Textual resources annotation is currently performed both manually by human experts selecting hand-crafted features and automatically by any trained systems. Human experts annotations are very accurate but require heavy effort in compilation and most often are not publicly accessible. Automatic approaches save efforts but don’t perform yet with the required accuracy, mostly because of the great difficulty and labor required to represent domain experts’ knowledge in a machine readable format. This work tackles the issue of automatically annotate plain text resources; it was motivated by the need of supporting Italian justice officers in detecting sensible information included in large amounts of judgements documents, for privacy preservation aims. We suggest a novel methodology, based on unsupervised machine learning techniques, to facilitate human experts in detecting sensible information. We performed experiments over about 20.000 plain text documents and we obtained an accuracy rate of about 75%, in the preliminary validation stage.
A machine learning based methodology for automatic annotation and anonymisation of privacy-related items in textual documents for justice domain
Di Martino B.;Marulli F.;
2021
Abstract
Textual resources annotation is currently performed both manually by human experts selecting hand-crafted features and automatically by any trained systems. Human experts annotations are very accurate but require heavy effort in compilation and most often are not publicly accessible. Automatic approaches save efforts but don’t perform yet with the required accuracy, mostly because of the great difficulty and labor required to represent domain experts’ knowledge in a machine readable format. This work tackles the issue of automatically annotate plain text resources; it was motivated by the need of supporting Italian justice officers in detecting sensible information included in large amounts of judgements documents, for privacy preservation aims. We suggest a novel methodology, based on unsupervised machine learning techniques, to facilitate human experts in detecting sensible information. We performed experiments over about 20.000 plain text documents and we obtained an accuracy rate of about 75%, in the preliminary validation stage.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.