Textual resources annotation is currently performed both manually by human experts selecting hand-crafted features and automatically by any trained systems. Human experts annotations are very accurate but require heavy effort in compilation and most often are not publicly accessible. Automatic approaches save efforts but don’t perform yet with the required accuracy, mostly because of the great difficulty and labor required to represent domain experts’ knowledge in a machine readable format. This work tackles the issue of automatically annotate plain text resources; it was motivated by the need of supporting Italian justice officers in detecting sensible information included in large amounts of judgements documents, for privacy preservation aims. We suggest a novel methodology, based on unsupervised machine learning techniques, to facilitate human experts in detecting sensible information. We performed experiments over about 20.000 plain text documents and we obtained an accuracy rate of about 75%, in the preliminary validation stage.

A machine learning based methodology for automatic annotation and anonymisation of privacy-related items in textual documents for justice domain

Di Martino B.;Marulli F.;
2021

Abstract

Textual resources annotation is currently performed both manually by human experts selecting hand-crafted features and automatically by any trained systems. Human experts annotations are very accurate but require heavy effort in compilation and most often are not publicly accessible. Automatic approaches save efforts but don’t perform yet with the required accuracy, mostly because of the great difficulty and labor required to represent domain experts’ knowledge in a machine readable format. This work tackles the issue of automatically annotate plain text resources; it was motivated by the need of supporting Italian justice officers in detecting sensible information included in large amounts of judgements documents, for privacy preservation aims. We suggest a novel methodology, based on unsupervised machine learning techniques, to facilitate human experts in detecting sensible information. We performed experiments over about 20.000 plain text documents and we obtained an accuracy rate of about 75%, in the preliminary validation stage.
2021
Di Martino, B.; Marulli, F.; Lupi, P.; Cataldi, A.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11591/442221
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 12
  • ???jsp.display-item.citation.isi??? ND
social impact