A machine learning based methodology for automatic annotation and anonymisation of privacy-related items in textual documents for justice domain

IRIS

Textual resources annotation is currently performed both manually by human experts selecting hand-crafted features and automatically by any trained systems. Human experts annotations are very accurate but require heavy effort in compilation and most often are not publicly accessible. Automatic approaches save efforts but don’t perform yet with the required accuracy, mostly because of the great difficulty and labor required to represent domain experts’ knowledge in a machine readable format. This work tackles the issue of automatically annotate plain text resources; it was motivated by the need of supporting Italian justice officers in detecting sensible information included in large amounts of judgements documents, for privacy preservation aims. We suggest a novel methodology, based on unsupervised machine learning techniques, to facilitate human experts in detecting sensible information. We performed experiments over about 20.000 plain text documents and we obtained an accuracy rate of about 75%, in the preliminary validation stage.

A machine learning based methodology for automatic annotation and anonymisation of privacy-related items in textual documents for justice domain

Di Martino B.;Marulli F.;Lupi P.;Cataldi A.

2021

Abstract

Textual resources annotation is currently performed both manually by human experts selecting hand-crafted features and automatically by any trained systems. Human experts annotations are very accurate but require heavy effort in compilation and most often are not publicly accessible. Automatic approaches save efforts but don’t perform yet with the required accuracy, mostly because of the great difficulty and labor required to represent domain experts’ knowledge in a machine readable format. This work tackles the issue of automatically annotate plain text resources; it was motivated by the need of supporting Italian justice officers in detecting sensible information included in large amounts of judgements documents, for privacy preservation aims. We suggest a novel methodology, based on unsupervised machine learning techniques, to facilitate human experts in detecting sensible information. We performed experiments over about 20.000 plain text documents and we obtained an accuracy rate of about 75%, in the preliminary validation stage.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2021
			
	Titolo della Serie
	
				ADVANCES IN INTELLIGENT SYSTEMS AND COMPUTING
			
	Tutti gli autori
	
						Di Martino, B.; Marulli, F.; Lupi, P.; Cataldi, A.
					
	Appare nelle tipologie:
	
				2.1 Contributo in volume (Capitolo o Saggio)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11591/442221

Citazioni

ND

12

ND

social impact