Natural Language Processing (NLP) field is taking great advantage from adopting models and methodologies from Artificial Intelligence. In particular, Part-Of-Speech (POS) tagging is a building block for many NLP applications. In this paper, a POS tagging system based on a deep neural network is proposed. It is made of a static and task-independent pre-trained model for representing words semantics enriched by morphological information, by approximating the Word Embedding representation learned from an unlabelled corpus by the fastText model, so as to handle consistently common and known words as well as rare and Out-of-Vocabulary words. A character-level representation of words is dynamically learned according to the POS tagging task, and is concatenated to the previous one. This joint representation is fed to the main network, comprising a Bi-LSTM layer, trained to associate a sequence of tags to a sequence of words. The effectiveness of the contributions of the proposed system with respect to the state-of-the-art is proven by an extensive experimental campaign, which provides evidence that improvements are gained in POS tagging accuracy by using Word Embeddings enriched with morphological information, by estimating embeddings for both known and unknown words, and by concatenating Word Embeddings with character-level information of the same size. Similar trends are obtained for two languages of different characteristics, namely English and Italian: in both cases, the overall accuracy on the PUS tagging test set was increased with respect to the most advanced existing systems, with particular improvements on the accuracy of Out-of-Vocabulary words. Finally, the method has a general basis, and could be proficiently used for all languages, particularly for those showing a wide morphological richness. (C) 2018 Elsevier B.V. All rights reserved.
Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched Word Embeddings
Marulli F.
Methodology
;
2019
Abstract
Natural Language Processing (NLP) field is taking great advantage from adopting models and methodologies from Artificial Intelligence. In particular, Part-Of-Speech (POS) tagging is a building block for many NLP applications. In this paper, a POS tagging system based on a deep neural network is proposed. It is made of a static and task-independent pre-trained model for representing words semantics enriched by morphological information, by approximating the Word Embedding representation learned from an unlabelled corpus by the fastText model, so as to handle consistently common and known words as well as rare and Out-of-Vocabulary words. A character-level representation of words is dynamically learned according to the POS tagging task, and is concatenated to the previous one. This joint representation is fed to the main network, comprising a Bi-LSTM layer, trained to associate a sequence of tags to a sequence of words. The effectiveness of the contributions of the proposed system with respect to the state-of-the-art is proven by an extensive experimental campaign, which provides evidence that improvements are gained in POS tagging accuracy by using Word Embeddings enriched with morphological information, by estimating embeddings for both known and unknown words, and by concatenating Word Embeddings with character-level information of the same size. Similar trends are obtained for two languages of different characteristics, namely English and Italian: in both cases, the overall accuracy on the PUS tagging test set was increased with respect to the most advanced existing systems, with particular improvements on the accuracy of Out-of-Vocabulary words. Finally, the method has a general basis, and could be proficiently used for all languages, particularly for those showing a wide morphological richness. (C) 2018 Elsevier B.V. All rights reserved.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.