Despite the large diffusion and use of embedding generated through Word2Vec, there are still many open questions about the reasons for its results and about its real capabilities. In particular, to our knowledge, no author seems to have analysed in detail how learning may be affected by the various choices of hyperparameters. In this work, we try to shed some light on various issues focusing on a typical dataset. It is shown that the learning rate prevents the exact mapping of the co-occurrence matrix, that Word2Vec is unable to learn syntactic relationships, and that it does not suffer from the problem of overfitting. Furthermore, through the creation of an ad-hoc network, it is also shown how it is possible to improve Word2Vec directly on the analogies, obtaining very high accuracy without damaging the pre-existing embedding. This analogy-enhanced Word2Vec may be convenient in various NLP scenarios, but it is used here as an optimal starting point to evaluate the limits of Word2Vec.
Considerations about learning Word2Vec
Giovanni Di Gennaro;Francesco Palmieri
2021
Abstract
Despite the large diffusion and use of embedding generated through Word2Vec, there are still many open questions about the reasons for its results and about its real capabilities. In particular, to our knowledge, no author seems to have analysed in detail how learning may be affected by the various choices of hyperparameters. In this work, we try to shed some light on various issues focusing on a typical dataset. It is shown that the learning rate prevents the exact mapping of the co-occurrence matrix, that Word2Vec is unable to learn syntactic relationships, and that it does not suffer from the problem of overfitting. Furthermore, through the creation of an ad-hoc network, it is also shown how it is possible to improve Word2Vec directly on the analogies, obtaining very high accuracy without damaging the pre-existing embedding. This analogy-enhanced Word2Vec may be convenient in various NLP scenarios, but it is used here as an optimal starting point to evaluate the limits of Word2Vec.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.