Comparing Word Embeddings with Gensim
Parul Sethi (~parulsethi) |
Python has many Natural language processing tools. In particular if someone wants to implement a recommender or a document classifier they face a problem choosing from many open source word embeddings available. In this talk, I will highlight the differences between them. I’ll go through some evaluations, primarily on the three word embeddings, Word2Vec, FastText and WordRank, which are all available either as direct implementation or wrapper in the widely used python library gensim. The results will reflect how these different embeddings specialize on different downstream NLP tasks.
As Visualizations are also a crucial part of Data analysis, to understand the structure and underlying patterns that may be held within the data, so I’ll cover about visualizing the word embeddings using TensorBoard and gensim.
- What are word embeddings.
- Why are they useful.
- Examples of some popular word embeddings
- Why you need to choose carefully b/w those different embeddings.
- Example of their different results, for similarity with a single word.
- Benchmark performance overview:
- What is Word Similarity data — how diff. embeddings perform on this.
- What is Word Analogy data — how diff. embeddings perform on this.
- PCA, t-SNE
- Using TensorBoard with an example of embedding
- Relation b/w word frequency and embedding performance
- How the differences b/w embeddings discussed above could effect downstream applications.
Just a basic idea of what word embeddings are.
Gensim on github: https://github.com/RaRe-Technologies/gensim
Word Embedding comparisons Jupyter notebook: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Wordrank_comparisons.ipynb
Slides: https://docs.google.com/presentation/d/1hse_hrTU9MVDTYNyhVilWuR5OTf_fN4M8xLEMnBtwbU/edit?usp=sharing (work in progress)
I'm a third year undergraduate student of Maths and IT at Cluster Innovation Centre, University of Delhi. I contributed for WordRank wrapper and Embedding comparison tutorial to gensim as part of my Incubator project with RaRe Technologies.