Report
Report
Assignment 2 Report
Training a Dense Embedding Model
The goal of this project is to train a Word2Vec model to learn semantic relationships of words
basedo n embeddings. For this, Word2Vec model was taken via Gensim library.
Model Selection: Word2Vec model was used with the skip-gram, as skip-gram performs better
when working with a smaller dataset, which was the case with our dataset present in 10 documents
in “extracted_data_text” folder. Skip-gram was chosen over the continuous bag of word because
it has the capability to predict context from the word, which typically produces more useful
embeddings in tasks such as semantic similarity.
Training: Before the training, the text files were pre-processed to clean the text as well as perform
tokenization of the text, remove stopwords, and apply lemmatization. After pre-processing, the
Word2Vec model was trained. The hyperparameters that were used were:
Embedding dimension: 30. Depicts that each word was represented by 30 values.
Context window size: 5. This window captures the surrounding words of the target word.
Min count: 2, Used to exclude words that appear less than twice.
Number of epochs: 10. Represents the number of iterations.
Evaluation: After training the Word2Vec model, we performed an evaluation using TSNE (t-
distributed Stochastic Neighbor Embedding, a technique used for reducing the dimensions of data)
for dimensionality reduction to visualize the embeddings in two-dimensional space. At the end,
visualization was done for the top 50 words in the vocabulary.
Issues in visualization: One of the foremost challenges encountered was to maintain clarity in the
scatter plot after applying TSNE. Initially, the word vectors were appearing to be quite congested
in that space, which made the visualization quite difficult. This was countered by increasing the
plot size, making the visualization much better.
Issues in Text Pre-Processing: Another challenges encountered was in the pre-processing steps.
While removing stopwords, caerain words were missed. Also, tokenization was not being done
correctly in the case of languages other than English, for which punkt_tab was used, which ensures
additional data for tokenization in other languages.