0% found this document useful (0 votes)
2 views

Report

The report details the training of a Word2Vec model using the Gensim library to learn semantic relationships from a small dataset of 10 documents. The model utilized the skip-gram approach, with specific hyperparameters set for embedding dimensions, context window size, and training epochs. Challenges included visualization clarity and text pre-processing, which were addressed through adjustments in plot size and improved tokenization methods.

Uploaded by

James Toney
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Report

The report details the training of a Word2Vec model using the Gensim library to learn semantic relationships from a small dataset of 10 documents. The model utilized the skip-gram approach, with specific hyperparameters set for embedding dimensions, context window size, and training epochs. Challenges included visualization clarity and text pre-processing, which were addressed through adjustments in plot size and improved tokenization methods.

Uploaded by

James Toney
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Name: Hassan Masood

Course: Natural Language Processing


Instructor: Dr. Usman Zia

Assignment 2 Report
Training a Dense Embedding Model

IGIS, NATIONAL UNIVERSITY OF SCIENCES AND TECHNOLOGY,


ISLAMABAD
A. Approach to Model Selection, Training, and Evaluation:

The goal of this project is to train a Word2Vec model to learn semantic relationships of words
basedo n embeddings. For this, Word2Vec model was taken via Gensim library.
Model Selection: Word2Vec model was used with the skip-gram, as skip-gram performs better
when working with a smaller dataset, which was the case with our dataset present in 10 documents
in “extracted_data_text” folder. Skip-gram was chosen over the continuous bag of word because
it has the capability to predict context from the word, which typically produces more useful
embeddings in tasks such as semantic similarity.
Training: Before the training, the text files were pre-processed to clean the text as well as perform
tokenization of the text, remove stopwords, and apply lemmatization. After pre-processing, the
Word2Vec model was trained. The hyperparameters that were used were:
 Embedding dimension: 30. Depicts that each word was represented by 30 values.
 Context window size: 5. This window captures the surrounding words of the target word.
 Min count: 2, Used to exclude words that appear less than twice.
 Number of epochs: 10. Represents the number of iterations.
Evaluation: After training the Word2Vec model, we performed an evaluation using TSNE (t-
distributed Stochastic Neighbor Embedding, a technique used for reducing the dimensions of data)
for dimensionality reduction to visualize the embeddings in two-dimensional space. At the end,
visualization was done for the top 50 words in the vocabulary.

B. Challenges Encountered and How They Were Addressed:

Issues in visualization: One of the foremost challenges encountered was to maintain clarity in the
scatter plot after applying TSNE. Initially, the word vectors were appearing to be quite congested
in that space, which made the visualization quite difficult. This was countered by increasing the
plot size, making the visualization much better.
Issues in Text Pre-Processing: Another challenges encountered was in the pre-processing steps.
While removing stopwords, caerain words were missed. Also, tokenization was not being done
correctly in the case of languages other than English, for which punkt_tab was used, which ensures
additional data for tokenization in other languages.

C. Observations from Embeddings:


Semantic Relationships: While the model successfully captured most of the semantic
relationships, (which was depicted by clustering contextually related words), the quality of some
of the relationships was not as strong or consistent as expected. Words that should be semantically
similar were often spaced farther apart in the embedding space, and clusters sometimes lacked
clear cohesion.

You might also like