Report

The report details the training of a Word2Vec model using the Gensim library to learn semantic relationships from a small dataset of 10 documents. The model utilized the skip-gram approach, with specific hyperparameters set for embedding dimensions, context window size, and training epochs. Challenges included visualization clarity and text pre-processing, which were addressed through adjustments in plot size and improved tokenization methods.

Uploaded by

James Toney

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Report

Uploaded by

James Toney

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Name: Hassan Masood

Course: Natural Language Processing

Instructor: Dr. Usman Zia

Assignment 2 Report
Training a Dense Embedding Model

IGIS, NATIONAL UNIVERSITY OF SCIENCES AND TECHNOLOGY,

ISLAMABAD
A. Approach to Model Selection, Training, and Evaluation:

The goal of this project is to train a Word2Vec model to learn semantic relationships of words
basedo n embeddings. For this, Word2Vec model was taken via Gensim library.
Model Selection: Word2Vec model was used with the skip-gram, as skip-gram performs better
when working with a smaller dataset, which was the case with our dataset present in 10 documents
in “extracted_data_text” folder. Skip-gram was chosen over the continuous bag of word because
it has the capability to predict context from the word, which typically produces more useful
embeddings in tasks such as semantic similarity.
Training: Before the training, the text files were pre-processed to clean the text as well as perform
tokenization of the text, remove stopwords, and apply lemmatization. After pre-processing, the
Word2Vec model was trained. The hyperparameters that were used were:
 Embedding dimension: 30. Depicts that each word was represented by 30 values.
 Context window size: 5. This window captures the surrounding words of the target word.
 Min count: 2, Used to exclude words that appear less than twice.
 Number of epochs: 10. Represents the number of iterations.
Evaluation: After training the Word2Vec model, we performed an evaluation using TSNE (t-
distributed Stochastic Neighbor Embedding, a technique used for reducing the dimensions of data)
for dimensionality reduction to visualize the embeddings in two-dimensional space. At the end,
visualization was done for the top 50 words in the vocabulary.

B. Challenges Encountered and How They Were Addressed:

Issues in visualization: One of the foremost challenges encountered was to maintain clarity in the
scatter plot after applying TSNE. Initially, the word vectors were appearing to be quite congested
in that space, which made the visualization quite difficult. This was countered by increasing the
plot size, making the visualization much better.
Issues in Text Pre-Processing: Another challenges encountered was in the pre-processing steps.
While removing stopwords, caerain words were missed. Also, tokenization was not being done
correctly in the case of languages other than English, for which punkt_tab was used, which ensures
additional data for tokenization in other languages.

C. Observations from Embeddings:

Semantic Relationships: While the model successfully captured most of the semantic
relationships, (which was depicted by clustering contextually related words), the quality of some
of the relationships was not as strong or consistent as expected. Words that should be semantically
similar were often spaced farther apart in the embedding space, and clusters sometimes lacked
clear cohesion.

Scope For FINAL Exam Grade 8 2024 TERM 4
100% (3)
Scope For FINAL Exam Grade 8 2024 TERM 4
4 pages
2 3 Bookmark The Buffalo Are Back
50% (2)
2 3 Bookmark The Buffalo Are Back
2 pages
Activity Design To Conduct Sports Tournament
88% (16)
Activity Design To Conduct Sports Tournament
2 pages
1 IE6700 Syllabus Fall 2021 - 01
No ratings yet
1 IE6700 Syllabus Fall 2021 - 01
5 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Annual Operational Plan 2017
83% (6)
Annual Operational Plan 2017
11 pages
Koegel PRT Pocket Guide Intro PDF
No ratings yet
Koegel PRT Pocket Guide Intro PDF
14 pages
Gap Analysis/Identifying Priority Improvement Areas: Reporter: Edrohn R. Cumla
No ratings yet
Gap Analysis/Identifying Priority Improvement Areas: Reporter: Edrohn R. Cumla
16 pages
Murtagh S Practice Tips 6th Edition PDF
88% (8)
Murtagh S Practice Tips 6th Edition PDF
281 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
wordembed
No ratings yet
wordembed
31 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
p1
No ratings yet
p1
44 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Part 3
No ratings yet
Part 3
5 pages
4. Word Embadding
No ratings yet
4. Word Embadding
24 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Word Embedding Generation For Telugu Corpus
No ratings yet
Word Embedding Generation For Telugu Corpus
28 pages
CCS369 UNIT-2 20.12.24
No ratings yet
CCS369 UNIT-2 20.12.24
41 pages
2020.emnlp-demos.4
No ratings yet
2020.emnlp-demos.4
8 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
EC3M Q3 solution(1)
No ratings yet
EC3M Q3 solution(1)
9 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Word Embeddings With Neural Network
No ratings yet
Word Embeddings With Neural Network
5 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
Chapter II
No ratings yet
Chapter II
26 pages
WORD EMBEDDING Project
No ratings yet
WORD EMBEDDING Project
15 pages
Word Embeddings in NLP - Gunjan Agicha - Medium
No ratings yet
Word Embeddings in NLP - Gunjan Agicha - Medium
5 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
Trend
No ratings yet
Trend
47 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Gen AI lab
No ratings yet
Gen AI lab
22 pages
Embeddings
No ratings yet
Embeddings
3 pages
Word2Vec
No ratings yet
Word2Vec
33 pages
Homework 2
No ratings yet
Homework 2
4 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
Generative AI (1)
No ratings yet
Generative AI (1)
16 pages
05. Vector Semantics and Embeddings
No ratings yet
05. Vector Semantics and Embeddings
29 pages
Generative AI 2
No ratings yet
Generative AI 2
24 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
Question Bank NLP SOLUTIONS
No ratings yet
Question Bank NLP SOLUTIONS
21 pages
NLP_slides2
No ratings yet
NLP_slides2
93 pages
2 4 Chapters
No ratings yet
2 4 Chapters
3 pages
IJISRT23DEC1110 (1)
No ratings yet
IJISRT23DEC1110 (1)
8 pages
Comparative Study of Word Embeddings Models and Their Usage in Arabic Language Applications
No ratings yet
Comparative Study of Word Embeddings Models and Their Usage in Arabic Language Applications
7 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
taask
No ratings yet
taask
18 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
14-Word Embeddings II
No ratings yet
14-Word Embeddings II
31 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
NLP2
No ratings yet
NLP2
11 pages
Experiment 3 Word2Vec Custom Vectors Generation and Performing Classification
No ratings yet
Experiment 3 Word2Vec Custom Vectors Generation and Performing Classification
4 pages
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
No ratings yet
Creating Word Embeddings - Coding The Word2Vec Algorithm in Python Using Deep Learning - by Eligijus Bujokas - Towards Data Science
11 pages
2411.05036v1
No ratings yet
2411.05036v1
21 pages
Large Language Model Using Tensorflow: A Complete TensorFlow Implementation Guide for Modern AI Development
From Everand
Large Language Model Using Tensorflow: A Complete TensorFlow Implementation Guide for Modern AI Development
Aarav Joshi
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CI-2 Introduction to ANNs
No ratings yet
CI-2 Introduction to ANNs
33 pages
Borderline Clients Tend To Show: The TAT and Borderline, Narcissistic, and Psychotic Patients
No ratings yet
Borderline Clients Tend To Show: The TAT and Borderline, Narcissistic, and Psychotic Patients
2 pages
The Role of Artificial Intelligence in Financial Analysis and Forecasting: Using Data and Algorithms
No ratings yet
The Role of Artificial Intelligence in Financial Analysis and Forecasting: Using Data and Algorithms
14 pages
4642-1709541031259-BPS_U1_W1_Introduction to Business Process Support
No ratings yet
4642-1709541031259-BPS_U1_W1_Introduction to Business Process Support
12 pages
Vark Learning Style
No ratings yet
Vark Learning Style
4 pages
N5020+Syllabus_+August+21+2024
No ratings yet
N5020+Syllabus_+August+21+2024
5 pages
Sherlynn L Resume
No ratings yet
Sherlynn L Resume
2 pages
Electricity and magnetism 2nd Edition Edward M Purcell - Download the ebook now to start reading without waiting
100% (1)
Electricity and magnetism 2nd Edition Edward M Purcell - Download the ebook now to start reading without waiting
57 pages
Mentoring Guidelines
No ratings yet
Mentoring Guidelines
18 pages
Black Belt Program Grading Syllabus
No ratings yet
Black Belt Program Grading Syllabus
19 pages
Grade 11 Test Memo
No ratings yet
Grade 11 Test Memo
9 pages
Result-Llb SEM-4 Sem - 4
No ratings yet
Result-Llb SEM-4 Sem - 4
1 page
Becoming Digital Missionaries
No ratings yet
Becoming Digital Missionaries
29 pages
Paciano Rizal Elementary School
No ratings yet
Paciano Rizal Elementary School
3 pages
Clasa A V-A Plan de Lectie Progress Check Units 1, 2
100% (1)
Clasa A V-A Plan de Lectie Progress Check Units 1, 2
3 pages
MSC Corporate Finance Handbook 2020 - 2021
No ratings yet
MSC Corporate Finance Handbook 2020 - 2021
16 pages
Japanese Haiku Poem 4's Lesson Plan
100% (2)
Japanese Haiku Poem 4's Lesson Plan
2 pages
Attestation Process Checklist: First Name Middle Name Surname Extension Name
100% (1)
Attestation Process Checklist: First Name Middle Name Surname Extension Name
17 pages
Bai Tap Ngay 442022 Education
No ratings yet
Bai Tap Ngay 442022 Education
2 pages
Strategies in Teaching Math
No ratings yet
Strategies in Teaching Math
2 pages
Ms Thompson Memoir - Example
No ratings yet
Ms Thompson Memoir - Example
4 pages
Third Quarter Exam in Reading and Writing
No ratings yet
Third Quarter Exam in Reading and Writing
3 pages

Report

Uploaded by

Report

Uploaded by

Name: Hassan Masood

Course: Natural Language Processing

IGIS, NATIONAL UNIVERSITY OF SCIENCES AND TECHNOLOGY,

B. Challenges Encountered and How They Were Addressed:

C. Observations from Embeddings:

You might also like