0% found this document useful (0 votes)
4 views23 pages

Word Embedding

Uploaded by

Sarthak Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

Word Embedding

Uploaded by

Sarthak Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Word Embedding

A report by :-
Naveen Kumar
Sarthak Sharma
Ayush Purohit
Topics to be covered:-
Prerequisite

Problem of previous technique

About Word Embedding

Some word embedding techniques

Programming for word embedding

Graphical visualization
Prerequsite

Need of Encoding techniques

Some of commonly used encoding techniques:-

1. One_Hot Encoding
2. Label Encoding
3. Mapping
Need of Encoding techniques:-
Gender City Age Income (k$) Buys Product

Male Delhi 25 50 Yes

Female Mumbai 30 60 No

Male Delhi 22 45 Yes

Female Bangalore 28 55 No

Male Mumbai 35 80 Yes

Female Delhi 24 48 Yes

Male Bangalore 40 90 No

Note :- Machine only understand numbers


One Hot Encoding
One-Hot Encoding is a way to convert categories into numbers by creating a new column for each
category.

If a category is present, we put 1, otherwise 0.

Gender_Male Gender_Female City_Delhi City_Mumbai City_Bangalore Age Income (k$) Buys Product

1 0 1 0 0 25 50 Yes

0 1 0 1 0 30 60 No

1 0 1 0 0 22 45 Yes

0 1 0 0 1 28 55 No

1 0 0 1 0 35 80 Yes

0 1 1 0 0 24 48 Yes

1 0 0 0 1 40 90 No
Label Encoding
Label Encoding is a technique used to convert categorical variables into numerical values by assigning each
unique category a distinct integer.

Key Points:

● Each category is mapped to a unique number.



● Used when the categorical feature has no meaningful order.

● Fast and memory-efficient.

Category Encoded Value

Red 0

Blue 1

Green 2
Label Encoding Example

Gender City Age Income (k$) Buys Product

1 1 25 50 1

0 2 30 60 0

1 1 22 45 1

0 3 28 55 0

1 2 35 80 1

0 1 24 48 1

1 3 40 90 0
Mapping
Mapping encoding is a technique to convert categorical data into numerical data by assigning specific
numbers to each category manually.

It gives flexibility to assign any number to any category based on your choice or logic.

Example:
Suppose we have City = {Delhi, Mumbai, Bangalore}.
We can map them like this:
● Delhi → 10 ● Delhi → 5
● Delhi → 1

● Mumbai → 20 ● Mumbai → 6
● Mumbai → 2

● Bangalore → 30 ● Bangalore → 7
● Bangalore → 3
Gender:
City: Buys Product:
○ Male → 5
○ Delhi → 100 ● Yes → 7
○ Female → 10
○ Mumbai → 200 ● No → 3
○ Bangalore → 300

Gender City Age Income (k$) Buys Product

5 100 25 50 7

10 200 30 60 3

5 100 22 45 7

10 300 28 55 3

5 200 35 80 7

10 100 24 48 7

5 300 40 90 3
One Hot Encoding vs Label Encoding vs Mapping
Feature Label Encoding One Hot Encoding Mapping

What it does Gives a number to Makes a new column You decide numbers
each value for each value yourself

When to use Categories have order Categories have no When you know the
or few types order ranking

Result One column with Many columns with 0s One column with
numbers and 1s custom numbers

Memory use Very less More (because more Very less


columns)

Example Red → 0, Blue → 1, Red → [1,0,0], Blue → Low → 5, Medium →


Green → 2 [0,1,0] 25, High → 150
Problem with previous Encoding techniques
Label Encoding Problems:

● Machine may think numbers have order (like 0 < 1 < 2), even if there is no real order.

● Can confuse models and give wrong predictions.

One Hot Encoding Problems:

● Creates many new columns (especially if many categories).

● Takes more memory and slows down training.

● Can cause "curse of dimensionality" (too many features).

Mapping Problems:

● Depends on human logic; if wrong numbers are assigned, model learns wrong patterns.

● Not automatic — needs manual work and good domain knowledge.


Main Drawback :-

Previously discussed Encoding techniques are able to convert categorical into


numerical form of data.

Previous Method => Categorical → Numerical form ( good work)

But it do not tells about the relationship / similarities between these words.

Hence, Word Mapping comes into the picture.

Word Mapping => Categorical → Numerical + Similarity ( Nice Work)


Word Embedding
What is Word Embedding?

● Word Embedding is a way to turn words into numbers, but in a smart way.
● It gives similar words similar numbers.
● Instead of simple numbers (like 1, 2, 3), words are mapped into vectors (groups of numbers) that capture their meaning.

Key Points:

● It helps machine learning models understand the meaning of words.


● Used in NLP tasks like translation, sentiment analysis, chatbots, etc.

Need for Word Embedding?


● To reduce dimensionality
● To use a word to predict the words around it.
● Inter-word semantics must be captured.

How are Word Embeddings used?


● They are used as input to machine learning models.
Take the words —-> Give their numeric representation —-> Use in training or inference.
● To represent or visualize any underlying patterns of usage in the corpus that was used to train them
Some Word Embedding Techniques
Word2Vec: Uses shallow neural networks to learn word associations. Models: CBOW (predicts center word
from context) & Skip-Gram (predicts context from center word). Efficient for semantic relationships

GloVe: Matrix factorization technique based on word co-occurrence matrix, capturing both local and global
statistics.

FastText: Represents words as character n-grams, improving handling of out-of-vocabulary words and subword
information.

ELMo: Context-dependent embeddings from bidirectional LSTM, offering dynamic word representations based
on context.

BERT: Pre-trains deep bidirectional representations, using context from both directions to generate contextual
embeddings for superior NLP task performance.

Transformer Models (e.g., GPT-3, T5): Use attention mechanisms for generating highly accurate, complex, and
contextual embeddings.
Example for better clearity
Example Continue
Example Visualization
Programming - part 1 - ( From Scratch )
Programming - part2 - Using Gemsim Library ( Word2Vec)
Output:-
Example of Bag Of Word
Definition:

● Represents text by counting word occurrences.


● Word order is ignored, only frequency matters.
● Each document is converted into a vector.

Example: Sentences:

● I love dogs
● I love cats

Vocabulary: [I, love, dogs, cats]

Word "I love dogs" "I love cats"

I 1 1

love 1 1

dogs 1 0

cats 0 1
Thank You 😊

You might also like