NLP MINI PROJECT
NLP MINI PROJECT
Submitted by
This is to certify that Lunawat Sujal Kailash as successfully completed the Mini Project entitled
"Multilingual Text Dataset Creation and Tokenization using Python" under the guidance of Prof.
Ramesh P. Daund in partial fulfillment of the requirements for the Final Year of Engineering in Computer
Engineering under Savitribai Phule Pune University during the academic year 2024–2025.
Date: ……………….
Place: Chandwad
With a deep sense of gratitude, I would like to thank all the people who have lit our path with their kind
guidance. I am very grateful to these intellectuals who did their best to help during my project work.
It is my proud privilege to express a deep sense of gratitude to Prof. Dr. R. G. Tated, Principal of SNJB’s Late
Sau KBJ COE, Chandwad, for his comments and kind permission to complete this project. I remain indebted
to Dr. K. M. Sanghavi, H.O.D. Computer Engineering Department for her timely suggestion and valuable
guidance.
The special gratitude goes to Prof. Ramesh P. Daund for excellent and precious guidance on completion of
this work. I am also thankful to my parents for their constant support, and to my friends and everyone who
directly or indirectly contributed to the successful completion of this project.
1.Problem Statement
2.Introduction
3.Methodology
4.Challenges Faced
5.Conclusion
6.Python Code
1. Problem Statement
To collect and preprocess text data in Marathi, Hindi, and Gujarati, tokenize the text, and export the
cleaned and tokenized data in CSV or JSON format. The objective is to build a small multilingual dataset
suitable for downstream Natural Language Processing (NLP) tasks.
2. Introduction
Natural Language Processing (NLP) is a growing field in Artificial Intelligence (AI) that helps computers
understand and work with human language. In India, where people speak many different languages, it is
important to develop NLP tools for regional languages as well, not just English.
This project focuses on three Indian languages: Marathi, Hindi, and Gujarati. We collected text data,
cleaned it, split it into words (tokenized), and saved it in a structured format. The created dataset can be
used in applications like machine translation, sentiment analysis, or text classification.
This mini project gives practical experience in building and preparing multilingual datasets for future NLP
tasks.
3. Steps
I. Text Collection
Collected text samples from public sources like books, news articles, and open datasets in Marathi, Hindi,
and Gujarati.
II. Preprocessing
Removed punctuation, special characters, extra spaces, and converted all text to lowercase using basic
Python string operations.
III. Tokenization
Used tokenization tools from spaCy and NLTK libraries. Language-specific tokenizers were used when
necessary to accurately split text into tokens.
IV. Dataset Creation
Stored the dataset in CSV and JSON formats with the following fields:
● Language
● Original Text
● Tokenized Text
V. Visualization
Created word clouds and token frequency graphs using Python libraries to visualize commonly used
words in each language.
4. Challenges Faced
● Difficulty in finding good quality open-source text data in regional languages.
● Display issues in word clouds for non-English scripts due to font compatibility.
5. Conclusion
This mini project covered all the basic steps in handling multilingual text data from collection and
preprocessing to tokenization and visualization. It provided a hands-on understanding of working with
Indian languages in NLP and helped build a small but useful dataset for future research or application
development.
import pandas as pd
import json
# Save as CSV
df = pd.DataFrame(data)
df.to_csv("text_samples.csv", index=False, encoding="utf-8")
# Save as JSON
with open("text_samples.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=4)
import re
def clean_text(text):
"""Removes punctuation, numbers, and extra spaces from the text."""
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\d+', '', text) # Remove numbers
text = re.sub(r'\s+', ' ', text).strip() # Remove extra spaces
return text
# Display results
for original, cleaned in zip(texts, cleaned_texts):
print(f"Original: {original}")
print(f"Cleaned: {cleaned}\n")
Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Cleaned: भरत ह एक ववधतन नटलल दश आह यथ ववध धरम भष आण ससकतच लक एकतर रहतत भरतच
Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Cleaned: भरत एक ववधतपरण दश ह यह वभनन धरम भषओ और ससकतय क लग एक सथ रहत ह भरत क
Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ત, ઇતહાસ અને વારસા માટે �ણીતું
Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ત, ઇતહાસ અને વારસા માટે �ણીતું
Cleaned: ગજરત એ ભરતન એક રજય છ ત તન સમદધ સસકત ઇતહસ અન વરસ મટ જણત છ ગજરતમ
import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Download the punkt_tab resource
from nltk.tokenize import word_tokenize
# Tokenize text
tokenized_texts = [word_tokenize(text) for text in texts]
# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized: {tokens}\n")
Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized: ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', 'िव�भ�', 'धम�', ',', 'भाषाओं
Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ત, ઇતહાસ અને વારસા માટે �ણીતું
Tokenized: ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'છે ', '.', 'તે', 'તેની', 'સમૃ�
import nltk
import string
nltk.download('punkt')
# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized (without punctuation): {tokens}\n")
Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized (without punctuation): ['भारत', 'हा', 'एक', 'िविवधतेने', 'नटलेला', 'देश', '
Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized (without punctuation): ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', '
Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ત, ઇતહાસ અને વારસા માટે �ણીતું
Tokenized (without punctuation): ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'છે ', '
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Define stopwords manually for Hindi, Marathi, and Gujarati (since NLTK does not prov
hindi_stopwords = ["का", "के", "क�", "है", "यह", "और", "को", "म�", "से", "िक", "पर",
marathi_stopwords = ["आहे", "आ�ण", "या", "म�ये", "वरील", "साठी", "हा", "हे", "क�", "ते
gujarati_stopwords = ["છે ", "અને", "કે ", "પર", "થી", "હા", "ના", "માં", "તે", "આ"]
# Display results
for original, tokens in zip(texts, filtered_texts):
print(f"Original: {original}")
print(f"Tokenized (without stopwords): {tokens}\n")
Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized (without stopwords): ['भारत', 'एक', 'िविवधतेने', 'नटलेला', 'देश', 'येथे', '
Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized (without stopwords): ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', 'िव�भ�
Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ત, ઇતહાસ અને વારસા માટે �ણીતું
Tokenized (without stopwords): ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'તેની', '
import nltk
from nltk.tokenize import word_tokenize
# Tokenize text
tokenized_texts = [word_tokenize(text) for text in texts]
# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized: {tokens}\n")
Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized: ['भारत', 'हा', 'एक', 'िविवधतेने', 'नटलेला', 'देश', 'आहे', '.', 'येथे', 'िविवध
Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized: ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', 'िव�भ�', 'धम�', ',', 'भाषाओं
Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ત, ઇતહાસ અને વારસા માટે �ણીતું
Tokenized: ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'છે ', '.', 'તે', 'તેની', 'સમૃ�
import nltk
import spacy
import csv
import json
import json
from nltk.tokenize import word_tokenize
# Append to dataset
dataset.append({"language": lang, "original_text": text, "tokenized_text": tokeniz
# Save as CSV
csv_filename = "tokenized_dataset.csv"
with open(csv_filename, mode="w", encoding="utf-8", newline="") as file:
writer = csv.DictWriter(file, fieldnames=["language", "original_text", "tokenized_
writer.writeheader()
for row in dataset:
writer.writerow(row)
# Save as JSON
json_filename = "tokenized_dataset.json"
with open(json_filename, mode="w", encoding="utf-8") as file:
json.dump(dataset, file, ensure_ascii=False, indent=4)
# Display result
print(f"Dataset saved as '{csv_filename}' and '{json_filename}' successfully!")
व: 7
भ: 5
स: 5
ત: 5
क: 4
ह: 3
एक: 3
न: 3
ल: 3
त: 3
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)