0% found this document useful (0 votes)

7 views

NLP MINI PROJECT

The document is a mini project report submitted by Lunawat Sujal Kailash to Savitribai Phule Pune University, focusing on the creation and tokenization of a multilingual text dataset in Marathi, Hindi, and Gujarati using Python. It details the problem statement, methodology, challenges faced, and includes Python code for data processing and visualization. The project aims to enhance Natural Language Processing (NLP) capabilities for regional Indian languages.

Uploaded by

25-A-Sujal Lunawat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

NLP MINI PROJECT

Uploaded by

25-A-Sujal Lunawat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

LP - VI (NLP)

Multilingual Text Dataset Creation and Tokenization using Python

Mini Project Report submitted to Savitribai Phule Pune University, Pune

In partial fulfillment for the award of Degree of Engineering in Computer Engineering

Submitted by

Lunawat Sujal Kailash

Under the Guidance of

Prof. Ramesh P. Daund

Designation of Guide: Assistant Professor

Academic Year : 2024-25

Department of Computer Engineering

SNJB’s Late Sau. Kantabai Bhavarlalji Jain, College of Engineering,

Chandwad, Dist. Nashik

CERTIFICATE
Department of Computer Engineering
SNJB’s Late Sau. Kantabai Bhavarlalji Jain, College of Engineering, Chandwad, Dist.
Nashik

This is to certify that Lunawat Sujal Kailash as successfully completed the Mini Project entitled
"Multilingual Text Dataset Creation and Tokenization using Python" under the guidance of Prof.
Ramesh P. Daund in partial fulfillment of the requirements for the Final Year of Engineering in Computer
Engineering under Savitribai Phule Pune University during the academic year 2024–2025.

Date: ……………….
Place: Chandwad

Sign of Guide HoD Principal

Ramesh P. Daund Dr. K. M. Sanghavi Dr. R. G. Tated
Acknowledgement

With a deep sense of gratitude, I would like to thank all the people who have lit our path with their kind
guidance. I am very grateful to these intellectuals who did their best to help during my project work.
It is my proud privilege to express a deep sense of gratitude to Prof. Dr. R. G. Tated, Principal of SNJB’s Late
Sau KBJ COE, Chandwad, for his comments and kind permission to complete this project. I remain indebted
to Dr. K. M. Sanghavi, H.O.D. Computer Engineering Department for her timely suggestion and valuable
guidance.

The special gratitude goes to Prof. Ramesh P. Daund for excellent and precious guidance on completion of
this work. I am also thankful to my parents for their constant support, and to my friends and everyone who
directly or indirectly contributed to the successful completion of this project.

Lunawat Sujal Kailash

Content:

1.Problem Statement

2.Introduction

3.Methodology

4.Challenges Faced

5.Conclusion

6.Python Code

7.Output & Visualizations

1. Problem Statement
To collect and preprocess text data in Marathi, Hindi, and Gujarati, tokenize the text, and export the
cleaned and tokenized data in CSV or JSON format. The objective is to build a small multilingual dataset
suitable for downstream Natural Language Processing (NLP) tasks.

2. Introduction
Natural Language Processing (NLP) is a growing field in Artificial Intelligence (AI) that helps computers
understand and work with human language. In India, where people speak many different languages, it is
important to develop NLP tools for regional languages as well, not just English.

This project focuses on three Indian languages: Marathi, Hindi, and Gujarati. We collected text data,
cleaned it, split it into words (tokenized), and saved it in a structured format. The created dataset can be
used in applications like machine translation, sentiment analysis, or text classification.

This mini project gives practical experience in building and preparing multilingual datasets for future NLP
tasks.

3. Steps
I. Text Collection
Collected text samples from public sources like books, news articles, and open datasets in Marathi, Hindi,
and Gujarati.

II. Preprocessing
Removed punctuation, special characters, extra spaces, and converted all text to lowercase using basic
Python string operations.

III. Tokenization
Used tokenization tools from spaCy and NLTK libraries. Language-specific tokenizers were used when
necessary to accurately split text into tokens.
IV. Dataset Creation
Stored the dataset in CSV and JSON formats with the following fields:
● Language

● Original Text

● Tokenized Text

V. Visualization
Created word clouds and token frequency graphs using Python libraries to visualize commonly used
words in each language.

4. Challenges Faced
● Difficulty in finding good quality open-source text data in regional languages.

● Language-specific tokenization required manual adjustments.

● Display issues in word clouds for non-English scripts due to font compatibility.

● Encoding problems while saving and reading multilingual text files.

5. Conclusion
This mini project covered all the basic steps in handling multilingual text data from collection and
preprocessing to tokenization and visualization. It provided a hands-on understanding of working with
Indian languages in NLP and helped build a small but useful dataset for future research or application
development.
import pandas as pd
import json

# Sample text data

data = [
{"language": "Marathi", "original_text": "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध
{"language": "Hindi", "original_text": "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं
{"language": "Gujarati", "original_text": "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ�
]

# Save as CSV
df = pd.DataFrame(data)
df.to_csv("text_samples.csv", index=False, encoding="utf-8")

# Save as JSON
with open("text_samples.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=4)

print("Text samples saved successfully!")

Text samples saved successfully!

import re

def clean_text(text):
"""Removes punctuation, numbers, and extra spaces from the text."""
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\d+', '', text) # Remove numbers
text = re.sub(r'\s+', ' ', text).strip() # Remove extra spaces
return text

# Sample text data

texts = [
"भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात. भारताची
"भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�। भारत क�
"ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે . ગુજરાતમાં
]

# Clean the text

cleaned_texts = [clean_text(text) for text in texts]

# Display results
for original, cleaned in zip(texts, cleaned_texts):
print(f"Original: {original}")
print(f"Cleaned: {cleaned}\n")

Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Cleaned: भरत ह एक ववधतन नटलल दश आह यथ ववध धरम भष आण ससकतच लक एकतर रहतत भरतच

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Cleaned: भरत एक ववधतपरण दश ह यह वभनन धरम भषओ और ससकतय क लग एक सथ रहत ह भरत क

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Cleaned: ગજરત એ ભરતન એક રજય છ ત તન સમદધ સસકત ઇતહસ અન વરસ મટ જણત છ ગજરતમ

import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Download the punkt_tab resource
from nltk.tokenize import word_tokenize

# Sample text data

# Tokenize text
tokenized_texts = [word_tokenize(text) for text in texts]

# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized: {tokens}\n")

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized: ['भारत', 'हा', 'एक', 'िविवधतेने', 'नटलेला', 'देश', 'आहे', '.', 'येथे', 'िविवध

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized: ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', 'िव�भ�', 'धम�', ',', 'भाषाओं

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Tokenized: ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'છે ', '.', 'તે', 'તેની', 'સમૃ�

import nltk
import string
nltk.download('punkt')

from nltk.tokenize import word_tokenize

# Sample text data

texts = [
"भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात.",
"भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।",
"ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Tokenize text and remove punctuation

tokenized_texts = [
[word for word in word_tokenize(text) if word not in string.punctuation]
[word for word in word_tokenize(text) if word not in string.punctuation]
for text in texts
]

# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized (without punctuation): {tokens}\n")

Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized (without punctuation): ['भारत', 'हा', 'एक', 'िविवधतेने', 'नटलेला', 'देश', '

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized (without punctuation): ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', '

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Tokenized (without punctuation): ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'છે ', '

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!

import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download necessary resources

nltk.download('punkt')
nltk.download('stopwords')

# Define stopwords manually for Hindi, Marathi, and Gujarati (since NLTK does not prov
hindi_stopwords = ["का", "के", "क�", "है", "यह", "और", "को", "म�", "से", "िक", "पर",
marathi_stopwords = ["आहे", "आ�ण", "या", "म�ये", "वरील", "साठी", "हा", "हे", "क�", "ते
gujarati_stopwords = ["છે ", "અને", "કે ", "પર", "થી", "હા", "ના", "માં", "તે", "આ"]

# Combine all stopwords

stop_words = set(stopwords.words('english') + hindi_stopwords + marathi_stopwords + gu

# Sample text data

# Tokenize text and remove stopwords

filtered_texts = [
[word for word in word_tokenize(text) if word.lower() not in stop_words and
for text in texts
]

# Display results
for original, tokens in zip(texts, filtered_texts):
print(f"Original: {original}")
print(f"Tokenized (without stopwords): {tokens}\n")

Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized (without stopwords): ['भारत', 'एक', 'िविवधतेने', 'नटलेला', 'देश', 'येथे', '

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized (without stopwords): ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', 'िव�भ�

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Tokenized (without stopwords): ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'તેની', '

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.

import nltk
from nltk.tokenize import word_tokenize

# Download necessary resources

nltk.download('punkt')

# Sample text data

# Tokenize text
tokenized_texts = [word_tokenize(text) for text in texts]

# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized: {tokens}\n")

Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized: ['भारत', 'हा', 'एक', 'िविवधतेने', 'नटलेला', 'देश', 'आहे', '.', 'येथे', 'िविवध

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!

import nltk
import spacy
import csv
import json
import json
from nltk.tokenize import word_tokenize

# Download necessary resources

nltk.download('punkt')

# Load spaCy models for multilingual tokenization

nlp_hi = spacy.blank("hi") # Hindi (also works for Marathi)
nlp_gu = spacy.blank("gu") # Gujarati

# Sample text data

texts = [
("hi", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે
]

# Tokenize using NLTK & spaCy

dataset = []
for lang, text in texts:
if lang == "gu":
doc = nlp_gu(text) # Use Gujarati spaCy model
else:
doc = nlp_hi(text) # Use Hindi/Marathi spaCy model

tokenized_text = [token.text for token in doc]

# Append to dataset
dataset.append({"language": lang, "original_text": text, "tokenized_text": tokeniz

# Save as CSV
csv_filename = "tokenized_dataset.csv"
with open(csv_filename, mode="w", encoding="utf-8", newline="") as file:
writer = csv.DictWriter(file, fieldnames=["language", "original_text", "tokenized_
writer.writeheader()
for row in dataset:
writer.writerow(row)

# Save as JSON
json_filename = "tokenized_dataset.json"
with open(json_filename, mode="w", encoding="utf-8") as file:
json.dump(dataset, file, ensure_ascii=False, indent=4)

# Display result
print(f"Dataset saved as '{csv_filename}' and '{json_filename}' successfully!")

Dataset saved as 'tokenized_dataset.csv' and 'tokenized_dataset.json' successfully

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!

!sudo apt update

!sudo apt install fonts-noto
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import os
import os

# Given text data

texts = [
("mr", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।")
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Combine all text data into one string

combined_text = " ".join(text for _, text in texts)

# Set font path for full Indic script support

font_path = "/usr/share/fonts/truetype/noto/NotoSansDevanagari-Regular.ttf"

# Check if font exists; otherwise, install it

if not os.path.exists(font_path):
print("Please install NotoSansDevanagari and NotoSansGujarati fonts for proper ren

# Function to generate a word cloud

def generate_wordcloud(text, font_path):
wordcloud = WordCloud(
font_path=font_path,
width=1000,
height=600,
background_color="white",
max_words=200,
min_font_size=10,
colormap="viridis",
collocations=False
).generate(text)

# Display the Word Cloud

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

# Generate and display the word cloud

generate_wordcloud(combined_text, font_path)

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632

Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,683 kB]
Get:13 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,788
Get:14 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,804 kB]
Get:15 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages
http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages
Get:16 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1
Get:17 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,5
Get:18 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [4
Get:19 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [3,097 k
Fetched 30.1 MB in 4s (7,480 kB/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
37 packages can be upgraded. Run 'apt list --upgradable' to see them.
W: Skipping acquire of configured file 'main/source/Sources' as repository '
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
fonts-noto-cjk fonts-noto-cjk-extra fonts-noto-color-emoji fonts-noto-core
fonts-noto-extra fonts-noto-mono fonts-noto-ui-core fonts-noto-ui-extra
fonts-noto-unhinted
The following NEW packages will be installed:
fonts-noto fonts-noto-cjk fonts-noto-cjk-extra fonts-noto-color-emoji
fonts-noto-core fonts-noto-extra fonts-noto-mono fonts-noto-ui-core
fonts-noto-ui-extra fonts-noto-unhinted
0 upgraded, 10 newly installed, 0 to remove and 37 not upgraded.
Need to get 317 MB of archives.
After this operation, 790 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-core all 202012
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-noto all 2020122
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-cjk all 1:20220
Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-cjk-extra all 1
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 fonts-noto-color-e
Get:6 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-noto-extra all 2
Get:7 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-mono all 202012
Get:8 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-ui-core all 202
Get:9 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-noto-ui-extra al
Get:10 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-noto-unhinted a
Fetched 317 MB in 11s (30.0 MB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Selecting previously unselected package fonts-noto-core.
(Reading database ... 126213 files and directories currently installed.)
Preparing to unpack .../0-fonts-noto-core_20201225-1build1_all.deb ...
Unpacking fonts-noto-core (20201225-1build1) ...
Selecting previously unselected package fonts-noto.
Preparing to unpack .../1-fonts-noto_20201225-1build1_all.deb ...
Unpacking fonts-noto (20201225-1build1) ...
Selecting previously unselected package fonts-noto-cjk.
Preparing to unpack .../2-fonts-noto-cjk_1%3a20220127+repack1-1_all.deb ...
Unpacking fonts-noto-cjk (1:20220127+repack1-1) ...
Selecting previously unselected package fonts-noto-cjk-extra.
Preparing to unpack .../3-fonts-noto-cjk-extra_1%3a20220127+repack1-1_all.deb ...
Unpacking fonts-noto-cjk-extra (1:20220127+repack1-1) ...
Selecting previously unselected package fonts-noto-color-emoji.
Preparing to unpack .../4-fonts-noto-color-emoji_2.047-0ubuntu0.22.04.1_all.deb ..
Unpacking fonts-noto-color-emoji (2.047-0ubuntu0.22.04.1) ...
Selecting previously unselected package fonts-noto-extra.
Preparing to unpack .../5-fonts-noto-extra_20201225-1build1_all.deb ...
Unpacking fonts-noto-extra (20201225-1build1) ...
Selecting previously unselected package fonts-noto-mono.
Preparing to unpack .../6-fonts-noto-mono_20201225-1build1_all.deb ...
Unpacking fonts-noto-mono (20201225-1build1) ...
Selecting previously unselected package fonts-noto-ui-core.
Preparing to unpack .../7-fonts-noto-ui-core_20201225-1build1_all.deb ...
Unpacking fonts-noto-ui-core (20201225-1build1) ...
Selecting previously unselected package fonts-noto-ui-extra.
Preparing to unpack .../8-fonts-noto-ui-extra_20201225-1build1_all.deb ...
Unpacking fonts-noto-ui-extra (20201225-1build1) ...
Selecting previously unselected package fonts-noto-unhinted.
Preparing to unpack .../9-fonts-noto-unhinted_20201225-1build1_all.deb ...
Unpacking fonts-noto-unhinted (20201225-1build1) ...
Setting up fonts-noto-mono (20201225-1build1) ...
Setting up fonts-noto-color-emoji (2.047-0ubuntu0.22.04.1) ...
Setting up fonts-noto-ui-extra (20201225-1build1) ...
Setting up fonts-noto-extra (20201225-1build1) ...
Setting up fonts-noto-cjk (1:20220127+repack1-1) ...
Setting up fonts-noto-unhinted (20201225-1build1) ...
Setting up fonts-noto-ui-core (20201225-1build1) ...
Setting up fonts-noto-core (20201225-1build1) ...
Setting up fonts-noto-cjk-extra (1:20220127+repack1-1) ...
Setting up fonts-noto (20201225-1build1) ...
Processing triggers for fontconfig (2.13.1-4.2ubuntu5) ...

from wordcloud import WordCloud

import matplotlib.pyplot as plt

# Input texts in different languages

[
texts = [
("mr", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।")
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Combine all text into one

combined_text = " ".join([text for lang, text in texts])

# Path to a Unicode font supporting Devanagari & Gujarati scripts

font_path = "/path/to/NotoSansDevanagari-Regular.ttf" # Update this with the actual f

# Generate word cloud

wordcloud = WordCloud(
font_path=font_path, # Ensure correct font
width=1000,
height=500,
background_color="white",
colormap="viridis",
regexp=r"[\u0900-\u097F\u0A80-\u0AFF]+" # Supports Devanagari (Marathi, Hindi) an
).generate(combined_text)

# Display the word cloud

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
from collections import Counter
import re

# Given text data

# Combine all text data

combined_text = " ".join(text for _, text in texts)

# Preprocessing: Remove punctuation and split words

words = re.findall(r'\b\w+\b', combined_text)

# Count word frequency

word_freq = Counter(words)

# Display word frequency

for word, freq in word_freq.most_common(10): # Show top 10 words
print(f"{word}: {freq}")

व: 7
भ: 5
स: 5
ત: 5
क: 4
ह: 3
एक: 3
न: 3
ल: 3
त: 3

import matplotlib.pyplot as plt

from collections import Counter
import re

# Given text data

texts = [
("hi", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।")
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Combine all text data

combined_text = " ".join(text for _, text in texts)

# Tokenize words (extract words)

tokenized_words = re.findall(r'\b\w+\b', combined_text)

# Count word frequency

# Count word frequency
word_counts = Counter(tokenized_words)
common_words = word_counts.most_common(10) # Top 10 words

# Extract words and their counts

words, counts = zip(*common_words)

# Create a bar chart

plt.figure(figsize=(10, 5))
bars = plt.bar(words, counts, color="skyblue")

# Add frequency labels on top of bars

for bar, count in zip(bars, counts):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), str(count), ha='center

# Formatting the chart

plt.xlabel("Words")
plt.ylabel("Frequency")
plt.title("Top 10 Frequent Words")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot

plt.show()

/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)

Corpus Linguistics Volume 1
100% (3)
Corpus Linguistics Volume 1
796 pages
Language Detector: Bachelor of Engineering (Sem-VIII)
No ratings yet
Language Detector: Bachelor of Engineering (Sem-VIII)
10 pages
NLP - PBL - Project Report - Draft.02
No ratings yet
NLP - PBL - Project Report - Draft.02
32 pages
Text Classification and Processing using NLP
No ratings yet
Text Classification and Processing using NLP
21 pages
Project Report
No ratings yet
Project Report
12 pages
Report in ML
No ratings yet
Report in ML
9 pages
Lecture-1-Introduction To Natural Language Processing-2021
No ratings yet
Lecture-1-Introduction To Natural Language Processing-2021
46 pages
Towards Developing Tools For Indian Lang
No ratings yet
Towards Developing Tools For Indian Lang
59 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
NLP Report
No ratings yet
NLP Report
44 pages
Language Translator Python
No ratings yet
Language Translator Python
25 pages
NLP Manual
No ratings yet
NLP Manual
34 pages
Multilingual_Mysteries_The_Art_of_Automated_Language_Identification
No ratings yet
Multilingual_Mysteries_The_Art_of_Automated_Language_Identification
6 pages
PROJECT REPORT For Machine Learning
100% (1)
PROJECT REPORT For Machine Learning
22 pages
Xyz
No ratings yet
Xyz
62 pages
YASH hpc final (1)
No ratings yet
YASH hpc final (1)
13 pages
final_final_final_final_final_Report
No ratings yet
final_final_final_final_final_Report
45 pages
Voice Based System Assistant Using NLP and Deep Learning-1
No ratings yet
Voice Based System Assistant Using NLP and Deep Learning-1
82 pages
PROJECT REPORT For Machine Learning
No ratings yet
PROJECT REPORT For Machine Learning
22 pages
CV Format
No ratings yet
CV Format
1 page
Satya Final Minor Report
100% (1)
Satya Final Minor Report
25 pages
Natural Language Processing (2) Finalll
No ratings yet
Natural Language Processing (2) Finalll
20 pages
A_Res
No ratings yet
A_Res
1 page
Arup_Das_Resume - Arup Das
No ratings yet
Arup_Das_Resume - Arup Das
2 pages
Resume
No ratings yet
Resume
2 pages
Sample Project Final Document
No ratings yet
Sample Project Final Document
68 pages
Text Modication Methods For Natural Language Generation: Universitat Autònoma de Barcelona
No ratings yet
Text Modication Methods For Natural Language Generation: Universitat Autònoma de Barcelona
44 pages
Temp Research Paper
No ratings yet
Temp Research Paper
5 pages
Mini Project Report Final
No ratings yet
Mini Project Report Final
25 pages
NLP Report
No ratings yet
NLP Report
20 pages
Techical Seminar Report sam_edit
No ratings yet
Techical Seminar Report sam_edit
16 pages
final report -12
No ratings yet
final report -12
60 pages
Machine Learning Project Report1
No ratings yet
Machine Learning Project Report1
20 pages
81 Cse e
No ratings yet
81 Cse e
5 pages
Final Research Paper
100% (1)
Final Research Paper
5 pages
batch11_review_ppt[1]
No ratings yet
batch11_review_ppt[1]
7 pages
Major Report9 Pages
No ratings yet
Major Report9 Pages
46 pages
Team8 Tamil LLM
No ratings yet
Team8 Tamil LLM
13 pages
Sanjulika Sharma MLE Data Scientist Resume
No ratings yet
Sanjulika Sharma MLE Data Scientist Resume
2 pages
Project Report -7_merged
No ratings yet
Project Report -7_merged
46 pages
PHD Regulation 2015
No ratings yet
PHD Regulation 2015
5 pages
Exploratory_Project_Report
No ratings yet
Exploratory_Project_Report
57 pages
Sample
No ratings yet
Sample
8 pages
UG-PP2-Report-Format-Part2_Abstract.docx
No ratings yet
UG-PP2-Report-Format-Part2_Abstract.docx
3 pages
draft sem 8
No ratings yet
draft sem 8
70 pages
Project Synopsis-1
No ratings yet
Project Synopsis-1
11 pages
Sentiment Analysis Using NLP
No ratings yet
Sentiment Analysis Using NLP
42 pages
Unit 4 Natural Language Generation
No ratings yet
Unit 4 Natural Language Generation
48 pages
NLP Interpreter Project Report
No ratings yet
NLP Interpreter Project Report
20 pages
Project Report on Natural Language Processing
No ratings yet
Project Report on Natural Language Processing
4 pages
NLP Manual (1-12) 2
No ratings yet
NLP Manual (1-12) 2
5 pages
Mainprojectsample Documentation.
No ratings yet
Mainprojectsample Documentation.
51 pages
mini-proj
No ratings yet
mini-proj
23 pages
NLP Lab Internal Question Bank
No ratings yet
NLP Lab Internal Question Bank
5 pages
NLP Lab Manual LP 6
No ratings yet
NLP Lab Manual LP 6
43 pages
Report Sentiment Analysis Using NLP and Deep Learning
No ratings yet
Report Sentiment Analysis Using NLP and Deep Learning
65 pages
Report SEM I
No ratings yet
Report SEM I
56 pages
Pretraining Data and Tokenizer for Indic LLM
No ratings yet
Pretraining Data and Tokenizer for Indic LLM
9 pages
NLP Mini Project
No ratings yet
NLP Mini Project
19 pages
Internals of Python 3.x: Derive Maximum Code Performance and Delve Further into Iterations, Objects, GIL, Memory management, and various Internals
From Everand
Internals of Python 3.x: Derive Maximum Code Performance and Delve Further into Iterations, Objects, GIL, Memory management, and various Internals
Prashanth Raghu
No ratings yet
Python GUI with PyQt: Learn to build modern and stunning GUIs in Python with PyQt5 and Qt Designer (English Edition)
From Everand
Python GUI with PyQt: Learn to build modern and stunning GUIs in Python with PyQt5 and Qt Designer (English Edition)
Saurabh Chandrakar
No ratings yet
Introduction to AI
No ratings yet
Introduction to AI
4 pages
Simplification Complex Sentences in Indonesia Language Using Rule-Based Reasoning PDF
No ratings yet
Simplification Complex Sentences in Indonesia Language Using Rule-Based Reasoning PDF
7 pages
Chat Bots Review 2019
No ratings yet
Chat Bots Review 2019
11 pages
AI Preboard II Paper G-10
No ratings yet
AI Preboard II Paper G-10
6 pages
ai &ml
No ratings yet
ai &ml
14 pages
Sentiment Mining Model For Opinionated Amharic Texts
No ratings yet
Sentiment Mining Model For Opinionated Amharic Texts
86 pages
Webanno: A Flexible, Web-Based and Visually Supported System For Distributed Annotations
No ratings yet
Webanno: A Flexible, Web-Based and Visually Supported System For Distributed Annotations
6 pages
2022 Sigul-1 18
No ratings yet
2022 Sigul-1 18
9 pages
Combating Depression in Students Using An Intelligent ChatBot A Cognitive Behavioral Therapy
No ratings yet
Combating Depression in Students Using An Intelligent ChatBot A Cognitive Behavioral Therapy
4 pages
Sih 1614
No ratings yet
Sih 1614
6 pages
AI and Its Implications For Market Knowledge in b2b Marketing
No ratings yet
AI and Its Implications For Market Knowledge in b2b Marketing
10 pages
Study On Ambiguity and NLP Application
No ratings yet
Study On Ambiguity and NLP Application
14 pages
FIGURATIVE LANGUAGES - Figurative Language Classifier and Translator Using Artificial Neural Network With Rule-Based Technique
No ratings yet
FIGURATIVE LANGUAGES - Figurative Language Classifier and Translator Using Artificial Neural Network With Rule-Based Technique
10 pages
Iit Data Science
No ratings yet
Iit Data Science
20 pages
WCAIMLDS Paris 2023 - Brochure
No ratings yet
WCAIMLDS Paris 2023 - Brochure
8 pages
Duolingo English Test: Technical Manual: Geoffrey T. Laflair and Burr Settles
No ratings yet
Duolingo English Test: Technical Manual: Geoffrey T. Laflair and Burr Settles
33 pages
Remarks - AI - LECTURES - DR ALI HUSSEIN HASAN - Upload by Diaa Saed
100% (1)
Remarks - AI - LECTURES - DR ALI HUSSEIN HASAN - Upload by Diaa Saed
93 pages
Grade 11
No ratings yet
Grade 11
24 pages
Expert system MCQs
No ratings yet
Expert system MCQs
5 pages
Arabic Word-Level Readability Visualization For Assisted Text Simplification
No ratings yet
Arabic Word-Level Readability Visualization For Assisted Text Simplification
8 pages
Research Proposal Sample
No ratings yet
Research Proposal Sample
6 pages
Natural Language Processing: A Beginner's Guide To Fundamentals of
No ratings yet
Natural Language Processing: A Beginner's Guide To Fundamentals of
14 pages
Ai Concept Paper
No ratings yet
Ai Concept Paper
3 pages
Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022
No ratings yet
Ek-1 2209-A Arastirma Onerisi Formu 28.09.2022
11 pages
ICICC - 2023 - Without Ref For Plag
No ratings yet
ICICC - 2023 - Without Ref For Plag
4 pages
Gen AI
No ratings yet
Gen AI
35 pages
Sachin CV
No ratings yet
Sachin CV
2 pages
Everest Report
No ratings yet
Everest Report
52 pages
Spelling Bee Solver, Answers and Hints
No ratings yet
Spelling Bee Solver, Answers and Hints
7 pages