0% found this document useful (0 votes)
7 views

NLP MINI PROJECT

The document is a mini project report submitted by Lunawat Sujal Kailash to Savitribai Phule Pune University, focusing on the creation and tokenization of a multilingual text dataset in Marathi, Hindi, and Gujarati using Python. It details the problem statement, methodology, challenges faced, and includes Python code for data processing and visualization. The project aims to enhance Natural Language Processing (NLP) capabilities for regional Indian languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

NLP MINI PROJECT

The document is a mini project report submitted by Lunawat Sujal Kailash to Savitribai Phule Pune University, focusing on the creation and tokenization of a multilingual text dataset in Marathi, Hindi, and Gujarati using Python. It details the problem statement, methodology, challenges faced, and includes Python code for data processing and visualization. The project aims to enhance Natural Language Processing (NLP) capabilities for regional Indian languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

LP - VI (NLP)

Multilingual Text Dataset Creation and Tokenization using Python

Mini Project Report submitted to Savitribai Phule Pune University, Pune

In partial fulfillment for the award of Degree of Engineering in Computer Engineering

Submitted by

Lunawat Sujal Kailash

Under the Guidance of

Prof. Ramesh P. Daund

Designation of Guide: Assistant Professor

Academic Year : 2024-25

Department of Computer Engineering


SNJB’s Late Sau. Kantabai Bhavarlalji Jain, College of Engineering,

Chandwad, Dist. Nashik


CERTIFICATE
Department of Computer Engineering
SNJB’s Late Sau. Kantabai Bhavarlalji Jain, College of Engineering, Chandwad, Dist.
Nashik

This is to certify that Lunawat Sujal Kailash as successfully completed the Mini Project entitled
"Multilingual Text Dataset Creation and Tokenization using Python" under the guidance of Prof.
Ramesh P. Daund in partial fulfillment of the requirements for the Final Year of Engineering in Computer
Engineering under Savitribai Phule Pune University during the academic year 2024–2025.

Date: ……………….
Place: Chandwad

Sign of Guide HoD Principal


Ramesh P. Daund Dr. K. M. Sanghavi Dr. R. G. Tated
Acknowledgement

With a deep sense of gratitude, I would like to thank all the people who have lit our path with their kind
guidance. I am very grateful to these intellectuals who did their best to help during my project work.
It is my proud privilege to express a deep sense of gratitude to Prof. Dr. R. G. Tated, Principal of SNJB’s Late
Sau KBJ COE, Chandwad, for his comments and kind permission to complete this project. I remain indebted
to Dr. K. M. Sanghavi, H.O.D. Computer Engineering Department for her timely suggestion and valuable
guidance.

The special gratitude goes to Prof. Ramesh P. Daund for excellent and precious guidance on completion of
this work. I am also thankful to my parents for their constant support, and to my friends and everyone who
directly or indirectly contributed to the successful completion of this project.

Lunawat Sujal Kailash


Content:

1.Problem Statement

2.Introduction

3.Methodology

4.Challenges Faced

5.Conclusion

6.Python Code

7.Output & Visualizations

1. Problem Statement
To collect and preprocess text data in Marathi, Hindi, and Gujarati, tokenize the text, and export the
cleaned and tokenized data in CSV or JSON format. The objective is to build a small multilingual dataset
suitable for downstream Natural Language Processing (NLP) tasks.

2. Introduction
Natural Language Processing (NLP) is a growing field in Artificial Intelligence (AI) that helps computers
understand and work with human language. In India, where people speak many different languages, it is
important to develop NLP tools for regional languages as well, not just English.

This project focuses on three Indian languages: Marathi, Hindi, and Gujarati. We collected text data,
cleaned it, split it into words (tokenized), and saved it in a structured format. The created dataset can be
used in applications like machine translation, sentiment analysis, or text classification.

This mini project gives practical experience in building and preparing multilingual datasets for future NLP
tasks.

3. Steps
I. Text Collection
Collected text samples from public sources like books, news articles, and open datasets in Marathi, Hindi,
and Gujarati.

II. Preprocessing
Removed punctuation, special characters, extra spaces, and converted all text to lowercase using basic
Python string operations.

III. Tokenization
Used tokenization tools from spaCy and NLTK libraries. Language-specific tokenizers were used when
necessary to accurately split text into tokens.
IV. Dataset Creation
Stored the dataset in CSV and JSON formats with the following fields:
● Language

● Original Text

● Tokenized Text

V. Visualization
Created word clouds and token frequency graphs using Python libraries to visualize commonly used
words in each language.

4. Challenges Faced
● Difficulty in finding good quality open-source text data in regional languages.

● Language-specific tokenization required manual adjustments.

● Display issues in word clouds for non-English scripts due to font compatibility.

● Encoding problems while saving and reading multilingual text files.

5. Conclusion
This mini project covered all the basic steps in handling multilingual text data from collection and
preprocessing to tokenization and visualization. It provided a hands-on understanding of working with
Indian languages in NLP and helped build a small but useful dataset for future research or application
development.
import pandas as pd
import json

# Sample text data


data = [
{"language": "Marathi", "original_text": "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध
{"language": "Hindi", "original_text": "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं
{"language": "Gujarati", "original_text": "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ�
]

# Save as CSV
df = pd.DataFrame(data)
df.to_csv("text_samples.csv", index=False, encoding="utf-8")

# Save as JSON
with open("text_samples.json", "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=4)

print("Text samples saved successfully!")

Text samples saved successfully!

import re

def clean_text(text):
"""Removes punctuation, numbers, and extra spaces from the text."""
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\d+', '', text) # Remove numbers
text = re.sub(r'\s+', ' ', text).strip() # Remove extra spaces
return text

# Sample text data


texts = [
"भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात. भारताची
"भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�। भारत क�
"ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે . ગુજરાતમાં
]

# Clean the text


cleaned_texts = [clean_text(text) for text in texts]

# Display results
for original, cleaned in zip(texts, cleaned_texts):
print(f"Original: {original}")
print(f"Cleaned: {cleaned}\n")

Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Cleaned: भरत ह एक ववधतन नटलल दश आह यथ ववध धरम भष आण ससकतच लक एकतर रहतत भरतच

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Cleaned: भरत एक ववधतपरण दश ह यह वभनन धरम भषओ और ससकतय क लग एक सथ रहत ह भरत क

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Cleaned: ગજરત એ ભરતન એક રજય છ ત તન સમદધ સસકત ઇતહસ અન વરસ મટ જણત છ ગજરતમ

import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Download the punkt_tab resource
from nltk.tokenize import word_tokenize

# Sample text data


texts = [
"भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात. भारताची
"भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�। भारत क�
"ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે . ગુજરાતમાં
]

# Tokenize text
tokenized_texts = [word_tokenize(text) for text in texts]

# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized: {tokens}\n")

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt_tab.zip.
Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized: ['भारत', 'हा', 'एक', 'िविवधतेने', 'नटलेला', 'देश', 'आहे', '.', 'येथे', 'िविवध

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized: ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', 'िव�भ�', 'धम�', ',', 'भाषाओं

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Tokenized: ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'છે ', '.', 'તે', 'તેની', 'સમૃ�

import nltk
import string
nltk.download('punkt')

from nltk.tokenize import word_tokenize

# Sample text data


texts = [
"भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात.",
"भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।",
"ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Tokenize text and remove punctuation


tokenized_texts = [
[word for word in word_tokenize(text) if word not in string.punctuation]
[word for word in word_tokenize(text) if word not in string.punctuation]
for text in texts
]

# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized (without punctuation): {tokens}\n")

Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized (without punctuation): ['भारत', 'हा', 'एक', 'िविवधतेने', 'नटलेला', 'देश', '

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized (without punctuation): ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', '

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Tokenized (without punctuation): ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'છે ', '

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Package punkt is already up-to-date!

import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download necessary resources


nltk.download('punkt')
nltk.download('stopwords')

# Define stopwords manually for Hindi, Marathi, and Gujarati (since NLTK does not prov
hindi_stopwords = ["का", "के", "क�", "है", "यह", "और", "को", "म�", "से", "िक", "पर",
marathi_stopwords = ["आहे", "आ�ण", "या", "म�ये", "वरील", "साठी", "हा", "हे", "क�", "ते
gujarati_stopwords = ["છે ", "અને", "કે ", "પર", "થી", "હા", "ના", "માં", "તે", "આ"]

# Combine all stopwords


stop_words = set(stopwords.words('english') + hindi_stopwords + marathi_stopwords + gu

# Sample text data


texts = [
"भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात.",
"भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।",
"ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Tokenize text and remove stopwords


filtered_texts = [
[word for word in word_tokenize(text) if word.lower() not in stop_words and
for text in texts
]

# Display results
for original, tokens in zip(texts, filtered_texts):
print(f"Original: {original}")
print(f"Tokenized (without stopwords): {tokens}\n")

Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized (without stopwords): ['भारत', 'एक', 'िविवधतेने', 'नटलेला', 'देश', 'येथे', '

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized (without stopwords): ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', 'िव�भ�

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Tokenized (without stopwords): ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'તેની', '

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.

import nltk
from nltk.tokenize import word_tokenize

# Download necessary resources


nltk.download('punkt')

# Sample text data


texts = [
"भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात.",
"भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।",
"ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Tokenize text
tokenized_texts = [word_tokenize(text) for text in texts]

# Display results
for original, tokens in zip(texts, tokenized_texts):
print(f"Original: {original}")
print(f"Tokenized: {tokens}\n")

Original: भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
Tokenized: ['भारत', 'हा', 'एक', 'िविवधतेने', 'नटलेला', 'देश', 'आहे', '.', 'येथे', 'िविवध

Original: भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते
Tokenized: ['भारत', 'एक', 'िविवधतापूण�', 'देश', 'है।', 'यहाँ', 'िव�भ�', 'धम�', ',', 'भाषाओं

Original: ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું
Tokenized: ['ગુજરાત', 'એ', 'ભારતનું', 'એક', 'રા�ય', 'છે ', '.', 'તે', 'તેની', 'સમૃ�

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Package punkt is already up-to-date!

import nltk
import spacy
import csv
import json
import json
from nltk.tokenize import word_tokenize

# Download necessary resources


nltk.download('punkt')

# Load spaCy models for multilingual tokenization


nlp_hi = spacy.blank("hi") # Hindi (also works for Marathi)
nlp_gu = spacy.blank("gu") # Gujarati

# Sample text data


texts = [
("hi", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે
]

# Tokenize using NLTK & spaCy


dataset = []
for lang, text in texts:
if lang == "gu":
doc = nlp_gu(text) # Use Gujarati spaCy model
else:
doc = nlp_hi(text) # Use Hindi/Marathi spaCy model

tokenized_text = [token.text for token in doc]

# Append to dataset
dataset.append({"language": lang, "original_text": text, "tokenized_text": tokeniz

# Save as CSV
csv_filename = "tokenized_dataset.csv"
with open(csv_filename, mode="w", encoding="utf-8", newline="") as file:
writer = csv.DictWriter(file, fieldnames=["language", "original_text", "tokenized_
writer.writeheader()
for row in dataset:
writer.writerow(row)

# Save as JSON
json_filename = "tokenized_dataset.json"
with open(json_filename, mode="w", encoding="utf-8") as file:
json.dump(dataset, file, ensure_ascii=False, indent=4)

# Display result
print(f"Dataset saved as '{csv_filename}' and '{json_filename}' successfully!")

Dataset saved as 'tokenized_dataset.csv' and 'tokenized_dataset.json' successfully


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!

!sudo apt update


!sudo apt install fonts-noto
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import os
import os

# Given text data


texts = [
("mr", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।")
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Combine all text data into one string


combined_text = " ".join(text for _, text in texts)

# Set font path for full Indic script support


font_path = "/usr/share/fonts/truetype/noto/NotoSansDevanagari-Regular.ttf"

# Check if font exists; otherwise, install it


if not os.path.exists(font_path):
print("Please install NotoSansDevanagari and NotoSansGujarati fonts for proper ren

# Function to generate a word cloud


def generate_wordcloud(text, font_path):
wordcloud = WordCloud(
font_path=font_path,
width=1000,
height=600,
background_color="white",
max_words=200,
min_font_size=10,
colormap="viridis",
collocations=False
).generate(text)

# Display the Word Cloud


plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

# Generate and display the word cloud


generate_wordcloud(combined_text, font_path)

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632


Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:12 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,683 kB]
Get:13 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,788
Get:14 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [8,804 kB]
Get:15 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages
http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages
Get:16 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1
Get:17 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,5
Get:18 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [4
Get:19 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [3,097 k
Fetched 30.1 MB in 4s (7,480 kB/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
37 packages can be upgraded. Run 'apt list --upgradable' to see them.
W: Skipping acquire of configured file 'main/source/Sources' as repository '
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
fonts-noto-cjk fonts-noto-cjk-extra fonts-noto-color-emoji fonts-noto-core
fonts-noto-extra fonts-noto-mono fonts-noto-ui-core fonts-noto-ui-extra
fonts-noto-unhinted
The following NEW packages will be installed:
fonts-noto fonts-noto-cjk fonts-noto-cjk-extra fonts-noto-color-emoji
fonts-noto-core fonts-noto-extra fonts-noto-mono fonts-noto-ui-core
fonts-noto-ui-extra fonts-noto-unhinted
0 upgraded, 10 newly installed, 0 to remove and 37 not upgraded.
Need to get 317 MB of archives.
After this operation, 790 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-core all 202012
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-noto all 2020122
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-cjk all 1:20220
Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-cjk-extra all 1
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 fonts-noto-color-e
Get:6 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-noto-extra all 2
Get:7 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-mono all 202012
Get:8 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-ui-core all 202
Get:9 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-noto-ui-extra al
Get:10 http://archive.ubuntu.com/ubuntu jammy/universe amd64 fonts-noto-unhinted a
Fetched 317 MB in 11s (30.0 MB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
Selecting previously unselected package fonts-noto-core.
(Reading database ... 126213 files and directories currently installed.)
Preparing to unpack .../0-fonts-noto-core_20201225-1build1_all.deb ...
Unpacking fonts-noto-core (20201225-1build1) ...
Selecting previously unselected package fonts-noto.
Preparing to unpack .../1-fonts-noto_20201225-1build1_all.deb ...
Unpacking fonts-noto (20201225-1build1) ...
Selecting previously unselected package fonts-noto-cjk.
Preparing to unpack .../2-fonts-noto-cjk_1%3a20220127+repack1-1_all.deb ...
Unpacking fonts-noto-cjk (1:20220127+repack1-1) ...
Selecting previously unselected package fonts-noto-cjk-extra.
Preparing to unpack .../3-fonts-noto-cjk-extra_1%3a20220127+repack1-1_all.deb ...
Unpacking fonts-noto-cjk-extra (1:20220127+repack1-1) ...
Selecting previously unselected package fonts-noto-color-emoji.
Preparing to unpack .../4-fonts-noto-color-emoji_2.047-0ubuntu0.22.04.1_all.deb ..
Unpacking fonts-noto-color-emoji (2.047-0ubuntu0.22.04.1) ...
Selecting previously unselected package fonts-noto-extra.
Preparing to unpack .../5-fonts-noto-extra_20201225-1build1_all.deb ...
Unpacking fonts-noto-extra (20201225-1build1) ...
Selecting previously unselected package fonts-noto-mono.
Preparing to unpack .../6-fonts-noto-mono_20201225-1build1_all.deb ...
Unpacking fonts-noto-mono (20201225-1build1) ...
Selecting previously unselected package fonts-noto-ui-core.
Preparing to unpack .../7-fonts-noto-ui-core_20201225-1build1_all.deb ...
Unpacking fonts-noto-ui-core (20201225-1build1) ...
Selecting previously unselected package fonts-noto-ui-extra.
Preparing to unpack .../8-fonts-noto-ui-extra_20201225-1build1_all.deb ...
Unpacking fonts-noto-ui-extra (20201225-1build1) ...
Selecting previously unselected package fonts-noto-unhinted.
Preparing to unpack .../9-fonts-noto-unhinted_20201225-1build1_all.deb ...
Unpacking fonts-noto-unhinted (20201225-1build1) ...
Setting up fonts-noto-mono (20201225-1build1) ...
Setting up fonts-noto-color-emoji (2.047-0ubuntu0.22.04.1) ...
Setting up fonts-noto-ui-extra (20201225-1build1) ...
Setting up fonts-noto-extra (20201225-1build1) ...
Setting up fonts-noto-cjk (1:20220127+repack1-1) ...
Setting up fonts-noto-unhinted (20201225-1build1) ...
Setting up fonts-noto-ui-core (20201225-1build1) ...
Setting up fonts-noto-core (20201225-1build1) ...
Setting up fonts-noto-cjk-extra (1:20220127+repack1-1) ...
Setting up fonts-noto (20201225-1build1) ...
Processing triggers for fontconfig (2.13.1-4.2ubuntu5) ...

from wordcloud import WordCloud


import matplotlib.pyplot as plt

# Input texts in different languages


[
texts = [
("mr", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।")
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Combine all text into one


combined_text = " ".join([text for lang, text in texts])

# Path to a Unicode font supporting Devanagari & Gujarati scripts


font_path = "/path/to/NotoSansDevanagari-Regular.ttf" # Update this with the actual f

# Generate word cloud


wordcloud = WordCloud(
font_path=font_path, # Ensure correct font
width=1000,
height=500,
background_color="white",
colormap="viridis",
regexp=r"[\u0900-\u097F\u0A80-\u0AFF]+" # Supports Devanagari (Marathi, Hindi) an
).generate(combined_text)

# Display the word cloud


plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
from collections import Counter
import re

# Given text data


texts = [
("hi", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે
]

# Combine all text data


combined_text = " ".join(text for _, text in texts)

# Preprocessing: Remove punctuation and split words


words = re.findall(r'\b\w+\b', combined_text)

# Count word frequency


word_freq = Counter(words)

# Display word frequency


for word, freq in word_freq.most_common(10): # Show top 10 words
print(f"{word}: {freq}")

व: 7
भ: 5
स: 5
ત: 5
क: 4
ह: 3
एक: 3
न: 3
ल: 3
त: 3

import matplotlib.pyplot as plt


from collections import Counter
import re

# Given text data


texts = [
("hi", "भारत हा एक िविवधतेने नटलेला देश आहे. येथे िविवध धम� , भाषा आ�ण सं�कृत�चे लोक एक� राहतात
("hi", "भारत एक िविवधतापूण� देश है। यहाँ िव�भ� धम�, भाषाओं और सं�कृ�तय� के लोग एक साथ रहते ह�।")
("gu", "ગુજરાત એ ભારતનું એક રા�ય છે . તે તેની સમૃ� સં�કૃ ￵ત, ઇ￵તહાસ અને વારસા માટે �ણીતું છે ."
]

# Combine all text data


combined_text = " ".join(text for _, text in texts)

# Tokenize words (extract words)


tokenized_words = re.findall(r'\b\w+\b', combined_text)

# Count word frequency


# Count word frequency
word_counts = Counter(tokenized_words)
common_words = word_counts.most_common(10) # Top 10 words

# Extract words and their counts


words, counts = zip(*common_words)

# Create a bar chart


plt.figure(figsize=(10, 5))
bars = plt.bar(words, counts, color="skyblue")

# Add frequency labels on top of bars


for bar, count in zip(bars, counts):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height(), str(count), ha='center

# Formatting the chart


plt.xlabel("Words")
plt.ylabel("Frequency")
plt.title("Top 10 Frequent Words")
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot


plt.show()

/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)
/usr/local/lib/python3.11/dist-packages/IPython/core/pylabtools.py:151: UserWarnin
fig.canvas.print_figure(bytes_io, **kw)

You might also like