Artificial Intelligence For Topic Modelling in Hin
Artificial Intelligence For Topic Modelling in Hin
Abstract
A distinct feature of Hindu religious and philosophical text is that they come from a
library of texts rather than single source. The Upanishads is known as one of the oldest
philosophical texts in the world that forms the foundation of Hindu philosophy. The
Bhagavad Gita is core text of Hindu philosophy and is known as a text that summarises
the key philosophies of the Upanishads with major focus on the philosophy of karma.
These texts have been translated into many languages and there exists studies about
themes and topics that are prominent; however, there is not much study of topic
modelling using language models which are powered by deep learning. In this paper, we
use advanced language produces such as BERT to provide topic modelling of the key
texts of the Upanishads and the Bhagavad Gita. We analyse the distinct and
overlapping topics amongst the texts and visualise the link of selected texts of the
Upanishads with Bhagavad Gita. Our results show a very high similarity between the
topics of these two texts with the mean cosine similarity of 73%. We find that out of
the fourteen topics extracted from the Bhagavad Gita, nine of them have a cosine
similarity of more than 70% with the topics of the Upanishads. We also found that
topics generated by the BERT-based models show very high coherence as compared to
that of conventional models. Our best performing model gives a coherence score of 73%
on the Bhagavad Gita and 69% on The Upanishads. The visualization of the low
dimensional embeddings of these texts shows very clear overlapping among their topics
adding another level of validation to our results.
Author summary
Introduction
Philosophy of religion [1, 2] is a field of study that covers key themes and ideas in
religions and culture that relate to philosophical topics such as ethics and metaphysics.
Hindu philosophy [3–5] consists of schools developed for thousands of years which focus
on themes such as ethics [6], consciousness [4], karma [7, 8], logic and ultimate reality [5].
Hindu philosophy is at times referred as Indian philosophy [9, 10]. The philosophy of
karma and reincarnation are central to Hindu philosophy [10]. The Upanishads form the
key texts of Hindu philosophy and seen as the conclusion of the Vedas [11–15]. Hindu
1 Background
1.1 BERT language model
BERT is an attention-based Transformer model [44] for learning contextualized
language representation where the vector representation of the every input token is
dependent on the context of its occurrence in a sentence. The Transformer model [44]
has been developed by using long short-term memory (LSTM) recurrent neural
networks [42, 69] with an an encoder-decoder architecture [70]. Transformer models
implement the mechanism of attention by weighting the significance of each part of the
input data which has been then prominent for language modelling tasks [44, 71].
BERT is first trained to understand the language (called pre-training phase) and the
context after that it is fine-tuned to learn the specific task such as neural machine
translation (NMT) [46, 72–76], question answering [77–82],and sentiment
analysis [83–87]. The pre-training phase of BERT involve two different NLP tasks such
as seen in Equation 3.
M
X
log p(D) ≥ Eqm [log p(θ, z, w)] − Eqm [log qm (θ, z)] (3)
m=1
LDA has been used for several language modeling tasks that include the study of the
relationship between two corpora using topic modeling [108] which is also the focus of
our study.
2 Methodology
2.1 Datasets
We evaluated a number of prominent translations of the Bhagavad Gita and the
Upanishads. In order to maintain the originality of the themes and ideas of these two
classical Indian texts, we used the older and the most popular translations for this work.
We chose Eknath Eashwaren’s translation since he directly translated from Sanskrit to
English and translated both texts [109, 110], hence it would be not be creating a
translation bias for topic modelling and comparison of the topics between the texts.
Eknath Easwaran ( 1910 – 1999) was a professor of English literature in India and later
moved to the United States where he translated these texts. In addition, we chose the
translation by Shri Purohit Swami and William Butler Yeats [111] for further
comparison. W. B Yeats (1865 – 1939) was Irish poet, dramatist, prose writer and
known as one of the foremost figures of 20th-century literature. Shri Purohit Swami
(1882 – 1941) was a Hindu teacher from Maharashtra, India. The translation of the
Upanishads by them is special since it has been done jointly by prominent Indian and
Irish scholars and captures Eastern and Western viewpoints. Table 1 provides further
details of the texts. Note that Shri Purohit Swami also translated the Bhagavad
Gita [112] which can be used in future analysis, and not used in this work.
The Bhagavad Gita consist of 18 chapters which features a series of questions and
answers between Lord Krishna and Arjuna that range with a range of topics that
includes the philosophy of Karma. The Mahabharata war lasted for 18 days [114];
hence, the organisation of the Gita is symbolic.
The Upanishads [110] translated by Eknath Eashwaren provides a commentary and
translation of the 11 major and 4 minor Upanishads. The 108 Upanishads [113] is a
collection of the translation and commentary of all 108 Upanishads in a single book
compiled by the Gita Society. The translation and commentary is done by a group of
spiritual teachers who have tried to recover the Upanishads which have believed to be
lost earlier; however, there are not much details about how they have recovered
them [113]. The Chandogya Upanishad has highest number of words followed by the
Katha Upanishad and the Brihadaranyaka Upanishad. The Ten Principal
Upanishads [111] consists of the translation of the 10 major Upanishads. This text does
not have a separate explanation for each Upanishad unlike the Upanishads by Eknath
Easwaran. The Brihadaranyaka Upanishad consists of the highest number of words
followed by the Chandogya Upanishad and Katha Upanishads. The Chandogya
2.2 Framework
Our major goal is to map the topics in the Bhagavad Gita with Upanishads. We begin
by selecting 12 prominent Upanishads (Isha, Katha, Kena, Prashna, Munda, Mandukya,
Taittiri, Aitareya, Chandogya, Brihadaranyaka, Brahma, Svetasvatara) from the text
translated by Eknath Easwaran [110]. The major reason that we selected both by the
same author for this task is to eliminate any bias in translation for topic modelling.
However we also considered other translations as mentioned in the table 1 and found
that these bias does affect the similarity matrix. For example when we compared the
similarity between the Upanishads by Eknath Easwaran and the Bhagavad Gita by the
same authors, average similarity score is 3% better than that of the Bhagavad Gita by
Eknath Easwaran and the Upanishads by Shri Purohit Swami. Finally, we also presents
the visualization of the topics space of 108 Upanishads and its different part divided
based on the original Vedas these Upanishads are originated from..
Next, we present a framework that employs different machine learning methods for
topic modelling. Figure 1 presents the complete framework for the analysis and topic
modelling of the respective texts given in Table 1. In Figure 1, the first stage consists of
conversion of PDF files and text pre-processing as discussed in the previous section. In
the second stage, we use two different sentence embedding models 1.) universal sentence
encoder (USE) and 2.) Sentence-BERT(SBERT) for generating the word and
documents embedding which is later passed thorough the topic extraction pipeline to
generate the topic vector and finally we compared our results with the classical topic
modelling algorithm LDA [47] across different corpus. Our framework to generate topics
is similar to Top2Vec [52], however we also used other clustering algorithms like
K-Means. First, USE and SBERT are used to generate the joint semantic embeddings
of documents and words. Since these embeddings are generally in higher dimension
which is very sparse, we need to reduce the dimension of the embeddings to get the
dense areas. We use dimensionality reduction techniques like UMAP and PCA for
SBERT
Preprocessing Topic Extraction
PDF files Text Files Final Topics Topic Comparison
& cleaning Pipeline
USE
reducing the high dimensional embedding vectors generated by the S-BERT and the
USE. Next, we find dense clusters of topics in the document vectors of the corpus using
clustering algorithms like HDBSCAN and K-Means. These clusters are represented by
the centroid of document vectors in the original dimension, which is called as topic
vectors [52]. Next, we find top N(N = 50 in our case) nearest words to the topic vectors
which represent our final topic. Topic vectors also allows us to group the similar topics
and hence reduce the number of topics using Hierarchical Topic Reduction [52].
Most of the topic modelling research [52, 116, 117] involves the bench-marking model
results on pre-existing datasets such as the 20 News Groups dataset [118], the Yahoo
Answers dataset [119, 120], Web Snippets dataset [121], W2E datasets [122]. These
datasets have been prepared to be used for the algorithms bench-marking tasks and
consists of the fixed number of documents and words. The 20 News Groups Datasets for
example consists of 15,465 documents and 4,159 words [116]. Tweets have also been
used for topic modelling tasks [123–125]. Jonsson et al. [123] for example, collected
tweets from Twitter to prepare a datasets of 129,530 tweets and used LDA [47],
Biterm-Topic-Model (BTM) [124] and a variation of LDA algorithms for topic modelling
to compare their performance. In case of Twitter based topic modelling datasets, a
tweet is considered as Document, though Jonsson et al. [123] aggregate documents to
form pseudo-documents and found that it solves the poor performance of LDA on
shorter documents. Murakami et al. [126] used research papers published in the journal
Global Environmental Change (GEC) from the first volume (1990/1991) to Volume 20
(2010) as the corpus for the topic modelling. They divided the a paper into several
paragraph blocks and modelled them as a documents of the corpus.
Our dataset can be seen as similar to Murakani et al. [126]. The Bhagavad Gita and
Upanishads are written in verse form and to maintain the originality of the texts, most
of the translations also preserve the numbering of the verses. Other than the verses, the
translations also contain commentary by the translator of the texts. While creating the
datasets, we first created documents based on the verse number in the texts, i.e a verse
is considered as a document of the corpus, where the numbering are clearly mentioned.
In other cases when verse numbers are not mentioned clearly, we considered one
paragraph as one documents. In case of the commentary, we split the commentary into
smaller parts to make them a document as done by Murakami et al. [126]. The statistics
in terms of number of documents, number of words (# words), average number of words
(avg # words), and number of verses (# verses) of different corpus (text files) and their
details can be found in Table 3.
1. Removing unicode characters generated in the text files due to noise in the PDF
files;
2. Normalizing(assigning uniform verses from each text) verse numbering in the
Upanishads and the Bhagavad Gita;
3. Replacing the archaic English words such as ”thy” and ”thou” with modern
English words like your and you;
4. Removing the punctuation, extra spaces, and lower-casing;
5. Removing repetitive and redundant sentences such as ”End of the Commentary”.
Examples of selected text from the original document along with the processed text
is shown in Table 2. The original text and processed text has been given in the Github
repository 2 . In topic modelling literature, word is the basic unit of data which is
defined to be an items from vocabulary indexed by {1, ..., V }, where V is the
vocabulary size. A Document is a collection of N words represented by
w = {w1 , w2 , ..., wN }, where wi is the ith word in the sequence. The corpus is
considered as a collection of M documents denoted by D = {w1 , w2 , ..., wM } [47].
1 https://github.com/writecrow/ocr2text
2 https://github.com/sydney-machine-learning/topicmodelling_vedictexts/
3 Results
3.1 Data Analysis
We begin by reporting key features of the selected texts (datasets) as shown in Table 3.
The Upanishads by Eknath Easwaran contains 862 documents, 40737 words and 705
verses. Since this text contains explanation by the authors as well so the number of
documents is more than the number of verses for this text. Ten Principal Upanishads
by W.B. Yeats and Shri Purohit Swami Consists of 1267 documents and same number
of verses as well. The corpus also consists of 27492 words with an average of 21.70
words per documents. The Bhagavad Gita by Eknath Easwaran consists of 700 verses
and the same number of documents along with 20299 words with an average of 21.70
words per documents.
3 https://huggingface.co/sentence-transformers/
Top 20 Words
('sense', 'objects') every
('attain', 'supreme') among
('attain', 'supreme', 'goal') wisdom
('beginning', 'middle', 'end') without
('dwells', 'every', 'creature') lord
('senses', 'mind', 'intellect') path
('sattva', 'rajas', 'tamas') selfish
('lifes', 'supreme', 'goal') attain
('gone', 'beyond', 'gunas') knowledge
('whose', 'consciousness', 'unified') world
('sanjaya', 'spoken', 'words') body
('giving', 'self', 'discipline') spiritual
0 5 10 15 20 0 20 40 60 80 100 120
Counts Counts
Fig 3. Visualisation of top 20 words, and top 10 bigrams and trigrams for the
Bhagavad Gita.
Top 20 Words
('dear', 'one') brahman
('united', 'lord') mind
('united', 'lord', 'love') body
('self', 'indeed', 'self') know
('inmost', 'self', 'truth') love
('truth', 'self', 'supreme') beyond
('self', 'supreme', 'shvetaketu') supreme
('come', 'everything', 'inmost') without
('everything', 'inmost', 'self') heart
('nothing', 'come', 'everything') see
('go', 'beyond', 'death') fire
('self', 'truth', 'self') joy
0 10 20 30 40 50 60 0 100 200 300 400 500
Counts Counts
Fig 4. Visualisation of top 20 words, and top 10 bigrams and trigrams for the
Bhagavad Gita. Upanishads by Eknath Easwaran.
where, the joint probability P (wi , wj ), i.e the probability of the single word P (wi )
is calculated by the Boolean sliding window approach (window length of s set to the
default value of 110). A virtual document is created and the count of occurrence of the
word (wi ) or the word pairs (wi , wj ), and then it is divided by the total number of the
virtual documents.
We use TC-NPMI as the topic-coherence measure to evaluate different topic models
and tune different hyper-parameters of different algorithms. Table 4 shows the value of
metric for different model on different datasets. We trained the LDA model for 200
iterations with other hyper-parameters set to the default value as given in the
gensim [146] library. We fine-tuned the number of topics parameters to get the optimal
value of TC-NPMI.
Next, we evaluate different components in the BERT-based topic model presented
earlier (Figure 1 from where we have five major approaches: 1.)
SBERT-UMAP-HDBSCAN, 2.) SBERT-UMAP-KMeans, 3.) USE-UMAP-HDBSCAN,
4.) USE-UMAP-KMeans, and 5.) LDA. In Table 4, we observe that in the case of the
Bhagavad Gita, the combination of USE-UMAP-KMeans gives the best TC-NPMI score
on both the datasets with a very slight difference when compared to
USE-UMAP-HDBSCAN and SBERT-UMAP-KMeans. Note that high TC-NPMI
results indicate better results. In the case of the Upanishads, we find a similar trend.
We also observe that LDA does not perform well, even after fine-tuning the number of
topics parameters to optimize the topic coherence.
Although the use of KMeans for the clustering component gives the best result, we
choose USE-UMAP-HDBSCAN to find the topic similarity between the Upanishads and
The Bhagavad Gita in the next section. This is because HDBSCAN does not require us
to specify the number of clusters, that corresponds to the number of topics, beforehand.
USE-UMAP-HDBSCAN gave 18 topics for the corpus - the Upanishads [110] for the
optimal value of the topic coherence mentioned in Table 4. Similarly, we got 14 topics
from the Bhagavad Gita [147]. In the case of the 108 Upanishads which contains larger
number of documents as compared to the rest of the texts, we got more topics for the
optimal values of topic coherence. However, we reduced the number of topics using
hierarchical topic reduction [52] in some of cases for example, while comparing the topic
similarity of the Bhagavad Gita and the Upanishads. Since the number of documents
and words are different for different corpus as seen from Table 3, the number of topics
obtained are different for different corpus. For example, in the Ten Principal
Upanishads – there are 1267 documents and we got 28 topics for them at the optimal
value of topic coherence. Similarly for 108 Upanishads, there are 6191 documents which
gives 115 topics (Table 4) for the model SBERT-UMAP-HDBSCAN at the optimal
value of topic coherence. Also, while plotting the semantic space for the different topics
obtained by our model as shown in Figure 8, Figure 10, and Figure 11, we reduced the
number of topics to 10 in order to visualize the topic’s semantic space clearly.
3.2.2 Topic similarity between the Bhagavad Gita and the Upanishads
There are studies that suggest that the Bhagavad Gita summarizes the key themes of
the Upanishads and various other Hindu texts [148–150]. The Bhagavad Gita along
with the Upanishads and the Brahma Sutras is known as the Prasthanatrayi [151–155],
literally meaning the three points of departure [151], or the three sources [153] ) which
makes the three foundational texts of the Vedanta school of Hindu
philosophy [13, 14, 149, 150, 156]. Sargeant et al. [148] stated that the Bhagavad Gita is
the summation of the Vedanta. Nicholson et al. [150] and Singh et al. [149] regarded the
Bhagavad Gita to be the key text of Vedanta.
Another source which discusses a direct relationship between the Bhagavad Gita and
the Upanishads is the Gita Dhayanam (also sometimes called Gita Dhyana and Dhyana
Slokas) which refers to the invocation of the Bhagavad Gita) [147, 157, 158]. We need to
note that Gita Dhayanam is an accompanying text with 9 verses used for prayer and
meditation that complements the Bhagavad Gita. These 9 verses are attributed
traditionally to Sri Madhusudana Sarasvati and are generally chanted by the students of
Gita before they start their daily studies [157]. These verses offer salutations to various
Hindu entities such as the Vyasa, Lord Krishna, Lord Varuna, Lord Indra, Lord Rudra
and the Lord of the Maruta and also characterises the relationship between the
Bhagavad Gita and the Upanishads. The 4th verse of the Gita Dhyanam states a direct
cow and milk relationship between the Upanishads and the Gita. Eknath
Easwaran [147] translated the 4th verse as ”The Upanishads are the cows milked by
Gopala, the son of Nanda, and Arjuna is the calf. Wise and pure men drink the milk,
the supreme, immortal nectar of the Gita”. Although these relationships have been
studied and retold for centuries, there are no existing studies that establishes a
quantitative measure to this relationship using modern language models. Next, we
evaluate and discuss similar relationships both quantitatively using a mathematical
formulation and also qualitatively by looking at the topics generated by our models as
shown in Tables 5, 6, and Figures 12, 13.
In order to evaluate the relationship between the Bhagavad Gita and the
Upanishads, we used the obtained topics to find a similarity matrix as shown in the
heatmap of Figure 6. The vertical axis of the heatmap shows the topics of the Bhagavad
Gita while the horizontal axis of the heatmap represent the topics of the Upanishads.
The heatmap represents the cosine similarity of the topic-vector obtained by the topic
model. Therefore, in each of the topics obtained from the Bhagavad Gita, we calculate
its similarity with all the topics of the Upanishads and then find the topic with
maximum similarity. This operation can be mathematically represented by the
Equation 5a. We represent the number of topics in Gita by Ngita and the number of
topics in Upanishads by Nupan . In each topic Tigita from the Bhagavad Gita, we explore
and find the most similar topic from Upanishads Tiupan . The topics and their similarity
score can be found in Table 5 and Table 6. We observe a very high similarity in the
topics of the Bhagavad Gita and two different texts of Upanishads (shown in Table 5
where Vigita and Viupan represent the ith topic vectors of the Bhagavad Gita and the
Upanishads, respectively. Sim(.) represent the similarity measure defined by equation 6,
which is cosine similarity in our case. There are various other measures of similarity
score between two vectors; however, the cosine similarity is used widely in the
literature [159–161]. One of the major reason for this is its interpretability. Value of
cosine similarity between any two vector lies between 0 and 1. A value closer to 1
represent that vectors are almost similar to each other and a value closer to 0 represent
that they are completely dissimilar.
The cosine similarity between any two vectors U and V is represented by Equation 6.
Since the topic vector contains contextual and thematic information about a topic, the
similarity score gives us extent of closeness of the themes and topics of the Bhagavad
Gita and the Upanishads.
U·V
Sim(U, V) = cos(θ) = (6)
kUkkVk
We can observe from the Table 5 that a number of the topics of Bhagavad Gita are
similar to the topics of the Upanishads with more than 70% similarity. We also find
that topic 4 of the Bhagavad Gita is similar to that of the topic 5 of the Upanishads
(Eknath Easwaran) with a similarity of 90%. We can see that both of these topics
contains almost similar words. Similarly, topic-5 of the Bhagavad Gita has a similarity
of 86% when compared with topic 8 of the Upanishads. Both of these topics are are
related to immortality and death. The similarity can be observed via Table 5; for
example, topic-1 of both Bhagavad Gita and the Upanishads (Eknath Easwaran)
consists of the words related to Hindu deities and entities such as Krishna, Arjuna,
Vishnu and Samashrava, they also have a similarity of 76%.
Figure 8 represents a visualization of the semantic space of the Bhagavad Gita and
the Upanishads with given topic labels. Although we find in Table 4 that Bhagavad Gita
and the Upanishads gave 14 and 18 topics respectively, we are only presenting 10 topics
from both of these texts in order to have a clear visualization. Each dots in the diagram
represent the two dimensional (2D) embedding of each of the documents of the corpus.
These topics can be seen in Figure 12 along with some of the most relevant documents
of the text with their source. Figure 12 represents the themes related to the deities and
the entities of the Hindu philosophy. We can also observe that documents relevant to
topic-1 have been originated form chapter 1, 3 and 10. These all are the verses
containing the name of the Hindu deities. Topic-2 of the same table encapsulate the
idea of self, worship, desire and fulfillment. A similar pattern can be observed in Table 6
which represent the topics and documents of the Ten Principal Upanishads [111].
ic-5 0.55 0.45 0.61 0.36 0.6 0.86 0.58 0.49 0.86 0.54 0.57 0.44 0.67 0.56 0.52 0.55 0.4 0.63
Top
Topic Vectors of Bhagavad Gita
-6
Topic 0.65 0.49 0.77 0.39 0.61 0.67 0.78 0.8 0.66 0.57 0.52 0.58 0.57 0.44 0.64 0.53 0.41 0.56 0.6
-7
Topic 0.74 0.46 0.63 0.51 0.69 0.55 0.51 0.45 0.51 0.48 0.5 0.74 0.49 0.56 0.47 0.47 0.31 0.41
ic-8 0.56 0.51 0.67 0.4 0.56 0.61 0.73 0.59 0.71 0.54 0.53 0.51 0.54 0.53 0.71 0.52 0.43 0.54
Top
ic-9 0.66 0.44 0.65 0.36 0.67 0.64 0.55 0.58 0.63 0.6 0.52 0.53 0.52 0.59 0.52 0.44 0.36 0.47 0.4
Top
0
ic-1 0.59 0.31 0.59 0.3 0.54 0.6 0.45 0.4 0.52 0.46 0.46 0.48 0.46 0.44 0.36 0.4 0.39 0.43
Top
1
ic-1 0.51 0.43 0.58 0.32 0.78 0.47 0.45 0.43 0.61 0.6 0.54 0.42 0.62 0.65 0.48 0.48 0.26 0.56
Top
12 0.47 0.27 0.2
ic- 0.5 0.24 0.53 0.43 0.4 0.29 0.52 0.47 0.51 0.36 0.45 0.55 0.33 0.37 0.38 0.43
Top
ic- 13 0.46 0.34 0.46 0.27 0.53 0.39 0.36 0.27 0.48 0.51 0.44 0.35 0.4 0.6 0.45 0.37 0.37 0.32
Top
4
ic-1 0.68 0.42 0.6 0.46 0.56 0.54 0.57 0.68 0.5 0.45 0.42 0.57 0.45 0.33 0.5 0.43 0.28 0.5
Top
0.0
ic-1 ic-2 ic-3 ic-4 ic-5 ic-6 ic-7 ic-8 ic-9 -10 -11 -12 -13 -14 -15 -16 -17 -18
Top Top Top Top Top Top Top Top Top Topic Topic Topic Topic Topic Topic Topic Topic Topic
Topic Vectors of Upanishads
Fig 6. Heatmap showing the similarity between different topics of Bhagavad Gita and Upanishads generated from a selected
approach (SBERT-UMAP-HDBSCAN).
Gita Upanishads
Topics of Gita Most Similar topics in Upanishads Similarity Score
Topic ID Topic ID
krishna,jayadratha,shraddha,ahamkara, sage,wisdom,devotee,sages,vishnu,mahabharata,
topic-1 topic-1 0.76
arjuna,ikshvaku,sankhya,ashvattha,kusha,vishnu devotees,samashrava,hindu,mahavakyas,theravada
selfless,selflessly,selfish,selfishly,desires, desires,happiness,eternal,selfless,beings,spiritual,
topic-2 topic-11 0.76
unkindness,desire,suffering,greed,themselves existence,spirituality,desire,joy,eternity,buddhism
worships,worship,devotion,devotees,eternal, eternal,divine,deity,eternity,lords,lord,everlasting,
topic-3 topic-3 0.75
myself,beings,eternity,spiritually,spiritual devotional,omnipotent,gods,beings,soul,beloved
meditation,meditate,spiritually,spiritual, meditation,meditating,meditates,meditate,meditated,
topic-4 topic-5 0.90
yoga,minds,asceticism,spirit,nirvana,wisdom minds,mind,spiritually,interiorize,enlightenment,spiritual
immortality,death,mortality,immortal,deathless, immortality,death,immortal,mortality,deathless,
topic-5 topic-6 0.86
eternity,eternal,dying,mortal,dead,mortals mortal,dying,mortals,eternity,deathlessness,eternal
gods,eternal,universe,beings,eternity,heavens, celestial,sun,heavens,earth,earthly,heaven,heavenly,
topic-6 topic-8 0.80
celestial,immortality,heavenly,divine,god luminous,sunrise,sky,universe,illumined,light,illumine
brahman,wisdom,devotees,brahma, sage,wisdom,devotee,sages,vishnu,mahabharata,devotees,
topic-7 topic-1 0.74
devotee,teachings,sages,worships,divine,devote samashrava,hindu,mahavakyas,theravada,hindus,buddhi
existence,universe,beings,eternal,nonexistence, universe,omnipotent,eternal,cosmos,eternity,beings,cosmic,
topic-8 topic-7 0.73
immortality,creatures,eternity,creature,cosmos immortal,gods,celestial,deity,beyondness,god,heavens
ignorance,ignorant,wisdom,delusions,darkness, meditation,meditating,meditates,meditate,meditated,
topic-9 topic-5 0.67
delusion,evils,intellects,eternal,asceticism minds,mind,spiritually,interiorize,enlightenment,spiritual
senses,sense,feeling,selflessly,selfless,selfishly, meditation,meditating,meditates,meditate,meditated,
topic-10 topic-5 0.78
feel,selfish,minds,oneself,themselves,perceive minds,mind,spiritually,interiorize,enlightenment,spiritual
enemy,enemies,conquer,defeat,fight,fighting, immortality,death,immortal,mortality,deathless,mortal,
topic-11 topic-6 0.60
conquered,battle,fought,nonviolence,dishonor dying,mortals,eternity,deathlessness,eternal,dead,deaths
forgiving,renunciation,fulfill,selfless,nonbeing, selfs,self,selfless,oneself,himself,themselves,selfish,ego,
topic-12 topic-14 0.55
unpleasant,selflessly,fulfilling,insatiable,indulging itself,egoism,yourself,independently,ourselves,autonomic
actions,act,action,acts,acting,inaction,selflessly, selfs,self,selfless,oneself,himself,themselves,selfish,ego,
topic-13 topic-14 0.60
selfless,themselves,unaffected,indifference,ignorance itself,egoism,yourself,independently,ourselves,autonomic
beings,spiritual,gods,divine,heavens,ocean, sage,wisdom,devotee,sages,vishnu,mahabharata,devotees,
topic-14 topic-1 0.68
spiritually,shudra,ahamkara,rudras,sacred,worships samashrava,hindu,mahavakyas,theravada,hindus,buddhi
Mean Similarity Score(AvgSim) 0.73
Table 5. Topics of the Bhagavad Gita(Eknath Easwaran ) with most similar topics from the Upanishads(Eknath Easwaran)
ic-5 0.66 0.5 0.24 0.6 0.89 0.62 0.54 0.59 0.48 0.64 0.58 0.46 0.5 0.77 0.67 0.33 0.34 0.46
Top
Topic Vectors of Bhagavad Gita
ic-6 0.82 0.52 0.34 0.55 0.64 0.77 0.58 0.53 0.61 0.64 0.65 0.46 0.55 0.7 0.53 0.54 0.46 0.62
Top 0.6
-7
Topic 0.55 0.61 0.36 0.56 0.57 0.58 0.58 0.55 0.54 0.5 0.67 0.43 0.45 0.65 0.34 0.32 0.39 0.53
ic-8 0.62 0.47 0.28 0.57 0.62 0.73 0.52 0.6 0.48 0.61 0.62 0.41 0.51 0.67 0.52 0.42 0.42 0.5
Top
-9 0.4
Topic 0.65 0.56 0.34 0.62 0.68 0.62 0.5 0.58 0.55 0.57 0.71 0.61 0.45 0.71 0.44 0.42 0.31 0.48
ic- 10 0.55 0.47 0.32 0.53 0.61 0.44 0.45 0.49 0.38 0.49 0.46 0.41 0.37 0.59 0.41 0.27 0.28 0.44
Top
ic- 11 0.55 0.44 0.16 0.62 0.56 0.55 0.6 0.64 0.62 0.57 0.57 0.48 0.42 0.53 0.5 0.28 0.2 0.43
Top
2 0.2
ic-1 0.42 0.43 0.27 0.64 0.55 0.44 0.47 0.53 0.36 0.45 0.38 0.39 0.36 0.52 0.41 0.16 0.22 0.35
Top
3
ic-1 0.37 0.4 0.19 0.56 0.46 0.48 0.4 0.54 0.43 0.44 0.52 0.46 0.36 0.49 0.34 0.18 0.2 0.27
Top
4
ic-1 0.76 0.57 0.34 0.45 0.52 0.59 0.54 0.42 0.57 0.54 0.54 0.35 0.48 0.58 0.41 0.53 0.44 0.61
Top
0.0
ic-1 ic-2 ic-3 ic-4 ic-5 ic-6 ic-7 ic-8 ic-9 -10 -11 -12 -13 -14 -15 -16 -17 -18
Top Top Top Top Top Top Top Top Top Topic Topic Topic Topic Topic Topic Topic Topic Topic
Topic Vectors of Upanishads
Fig 7. Heatmap showing the similarity between different topics of Bhagavad Gita(Eknath Easwaran) and the Ten Principal
Upanishads(Shri Purohit Swami) generated from a selected approach (SBERT-UMAP-HDBSCAN).
Upanishads that fall under 4 different categories identified by the four Vedas [12] ( Rig
Veda, Samar Veda, Yajur Veda, Artha Veda) which are known as the founding texts of
Hinduism. The Rig Veda is the oldest Hindu texts written in ancient Sanskrit and
believed to be remembered orally from guru-student tradition of mantra-recital [162]
thousands of years before being written down [11]. It has been difficult to translated
and also understand significance of certain aspects of the Vedas since it has been
written in ancient Sanskrit in verse with symbolism [163]. The Upanishads are known as
the texts that explain the philosophy of the Vedas and also known as the concluding
chapters that have been added to the four Vedas [164]. Table 7 gives information about
how the 108 Upanishads have been grouped according to their historical relevance to
the respective Vedas.
Figure 11 presents visualization of the semantic space of different parts (divided by 4
Vedas as shown in Table 7) of 108 Upanishads.
4 Discussion
The high level of semantic and topic similarity between the Bhagavad Gita and the
different sets of the Upanishads by the respective authors is not surprising. It verifies
well known thematic similarities as pointed out by Hindu scholars such as Swami
Vivekananda [165] and western scholars [14]. Bhagavad Gita is well known as the
central text of Hinduism that summaries the rest of the Vedic corpus. The Bhagavad
Gita is a conversation between Lord Krishna and Arjuna in a situation where Arjuna
has to go to war. The Bhagavad Gita is a chapter from the Mahabharata that uses a
conflicting event to summarize philosophy of the Upanishads and the Vedic corpus. The
Mahabharata is one of the oldest and longest texts written in verse form in Sanskrit
which describes a historical event (118,087 sentences, 2,858,609 words) [166]. We note
that most of Hindu ancient texts have been written in verse so that it can be sung and
remembered through an oral tradition in an absence of a writing system.
The goal of Lord Krishna was to motivate Arjuna to do his duty (karma) and go to
war to protect ethical standards (dharma) in the society. Krishna, in the Bhagavad Gita
begins by renouncing his duties as a warrior. We note that the Mahabharata war, is
believed to take place after the Vedas and the Upanishads were composed. Note that by
composition, it does not mean that these texts were written, they were sung and verses
became key mantras that were remembered through a guru-student tradition for
thousands of years. There are accounts where the Vedas have been mentioned in the
Mahabharata. Hence, Krishna is known as a student of the Vedic corpus which also
Acknowledgement
We thank Shweta Bindal from Indian Institute of Technology - Guwathi for contributing
to discussions about the workflow used in this work.
Appendix
References
1. Meister C. Introducing philosophy of religion. Routledge; 2009.
2. Reese WL. Dictionary of philosophy and religion: Eastern and Western thought.
1996;.
8. Mulla ZR, Krishnan VR. Karma Yoga: A conceptualization and validation of the
Indian philosophy of work. Journal of Indian Psychology. 2006;24(1/2):26–43.
4 https://github.com/sydney-machine-learning/topicmodelling_vedictexts
49. Silveira R, Fernandes C, Neto JAM, Furtado V, Pimentel Filho JE. Topic
Modelling of Legal Documents via LEGAL-BERT. Proceedings http://ceur-ws
org ISSN. 2021;1613:0073.
50. Peinelt N, Nguyen D, Liakata M. tBERT: Topic models and BERT joining
forces for semantic similarity detection. In: Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics; 2020. p. 7047–7055.
51. Grootendorst M. BERTopic: leveraging BERT and c-TF-IDF to create easily
interpretable topics (2020). URL https://doi org/105281/zenodo;4381785.
52. Angelov D. Top2vec: Distributed representations of topics. arXiv preprint
arXiv:200809470. 2020;.
53. Sia S, Dalmia A, Mielke SJ. Tired of Topic Models? Clusters of Pretrained
Word Embeddings Make for Fast and Good Topics too! arXiv preprint
arXiv:200414914. 2020;.
54. Thompson L, Mimno D. Topic modeling with contextualized word
representation clusters. arXiv preprint arXiv:201012626. 2020;.
55. Grootendorst M. BERTopic: Neural topic modeling with a class-based TF-IDF
procedure. arXiv preprint arXiv:220305794. 2022;.
56. Glazkova A. Identifying topics of scientific articles with BERT-based approaches
and topic modeling. In: Pacific-Asia Conference on Knowledge Discovery and
Data Mining. Springer; 2021. p. 98–105.
57. Scott M. Religious Language. In: Zalta EN, editor. The Stanford Encyclopedia
of Philosophy. Winter 2017 ed. Metaphysics Research Lab, Stanford University;
2017.
58. Pandharipande RV. Language of Religion: What does it inform the field of
Linguistics?;.
59. Keane W. Religious language. Annual review of anthropology. 1997;26(1):47–71.
60. Downes W. Linguistics and the Scientific Study of Religion. Religion, language,
and the human mind. 2018; p. 89.
61. Theodor I. Exploring the Bhagavad Gita: Philosophy, structure and meaning.
Routledge; 2016.
62. Stein D. Multi-Word Expressions in the Spanish Bhagavad Gita, Extracted with
Local Grammars Based on Semantic Classes. In: LREC 2012 Workshop
Language Resources and Evaluation for Religious Texts (LRE-Rel); 2012. p.
88–94.
63. Rajandran K. From matter to spirit: Metaphors of enlightenment in
Bhagavad-Gita. Journal of Language Studies. 2017;17(2):163–176.
64. Rajput NK, Ahuja B, Riyal MK. A statistical probe into the word frequency
and length distributions prevalent in the translations of Bhagavad Gita.
Pramana. 2019;92(4):1–6.
66. Bhawuk DP. Anchoring cognition, emotion, and behavior in desire: A model
from the Bhagavad-Gita. Handbook of Indian psychology. 2008; p. 390–413.
67. Chandra R, Kulkarni V. Semantic and sentiment analysis of the Bhagavad Gita
translations using BERT-based language models. arXiv preprint arXiv:xxx.
2022;.
68. Haas GC. Recurrent and parallel passages in the principal Upanishads and the
Bhagavad-Gı̄tā. Journal of the American Oriental Society. 1922; p. 1–43.
69. Greff K, Srivastava RK, Koutnı́k J, Steunebrink BR, Schmidhuber J. LSTM: A
search space odyssey. IEEE transactions on neural networks and learning
systems. 2016;28(10):2222–2232.
75. Clinchant S, Jung KW, Nikoulina V. On the use of BERT for neural machine
translation. arXiv preprint arXiv:190912744. 2019;.
76. Shavarani HS, Sarkar A. Better Neural Machine Translation by Extracting
Linguistic Information from BERT. arXiv preprint arXiv:210402831. 2021;.
79. Geva M, Khashabi D, Segal E, Khot T, Roth D, Berant J. Did Aristotle Use a
Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.
Transactions of the Association for Computational Linguistics. 2021;9:346–361.
80. Ozyurt IB, Bandrowski A, Grethe JS. Bio-AnswerFinder: a system to find
answers to questions from biomedical texts. Database. 2020;2020.
104. McInnes L, Healy J. Accelerated hierarchical density based clustering. In: 2017
IEEE International Conference on Data Mining Workshops (ICDMW). IEEE;
2017. p. 33–42.
105. McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and
projection for dimension reduction. arXiv preprint arXiv:180203426. 2018;.
106. Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of machine
learning research. 2008;9(11).
107. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics
and intelligent laboratory systems. 1987;2(1-3):37–52.
111. Swami SP, Yeats WB. The Ten Principal Upanishads. Rupa Publications India
Pvt. Ltd.; 2012. Available from:
https://books.google.co.in/books?id=f1KeAwAAQBAJ.
112. Swami SP. The Holy Geeta. Jaico Publishing House; 1935.
115. Witz KG. The supreme wisdom of the Upanis.ads: an introduction. Motilal
Banarsidass Publ.; 1998.
133. Indich WM. Consciousness in advaita vedanta. Motilal Banarsidass Publ.; 1995.
135. Ackerman RW. The Debate of the Body and the Soul and Parochial
Christianity. Speculum. 1962;37(4):541–565.
136. Sharma S. Corporate Gita: lessons for management, administration and
leadership. Journal of Human Values. 1999;5(2):103–123.
137. Nayak AK. Effective leadership traits from Bhagavad Gita. International
Journal of Indian Culture and Business Management. 2018;16(1):1–18.
138. Reddy M. Psychotherapy-insights from bhagavad gita. Indian journal of
psychological medicine. 2012;34(1):100–104.
139. Chang J, Gerrish S, Wang C, Boyd-Graber JL, Blei DM. Reading tea leaves:
How humans interpret topic models. In: Advances in neural information
processing systems; 2009. p. 288–296.
140. Syed S, Spruit M. Full-text or abstract? examining topic coherence scores using
latent dirichlet allocation. In: 2017 IEEE International conference on data
science and advanced analytics (DSAA). IEEE; 2017. p. 165–174.
144. Newman D, Bonilla EV, Buntine W. Improving topic coherence with regularized
topic models. Advances in neural information processing systems.
2011;24:496–504.
145. Röder M, Both A, Hinneburg A. Exploring the space of topic coherence
measures. In: Proceedings of the eighth ACM international conference on Web
search and data mining; 2015. p. 399–408.
146. Řehůřek R, Sojka P. Software Framework for Topic Modelling with Large
Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for
NLP Frameworks. Valletta, Malta: ELRA; 2010. p. 45–50.
147. Easwaran E. The Bhagavad Gita for Daily Living: Commentary, Translation,
and Sanskrit Text, Chapters 13 Through 18. Nilgiri Press; 1984.
148. Sargeant W, Chapple CK. The Bhagavad Gita: Twenty-fifth–Anniversary
Edition. SUNY Press; 2009.
149. Singh K. The Sterling Book of Hinduism. Sterling Publishers Pvt. Ltd; 2011.
152. Rao M. A Brief History of the Bhagavad Gita’s Modern Canonization. Religion
Compass. 2013;7(11):467–475.
153. Lattanzio NG. I Am that I Am: Self-Inquiry, Nondual Awareness, and Nondual
Therapy as an Eclectic Framework. Argosy University/Schaumburg (Chicago
Northwest); 2020.
157. Chinmayananda S. Srimad Bhagawad Geeta Chapter I & II. Central Chinmaya
Mission Trust; 2014.
158. Ranganathananda S. Universal Message of the Bhagavad Gita: An exposition of
the Gita in the Light of Modern Thought and Modern Needs. Advaita Ashrama
(A Publication House of Ramakrishna Math, Belur Math); 2000.
159. Salicchi L, Lenci A. PIHKers at CMCL 2021 Shared Task: Cosine Similarity and
Surprisal to Predict Human Reading Patterns. In: Proceedings of the Workshop
on Cognitive Modeling and Computational Linguistics; 2021. p. 102–107.
160. Thongtan T, Phienthrakul T. Sentiment classification using document
embeddings trained with cosine similarity. In: Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics: Student Research
Workshop; 2019. p. 407–414.
161. Gunawan D, Sembiring C, Budiman MA. The implementation of cosine
similarity to calculate text relevance between two documents. In: Journal of
physics: conference series. vol. 978. IOP Publishing; 2018. p. 012120.
162. Yelle RA. Explaining mantras: Ritual, rhetoric, and the dream of a natural
language in Hindu Tantra. Routledge; 2004.
163. Aurobindo S. Secret of the Veda. Lotus Press; 2018.
164. Pandit MP. Mystic Approach to the Veda and the Upanishad. Lotus Press;
1974.
165. Vivekananda S. Essentials of Hinduism. Advaita Ashrama (A publication
branch of Ramakrishna Math, Belur Math); 1937.
169. Tai W, Kung H, Dong XL, Comiter M, Kuo CF. exBERT: Extending
pre-trained models with domain-specific vocabulary under constrained training
resources. In: Findings of the Association for Computational Linguistics:
EMNLP 2020; 2020. p. 1433–1439.
170. Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific
text. arXiv preprint arXiv:190310676. 2019;.
171. Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I.
LEGAL-BERT: The muppets straight out of law school. arXiv preprint
arXiv:201002559. 2020;.