Finding Movie Similarity Based On Plot Summaries
Finding Movie Similarity Based On Plot Summaries
Similarity
Based On Plot
Summaries
i
CONTENTS
2.ANALYSIS
2.1Software requirement
specification
2.1.1 Software requirements
2.1.2 Hardware requirements
2 2-3
2.1.3 Modules
`2.1.4 Architecture
3.DESIGN
3.1 Data Set Descriptions
3.2 Methods & Algorithm
4.DEPLOYMENT AND
RESULTS
4 7-15
4.1 Source Code
4.2 Final Results
5.CONCLUSION
5 5.1 Project Conclusion 16
5.2 Future Scope
ii
1.INTRODUCTION
1.1 PROBLEM STATEMENT
1.2 ABSTRACT
This project focuses on leveraging Natural Language Processing (NLP) techniques and clustering
algorithms to identify similarities among movie plot summaries. The first step involves collecting a
dataset of movie plot summaries, either from existing sources or from platforms IMDb or Wikipedia.
Subsequently, the collected text data undergoes preprocessing. Feature extraction techniques are then
employed to convert the textual content into numerical vectors. Common methods include
tokenization, stemming, create TfidfVectorizer enabling the representation of the plot summaries in
a format suitable for clustering analysis. The core of the project revolves around clustering algorithm
K-Means clustering. These algorithms group together movies with similar plot structures, allowing
for the identification of thematic and narrative similarities. The outcome of this project provides a
valuable tool for movie enthusiasts, filmmakers, and researchers by offering a systematic approach
to explore and understand connections between movies based on their plot summaries..
1.3 LIMITATIONS
• Semantic Understanding: Plot summaries might not capture the full essence of a movie. They may
miss nuances, themes, character development, and other important elements that contribute to the
overall similarity between movies.
• Subjectivity: Different individuals may interpret and summarize movie plots differently. This
subjectivity can lead to inconsistencies and inaccuracies in the similarity assessment.
• Overfitting: Depending solely on plot summaries for similarity calculation may lead to overfitting,
esp ecially if the dataset is small or if the model
1
2.Analysis
• Jupyter Notebook
• Python
• Google Chrome or Microsoft edge of Latest Version
• Libraries : Python (pandas, Numpy , Matplotlib , Seaborn, Sklearn)
2.1.3 Modules
• scikit-learn: A popular machine learning library that includes modules for TF-IDF
vectorization, cosine similarity, and other machine learning algorithms.
• Gensim: A library for topic modeling, document similarity analysis, and word embeddings.
It includes implementations of Word2Vec and Doc2Vec models.
2
2.1.4 Architecture
3
3. DESIGN
3.1 Introduction
These algorithms are used for various purposes like data mining, image processing, predictive
analytics, etc. to name a few. The main advantage of using machine learning is that, once an algorithm
learns what to do with data, it can do its work automatically
4
3.3. Data Preprocessing Techniques
Data preprocessing is a crucial step in building machine learning models for Finding movie similarity
based on plot summaries or Any other predictive task. It involves cleaning, transforming, and
organizing raw data to make it suitable For training and testing machine learning algorithms.
Initially, there is a requirement of segmenting the data into characters. The first step involves
collecting a dataset of movie plot summaries, either from existing sources or from platforms IMDb
or Wikipedia. Subsequently, the collected text data undergoes preprocessing. Feature extraction
techniques are then employed to convert the textual content into numerical vectors.
K-means Clustering Analysis:
K-means clustering:
The accuracy of the K-means clustering is 95%
5
2. Natural Language Processing:
• NLP combines computer science, linguistics, and machine learning to study how computers and
humans communicate in natural language
• NLTK (Natural Language Toolkit): A comprehensive library for processing human language data,
including modules for tokenization, stemming, and TF-IDF calculations.
• SpaCy: A library for advanced natural language processing tasks, providing pre-trained models for
various languages and efficient tokenization.
It also supports multiple backend neural network computation.
3. Transformer Models:
• TFID Transformers: A library that provides pre-trained transformer-based models, including BERT.
You can use this library to easily obtain contextual embeddings for words and sentences.
4. Python Modules:
We use basic python modules like pandas,matplotlib and seaborn for plotting the data in form of graphs
to make convenient for users.
5. Count Vectorizers:
Count Vectorizer converts a collection of text documents into a matrix where the rows represent the
documents, and the columns represent the tokens (words or n-grams).
6
4. DEPLOYMENT AND RESULTS
7
8
9
10
11
12
13
4.2. Final Results:
14
15
5. Conclusion
By this project we can conclude that, finding movie similarity from plot summaries
involves combining traditional and modern methods to understand and measure how
similar movie stories are. We start by cleaning and organizing the plot information,
then use different techniques to represent and compare movies. This includes
traditional methods like counting words and more advanced methods like using
artificial intelligence to understand the meaning of words in context.
The combination of these methods, including advanced neural networks and graph-
based techniques, helps us create a holistic view of movie similarity. We also use an
ensemble approach, combining insights from different sources, to make our final
judgment about how similar movies are.
5.2 Future Scope:
The future scope of this project can be performed with diverse mixtures of machine
learning techniques to better prediction techniques.
REFERENCES
[1] Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? (and how to
fix it using search-based software engineering). Information and Software
Technology, 98(February), 74–88
[2] Bamman, D., O'Connor, B., & Smith, N. A. (2015). Learning latent personas of film
characters. In Proceedings of the 51st annual meeting ofthe association for
computational linguistics (pp. 352–361).
[3] Bamman, David, Ted Underwood, and A. Noah Smith. 2019. A Bayesian Mixed
Effects Model of Literary Character. In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics. Baltimore, MD, USA, pages 370–379..
16