0% found this document useful (0 votes)

9 views18 pages

Finding Movie Similarity Based On Plot Summaries

This document outlines a project focused on finding movie similarities based on plot summaries using Natural Language Processing (NLP) and clustering algorithms. It details the project's objectives, methodologies, and the software and hardware requirements necessary for implementation. The project aims to provide a systematic approach for identifying thematic and narrative connections between movies, while acknowledging limitations such as semantic understanding and subjectivity in plot interpretation.

Uploaded by

rachelrosfrds

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views18 pages

Finding Movie Similarity Based On Plot Summaries

Uploaded by

rachelrosfrds

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Finding Movie

Similarity
Based On Plot
Summaries

i
CONTENTS

CHAPTER NO TITLE PG.NO

1.INTRODUCTION
1 1.1 Abstract 1
1.2.Objective of project
1.3 Limitations of project

2.ANALYSIS
2.1Software requirement
specification
2.1.1 Software requirements
2.1.2 Hardware requirements
2 2-3
2.1.3 Modules
`2.1.4 Architecture
3.DESIGN
3.1 Data Set Descriptions
3.2 Methods & Algorithm

3 3.3 Model Development & 4-6

Training
3.4 Model Evaluation
Metrics

4.DEPLOYMENT AND
RESULTS
4 7-15
4.1 Source Code
4.2 Final Results
5.CONCLUSION
5 5.1 Project Conclusion 16
5.2 Future Scope

ii
1.INTRODUCTION
1.1 PROBLEM STATEMENT

In the ever-expanding landscape of digital entertainment, the availability of a vast

array of movies poses a challenge for users to discover content that aligns with their
preferences. Traditional recommendation systems often rely on user ratings, genre
preferences, or collaborative filtering, but they may not capture the nuanced
similarities in movie plots. To address this gap, the challenge is to develop an
innovative solution for finding movie similarity based on plot summaries.

1.2 ABSTRACT
This project focuses on leveraging Natural Language Processing (NLP) techniques and clustering
algorithms to identify similarities among movie plot summaries. The first step involves collecting a
dataset of movie plot summaries, either from existing sources or from platforms IMDb or Wikipedia.
Subsequently, the collected text data undergoes preprocessing. Feature extraction techniques are then
employed to convert the textual content into numerical vectors. Common methods include
tokenization, stemming, create TfidfVectorizer enabling the representation of the plot summaries in
a format suitable for clustering analysis. The core of the project revolves around clustering algorithm
K-Means clustering. These algorithms group together movies with similar plot structures, allowing
for the identification of thematic and narrative similarities. The outcome of this project provides a
valuable tool for movie enthusiasts, filmmakers, and researchers by offering a systematic approach
to explore and understand connections between movies based on their plot summaries..

1.3 LIMITATIONS
• Semantic Understanding: Plot summaries might not capture the full essence of a movie. They may
miss nuances, themes, character development, and other important elements that contribute to the
overall similarity between movies.
• Subjectivity: Different individuals may interpret and summarize movie plots differently. This
subjectivity can lead to inconsistencies and inaccuracies in the similarity assessment.
• Overfitting: Depending solely on plot summaries for similarity calculation may lead to overfitting,
esp ecially if the dataset is small or if the model

1
2.Analysis

2.1 Software Requirement specification

2.1.1 Software requirements

• Jupyter Notebook
• Python
• Google Chrome or Microsoft edge of Latest Version
• Libraries : Python (pandas, Numpy , Matplotlib , Seaborn, Sklearn)

2.1.2 Hardware requirements

• OS: Windows 10 or Higher
• Processor: Intel i5 processor or Higher
• Ram: Minimum 8 GB or Higher
• Hard Drive: Minimum 256 GB or Higher

2.1.3 Modules

• NLTK (Natural Language Toolkit): A comprehensive library for

processing human language data, including modules for tokenization, stemming,
and TF-IDF calculations.
• SpaCy: A library for advanced natural language processing tasks, providing
pre-trained models for various languages and efficient tokenization.

• scikit-learn: A popular machine learning library that includes modules for TF-IDF
vectorization, cosine similarity, and other machine learning algorithms.

• Gensim: A library for topic modeling, document similarity analysis, and word embeddings.
It includes implementations of Word2Vec and Doc2Vec models.

2
2.1.4 Architecture

3
3. DESIGN

3.1 Introduction

These algorithms are used for various purposes like data mining, image processing, predictive
analytics, etc. to name a few. The main advantage of using machine learning is that, once an algorithm
learns what to do with data, it can do its work automatically

3.2 Dataset Description

To perform Finding similar movies based on plot summaries, you'll need a dataset that includes text
of movies based on plots . Here's a description of the required dataset:
1.Dataset Format:
Characters: The dataset should consist of characters that is to be undergoing preprocessing ,initially
to be taken as input.
Vectors: Vectors are the later form that is formed after preprocessing and trained to machine in form
of vectors by using vectorizing after tokenization and stemming.
2.Character Set:
Define the set of movie plots as characters you want to recognize. This may include all the movies
descriptions as characters.
3.Training and Testing Sets:
Split the dataset into training and testing sets. A common split is to use a majority of the data for
training (e.g., 70-80%) and the rest for testing to evaluate the model's performance.
5. Implementation Details:
Choose a suitable implementation library, such as Scikit-learn in Python. Train the above using the
training dataset and evaluate its performance on the testing dataset.
6. Dataset Source and Licensing:
We have collected data from IMDb and wikipedia.They are a collection of movies,which are based
on plots. As the present study required a reasonable length of summary and we were also interested
in the release year of movies, we filtered the IMDb dataset to remove records without a year and with
a summary length of less than 400 characters (Wikipedia guidelines state that summaries should be
400–800 characters in length).

4
3.3. Data Preprocessing Techniques
Data preprocessing is a crucial step in building machine learning models for Finding movie similarity
based on plot summaries or Any other predictive task. It involves cleaning, transforming, and
organizing raw data to make it suitable For training and testing machine learning algorithms.

Character representation and Segmentation:

Initially, there is a requirement of segmenting the data into characters. The first step involves
collecting a dataset of movie plot summaries, either from existing sources or from platforms IMDb
or Wikipedia. Subsequently, the collected text data undergoes preprocessing. Feature extraction
techniques are then employed to convert the textual content into numerical vectors.
K-means Clustering Analysis:

• Variant of clustering analysis

• This algorithm group together movies with similar plot structures, allowing for the identification of
thematic and narrative similarities

3.4 Methods and Algorithm

Natural Language Processing (NLP) is an exciting field of study for data scientists where they
develop algorithms that can make sense out of conversational language used by humans. In this
Project, we will use NLP to find the degree of similarity between movies based on their plots
available on IMDb and Wikipedia.

3.5 Model Development & Training

K-means clustering:
The accuracy of the K-means clustering is 95%

1. Machine Learning Libraries:

• scikit-learn: A popular machine learning library that includes modules for TF-
IDF vectorization, cosine similarity, and other machine learning algorithms.
• Gensim: A library for topic modeling, document similarity analysis, and word
embeddings. It includes implementations of Word2Vec and Doc2Vec models.

5
2. Natural Language Processing:

• NLP combines computer science, linguistics, and machine learning to study how computers and
humans communicate in natural language
• NLTK (Natural Language Toolkit): A comprehensive library for processing human language data,
including modules for tokenization, stemming, and TF-IDF calculations.
• SpaCy: A library for advanced natural language processing tasks, providing pre-trained models for
various languages and efficient tokenization.
It also supports multiple backend neural network computation.

3. Transformer Models:
• TFID Transformers: A library that provides pre-trained transformer-based models, including BERT.
You can use this library to easily obtain contextual embeddings for words and sentences.

4. Python Modules:
We use basic python modules like pandas,matplotlib and seaborn for plotting the data in form of graphs
to make convenient for users.

5. Count Vectorizers:
Count Vectorizer converts a collection of text documents into a matrix where the rows represent the
documents, and the columns represent the tokens (words or n-grams).

6
4. DEPLOYMENT AND RESULTS

4.1. SOURCE CODE:

7
8
9
10
11
12
13
4.2. Final Results:

14
15
5. Conclusion

5.1 Project Conclusion:

By this project we can conclude that, finding movie similarity from plot summaries
involves combining traditional and modern methods to understand and measure how
similar movie stories are. We start by cleaning and organizing the plot information,
then use different techniques to represent and compare movies. This includes
traditional methods like counting words and more advanced methods like using
artificial intelligence to understand the meaning of words in context.
The combination of these methods, including advanced neural networks and graph-
based techniques, helps us create a holistic view of movie similarity. We also use an
ensemble approach, combining insights from different sources, to make our final
judgment about how similar movies are.
5.2 Future Scope:
The future scope of this project can be performed with diverse mixtures of machine
learning techniques to better prediction techniques.

REFERENCES

[1] Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? (and how to
fix it using search-based software engineering). Information and Software
Technology, 98(February), 74–88
[2] Bamman, D., O'Connor, B., & Smith, N. A. (2015). Learning latent personas of film
characters. In Proceedings of the 51st annual meeting ofthe association for
computational linguistics (pp. 352–361).
[3] Bamman, David, Ted Underwood, and A. Noah Smith. 2019. A Bayesian Mixed
Effects Model of Literary Character. In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics. Baltimore, MD, USA, pages 370–379..

Mini Report Movie
No ratings yet
Mini Report Movie
9 pages
Animal Intrusion Detection in Farms
No ratings yet
Animal Intrusion Detection in Farms
21 pages
Synopsis
No ratings yet
Synopsis
12 pages
Report
No ratings yet
Report
20 pages
Project Report MRS (1)
No ratings yet
Project Report MRS (1)
47 pages
Efficient Features for Movie
No ratings yet
Efficient Features for Movie
53 pages
Final Report Format SSP[1][1]
No ratings yet
Final Report Format SSP[1][1]
14 pages
Final Report Format SSP[1][1][1]
No ratings yet
Final Report Format SSP[1][1][1]
13 pages
Mini Project
No ratings yet
Mini Project
10 pages
Project Synopsis
No ratings yet
Project Synopsis
14 pages
17BIT024
No ratings yet
17BIT024
51 pages
dsbda_mini_2__1_
No ratings yet
dsbda_mini_2__1_
23 pages
MR Synopsis
No ratings yet
MR Synopsis
5 pages
A Movie Recommendation System (Amrs)
No ratings yet
A Movie Recommendation System (Amrs)
27 pages
Team 10 Movie Prediction
No ratings yet
Team 10 Movie Prediction
14 pages
dsbda_mini_2
No ratings yet
dsbda_mini_2
23 pages
Vignesh Report
No ratings yet
Vignesh Report
20 pages
roject-Synopsis
No ratings yet
roject-Synopsis
10 pages
ppt3_merged (1)
No ratings yet
ppt3_merged (1)
22 pages
Sentimental Analysis of Movie Review
100% (1)
Sentimental Analysis of Movie Review
58 pages
Intership PPT Final
No ratings yet
Intership PPT Final
15 pages
NM (2)_merged_organized
No ratings yet
NM (2)_merged_organized
16 pages
Chapter - 5 Machine Learning
0% (1)
Chapter - 5 Machine Learning
25 pages
PPT
No ratings yet
PPT
15 pages
Movie Recommendation System.
No ratings yet
Movie Recommendation System.
40 pages
21ESKCA031 Baldeep Report (1)
No ratings yet
21ESKCA031 Baldeep Report (1)
34 pages
Project Srs
No ratings yet
Project Srs
17 pages
Newmovies
No ratings yet
Newmovies
28 pages
Final Synopsis
No ratings yet
Final Synopsis
18 pages
NM (2)_merged
No ratings yet
NM (2)_merged
16 pages
Movie_recommendation pranali
No ratings yet
Movie_recommendation pranali
12 pages
3170724_ML_210490131009_OEP
No ratings yet
3170724_ML_210490131009_OEP
8 pages
Minor Presentation
No ratings yet
Minor Presentation
20 pages
Vaibhav - Project Report On Movie Recommender System Using Machine Learning
No ratings yet
Vaibhav - Project Report On Movie Recommender System Using Machine Learning
11 pages
Final Report
No ratings yet
Final Report
20 pages
MOvie Recommendation System Project Report
No ratings yet
MOvie Recommendation System Project Report
30 pages
IV YEAR_MINI PROJECT_FINAL REVIEW PPT SAMPLE FORMAT
No ratings yet
IV YEAR_MINI PROJECT_FINAL REVIEW PPT SAMPLE FORMAT
25 pages
Final Report Ai Application
No ratings yet
Final Report Ai Application
18 pages
Ali Docs
No ratings yet
Ali Docs
32 pages
Project - Report - Movie Recommendfation System
No ratings yet
Project - Report - Movie Recommendfation System
31 pages
Final - Viva PPTX Santosh
No ratings yet
Final - Viva PPTX Santosh
24 pages
Movie Recommendation Project Report
No ratings yet
Movie Recommendation Project Report
9 pages
DESIGN OF PHYSICAL RANDOM ACCESSThesisReportNM
No ratings yet
DESIGN OF PHYSICAL RANDOM ACCESSThesisReportNM
55 pages
Noble Playbook
100% (1)
Noble Playbook
117 pages
Movie Recommendation System Using Machine Learning
No ratings yet
Movie Recommendation System Using Machine Learning
6 pages
Movie_Recommendation_System_Report[1][1]
No ratings yet
Movie_Recommendation_System_Report[1][1]
18 pages
Chen Et Al 2019
No ratings yet
Chen Et Al 2019
35 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Data Science and AI
No ratings yet
Data Science and AI
15 pages
Lattin Et Al - Analyzing Multivariate Data - 279-281
No ratings yet
Lattin Et Al - Analyzing Multivariate Data - 279-281
3 pages
Stata Training Course
No ratings yet
Stata Training Course
43 pages
BigData&Analytics Module5
No ratings yet
BigData&Analytics Module5
21 pages
Clustering Data With Measurement Errors: Mahesh Kumar, Nitin R. Patel, James B. Orlin Operations Research Center, MIT
No ratings yet
Clustering Data With Measurement Errors: Mahesh Kumar, Nitin R. Patel, James B. Orlin Operations Research Center, MIT
26 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Unsupervised Learning: Harsha Vardhan Reddy Burri
No ratings yet
Unsupervised Learning: Harsha Vardhan Reddy Burri
10 pages
Movie Recommendation System: Synopsis For Project (KCA 353)
No ratings yet
Movie Recommendation System: Synopsis For Project (KCA 353)
17 pages
Clusters_of_Non-dominated_Solutions_in_Multiobject
No ratings yet
Clusters_of_Non-dominated_Solutions_in_Multiobject
11 pages
Improving Multilayer-Perceptron (MLP) - Based Network Anomaly Detection With Birch Clustering On CICIDS-2017 Dataset
No ratings yet
Improving Multilayer-Perceptron (MLP) - Based Network Anomaly Detection With Birch Clustering On CICIDS-2017 Dataset
9 pages
Machine Learning Updated
No ratings yet
Machine Learning Updated
14 pages
Logic Design Theory by Nn Biswas
No ratings yet
Logic Design Theory by Nn Biswas
3 pages
Coba Coba Clustering
No ratings yet
Coba Coba Clustering
8 pages
Cluster Is A Group of Objects That Belongs To The Same Class
No ratings yet
Cluster Is A Group of Objects That Belongs To The Same Class
12 pages
MCA NEW IInd Semester 2023-24
No ratings yet
MCA NEW IInd Semester 2023-24
9 pages
A_Deep_Learning_Approach_Based_on_CT_Images_for_an_Automatic_Detection
No ratings yet
A_Deep_Learning_Approach_Based_on_CT_Images_for_an_Automatic_Detection
5 pages
C MDA
No ratings yet
C MDA
7 pages
ML LAB Viva Questions with Answers
No ratings yet
ML LAB Viva Questions with Answers
10 pages
Lab Report - Assignment 1: Variables
No ratings yet
Lab Report - Assignment 1: Variables
4 pages
AI Syllabus Course
No ratings yet
AI Syllabus Course
16 pages
Capelassi 2017 IOP Conf. Ser. Mater. Sci. Eng. 254 172006
No ratings yet
Capelassi 2017 IOP Conf. Ser. Mater. Sci. Eng. 254 172006
6 pages
Principles of ML
100% (1)
Principles of ML
2 pages
Data-Mining-And Data Warehousing
No ratings yet
Data-Mining-And Data Warehousing
1 page
3D Kinematic Analysis by Bodymech: A Matlab Based Open Source Software Package For Research and Education
No ratings yet
3D Kinematic Analysis by Bodymech: A Matlab Based Open Source Software Package For Research and Education
4 pages
Clustering
No ratings yet
Clustering
3 pages
ML Project Movie Recommendation System
No ratings yet
ML Project Movie Recommendation System
2 pages
Zig Programming: From Zero to Systems Master
From Everand
Zig Programming: From Zero to Systems Master
Niklas Hoffmann
No ratings yet
Using Vocals Determine Human Emotion
From Everand
Using Vocals Determine Human Emotion
Faiz ul haque Zeya
No ratings yet
Mastering Performance Optimization in Python: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Performance Optimization in Python: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Mastering Object-Oriented Programming with Python: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Object-Oriented Programming with Python: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Software Architecture with Python
From Everand
Software Architecture with Python
Anand Balachandran Pillai
3/5 (1)
Programming with Nim: Definitive Reference for Developers and Engineers
From Everand
Programming with Nim: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Micropython Essentials: Definitive Reference for Developers and Engineers
From Everand
Micropython Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of C# Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
From Everand
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
C++ Mastery: Advanced Techniques and Strategies
From Everand
C++ Mastery: Advanced Techniques and Strategies
Adam Jones
No ratings yet

Finding Movie Similarity Based On Plot Summaries

Uploaded by

Finding Movie Similarity Based On Plot Summaries

Uploaded by

Finding Movie

CHAPTER NO TITLE PG.NO

3 3.3 Model Development & 4-6

In the ever-expanding landscape of digital entertainment, the availability of a vast

2.1 Software Requirement specification

2.1.1 Software requirements

2.1.2 Hardware requirements

• NLTK (Natural Language Toolkit): A comprehensive library for

3.2 Dataset Description

Character representation and Segmentation:

• Variant of clustering analysis

3.4 Methods and Algorithm

3.5 Model Development & Training

1. Machine Learning Libraries:

4.1. SOURCE CODE:

5.1 Project Conclusion:

You might also like