0% found this document useful (0 votes)
26 views

Karan Mini Proj

Uploaded by

Karan D Parge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Karan Mini Proj

Uploaded by

Karan D Parge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

lOM oAR c P S D | 4007 793 9

Department of Computer engineering


TSSM BSCOER
Project
Report On

Movie Recommendation System


By

Karan Devdas Parge


Roll No:12

GUIDED BY:
Prof. A. D. Gujar
lOM oAR c P S D | 4007 793 9

Movie Recommendation System

PROBLEM STATEMENT
Develop a movie recommendation model using the scikit-learn library in python.

OBJECTIVE
The objective of this recommendation system is to provide satisfactory movie
recommendations to users while keeping the system user friendly i.e. by taking minimum
input from users. It recommends the movies based on metadata of the movies and past user
ratings.

TECHNOLOGY USED

Machine Learning Library:

• pandas numpy

• difflib
• AST

• scikit-learn Requirements:

• Python 3.6

THEORY

1. What is scikit-learn?
lOM oAR c P S D | 4007 793 9

Scikit-Learn is a free machine learning library for Python. It supports both supervised and
unsupervised machine learning, providing diverse algorithms for classification, regression,
clustering, and dimensionality reduction. It is licensed under a permissive simplified BSD
license and is distributed under many Linux distributions, encouraging academic and
commercial use.

The library is built upon the SciPy (Scientific Python) that must be installed before you can
use scikit-learn. This stack that includes:

• NumPy: Base n-dimensional array package


• SciPy: Fundamental library for scientific computing
• Matplotlib: Comprehensive 2D/3D plotting
• IPython: Enhanced interactive console
• Sympy: Symbolic mathematics
• Pandas: Data structures and analysis

Extensions or modules for SciPy care conventionally named SciKits. As such, the module
provides learning algorithms and is named scikit-learn.
The vision for the library is a level of robustness and support required for use in production
systems. This means a deep focus on concerns such as easy of use, code quality,
collaboration, documentation and performance.

Although the interface is Python, c-libraries are leverage for performance such as numpy for
arrays and matrix operations.
It was originally called scikits.learn and was initially developed by David Cournapeau as a
Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux,
Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in
Computer Science and Automation), took this project at another level and made the first
public release (v0.1 beta) on 1st Feb. 2010.

1.1 FEATURES:
The library is focused on modelling data. It is not focused on loading, manipulating and
summarizing data. For these features, refer to NumPy and Pandas.Some popular groups of
models provided by scikit-learn include:
lOM oAR c P S D | 4007 793 9

• Clustering: for grouping un labelled data such as K Means.


• Cross Validation: for estimating the performance of supervised models on unseen data.
• Datasets: for test datasets and for generating datasets with specific properties for
investigating model behaviour.
• Dimensionality Reduction: for reducing the number of attributes in data for
summarization, visualization and feature selection such as Principal component
analysis.
• Ensemble methods: for combining the predictions of multiple supervised models.
• Feature extraction: for defining attributes in image and text data.
• Feature selection: for identifying meaningful attributes from which to create
supervised models.
• Parameter Tuning: for getting the most out of supervised models.
• Manifold Learning: For summarizing and depicting complex multi-dimensional data.
• Supervised Models: a vast array not limited to generalized linear models,
discriminate analysis, naive bayes, lazy methods, neural networks, support vector
machines and decision trees

2. What is a Recommendation System?

Simply put a Recommendation System is a filtration program whose prime goal is to predict
the “rating” or “preference” of a user towards a domain-specific item or item. In our case, this
domain-specific item is a movie, therefore the main focus of our recommendation system is to
filter and predict only those movies which a user would prefer given some data about the user
him or herself.

2.1. Recommendation System Mechanism:

The engine of the recommendation system filters the data via different machine learning
algorithms, and based on that filtering, it can predicts the most relevant entities to be
recommended. After studying the previous behaviours of the users, it recommends
products/services that the used may be interested on.
lOM oAR c P S D | 4007 793 9

The engine’s working of a recommendation is classified in these 3 steps:

2.1.1. Data Collection


The techniques that can be used to collect data are:

1. Explicit, where data are provided intentionally as an information (e.g. user’s input
such as movies rating)

2. Implicit, where data are provided intentionally but gathered from available data
stream (e.g. search history, clicks, order history, etc…)

2.1.2 Data Storage


It can be stored in a cloud storage such as SQL database, NoSQL database, or some other
kind of object storage. However, it depends on the data type and amount as well. The
more data that the storage can have for the model, the better recommendation system can
be.

3. What are the different filtration strategies?


lOM oAR c P S D | 4007 793 9

3.1. Content-based Filtering:

This filtration strategy is based on the data provided about the items. The Algorithm
recommends products that are similar to the ones that a user has liked in the past. This
similarity (generally cosine similarity) is computed from the data we have about the items as
well as the user’s past preferences.

For example, if a user likes movies such as ‘The Prestige’ then we can recommend him the
movies of ‘Christian Bale’ or movies with the genre ‘Thriller’ or maybe even movies directed
by ‘Christopher Nolan’. So what happens here the recommendation system checks the past
preferences of the user and find the film “The Prestige”, then tries to find similar movies to
that using the information available in the database such as the lead actors, the director, genre
of the film, production house, etc and based on this information find movies similar to “The
Prestige”.

Disadvantages:

1. Different products do not get much exposure to the user.

2. Businesses cannot be expanded as the user does not try different types of products.

3.2. Collaborative Filtering:

This filtration strategy is based on the combination of the user’s behaviour and comparing and
contrasting that with other users’ behaviour in the database. The history of all users plays an
lOM oAR c P S D | 4007 793 9

important role in this algorithm. The main difference between content-based filtering and
collaborative filtering that in the latter, the interaction of all users with the items influences
the recommendation algorithm while for content-based filtering only the concerned user’s
data is taken into account. There are multiple ways to implement collaborative filtering but
the main concept to be grasped is that in collaborative filtering multiple user’s data influences
the outcome of the recommendation. and doesn’t depend on only one user’s data for
modelling.

There are 2 types of collaborative filtering algorithms:

3.2.1. User-based Collaborative filtering:

The basic idea here is to find users that have similar past preference patterns as the user ‘A’
has had and then recommending him or her items liked by those similar users which ‘A’ has
not encountered yet. This is achieved by making a matrix of items each user has
rated/viewed/liked/clicked depending upon the task at hand, and then computing the
similarity score between the users and finally recommending items that the concerned user
isn’t aware of but users similar to him/her are and liked it. For example, if the user ‘A’ likes
‘Batman Begins’, ‘Justice League’ and ‘The Avengers’ while the user ‘B’ likes ‘Batman
Begins’, ‘Justice League’ and ‘Thor’ then they have similar interests because we know that
these movies belong to the super-hero genre. So, there is a high probability that the user ‘A’
would like ‘Thor’ and the user ‘B’ would like The Avengers’.

Disadvantages:

1. People are fickle-minded i.e their taste change from time to time and as this algorithm
is based on user similarity it may pick up initial similarity patterns between 2 users
who after a while may have completely different preferences.
2. There are many more users than items therefore it becomes very difficult to maintain
such large matrices and therefore needs to be recomputed very regularly.
3. This algorithm is very susceptible to shilling attacks where fake users profiles
consisting of biased preference patterns are used to manipulate key decisions.

3.2.2. Item-based Collaborative Filtering:

The concept in this case is to find similar movies instead of similar users and then
recommending similar movies to that ‘A’ has had in his/her past preferences. This is executed
lOM oAR c P S D | 4007 793 9

by finding every pair of items that were rated/viewed/liked/clicked by the same user, then
measuring the similarity of those rated/viewed/liked/clicked across all user who
rated/viewed/liked/clicked both, and finally recommending them based on similarity scores.

Here, for example, we take 2 movies ‘A’ and ‘B’ and check their ratings by all users who
have rated both the movies and based on the similarity of these ratings, and based on this
rating similarity by users who have rated both we find similar movies. So if most common
users have rated ‘A’ and ‘B’ both similarly and it is highly probable that ‘A’ and ‘B’ are
similar, therefore if someone has watched and liked ‘A’ they should be recommended ‘B’ and
vice versa.

Advantages over User-based Collaborative Filtering :

1. Unlike people’s taste, movies don’t change.


2. There are usually a lot fewer items than people, therefore easier to maintain and compute
the matrices.
3. Shilling attacks are much harder because items cannot be faked.

4. Data Description:

A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds
to one or more database tables, where every column of a table represents a particular variable,
and each row corresponds to a given record of the data set in question. The data set lists
values for each of the variables, such as for example height and weight of an object, for each
member of the data set. Data sets can also consist of a collection of documents or files. In the
open data discipline, data set is the unit to measure the information released in a public open
data repository. The European Open Data portal aggregates more than half a million data
sets.[2] Some other issues (real-time data sources,[3] non-relational data sets, etc.) increases the
difficulty to reach a consensus about it.[
This dataset contain 26 million ratings from 270,000 users for all 45,000 movies listed in the
Full Movie Lens Dataset. The dataset consists of movies released on or before July 2017.
Data points include cast, crew, plot keywords, budget, revenue, posters, release dates,
languages, production companies, countries, TMDB vote counts and vote averages.
lOM oAR c P S D | 4007 793 9

5. Building a Movie Recommendation System:

The approach to build the movie recommendation engine consists of the following:

1. Perform Exploratory Data Analysis (EDA) on the data.


2. Build the recommendation system.
3. Get recommendations.

• After downloading the dataset, we need to import all the required libraries and then
read the csv file using read_csv() method.
• If you visualize the dataset, you will see that it has many extra info about a movie.We
don’t need all of them. So, we choose keywords, cast, genres and director column to
use as our feature set(the so called “content” of the movie).
• If you visualize the dataset, you will see that it has many extra info about a movie.We
don’t need all of them. So, we choose keywords, cast, genres and director column to
use as our feature set(the so called “content” of the movie).
• Now, we need to call this function over each row of our dataframe. But, before doing
that, we need to clean and preprocess the data for our use.
• We will fill all the NaN values with blank string in the dataframe. Now that we have
obtained the combined strings, we can now feed these strings to a CountVectorizer()
object for getting the count matrix.
• At this point, 60% work is done. Now, we need to obtain the cosine similarity
matrixfrom the count matrix.
• Now, we will define two helper functions to get movie title from movie index and
vice-versa.
• Our next step is to get the title of the movie that the user currently likes. Then we
will find the index of that movie.
• After that, we will access the row corresponding to this movie in the similarity matrix.
• Thus, we will get the similarity scores of all other movies from the current
movie.Then we will enumerate through all the similarity scores of that movie to
make a tuple of movie index and similarity score.
• This will convert a row of similarity scores like this- [1 0.5 0.2 0.9] to this- [(0, 1) (1,
0.5) (2, 0.2) (3, 0.9)] . Here, each item is in this form- (movie index, similarity
score). Now comes the most vital point.
lOM oAR c P S D | 4007 793 9

• We will sort the list similar_movies according to similarity scores in descending


order. Since the most similar movie to a given movie will be itself, we will discard
the first element after sorting the movies.
• Now, we will run a loop to print first 5 entries from sorted_similar_movies list

INPUT
Here we use the movie_dataset.csv file.
The code goes as follows:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity df =
pd.read_csv("movie_dataset.csv") features =
['keywords','cast','genres','director'] def
combine_features(row): return row['keywords'] +"
"+row['cast']+" "+row["genres"]+" "+row["director"] for
feature in features: df[feature] = df[feature].fillna('')
df["combined_features"] = df.apply(combine_features,axis=1)
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combined_features"])
cosine_sim = cosine_similarity(count_matrix) def
get_title_from_index(index):
return df[df.index == index]["title"].values[0]
def get_index_from_title(title): return
df[df.title == title]["index"].values[0]
movie_user_likes = "Avatar"
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index]))
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse
=True)[1:]
i=0
print("Top 5 similar movies to "+movie_user_likes+" are:\n")
for element in sorted_similar_movies:
print(get_title_from_index(element[0])) i=i+1 if i>=5:
break

OUTPUT
Top 5 similar movies to Avatar are:

Guardians of the Galaxy


Aliens
Star Wars: Clone Wars: Volume 1
lOM oAR c P S D | 4007 793 9

Star Trek Into Darkness


Star Trek Beyond

CONCLUSION

Recommendation systems have become an important part of everyone’s lives. With the
enormous number of movies releasing worldwide every year, people often miss out on some
amazing work of arts due to the lack of correct suggestion. Putting machine learning based
Recommendation systems into work is thus very important to get the right recommendations.
We saw content-based recommendation systems that although may not seem very effective on
its own, but when combined with collaborative techniques can solve the cold start problems
that collaborative filtering methods face when run independently.

You might also like