AIML Mod4 Loki
AIML Mod4 Loki
Module-4
1. Explain support, confidence and Lift for Association rule
Support indicates the frequencies of items appearing together in baskets with respect to all possible baskets
considered (or in a sample). For example, the support for (beer, diaper) will be 2/4 (based on the data shown in F
9.1), that is, 50% as it appears together in 2 baskets out of 4 baskets.
Assume that X and Y are items being considered. Let
1. 𝑁 be the total number of baskets.
2. 𝑁𝑋𝑌 represent the number of baskets in which X and Y appear together.
3. 𝑁𝑋 represent the number of baskets in which X appears.
4. 𝑁𝑌 represent the number of baskets in which Y appears.
Then the support between X and Y, Support(X, Y), is given by
𝑁𝑋𝑌
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋, 𝑌) = 𝑁
Confidence measures the proportion of the transactions that contain X, which also contain Y. X is called antecedent a
is called consequent. Confidence can be calculated using the following formula:
𝑁𝑋𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 → 𝑌) = 𝑃(𝑌|𝑋) = 𝑁𝑋
Lift can be interpreted as the degree of association between two items. Lift value 1 indicates that the items are indepe
(no association), lift value of less than 1 implies that the products are substitution (purchase one product will decrea
probability of purchase of the other product) and lift value of greater than 1 indicates purchase of Product X will inc
the probability of purchase of Product Y. Lift value of greater than 1 is a necessary condition of generating assoc
rules.
The dataset is read using Python's open() method to load transaction data from a CSV file (e.g., groceries.c
The steps include:
Example code:
all_txns = []
# Open the file
with open('groceries.csv') as f:
content = f.readlines()
all_txns.append(each_txn.split(','))
The resulting all_txns variable contains a list of transactions, with each transaction represented as a list of ite
Encoding Transactions:
To apply association rule mining, transactions must be converted into a tabular or matrix format where:
Example code:
import pandas as pd
one_hot_encoding = OnehotTransactions()
one_hot_txns = one_hot_encoding.fit(all_txns).transform(all_txns)
# Convert to DataFrame
The Apriori algorithm is used to generate frequent itemsets and association rules. It processes the following steps:
Example code:
2. Generate Rules: Use the association_rules function with the following parameters:
Example code:
To retrieve the top 10 rules, sort the rules DataFrame by confidence in descending order. This app
prioritizes rules with the highest reliability.
Example Code:
4. Explain loading dataset and Calculating Cosine Similarity between Users for user based similarity
The process of loading the dataset involves using the MovieLens dataset, which contains user ratings for movies
dataset includes the following attributes:
rating_df = pd.read_csv("ml-latest-small/ratings.csv")
2. The timestamp column, which is not required for this analysis, is dropped.
A pivot table is then created to represent users as rows and movies as columns. The values in the matrix correspond
ratings given by users to movies:
Cosine similarity measures the angle between two vectors in multi-dimensional space, indicating the similarity in t
direction. It is computed for users based on their ratings.
import pandas as pd
user_sim_df = pd.DataFrame(user_sim)
user_sim_df.index = rating_df.userId.unique()
user_sim_df.columns = rating_df.userId.unique()
import numpy as np
np.fill_diagonal(user_sim, 0)
The resulting similarity matrix can be used to identify the most similar users to any given user. For instance:
Cosine similarity values close to 1 indicate high similarity, while values close to 0 indicate dissimilarity.
5. Explain filtering of similar users, loading the movies dataset, Finding Common Movies of Similar Users for user
similarity
1. To find the most similar users, the idxmax() function is applied to each column of the cosine similarity matrix, w
contains similarities between all users.
2. Example Code:
user_sim_df.idxmax(axis=1)[0:5]
● This code retrieves the most similar user for the first 5 users. For instance, user 325 is most similar to user 1, use
to user 2, and so on.
1. Dataset Information:
○ The movies.csv file contains movie information with columns:
■ movieId: Unique identifier for movies.
■ title: Name of the movie.
■ genres: Movie genres.
2. Steps to Load the Dataset:
○ Use pandas to read the dataset:
movies_df = pd.read_csv("ml-latest-small/movies.csv")
movies_df.head()
This dataset will be joined with user ratings later to identify common movies.
1. Steps:
○ Define a method get_user_similar_movies() to find common movies watched by two users and their
respective ratings:
rating_df[rating_df.userId == user2],
on="movieId",
how="inner"
○ Example usage:
2. Insights:
This process identifies shared movie preferences and similar ratings between users. For example, users 2 and 338 m
have rated "Apollo 13" and "Schindler's List" highly, indicating similar tastes.
6. Explain Calculating Cosine Similarity between Users and finding most similar movies for item based similarity
If two movies, movie A and movie B, have been watched by several users and rated very similarly, then movie A
movie B can be similar in taste. In other words, if a user watches movie A, then he or she is very likely to watch B
vice versa.
In item-based similarity, the cosine similarity between movies is calculated based on user ratings to identify
similar movies. Here's the process:
○ Fill missing values (NaN) with 0 since not all movies are rated by all users:
rating_mat.fillna(0, inplace=True)
movie_sim_df = pd.DataFrame(movie_sim)
This results in a similarity matrix where each entry represents the similarity between two movies.
movies_df['similarity'] = movie_sim_df.iloc[movieidx]
return top_n
● For a given movieId (e.g., "The Godfather" with movieId=858), call the function:
similar_movies = get_similar_movies(858)
● This returns the top N movies similar to the specified movie, ranked by similarity.
This approach allows recommendations for movies based on their inherent similarities derived from user
rating patterns.
The Surprise library offers an efficient way to implement user-based collaborative filtering for building
recommendation systems. The steps for implementing this approach are as follows:
● The Dataset class from Surprise is used to load the dataset into a suitable format for building models. A
Reader object specifies the rating scale (e.g., 1 to 5).
● Code example:
● The KNNBasic algorithm from Surprise is used for user-based collaborative filtering. This method finds
similar users based on shared movie ratings and makes recommendations based on the ratings of the
nearest neighbors.
● Key parameters of KNNBasic:
○ k: Number of nearest neighbors (e.g., 20).
○ min_k: Minimum number of neighbors considered.
○ sim_options: Defines similarity measure. Set 'name' to 'pearson' or 'cosine' and
'user_based' to True for user-based similarity.
● Example code:
● Use cross_validate to perform K-fold cross-validation and evaluate the model's performance using
metrics like RMSE.
● Example:
● To identify the most similar users, the similarity matrix is computed using cosine similarity or Pearson
correlation. The pairwise_distances function from Scikit-learn can calculate the distances between
users. Example code:
● GridSearchCV can be used to search for the best parameters, including the number of neighbors and
similarity measures. Example:
param_grid = {'k': [10, 20], 'sim_options': {'name': ['cosine', 'pearson'], 'user_based': [True, False]}}
grid_cv = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5, refit=True)
grid_cv.fit(data)
print(grid_cv.best_score['rmse'])
This method helps to refine the model by selecting the optimal parameters for best performance
Matrix factorization is a technique used to decompose a large user-item rating matrix into smaller
matrices, which capture latent features that explain observed ratings. The main idea is to
approximate the original matrix as the product of two lower-dimensional matrices. This process is
useful in recommendation systems, where the goal is to predict ratings for unrated items based on
patterns identified in the observed ratings.
For instance, in a 3x5 matrix (3 users and 5 movies), the matrix can be decomposed into:
The multiplication of these two matrices will approximate the original user-item matrixent Factors**:
The "latent factors" refer to hidden variables that influence how a user rates a movie. These could
be related to aspects like movie genre, director, or actors, but they are not directly observable.
Once the model is trained using SVD, it is evaluated using cross-validation to measure its
performance, typically by using RMSE (Root Mean Squared Error).
In summary, matrix factorization reduces the complexity of user-item rating matrices and
uncovers hidden patterns that can help predict ratings for unseen items, making it effective for
building recommendation systems .