0% found this document useful (0 votes)
9 views

AIML Mod4 Loki

Uploaded by

vibhav kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

AIML Mod4 Loki

Uploaded by

vibhav kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Advanced AIML

Module-4
1. Explain support, confidence and Lift for Association rule
Support indicates the frequencies of items appearing together in baskets with respect to all possible baskets
considered (or in a sample). For example, the support for (beer, diaper) will be 2/4 (based on the data shown in F
9.1), that is, 50% as it appears together in 2 baskets out of 4 baskets.
Assume that X and Y are items being considered. Let
1. 𝑁 be the total number of baskets.
2. 𝑁𝑋𝑌 represent the number of baskets in which X and Y appear together.
3. 𝑁𝑋 represent the number of baskets in which X appears.
4. 𝑁𝑌 represent the number of baskets in which Y appears.
Then the support between X and Y, Support(X, Y), is given by
𝑁𝑋𝑌
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋, 𝑌) = 𝑁

Confidence measures the proportion of the transactions that contain X, which also contain Y. X is called antecedent a
is called consequent. Confidence can be calculated using the following formula:
𝑁𝑋𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 → 𝑌) = 𝑃(𝑌|𝑋) = 𝑁𝑋

where P(Y|X) is the conditional probability of Y given X.

Lift is calculated using the following formula:


𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋,𝑌) 𝑁𝑋𝑌
𝐿𝑖𝑓𝑡 = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋)×𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑌)
= 𝑁𝑋𝑁𝑌

Lift can be interpreted as the degree of association between two items. Lift value 1 indicates that the items are indepe
(no association), lift value of less than 1 implies that the products are substitution (purchase one product will decrea
probability of purchase of the other product) and lift value of greater than 1 indicates purchase of Product X will inc
the probability of purchase of Product Y. Lift value of greater than 1 is a necessary condition of generating assoc
rules.

2. Explain loading dataset and encoding of transactions

Loading the Dataset:

The dataset is read using Python's open() method to load transaction data from a CSV file (e.g., groceries.c
The steps include:

1. Open the file.


2. Read all lines from the file using readlines().
3. Remove leading and trailing whitespace using strip().
4. Split each line into items based on commas to create a list of transactions.

Example code:

all_txns = []
# Open the file

with open('groceries.csv') as f:

content = f.readlines()

# Clean and process each line

txns = [x.strip() for x in content]

# Create a list of transactions

for each_txn in txns:

all_txns.append(each_txn.split(','))

The resulting all_txns variable contains a list of transactions, with each transaction represented as a list of ite

Encoding Transactions:

To apply association rule mining, transactions must be converted into a tabular or matrix format where:

● Rows represent transactions.


● Columns represent unique items.

Each entry in the matrix is one-hot encoded:

● 1: Indicates the item exists in the transaction.


● 0: Indicates the item does not exist in the transaction.

Example code:

from mlxtend.preprocessing import OnehotTransactions

import pandas as pd

# Initialize the OnehotTransactions

one_hot_encoding = OnehotTransactions()

# Transform the data

one_hot_txns = one_hot_encoding.fit(all_txns).transform(all_txns)

# Convert to DataFrame

one_hot_txns_df = pd.DataFrame(one_hot_txns, columns=one_hot_encoding.columns_)


This DataFrame is sparse, with dimensions indicating the number of transactions and unique items. It is ready
algorithms like Apriori to generate association rules​​.

3. Explain Generating Association Rules, Top Ten Rules

Generating Association Rules:

The Apriori algorithm is used to generate frequent itemsets and association rules. It processes the following steps:

1. Generate Frequent Itemsets: Use the Apriori algorithm with parameters:


○ df: A one-hot-encoded DataFrame representing transactions.
○ min_support: The minimum threshold for support (e.g., 0.02 indicates itemsets present in at least 2
transactions).
○ use_colnames: Set to True to use DataFrame column names in the output​​.

Example code:

from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(one_hot_txns_df, min_support=0.02, use_colnames=True)

2. Generate Rules: Use the association_rules function with the following parameters:

○ frequent_itemsets: The DataFrame of frequent itemsets.


○ metric: Choose between metrics such as confidence or lift to evaluate rules.
○ min_threshold: Minimum threshold for the selected metric (e.g., lift > 1.0 for positive association).

Example code:

from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

Top Ten Rules:

To retrieve the top 10 rules, sort the rules DataFrame by confidence in descending order. This app
prioritizes rules with the highest reliability.

Example Code:

top_rules = rules.sort_values('confidence', ascending=False).head(10)

Key Parameters Explained:

1. Support: Frequency of the itemset in the dataset.


2. Confidence: Probability of the consequent given the antecedent.
3. Lift: Degree to which the occurrence of the antecedent boosts the likelihood of the consequent.
By evaluating these parameters, the generated rules identify significant associations, such as {A} → {B} indicatin
transactions containing A often include B.

4. Explain loading dataset and Calculating Cosine Similarity between Users for user based similarity

Loading the Dataset:

The process of loading the dataset involves using the MovieLens dataset, which contains user ratings for movies
dataset includes the following attributes:

● userId: A unique identifier for each user.


● movieId: A unique identifier for each movie.
● rating: The rating a user has given to a movie, on a scale of 1 to 5.
● timestamp: The time when the rating was given.

Steps to load the dataset:

1. The dataset is read into a DataFrame using the pandas library.

rating_df = pd.read_csv("ml-latest-small/ratings.csv")

2. The timestamp column, which is not required for this analysis, is dropped.

rating_df.drop("timestamp", axis=1, inplace=True)

3. Count the unique users and movies in the dataset:

len(rating_df.userId.unique()) # Number of unique users

len(rating_df.movieId.unique()) # Number of unique movies

A pivot table is then created to represent users as rows and movies as columns. The values in the matrix correspond
ratings given by users to movies:

user_movies_df = rating_df.pivot(index="userId", columns="movieId", values="rating")

Movies not rated by a user are represented as NaN values​​.

Calculating Cosine Similarity Between Users:

Cosine similarity measures the angle between two vectors in multi-dimensional space, indicating the similarity in t
direction. It is computed for users based on their ratings.

Steps to calculate user-based similarity:

1. Compute pairwise cosine similarity using the pairwise_distances function:


from sklearn.metrics import pairwise_distances

user_sim = 1 - pairwise_distances(user_movies_df.values, metric="cosine")

2. Store the results in a DataFrame for better readability:

import pandas as pd

user_sim_df = pd.DataFrame(user_sim)

user_sim_df.index = rating_df.userId.unique()

user_sim_df.columns = rating_df.userId.unique()

3. Set diagonal values to 0 to exclude self-similarity:

import numpy as np

np.fill_diagonal(user_sim, 0)

The resulting similarity matrix can be used to identify the most similar users to any given user. For instance:

user_sim_df.idxmax(axis=1) # Finds the most similar user for each user

Cosine similarity values close to 1 indicate high similarity, while values close to 0 indicate dissimilarity​​.

5. Explain filtering of similar users, loading the movies dataset, Finding Common Movies of Similar Users for user
similarity

Filtering Similar Users:

1. To find the most similar users, the idxmax() function is applied to each column of the cosine similarity matrix, w
contains similarities between all users.
2. Example Code:

user_sim_df.idxmax(axis=1)[0:5]

● This code retrieves the most similar user for the first 5 users. For instance, user 325 is most similar to user 1, use
to user 2, and so on​​.

Loading the Movies Dataset:

1. Dataset Information:
○ The movies.csv file contains movie information with columns:
■ movieId: Unique identifier for movies.
■ title: Name of the movie.
■ genres: Movie genres.
2. Steps to Load the Dataset:
○ Use pandas to read the dataset:

movies_df = pd.read_csv("ml-latest-small/movies.csv")

○ Drop the genres column as it is not needed for similarity analysis:

movies_df.drop('genres', axis=1, inplace=True)

○ View the first few records:

movies_df.head()

This dataset will be joined with user ratings later to identify common movies​​.

Finding Common Movies of Similar Users:

1. Steps:
○ Define a method get_user_similar_movies() to find common movies watched by two users and their
respective ratings:

def get_user_similar_movies(user1, user2):

common_movies = rating_df[rating_df.userId == user1].merge(

rating_df[rating_df.userId == user2],

on="movieId",

how="inner"

return common_movies.merge(movies_df, on="movieId")

○ Example usage:

common_movies = get_user_similar_movies(2, 338)

○ Filter movies both users rated 4 or higher to limit the output:

common_movies[(common_movies.rating_x >= 4.0) & (common_movies.rating_y >= 4.0)]

2. Insights:
This process identifies shared movie preferences and similar ratings between users. For example, users 2 and 338 m
have rated "Apollo 13" and "Schindler's List" highly, indicating similar tastes​​.

6. Explain Calculating Cosine Similarity between Users and finding most similar movies for item based similarity
If two movies, movie A and movie B, have been watched by several users and rated very similarly, then movie A
movie B can be similar in taste. In other words, if a user watches movie A, then he or she is very likely to watch B
vice versa.

Calculating Cosine Similarity Between Movies:

In item-based similarity, the cosine similarity between movies is calculated based on user ratings to identify
similar movies. Here's the process:

1. Create a Pivot Table:


○ The rows represent movies (movieId), columns represent users (userId), and the matrix values
represent user ratings.
○ The pivot() method is used to reshape the DataFrame:

rating_mat = rating_df.pivot(index='movieId', columns='userId',


values="rating").reset_index(drop=True)

○ Fill missing values (NaN) with 0 since not all movies are rated by all users:

rating_mat.fillna(0, inplace=True)

2. Calculate Similarity Matrix:


● Use pairwise_distances with the correlation metric to compute similarities between movies:
from sklearn.metrics import pairwise_distances
movie_sim = 1 - pairwise_distances(rating_mat.values,
metric="correlation")

○ Convert the similarity matrix into a DataFrame for easy access:

movie_sim_df = pd.DataFrame(movie_sim)

3. Set Diagonal to Zero:


○ Since a movie is most similar to itself, set diagonal values to 0 to exclude self-similarity:
np.fill_diagonal(movie_sim, 0)

This results in a similarity matrix where each entry represents the similarity between two movies​.

Finding Most Similar Movies:

To identify movies most similar to a given movie:


1. Define a Function:
○ Create a function get_similar_movies() that takes a movieId and the number of top similar
movies (topN) as inputs:

def get_similar_movies(movieid, topN=5):

# Get the index of the movie in the DataFrame

movieidx = movies_df[movies_df.movieId == movieid].index[0]

# Add a similarity column to the DataFrame

movies_df['similarity'] = movie_sim_df.iloc[movieidx]

# Sort movies by similarity in descending order

top_n = movies_df.sort_values(["similarity"], ascending=False)[0:topN]

return top_n

2. Find Similar Movies:

● For a given movieId (e.g., "The Godfather" with movieId=858), call the function:

similar_movies = get_similar_movies(858)

● This returns the top N movies similar to the specified movie, ranked by similarity​​.

This approach allows recommendations for movies based on their inherent similarities derived from user
rating patterns.

7. Explain user based similarity by using surprise library

The Surprise library offers an efficient way to implement user-based collaborative filtering for building
recommendation systems. The steps for implementing this approach are as follows:

1. Loading the Dataset:

● The Dataset class from Surprise is used to load the dataset into a suitable format for building models. A
Reader object specifies the rating scale (e.g., 1 to 5).
● Code example:

from surprise import Dataset, Reader

reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(rating_df[['userId', 'movieId', 'rating']], reader=reader)


2. User-Based Similarity Algorithm:

● The KNNBasic algorithm from Surprise is used for user-based collaborative filtering. This method finds
similar users based on shared movie ratings and makes recommendations based on the ratings of the
nearest neighbors.
● Key parameters of KNNBasic:
○ k: Number of nearest neighbors (e.g., 20).
○ min_k: Minimum number of neighbors considered.
○ sim_options: Defines similarity measure. Set 'name' to 'pearson' or 'cosine' and
'user_based' to True for user-based similarity.
● Example code:

from surprise import KNNBasic

sim_options = {'name': 'pearson', 'user_based': True}

knn = KNNBasic(k=20, min_k=5, sim_options=sim_options)

3. Cross-Validation and Model Evaluation:

● Use cross_validate to perform K-fold cross-validation and evaluate the model's performance using
metrics like RMSE.
● Example:

from surprise.model_selection import cross_validate

cv_results = cross_validate(knn, data, measures=['RMSE'], cv=5, verbose=False)

4. Finding Most Similar Users:

● To identify the most similar users, the similarity matrix is computed using cosine similarity or Pearson
correlation. The pairwise_distances function from Scikit-learn can calculate the distances between
users. Example code:

from sklearn.metrics import pairwise_distances

user_sim = 1 - pairwise_distances(user_movies_df.values, metric="cosine")

5. Grid Search for Hyperparameter Tuning:

● GridSearchCV can be used to search for the best parameters, including the number of neighbors and
similarity measures. Example:

from surprise.model_selection import GridSearchCV

param_grid = {'k': [10, 20], 'sim_options': {'name': ['cosine', 'pearson'], 'user_based': [True, False]}}
grid_cv = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5, refit=True)

grid_cv.fit(data)

print(grid_cv.best_score['rmse'])

This method helps to refine the model by selecting the optimal parameters for best performance​

8. Explain MATRIX FACTORIZATION

Matrix factorization is a technique used to decompose a large user-item rating matrix into smaller
matrices, which capture latent features that explain observed ratings. The main idea is to
approximate the original matrix as the product of two lower-dimensional matrices. This process is
useful in recommendation systems, where the goal is to predict ratings for unrated items based on
patterns identified in the observed ratings.

Steps of Matrix Factorization:

1. Decomposition of User-Item Matrix:


A user-item matrix (with users as rows, items as columns, and ratings as values) is factorized into
two lower-dimensional matrices:
○ Users-Factors Matrix: Represents how users relate to latent factors (such as preferences for
movie genres, actors, etc.).
○ Factors-Movies Matrix: Represents how movies relate to these latent factors.

For instance, in a 3x5 matrix (3 users and 5 movies), the matrix can be decomposed into:

○ A 3x3 Users-Factors matrix.


○ A 3x5 Factors-Movies matrix.

The multiplication of these two matrices will approximate the original user-item matrixent Factors**:
The "latent factors" refer to hidden variables that influence how a user rates a movie. These could
be related to aspects like movie genre, director, or actors, but they are not directly observable.

2. Singular Value Decomposition (SVD):


One popular matrix factorization technique is Singular Value Decomposition (SVD), which
decomposes a matrix into three matrices. In this context, it helps find latent features that explain
the ratings observed in the user-item matrix.
Example Code (using Surprise library):

from surprise import SVD

svd = SVD(n_factors=5) # Using 5 latent factors


3. Cross-validation for Model Performance:

Once the model is trained using SVD, it is evaluated using cross-validation to measure its
performance, typically by using RMSE (Root Mean Squared Error).

from surprise.model_selection import cross_validate

cv_results = cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)

In summary, matrix factorization reduces the complexity of user-item rating matrices and
uncovers hidden patterns that can help predict ratings for unseen items, making it effective for
building recommendation systems .

You might also like