0% found this document useful (0 votes)

9 views

AIML Mod4 Loki

Uploaded by

vibhav kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

AIML Mod4 Loki

Uploaded by

vibhav kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Advanced AIML

Module-4
1. Explain support, confidence and Lift for Association rule
Support indicates the frequencies of items appearing together in baskets with respect to all possible baskets
considered (or in a sample). For example, the support for (beer, diaper) will be 2/4 (based on the data shown in F
9.1), that is, 50% as it appears together in 2 baskets out of 4 baskets.
Assume that X and Y are items being considered. Let
1. 𝑁 be the total number of baskets.
2. 𝑁𝑋𝑌 represent the number of baskets in which X and Y appear together.
3. 𝑁𝑋 represent the number of baskets in which X appears.
4. 𝑁𝑌 represent the number of baskets in which Y appears.
Then the support between X and Y, Support(X, Y), is given by
𝑁𝑋𝑌
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋, 𝑌) = 𝑁

Confidence measures the proportion of the transactions that contain X, which also contain Y. X is called antecedent a
is called consequent. Confidence can be calculated using the following formula:
𝑁𝑋𝑌
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 → 𝑌) = 𝑃(𝑌|𝑋) = 𝑁𝑋

where P(Y|X) is the conditional probability of Y given X.

Lift is calculated using the following formula:

𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋,𝑌) 𝑁𝑋𝑌
𝐿𝑖𝑓𝑡 = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋)×𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑌)
= 𝑁𝑋𝑁𝑌

Lift can be interpreted as the degree of association between two items. Lift value 1 indicates that the items are indepe
(no association), lift value of less than 1 implies that the products are substitution (purchase one product will decrea
probability of purchase of the other product) and lift value of greater than 1 indicates purchase of Product X will inc
the probability of purchase of Product Y. Lift value of greater than 1 is a necessary condition of generating assoc
rules.

2. Explain loading dataset and encoding of transactions

Loading the Dataset:

The dataset is read using Python's open() method to load transaction data from a CSV file (e.g., groceries.c
The steps include:

1. Open the file.

2. Read all lines from the file using readlines().
3. Remove leading and trailing whitespace using strip().
4. Split each line into items based on commas to create a list of transactions.

Example code:

all_txns = []
# Open the file

with open('groceries.csv') as f:

content = f.readlines()

# Clean and process each line

txns = [x.strip() for x in content]

# Create a list of transactions

for each_txn in txns:

all_txns.append(each_txn.split(','))

The resulting all_txns variable contains a list of transactions, with each transaction represented as a list of ite

Encoding Transactions:

To apply association rule mining, transactions must be converted into a tabular or matrix format where:

● Rows represent transactions.

● Columns represent unique items.

Each entry in the matrix is one-hot encoded:

● 1: Indicates the item exists in the transaction.

● 0: Indicates the item does not exist in the transaction.

Example code:

from mlxtend.preprocessing import OnehotTransactions

import pandas as pd

# Initialize the OnehotTransactions

one_hot_encoding = OnehotTransactions()

# Transform the data

one_hot_txns = one_hot_encoding.fit(all_txns).transform(all_txns)

# Convert to DataFrame

one_hot_txns_df = pd.DataFrame(one_hot_txns, columns=one_hot_encoding.columns_)

This DataFrame is sparse, with dimensions indicating the number of transactions and unique items. It is ready
algorithms like Apriori to generate association rules.

3. Explain Generating Association Rules, Top Ten Rules

Generating Association Rules:

The Apriori algorithm is used to generate frequent itemsets and association rules. It processes the following steps:

1. Generate Frequent Itemsets: Use the Apriori algorithm with parameters:

○ df: A one-hot-encoded DataFrame representing transactions.
○ min_support: The minimum threshold for support (e.g., 0.02 indicates itemsets present in at least 2
transactions).
○ use_colnames: Set to True to use DataFrame column names in the output.

Example code:

from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(one_hot_txns_df, min_support=0.02, use_colnames=True)

2. Generate Rules: Use the association_rules function with the following parameters:

○ frequent_itemsets: The DataFrame of frequent itemsets.

○ metric: Choose between metrics such as confidence or lift to evaluate rules.
○ min_threshold: Minimum threshold for the selected metric (e.g., lift > 1.0 for positive association).

Example code:

from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

Top Ten Rules:

To retrieve the top 10 rules, sort the rules DataFrame by confidence in descending order. This app
prioritizes rules with the highest reliability.

Example Code:

top_rules = rules.sort_values('confidence', ascending=False).head(10)

Key Parameters Explained:

1. Support: Frequency of the itemset in the dataset.

2. Confidence: Probability of the consequent given the antecedent.
3. Lift: Degree to which the occurrence of the antecedent boosts the likelihood of the consequent.
By evaluating these parameters, the generated rules identify significant associations, such as {A} → {B} indicatin
transactions containing A often include B.

4. Explain loading dataset and Calculating Cosine Similarity between Users for user based similarity

Loading the Dataset:

The process of loading the dataset involves using the MovieLens dataset, which contains user ratings for movies
dataset includes the following attributes:

● userId: A unique identifier for each user.

● movieId: A unique identifier for each movie.
● rating: The rating a user has given to a movie, on a scale of 1 to 5.
● timestamp: The time when the rating was given.

Steps to load the dataset:

1. The dataset is read into a DataFrame using the pandas library.

rating_df = pd.read_csv("ml-latest-small/ratings.csv")

2. The timestamp column, which is not required for this analysis, is dropped.

rating_df.drop("timestamp", axis=1, inplace=True)

3. Count the unique users and movies in the dataset:

len(rating_df.userId.unique()) # Number of unique users

len(rating_df.movieId.unique()) # Number of unique movies

A pivot table is then created to represent users as rows and movies as columns. The values in the matrix correspond
ratings given by users to movies:

user_movies_df = rating_df.pivot(index="userId", columns="movieId", values="rating")

Movies not rated by a user are represented as NaN values.

Calculating Cosine Similarity Between Users:

Cosine similarity measures the angle between two vectors in multi-dimensional space, indicating the similarity in t
direction. It is computed for users based on their ratings.

Steps to calculate user-based similarity:

1. Compute pairwise cosine similarity using the pairwise_distances function:

from sklearn.metrics import pairwise_distances

user_sim = 1 - pairwise_distances(user_movies_df.values, metric="cosine")

2. Store the results in a DataFrame for better readability:

import pandas as pd

user_sim_df = pd.DataFrame(user_sim)

user_sim_df.index = rating_df.userId.unique()

user_sim_df.columns = rating_df.userId.unique()

3. Set diagonal values to 0 to exclude self-similarity:

import numpy as np

np.fill_diagonal(user_sim, 0)

The resulting similarity matrix can be used to identify the most similar users to any given user. For instance:

user_sim_df.idxmax(axis=1) # Finds the most similar user for each user

Cosine similarity values close to 1 indicate high similarity, while values close to 0 indicate dissimilarity.

5. Explain filtering of similar users, loading the movies dataset, Finding Common Movies of Similar Users for user
similarity

Filtering Similar Users:

1. To find the most similar users, the idxmax() function is applied to each column of the cosine similarity matrix, w
contains similarities between all users.
2. Example Code:

user_sim_df.idxmax(axis=1)[0:5]

● This code retrieves the most similar user for the first 5 users. For instance, user 325 is most similar to user 1, use
to user 2, and so on.

Loading the Movies Dataset:

1. Dataset Information:
○ The movies.csv file contains movie information with columns:
■ movieId: Unique identifier for movies.
■ title: Name of the movie.
■ genres: Movie genres.
2. Steps to Load the Dataset:
○ Use pandas to read the dataset:

movies_df = pd.read_csv("ml-latest-small/movies.csv")

○ Drop the genres column as it is not needed for similarity analysis:

movies_df.drop('genres', axis=1, inplace=True)

○ View the first few records:

movies_df.head()

This dataset will be joined with user ratings later to identify common movies.

Finding Common Movies of Similar Users:

1. Steps:
○ Define a method get_user_similar_movies() to find common movies watched by two users and their
respective ratings:

def get_user_similar_movies(user1, user2):

common_movies = rating_df[rating_df.userId == user1].merge(

rating_df[rating_df.userId == user2],

on="movieId",

how="inner"

return common_movies.merge(movies_df, on="movieId")

○ Example usage:

common_movies = get_user_similar_movies(2, 338)

○ Filter movies both users rated 4 or higher to limit the output:

common_movies[(common_movies.rating_x >= 4.0) & (common_movies.rating_y >= 4.0)]

2. Insights:
This process identifies shared movie preferences and similar ratings between users. For example, users 2 and 338 m
have rated "Apollo 13" and "Schindler's List" highly, indicating similar tastes.

6. Explain Calculating Cosine Similarity between Users and finding most similar movies for item based similarity
If two movies, movie A and movie B, have been watched by several users and rated very similarly, then movie A
movie B can be similar in taste. In other words, if a user watches movie A, then he or she is very likely to watch B
vice versa.

Calculating Cosine Similarity Between Movies:

In item-based similarity, the cosine similarity between movies is calculated based on user ratings to identify
similar movies. Here's the process:

1. Create a Pivot Table:

○ The rows represent movies (movieId), columns represent users (userId), and the matrix values
represent user ratings.
○ The pivot() method is used to reshape the DataFrame:

rating_mat = rating_df.pivot(index='movieId', columns='userId',

values="rating").reset_index(drop=True)

○ Fill missing values (NaN) with 0 since not all movies are rated by all users:

rating_mat.fillna(0, inplace=True)

2. Calculate Similarity Matrix:

● Use pairwise_distances with the correlation metric to compute similarities between movies:
from sklearn.metrics import pairwise_distances
movie_sim = 1 - pairwise_distances(rating_mat.values,
metric="correlation")

○ Convert the similarity matrix into a DataFrame for easy access:

movie_sim_df = pd.DataFrame(movie_sim)

3. Set Diagonal to Zero:

○ Since a movie is most similar to itself, set diagonal values to 0 to exclude self-similarity:
np.fill_diagonal(movie_sim, 0)

This results in a similarity matrix where each entry represents the similarity between two movies.

Finding Most Similar Movies:

To identify movies most similar to a given movie:

1. Define a Function:
○ Create a function get_similar_movies() that takes a movieId and the number of top similar
movies (topN) as inputs:

def get_similar_movies(movieid, topN=5):

# Get the index of the movie in the DataFrame

movieidx = movies_df[movies_df.movieId == movieid].index[0]

# Add a similarity column to the DataFrame

movies_df['similarity'] = movie_sim_df.iloc[movieidx]

# Sort movies by similarity in descending order

top_n = movies_df.sort_values(["similarity"], ascending=False)[0:topN]

return top_n

2. Find Similar Movies:

● For a given movieId (e.g., "The Godfather" with movieId=858), call the function:

similar_movies = get_similar_movies(858)

● This returns the top N movies similar to the specified movie, ranked by similarity.

This approach allows recommendations for movies based on their inherent similarities derived from user
rating patterns.

7. Explain user based similarity by using surprise library

The Surprise library offers an efficient way to implement user-based collaborative filtering for building
recommendation systems. The steps for implementing this approach are as follows:

1. Loading the Dataset:

● The Dataset class from Surprise is used to load the dataset into a suitable format for building models. A
Reader object specifies the rating scale (e.g., 1 to 5).
● Code example:

from surprise import Dataset, Reader

reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(rating_df[['userId', 'movieId', 'rating']], reader=reader)

2. User-Based Similarity Algorithm:

● The KNNBasic algorithm from Surprise is used for user-based collaborative filtering. This method finds
similar users based on shared movie ratings and makes recommendations based on the ratings of the
nearest neighbors.
● Key parameters of KNNBasic:
○ k: Number of nearest neighbors (e.g., 20).
○ min_k: Minimum number of neighbors considered.
○ sim_options: Defines similarity measure. Set 'name' to 'pearson' or 'cosine' and
'user_based' to True for user-based similarity.
● Example code:

from surprise import KNNBasic

sim_options = {'name': 'pearson', 'user_based': True}

knn = KNNBasic(k=20, min_k=5, sim_options=sim_options)

3. Cross-Validation and Model Evaluation:

● Use cross_validate to perform K-fold cross-validation and evaluate the model's performance using
metrics like RMSE.
● Example:

from surprise.model_selection import cross_validate

cv_results = cross_validate(knn, data, measures=['RMSE'], cv=5, verbose=False)

4. Finding Most Similar Users:

● To identify the most similar users, the similarity matrix is computed using cosine similarity or Pearson
correlation. The pairwise_distances function from Scikit-learn can calculate the distances between
users. Example code:

from sklearn.metrics import pairwise_distances

user_sim = 1 - pairwise_distances(user_movies_df.values, metric="cosine")

5. Grid Search for Hyperparameter Tuning:

● GridSearchCV can be used to search for the best parameters, including the number of neighbors and
similarity measures. Example:

from surprise.model_selection import GridSearchCV

param_grid = {'k': [10, 20], 'sim_options': {'name': ['cosine', 'pearson'], 'user_based': [True, False]}}
grid_cv = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=5, refit=True)

grid_cv.fit(data)

print(grid_cv.best_score['rmse'])

This method helps to refine the model by selecting the optimal parameters for best performance

8. Explain MATRIX FACTORIZATION

Matrix factorization is a technique used to decompose a large user-item rating matrix into smaller
matrices, which capture latent features that explain observed ratings. The main idea is to
approximate the original matrix as the product of two lower-dimensional matrices. This process is
useful in recommendation systems, where the goal is to predict ratings for unrated items based on
patterns identified in the observed ratings.

Steps of Matrix Factorization:

1. Decomposition of User-Item Matrix:

A user-item matrix (with users as rows, items as columns, and ratings as values) is factorized into
two lower-dimensional matrices:
○ Users-Factors Matrix: Represents how users relate to latent factors (such as preferences for
movie genres, actors, etc.).
○ Factors-Movies Matrix: Represents how movies relate to these latent factors.

For instance, in a 3x5 matrix (3 users and 5 movies), the matrix can be decomposed into:

○ A 3x3 Users-Factors matrix.

○ A 3x5 Factors-Movies matrix.

The multiplication of these two matrices will approximate the original user-item matrixent Factors**:
The "latent factors" refer to hidden variables that influence how a user rates a movie. These could
be related to aspects like movie genre, director, or actors, but they are not directly observable.

2. Singular Value Decomposition (SVD):

One popular matrix factorization technique is Singular Value Decomposition (SVD), which
decomposes a matrix into three matrices. In this context, it helps find latent features that explain
the ratings observed in the user-item matrix.
Example Code (using Surprise library):

from surprise import SVD

svd = SVD(n_factors=5) # Using 5 latent factors

3. Cross-validation for Model Performance:

Once the model is trained using SVD, it is evaluated using cross-validation to measure its
performance, typically by using RMSE (Root Mean Squared Error).

from surprise.model_selection import cross_validate

cv_results = cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)

In summary, matrix factorization reduces the complexity of user-item rating matrices and
uncovers hidden patterns that can help predict ratings for unseen items, making it effective for
building recommendation systems .

Module 5 - Frequent Pattern Mining
No ratings yet
Module 5 - Frequent Pattern Mining
111 pages
(1999) Carman - The Body in Husserl and Merleau-Ponty PDF
100% (2)
(1999) Carman - The Body in Husserl and Merleau-Ponty PDF
22 pages
Measure What Matters PDF
No ratings yet
Measure What Matters PDF
1 page
Reiki Symbols
95% (57)
Reiki Symbols
37 pages
Chapter 9 - Recommendation Systems
No ratings yet
Chapter 9 - Recommendation Systems
12 pages
MODULE_4 Advance AIML part 1
No ratings yet
MODULE_4 Advance AIML part 1
12 pages
Unit IV Recommender System
No ratings yet
Unit IV Recommender System
5 pages
CS8091 BDA Unit 3
No ratings yet
CS8091 BDA Unit 3
144 pages
APRIARI Algorithm
No ratings yet
APRIARI Algorithm
55 pages
Module 4-1
No ratings yet
Module 4-1
34 pages
M4
No ratings yet
M4
58 pages
BANA 560 Lecture 6 Association Rules Collaborative Filtering
No ratings yet
BANA 560 Lecture 6 Association Rules Collaborative Filtering
34 pages
Chapter 14 Association Rules Collaborative Filtering
No ratings yet
Chapter 14 Association Rules Collaborative Filtering
34 pages
AIML_presentation
No ratings yet
AIML_presentation
21 pages
Module 4
No ratings yet
Module 4
11 pages
Unit-2
No ratings yet
Unit-2
8 pages
Association Rules Ans
No ratings yet
Association Rules Ans
28 pages
report
No ratings yet
report
5 pages
Interesting Python
No ratings yet
Interesting Python
5 pages
Recommendation System
No ratings yet
Recommendation System
11 pages
Association Rules Problem Statement
100% (1)
Association Rules Problem Statement
29 pages
Apriori Algorithm in Machine Learning
No ratings yet
Apriori Algorithm in Machine Learning
8 pages
DA_EXP_9 (1)
No ratings yet
DA_EXP_9 (1)
5 pages
DM Lab Cycle 7 1
No ratings yet
DM Lab Cycle 7 1
7 pages
ashfatmaterial
No ratings yet
ashfatmaterial
4 pages
Ex. 9 Association Rule Learning Using Apriori Algorithm
No ratings yet
Ex. 9 Association Rule Learning Using Apriori Algorithm
3 pages
Aml Unit 3
No ratings yet
Aml Unit 3
17 pages
indexdw (1)
No ratings yet
indexdw (1)
34 pages
Unit 2: Scs5623 - Data Mining and Warehousing
No ratings yet
Unit 2: Scs5623 - Data Mining and Warehousing
9 pages
DMC Lab Ex - 1 To 15 (31.03.2024)
No ratings yet
DMC Lab Ex - 1 To 15 (31.03.2024)
52 pages
AIML mod 4
No ratings yet
AIML mod 4
37 pages
INN AAT REPORT
No ratings yet
INN AAT REPORT
10 pages
Intro To Ai ML
No ratings yet
Intro To Ai ML
21 pages
Module 4.docx aiml
No ratings yet
Module 4.docx aiml
40 pages
6 - Association Rules- for students
No ratings yet
6 - Association Rules- for students
39 pages
Equent Itemsets & Clustering
No ratings yet
Equent Itemsets & Clustering
27 pages
Importance of Association Rule Mining and Its Real-Time Applications
No ratings yet
Importance of Association Rule Mining and Its Real-Time Applications
28 pages
Module 4
No ratings yet
Module 4
20 pages
Fa22-bcs-025 MOAZ Assignment 1
No ratings yet
Fa22-bcs-025 MOAZ Assignment 1
9 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
23 pages
Da 11
No ratings yet
Da 11
3 pages
Recommender System
No ratings yet
Recommender System
45 pages
Seminar 6
No ratings yet
Seminar 6
30 pages
Lab 10
No ratings yet
Lab 10
2 pages
Market Basket Analysis
No ratings yet
Market Basket Analysis
27 pages
E-Note_28879_Content_Document_20241209125940PM
No ratings yet
E-Note_28879_Content_Document_20241209125940PM
20 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
association rule mapping -unit-4
No ratings yet
association rule mapping -unit-4
11 pages
apriori algorithm or market basket analysis _ kaggle
No ratings yet
apriori algorithm or market basket analysis _ kaggle
30 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Association Rule Mod 3
No ratings yet
Association Rule Mod 3
28 pages
Vertopal.com IMDb+Movie+Assignment Stub
No ratings yet
Vertopal.com IMDb+Movie+Assignment Stub
9 pages
Recommender System Unit Ii
No ratings yet
Recommender System Unit Ii
14 pages
Association Rules and Frequent Item Analysis
No ratings yet
Association Rules and Frequent Item Analysis
30 pages
NAAN MUTHALVAN PRACTICAL SAMPLE
No ratings yet
NAAN MUTHALVAN PRACTICAL SAMPLE
7 pages
Association and Recommendation System
No ratings yet
Association and Recommendation System
24 pages
NEEL (1) Edited Edited
No ratings yet
NEEL (1) Edited Edited
12 pages
Association Rule Mining Activity
No ratings yet
Association Rule Mining Activity
4 pages
NEEL (1)_edited
No ratings yet
NEEL (1)_edited
12 pages
1.2 Association Rule Mining: Abdulfetah Abdulahi A
No ratings yet
1.2 Association Rule Mining: Abdulfetah Abdulahi A
43 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
SOPY3085 Rubrics - Reflection Paper
No ratings yet
SOPY3085 Rubrics - Reflection Paper
1 page
Cake PHP Cookbook
No ratings yet
Cake PHP Cookbook
782 pages
Root Cause Analysis: 1.define The Problem
No ratings yet
Root Cause Analysis: 1.define The Problem
5 pages
Chapter 5 Torsion for RC beam (1)
No ratings yet
Chapter 5 Torsion for RC beam (1)
39 pages
Ranitidine
No ratings yet
Ranitidine
8 pages
TIA Export Graphics Tool en PDF
No ratings yet
TIA Export Graphics Tool en PDF
11 pages
Modal and Harmonic Response Analysis of Key Components of Ditch Device Based On ANSYS
No ratings yet
Modal and Harmonic Response Analysis of Key Components of Ditch Device Based On ANSYS
9 pages
Permutation and Combination
No ratings yet
Permutation and Combination
7 pages
Google Coding Standard C
No ratings yet
Google Coding Standard C
3 pages
Van Brakel Measuring Stigma
No ratings yet
Van Brakel Measuring Stigma
12 pages
Why Why Java ?: Kiki Ahmadi
No ratings yet
Why Why Java ?: Kiki Ahmadi
21 pages
1) Aptitude Test: Questions 82 Time
100% (5)
1) Aptitude Test: Questions 82 Time
14 pages
Corrosion Modelling.: Piping Corrosion Circuit or Corrosion Loop / Piping Circuitization and
No ratings yet
Corrosion Modelling.: Piping Corrosion Circuit or Corrosion Loop / Piping Circuitization and
4 pages
Prisoners Dilemma PDF
No ratings yet
Prisoners Dilemma PDF
11 pages
Policies and Procedures 2012
No ratings yet
Policies and Procedures 2012
27 pages
Ahjkosdqjhkladsfl Compressed
No ratings yet
Ahjkosdqjhkladsfl Compressed
10 pages
Jorge Edwards
No ratings yet
Jorge Edwards
9 pages
Honeywell Alarmnet Internet Connectivity Test
No ratings yet
Honeywell Alarmnet Internet Connectivity Test
3 pages
Touchscreen Is An Electronic Visual Display That Can Detect The Presence and Location of A Touch Within The Display Area
No ratings yet
Touchscreen Is An Electronic Visual Display That Can Detect The Presence and Location of A Touch Within The Display Area
9 pages
TSM Sap Db2 Guide
No ratings yet
TSM Sap Db2 Guide
270 pages
202330879729BA - Lawal 2
No ratings yet
202330879729BA - Lawal 2
1 page
The ProSim
No ratings yet
The ProSim
11 pages
Add Math Fuction
No ratings yet
Add Math Fuction
21 pages
Q3Lesson 2 Tools in Oral Communication
No ratings yet
Q3Lesson 2 Tools in Oral Communication
38 pages
Dierendonck CompassionateLoveCornerstone 2015 2
No ratings yet
Dierendonck CompassionateLoveCornerstone 2015 2
14 pages
To Install SC4M Map File: SC4Mapper
No ratings yet
To Install SC4M Map File: SC4Mapper
2 pages
Concept of Leadership
100% (1)
Concept of Leadership
142 pages

AIML Mod4 Loki

Uploaded by

AIML Mod4 Loki

Uploaded by

Advanced AIML

where P(Y|X) is the conditional probability of Y given X.

Lift is calculated using the following formula:

2. Explain loading dataset and encoding of transactions

Loading the Dataset:

1. Open the file.

# Clean and process each line

txns = [x.strip() for x in content]

# Create a list of transactions

for each_txn in txns:

● Rows represent transactions.

Each entry in the matrix is one-hot encoded:

● 1: Indicates the item exists in the transaction.

from mlxtend.preprocessing import OnehotTransactions

# Initialize the OnehotTransactions

# Transform the data

one_hot_txns_df = pd.DataFrame(one_hot_txns, columns=one_hot_encoding.columns_)

3. Explain Generating Association Rules, Top Ten Rules

Generating Association Rules:

1. Generate Frequent Itemsets: Use the Apriori algorithm with parameters:

from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(one_hot_txns_df, min_support=0.02, use_colnames=True)

○ frequent_itemsets: The DataFrame of frequent itemsets.

from mlxtend.frequent_patterns import association_rules

rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

Top Ten Rules:

top_rules = rules.sort_values('confidence', ascending=False).head(10)

Key Parameters Explained:

1. Support: Frequency of the itemset in the dataset.

Loading the Dataset:

● userId: A unique identifier for each user.

Steps to load the dataset:

1. The dataset is read into a DataFrame using the pandas library.

rating_df.drop("timestamp", axis=1, inplace=True)

3. Count the unique users and movies in the dataset:

len(rating_df.userId.unique()) # Number of unique users

len(rating_df.movieId.unique()) # Number of unique movies

user_movies_df = rating_df.pivot(index="userId", columns="movieId", values="rating")

Movies not rated by a user are represented as NaN values​​.

Calculating Cosine Similarity Between Users:

Steps to calculate user-based similarity:

1. Compute pairwise cosine similarity using the pairwise_distances function:

user_sim = 1 - pairwise_distances(user_movies_df.values, metric="cosine")

2. Store the results in a DataFrame for better readability:

3. Set diagonal values to 0 to exclude self-similarity:

user_sim_df.idxmax(axis=1) # Finds the most similar user for each user

Filtering Similar Users:

Loading the Movies Dataset:

○ Drop the genres column as it is not needed for similarity analysis:

movies_df.drop('genres', axis=1, inplace=True)

○ View the first few records:

Finding Common Movies of Similar Users:

def get_user_similar_movies(user1, user2):

common_movies = rating_df[rating_df.userId == user1].merge(

return common_movies.merge(movies_df, on="movieId")

common_movies = get_user_similar_movies(2, 338)

○ Filter movies both users rated 4 or higher to limit the output:

common_movies[(common_movies.rating_x >= 4.0) & (common_movies.rating_y >= 4.0)]

Calculating Cosine Similarity Between Movies:

1. Create a Pivot Table:

rating_mat = rating_df.pivot(index='movieId', columns='userId',

2. Calculate Similarity Matrix:

○ Convert the similarity matrix into a DataFrame for easy access:

3. Set Diagonal to Zero:

Finding Most Similar Movies:

To identify movies most similar to a given movie:

def get_similar_movies(movieid, topN=5):

# Get the index of the movie in the DataFrame

movieidx = movies_df[movies_df.movieId == movieid].index[0]

# Add a similarity column to the DataFrame

# Sort movies by similarity in descending order

top_n = movies_df.sort_values(["similarity"], ascending=False)[0:topN]

2. Find Similar Movies:

7. Explain user based similarity by using surprise library

1. Loading the Dataset:

from surprise import Dataset, Reader

Movies not rated by a user are represented as NaN values.