0% found this document useful (0 votes)
32 views

Evaluating Recommender Sytems

Uploaded by

riya.munjal.ug21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Evaluating Recommender Sytems

Uploaded by

riya.munjal.ug21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Evaluating

Recommender Sytems
EVALUATION METRICS

• Mean Average precision at K


• It gives how much relevant is the list of recommended items. Here precision at K means Recommended items
in top k sets that are relevant.
• 2. Coverage
• It is the percentage of items in the training data model able to recommend in test sets. Or simply, the
percentage of a possible recommendation system can predict.
• 3. Personalization
• It is basically how many same items the model recommends to different users. Or, the dissimilarity between
users lists and recommendations.
• 4. Intralist Similarity
• It is an average cosine similarity of all items in a list of recommendations.
• Common Metrics Used
• Predictive accuracy metrics, classification accuracy metrics, rank accuracy metrics, and non-accuracy
measurements are the four major types of evaluation metrics for recommender systems.

• Predictive Accuracy Metrics


• Predictive accuracy or rating prediction measures address the subject of how near a recommender’s estimated
ratings are to genuine user ratings. This sort of measure is widely used for evaluating non-binary ratings.

• It is best suited for usage scenarios in which accurate prediction of ratings for all products is critical. Mean
Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Normalized
Mean Absolute Error (NMAE) are the most important measures for this purpose.
Offline recommender system
methodology
Cosine similarity
• To compute the similarity between a purchased item and
the new item for an item-centered system, we simply
take the cosine between 2 vectors representing those
items.
• Cosine similarity is the best match if there are many
high-dimensional features, especially in text mining.
Jaccard similarity

• Jaccard similarity is the size of the intersection divided


by the size of the union of two sets of items.
Top-K parameter

• The K parameter is the evaluation cutoff point. It


represents the number of top-ranked items to evaluate.
For example, you can focus on the quality of top-10
recommendations.
• To evaluate a recommendation or ranking system, you need:
• The model predictions. They include the ranked list of
user-item pairs. The complete dataset also contains features
that describe users or items. You’ll need them for some of
the metrics.
• The ground truth. You need to know the actual user-item
relevance to evaluate the quality of predictions. This might
be a binary or graded relevance score. It is often based on
the user interactions, such as clicks and conversions.
• The K. You need to pick the number of the top
recommendations to consider. This puts a constraint on
evaluations: you will disregard anything that happens after
this cutoff point.
• 1. Predictive metrics. They reflect the “correctness” of
recommendations and show how well the system finds
relevant items.
• 2. Ranking metrics. They reflect the ranking quality:
how well the system can sort the items from more
relevant to less relevant.
• 3. Behavioral metrics. These metrics reflect specific
properties of the system, such as how diverse or novel
the recommendations are.
Precision at K
• Precision shows how many recommendations among
the provided ones are relevant. It gives an assessment
of prediction “correctness.” It is intuitive and easy to
understand: Precision in ranking works the same as its
counterpart in classification quality evaluation.
Recall at K
• Recall at K measures the coverage of relevant items in
the top K.
• Recall at K shows how many relevant items, out of their
total number, you can successfully retrieve within the
top K recommendations.
F-score
• The F Beta score is a metric that balances Precision and
Recall.
• The F Beta score at K combines Precision and Recall
metrics into a single value to provide a balanced
assessment. The Beta parameter allows adjusting the
importance of Recall relative to Precision.
• If you set the Beta to 1, you will get the standard F1
score, a harmonic mean of Precision and Recall.
• The F Beta score is a good metric when you care about
both properties: correctness of predictions and ability to
cover as many relevant items as possible with the top-K.
The Beta parameter allows you to customize the
priorities.
• Precision and Recall depend heavily on the total number
of relevant items. Because of this, it might be
challenging to compare the performance across
different lists.
• In addition, metrics like Precision and Recall are not
rank-aware. They are indifferent to the position of
relevant items inside the top K.
• Consider two lists that both have 5 out of 10 matches.
In the first list, the relevant items are at the very top. In
the second, they are at the very bottom. The Precision
will be the same (50%) as long as the total number of
relevant items is.
Ranking quality metrics

• Ranking metrics help assess the ability to order the


items based on their relevance to the user or query. In
an ideal scenario, all the relevant items should appear
ahead of the less relevant ones. Ranking metrics help
measure how far you are from this.
• MRR
• MRR calculates the average of the reciprocal ranks of
the first relevant item.
• MRR (Mean Reciprocal Rank) shows how soon you
can find the first relevant item.
• To calculate MRR, you take the reciprocal of the rank of
the first relevant item and average this value across all
queries or users.
• For example, if the first relevant item appears in the
second position, this list's RR (Reciprocal Rank) is 1/2. If
the first relevant item takes the third place, then the RR
equals 1/3, and so on.
• Once you compute the RRs for all lists, you can average
it to get the resulting MRR for all users or queries.
• MRR is an easy-to-understand and intuitive metric. It is
beneficial when the top-ranked item matters: for
example, you expect the search engine to return a
relevant first result.
• However, the limitation is that MRR solely focuses on
the first relevant item and disregards all the rest. In
case you care about overall ranking, you might need
additional metrics.
MAP
• MAP measures the average Precision across different
Recall levels for a ranked list.
• ‍Mean Average Precision (MAP) at K evaluates the
average Precision at all relevant ranks within the list of
top K recommendations. This helps get a comprehensive
measure of recommendation system performance,
accounting for the quality of the ranking.
• To compute MAP, you first need to calculate the Average
Precision (AP) for each list: an average of Precision
values at all positions in K with relevant
recommendations.
• Once you compute the AP for every list, you can average
it across all users. Here is the complete formula:
• MAP helps address the limitations of “classic” Prediction
and Recall: it evaluates both the correctness of
recommendations and how well the system can sort the
relevant items inside the list.
• Due to the underlying formula, MAP heavily rewards
correct recommendations at the top of the list.
Otherwise, you will factor the errors at the top in every
consecutive Precision computation.
• MAP is a valuable metric when it is important to get the
top predictions right, like in information retrieval. As a
downside, this metric might be hard to communicate and
does not have an immediate intuitive explanation.
Hit rate

• Hit Rate measures the share of users that get at least


one relevant recommendation.
• Hit Rate at K calculates the share of users for which at
least one relevant item is present in the K. This metric is
very intuitive.
• You can get a binary score for each user: “1” if there is
at least a single relevant item in top K or “0” otherwise.
Then, you can compute the average hit rate across all
users.
Behavioral metrics

• Behavioral metrics help go “beyond accuracy” and


evaluate important qualities of a recommender system,
like the diversity and novelty of recommendations.
• 1. Diversity
• Recommendation diversity assesses how varied the
recommended items are for each user. It reflects the
breadth of item types or categories to which each user
is exposed.
• To compute this metric, you can measure the intra-list
diversity by evaluating the average Cosine Distance
between pairs of items inside the list. Then, you can
average it across all users.
• Diversity is helpful if you expect users to have a better
experience when they receive recommendations that
span a diverse range of topics, genres, or
characteristics.
• However, while diversity helps check if a system can
show a varied mix of items, it does not consider
relevance. You can use this metric with ranking or
predictive metrics to get a complete picture.
Novelty

• Novelty assesses how unique or unusual the recommended items


are. It measures the degree to which the suggested items differ
from popular ones.
• You can compute novelty as the negative logarithm (base 2) of
the probability of encountering a given item in a training set. High
novelty corresponds to long-tail items that few users interacted
with, and low novelty corresponds to popular items. Then, you can
average the novelty inside the list and across users.
• Novelty reflects the system's ability to recommend items that are
not well-known in the dataset. It is helpful for scenarios when you
expect users to get new and unusual recommendations to stay
engaged.
Serendipity
• Serendipity measures the unexpectedness or pleasant surprise in
recommendations. Serendipity evaluates the system's ability to
suggest items beyond the user's typical preferences or
expectations.
• Serendipity is challenging to quantify precisely, but one way to
approach it is by considering the dissimilarity (measured via Cosine
Distance) between successfully recommended items and a user's
historical preferences. Then, you can average it across users.
• Serendipity reflects the ability of the system to venture beyond the
predictable and offer new recommendations that users enjoy. It
promotes exploring diverse and unexpected content, adding an
element of delight and discovery.
Popularity bias

• Popularity bias refers to a phenomenon where the


recommendation favors popular items over more diverse or niche
ones. It can lead to a lack of personalization, causing users to see
the same widely popular items repeatedly. This bias may result in a
less diverse and engaging user experience.
• There are different ways to evaluate the popularity of
recommendations, for example:
• Coverage: the share of all items in the catalog present in
recommendations.
• Average recommendation popularity (ARP).
• Average overlap between the items in the lists.
• Gini index.
What are the steps to validate data from
recommendation systems?

1Define the objectives


2Assess the data sources
3Clean and transform the data
4Split the data into subsets
5Apply the recommendation models
6Evaluate the recommendations
7Here’s what else to consider
• Data validation is the process of checking the quality,
completeness, and consistency of the data before using it for
analysis or modeling.

Define the objectives
• The first step to validate data from recommendation systems is to
define the objectives and criteria of the validation. What are you
trying to achieve with the recommendations? What are the key
performance indicators (KPIs) that measure the success of the
recommendations? How do you define the quality and relevance of
the data? These questions will help you set the scope and
standards of the validation and align them with the business goals
and user needs.
2 Assess the data sources

• The next step is to assess the data sources that provide


the input for the recommendation systems. Data
sources can include user profiles, preferences, ratings,
feedback, browsing history, transactions, social media,
etc. You need to evaluate the reliability, availability, and
accessibility of these sources and identify any potential
issues or gaps. For example, you might want to check if
the data is updated regularly, if it covers enough users
and items, if it has enough diversity and variety, and if
it is compatible with the data formats and platforms you
use.
Clean and transform the data
• After assessing the data sources, you need to clean and
transform the data to make it suitable for the
recommendation systems. Cleaning involves removing
or correcting any errors, outliers, duplicates, missing
values, or inconsistencies in the data. Transforming
involves applying any operations or functions that
change the structure, format, or values of the data. For
example, you might want to normalize, standardize,
aggregate, or categorize the data to make it more
uniform and comparable.
Split the data into subsets
• The next step is to split the data into subsets for
different purposes. Typically, you will need three
subsets: training, validation, and testing. Training data
is used to build and train the recommendation models.
Validation data is used to tune and optimize the model
parameters and select the best model. Testing data is
used to evaluate and compare the performance and
accuracy of the models. You need to ensure that the
subsets are representative, balanced, and independent
of each other.
Apply the recommendation
models
• The next step is to apply the recommendation models
to the data subsets and generate the recommendations.
Recommendation models can be based on different
techniques, such as collaborative filtering, content-
based filtering, hybrid filtering, or deep learning. You
need to choose the appropriate models for your
objectives and data characteristics and test them on
different scenarios and settings. You also need to
monitor and record the results and outputs of the
models for further analysis.
Evaluate the recommendations

• he final step is to evaluate the recommendations and


validate their quality and effectiveness. You can use
different methods and metrics to measure the
performance of the recommendations, such as
accuracy, precision, recall, coverage, diversity, novelty,
serendipity, etc. You can also use feedback from users,
such as ratings, reviews, clicks, conversions, retention,
etc. to assess the satisfaction and engagement of the
recommendations. You need to compare the results with
the objectives and criteria you defined in the first step
and identify any strengths, weaknesses, or areas for
improvement.

You might also like