SlideShare a Scribd company logo
user profiling in text-based recommender
systems based on distributed word
representations
.
Anton Alekseev and Sergey I. Nikolenko
Steklov Institute of Mathematics at St. Petersburg
National Research University Higher School of Economics, St. Petersburg
Kazan (Volga Region) Federal University, Kazan, Russia
Deloitte Analytics Institute, Moscow, Russia
April 7, 2016
intro: word embeddings
.
overview
.
• Very brief overview of the paper:
• we want to recommend full-text items to users;
• in the input data, users like full-text items, and we’d like to
construct thematic user profiles based on this;
• to do so, we cluster the word embeddings of keywords;
• then we propose a conceptual way to weigh down meaningless
clusters of common words.
3
word embeddings
.
• In this work, we construct user profiles based on texts.
• To do so, we used distributed word representations (word
embeddings).
• Distributed word representations map each word occurring in
the dictionary to a Euclidean space, attempting to capture
semantic relationships between the words as geometric
relationships in the Euclidean space.
4
word embeddings
.
• Started back in (Bengio et al., 2003), exploded after the works of
Bengio et al. and Mikolov et al. (2009–2011), now used
everywhere.
• Basic idea:
• shallow neural networks trained to reconstruct contexts by words
or words by context;
• skip-gram: predict contextual words by the word , ;
• CBOW: predict the word by its context , ;
• Glove: train a decomposition of the matrix of cooccurrences.
• Word embeddings serve as building blocks for neural network
approaches to NLP.
4
word embeddings
.
• Two main architectures:
CBOW skip-gram
• We use CBOW embeddings trained on a very large Russian
dataset (thanks to Nikolay Arefyev and Alexander Panchenko!).
4
methods
.
tf-idf document profiles
.
• We begin with baseline approaches.
• Using distributed representations trained on a huge Russian
corpus, we:
• clustered the word vectors, resulting in semantic clustering;
• used a vector representation for the documents as weighted sum
(with tf-idf weights) for the words;
• stored baseline user profiles based on simple weighted sums of
their likes in this document representation;
• trained baseline recommender algorithms that use these profiles:
ranking by cosine similarity, user-based and item-based
collaborative filtering.
6
new ideas and results
.
• Main problem:
• we have a clustering in the word vector space , which can be
also applied to documents represented as vectors in ;
• we also have a users documents matrix;
• how do we better compress it to individual user profiles?
• We have tried decomposing this matrix with SVD and pLSA, but
with no good results. Two problems:
• there are only likes in the dataset, no dislikes;
• “junk” clusters with common words always fill user profiles,
whatever we did.
7
new ideas and results
.
• We can use the following natural idea:
• represent a document as a vector of cluster likelihoods ;
• treat each user independently;
• for every user, construct a logistic regression problem that models
the probability of like with weights corresponding to clusters;
• train logistic regression; its weights constitute the user profile.
• But it also seems to suffer from the same problems: where do
we get negative examples for the regression, and what do we do
with “junk” clusters?
7
new ideas and results
.
• We solve both problems with one stroke:
• train several (hundred) balanced logistic regressions, choosing
negative examples uniformly at random among not-liked items;
• then use the weights statistics (e.g., mean and variance) as user
profile;
• this way, logistic regression is always balanced;
• also, now junk clusters with common words with often appear in
negative examples too, so they will have significantly higher
variance than informative clusters!
• Having constructed these profiles, how do we make
recommendations?
7
new ideas and results
.
• Recommender algorithm:
• from the posterior distribution of weights (we used normal
distribution with posterior mean and variance), sample several
(hundred) different weight combinations;
• predict the probabilities of likes for all these combinations;
• rank according to mean predicted like probability.
7
sample user profile
.
# Words
867 0.772 0.165 hours two-hour break minute half-hour five-minute two-hour ten-hour...
424 0.833 0.202 kissing call cry silent scream laughing nod dare restrain angry slam...
837 0.399 0.010 youtube blog net mail facebook player online yandex user tor ado...
366 0.396 0.042 associate attitude seems quite horoscope ideal religious face era...
413 0.406 0.080 feel glad remember worrying offended jealous inhale pity envy suffer autumn...
427 0.385 0.073 hijack bombing raid to steal loot bomb
798 0.385 0.080 uro missile air defense mine RL submarine Vaenga Red Banner Pacific Fleet...
8
experimental evaluation
.
algorithms
.
• So far we are comparing three baseline algorithms and our
regression-based algorithm:
(1) cosine: find nearest documents to a linear user profile with
respect to cosine proximity;
(2) user-based collaborative filtering: find nearest neighbors for a
user and recommend documents according to their likes;
(3) item-based collaborative filtering: find nearest neighbors for a
document and recommend documents similar to the ones a user
liked;
(4) regression-based algorithm: sample weights according to the
posterior distribution, recommend according to average results.
10
evaluation: metrics
.
• In experimental evaluation, regression-based recommender
clearly outperforms all other methods.
Algorithm AUC NDCG Top1 Top5 Top10
0 Cosine 0.514 0.779 0.511 2.471 4.757
1 User-based CF 0.456 0.686 0.101 1.418 3.851
2 Item-based CF 0.495 0.780 0.523 2.493 4.813
3 Regression 0.530 0.796 0.562 2.667 5.153
• Demo...
11
thank you!
.
Thank you for your attention!
12
Ad

More Related Content

Viewers also liked (14)

MKYoumans.Resume
MKYoumans.ResumeMKYoumans.Resume
MKYoumans.Resume
Michael Youmans
 
Swimming Let's Make It Fun Again
Swimming Let's Make It Fun AgainSwimming Let's Make It Fun Again
Swimming Let's Make It Fun Again
SuncoastMeetings
 
Evaluation task q3
Evaluation task q3 Evaluation task q3
Evaluation task q3
Holly Logan
 
My motherland
My motherlandMy motherland
My motherland
Ann Isakhanyan
 
アートは町に対してなにができるか 日置菜津美
アートは町に対してなにができるか 日置菜津美アートは町に対してなにができるか 日置菜津美
アートは町に対してなにができるか 日置菜津美
hiokinatsumi0613
 
LEIZ Mediaproducties
LEIZ MediaproductiesLEIZ Mediaproducties
LEIZ Mediaproducties
Ronald van der Ziel
 
How and Where to Find Exceptional Talent
How and Where to Find Exceptional Talent How and Where to Find Exceptional Talent
How and Where to Find Exceptional Talent
Gary Skipper
 
Role of agroforestry in augmenting crop productivity
Role of agroforestry in augmenting crop productivityRole of agroforestry in augmenting crop productivity
Role of agroforestry in augmenting crop productivity
Amit Chaudhary
 
Must Know Google Map Features for your Web application
Must Know Google Map Features  for your Web applicationMust Know Google Map Features  for your Web application
Must Know Google Map Features for your Web application
Appsbee
 
Sobre llibertat, igualtat i justícia.
Sobre llibertat, igualtat i justícia.Sobre llibertat, igualtat i justícia.
Sobre llibertat, igualtat i justícia.
Manel Villar (Institut Poeta Maragall)
 
William Shakespeare - To be or not to be
William Shakespeare - To be or not to beWilliam Shakespeare - To be or not to be
William Shakespeare - To be or not to be
bhavya mohindru
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
AIST
 
Sales and Marketing Alignment Benchmarking Report
Sales and Marketing Alignment Benchmarking ReportSales and Marketing Alignment Benchmarking Report
Sales and Marketing Alignment Benchmarking Report
Demand Metric
 
Swimming Let's Make It Fun Again
Swimming Let's Make It Fun AgainSwimming Let's Make It Fun Again
Swimming Let's Make It Fun Again
SuncoastMeetings
 
Evaluation task q3
Evaluation task q3 Evaluation task q3
Evaluation task q3
Holly Logan
 
アートは町に対してなにができるか 日置菜津美
アートは町に対してなにができるか 日置菜津美アートは町に対してなにができるか 日置菜津美
アートは町に対してなにができるか 日置菜津美
hiokinatsumi0613
 
How and Where to Find Exceptional Talent
How and Where to Find Exceptional Talent How and Where to Find Exceptional Talent
How and Where to Find Exceptional Talent
Gary Skipper
 
Role of agroforestry in augmenting crop productivity
Role of agroforestry in augmenting crop productivityRole of agroforestry in augmenting crop productivity
Role of agroforestry in augmenting crop productivity
Amit Chaudhary
 
Must Know Google Map Features for your Web application
Must Know Google Map Features  for your Web applicationMust Know Google Map Features  for your Web application
Must Know Google Map Features for your Web application
Appsbee
 
William Shakespeare - To be or not to be
William Shakespeare - To be or not to beWilliam Shakespeare - To be or not to be
William Shakespeare - To be or not to be
bhavya mohindru
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
AIST
 
Sales and Marketing Alignment Benchmarking Report
Sales and Marketing Alignment Benchmarking ReportSales and Marketing Alignment Benchmarking Report
Sales and Marketing Alignment Benchmarking Report
Demand Metric
 

Similar to Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommender Systems Based on Distributed Word Representations (20)

Filtering content bbased crs
Filtering content bbased crsFiltering content bbased crs
Filtering content bbased crs
Aravindharamanan S
 
[WI 2014]Context Recommendation Using Multi-label Classification
[WI 2014]Context Recommendation Using Multi-label Classification[WI 2014]Context Recommendation Using Multi-label Classification
[WI 2014]Context Recommendation Using Multi-label Classification
YONG ZHENG
 
Netizen style commenting on fashion photos
Netizen style commenting on fashion photosNetizen style commenting on fashion photos
Netizen style commenting on fashion photos
Jason Tang
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
Lidia Pivovarova
 
Target-Based Sentiment Anaysis as a Sequence-Tagging Task
Target-Based Sentiment Anaysis as a Sequence-Tagging TaskTarget-Based Sentiment Anaysis as a Sequence-Tagging Task
Target-Based Sentiment Anaysis as a Sequence-Tagging Task
jcscholtes
 
Abstractive Review Summarization
Abstractive Review SummarizationAbstractive Review Summarization
Abstractive Review Summarization
Cognizant Technology Solutions
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
Rinke Hoekstra
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Rounak Dhaneriya
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
Anubhav Jain
 
Recommenders.ppt
Recommenders.pptRecommenders.ppt
Recommenders.ppt
NagendraBabu27244
 
Recommenders.ppt
Recommenders.pptRecommenders.ppt
Recommenders.ppt
Aravind Reddy
 
Unit - III Vector Space Model in Natura Languge Processing .pptx
Unit - III Vector Space Model in Natura Languge Processing .pptxUnit - III Vector Space Model in Natura Languge Processing .pptx
Unit - III Vector Space Model in Natura Languge Processing .pptx
AnilkumarBrahmane2
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESS
Kishan Patel
 
SEppt
SEpptSEppt
SEppt
Hemankita Perabathini
 
Recommenders Systems
Recommenders SystemsRecommenders Systems
Recommenders Systems
Tariq Hassan
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
Traian Rebedea
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)
IJERA Editor
 
[WI 2014]Context Recommendation Using Multi-label Classification
[WI 2014]Context Recommendation Using Multi-label Classification[WI 2014]Context Recommendation Using Multi-label Classification
[WI 2014]Context Recommendation Using Multi-label Classification
YONG ZHENG
 
Netizen style commenting on fashion photos
Netizen style commenting on fashion photosNetizen style commenting on fashion photos
Netizen style commenting on fashion photos
Jason Tang
 
Target-Based Sentiment Anaysis as a Sequence-Tagging Task
Target-Based Sentiment Anaysis as a Sequence-Tagging TaskTarget-Based Sentiment Anaysis as a Sequence-Tagging Task
Target-Based Sentiment Anaysis as a Sequence-Tagging Task
jcscholtes
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
Rinke Hoekstra
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
Kai Li
 
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNNTextual Document Categorization using Bigram Maximum Likelihood and KNN
Textual Document Categorization using Bigram Maximum Likelihood and KNN
Rounak Dhaneriya
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
Anubhav Jain
 
Unit - III Vector Space Model in Natura Languge Processing .pptx
Unit - III Vector Space Model in Natura Languge Processing .pptxUnit - III Vector Space Model in Natura Languge Processing .pptx
Unit - III Vector Space Model in Natura Languge Processing .pptx
AnilkumarBrahmane2
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESS
Kishan Patel
 
Recommenders Systems
Recommenders SystemsRecommenders Systems
Recommenders Systems
Tariq Hassan
 
Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)Text Mining: (Asynchronous Sequences)
Text Mining: (Asynchronous Sequences)
IJERA Editor
 
Ad

More from AIST (20)

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
AIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
AIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
AIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
AIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
AIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
AIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
AIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
AIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
AIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
AIST
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
AIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
AIST
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
AIST
 
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
AIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
AIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
AIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
AIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
AIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
AIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
AIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
AIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
AIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
AIST
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
AIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
AIST
 
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
AIST
 
Ad

Recently uploaded (20)

LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 

Sergey Nikolenko and Anton Alekseev User Profiling in Text-Based Recommender Systems Based on Distributed Word Representations

  • 1. user profiling in text-based recommender systems based on distributed word representations . Anton Alekseev and Sergey I. Nikolenko Steklov Institute of Mathematics at St. Petersburg National Research University Higher School of Economics, St. Petersburg Kazan (Volga Region) Federal University, Kazan, Russia Deloitte Analytics Institute, Moscow, Russia April 7, 2016
  • 3. overview . • Very brief overview of the paper: • we want to recommend full-text items to users; • in the input data, users like full-text items, and we’d like to construct thematic user profiles based on this; • to do so, we cluster the word embeddings of keywords; • then we propose a conceptual way to weigh down meaningless clusters of common words. 3
  • 4. word embeddings . • In this work, we construct user profiles based on texts. • To do so, we used distributed word representations (word embeddings). • Distributed word representations map each word occurring in the dictionary to a Euclidean space, attempting to capture semantic relationships between the words as geometric relationships in the Euclidean space. 4
  • 5. word embeddings . • Started back in (Bengio et al., 2003), exploded after the works of Bengio et al. and Mikolov et al. (2009–2011), now used everywhere. • Basic idea: • shallow neural networks trained to reconstruct contexts by words or words by context; • skip-gram: predict contextual words by the word , ; • CBOW: predict the word by its context , ; • Glove: train a decomposition of the matrix of cooccurrences. • Word embeddings serve as building blocks for neural network approaches to NLP. 4
  • 6. word embeddings . • Two main architectures: CBOW skip-gram • We use CBOW embeddings trained on a very large Russian dataset (thanks to Nikolay Arefyev and Alexander Panchenko!). 4
  • 8. tf-idf document profiles . • We begin with baseline approaches. • Using distributed representations trained on a huge Russian corpus, we: • clustered the word vectors, resulting in semantic clustering; • used a vector representation for the documents as weighted sum (with tf-idf weights) for the words; • stored baseline user profiles based on simple weighted sums of their likes in this document representation; • trained baseline recommender algorithms that use these profiles: ranking by cosine similarity, user-based and item-based collaborative filtering. 6
  • 9. new ideas and results . • Main problem: • we have a clustering in the word vector space , which can be also applied to documents represented as vectors in ; • we also have a users documents matrix; • how do we better compress it to individual user profiles? • We have tried decomposing this matrix with SVD and pLSA, but with no good results. Two problems: • there are only likes in the dataset, no dislikes; • “junk” clusters with common words always fill user profiles, whatever we did. 7
  • 10. new ideas and results . • We can use the following natural idea: • represent a document as a vector of cluster likelihoods ; • treat each user independently; • for every user, construct a logistic regression problem that models the probability of like with weights corresponding to clusters; • train logistic regression; its weights constitute the user profile. • But it also seems to suffer from the same problems: where do we get negative examples for the regression, and what do we do with “junk” clusters? 7
  • 11. new ideas and results . • We solve both problems with one stroke: • train several (hundred) balanced logistic regressions, choosing negative examples uniformly at random among not-liked items; • then use the weights statistics (e.g., mean and variance) as user profile; • this way, logistic regression is always balanced; • also, now junk clusters with common words with often appear in negative examples too, so they will have significantly higher variance than informative clusters! • Having constructed these profiles, how do we make recommendations? 7
  • 12. new ideas and results . • Recommender algorithm: • from the posterior distribution of weights (we used normal distribution with posterior mean and variance), sample several (hundred) different weight combinations; • predict the probabilities of likes for all these combinations; • rank according to mean predicted like probability. 7
  • 13. sample user profile . # Words 867 0.772 0.165 hours two-hour break minute half-hour five-minute two-hour ten-hour... 424 0.833 0.202 kissing call cry silent scream laughing nod dare restrain angry slam... 837 0.399 0.010 youtube blog net mail facebook player online yandex user tor ado... 366 0.396 0.042 associate attitude seems quite horoscope ideal religious face era... 413 0.406 0.080 feel glad remember worrying offended jealous inhale pity envy suffer autumn... 427 0.385 0.073 hijack bombing raid to steal loot bomb 798 0.385 0.080 uro missile air defense mine RL submarine Vaenga Red Banner Pacific Fleet... 8
  • 15. algorithms . • So far we are comparing three baseline algorithms and our regression-based algorithm: (1) cosine: find nearest documents to a linear user profile with respect to cosine proximity; (2) user-based collaborative filtering: find nearest neighbors for a user and recommend documents according to their likes; (3) item-based collaborative filtering: find nearest neighbors for a document and recommend documents similar to the ones a user liked; (4) regression-based algorithm: sample weights according to the posterior distribution, recommend according to average results. 10
  • 16. evaluation: metrics . • In experimental evaluation, regression-based recommender clearly outperforms all other methods. Algorithm AUC NDCG Top1 Top5 Top10 0 Cosine 0.514 0.779 0.511 2.471 4.757 1 User-based CF 0.456 0.686 0.101 1.418 3.851 2 Item-based CF 0.495 0.780 0.523 2.493 4.813 3 Regression 0.530 0.796 0.562 2.667 5.153 • Demo... 11
  • 17. thank you! . Thank you for your attention! 12