SlideShare a Scribd company logo
Context-based movie search for user questions
that ask the title of the movie
장진규, 박정인
School of
Business
Administration
2018. 4. 18
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Text Mining Term Project 4
Introduction
People sometimes have a craving for find a movie that they once glimpsed. At that time,
they used to ask the movie name through Q&A sites and get the result. Answerers often
seems ‘god of movie’, so we want to imitate their prophecy.
Question Examples
Text Mining Term Project 5
Data Gathering
We chose one expert of this field and gather his answers.
Q&A Site http://kin.naver.com
Expert ID xedz****
Question & Answer data 39,758
Date 2012 December ~ 2018March
Unique Movie 5,900
Gathered Data Information
Text Mining Term Project 6
2 Types of Text Representation
There are 2 kinds of text representation: sparse and dense
Sparse: One-Hot Encoding Dense: Word Embedding
Sparse Dense
Dimension ▪ As many as unique words
• Autonomous setting
• Usually 20~200 dimensions
Information
• Lots of 0 value
• No Information
• Every element has value
• Abundant Information
Comparison of Text Representations
source: https://dreamgonfly.github.io/machine/learning,/natural/language/processing/2017/08/16/word2vec_explained.html
Text Mining Term Project 7
Main Idea of Word2Vec
Word2Vec is one of the word embedding methods.
Its main idea is “You shall know a word by the company it keeps.”
Every word has friends around them
Text Mining Term Project 8
Algorithms of Word2Vec
Word2vec has two model architectures: continuous-bag-of-words (CBOW), skip-gram.
Diagrams of CBOW and Skip-gram
source: https://aws.amazon.com/ko/blogs/korea/amazon-sagemaker-blazingtext-parallelizing-word2vec-on-multiple-cpus-or-gpus/
Text Mining Term Project 9
Algorithms of Doc2Vec
Doc2vec has two model architectures: distributed memory model (PV-DM) and
Distributed bag of words model(PV-DBOW).
Diagrams of PV-DM and PV-DBOW
source: Distributed Representations of Sentences and Documents
The concatenation or average of vector with a context of three
words is used to predict the fourth word. The paragraph vector
represents the missing information from the current context
Ignore the context words in the input, but force the model to
predict words randomly sampled from the paragraph in the
output. Similar to Skip-gram model
PV-DM PV-DBOW
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Text Mining Term Project 11
▪ Tokenize with KoNLPy
• using Twitter package
▪ Pos-tagging
• only get noun, verb, and adjective
▪ Remove Token which has only one
character
▪ Remove Stop-words
▪ Delete questions of which token length are
less than 10
▪ Remove unnecessary words
• URL, Special characters (!, ?, *, @, <. >),
Emoticon(ㅋㅋ, ㅠㅠ), multispacer
▪ Stem words that dictionary cannot correct
• (남주 → 남자주인공), (페북 → 페이스북),
(영환 → 영화인데), (여자애 → 여자)
▪ Delete unnecessory phrase in question
• 좀 옛날 영화인데 ~, 페북에서 봤는데, ~
장면이 있었는데 기억이안나네요
▪ Delete questions of which length are less
than 30
Preprocessing
We did preprocessing for better performance and it is processed by 2 steps: whole text
data and tokenized data.
Raw Preprocessing Tokenizing
Text Mining Term Project 12
Select Movies and Split dataset
There are 5,900 movies in dataset, but many movies has few questions. So we remove
certain movies that have questions below cutoff value. Then we split the dataset with
8:2 ratio to test the model.
Movie Count
스파이더위크가의 비밀 259
캐빈 인 더 우즈 222
비밀의 숲 테라비시아 179
Cutoff
무서운 영화 2 1
전우 1
전우치 1
The number of question per movies
Movie Train Test
스파이더위크가의 비밀 207 52
케빈 인 더 우즈 177 45
비밀의 숲 테라비시아 143 36
레모니 스니켓의 위험한 대결 142 36
플립 141 36
… … …
Split Train and Test
*Basic cutoff = 3
*Using stratified method
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Text Mining Term Project 14
Modeling – Word2vec
To train word2vec model, we put the answers (label) between the tokenized words in
the question. Using this corpus, we trained word2vec model.
*put labels in every 5 words
Q: question, A: answer(label), W: word
The number of labels in train, test data : 2021
Train data set : 22620 ,Test data set: 5655
skip-gram is employed
Dimensionality of the feature vectors - 300
Window size - 10
Hierarchical softmax used
Text Mining Term Project 15
Modeling – Word2vec
Text Mining Term Project 16
Modeling – Word2vec
Each word in the test set is embedded into the model to obtain a word vector.
Combine all the vectors into one vector on a question-by-question basis (Document vector)
Text Mining Term Project 17
Modeling – Word2vec
Also embedding the unique answers (label) into the model to obtain label vector. After
that, Calculate pairwise cosine similarity between the label vectors (𝑽 𝑨𝒏 ) and the
document vectors (𝐕′ 𝒌)
K: the number test set data
n: the number of unique labels
Text Mining Term Project 18
Modeling – Word2vec
Finally, Normalize the cosine similarity result for each document vectors, binarize the
answers(label), and evaluate the performance of the model
Test set
Ex)
𝑉′ 𝑘 = 0.05,0.001, 0.003 … . 0.002
𝐴 𝑘 = [1,0,0 …. 0]
Text Mining Term Project 19
Modeling – Doc2vec
In the Doc2vec model, we do not need to put the correct answer like in the word2vec
model, because the answer (label) is also learned as a tag.
PV distributed memory is employed.
Dimensionality of the feature vectors. - 300
Window size - 3
hierarchical softmax used
use the sum of the context word vectors.
The paragraph vectors(label vectors) are asked to a prediction task about
the next word in the sentence. Every paragraph is mapped to a unique
vector. The paragraph vector and word vectors are averaged to predict
the next word in a context.
Text Mining Term Project 20
Modeling – Doc2vec
*Computes cosine similarity between a simple mean of the projection weight vectors of the given docs.
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Text Mining Term Project 22
Model evaluation – ROC curve
The ROC curve results for each labels are evaluated by two methods: micro-averaging
and macro-averaging.
micro-averaging - considering each element of the label indicator matrix as a binary prediction
macro-averaging - gives equal weight to the classification of each label.
word2vec doc2vec
AUC 0.78, 0.82 AUC 0.97, 0.97
(Micro, macro)(Micro, macro)
Text Mining Term Project 23
Model evaluation – Top-n Accuracy approach
Top-n accuracy approach results for each labels are evaluated. For example, Top-5
accuracy means that any of our model 5 highest probability answers must match the
expected answer. (n: 1~10)
word2vec doc2vec
Accuracy* : top 1 and top 10 accuracy
Accuracy*
0.08 to 0.20
Accuracy*
0.49 to 0.73
Text Mining Term Project 24
Discussion
• Conclusion
✓ Overall, doc2vec shows better performance than word2vec model
✓ Building a service by presenting n (at least 5) correct answer lists for new questions
✓ Application to speech recognition based movie recommendation service
• Further study
✓ Problems that questions about untrained movies
- complementing through learning synopsis of the movies
✓ A method for dealing with imbalanced movie data is needed
Text Mining Term Project 25
Thank you
Text Mining Term Project 26
APPENDIX
We drew graphs to find which movies and which genres are highly asked. We could find
that people wanted to find mysterious and thrilling movies
259
222
179
178
177
166
151
147
143
131
스파이더위크가의 비밀
캐빈 인 더 우즈
비밀의 숲 테라비시아
레모니 스니켓의 위험한…
플립
트루먼 쇼
다이버전트
스플라이스
아바타
업사이드 다운
7447
5001
4585
2560
784
706
501
공포, 스릴러
SF, 판타지
로맨스, 멜로
액션, 무협
코미디
드라마
애니메이션
Asked Movie Ranking Asked Movie Genre
Text Mining Term Project 27
APPENDIX
Delete unnecessory phrase in question
At the beginning of the question and at the end, remove all phrases before and after the word.
If the word in the check words list (within 20% of the length of the question)
Ad

More Related Content

What's hot (20)

Meta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural NetworksMeta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural Networks
홍배 김
 
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
live_and_let_live
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
milad abbasi
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
Jacob Perkins
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
Sunghoon Joo
 
Smqa unit ii
Smqa unit   iiSmqa unit   ii
Smqa unit ii
Manoj Patil
 
Multi-Centre Optimization and Validation of an Open Deep Learning Model for C...
Multi-Centre Optimization and Validation of an Open Deep Learning Model for C...Multi-Centre Optimization and Validation of an Open Deep Learning Model for C...
Multi-Centre Optimization and Validation of an Open Deep Learning Model for C...
Institute for Clinical Research (ICR)
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical View
Chia-Ching Lin
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
Recent Trends in Deep Learning
Recent Trends in Deep LearningRecent Trends in Deep Learning
Recent Trends in Deep Learning
Sungjoon Choi
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
Keras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower RecognitionKeras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower Recognition
Fatima Qayyum
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
Priyanka Reddy
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Meta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural NetworksMeta-Learning with Memory Augmented Neural Networks
Meta-Learning with Memory Augmented Neural Networks
홍배 김
 
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Universitat Politècnica de Catalunya
 
Autoencoders in Deep Learning
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
milad abbasi
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
MENGSAYLOEM1
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 
PR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdfPR422_hyper-deep ensembles.pdf
PR422_hyper-deep ensembles.pdf
Sunghoon Joo
 
Multi-Centre Optimization and Validation of an Open Deep Learning Model for C...
Multi-Centre Optimization and Validation of an Open Deep Learning Model for C...Multi-Centre Optimization and Validation of an Open Deep Learning Model for C...
Multi-Centre Optimization and Validation of an Open Deep Learning Model for C...
Institute for Clinical Research (ICR)
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical View
Chia-Ching Lin
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
Recent Trends in Deep Learning
Recent Trends in Deep LearningRecent Trends in Deep Learning
Recent Trends in Deep Learning
Sungjoon Choi
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
Larry Guo
 
Keras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower RecognitionKeras CNN Pre-trained Deep Learning models for Flower Recognition
Keras CNN Pre-trained Deep Learning models for Flower Recognition
Fatima Qayyum
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 

Similar to Context-based movie search using doc2vec, word2vec (20)

Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
Matthias Trapp
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
ankit_ppt
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
NAVER Engineering
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
KtonNguyn2
 
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionTranscription Factor DNA Binding Prediction
Transcription Factor DNA Binding Prediction
UT, San Antonio
 
Data_Prep_Techniques_Challenges_Methods.pdf
Data_Prep_Techniques_Challenges_Methods.pdfData_Prep_Techniques_Challenges_Methods.pdf
Data_Prep_Techniques_Challenges_Methods.pdf
Shailja Thakur
 
Introduction to Neural Information Retrieval and Large Language Models
Introduction to Neural Information Retrieval and Large Language ModelsIntroduction to Neural Information Retrieval and Large Language Models
Introduction to Neural Information Retrieval and Large Language Models
sadjadeb
 
devrev presentation of solution at inter iit tech
devrev presentation of solution at inter iit techdevrev presentation of solution at inter iit tech
devrev presentation of solution at inter iit tech
YuvrajMotiramani2
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
Liwei Ren任力偉
 
Concept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based ModelsConcept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based Models
IRJET Journal
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
Lalit Jain
 
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxIMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
AtharvaTanawade
 
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
kevig
 
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
kevig
 
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
kevig
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
JaeHo Jang
 
Neo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.pdf
Neo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.pdfNeo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.pdf
Neo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.pdf
Véronique Gendner
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
Matthias Trapp
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
NAVER Engineering
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
KtonNguyn2
 
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionTranscription Factor DNA Binding Prediction
Transcription Factor DNA Binding Prediction
UT, San Antonio
 
Data_Prep_Techniques_Challenges_Methods.pdf
Data_Prep_Techniques_Challenges_Methods.pdfData_Prep_Techniques_Challenges_Methods.pdf
Data_Prep_Techniques_Challenges_Methods.pdf
Shailja Thakur
 
Introduction to Neural Information Retrieval and Large Language Models
Introduction to Neural Information Retrieval and Large Language ModelsIntroduction to Neural Information Retrieval and Large Language Models
Introduction to Neural Information Retrieval and Large Language Models
sadjadeb
 
devrev presentation of solution at inter iit tech
devrev presentation of solution at inter iit techdevrev presentation of solution at inter iit tech
devrev presentation of solution at inter iit tech
YuvrajMotiramani2
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
Liwei Ren任力偉
 
Concept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based ModelsConcept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based Models
IRJET Journal
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
Lalit Jain
 
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxIMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
AtharvaTanawade
 
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
kevig
 
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
kevig
 
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
kevig
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
JaeHo Jang
 
Neo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.pdf
Neo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.pdfNeo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.pdf
Neo4j Graph DB & LLM.graphs & genAI introduction & cheatsheet.pdf
Véronique Gendner
 
Ad

Recently uploaded (20)

Web Design Creating User-Friendly and Visually Engaging Websites - April 2025...
Web Design Creating User-Friendly and Visually Engaging Websites - April 2025...Web Design Creating User-Friendly and Visually Engaging Websites - April 2025...
Web Design Creating User-Friendly and Visually Engaging Websites - April 2025...
TheoRuby
 
Salesforce_Architecture_Diagramming_Workshop (1).pptx
Salesforce_Architecture_Diagramming_Workshop (1).pptxSalesforce_Architecture_Diagramming_Workshop (1).pptx
Salesforce_Architecture_Diagramming_Workshop (1).pptx
reinbauwens1
 
Region Research (Hiring Trends) Vietnam 2025.pdf
Region Research (Hiring Trends) Vietnam 2025.pdfRegion Research (Hiring Trends) Vietnam 2025.pdf
Region Research (Hiring Trends) Vietnam 2025.pdf
Consultonmic
 
The Peter Cowley Entrepreneurship Event Master 30th.pdf
The Peter Cowley Entrepreneurship Event Master 30th.pdfThe Peter Cowley Entrepreneurship Event Master 30th.pdf
The Peter Cowley Entrepreneurship Event Master 30th.pdf
Richard Lucas
 
Network Detection and Response (NDR): The Future of Intelligent Cybersecurity
Network Detection and Response (NDR): The Future of Intelligent CybersecurityNetwork Detection and Response (NDR): The Future of Intelligent Cybersecurity
Network Detection and Response (NDR): The Future of Intelligent Cybersecurity
GauriKale30
 
Alec Lawler - A Passion For Building Brand Awareness
Alec Lawler - A Passion For Building Brand AwarenessAlec Lawler - A Passion For Building Brand Awareness
Alec Lawler - A Passion For Building Brand Awareness
Alec Lawler
 
Yuriy Chapran: Zero Trust and Beyond: OpenVPN’s Role in Next-Gen Network Secu...
Yuriy Chapran: Zero Trust and Beyond: OpenVPN’s Role in Next-Gen Network Secu...Yuriy Chapran: Zero Trust and Beyond: OpenVPN’s Role in Next-Gen Network Secu...
Yuriy Chapran: Zero Trust and Beyond: OpenVPN’s Role in Next-Gen Network Secu...
Lviv Startup Club
 
PREDICTION%20AND%20ANALYSIS%20OF%20ADMET%20PROPERTIES%20OF%20NEW%20MOLECULE%2...
PREDICTION%20AND%20ANALYSIS%20OF%20ADMET%20PROPERTIES%20OF%20NEW%20MOLECULE%2...PREDICTION%20AND%20ANALYSIS%20OF%20ADMET%20PROPERTIES%20OF%20NEW%20MOLECULE%2...
PREDICTION%20AND%20ANALYSIS%20OF%20ADMET%20PROPERTIES%20OF%20NEW%20MOLECULE%2...
AMITKUMARVERMA479091
 
SAP S/4HANA Asset Management - Functions and Innovations
SAP S/4HANA Asset Management - Functions and InnovationsSAP S/4HANA Asset Management - Functions and Innovations
SAP S/4HANA Asset Management - Functions and Innovations
Course17
 
Accounting_Basics_Complete_Guide_By_CA_Suvidha_Chaplot (1).pdf
Accounting_Basics_Complete_Guide_By_CA_Suvidha_Chaplot (1).pdfAccounting_Basics_Complete_Guide_By_CA_Suvidha_Chaplot (1).pdf
Accounting_Basics_Complete_Guide_By_CA_Suvidha_Chaplot (1).pdf
CA Suvidha Chaplot
 
Comments on Cloud Stream Part II Mobile Hub V1 Hub Agency.pdf
Comments on Cloud Stream Part II Mobile Hub V1 Hub Agency.pdfComments on Cloud Stream Part II Mobile Hub V1 Hub Agency.pdf
Comments on Cloud Stream Part II Mobile Hub V1 Hub Agency.pdf
Brij Consulting, LLC
 
TNR Gold Los Azules Copper NSR Royalty Holding with McEwen Mining Presentation
TNR Gold Los Azules Copper NSR Royalty Holding with McEwen Mining PresentationTNR Gold Los Azules Copper NSR Royalty Holding with McEwen Mining Presentation
TNR Gold Los Azules Copper NSR Royalty Holding with McEwen Mining Presentation
Kirill Klip
 
Progress Report - Workday Analyst Summit 2025 - Change fo the better coming
Progress Report - Workday Analyst Summit 2025 - Change fo the better comingProgress Report - Workday Analyst Summit 2025 - Change fo the better coming
Progress Report - Workday Analyst Summit 2025 - Change fo the better coming
Holger Mueller
 
CGG Deck English - Apr 2025-edit (1).pptx
CGG Deck English - Apr 2025-edit (1).pptxCGG Deck English - Apr 2025-edit (1).pptx
CGG Deck English - Apr 2025-edit (1).pptx
China_Gold_International_Resources
 
From Sunlight to Savings The Rise of Homegrown Solar Power.pdf
From Sunlight to Savings The Rise of Homegrown Solar Power.pdfFrom Sunlight to Savings The Rise of Homegrown Solar Power.pdf
From Sunlight to Savings The Rise of Homegrown Solar Power.pdf
Insolation Energy
 
Theory of Cognitive Chasms: Failure Modes of GenAI Adoption
Theory of Cognitive Chasms: Failure Modes of GenAI AdoptionTheory of Cognitive Chasms: Failure Modes of GenAI Adoption
Theory of Cognitive Chasms: Failure Modes of GenAI Adoption
Dr. Tathagat Varma
 
Viktor Svystunov: Your Team Can Do More (UA)
Viktor Svystunov: Your Team Can Do More (UA)Viktor Svystunov: Your Team Can Do More (UA)
Viktor Svystunov: Your Team Can Do More (UA)
Lviv Startup Club
 
Freeze-Dried Fruit Powder Market Trends & Growth
Freeze-Dried Fruit Powder Market Trends & GrowthFreeze-Dried Fruit Powder Market Trends & Growth
Freeze-Dried Fruit Powder Market Trends & Growth
chanderdeepseoexpert
 
Affinity.co Lifecycle Marketing Presentation
Affinity.co Lifecycle Marketing PresentationAffinity.co Lifecycle Marketing Presentation
Affinity.co Lifecycle Marketing Presentation
omiller199514
 
Understanding Dynamic Competition: Perspectives on Monopoly and Market Power ...
Understanding Dynamic Competition: Perspectives on Monopoly and Market Power ...Understanding Dynamic Competition: Perspectives on Monopoly and Market Power ...
Understanding Dynamic Competition: Perspectives on Monopoly and Market Power ...
David Teece
 
Web Design Creating User-Friendly and Visually Engaging Websites - April 2025...
Web Design Creating User-Friendly and Visually Engaging Websites - April 2025...Web Design Creating User-Friendly and Visually Engaging Websites - April 2025...
Web Design Creating User-Friendly and Visually Engaging Websites - April 2025...
TheoRuby
 
Salesforce_Architecture_Diagramming_Workshop (1).pptx
Salesforce_Architecture_Diagramming_Workshop (1).pptxSalesforce_Architecture_Diagramming_Workshop (1).pptx
Salesforce_Architecture_Diagramming_Workshop (1).pptx
reinbauwens1
 
Region Research (Hiring Trends) Vietnam 2025.pdf
Region Research (Hiring Trends) Vietnam 2025.pdfRegion Research (Hiring Trends) Vietnam 2025.pdf
Region Research (Hiring Trends) Vietnam 2025.pdf
Consultonmic
 
The Peter Cowley Entrepreneurship Event Master 30th.pdf
The Peter Cowley Entrepreneurship Event Master 30th.pdfThe Peter Cowley Entrepreneurship Event Master 30th.pdf
The Peter Cowley Entrepreneurship Event Master 30th.pdf
Richard Lucas
 
Network Detection and Response (NDR): The Future of Intelligent Cybersecurity
Network Detection and Response (NDR): The Future of Intelligent CybersecurityNetwork Detection and Response (NDR): The Future of Intelligent Cybersecurity
Network Detection and Response (NDR): The Future of Intelligent Cybersecurity
GauriKale30
 
Alec Lawler - A Passion For Building Brand Awareness
Alec Lawler - A Passion For Building Brand AwarenessAlec Lawler - A Passion For Building Brand Awareness
Alec Lawler - A Passion For Building Brand Awareness
Alec Lawler
 
Yuriy Chapran: Zero Trust and Beyond: OpenVPN’s Role in Next-Gen Network Secu...
Yuriy Chapran: Zero Trust and Beyond: OpenVPN’s Role in Next-Gen Network Secu...Yuriy Chapran: Zero Trust and Beyond: OpenVPN’s Role in Next-Gen Network Secu...
Yuriy Chapran: Zero Trust and Beyond: OpenVPN’s Role in Next-Gen Network Secu...
Lviv Startup Club
 
PREDICTION%20AND%20ANALYSIS%20OF%20ADMET%20PROPERTIES%20OF%20NEW%20MOLECULE%2...
PREDICTION%20AND%20ANALYSIS%20OF%20ADMET%20PROPERTIES%20OF%20NEW%20MOLECULE%2...PREDICTION%20AND%20ANALYSIS%20OF%20ADMET%20PROPERTIES%20OF%20NEW%20MOLECULE%2...
PREDICTION%20AND%20ANALYSIS%20OF%20ADMET%20PROPERTIES%20OF%20NEW%20MOLECULE%2...
AMITKUMARVERMA479091
 
SAP S/4HANA Asset Management - Functions and Innovations
SAP S/4HANA Asset Management - Functions and InnovationsSAP S/4HANA Asset Management - Functions and Innovations
SAP S/4HANA Asset Management - Functions and Innovations
Course17
 
Accounting_Basics_Complete_Guide_By_CA_Suvidha_Chaplot (1).pdf
Accounting_Basics_Complete_Guide_By_CA_Suvidha_Chaplot (1).pdfAccounting_Basics_Complete_Guide_By_CA_Suvidha_Chaplot (1).pdf
Accounting_Basics_Complete_Guide_By_CA_Suvidha_Chaplot (1).pdf
CA Suvidha Chaplot
 
Comments on Cloud Stream Part II Mobile Hub V1 Hub Agency.pdf
Comments on Cloud Stream Part II Mobile Hub V1 Hub Agency.pdfComments on Cloud Stream Part II Mobile Hub V1 Hub Agency.pdf
Comments on Cloud Stream Part II Mobile Hub V1 Hub Agency.pdf
Brij Consulting, LLC
 
TNR Gold Los Azules Copper NSR Royalty Holding with McEwen Mining Presentation
TNR Gold Los Azules Copper NSR Royalty Holding with McEwen Mining PresentationTNR Gold Los Azules Copper NSR Royalty Holding with McEwen Mining Presentation
TNR Gold Los Azules Copper NSR Royalty Holding with McEwen Mining Presentation
Kirill Klip
 
Progress Report - Workday Analyst Summit 2025 - Change fo the better coming
Progress Report - Workday Analyst Summit 2025 - Change fo the better comingProgress Report - Workday Analyst Summit 2025 - Change fo the better coming
Progress Report - Workday Analyst Summit 2025 - Change fo the better coming
Holger Mueller
 
From Sunlight to Savings The Rise of Homegrown Solar Power.pdf
From Sunlight to Savings The Rise of Homegrown Solar Power.pdfFrom Sunlight to Savings The Rise of Homegrown Solar Power.pdf
From Sunlight to Savings The Rise of Homegrown Solar Power.pdf
Insolation Energy
 
Theory of Cognitive Chasms: Failure Modes of GenAI Adoption
Theory of Cognitive Chasms: Failure Modes of GenAI AdoptionTheory of Cognitive Chasms: Failure Modes of GenAI Adoption
Theory of Cognitive Chasms: Failure Modes of GenAI Adoption
Dr. Tathagat Varma
 
Viktor Svystunov: Your Team Can Do More (UA)
Viktor Svystunov: Your Team Can Do More (UA)Viktor Svystunov: Your Team Can Do More (UA)
Viktor Svystunov: Your Team Can Do More (UA)
Lviv Startup Club
 
Freeze-Dried Fruit Powder Market Trends & Growth
Freeze-Dried Fruit Powder Market Trends & GrowthFreeze-Dried Fruit Powder Market Trends & Growth
Freeze-Dried Fruit Powder Market Trends & Growth
chanderdeepseoexpert
 
Affinity.co Lifecycle Marketing Presentation
Affinity.co Lifecycle Marketing PresentationAffinity.co Lifecycle Marketing Presentation
Affinity.co Lifecycle Marketing Presentation
omiller199514
 
Understanding Dynamic Competition: Perspectives on Monopoly and Market Power ...
Understanding Dynamic Competition: Perspectives on Monopoly and Market Power ...Understanding Dynamic Competition: Perspectives on Monopoly and Market Power ...
Understanding Dynamic Competition: Perspectives on Monopoly and Market Power ...
David Teece
 
Ad

Context-based movie search using doc2vec, word2vec

  • 1. Context-based movie search for user questions that ask the title of the movie 장진규, 박정인 School of Business Administration 2018. 4. 18
  • 4. Text Mining Term Project 4 Introduction People sometimes have a craving for find a movie that they once glimpsed. At that time, they used to ask the movie name through Q&A sites and get the result. Answerers often seems ‘god of movie’, so we want to imitate their prophecy. Question Examples
  • 5. Text Mining Term Project 5 Data Gathering We chose one expert of this field and gather his answers. Q&A Site http://kin.naver.com Expert ID xedz**** Question & Answer data 39,758 Date 2012 December ~ 2018March Unique Movie 5,900 Gathered Data Information
  • 6. Text Mining Term Project 6 2 Types of Text Representation There are 2 kinds of text representation: sparse and dense Sparse: One-Hot Encoding Dense: Word Embedding Sparse Dense Dimension ▪ As many as unique words • Autonomous setting • Usually 20~200 dimensions Information • Lots of 0 value • No Information • Every element has value • Abundant Information Comparison of Text Representations source: https://dreamgonfly.github.io/machine/learning,/natural/language/processing/2017/08/16/word2vec_explained.html
  • 7. Text Mining Term Project 7 Main Idea of Word2Vec Word2Vec is one of the word embedding methods. Its main idea is “You shall know a word by the company it keeps.” Every word has friends around them
  • 8. Text Mining Term Project 8 Algorithms of Word2Vec Word2vec has two model architectures: continuous-bag-of-words (CBOW), skip-gram. Diagrams of CBOW and Skip-gram source: https://aws.amazon.com/ko/blogs/korea/amazon-sagemaker-blazingtext-parallelizing-word2vec-on-multiple-cpus-or-gpus/
  • 9. Text Mining Term Project 9 Algorithms of Doc2Vec Doc2vec has two model architectures: distributed memory model (PV-DM) and Distributed bag of words model(PV-DBOW). Diagrams of PV-DM and PV-DBOW source: Distributed Representations of Sentences and Documents The concatenation or average of vector with a context of three words is used to predict the fourth word. The paragraph vector represents the missing information from the current context Ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. Similar to Skip-gram model PV-DM PV-DBOW
  • 11. Text Mining Term Project 11 ▪ Tokenize with KoNLPy • using Twitter package ▪ Pos-tagging • only get noun, verb, and adjective ▪ Remove Token which has only one character ▪ Remove Stop-words ▪ Delete questions of which token length are less than 10 ▪ Remove unnecessary words • URL, Special characters (!, ?, *, @, <. >), Emoticon(ㅋㅋ, ㅠㅠ), multispacer ▪ Stem words that dictionary cannot correct • (남주 → 남자주인공), (페북 → 페이스북), (영환 → 영화인데), (여자애 → 여자) ▪ Delete unnecessory phrase in question • 좀 옛날 영화인데 ~, 페북에서 봤는데, ~ 장면이 있었는데 기억이안나네요 ▪ Delete questions of which length are less than 30 Preprocessing We did preprocessing for better performance and it is processed by 2 steps: whole text data and tokenized data. Raw Preprocessing Tokenizing
  • 12. Text Mining Term Project 12 Select Movies and Split dataset There are 5,900 movies in dataset, but many movies has few questions. So we remove certain movies that have questions below cutoff value. Then we split the dataset with 8:2 ratio to test the model. Movie Count 스파이더위크가의 비밀 259 캐빈 인 더 우즈 222 비밀의 숲 테라비시아 179 Cutoff 무서운 영화 2 1 전우 1 전우치 1 The number of question per movies Movie Train Test 스파이더위크가의 비밀 207 52 케빈 인 더 우즈 177 45 비밀의 숲 테라비시아 143 36 레모니 스니켓의 위험한 대결 142 36 플립 141 36 … … … Split Train and Test *Basic cutoff = 3 *Using stratified method
  • 14. Text Mining Term Project 14 Modeling – Word2vec To train word2vec model, we put the answers (label) between the tokenized words in the question. Using this corpus, we trained word2vec model. *put labels in every 5 words Q: question, A: answer(label), W: word The number of labels in train, test data : 2021 Train data set : 22620 ,Test data set: 5655 skip-gram is employed Dimensionality of the feature vectors - 300 Window size - 10 Hierarchical softmax used
  • 15. Text Mining Term Project 15 Modeling – Word2vec
  • 16. Text Mining Term Project 16 Modeling – Word2vec Each word in the test set is embedded into the model to obtain a word vector. Combine all the vectors into one vector on a question-by-question basis (Document vector)
  • 17. Text Mining Term Project 17 Modeling – Word2vec Also embedding the unique answers (label) into the model to obtain label vector. After that, Calculate pairwise cosine similarity between the label vectors (𝑽 𝑨𝒏 ) and the document vectors (𝐕′ 𝒌) K: the number test set data n: the number of unique labels
  • 18. Text Mining Term Project 18 Modeling – Word2vec Finally, Normalize the cosine similarity result for each document vectors, binarize the answers(label), and evaluate the performance of the model Test set Ex) 𝑉′ 𝑘 = 0.05,0.001, 0.003 … . 0.002 𝐴 𝑘 = [1,0,0 …. 0]
  • 19. Text Mining Term Project 19 Modeling – Doc2vec In the Doc2vec model, we do not need to put the correct answer like in the word2vec model, because the answer (label) is also learned as a tag. PV distributed memory is employed. Dimensionality of the feature vectors. - 300 Window size - 3 hierarchical softmax used use the sum of the context word vectors. The paragraph vectors(label vectors) are asked to a prediction task about the next word in the sentence. Every paragraph is mapped to a unique vector. The paragraph vector and word vectors are averaged to predict the next word in a context.
  • 20. Text Mining Term Project 20 Modeling – Doc2vec *Computes cosine similarity between a simple mean of the projection weight vectors of the given docs.
  • 22. Text Mining Term Project 22 Model evaluation – ROC curve The ROC curve results for each labels are evaluated by two methods: micro-averaging and macro-averaging. micro-averaging - considering each element of the label indicator matrix as a binary prediction macro-averaging - gives equal weight to the classification of each label. word2vec doc2vec AUC 0.78, 0.82 AUC 0.97, 0.97 (Micro, macro)(Micro, macro)
  • 23. Text Mining Term Project 23 Model evaluation – Top-n Accuracy approach Top-n accuracy approach results for each labels are evaluated. For example, Top-5 accuracy means that any of our model 5 highest probability answers must match the expected answer. (n: 1~10) word2vec doc2vec Accuracy* : top 1 and top 10 accuracy Accuracy* 0.08 to 0.20 Accuracy* 0.49 to 0.73
  • 24. Text Mining Term Project 24 Discussion • Conclusion ✓ Overall, doc2vec shows better performance than word2vec model ✓ Building a service by presenting n (at least 5) correct answer lists for new questions ✓ Application to speech recognition based movie recommendation service • Further study ✓ Problems that questions about untrained movies - complementing through learning synopsis of the movies ✓ A method for dealing with imbalanced movie data is needed
  • 25. Text Mining Term Project 25 Thank you
  • 26. Text Mining Term Project 26 APPENDIX We drew graphs to find which movies and which genres are highly asked. We could find that people wanted to find mysterious and thrilling movies 259 222 179 178 177 166 151 147 143 131 스파이더위크가의 비밀 캐빈 인 더 우즈 비밀의 숲 테라비시아 레모니 스니켓의 위험한… 플립 트루먼 쇼 다이버전트 스플라이스 아바타 업사이드 다운 7447 5001 4585 2560 784 706 501 공포, 스릴러 SF, 판타지 로맨스, 멜로 액션, 무협 코미디 드라마 애니메이션 Asked Movie Ranking Asked Movie Genre
  • 27. Text Mining Term Project 27 APPENDIX Delete unnecessory phrase in question At the beginning of the question and at the end, remove all phrases before and after the word. If the word in the check words list (within 20% of the length of the question)