SlideShare a Scribd company logo
Transformation Functions for Text
Classification: A case study with
StackOverflow
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
1
Natural Language Processing, Dublin Meetup
28th Sept, 2016
Piyush Arora, Debasis Ganguly, Gareth J.F. Jones
ADAPT Centre, School of Computing,
Dublin City University
{parora@computing.dcu.ie, piyusharora07@gmail.com}
https://computing.dcu.ie/~parora/
2
www.adaptcentre.ieOverview of the Talk
❏ Informal Overview of the problem.
❏ StackOverflow data characteristics.
❏ A more technical introduction to the problem.
❏ Text based Classification.
❏ Vector Embedding based Classification.
❏ Conclusions
3
www.adaptcentre.ieOverview of the Problem
❏ Parametric approach: Draws a ‘decision boundary’ (a vector in the
parameter space) bsed on labelled samples.
❏ Consider the role of additional (unlabelled) samples.
4
www.adaptcentre.ieOverview of the Problem
❏ Apply a transformation function, which transforms a labelled sample to
another point, depending on its neighbourhood(4).
❏ Retrain a standard parametric classifier on the transformed samples.
❏ Hypothesis: The classification effectiveness after the ‘transformation’ will
improve.
Transformed
point
Transformed and re-
trained
5
www.adaptcentre.ieStackOverflow Question Quality Prediction
❏ Motivation:
❏ Rapid increase in the number of questions posted on CQA forums.
❏ Need for automated methods of question quality moderation to
improve user experience and forum effectiveness.
6
www.adaptcentre.ieStackOverflow Question Quality Prediction
❏ Problem Statement:
❏ How to classify a new question as suitable or unsuitable without the
use of community feedback such as votes, comments or answers
unlike previous work(2).
❏ Addressing “Cold Start Problem”(1)
❏ Proposed solution:
❏ Approach based on “Nearest Neighbour based Transformation
Functions”.
❏ Application:
❏ Automatic moderation for online forums: saving cost, resources and
improving user experience.
❏ General transformation model: can be adapted for any dataset.
7
www.adaptcentre.ieSO Data Selection(1)
8
www.adaptcentre.ieSO Question Classification
❏ To predict if a question is good (net votes +ve) or bad (net votes –ve).
Category All Views > 1000
Bad (-ve net score) 380800 30163
Good (+ve net score) 3,780,301 1,315,731
Vocab Overlap 59.5% 34.6%
9
Questions Distribution
www.adaptcentre.ieSO Question Classification (M1)
❏ Imbalanced class distribution.
❏ High vocabulary overlap between the classes.
❏ Relatively short documents (avg. length of 69 words).
❏ Lack of informative and discriminative content for classification.
❏ Training with ‘all’ labelled samples: Creates a biased classifier which
outputs almost every question as ‘good’.
❏ Problem: High accuracy but low recall and precision for the –ve class.
Question Text Accuracy F-measure
Titles only 0.9707 0.503 (almost random)
Title + Body 0.9735 0.503 (almost random)
10
Raw classification results
www.adaptcentre.ieSO Question Classification (M2)
❏ K-NN classification (a non-parametric approach).
❏ Why not just K-NN?
❏ Because its performance is not good.
❏ Because it is solely non-parametric.
❏ No use of rich textual information for classification.
K Accuracy F-score
1 0.4668 0.4766
3 0.4594 0.4548
5 0.4599 0.4523
Knn- results
11
www.adaptcentre.ieSO Question Classification
❏Use other ‘similar’ questions previously asked in the forum to
❏ Transform every question Q to Q’ (3)
❏ Retrain classification model on the Q’ instances.
❏ Combines parametric with non-parametric approach.
12
www.adaptcentre.ieTransformation Function
❏ The transformation function φ operating on a vector x depends on the
neighbourhood of x.
❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r}
❏ There are various choices for defining Φ, e.g. weighted centroid etc.
❏ Mainly depends on the type of x, i.e. categorical (text) or real (vectors).
❏ Experiments on both categorical x (i.e. term space representation of
documents) and real valued x (embedded vectors of documents(5)).
13
www.adaptcentre.ieQuestion Fields
14
www.adaptcentre.ieText based Classification (M3)
❏ Baseline: Multinomial Naïve Bayes (MNB) on the text (title+body) of
each question.
❏ Obtain neighbourhood of each document by:
❏ Treat the title of each question as a query.
❏ Use this query to retrieve top K similar documents by BM25 (k=1.2,
b=0.75)(6)
❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator.
K Accuracy F-measure
0 (MNB) 0.713 0.704
1 0.718 0.710
3 0.719 0.713
5 0.715 0.710
9 0.715 0.711
15
Text Expansion Results
www.adaptcentre.ieText based Classification
❏ Query: Title field
❏ For retrieval: BM25F with different weights for title and body.
❏ Two step grid search for finding optimal parameter settings.
❏ Best results obtained with w(T), w(B) = 1, 3.
16
www.adaptcentre.ieText based Classification (M4)
❏ Obtain neighbourhood of each document by:
❏ Treat the title of each question as a query.
❏ Use this query to retrieve top K similar documents by BM25F (k=1.2,
b=0.75) with w(T), w(B) = 1, 3.
❏ Optimized search
❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator.
K Accuracy F-measure
0 (MNB) 0.713 0.704
3 (BM25) 0.719 0.713
3 (BM25F) 0.738 0.733
17
Textual Space results
www.adaptcentre.ieDoc2vec embeddings
18
www.adaptcentre.ieEmbedding based Classification
❏ Motivation: Document embedding captures the semantic similarity
between questions.
❏ Embed the text (title + body) of each SO question by doc2vec.
❏ Components of each vector in [-1, 1].
❏ Use SVM for classifying these vectors.
❏ Best results obtained when #dimensions set to 200.
❏ Transformation function: Weighted centroid.
19
www.adaptcentre.ieEmbedding based Classification (M5)
❏ For document embedded vector based experiments, we use an SVM
classifier (Gaussian kernel with default parameters).
❏ The SVM classification effectiveness obtained with the dbow
document vectors outperforms those obtained with the dmm model.
K Accuracy F-measure
0 (MNB)2 0.713 0.704
3 (BM25F)2 0.738 0.733
0 (SVM Baseline)1 0.743 0.743
1 (SVM)1 0.740 0.739
3 (SVM)1 0.747 0.746
5 (SVM)1 0.750 0.749
9 (SVM)1 0.769 0.768
11 (SVM)1 0.765 0.764
20
1 indicates embedding space results and 2 indicates textual space results
included for comparison
www.adaptcentre.ieSummary
❏ A general framework for applying a non-parametric based transformation
function.
❏ Empirical investigation on StackOverflow questions to predict question
quality.
❏ Two domains investigated: text and real vectors.
❏ Two neighbourhood functions:
❏ Text: Concatenation
❏ Docvecs: Weighted centroid
❏ Interpretation of the transformation function φ operating on a vector x
❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r}
21
www.adaptcentre.ieConclusions
❏ BM25F with more weight to the ‘body’ field of a question improves results
by 4.1% relative to MNB baseline.
❏ For docvecs, results are improved by 3.4% relative to SVM baseline.
❏ Consistent trends in improvements of classification results for both text
and document vectors.
❏ Explore alternative transformation functions, and different ways of
combining the neighbourhood and the transformation functions of the
textual and the document vector spaces
22
www.adaptcentre.ieReferences
1. S. Ravi, B. Pang, V. Rastogi, and R. Kumar. Great Question! Question Quality in
Community Q&A. In Proc. of ICWSM ’14 , 2014
2. D. Correa and A. Sureka. Chaff from the wheat: characterization and modeling of
deleted questions on stack overflow. In Proceedings of WWW ’14 , pages 631–642,
2014.
3. M. Efron, P. Organisciak, and K. Fenlon. Improving retrieval of short texts through
document expansion. In Proceedings of the SIGIR ’12 ,pages 911–920, 2012.
4. Q. V. Le and T. Mikolov. Distributed representations of sentences and documents.
In Proceedings of ICML ’14 , pages 1188–1196, 2014.
5. K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from distributions
via support measure machines. In Proc. of NIPS ’12
6. S. E. Robertson, H. Zaragoza, and M. J. Taylor. Simple BM25 extension to multiple
weighted fields. In Proceedings of CIKM ’04 , pages 42–49, 2004.
7. The good, the bad and their kins: Identifying questions with negative scores in
StackOverflow. P Arora, D Ganguly, GJF Jones. In proceedings of ASONAM’ 2015,
pages 1232-1239
8. Nearest Neighbour based Transformation Functions for Text Classification: A Case
Study with StackOverflow. P Arora, D Ganguly, GJF Jones. In Proceedings of
lCTIR’ 2016, pages 299-302
23
www.adaptcentre.ieQ & A
24

More Related Content

What's hot (20)

PPTX
Transfer learning-presentation
Bushra Jbawi
 
PPTX
The Duet model
Bhaskar Mitra
 
PPTX
Neural Models for Document Ranking
Bhaskar Mitra
 
PDF
Basic review on topic modeling
Hiroyuki Kuromiya
 
PPTX
Adversarial and reinforcement learning-based approaches to information retrieval
Bhaskar Mitra
 
PDF
Topics Modeling
Svitlana volkova
 
PPTX
Neural Models for Information Retrieval
Bhaskar Mitra
 
PPTX
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
PDF
Collaborative DL
Dai-Hai Nguyen
 
PPTX
Deep Learning for Search
Bhaskar Mitra
 
PDF
Lifelong Topic Modelling presentation
Daniele Di Mitri
 
PPT
Topic Models - LDA and Correlated Topic Models
Claudia Wagner
 
PDF
Language Models for Information Retrieval
Nik Spirin
 
PPTX
Boolean,vector space retrieval Models
Primya Tamil
 
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Leonardo Di Donato
 
PDF
Survey of Generative Clustering Models 2008
Roman Stanchak
 
PPTX
Term weighting
Primya Tamil
 
PDF
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Parang Saraf
 
PPTX
Exploring Session Context using Distributed Representations of Queries and Re...
Bhaskar Mitra
 
PPTX
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
Transfer learning-presentation
Bushra Jbawi
 
The Duet model
Bhaskar Mitra
 
Neural Models for Document Ranking
Bhaskar Mitra
 
Basic review on topic modeling
Hiroyuki Kuromiya
 
Adversarial and reinforcement learning-based approaches to information retrieval
Bhaskar Mitra
 
Topics Modeling
Svitlana volkova
 
Neural Models for Information Retrieval
Bhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
Bhaskar Mitra
 
Collaborative DL
Dai-Hai Nguyen
 
Deep Learning for Search
Bhaskar Mitra
 
Lifelong Topic Modelling presentation
Daniele Di Mitri
 
Topic Models - LDA and Correlated Topic Models
Claudia Wagner
 
Language Models for Information Retrieval
Nik Spirin
 
Boolean,vector space retrieval Models
Primya Tamil
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Leonardo Di Donato
 
Survey of Generative Clustering Models 2008
Roman Stanchak
 
Term weighting
Primya Tamil
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Parang Saraf
 
Exploring Session Context using Distributed Representations of Queries and Re...
Bhaskar Mitra
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 

Viewers also liked (15)

PDF
E traction presentation_20141212_eng
Evgeniy Shchepelin
 
PDF
Content production for sellers
QTran2909
 
PDF
Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...
eZ Systems
 
PDF
Funded PhD/MSc. Opportunities at AYLIEN
Sebastian Ruder
 
PDF
Synthrone 102016
Henry Val
 
PDF
Multi-modal Neural Machine Translation - Iacer Calixto
Sebastian Ruder
 
PDF
E-commerce Berlin Expo - Tomasz Mazur - Danone
E-Commerce Berlin EXPO
 
PDF
NIPS 2016 Highlights - Sebastian Ruder
Sebastian Ruder
 
PDF
How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012
Ming Chan
 
PDF
[Report] Organizing for Content: Models to Incorporate Content Strategy and C...
Altimeter, a Prophet Company
 
PPTX
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Andrew Gardner
 
PDF
IKRA Creative Agency Presentation (ENG)
IKRA Creative agency
 
PDF
19 Reasons Your LinkedIn Photo Is an Epic Fail
MarketingProfs
 
PDF
Content Strategy for Everything
Kristina Halvorson
 
PDF
Content Marketing Predictions 2017
Content Marketing Institute
 
E traction presentation_20141212_eng
Evgeniy Shchepelin
 
Content production for sellers
QTran2909
 
Bringing Content and Commerce Together (presented by Ania Hentz at eZ Confere...
eZ Systems
 
Funded PhD/MSc. Opportunities at AYLIEN
Sebastian Ruder
 
Synthrone 102016
Henry Val
 
Multi-modal Neural Machine Translation - Iacer Calixto
Sebastian Ruder
 
E-commerce Berlin Expo - Tomasz Mazur - Danone
E-Commerce Berlin EXPO
 
NIPS 2016 Highlights - Sebastian Ruder
Sebastian Ruder
 
How to build a world-class Digital Agency - Masterclass at Kreative Asia 2012
Ming Chan
 
[Report] Organizing for Content: Models to Incorporate Content Strategy and C...
Altimeter, a Prophet Company
 
Deep Learning for Data Scientists - Data Science ATL Meetup Presentation, 201...
Andrew Gardner
 
IKRA Creative Agency Presentation (ENG)
IKRA Creative agency
 
19 Reasons Your LinkedIn Photo Is an Epic Fail
MarketingProfs
 
Content Strategy for Everything
Kristina Halvorson
 
Content Marketing Predictions 2017
Content Marketing Institute
 
Ad

Similar to Transformation Functions for Text Classification: A case study with StackOverflow (20)

PPTX
Natural Language Processing
Nimrita Koul
 
PPT
text
nyomans1
 
DOC
Team G
butest
 
PDF
data_mining_Projectreport
Sampath Velaga
 
PPT
lecture_mooney.ppt
butest
 
PDF
IRJET- Semantic Question Matching
IRJET Journal
 
PPTX
Text mining meets neural nets
Dan Sullivan, Ph.D.
 
PPTX
Sparse Composite Document Vector (Emnlp 2017)
Vivek Gupta
 
PDF
Context Driven Technique for Document Classification
IDES Editor
 
PDF
A Survey of Text Mining
Justin Sybrandt, Ph.D.
 
PPTX
Natural Language Processing in R (rNLP)
fridolin.wild
 
PDF
Word representation: SVD, LSA, Word2Vec
ananth
 
PPTX
Text features
Shruti kar
 
PDF
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
PPTX
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
DOC
Indian Language Text Representation and Categorization Using Supervised Learn...
ijbuiiir1
 
PDF
Lecture20 xing
Tianlu Wang
 
PDF
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
PDF
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET Journal
 
PDF
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly
 
Natural Language Processing
Nimrita Koul
 
text
nyomans1
 
Team G
butest
 
data_mining_Projectreport
Sampath Velaga
 
lecture_mooney.ppt
butest
 
IRJET- Semantic Question Matching
IRJET Journal
 
Text mining meets neural nets
Dan Sullivan, Ph.D.
 
Sparse Composite Document Vector (Emnlp 2017)
Vivek Gupta
 
Context Driven Technique for Document Classification
IDES Editor
 
A Survey of Text Mining
Justin Sybrandt, Ph.D.
 
Natural Language Processing in R (rNLP)
fridolin.wild
 
Word representation: SVD, LSA, Word2Vec
ananth
 
Text features
Shruti kar
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
Text Analytics for Legal work
AlgoAnalytics Financial Consultancy Pvt. Ltd.
 
Indian Language Text Representation and Categorization Using Supervised Learn...
ijbuiiir1
 
Lecture20 xing
Tianlu Wang
 
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
IRJET- Automated Document Summarization and Classification using Deep Lear...
IRJET Journal
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly
 
Ad

More from Sebastian Ruder (14)

PDF
Frontiers of Natural Language Processing
Sebastian Ruder
 
PDF
On the Limitations of Unsupervised Bilingual Dictionary Induction
Sebastian Ruder
 
PDF
Successes and Frontiers of Deep Learning
Sebastian Ruder
 
PDF
Optimization for Deep Learning
Sebastian Ruder
 
PPTX
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Sebastian Ruder
 
PDF
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Sebastian Ruder
 
PDF
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Sebastian Ruder
 
PDF
Transfer Learning for Natural Language Processing
Sebastian Ruder
 
PDF
Making sense of word senses: An introduction to word-sense disambiguation and...
Sebastian Ruder
 
PDF
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Sebastian Ruder
 
PDF
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Sebastian Ruder
 
PDF
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Sebastian Ruder
 
PDF
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Sebastian Ruder
 
PDF
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Sebastian Ruder
 
Frontiers of Natural Language Processing
Sebastian Ruder
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
Sebastian Ruder
 
Successes and Frontiers of Deep Learning
Sebastian Ruder
 
Optimization for Deep Learning
Sebastian Ruder
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Sebastian Ruder
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Sebastian Ruder
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Sebastian Ruder
 
Transfer Learning for Natural Language Processing
Sebastian Ruder
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Sebastian Ruder
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Sebastian Ruder
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Sebastian Ruder
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Sebastian Ruder
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Sebastian Ruder
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Sebastian Ruder
 

Recently uploaded (20)

PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PDF
Rethinking Security Operations - Modern SOC.pdf
Haris Chughtai
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PDF
Agentic Artificial Intelligence (AI) and its growing impact on business opera...
Alakmalak Technologies Pvt. Ltd.
 
PDF
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
PDF
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Productivity Management Software | Workstatus
Lovely Baghel
 
PPTX
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
PDF
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
PPTX
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
PDF
The Past, Present & Future of Kenya's Digital Transformation
Moses Kemibaro
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Top Managed Service Providers in Los Angeles
Captain IT
 
TrustArc Webinar - Navigating Data Privacy in LATAM: Laws, Trends, and Compli...
TrustArc
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Rethinking Security Operations - Modern SOC.pdf
Haris Chughtai
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
Agentic Artificial Intelligence (AI) and its growing impact on business opera...
Alakmalak Technologies Pvt. Ltd.
 
Novus-Safe Pro: Brochure-What is Novus Safe Pro?.pdf
Novus Hi-Tech
 
Meetup Kickoff & Welcome - Rohit Yadav, CSIUG Chairman
ShapeBlue
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Productivity Management Software | Workstatus
Lovely Baghel
 
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Women in Automation Presents: Reinventing Yourself — Bold Career Pivots That ...
DianaGray10
 
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
The Past, Present & Future of Kenya's Digital Transformation
Moses Kemibaro
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 

Transformation Functions for Text Classification: A case study with StackOverflow

  • 1. Transformation Functions for Text Classification: A case study with StackOverflow The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. 1 Natural Language Processing, Dublin Meetup 28th Sept, 2016
  • 2. Piyush Arora, Debasis Ganguly, Gareth J.F. Jones ADAPT Centre, School of Computing, Dublin City University {[email protected], [email protected]} https://computing.dcu.ie/~parora/ 2
  • 3. www.adaptcentre.ieOverview of the Talk ❏ Informal Overview of the problem. ❏ StackOverflow data characteristics. ❏ A more technical introduction to the problem. ❏ Text based Classification. ❏ Vector Embedding based Classification. ❏ Conclusions 3
  • 4. www.adaptcentre.ieOverview of the Problem ❏ Parametric approach: Draws a ‘decision boundary’ (a vector in the parameter space) bsed on labelled samples. ❏ Consider the role of additional (unlabelled) samples. 4
  • 5. www.adaptcentre.ieOverview of the Problem ❏ Apply a transformation function, which transforms a labelled sample to another point, depending on its neighbourhood(4). ❏ Retrain a standard parametric classifier on the transformed samples. ❏ Hypothesis: The classification effectiveness after the ‘transformation’ will improve. Transformed point Transformed and re- trained 5
  • 6. www.adaptcentre.ieStackOverflow Question Quality Prediction ❏ Motivation: ❏ Rapid increase in the number of questions posted on CQA forums. ❏ Need for automated methods of question quality moderation to improve user experience and forum effectiveness. 6
  • 7. www.adaptcentre.ieStackOverflow Question Quality Prediction ❏ Problem Statement: ❏ How to classify a new question as suitable or unsuitable without the use of community feedback such as votes, comments or answers unlike previous work(2). ❏ Addressing “Cold Start Problem”(1) ❏ Proposed solution: ❏ Approach based on “Nearest Neighbour based Transformation Functions”. ❏ Application: ❏ Automatic moderation for online forums: saving cost, resources and improving user experience. ❏ General transformation model: can be adapted for any dataset. 7
  • 9. www.adaptcentre.ieSO Question Classification ❏ To predict if a question is good (net votes +ve) or bad (net votes –ve). Category All Views > 1000 Bad (-ve net score) 380800 30163 Good (+ve net score) 3,780,301 1,315,731 Vocab Overlap 59.5% 34.6% 9 Questions Distribution
  • 10. www.adaptcentre.ieSO Question Classification (M1) ❏ Imbalanced class distribution. ❏ High vocabulary overlap between the classes. ❏ Relatively short documents (avg. length of 69 words). ❏ Lack of informative and discriminative content for classification. ❏ Training with ‘all’ labelled samples: Creates a biased classifier which outputs almost every question as ‘good’. ❏ Problem: High accuracy but low recall and precision for the –ve class. Question Text Accuracy F-measure Titles only 0.9707 0.503 (almost random) Title + Body 0.9735 0.503 (almost random) 10 Raw classification results
  • 11. www.adaptcentre.ieSO Question Classification (M2) ❏ K-NN classification (a non-parametric approach). ❏ Why not just K-NN? ❏ Because its performance is not good. ❏ Because it is solely non-parametric. ❏ No use of rich textual information for classification. K Accuracy F-score 1 0.4668 0.4766 3 0.4594 0.4548 5 0.4599 0.4523 Knn- results 11
  • 12. www.adaptcentre.ieSO Question Classification ❏Use other ‘similar’ questions previously asked in the forum to ❏ Transform every question Q to Q’ (3) ❏ Retrain classification model on the Q’ instances. ❏ Combines parametric with non-parametric approach. 12
  • 13. www.adaptcentre.ieTransformation Function ❏ The transformation function φ operating on a vector x depends on the neighbourhood of x. ❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r} ❏ There are various choices for defining Φ, e.g. weighted centroid etc. ❏ Mainly depends on the type of x, i.e. categorical (text) or real (vectors). ❏ Experiments on both categorical x (i.e. term space representation of documents) and real valued x (embedded vectors of documents(5)). 13
  • 15. www.adaptcentre.ieText based Classification (M3) ❏ Baseline: Multinomial Naïve Bayes (MNB) on the text (title+body) of each question. ❏ Obtain neighbourhood of each document by: ❏ Treat the title of each question as a query. ❏ Use this query to retrieve top K similar documents by BM25 (k=1.2, b=0.75)(6) ❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator. K Accuracy F-measure 0 (MNB) 0.713 0.704 1 0.718 0.710 3 0.719 0.713 5 0.715 0.710 9 0.715 0.711 15 Text Expansion Results
  • 16. www.adaptcentre.ieText based Classification ❏ Query: Title field ❏ For retrieval: BM25F with different weights for title and body. ❏ Two step grid search for finding optimal parameter settings. ❏ Best results obtained with w(T), w(B) = 1, 3. 16
  • 17. www.adaptcentre.ieText based Classification (M4) ❏ Obtain neighbourhood of each document by: ❏ Treat the title of each question as a query. ❏ Use this query to retrieve top K similar documents by BM25F (k=1.2, b=0.75) with w(T), w(B) = 1, 3. ❏ Optimized search ❏ Choose the Φ(x, N(x)) function as the ‘concatenation’ operator. K Accuracy F-measure 0 (MNB) 0.713 0.704 3 (BM25) 0.719 0.713 3 (BM25F) 0.738 0.733 17 Textual Space results
  • 19. www.adaptcentre.ieEmbedding based Classification ❏ Motivation: Document embedding captures the semantic similarity between questions. ❏ Embed the text (title + body) of each SO question by doc2vec. ❏ Components of each vector in [-1, 1]. ❏ Use SVM for classifying these vectors. ❏ Best results obtained when #dimensions set to 200. ❏ Transformation function: Weighted centroid. 19
  • 20. www.adaptcentre.ieEmbedding based Classification (M5) ❏ For document embedded vector based experiments, we use an SVM classifier (Gaussian kernel with default parameters). ❏ The SVM classification effectiveness obtained with the dbow document vectors outperforms those obtained with the dmm model. K Accuracy F-measure 0 (MNB)2 0.713 0.704 3 (BM25F)2 0.738 0.733 0 (SVM Baseline)1 0.743 0.743 1 (SVM)1 0.740 0.739 3 (SVM)1 0.747 0.746 5 (SVM)1 0.750 0.749 9 (SVM)1 0.769 0.768 11 (SVM)1 0.765 0.764 20 1 indicates embedding space results and 2 indicates textual space results included for comparison
  • 21. www.adaptcentre.ieSummary ❏ A general framework for applying a non-parametric based transformation function. ❏ Empirical investigation on StackOverflow questions to predict question quality. ❏ Two domains investigated: text and real vectors. ❏ Two neighbourhood functions: ❏ Text: Concatenation ❏ Docvecs: Weighted centroid ❏ Interpretation of the transformation function φ operating on a vector x ❏ φ(x) = Φ(x, N(x)), where N(x) = {xi: d(x, xi) <= r} 21
  • 22. www.adaptcentre.ieConclusions ❏ BM25F with more weight to the ‘body’ field of a question improves results by 4.1% relative to MNB baseline. ❏ For docvecs, results are improved by 3.4% relative to SVM baseline. ❏ Consistent trends in improvements of classification results for both text and document vectors. ❏ Explore alternative transformation functions, and different ways of combining the neighbourhood and the transformation functions of the textual and the document vector spaces 22
  • 23. www.adaptcentre.ieReferences 1. S. Ravi, B. Pang, V. Rastogi, and R. Kumar. Great Question! Question Quality in Community Q&A. In Proc. of ICWSM ’14 , 2014 2. D. Correa and A. Sureka. Chaff from the wheat: characterization and modeling of deleted questions on stack overflow. In Proceedings of WWW ’14 , pages 631–642, 2014. 3. M. Efron, P. Organisciak, and K. Fenlon. Improving retrieval of short texts through document expansion. In Proceedings of the SIGIR ’12 ,pages 911–920, 2012. 4. Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of ICML ’14 , pages 1188–1196, 2014. 5. K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from distributions via support measure machines. In Proc. of NIPS ’12 6. S. E. Robertson, H. Zaragoza, and M. J. Taylor. Simple BM25 extension to multiple weighted fields. In Proceedings of CIKM ’04 , pages 42–49, 2004. 7. The good, the bad and their kins: Identifying questions with negative scores in StackOverflow. P Arora, D Ganguly, GJF Jones. In proceedings of ASONAM’ 2015, pages 1232-1239 8. Nearest Neighbour based Transformation Functions for Text Classification: A Case Study with StackOverflow. P Arora, D Ganguly, GJF Jones. In Proceedings of lCTIR’ 2016, pages 299-302 23