SlideShare a Scribd company logo
Joseph Orilogbon
Luis Lasierra
Bin Shen
5/12/14Semantic Technologies in IBM Watson 1
Discovering why Topics are Trending on Twitter
5/12/14Semantic Technologies in IBM Watson 2
*
*We set out to explain Why Topics are Trending
on Twitter
*Main approach to achieve this was to use
summarization.
5/12/14Semantic Technologies in IBM Watson 3
*
*News break on Twitter
*Twitter -> prominent way of expressing
opinions on the Internet
*Why people are talking about a particular topic
in a given location
*Commercial interest
5/12/14Semantic Technologies in IBM Watson 4
*
*Summarization of trending topics on Twitter
*Categorization of Topics; and
*Named-Entity Extraction for Trending topics
5/12/14Semantic Technologies in IBM Watson 5
*
http://whytrend.intelworx.com
5/12/14Semantic Technologies in IBM Watson 6
*
*Speech Act Guided Summarization
*Phrase Ranking using MLE
*Phrase Extraction using POS filtering
*Salience Score of Extracted Phrases
*Summary generation using templates
5/12/14Semantic Technologies in IBM Watson 7
*
*Speech Acts include : Statement [sta], Question [que],
Comment [com], Suggestion [sug] and Miscellaneous
[mis]
*Speech Act classification is a multiclass problem
*K-Nearest neighbors approach was used for classification.
5/12/14Semantic Technologies in IBM Watson 8
*
*Extracted Phrase were Ranked using the following
equation
* 𝑆𝑆𝑆𝑆𝑆 𝑃 = log
𝐿(𝑤𝑤𝑤𝑤𝑤 𝑖𝑖 𝑃 𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖)
𝐿(𝑤𝑤𝑤𝑤𝑤 𝑖𝑖 𝑃 𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑)
*Dependence/Independence measured based on using a
background twitter corpus built from 550,000 tweets
*For lengths 1 to L, we extract the top 50 phrases.
*L is a model parameter for maximum phrase length
5/12/14Semantic Technologies in IBM Watson 9
*
*Extracted N-Grams are only useful if they are:
*Nouns or Noun Phrase
*Verbs or a Verb-Centered Phrase
*After Extracting N-Grams, those not matching the
required patterns were filtered out using RegEx on
their POS Tag Pattern
*Tagging was done before extracting N-Grams to give
the tagger the proper context.
*Different patterns are suitable for different Speech-Act
5/12/14Semantic Technologies in IBM Watson 10
*
*This is another round of ranking of phrases based on
how “Salient” they are within the given topic
*Salience Score is given as 𝑆𝑆 𝑁𝑔 𝑖
= 𝐺𝐺 𝑁𝑔 𝑖
× 𝑁𝑖
* 𝑁𝑖 is the length of N-Gram 𝑁𝑔 𝑖
* 𝐺𝐺 𝑁𝑔 𝑖
is a graph score obtained by iterating over a
graph G=(V, E), where V is the set of N-grams, and E is a
set edges weighted based on the number of times the N-
Grams co-occur.
5/12/14Semantic Technologies in IBM Watson 11
*
*Greedy strategy was used to select most salient
phrases
*Phrases were used to fill templates
*Speech acts used to describe how people are talking
about the salient phrases.
*Redundant phrases were detected using Jaccard
Coefficient of 0.275
*Hashtags were split into words using an existing
application.
5/12/14Semantic Technologies in IBM Watson 12
*
*The main reference is Zhang et. al, 2013
*Speech Acts were not used for filtering out tweets
*Two rounds of POS filtering was done, as supposed
to one in the original paper
*Greedy strategy was used as opposed to Round-
robin used in the original paper
*Representative tweets were also presented to give
the user some sense of context.
5/12/14Semantic Technologies in IBM Watson 13
*
*Speech Act Training Data Set (Liu, et. al), for
speech act classification
*Sentiment 140 dataset, for background corpus
*TweetMotif dataset (O’Connor et. al, 2010) for
background corpus.
*Twitter NLP (Gimpel et al) for POS tagging
*Tweets collected via Twitter API for testing
summarization model, see examples on site.
5/12/14Semantic Technologies in IBM Watson 14
*
*Entity Extraction
*Preprocessing, proper nouns extraction
*Google Knowledge Graph: Freebase
*Categorization
*uClassify API
*Extract highest ranking category
5/12/14Semantic Technologies in IBM Watson 15
*
*Front end
*Auto-detection/manual selection of location
*Displays trending topics
*Sends requests to server to analyze topics
*Back end
*Tweets retrieval
*Analysis using model of summarization
*Send results to Freebase and uClassify APIs
*Caches result
5/12/14Semantic Technologies in IBM Watson 16
*
*Front end: HTML 5, JS, Google Maps API,
Angular JS, JQuery
*Backend: Java / Play framework and MySQL
database
*Hosted on AWS
5/12/14Semantic Technologies in IBM Watson 17
*
*Asked users to provide feedback on results
*Questions covered all 3 parts of the project
*Got 19 responses as at the time of making this
slide,
5/12/14Semantic Technologies in IBM Watson 18
Avg = 3.89
Avg = 4.00
5/12/14Semantic Technologies in IBM Watson 19
Avg = 4.21
Avg = 3.84
5/12/14Semantic Technologies in IBM Watson 20
Avg = 4.16
5/12/14Semantic Technologies in IBM Watson 21
*
* Liu, Fei, Yang Liu, and Fuliang Weng. "Why is SXSW trending?: exploring multiple
text sources for Twitter topic summarization." 2011. 66--75.
* OConnor, Brendan, Michel Krieger, and David Ahn. "TweetMotif: Exploratory Search
and Topic Summarization for Twitter." 2010.
* Zhang, Renxian, Wenjie Li, Dehong Gao, and You Ouyang. "Automatic Twitter Topic
Summarization With Speech Acts." Audio, Speech, and Language Processing, IEEE
Transactions on (IEEE) 21 (2013): 649--658.
* Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein,
Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of-
Speech Tagging for Twitter: Annotation, Features, and Experiments Kevin
Gimpel, In Proceedings of ACL 2011.
* Abeel, T.; de Peer, Y. V. & Saeys, Y. Java-ML: A Machine Learning Library, Journal
of Machine Learning Research, 2009, 10, 931-934
5/12/14Semantic Technologies in IBM Watson 22
*
*Tweets under a topic are loosely grouped together,
sometimes not sharing too much in common.
*Low performance with Speech-Act Classification
*Detection of Main entity
*Normalization of tweets could at times result in
weird results
*Limits on Twitter API 180 search
queries/user/application/15 minutes
5/12/14Semantic Technologies in IBM Watson 23
*
*Real-time indexing of tweets before they start
trending, using Lucene/ES or other full-text
engines.
*Detection of sentence overlap in the selected
phrases
*Detecting redundancies semantically.
*Different templates for various topic
categories.
5/12/14Semantic Technologies in IBM Watson 24
*

More Related Content

PPTX
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
PPTX
Sentiment analyzer and opinion mining
PDF
Aspects of NLP Practice
PDF
NLP Project Full Cycle
PPTX
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
PDF
Squiz Scotland Seminar - Hot Topics for Web Experience Management - Feb 2012
PDF
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
PPTX
NLP todo
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
Sentiment analyzer and opinion mining
Aspects of NLP Practice
NLP Project Full Cycle
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Squiz Scotland Seminar - Hot Topics for Web Experience Management - Feb 2012
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
NLP todo

Similar to Watson presentation (20)

PPSX
An Introduction to Semantic Web Technology
PDF
Semantic engagement handouts
PDF
Text Summarization Talk @ Saama Technologies
PPT
Building the Inform Semantic Publishing Ecosystem: from Author to Audience
PPTX
(Keynote) Peter Mika - “Making the Web Searchable”
PPT
Text Analytics: Yesterday, Today and Tomorrow
PPTX
Making the Web Searchable - Keynote ICWE 2015
PPTX
Mining Web content for Enhanced Search
PPTX
Building a Semantic search Engine in a library
PPTX
Lexicon-Based Sentiment Analysis at GHC 2014
PPTX
Preslav Nakov - The Web as a Training Set Part 3
PPT
Semantic Web research anno 2006:main streams, popular falacies, current statu...
PDF
Can Deep Learning solve the Sentiment Analysis Problem
PPTX
A Semantics-based Approach to Machine Perception
PPTX
A Semantics-based Approach to Machine Perception
PPTX
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
PDF
TechEvent Customer Project "Trend-Analytics"
PDF
Semantic Annotation - Ontobras 2015
PPT
Web 3 Expert System
PPT
Web 3 Expert System
An Introduction to Semantic Web Technology
Semantic engagement handouts
Text Summarization Talk @ Saama Technologies
Building the Inform Semantic Publishing Ecosystem: from Author to Audience
(Keynote) Peter Mika - “Making the Web Searchable”
Text Analytics: Yesterday, Today and Tomorrow
Making the Web Searchable - Keynote ICWE 2015
Mining Web content for Enhanced Search
Building a Semantic search Engine in a library
Lexicon-Based Sentiment Analysis at GHC 2014
Preslav Nakov - The Web as a Training Set Part 3
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Can Deep Learning solve the Sentiment Analysis Problem
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
TechEvent Customer Project "Trend-Analytics"
Semantic Annotation - Ontobras 2015
Web 3 Expert System
Web 3 Expert System
Ad

Recently uploaded (20)

PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
Top Generative AI Tools for Patent Drafting in 2025.pdf
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
PDF
Software Development Methodologies in 2025
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
PDF
This slide provides an overview Technology
PDF
Google’s NotebookLM Unveils Video Overviews
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dell Pro 14 Plus: Be better prepared for what’s coming
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
DevOps & Developer Experience Summer BBQ
PDF
Transforming Manufacturing operations through Intelligent Integrations
agentic-ai-and-the-future-of-autonomous-systems.pdf
Event Presentation Google Cloud Next Extended 2025
GamePlan Trading System Review: Professional Trader's Honest Take
Chapter 2 Digital Image Fundamentals.pdf
Top Generative AI Tools for Patent Drafting in 2025.pdf
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Software Development Methodologies in 2025
A Day in the Life of Location Data - Turning Where into How.pdf
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
Revolutionize Operations with Intelligent IoT Monitoring and Control
This slide provides an overview Technology
Google’s NotebookLM Unveils Video Overviews
Understanding_Digital_Forensics_Presentation.pptx
Dell Pro 14 Plus: Be better prepared for what’s coming
Smarter Business Operations Powered by IoT Remote Monitoring
DevOps & Developer Experience Summer BBQ
Transforming Manufacturing operations through Intelligent Integrations
Ad

Watson presentation

  • 1. Joseph Orilogbon Luis Lasierra Bin Shen 5/12/14Semantic Technologies in IBM Watson 1 Discovering why Topics are Trending on Twitter
  • 2. 5/12/14Semantic Technologies in IBM Watson 2 * *We set out to explain Why Topics are Trending on Twitter *Main approach to achieve this was to use summarization.
  • 3. 5/12/14Semantic Technologies in IBM Watson 3 * *News break on Twitter *Twitter -> prominent way of expressing opinions on the Internet *Why people are talking about a particular topic in a given location *Commercial interest
  • 4. 5/12/14Semantic Technologies in IBM Watson 4 * *Summarization of trending topics on Twitter *Categorization of Topics; and *Named-Entity Extraction for Trending topics
  • 5. 5/12/14Semantic Technologies in IBM Watson 5 * http://whytrend.intelworx.com
  • 6. 5/12/14Semantic Technologies in IBM Watson 6 * *Speech Act Guided Summarization *Phrase Ranking using MLE *Phrase Extraction using POS filtering *Salience Score of Extracted Phrases *Summary generation using templates
  • 7. 5/12/14Semantic Technologies in IBM Watson 7 * *Speech Acts include : Statement [sta], Question [que], Comment [com], Suggestion [sug] and Miscellaneous [mis] *Speech Act classification is a multiclass problem *K-Nearest neighbors approach was used for classification.
  • 8. 5/12/14Semantic Technologies in IBM Watson 8 * *Extracted Phrase were Ranked using the following equation * 𝑆𝑆𝑆𝑆𝑆 𝑃 = log 𝐿(𝑤𝑤𝑤𝑤𝑤 𝑖𝑖 𝑃 𝑎𝑎𝑎 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖) 𝐿(𝑤𝑤𝑤𝑤𝑤 𝑖𝑖 𝑃 𝑎𝑎𝑎 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑) *Dependence/Independence measured based on using a background twitter corpus built from 550,000 tweets *For lengths 1 to L, we extract the top 50 phrases. *L is a model parameter for maximum phrase length
  • 9. 5/12/14Semantic Technologies in IBM Watson 9 * *Extracted N-Grams are only useful if they are: *Nouns or Noun Phrase *Verbs or a Verb-Centered Phrase *After Extracting N-Grams, those not matching the required patterns were filtered out using RegEx on their POS Tag Pattern *Tagging was done before extracting N-Grams to give the tagger the proper context. *Different patterns are suitable for different Speech-Act
  • 10. 5/12/14Semantic Technologies in IBM Watson 10 * *This is another round of ranking of phrases based on how “Salient” they are within the given topic *Salience Score is given as 𝑆𝑆 𝑁𝑔 𝑖 = 𝐺𝐺 𝑁𝑔 𝑖 × 𝑁𝑖 * 𝑁𝑖 is the length of N-Gram 𝑁𝑔 𝑖 * 𝐺𝐺 𝑁𝑔 𝑖 is a graph score obtained by iterating over a graph G=(V, E), where V is the set of N-grams, and E is a set edges weighted based on the number of times the N- Grams co-occur.
  • 11. 5/12/14Semantic Technologies in IBM Watson 11 * *Greedy strategy was used to select most salient phrases *Phrases were used to fill templates *Speech acts used to describe how people are talking about the salient phrases. *Redundant phrases were detected using Jaccard Coefficient of 0.275 *Hashtags were split into words using an existing application.
  • 12. 5/12/14Semantic Technologies in IBM Watson 12 * *The main reference is Zhang et. al, 2013 *Speech Acts were not used for filtering out tweets *Two rounds of POS filtering was done, as supposed to one in the original paper *Greedy strategy was used as opposed to Round- robin used in the original paper *Representative tweets were also presented to give the user some sense of context.
  • 13. 5/12/14Semantic Technologies in IBM Watson 13 * *Speech Act Training Data Set (Liu, et. al), for speech act classification *Sentiment 140 dataset, for background corpus *TweetMotif dataset (O’Connor et. al, 2010) for background corpus. *Twitter NLP (Gimpel et al) for POS tagging *Tweets collected via Twitter API for testing summarization model, see examples on site.
  • 14. 5/12/14Semantic Technologies in IBM Watson 14 * *Entity Extraction *Preprocessing, proper nouns extraction *Google Knowledge Graph: Freebase *Categorization *uClassify API *Extract highest ranking category
  • 15. 5/12/14Semantic Technologies in IBM Watson 15 * *Front end *Auto-detection/manual selection of location *Displays trending topics *Sends requests to server to analyze topics *Back end *Tweets retrieval *Analysis using model of summarization *Send results to Freebase and uClassify APIs *Caches result
  • 16. 5/12/14Semantic Technologies in IBM Watson 16 * *Front end: HTML 5, JS, Google Maps API, Angular JS, JQuery *Backend: Java / Play framework and MySQL database *Hosted on AWS
  • 17. 5/12/14Semantic Technologies in IBM Watson 17 * *Asked users to provide feedback on results *Questions covered all 3 parts of the project *Got 19 responses as at the time of making this slide,
  • 18. 5/12/14Semantic Technologies in IBM Watson 18 Avg = 3.89 Avg = 4.00
  • 19. 5/12/14Semantic Technologies in IBM Watson 19 Avg = 4.21 Avg = 3.84
  • 20. 5/12/14Semantic Technologies in IBM Watson 20 Avg = 4.16
  • 21. 5/12/14Semantic Technologies in IBM Watson 21 * * Liu, Fei, Yang Liu, and Fuliang Weng. "Why is SXSW trending?: exploring multiple text sources for Twitter topic summarization." 2011. 66--75. * OConnor, Brendan, Michel Krieger, and David Ahn. "TweetMotif: Exploratory Search and Topic Summarization for Twitter." 2010. * Zhang, Renxian, Wenjie Li, Dehong Gao, and You Ouyang. "Automatic Twitter Topic Summarization With Speech Acts." Audio, Speech, and Language Processing, IEEE Transactions on (IEEE) 21 (2013): 649--658. * Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of- Speech Tagging for Twitter: Annotation, Features, and Experiments Kevin Gimpel, In Proceedings of ACL 2011. * Abeel, T.; de Peer, Y. V. & Saeys, Y. Java-ML: A Machine Learning Library, Journal of Machine Learning Research, 2009, 10, 931-934
  • 22. 5/12/14Semantic Technologies in IBM Watson 22 * *Tweets under a topic are loosely grouped together, sometimes not sharing too much in common. *Low performance with Speech-Act Classification *Detection of Main entity *Normalization of tweets could at times result in weird results *Limits on Twitter API 180 search queries/user/application/15 minutes
  • 23. 5/12/14Semantic Technologies in IBM Watson 23 * *Real-time indexing of tweets before they start trending, using Lucene/ES or other full-text engines. *Detection of sentence overlap in the selected phrases *Detecting redundancies semantically. *Different templates for various topic categories.