SlideShare a Scribd company logo
SENTIMENT
CLASSIFICATION
Practical Machine Learning and Rails Part2
TRAINING DATA:
TRAINING DATA:
- tweets
TRAINING DATA:
- tweets
- positive/negative
TRAINING DATA:
- tweets
- positive/negative
  - use emoticons from twitter
TRAINING DATA:
- tweets
- positive/negative
  - use emoticons from twitter
  :-) or :-(
BUILDING TRAINING DATA:
  NEGATIVE
  is upset that he cant update his Facebook by texting it... and might cry as a
  result School today also. Blah!
  I couldnt bear to watch it. And I thought the UA loss was embarrassing
  I hate when I have to call and wake people up


  POSITIVE
  Just woke up. Having no school is the best feeling ever
  Im enjoying a beautiful morning here in Phoenix
  dropping molly off getting ice cream with Aaron
Practical Machine Learning and Rails Part2
FEATURES:
FEATURES:
 BAG OF WORDS MODEL
FEATURES:
 BAG OF WORDS MODEL
 split the text into words, create a dictionary,
 and replace text with word counts
BAG OF WORDS
BAG OF WORDS
tweets:
I ran fast
Bob ran far
I ran to Bob
BAG OF WORDS
tweets:
I ran fast
Bob ran far
I ran to Bob

   dictionary = %w{I ran fast Bob far to}
BAG OF WORDS
tweets:
I ran fast
Bob ran far
I ran to Bob

   dictionary = %w{I ran fast Bob far to}
BAG OF WORDS
tweets:                   word vectors:
I ran fast                [1 1 1 0 0 0]
Bob ran far               [0 1 0 1 1 0]
I ran to Bob              [1 1 0 1 0 1]

   dictionary = %w{I ran fast Bob far to}
CLASSIFIER:
CLASSIFIER:
 training examples:
word vector -> labels
CLASSIFIER:
 training examples:
word vector -> labels
CLASSIFIER:
  training examples:
 word vector -> labels


classification algorithm
CLASSIFIER:
  training examples:
 word vector -> labels


classification algorithm
CLASSIFIER:
  training examples:
 word vector -> labels


classification algorithm


        model
WEKA
WEKA
• open source java app
WEKA
• open source java app
• contains common ML algorithms
WEKA
• open source java app
• contains common ML algorithms
• gui interface
WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
    • converting words into vectors
WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
    • converting words into vectors
    • training/test, cross-validation,
      metrics
ARFF FILE
TRAINING IN
   WEKA

[SHOW EXAMPLE HERE]
EVALUATION
• correctly classified
• mean squared error
EVALUATION

false negative/positives
SENTIMENT
   CLASSIFICATION
      EXAMPLE
https://github.com/ryanstout/
mlexample
QUERYING
arff_path = Rails.root.join("data/sentiment.arff").to_s
arff = FileReader.new(arff_path)

model_path = Rails.root.join("models/sentiment.model").to_s
classifier = SerializationHelper.read(model_path)

data = begin
  Instances.new(arff,1).tap do |instance|
    if instance.class_index == -1
      instance.set_class_index(instance.num_attributes - 1)
    end
  end
end
QUERYING

instance = SparseInstance.new(data.num_attributes)
instance.set_dataset(data)
instance.set_value(data.attribute(0), params[:sentiment][:message])

result = classifier.distribution_for_instance(instance).first
percent_positive = 1 - result.to_f

@message = "The text is #{(percent_positive*100.0).round}% positive"
HOW DO WE
 IMPROVE?
HOW DO WE
      IMPROVE?

•bigger dictionary
HOW DO WE
      IMPROVE?

•bigger dictionary
•bi-grams/tri-grams
HOW DO WE
      IMPROVE?

•bigger dictionary
•bi-grams/tri-grams
•part of speech tagging
HOW DO WE
      IMPROVE?

•bigger dictionary
•bi-grams/tri-grams
•part of speech tagging
•more data
Feature Generation
Feature Generation

 think about what information is
 valuable to an expert
Feature Generation

 think about what information is
 valuable to an expert
 remove data that isn't useful
 (attribute selection)
ATTRIBUTE
     SELECTION


[SHOW ATTRIBUTE SELECTION
EXAMPLE]
ATTRIBUTE
SELECTION
DOMAIN PRICE
    PREDICTION

• predict how much a domain would
 sell for
TRAINING DATA
TRAINING DATA

• domains
TRAINING DATA

• domains
• historical sale prices for domains
FEATURES
FEATURES
• split domain by words
FEATURES
• split domain by words
• generate features for each word
FEATURES
• split domain by words
• generate features for each word
   • how common the word is
FEATURES
• split domain by words
• generate features for each word
   • how common the word is
   • number of google results for each
      word
FEATURES
• split domain by words
• generate features for each word
   • how common the word is
   • number of google results for each
      word
   • cpc for the word
ALGORITHM

support vector regression
   functions > SMOreg in weka
WHAT WE DIDN’T
   COVER
WHAT WE DIDN’T
    COVER

• collaborative filtering
WHAT WE DIDN’T
    COVER

• collaborative filtering
• clustering
WHAT WE DIDN’T
    COVER

• collaborative filtering
• clustering
• theorem proving (classical AI)
ADDITIONAL
    RESOURCES

stanford machine learning class
    ml-class.org
TOOLS
• weka
• libsvm, liblinear
• vowpal wabbit (big dictionaries)
• recommendify
   •   https://github.com/paulasmuth/recommendify
QUESTIONS

contact us on twitter at
@tectonic and @ryanstout

More Related Content

Similar to Practical Machine Learning and Rails Part2 (20)

PDF
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 
PPTX
A self training framework for exploratory discourse detection final
Zhongyu Wei
 
PDF
Machine Learning: Learning with data
ONE Talks
 
PDF
One talk Machine Learning
ONE Talks
 
PDF
Statistical Learning and Text Classification with NLTK and scikit-learn
Olivier Grisel
 
PDF
CascadiaJS 2015 - Adding intelligence to your JS applications
Kevin Dela Rosa
 
PDF
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
PDF
Introduction to active learning
Alexey Voropaev
 
PPTX
Approaches to ml techniques on real world data
Venkata Ramana
 
PDF
unit-5.pdf
Jayaprasanna4
 
PPTX
Debugging Skynet: A Machine Learning Approach to Log Analysis - Ianir Ideses,...
DevOpsDays Tel Aviv
 
PPTX
Future of AI - 2023 07 25.pptx
Greg Makowski
 
PPT
Using binary classifiers
butest
 
PPTX
05 -- Feature Engineering (Text).pptxiuy
Sravani477269
 
PDF
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
fabiodeazevedo3
 
PPTX
Sentiment analysis
girisv
 
PDF
data_mining_Projectreport
Sampath Velaga
 
PPTX
Predicting Tweet Sentiment
Lucinda Linde
 
PPT
Learning analytics to identify exploratory dialogue in online discussions
Rebecca Ferguson
 
PPTX
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Geetika Gautam
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 
A self training framework for exploratory discourse detection final
Zhongyu Wei
 
Machine Learning: Learning with data
ONE Talks
 
One talk Machine Learning
ONE Talks
 
Statistical Learning and Text Classification with NLTK and scikit-learn
Olivier Grisel
 
CascadiaJS 2015 - Adding intelligence to your JS applications
Kevin Dela Rosa
 
Text Classification Powered by Apache Mahout and Lucene
lucenerevolution
 
Introduction to active learning
Alexey Voropaev
 
Approaches to ml techniques on real world data
Venkata Ramana
 
unit-5.pdf
Jayaprasanna4
 
Debugging Skynet: A Machine Learning Approach to Log Analysis - Ianir Ideses,...
DevOpsDays Tel Aviv
 
Future of AI - 2023 07 25.pptx
Greg Makowski
 
Using binary classifiers
butest
 
05 -- Feature Engineering (Text).pptxiuy
Sravani477269
 
Hate Speech / Toxic Comment Detection - Data Mining (CSE-362) Project
fabiodeazevedo3
 
Sentiment analysis
girisv
 
data_mining_Projectreport
Sampath Velaga
 
Predicting Tweet Sentiment
Lucinda Linde
 
Learning analytics to identify exploratory dialogue in online discussions
Rebecca Ferguson
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Geetika Gautam
 

More from ryanstout (8)

PDF
Neural networks - BigSkyDevCon
ryanstout
 
PDF
Volt 2015
ryanstout
 
PDF
Isomorphic App Development with Ruby and Volt - Rubyconf2014
ryanstout
 
PDF
Reactive programming
ryanstout
 
PDF
Concurrency Patterns
ryanstout
 
PDF
EmberJS
ryanstout
 
PPTX
Practical Machine Learning and Rails Part1
ryanstout
 
PDF
Intro to Advanced JavaScript
ryanstout
 
Neural networks - BigSkyDevCon
ryanstout
 
Volt 2015
ryanstout
 
Isomorphic App Development with Ruby and Volt - Rubyconf2014
ryanstout
 
Reactive programming
ryanstout
 
Concurrency Patterns
ryanstout
 
EmberJS
ryanstout
 
Practical Machine Learning and Rails Part1
ryanstout
 
Intro to Advanced JavaScript
ryanstout
 
Ad

Recently uploaded (20)

PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Digital Circuits, important subject in CS
contactparinay1
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Ad

Practical Machine Learning and Rails Part2

Editor's Notes

  • #2: having an example makes it easier to understand the process\n
  • #3: also could use movie/product review data\n
  • #4: also could use movie/product review data\n
  • #5: also could use movie/product review data\n
  • #6: also could use movie/product review data\n
  • #7: also could use movie/product review data\n
  • #8: \n
  • #9: bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
  • #10: bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
  • #11: bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn’t look at word order, syntax, grammar, punctuation, etc...\n
  • #12: words in dictionary array are replaced with the count’s in the text\n\n
  • #13: words in dictionary array are replaced with the count’s in the text\n\n
  • #14: words in dictionary array are replaced with the count’s in the text\n\n
  • #15: words in dictionary array are replaced with the count’s in the text\n\n
  • #16: word vectors/labels\n
  • #17: word vectors/labels\n
  • #18: word vectors/labels\n
  • #19: word vectors/labels\n
  • #20: word vectors/labels\n
  • #21: \n
  • #22: \n
  • #23: \n
  • #24: \n
  • #25: \n
  • #26: \n
  • #27: \n
  • #28: generated using RARFF\n
  • #29: \n
  • #30: \n
  • #31: \n
  • #32: \n
  • #33: load the arff\nload the model - serialized java object\nload a dataset\n
  • #34: create a sparse instance, set the dataset\nget distribution (predicted values for each class)\n
  • #35: the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
  • #36: the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
  • #37: the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
  • #38: the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
  • #39: \n
  • #40: \n
  • #41: \n
  • #42: \n
  • #43: \n
  • #44: \n
  • #45: \n
  • #46: assume a max of three words\neach feature of three words, 0’s if less words\n
  • #47: assume a max of three words\neach feature of three words, 0’s if less words\n
  • #48: assume a max of three words\neach feature of three words, 0’s if less words\n
  • #49: assume a max of three words\neach feature of three words, 0’s if less words\n
  • #50: assume a max of three words\neach feature of three words, 0’s if less words\n
  • #51: \n
  • #52: clustering - similar documents, related terms\n
  • #53: clustering - similar documents, related terms\n
  • #54: clustering - similar documents, related terms\n
  • #55: \n
  • #56: vowpal - good for large datasets, contains different algorithms (matrix factorization, collab filtering, lda, etc..)\n
  • #57: hopefully this helped you know the tools and techniques\nyou can teach yourself\nfeel free to contact us\n