Practical Machine Learning and Rails Part2

Download as KEY, PDF

5 likes1,377 views

The document discusses sentiment classification using training data from tweets labeled as positive or negative, focusing on techniques like the bag of words model and classifiers such as Weka. It outlines feature generation and selection processes and mentions improvements such as using a larger dictionary and bi-grams. Additionally, it briefly touches on domain price prediction and other machine learning concepts like collaborative filtering and clustering.

Technology Education

Practical Machine Learning and Rails Part2

TRAINING DATA:
- tweets
- positive/negative

TRAINING DATA:
- tweets
- positive/negative
- use emoticons from twitter

TRAINING DATA:
- tweets
- positive/negative
- use emoticons from twitter
:-) or :-(

BUILDING TRAINING DATA:
NEGATIVE
is upset that he cant update his Facebook by texting it... and might cry as a
result School today also. Blah!
I couldnt bear to watch it. And I thought the UA loss was embarrassing
I hate when I have to call and wake people up

POSITIVE
Just woke up. Having no school is the best feeling ever
Im enjoying a beautiful morning here in Phoenix
dropping molly off getting ice cream with Aaron

FEATURES:
BAG OF WORDS MODEL
split the text into words, create a dictionary,
and replace text with word counts

BAG OF WORDS
tweets:
I ran fast
Bob ran far
I ran to Bob

BAG OF WORDS
tweets:
I ran fast
Bob ran far
I ran to Bob

dictionary = %w{I ran fast Bob far to}

BAG OF WORDS
tweets: word vectors:
I ran fast [1 1 1 0 0 0]
Bob ran far [0 1 0 1 1 0]
I ran to Bob [1 1 0 1 0 1]

dictionary = %w{I ran fast Bob far to}

CLASSIFIER:
training examples:
word vector -> labels

CLASSIFIER:
training examples:
word vector -> labels

classiﬁcation algorithm

CLASSIFIER:
training examples:
word vector -> labels

classiﬁcation algorithm

model

WEKA
• open source java app
• contains common ML algorithms

WEKA
• open source java app
• contains common ML algorithms
• gui interface

WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby

WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:

WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
• converting words into vectors

WEKA
• open source java app
• contains common ML algorithms
• gui interface
• can access it from jruby
• helps with:
• converting words into vectors
• training/test, cross-validation,
metrics

EVALUATION
• correctly classiﬁed
• mean squared error

SENTIMENT
CLASSIFICATION
EXAMPLE
https://github.com/ryanstout/
mlexample

QUERYING
arff_path = Rails.root.join("data/sentiment.arff").to_s
arff = FileReader.new(arff_path)

model_path = Rails.root.join("models/sentiment.model").to_s
classifier = SerializationHelper.read(model_path)

data = begin
Instances.new(arff,1).tap do |instance|
if instance.class_index == -1
instance.set_class_index(instance.num_attributes - 1)
end
end
end

$QUERYING instance = SparseInstance.new(data.num_attributes) instance.set_dataset(data) instance.set_value(data.attribute(0), params[:sentiment][:message]) result = classifier.distribution_for_instance(instance).first percent_positive = 1 - result.to_f @message = "The text is #{(percent_positive*100.0).round}% positive"$

HOW DO WE
IMPROVE?

•bigger dictionary
•bi-grams/tri-grams

HOW DO WE
IMPROVE?

•bigger dictionary
•bi-grams/tri-grams
•part of speech tagging

HOW DO WE
IMPROVE?

•bigger dictionary
•bi-grams/tri-grams
•part of speech tagging
•more data

Feature Generation

think about what information is
valuable to an expert

Feature Generation

think about what information is
valuable to an expert
remove data that isn't useful
(attribute selection)

ATTRIBUTE
SELECTION

[SHOW ATTRIBUTE SELECTION
EXAMPLE]

DOMAIN PRICE
PREDICTION

• predict how much a domain would
sell for

TRAINING DATA

• domains
• historical sale prices for domains

FEATURES
• split domain by words
• generate features for each word

FEATURES
• split domain by words
• generate features for each word
• how common the word is

FEATURES
• split domain by words
• generate features for each word
• how common the word is
• number of google results for each
word

FEATURES
• split domain by words
• generate features for each word
• how common the word is
• number of google results for each
word
• cpc for the word

ALGORITHM

support vector regression
functions > SMOreg in weka

WHAT WE DIDN’T
COVER

• collaborative ﬁltering

WHAT WE DIDN’T
COVER

• collaborative ﬁltering
• clustering

WHAT WE DIDN’T
COVER

• collaborative ﬁltering
• clustering
• theorem proving (classical AI)

ADDITIONAL
RESOURCES

stanford machine learning class
ml-class.org

TOOLS
• weka
• libsvm, liblinear
• vowpal wabbit (big dictionaries)
• recommendify
• https://github.com/paulasmuth/recommendify

QUESTIONS

contact us on twitter at
@tectonic and @ryanstout

Practical Machine Learning and Rails Part2

1. SENTIMENT CLASSIFICATION

3. TRAINING DATA:

4. TRAINING DATA: - tweets

5. TRAINING DATA: - tweets - positive/negative

6. TRAINING DATA: - tweets - positive/negative - use emoticons from twitter

7. TRAINING DATA: - tweets - positive/negative - use emoticons from twitter :-) or :-(

8. BUILDING TRAINING DATA: NEGATIVE is upset that he cant update his Facebook by texting it... and might cry as a result School today also. Blah! I couldnt bear to watch it. And I thought the UA loss was embarrassing I hate when I have to call and wake people up POSITIVE Just woke up. Having no school is the best feeling ever Im enjoying a beautiful morning here in Phoenix dropping molly off getting ice cream with Aaron

10. FEATURES:

11. FEATURES: BAG OF WORDS MODEL

12. FEATURES: BAG OF WORDS MODEL split the text into words, create a dictionary, and replace text with word counts

13. BAG OF WORDS

14. BAG OF WORDS tweets: I ran fast Bob ran far I ran to Bob

15. BAG OF WORDS tweets: I ran fast Bob ran far I ran to Bob dictionary = %w{I ran fast Bob far to}

16. BAG OF WORDS tweets: I ran fast Bob ran far I ran to Bob dictionary = %w{I ran fast Bob far to}

17. BAG OF WORDS tweets: word vectors: I ran fast [1 1 1 0 0 0] Bob ran far [0 1 0 1 1 0] I ran to Bob [1 1 0 1 0 1] dictionary = %w{I ran fast Bob far to}

18. CLASSIFIER:

19. CLASSIFIER: training examples: word vector -> labels

20. CLASSIFIER: training examples: word vector -> labels

21. CLASSIFIER: training examples: word vector -> labels classiﬁcation algorithm

22. CLASSIFIER: training examples: word vector -> labels classiﬁcation algorithm

23. CLASSIFIER: training examples: word vector -> labels classiﬁcation algorithm model

24. WEKA

25. WEKA • open source java app

26. WEKA • open source java app • contains common ML algorithms

27. WEKA • open source java app • contains common ML algorithms • gui interface

28. WEKA • open source java app • contains common ML algorithms • gui interface • can access it from jruby

29. WEKA • open source java app • contains common ML algorithms • gui interface • can access it from jruby • helps with:

30. WEKA • open source java app • contains common ML algorithms • gui interface • can access it from jruby • helps with: • converting words into vectors

31. WEKA • open source java app • contains common ML algorithms • gui interface • can access it from jruby • helps with: • converting words into vectors • training/test, cross-validation, metrics

32. ARFF FILE

33. TRAINING IN WEKA [SHOW EXAMPLE HERE]

34. EVALUATION • correctly classiﬁed • mean squared error

35. EVALUATION false negative/positives

36. SENTIMENT CLASSIFICATION EXAMPLE https://github.com/ryanstout/ mlexample

37. QUERYING arff_path = Rails.root.join("data/sentiment.arff").to_s arff = FileReader.new(arff_path) model_path = Rails.root.join("models/sentiment.model").to_s classifier = SerializationHelper.read(model_path) data = begin Instances.new(arff,1).tap do |instance| if instance.class_index == -1 instance.set_class_index(instance.num_attributes - 1) end end end

38. QUERYING instance = SparseInstance.new(data.num_attributes) instance.set_dataset(data) instance.set_value(data.attribute(0), params[:sentiment][:message]) result = classifier.distribution_for_instance(instance).first percent_positive = 1 - result.to_f @message = "The text is #{(percent_positive*100.0).round}% positive"

39. HOW DO WE IMPROVE?

40. HOW DO WE IMPROVE? •bigger dictionary

41. HOW DO WE IMPROVE? •bigger dictionary •bi-grams/tri-grams

42. HOW DO WE IMPROVE? •bigger dictionary •bi-grams/tri-grams •part of speech tagging

43. HOW DO WE IMPROVE? •bigger dictionary •bi-grams/tri-grams •part of speech tagging •more data

44. Feature Generation

45. Feature Generation think about what information is valuable to an expert

46. Feature Generation think about what information is valuable to an expert remove data that isn't useful (attribute selection)

47. ATTRIBUTE SELECTION [SHOW ATTRIBUTE SELECTION EXAMPLE]

48. ATTRIBUTE SELECTION

49. DOMAIN PRICE PREDICTION • predict how much a domain would sell for

50. TRAINING DATA

51. TRAINING DATA • domains

52. TRAINING DATA • domains • historical sale prices for domains

53. FEATURES

54. FEATURES • split domain by words

55. FEATURES • split domain by words • generate features for each word

56. FEATURES • split domain by words • generate features for each word • how common the word is

57. FEATURES • split domain by words • generate features for each word • how common the word is • number of google results for each word

58. FEATURES • split domain by words • generate features for each word • how common the word is • number of google results for each word • cpc for the word

59. ALGORITHM support vector regression functions > SMOreg in weka

60. WHAT WE DIDN’T COVER

61. WHAT WE DIDN’T COVER • collaborative ﬁltering

62. WHAT WE DIDN’T COVER • collaborative ﬁltering • clustering

63. WHAT WE DIDN’T COVER • collaborative ﬁltering • clustering • theorem proving (classical AI)

64. ADDITIONAL RESOURCES stanford machine learning class ml-class.org

65. TOOLS • weka • libsvm, liblinear • vowpal wabbit (big dictionaries) • recommendify • https://github.com/paulasmuth/recommendify

66. QUESTIONS contact us on twitter at @tectonic and @ryanstout

Editor's Notes

#2: having an example makes it easier to understand the process\n
#3: also could use movie/product review data\n
#4: also could use movie/product review data\n
#5: also could use movie/product review data\n
#6: also could use movie/product review data\n
#7: also could use movie/product review data\n
#8: \n
#9: bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn&#x2019;t look at word order, syntax, grammar, punctuation, etc...\n
#10: bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn&#x2019;t look at word order, syntax, grammar, punctuation, etc...\n
#11: bag of words - a way of generating features from text that only looks at which words occur in the text\n- doesn&#x2019;t look at word order, syntax, grammar, punctuation, etc...\n
#12: words in dictionary array are replaced with the count&#x2019;s in the text\n\n
#13: words in dictionary array are replaced with the count&#x2019;s in the text\n\n
#14: words in dictionary array are replaced with the count&#x2019;s in the text\n\n
#15: words in dictionary array are replaced with the count&#x2019;s in the text\n\n
#16: word vectors/labels\n
#17: word vectors/labels\n
#18: word vectors/labels\n
#19: word vectors/labels\n
#20: word vectors/labels\n
#21: \n
#22: \n
#23: \n
#24: \n
#25: \n
#26: \n
#27: \n
#28: generated using RARFF\n
#29: \n
#30: \n
#31: \n
#32: \n
#33: load the arff\nload the model - serialized java object\nload a dataset\n
#34: create a sparse instance, set the dataset\nget distribution (predicted values for each class)\n
#35: the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
#36: the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
#37: the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
#38: the cat ran out the door\n[the cat] [cat ran] [ran out]...\n
#39: \n
#40: \n
#41: \n
#42: \n
#43: \n
#44: \n
#45: \n
#46: assume a max of three words\neach feature of three words, 0&#x2019;s if less words\n
#47: assume a max of three words\neach feature of three words, 0&#x2019;s if less words\n
#48: assume a max of three words\neach feature of three words, 0&#x2019;s if less words\n
#49: assume a max of three words\neach feature of three words, 0&#x2019;s if less words\n
#50: assume a max of three words\neach feature of three words, 0&#x2019;s if less words\n
#51: \n
#52: clustering - similar documents, related terms\n
#53: clustering - similar documents, related terms\n
#54: clustering - similar documents, related terms\n
#55: \n
#56: vowpal - good for large datasets, contains different algorithms (matrix factorization, collab filtering, lda, etc..)\n
#57: hopefully this helped you know the tools and techniques\nyou can teach yourself\nfeel free to contact us\n

Practical Machine Learning and Rails Part2

More Related Content

Similar to Practical Machine Learning and Rails Part2 (20)

More from ryanstout (8)

Recently uploaded (20)

Practical Machine Learning and Rails Part2

Editor's Notes