SlideShare a Scribd company logo
Practical Data Analysis in Python
Hilary Mason
@hmason
www.hilarymason.com
hilary@path101.com
Data is ubiquitous.
The ability and tools to use it are not.
(Focused) Data == Intelligence
Data Analysis on the Web
Data items change rapidly.
Data items are not independent.
There’s a lot of semi-structured data around.
There’s a LOT of data around.
==
Too many problems, few tools, and few experts.
Entity Disambiguation
This is important.
ME
UGLY HAG
Entity Disambiguation
This is important.
Company disambiguation is a very common
problem – Are “Microsoft”, “Microsoft
Corporation”, and “MS” the same company?
This is a hard problem.
SPAM sucks
Classification
Document classification.
Image recognition.
Topic recognition.
Text Parsing
Recommendation Systems
Product recommendations.
Disease predictions.
Behavior analysis.
IEEE Tag Clustering
immunity
ultrasound
medical
imaging
medical
devices
thermoelectric
devices
fault-tolerant
circuits
low power
devices
Python for Data Analysis
import why_python_is_awesome
Python is readable.
Easy to transition from Matlab or R.
Numerical computing support.
Growing set of machine learning libraries.
Libraries
NLTK (Natural Language Toolkit) – www.nltk.org
mlpy (Machine Learning PY) – mlpy.fbk.eu
numpy & scipy – scipy.org
An EC2 AMI provisioned with all of the toys you
need:
http://blog.infochimps.org/2009/02/06/start-
hacking-machetec2-released/
MachetEC2
Practical Data Analysis in Python
Supervised Classification
Text
Feature
Extractor
Trained
Classifier
Spam
Not Spam
Training
Data
Feature
Extractor
Data: Tweets
Hand-classified. For example, some spam:
| don't disrespect me. I just wanted yall to get a head start so
don't feel bad when I have more followers in two days.
http://xyyx.eu/a1ha |
| oh yay more new followers..hiii...if u want go to
http://xyyx.eu/a1hb
|
| My friend made this new tool to get more twitter followers,
http://xyyx.eu/a1ht
|
| Yes, Twitter is doing some Follower/Following count
corrections. Get it back at: http://xyyx.eu/a1h8
|
| man if i see one more person cry about losing followers!!!
http://xyyx.eu/a1h4
|
Features
def document_features(self, document):
document_words = set(document)
features = {}
for word in self.word_features:
features['contains(%s)' % word] = (word in document_words)
return features
Break tweets into lists of relevant words.
Naïve Bayesian Classifer
P(A|B) = the conditional probability of A given B
http://yudkowsky.net/rational/bayes
http://blog.oscarbonilla.com/2009/05/visualizin
g-bayes-theorem/
classifier = nltk.NaiveBayesClassifier.train(train_set)
Classifer Accuracy
Use a hand-classified test set to see the accuracy
of the classifier:
nltk.classify.accuracy(classifier, test_set)
Feature Relevance
contains(') = True not_s : spam = 53.6 : 1.4
contains(") = True not_s : spam = 32.2 : 1.1
contains(#) = True not_s : spam = 22.0 : 1.0
contains(!) = True not_s : spam = 10.8 : 1.0
contains(*) = True spam : not_s = 7.4 : 1.0
contains(=) = True not_s : spam = 5.5 : 1.0
contains(i) = False spam : not_s = 5.2 : 1.0
contains(?) = True not_s : spam = 2.4 : 1.0
contains(:) = True spam : not_s = 2.3 : 1.0
contains(&) = True not_s : spam = 1.8 : 1.0
contains(;) = True not_s : spam = 1.6 : 1.0
contains($) = True spam : not_s = 1.5 : 1.0
contains(u) = True spam : not_s = 1.5 : 1.0
contains(2.0) = False not_s : spam = 1.4 : 1.0
contains(saw) = False not_s : spam = 1.4 : 1.0
contains(noble) = False not_s : spam = 1.4 : 1.0
contains(sound) = False not_s : spam = 1.3 : 1.0
contains(approach) = False not_s : spam = 1.3 : 1.0
contains(finally) = False not_s : spam = 1.3 : 1.0
contains(more) = False spam : not_s = 1.3 : 1.0
Kitchen Sink
wash, rinse, repeat
Results
90% accuracy on spam tweets – not bad!
Other possibilities:
categorization – what do you tweet about?
human vs bot?
which celebrity tweeter are you?
<3 Data
Thank you!

More Related Content

Viewers also liked (20)

PDF
pandas - Python Data Analysis
Andrew Henshaw
 
PDF
Parsing real-time data using Twitter Streaming API
Ram Parthasarathy
 
ODP
Data Analysis in Python
Richard Herrell
 
PPTX
Python and Data Analysis
Praveen Nair
 
PPTX
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
PDF
Getting started with pandas
maikroeder
 
PDF
pandas: Powerful data analysis tools for Python
Wes McKinney
 
PDF
Python for Financial Data Analysis with pandas
Wes McKinney
 
PPTX
CLASSIFICATION OF TWEETS
Mukul Jha
 
PPTX
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
PPTX
Python for Data Analysis: Chapter 2
智哉 今西
 
PDF
Creative Data Analysis with Python
Grant Paton-Simpson
 
PDF
Researh toolbox-data-analysis-with-python
Waternomics
 
PDF
Making your-very-own-android-apps-for-waternomics-using-app-inventor-2
Waternomics
 
PPTX
Data analysis with pandas
Outreach Digital
 
PDF
Creating Your First Predictive Model In Python
Robert Dempsey
 
PDF
Categorical Data Analysis in Python
Jaidev Deshpande
 
PDF
Big data analysis in python @ PyCon.tw 2013
Jimmy Lai
 
PPTX
Analyzing Data With Python
Sarah Guido
 
PDF
Data Structures for Statistical Computing in Python
Wes McKinney
 
pandas - Python Data Analysis
Andrew Henshaw
 
Parsing real-time data using Twitter Streaming API
Ram Parthasarathy
 
Data Analysis in Python
Richard Herrell
 
Python and Data Analysis
Praveen Nair
 
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
Getting started with pandas
maikroeder
 
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Python for Financial Data Analysis with pandas
Wes McKinney
 
CLASSIFICATION OF TWEETS
Mukul Jha
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
Python for Data Analysis: Chapter 2
智哉 今西
 
Creative Data Analysis with Python
Grant Paton-Simpson
 
Researh toolbox-data-analysis-with-python
Waternomics
 
Making your-very-own-android-apps-for-waternomics-using-app-inventor-2
Waternomics
 
Data analysis with pandas
Outreach Digital
 
Creating Your First Predictive Model In Python
Robert Dempsey
 
Categorical Data Analysis in Python
Jaidev Deshpande
 
Big data analysis in python @ PyCon.tw 2013
Jimmy Lai
 
Analyzing Data With Python
Sarah Guido
 
Data Structures for Statistical Computing in Python
Wes McKinney
 

Similar to Practical Data Analysis in Python (20)

PDF
maxbox_starter138_top7_statistical_methods.pdf
MaxKleiner3
 
PDF
AI and ML Skills for the Testing World Tutorial
Tariq King
 
PPT
Static Analysis
alice yang
 
DOCX
First ML Experience
Amrith Kumar
 
PDF
It Probably Works - QCon 2015
Fastly
 
PPTX
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
wajrcs
 
PDF
yelp data challenge
AMR koura
 
PDF
Computational decision making
Boris Adryan
 
PDF
Debugging AI
Dr. Christian Betz
 
PPTX
EVERYTHING ABOUT STATIC CODE ANALYSIS FOR A JAVA PROGRAMMER
Andrey Karpov
 
PPTX
Ember
mrphilroth
 
PPT
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
PPTX
Using the Machine to predict Testability
Miguel Lopez
 
PPT
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
butest
 
PDF
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
PDF
Neural networks, naïve bayes and decision tree machine learning
Francisco E. Figueroa-Nigaglioni
 
PDF
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain
 
PDF
BlueHat v18 || Protecting the protector, hardening machine learning defenses ...
BlueHat Security Conference
 
PDF
Introduction to Data Mining
Kai Koenig
 
PPTX
B4UConference_machine learning_deeplearning
Hoa Le
 
maxbox_starter138_top7_statistical_methods.pdf
MaxKleiner3
 
AI and ML Skills for the Testing World Tutorial
Tariq King
 
Static Analysis
alice yang
 
First ML Experience
Amrith Kumar
 
It Probably Works - QCon 2015
Fastly
 
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
wajrcs
 
yelp data challenge
AMR koura
 
Computational decision making
Boris Adryan
 
Debugging AI
Dr. Christian Betz
 
EVERYTHING ABOUT STATIC CODE ANALYSIS FOR A JAVA PROGRAMMER
Andrey Karpov
 
Ember
mrphilroth
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
Using the Machine to predict Testability
Miguel Lopez
 
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
butest
 
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
Neural networks, naïve bayes and decision tree machine learning
Francisco E. Figueroa-Nigaglioni
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain
 
BlueHat v18 || Protecting the protector, hardening machine learning defenses ...
BlueHat Security Conference
 
Introduction to Data Mining
Kai Koenig
 
B4UConference_machine learning_deeplearning
Hoa Le
 
Ad

More from Hilary Mason (12)

PDF
Grace Hopper Conference Opening Keynote
Hilary Mason
 
PPTX
Short URLs, Big Fun
Hilary Mason
 
PPTX
Strata NY Sep 2011: Big Data, Short URLs: Learning in Realtime
Hilary Mason
 
PPTX
PyCon 2011 Keynote
Hilary Mason
 
PPTX
Machine Learning for Web Data
Hilary Mason
 
PPTX
A Data-driven Look at the Realtime Web
Hilary Mason
 
PDF
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
Hilary Mason
 
PPT
Have data? What now?!
Hilary Mason
 
PPT
JWU Guest Talk: JavaScript and AJAX
Hilary Mason
 
PPT
Analytics for Virtual Worlds
Hilary Mason
 
PPT
Experiential Learning in Second Life
Hilary Mason
 
PPT
Virtual Worlds in Education
Hilary Mason
 
Grace Hopper Conference Opening Keynote
Hilary Mason
 
Short URLs, Big Fun
Hilary Mason
 
Strata NY Sep 2011: Big Data, Short URLs: Learning in Realtime
Hilary Mason
 
PyCon 2011 Keynote
Hilary Mason
 
Machine Learning for Web Data
Hilary Mason
 
A Data-driven Look at the Realtime Web
Hilary Mason
 
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
Hilary Mason
 
Have data? What now?!
Hilary Mason
 
JWU Guest Talk: JavaScript and AJAX
Hilary Mason
 
Analytics for Virtual Worlds
Hilary Mason
 
Experiential Learning in Second Life
Hilary Mason
 
Virtual Worlds in Education
Hilary Mason
 
Ad

Recently uploaded (20)

PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 

Practical Data Analysis in Python

Editor's Notes

  • #4: 1) Access to the data, and 2) CPU power/algorithms that are robust enough to analyze it
  • #15: NLTK – in development since 2001