What Is Data Science (Slides)
What Is Data Science (Slides)
http://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA
_Big_Data.jpg
http://fc01.deviantart.net/fs71/i/2012/326/3/4/cute_dog_by_tho
masmeadows345-d5lsah9.jpg
https://encryptedtbn2.gstatic.com/images?q=tbn:ANd9GcS9dKu3_Tzi-sWWyAqee5y0EhuvoIZNSya_rAKnuBBd0JYxPX7pw
http://www.freefoto.com/images/1351/06/1351_06_2---Books-Shakespeare-and-Company-Bookstore--The-Latin-Quarter-Paris_web.jpg
http://upload.wikimedia.org/wikipedia/commons/9/96/Bill_Nye
,_Barack_Obama_and_Neil_deGrasse_Tyson_selfie_2014.jpg
http://upload.wikimedia.org/wikipedia/commons/e/e4/Gr
een_Bank_100m_diameter_Radio_Telescope.jpg
https://c2.staticflickr.com/4/3273/3017878633_65beb1c7d6.jpg
https://c1.staticflickr.com/1/2/1349370_07
03fce74c.jpg
it would take about 15 years to watch every video uploaded in one day
570 new websites spring into existence every minute of every day.
http://smartdatacollective.com/bernardmarr/277731/big-data-25-facts-everyone-needs-know
http://pixabay.com/static/uploads/photo/2014/03/13/01/12/datacen
ter-286386_640.jpg
https://c2.staticflickr.com/2/1296/533233247_b6baa30fdb_z.jpg?zz=1
Video clip:
http://youtu.be/PBx7rgqeGG8?t=2m
https://c1.staticflickr.com/3/2300/2596366618_2d6cb01735.jpg
http://upload.wiki
media.org/wikipedi
a/commons/9/90/Ke
ncf0618FacebookNe
twork.jpg
http://upload.wikimedia.org/wikipedia/commons/b/bf/USDA_Hardine
ss_zone_map.jpg
http://upload.wikimedia.org/wikipedia/commons/1/1c/CMS_Higgs-event.jpg
What is a database?
Database
[dey-tuh-beys]
noun
A comprehensive collection of related data
organized for convenient access, generally in
a computer.
-dictionary.com
Types of Databases
http://www.oaddo.org
Online Shopping
Course Registration/Canvas
Travel
Etc. etc. etc..
Email
Posting status updates
Attending events
Etc. etc. etc..
https://www.google.com/maps/@38.8905569,-77.1721577,13z/data=!5m1!1e1
http://upload.wikimedia.org/wikipedia/commons/6/69/Netflix_logo.svg
https://c2.staticflickr.com/4/3324/3507973704_563846fe14_z.jpg?zz=1
How is data
collected about you
used to help you?
Data Scientist
Computer Scientist
Data collection systems
Machine Learning
Algorithms
Interface Design
Design/Manage/Query
Databases
Data Aggregation
Data Mining
Mathematician
Statistical Models
Evaluation Metrics
Predictive Analytics
Data Visualizations
Business Person
Domain Expertise
Knowing what
questions to ask
Interpreting results for
business decisions
Presenting outcomes
http://semanticommunity.info/@api/deki/files/27057/Figure14.png?size=bestfit&width=484&height=541&revision=1
Statistician
Pythonista
Financial Analyst
Biostatistician
Recommendation System
Information Architect
Spatial/GIS Analyst
Artificial Intelligence
Natural Language
Programmer
Computational Physicist
Engineer
Researcher
Neuroscientist
Extraction of Knowledge
Data Mining
Business Understanding
Data Understanding
Data Preparation
Modeling
Clustering
Classification
Regression
Evaluation
https://www.kaggle.com/c/seizure-detection
in unnecessary stimulation
Data provided
Sampling frequency
Channels (electrodes)
real-life data, you wont know if or how long until seizure hits
thats what youre trying to predict
Correlation Coefficient r
Eigenvalues can think of this as a
scaling factor
Put all these values into a
Random Forest classifier
http://en.wikipedia.org/wiki/Fast_Fourier_transform
RandomForestClassifier(n_estimators=3000, min_samples_split=1,
bootstrap=False, random_state=0)
Judged on the mean area under the ROC curve (AUC) of two predictions.
Receiver Operating Characteristic = true positive vs false positive.
1)
2)
His model will label 963 of every 1000 true seizure clips as seizures
He won $5000 (much less than UPenn/Mayo would have had to pay a
Data Scientist to develop this as an employee or consultant!)
Currently another similar contest posted w/$25,000 prize
Other Examples
http://labs.strava.com/heatmap/#12/-78.90549/38.44669/blue/bike
http://xkcd.com/1425/
Programming
Math
Calculus
Linear Algebra
Statistics (2 levels)
Advanced: Optimization /
Linear Programming
Business Analytics
Data Mining
Others
Business / Communication
Graphic Design
http://www.ted.com/search?q=data
Explore
http://101.datascience.community/2014/10/17/data-sources-for-cool-datascience-projects-part-1-guest-post/
https://www.opensciencedatacloud.org/publicdata/
http://catalog.data.gov/dataset
https://archive.ics.uci.edu/ml/datasets.html?format=&task=clu&att=&area=&nu
mAtt=&numIns=&type=&sort=nameUp&view=table
Questions?
Renee T.
[contact me via twitter or blog for email]
@becomingdatasci
http://www.becomingadatascientist.com