Data Strategy
Data Strategy
Data Strategy
From Last Class…
“In pioneer days they used oxen for heavy pulling, when one couldn’t budge a log
they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but
for more systems of computers.” ~Grace Hopper
We have witnessed explosion in algorithmic solutions
What you cannot achieve by an algorithm can be achieved by more data
Big data if analyzed right gives you better answers
ie: traditional prediction of flu vs. prediction of flu through “search” data [1]
[1] Detecting influenza epidemics using search engine query data, Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel,
Lynnette Brammer, Mark S. Smolinski & Larry Brilliant
Data Strategy
In this era of big data, what is your data strategy?
Essentially, how are you going to plan for the data challenge?
● It is not only about big data, but data in all sizes and forms
● Data collections from customers used to be an elaborate task
○ ie surveys
● Nowadays data is available in abundance
○ technological advances and social networks
● Data is also generated by many of your own business processes and
applications
Components of a Data Strategy
● Data integration
● Meta data
● Data modeling
● Organizational roles and responsibilities
● Performance and metrics
● Security and privacy
● Structured data management
● Unstructured data management
● Business intelligence
● Data analysis and visualization
● Tapping into social data
Data Strategy at a high level
● How will you collect data? Aggregate data? What are your sources?
(ie. social media)
● How will you store your data? And where?
● How will you use the data? Analytics? Data mining? Pattern
recognition?
● How will you present or report the data to the stakeholders and
decision makers? Visualization?
Example 1 with Exam Grades
Question 1..5, total, mean, median, mode; mean ver1, mean ver2
Example 2 with Same Grades
Example 3 with Same Grades
Steps to Consider
1. Frame the problem: Understand the use case
2. Understand the data: Exploratory data analysis
3. Extract features: What parts of the data are important to you…
4. Model the data and analyze: How do we plan to get meaning from
the data
5. Design, code and experiment: Use tools to clean, extract, plot, view
6. Present and test results: Two types of clients -humans and
systems
7. Iterate: Go back to any of the steps based on the insights!
1. Frame the Problem
Frame the Problem
● Have a standard use case format (What, why, how, stakeholders, data
in, info out, challenges, limitations, scope etc.)
● Refer to your software engineering course
● Statement of work (SOW): clearly state what you will accomplish
2. Understand the Data
Understand the Data
● Data represents the traces of real-world processes
○ What traces we collect depends on the sampling methods
○ You build models to understand the data and extract meaning and
information from the data: statistical inference
● Two sources of randomness and uncertainty
○ The process that generates data is random
○ The sampling process itself is random
● Your mind-set should be “statistical thinking in the age of big-data”
○ Combine statistical approach with big-data
Questions to ask
● How big is the data?
● Any outliers?
● Missing data? How to address it? (Clean our data…)
● Sparse or dense?
● Collision of identifiers in different sets of data
New Kinds of Data
● Traditional: numerical, categorical, or binary
● Text: emails, tweets, NY times articles
● Records: user-level data, time-stamped event data, json formatted log
files
● Geo-based location data
● Network data (How do you sample and preserve network structure?)
● Sensor data
● Images
Uncertainty and Randomness
● A mathematical model for uncertainty and randomness is offered by probability
theory.
● A world/process is defined by one or more variables. The model of the world is
defined by a function:
○ Model = f(w) or f(x,y,z) (A multivariate function)
○ The function is unknown, model is unclear, at least initially. Typically our
task is to come up with the model, given the data.
● Uncertainty: is due to lack of knowledge -consider predicting the weather
● Randomness: is due lack of predictability -consider a die roll
● Both can be expressed by probability theory
Statistical Inference
● From the world -> collect data
● From the data -> capture the meaning through models or functions
● From the meaning -> devise statistical estimators for predicting things
about the world
Anon Signed In
Example from Doing Data Science (Ch. 2)
Looking at all users, there is no
All Signed In Users significant difference between Signed In Users
Under 18
male and female click-through
rate…