Data Strategy

Uploaded by

adane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views41 pages

Data Strategy

Uploaded by

adane

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

High Computing Intelligence

Data Strategy
From Last Class…
“In pioneer days they used oxen for heavy pulling, when one couldn’t budge a log
they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but
for more systems of computers.” ~Grace Hopper
We have witnessed explosion in algorithmic solutions
What you cannot achieve by an algorithm can be achieved by more data
Big data if analyzed right gives you better answers
ie: traditional prediction of ﬂu vs. prediction of ﬂu through “search” data [1]

[1] Detecting inﬂuenza epidemics using search engine query data, Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel,
Lynnette Brammer, Mark S. Smolinski & Larry Brilliant
Data Strategy
In this era of big data, what is your data strategy?
Essentially, how are you going to plan for the data challenge?
● It is not only about big data, but data in all sizes and forms
● Data collections from customers used to be an elaborate task
○ ie surveys
● Nowadays data is available in abundance
○ technological advances and social networks
● Data is also generated by many of your own business processes and
applications
Components of a Data Strategy
● Data integration
● Meta data
● Data modeling
● Organizational roles and responsibilities
● Performance and metrics
● Security and privacy
● Structured data management
● Unstructured data management
● Business intelligence
● Data analysis and visualization
● Tapping into social data
Data Strategy at a high level
● How will you collect data? Aggregate data? What are your sources?
(ie. social media)
● How will you store your data? And where?
● How will you use the data? Analytics? Data mining? Pattern
recognition?
● How will you present or report the data to the stakeholders and
decision makers? Visualization?
Example 1 with Exam Grades

Question 1..5, total, mean, median, mode; mean ver1, mean ver2
Example 2 with Same Grades
Example 3 with Same Grades
Steps to Consider
1. Frame the problem: Understand the use case
2. Understand the data: Exploratory data analysis
3. Extract features: What parts of the data are important to you…
4. Model the data and analyze: How do we plan to get meaning from
the data
5. Design, code and experiment: Use tools to clean, extract, plot, view
6. Present and test results: Two types of clients -humans and
systems
7. Iterate: Go back to any of the steps based on the insights!
1. Frame the Problem
Frame the Problem
● Have a standard use case format (What, why, how, stakeholders, data
in, info out, challenges, limitations, scope etc.)
● Refer to your software engineering course
● Statement of work (SOW): clearly state what you will accomplish
2. Understand the Data
Understand the Data
● Data represents the traces of real-world processes
○ What traces we collect depends on the sampling methods
○ You build models to understand the data and extract meaning and
information from the data: statistical inference
● Two sources of randomness and uncertainty
○ The process that generates data is random
○ The sampling process itself is random
● Your mind-set should be “statistical thinking in the age of big-data”
○ Combine statistical approach with big-data
Questions to ask
● How big is the data?
● Any outliers?
● Missing data? How to address it? (Clean our data…)
● Sparse or dense?
● Collision of identifiers in different sets of data
New Kinds of Data
● Traditional: numerical, categorical, or binary
● Text: emails, tweets, NY times articles
● Records: user-level data, time-stamped event data, json formatted log
files
● Geo-based location data
● Network data (How do you sample and preserve network structure?)
● Sensor data
● Images
Uncertainty and Randomness
● A mathematical model for uncertainty and randomness is offered by probability
theory.
● A world/process is defined by one or more variables. The model of the world is
defined by a function:
○ Model = f(w) or f(x,y,z) (A multivariate function)
○ The function is unknown, model is unclear, at least initially. Typically our
task is to come up with the model, given the data.
● Uncertainty: is due to lack of knowledge -consider predicting the weather
● Randomness: is due lack of predictability -consider a die roll
● Both can be expressed by probability theory
Statistical Inference
● From the world -> collect data
● From the data -> capture the meaning through models or functions
● From the meaning -> devise statistical estimators for predicting things
about the world

Statistical inference: development of procedures, methods, and theorems

that allow us to extract meaning and information from data that has been
generated by stochastic (random) processes
Population and Sample
● Population is complete set of traces/data points
○ US population: 314 Million, world population: 7 billion for example
○ All voters, all things
● Sample is a subset of the complete set (or population): how we select
the sample introduces biases into the data
● See an example in http://www.sca.isr.umich.edu/
○ Here out of the 314 Million US population, 250,000 households form the
sample
● Population -> mathematical model -> sample
Population and Sample
● Example: Emails sent by people in the CSE dept. in a year.
○ Method 1: 1/10 of all emails over the year randomly chosen
○ Method 2: 1/10 of people randomly chosen; all their email over the year
● Both are reasonable sample selection method for analysis.
● However estimations (probability distribution functions) of the emails
sent by a person for the two samples will be different.
Big Data vs Statistical Inference
● Sample size N
○ For statistical inference N < All
○ For big data N == All
○ For some atypical big data analysis N == 1
■ World model through the eyes of a prolific twitter user
■ Followers of Ashton Kutcher: If you analyze the twitter data you may
get a world view from his point of view
Big Data Context
● Sampling is still a valid solution, it depends on your needs
○ For quick analysis, or inference purposes you don’t need all the data
○ At Google (at the originator big data algs.) people sample all the time.
● However, if you want to serve and render information in a UI, you cannot sample.
● Some DNA-based search you cannot sample.
● Just because you have an entire population does not mean there is no bias
○ Even taking, for example, the entirety of the data on Twitter…conclusions
made cannot be extended beyond the population which uses Twitter
○ Another example is of the tweets pre- and post- hurricane Sandy..
○ Think about Yelp
Exploratory Data Analysis (EDA)
● By doing EDA, you achieve two things to get you started:
○ You get an intuitive feel for the data
○ You can start to make a list of hypotheses
● EDA is the prototype phase of ML and other sophisticated approaches
● Basic tools of EDA are plots, graphs, and summary stats (a lot of histograms)...
● It is a method for “systematically” going through data, plotting distributions,
plotting time series, looking at pairwise relationships using scatter plots,
generating summary stats.eg. mean, min, max, upper, lower quartiles, identifying
outliers.
● EDA is done to understand big data before using expensive big data
methodology.
Example from Doing Data Science (Ch. 2)
Plotting the click-through rate for anonymous visitors
to the site vs those that are signed in shows a higher
average click-through rate for anonymous users

Anon Signed In
Example from Doing Data Science (Ch. 2)
Looking at all users, there is no
All Signed In Users significant difference between Signed In Users
Under 18
male and female click-through
rate…

…But restricting the population to

those under 18 changes these
assumptions
Female Male Female Male
Example from Doing Data Science (Ch. 2)
● Previous plots were just from a single day
○ How does the data look across the whole month?
○ What if we look at the change across days?
● What are the sample sizes of each grouping?
● What are the outliers?
Example fromData Science fromScratch (Ch. 5)

● Consider dividing up data scientists you know into bins based on

which coast they are from and finding the average number of friends
in for each bin

It would appear that the West

Coast # of Avg. # of Coast scientists are, by this
members friends metric, “friendlier”

West Coast 101 8.2 …however

East Coast 103 6.5
Example fromData Science fromScratch (Ch. 5)

● Consider dividing up data scientists you know into bins based on

which coast they are from and finding the average number of friends
in for each bin
Coast Degree # of Avg. # of
members friends
If we also bin by degree, we
see a different story… West Coast PhD 35 3.1
Coast # of Avg. # of
members friends East Coast PhD 70 3.2
East coast has a higher
Westpercentage
Coast 101of PhD members,
8.2 West Coast No PhD 66 10.9
but each bin is “friendlier”
East Coast 103 6.5 East Coast No PhD 33 13.4
Example fromData Science fromScratch (Ch. 5)

● Consider dividing up data scientists you know into bins based on

which coast they are from and finding the average number of friends
in for each bin
Coast Degree # of Avg. # of
members friends
If we also bin by degree, we
From this PhD
West Coast
we can conclude
35
see a different story… 3.1
Coast # of Avg. # of that the degree may also be a
members friends Eastrelevant
Coast factor
PhD in 70 3.2
East coast has a higher
“friendliness” and form new
Westpercentage
Coast 101of PhD members,
8.2 West Coast No PhD 66 10.9
hypotheses
but each bin is “friendlier”
East Coast 103 6.5 East Coast No PhD 33 13.4
3. Extracting Features
Extract Features
● Data is clean, we’ve done EDA, now we need to extract what is useful
● Filter out only the important fields or features, say from a json file
● Often defined by the problem analysis (EDA) and use case defined
○ Example: location and temperature are the only important data in a tweet
for a particular analysis
○ Consider the example from Doing Data Science (Ch. 2)
■ Depending on what you are trying do get, do you need information
about age? gender? neither?
4. Modeling
Modeling
● Abstraction of a real world process
● Let’s say we have a data set with two columns x and y and y is
dependent on x, we could write (for example):
○ y = β1+β2x (linear relationship)
○ We do not know β1, or β2…we must find them
● How to build a model?
○ Unfortunately…that’s not always straightforward
● Probability distribution functions (pdf) are building blocks of
statistical models
Probability Distributions
● Normal, uniform, Cauchy, t-, F-, Chi-square, exponential, Weibull,
lognormal, ...
● They are known as continuous density functions
● Any random variable, x, can be assumed to have probability
distribution p(x), which maps x to a positive real number
● To be a probability density function, integrating p(x) to find the area
must give 1
○ This allows us to determine the probability of a certain outcome by
integrating area under the curve
Probability Distributions
● We can also combine functions to serve as joint distributions or
conditional distributions:
○ p(x,y) is a multivariate function that determines the probability of both x
and y occurring (area under the plane must also be 1)
○ p(x|y) is a conditional distribution: it is the probability of x occurring, given
a particular value of y
Fitting a Model
● Estimating the parameters of the model: what distribution to use,
what are the values of min, max, mean, stddev, coeﬃcients for the
distribution, etc.
● This functionality is readily provided in tools like R and python libraries
● It involves algorithms such as maximum likelihood estimation (MLE)
and optimization methods…
○ Example: y = β1+β2x -> y = 7.2 + 4.5*x
5. Design, Code, Deploy
Design, code, deploy

● Design first before you code: an important principle

● Code using best practices and “Software engineering”
principles
● Choose the right language and development environment
● Document within the code and outside
● Clear state the steps in deploying the code
● Provide troubleshooting tips
6. Present the Results
Present the Results
● Good annotated graphs and visuals are important explaining the results
● Annotate using text, markup and markdown
● Extras: provide ability to interact with plots and assess what-if
conditions
● Explore
○ d3.js : https://d3js.org/
○ Tableau: https://www.tableau.com/academic
○ Python viz libraries (see Data Science from Scratch)
● And a lot of creativity! Do not underestimate this…you are the best
person to figure out how to present your results effectively.
7. Iterate
ITERATE!
● Iterate through any of steps as warranted by the feedback and the
results
● Data science process is an iterative process

Written Report in Linguistic Philosophy Linguistic Philosophy Represents The Viewpoint That Philosophical Issues Can Be Handled by Proponents
No ratings yet
Written Report in Linguistic Philosophy Linguistic Philosophy Represents The Viewpoint That Philosophical Issues Can Be Handled by Proponents
4 pages
Data Strategy Feb 9 Part 2
No ratings yet
Data Strategy Feb 9 Part 2
36 pages
Data Modeling March 16
No ratings yet
Data Modeling March 16
29 pages
6220010
No ratings yet
6220010
37 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
IE5005 Lecture 01
No ratings yet
IE5005 Lecture 01
58 pages
Module 1
No ratings yet
Module 1
19 pages
Data Science and Visualization
No ratings yet
Data Science and Visualization
37 pages
Introduction To Data Science
75% (4)
Introduction To Data Science
74 pages
STAT121 / AC209 / E-109: CS109 Data Science
No ratings yet
STAT121 / AC209 / E-109: CS109 Data Science
74 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
dataScience(mod1)
No ratings yet
dataScience(mod1)
4 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
Lecture 1_ Introduction to Data Science
No ratings yet
Lecture 1_ Introduction to Data Science
38 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
Data Science 2
No ratings yet
Data Science 2
55 pages
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
63 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
6 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
DS 1
No ratings yet
DS 1
56 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
Revised NOTES on AI PROJECT CYCLE Class 9 and 10 as on 29-10-2024 1
No ratings yet
Revised NOTES on AI PROJECT CYCLE Class 9 and 10 as on 29-10-2024 1
21 pages
ds sem
No ratings yet
ds sem
71 pages
Module 4.1 - Data Science_c19a56558691ed09690242a995a65dbe
No ratings yet
Module 4.1 - Data Science_c19a56558691ed09690242a995a65dbe
56 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
unit_1
No ratings yet
unit_1
9 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Module 2 PPT
No ratings yet
Module 2 PPT
78 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
data science dse
No ratings yet
data science dse
24 pages
Data Science
No ratings yet
Data Science
40 pages
Data Analytics_Module-1.1
No ratings yet
Data Analytics_Module-1.1
42 pages
21css303t Datascience Unit 1 Notes (1)
No ratings yet
21css303t Datascience Unit 1 Notes (1)
246 pages
15CS34E Analytic Computing Answer Key Part-A
No ratings yet
15CS34E Analytic Computing Answer Key Part-A
17 pages
Mathematical Algorithms For Artificial Intelligence and Big Data
No ratings yet
Mathematical Algorithms For Artificial Intelligence and Big Data
34 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Model_Qp_Scheme-2
No ratings yet
Model_Qp_Scheme-2
19 pages
DS Skills
No ratings yet
DS Skills
4 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
1
No ratings yet
1
32 pages
15CS34E Analytic Computing Key
No ratings yet
15CS34E Analytic Computing Key
17 pages
Week-1 Introduction To BDDA-TWM PDF
No ratings yet
Week-1 Introduction To BDDA-TWM PDF
48 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Unit 1 - AP For Data Science
No ratings yet
Unit 1 - AP For Data Science
19 pages
Data Science Pipeline, EDA & Data Preparation
No ratings yet
Data Science Pipeline, EDA & Data Preparation
14 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
Big Data
No ratings yet
Big Data
4 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Unit 4
No ratings yet
Unit 4
5 pages
The Data Revolution
From Everand
The Data Revolution
Pasquale De Marco
No ratings yet
Thinking Analytically: A Guide for Making Data-Driven Decisions
From Everand
Thinking Analytically: A Guide for Making Data-Driven Decisions
Jim Frost
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Survey on 5G Physical Layer Security Threats
No ratings yet
Survey on 5G Physical Layer Security Threats
40 pages
NC10-USRP_K8
No ratings yet
NC10-USRP_K8
8 pages
advanced_protocol_pactor
No ratings yet
advanced_protocol_pactor
8 pages
Presentation1
No ratings yet
Presentation1
25 pages
HCIP-5G-RAN V2.0 Exam Outline
100% (1)
HCIP-5G-RAN V2.0 Exam Outline
2 pages
TDOA Localization-Techniques
No ratings yet
TDOA Localization-Techniques
11 pages
Lbna 24885 Enc
No ratings yet
Lbna 24885 Enc
32 pages
Ucu 103 Complete Module
No ratings yet
Ucu 103 Complete Module
75 pages
Rph-Historical Critism Lesson 2
No ratings yet
Rph-Historical Critism Lesson 2
3 pages
Science q1 w2
No ratings yet
Science q1 w2
7 pages
Action Research Namin
100% (1)
Action Research Namin
3 pages
Lesson Plan General Objectives TEMPLET B.Ed
No ratings yet
Lesson Plan General Objectives TEMPLET B.Ed
5 pages
Research Paper
No ratings yet
Research Paper
4 pages
Social Dimension of Education
100% (3)
Social Dimension of Education
34 pages
ENAES Classification-Of-Grades FOURTH QUARTER
No ratings yet
ENAES Classification-Of-Grades FOURTH QUARTER
9 pages
ECC 563 - Course Info
No ratings yet
ECC 563 - Course Info
10 pages
Definition of Literature Review by Scholars
100% (1)
Definition of Literature Review by Scholars
5 pages
Anth 202 2014 Assignment #2
No ratings yet
Anth 202 2014 Assignment #2
5 pages
Geo Chapter 1
No ratings yet
Geo Chapter 1
18 pages
11 Statistics and Probability
No ratings yet
11 Statistics and Probability
4 pages
Business Informatics
No ratings yet
Business Informatics
2 pages
Black Lives Matter Ethnomethodological and Conversation Analytic Studies of Race and Systemic Racism in Everyday Interaction
No ratings yet
Black Lives Matter Ethnomethodological and Conversation Analytic Studies of Race and Systemic Racism in Everyday Interaction
336 pages
Philosophy of Nursing Theory
0% (1)
Philosophy of Nursing Theory
18 pages
MEB - Volume 3 - Issue 2 - Pages 469-476
No ratings yet
MEB - Volume 3 - Issue 2 - Pages 469-476
8 pages
Lesson 1 in MMW
No ratings yet
Lesson 1 in MMW
3 pages
Developing An Implementation Research Proposal: Session 1: Writing The Introduction Section
No ratings yet
Developing An Implementation Research Proposal: Session 1: Writing The Introduction Section
35 pages
Week2Q3 LAS PracticalResearch1 Final
No ratings yet
Week2Q3 LAS PracticalResearch1 Final
8 pages
Astrology and Astronomy
No ratings yet
Astrology and Astronomy
3 pages
Cartolina PHD
No ratings yet
Cartolina PHD
2 pages
Good Points: The Letter To God
No ratings yet
Good Points: The Letter To God
24 pages
(Ebook) Institutions and the path to the modern economy. lessons from medieval trade by Avner Greif ISBN 9780521480444, 9780521671347, 0521480442, 0521671345 instant download
100% (3)
(Ebook) Institutions and the path to the modern economy. lessons from medieval trade by Avner Greif ISBN 9780521480444, 9780521671347, 0521480442, 0521671345 instant download
59 pages
PHD Thesis Writing Process A Systematic Approach-How To Write Your Literature
No ratings yet
PHD Thesis Writing Process A Systematic Approach-How To Write Your Literature
8 pages
CHAPTER 1 - Management An Introduction
No ratings yet
CHAPTER 1 - Management An Introduction
51 pages
PDF Insight On the Origins of New Ideas Frederic Vallee Tourangeau download
100% (3)
PDF Insight On the Origins of New Ideas Frederic Vallee Tourangeau download
47 pages

Data Strategy

Uploaded by

Data Strategy

Uploaded by

High Computing Intelligence

Statistical inference: development of procedures, methods, and theorems

…But restricting the population to

● Consider dividing up data scientists you know into bins based on

It would appear that the West

West Coast 101 8.2 …however

● Consider dividing up data scientists you know into bins based on

● Consider dividing up data scientists you know into bins based on

● Design first before you code: an important principle

You might also like