Data Science and Data Analytics: Part B
Data Science and Data Analytics: Part B
1
DATA ANALYTICS
• Gartner projects that by 2015, 85% of fortune 500
organizations will be unable to exploit big data for competitive
advantage. About 4.4 million jobs will be created around big
data (baesens et al, 2003).
2
DATA ANALYTICS
3
DATA ANALYTICS
4
(Baesens, 2014)
DATA ANALYTICS
5
DATA ANALYTICS
• Types of data sources and data elements
• Data collection
• Populations and samples of big data
• Data munging/wrangling
• Data pre-processing
• Visual data and exploratory statistical analysis
• Data storage and management of big data
6
TYPES OF DATA SOURCES AND DATA ELEMENTS
Garbage in,
garbage out Key Ingredient
(GIGO)
7
TYPES OF DATA SOURCES
• Structured
• Low-level
Transaction • Details of key characteristics
• Common sense
• Business experience 8
Qualitative, expert-based
TYPES OF DATA ELEMENTS
interval
10
METHODS OF DATA COLLECTION
11
SAMPLING OF BIG DATA
Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. But it’s not
the amount of data that’s important. It’s what organizations do with the data
that matters. Big data can be analyzed for insights that lead to better decisions
and strategic business moves.
12
WHY SAMPLING OF BIG DATA
DATA PROCESSING
14
DATA MUNGING/WRANGLING
18
DEALING WITH MISSING VALUES (EXAMPLE)
19
Suggest a way to deal with the missing values of Record 1, 6 and 10.
DEALING WITH OUTLIERS
Valid observations Invalid Observations
? ?
? ?
Treatment
22
USING A HISTOGRAM FOR OUTLIERS DETECTION
23
USING A BOXPLOT FOR OUTLIERS DETECTION
A box plot represents three key quartiles of the data: the first quartile (25%
of the observations have a lower value), the median (50% of the observations
have a lower value), and the third quartile (75% of the observations have a
lower value).
The minimum and maximum values are then also added unless they are too
far away from the edges of the box.
Too far away is then quantified as more than 1.5 * Interquartile Range
(IQR = Q 3 − Q1 ). 24
USING Z-SCORES FOR OUTLIERS DETECTION
• Z‐SCORES MEASURES HOW MANY STANDARD DEVIATIONS, Σ, AN OBSERVATION LIES AWAY
FROM THE MEAN, Μ.
25
DESCRIPTIVE ANALYTICS
• Statistical inference
• Association rules
• Sequence rules
• Segmentation
26
DESCRIPTIVE AND INFERENTIAL ANALYSIS
27
27/1
DATA LEAKAGE
• When the data you're using to train contains information about
what you're trying to predict.
• Introducing information about the target during training that
would not legitimately be available during actual use.
• Obvious examples:
• Including the label to be predicted as a feature
• Including test data with training data
• If your model performance is too good to be true, it probably is
and likely due to "giveaway" features.
28
EXAMPLES DATA LEAKAGE
• Leakage in training data:
• Performing data preprocessing using parameters or results from
analyzing the entire dataset:
Normalizing and rescaling, detecting and removing outliers,
estimating missing values,
feature selection.
• Time-series datasets: using records from the future when
computing features for the current
prediction.
• Errors in data values/gathering or missing variable indicators (e.g.
the special value 999) can
encode information about missing data that reveals information 29
31
https://www.kdnuggets.com/2017/07/4-types-data-analytics.html
1. Descriptive: What is happening?
• This is the most common of all forms. In business it provides the analyst a view of
key metrics and measures within the business.
• An examples of this could be a monthly profit and loss statement. Similarly, an
analyst could have data on a large population of customers. Understanding
demographic information on their customers (e.G. 30% of our customers are self-
employed) would be categorised as “descriptive analytics”. Utilising effective
visualisation tools enhances the message of descriptive analytics.
32
2. Diagnostic: Why is it happening?
33
3. Predictive: What is likely to happen?
• Predictive analytics is all about forecasting. Whether it’s the likelihood of an event happening in
future, forecasting a quantifiable amount or estimating a point in time at which something might
happen – these are all done through predictive models.
• Predictive models typically utilise a variety of variable data to make the prediction. The
variability of the component data will have a relationship with what it is likely to predict (e.G. The
older a person, the more susceptible they are to a heart-attack – we would say that age has a
linear correlation with heart-attack risk). These data are then compiled together into a score or
prediction.
• In a world of great uncertainty, being able to predict allows one to make better decisions.
Predictive models are some of the most important utilised across a number of fields.
34
4. Prescriptive: What do I need to do?
• The next step up in terms of value and complexity is the prescriptive model. The prescriptive
model utilises an understanding of what has happened, why it has happened and a variety of
“what-might-happen” analysis to help the user determine the best course of action to take.
Prescriptive analysis is typically not just with one individual action, but is in fact a host of other
actions.
• A good example of this is a traffic application helping you choose the best route home and taking
into account the distance of each route, the speed at which one can travel on each road and,
crucially, the current traffic constraints.
• Another example might be producing an exam time-table such that no students have clashing
schedules.
35
How big data analytics impact your business’ success
• Famous article on forbes by louis columbus said that “51% of enterprises intend to invest more in
big data”. Visionary leaders know how important information/data will be in the digital age. It is
primordial to approach and analyze every customer to then be able to predict what are the next
user needs. Predicting what the consumers want before they even know it makes them quite excited
and helps you provide an unforgettable service as well as build your brand name. Big data
analytics also contributes significantly to help companies when it comes to improving their business
performance.
• “Without big data, you are blind and deaf and in the middle of a freeway”– geoffrey moore
• 43 trillion gigabytes will be created by 2020, and will increase of 300 times from 2005
• 6 billion people have cell phones. The world population is 7 billion
• More than 4 billion hours of video are watched on youtube each month
• 400 million tweets are sent per day by 200+ million monthly active users
• Poor data quality costs the US economy around $3.1 trillion a year
• Big data analytics grows to an annual rate of 40% in 2020 38
BIG DATA ANALYTICCS CASE STUDIES
Walmart
• As we all know it, walmart is one of the best retailers in the world. With more than 245 million
customers visiting 10,900 stores and its employee number that is actually more than some of the
retailer’s customer numbers.
• Walmart uses data which helps them to provide product recommendations to the users based on
which products were bought before or which products were bought together. The chief information
officer, linda dillman, examined sales data after hurricane charley to determine what would be
needed following the forecasted hurricane frances. As well as the predictable increase in sales of
flashlights and emergency equipment, the period saw an unexpected increase in demand for beer
and strawberry pop-tarts. This data was used to inform stocking decisions, and led to strong sales.
Strawberry pop-tarts sales increased by 7 times before a hurricane.
• Walmart collected valuable information from each individual consumer, what was bought, where 39
they live and what they are interested in to then predict the customer’s behavior.
Apple
Apple’s products make people happy, it’s a fact. Everyone is curious
about their secrets, they want to know what are the processes behind all
those excellent product designs. It is inevitable that they use big data
analytics to analyse their user’s behaviors. Apple is interested in user
experience, and they always think about how they can produce tech
products that would provide the most logic and comfortable feeling to
their users. Big data analytics helps apple with the “how, when & why”
that they need to build products and determine what new features
should be added.
40
Ebay & Amazon
• When we are shopping online on ebay or on amazon, you can almost always see
some suggested products. For example, when I want to buy a dress, during the
time i am purchasing it, I will get suggestions to buy a handbag, a scarf, some
jewelry or even a pair of shoes that will help me be more satisfied of the outfit.
It’s a sort of inspiration for me and who knows, i might buy more products from
them.
• In this case, they studied the customer’s behavior by researching data and by
analyzing hobbies and habits of customers when they shopped at their stores in
previous times. This enables to give a rational suggestion to a specific person,
based on past attitudes. 41
Linkedin
• Linkedin has more than 400 million members in over 200 countries and territories.
There are 3 million company pages, 2 new members join linkedin per second and
there are more than 100 million unique monthly visitors. The total revenue
advanced by 35% year-on-year to $861 M in the Q1 of 2016 Q1.
• Big data analytics is a key tool that helps linkedin create great features on its
platform. Let’s see how wise it is! Linkedin suggests us amazing points: people you
may know, skill endorsements, jobs you may be interested in, news feed updates,
who has viewed my linkedin profile, etc.
• These features increase the value of the platform for the users, increasing
42
therefore their satisfaction and interest. It has helped linkedin build its success.