0% found this document useful (0 votes)
37 views

Data Science and Data Analytics: Part B

Gartner projects that by 2015, 85% of Fortune 500 companies will be unable to take advantage of big data due to a lack of skilled data analytics resources. There is also a shortage of "data scientists" needed to analyze big data. Data analytics involves collecting, processing, and analyzing data to extract useful insights. It encompasses techniques such as data mining, machine learning, and statistical analysis. Proper data processing steps like handling missing values and outliers are important for clean data that can be accurately analyzed.

Uploaded by

jackson foo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Data Science and Data Analytics: Part B

Gartner projects that by 2015, 85% of Fortune 500 companies will be unable to take advantage of big data due to a lack of skilled data analytics resources. There is also a shortage of "data scientists" needed to analyze big data. Data analytics involves collecting, processing, and analyzing data to extract useful insights. It encompasses techniques such as data mining, machine learning, and statistical analysis. Proper data processing steps like handling missing values and outliers are important for clean data that can be accurately analyzed.

Uploaded by

jackson foo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CHAPTER 2

DATA SCIENCE AND DATA


ANALYTICS
PART B

1
DATA ANALYTICS
• Gartner projects that by 2015, 85% of fortune 500
organizations will be unable to exploit big data for competitive
advantage. About 4.4 million jobs will be created around big
data (baesens et al, 2003).

• A main obstacle to fully harnessing the power of big data using


analytics is the lack of skilled resources and “data scientist”
talent required to exploit big data.

2
DATA ANALYTICS

Analytics is a term that is often used interchangeably


with data science, data mining, knowledge discovery,
and others.

3
DATA ANALYTICS

4
(Baesens, 2014)
DATA ANALYTICS

5
DATA ANALYTICS
• Types of data sources and data elements
• Data collection
• Populations and samples of big data
• Data munging/wrangling
• Data pre-processing
• Visual data and exploratory statistical analysis
• Data storage and management of big data

6
TYPES OF DATA SOURCES AND DATA ELEMENTS
Garbage in,
garbage out Key Ingredient
(GIGO)

The more the


Messy better

7
TYPES OF DATA SOURCES
• Structured
• Low-level
Transaction • Details of key characteristics

• Macroeconomic data • Text documents


• Social media data • Multimedia content
• Require extensive
prepropressing
Public Unstructured
Data Sources

• Common sense
• Business experience 8
Qualitative, expert-based
TYPES OF DATA ELEMENTS
interval

Limit/without Continuous Categorical


e.g. sales

Nominal Ordinal Binary


Ordering is Ordering is 2 values
not meaningful only
meaningful

e.g. job e.g. age- e.g. gender 9


title group
DATA COLLECTION

The activity of collecting information that can be used to find out


about a particular subject.
(Cambridge Dictionary)

Data collection is the process of gathering and measuring


information on targeted variables in an established systematic
fashion, which then enables one to answer relevant questions and
evaluate outcomes.
(Wikipedia)

10
METHODS OF DATA COLLECTION

11
SAMPLING OF BIG DATA

Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. But it’s not
the amount of data that’s important. It’s what organizations do with the data
that matters. Big data can be analyzed for insights that lead to better decisions
and strategic business moves.

12
WHY SAMPLING OF BIG DATA

Storing the full data may not be feasible


• Your application may not keep everything

Work with data in full is inconvenient


• What is the need to analyze the full data?

Work with a compact summary is faster


• Would you rather exploring data with a PC than a
supercomputer/cluster?
13
NEXT

DATA PROCESSING

14
DATA MUNGING/WRANGLING

RAW DATA => CLEANED DATA =>


Messy / noisy data Data that can be analyzed

Process of manually converting or mapping data from one


"raw" form into another format that allows for more
convenient consumption of the data. 15
DEALING WITH MISSING VALUES
No information of those Private data , such
students who withdraw. as salary, may not
be disclosed.
Nonapplicable Undisclosed
information information
Why values
are missing?

Error Outlier values

Human factor – question was The values have to be treated


skipped by respondent, typo missing. E.g. extremely low or 16

Technical issue extremely high values


DEALING WITH MISSING VALUES
Replace (Impute)
• Replacing the missing values with a known value (e.g.
mean, median, mode)
Delete
• Deleting observations or variables with lots of missing
values as the data may not be meaningful
Keep
• If the data with missing values are meaningful. Needs to
be considered as a separate category.
17
CHOOSING THE RIGHT WAY TO DEAL WITH MISSING
VALUES
Statistical test
• Test whether the missing information is related to the
target variable
• If yes, then choose keep
Observe the number of available observations
• If available observations are high, then consider delete
• Else, consider impute

18
DEALING WITH MISSING VALUES (EXAMPLE)

19
Suggest a way to deal with the missing values of Record 1, 6 and 10.
DEALING WITH OUTLIERS
Valid observations Invalid Observations

Salary of CEO is $1 million Age = 300 years

? ?

? ?

Provide some examples of each type of outliers.


20
MULTIVARIATE OUTLIERS

Multivariate outliers are observations that are outlying in multiple dimensions


(e.g. age and income) 21
STEPS TO DEAL WITH OUTLIERS
• Calculate the min and max For univariate
• Use visual tools, e.g. histograms, outliers
boxplots
Detection • Z-scores

• regression For multivariate outliers

Treatment
22
USING A HISTOGRAM FOR OUTLIERS DETECTION

23
USING A BOXPLOT FOR OUTLIERS DETECTION

A box plot represents three key quartiles of the data: the first quartile (25%
of the observations have a lower value), the median (50% of the observations
have a lower value), and the third quartile (75% of the observations have a
lower value).
The minimum and maximum values are then also added unless they are too
far away from the edges of the box.
Too far away is then quantified as more than 1.5 * Interquartile Range
(IQR = Q 3 − Q1 ). 24
USING Z-SCORES FOR OUTLIERS DETECTION
• Z‐SCORES MEASURES HOW MANY STANDARD DEVIATIONS, Σ, AN OBSERVATION LIES AWAY
FROM THE MEAN, Μ.

A practical rule of thumb then defines outliers when the absolute


value of the z ‐score | z| is bigger than 3. Note that the z
‐score relies on the normal distribution.

25
DESCRIPTIVE ANALYTICS

• Statistical inference
• Association rules
• Sequence rules
• Segmentation

26
DESCRIPTIVE AND INFERENTIAL ANALYSIS

DESCRIPTIVE ANALYSIS INFERENTIAL ANALYSIS


• DESCRIPTIVE STATISTICAL ANALYSIS LIMITS • INFERENTIAL ANALYSIS SELECTS A SMALL GROUP
GENERALIZATION TO THE PARTICULAR GROUP OF (SAMPLE) OUT OF A LARGER GROUP (POPULATION)
INDIVIDUALS OBSERVED. THAT IS: AND THE FINDING ARE APPLIED TO THE LARGER
GROUP. IT IS USED TO ESTIMATE A PARAMETER, THE
• NO CONCLUSION ARE EXTENDED BEYOND THIS
CORRESPONDING VALUE IN THE POPULATION FROM
GROUP
WHICH THE SAMPLE IS SELECTED.
• ANY SIMILARITY TO THOSE OUTSIDE THE GROUP
• IT IS NECESSARY TO CAREFULLY SELECT THE SAMPLE
CANNOT BE ASSUMED.
OR THE INFERENCES MAY NOT APPLY TO THE
• THE DATA DESCRIBE ONE GROUP AND THAT GROUP POPULATION.
ONLY.

27
27/1
DATA LEAKAGE
• When the data you're using to train contains information about
what you're trying to predict.
• Introducing information about the target during training that
would not legitimately be available during actual use.

• Obvious examples:
• Including the label to be predicted as a feature
• Including test data with training data
• If your model performance is too good to be true, it probably is
and likely due to "giveaway" features.

28
EXAMPLES DATA LEAKAGE
• Leakage in training data:
• Performing data preprocessing using parameters or results from
analyzing the entire dataset:
Normalizing and rescaling, detecting and removing outliers,
estimating missing values,
feature selection.
• Time-series datasets: using records from the future when
computing features for the current
prediction.
• Errors in data values/gathering or missing variable indicators (e.g.
the special value 999) can
encode information about missing data that reveals information 29

about the future.


EXAMPLES DATA LEAKAGE
• Perform data preparation within each cross-validation fold separately
• Scale/normalize data, perform feature selection, etc. within each fold
separately, not
using the entire dataset.
• For any such parameters estimated on the training data, you must use those
same
parameters to prepare data on the corresponding held-out test fold.

• With time series data, use a timestamp cutoff


• The cutoff value is set to the specific time point where prediction is to occur
using
current and past records.
• Using a cutoff time will make sure you aren't accessing any data records that
were
gathered after the prediction time, i.e. in the future.
30
TYPES OF DATA ANALYTICS

31

https://www.kdnuggets.com/2017/07/4-types-data-analytics.html
1. Descriptive: What is happening?

• This is the most common of all forms. In business it provides the analyst a view of
key metrics and measures within the business.
• An examples of this could be a monthly profit and loss statement. Similarly, an
analyst could have data on a large population of customers. Understanding
demographic information on their customers (e.G. 30% of our customers are self-
employed) would be categorised as “descriptive analytics”. Utilising effective
visualisation tools enhances the message of descriptive analytics.
32
2. Diagnostic: Why is it happening?

• This is the next step of complexity in data analytics is descriptive analytics. On


assessment of the descriptive data, diagnostic analytical tools will empower an
analyst to drill down and in so doing isolate the root-cause of a problem.
• Well-designed business information (bi) dashboards incorporating reading of
time-series data (i.E. Data over multiple successive points in time) and featuring
filters and drill down capability allow for such analysis.

33
3. Predictive: What is likely to happen?

• Predictive analytics is all about forecasting. Whether it’s the likelihood of an event happening in
future, forecasting a quantifiable amount or estimating a point in time at which something might
happen – these are all done through predictive models.
• Predictive models typically utilise a variety of variable data to make the prediction. The
variability of the component data will have a relationship with what it is likely to predict (e.G. The
older a person, the more susceptible they are to a heart-attack – we would say that age has a
linear correlation with heart-attack risk). These data are then compiled together into a score or
prediction.
• In a world of great uncertainty, being able to predict allows one to make better decisions.
Predictive models are some of the most important utilised across a number of fields.
34
4. Prescriptive: What do I need to do?

• The next step up in terms of value and complexity is the prescriptive model. The prescriptive
model utilises an understanding of what has happened, why it has happened and a variety of
“what-might-happen” analysis to help the user determine the best course of action to take.
Prescriptive analysis is typically not just with one individual action, but is in fact a host of other
actions.
• A good example of this is a traffic application helping you choose the best route home and taking
into account the distance of each route, the speed at which one can travel on each road and,
crucially, the current traffic constraints.
• Another example might be producing an exam time-table such that no students have clashing
schedules.
35
How big data analytics impact your business’ success
• Famous article on forbes by louis columbus said that “51% of enterprises intend to invest more in
big data”. Visionary leaders know how important information/data will be in the digital age. It is
primordial to approach and analyze every customer to then be able to predict what are the next
user needs. Predicting what the consumers want before they even know it makes them quite excited
and helps you provide an unforgettable service as well as build your brand name. Big data
analytics also contributes significantly to help companies when it comes to improving their business
performance.
• “Without big data, you are blind and deaf and in the middle of a freeway”– geoffrey moore

06/07/2016 in technology industry trends by dieu le 36


https://apiumhub.com/tech-blog-barcelona/impact-big-data-analytics/
Key statistics about big data that you shouldn’t ignore
• 91 % of marketers believe that successful brands use customer data to decide marketing
strategies
• 49 % increase in revenue growth for companies that invested in analytics versus those that did not
• 35 % of marketers say that data has improved their customer engagement through
personalization
• Retailers who leverage the full power of big data analytics could increase their operating margins
by as much as 60%
• 60% of the professionals asked feel that data is generating revenue within their organizations
• 83% of the professionals say that data analysis makes existing services and products more
profitable. Asia is leading the way – where 63% said they are routinely generating value from
data. In the US, the figure was 58%, and in Europe, 56% 37
Impressive statistics about 4 v’s of big data analytics

• Volume, variety, velocity & veracity

• 43 trillion gigabytes will be created by 2020, and will increase of 300 times from 2005
• 6 billion people have cell phones. The world population is 7 billion
• More than 4 billion hours of video are watched on youtube each month
• 400 million tweets are sent per day by 200+ million monthly active users
• Poor data quality costs the US economy around $3.1 trillion a year
• Big data analytics grows to an annual rate of 40% in 2020 38
BIG DATA ANALYTICCS CASE STUDIES
Walmart
• As we all know it, walmart is one of the best retailers in the world. With more than 245 million
customers visiting 10,900 stores and its employee number that is actually more than some of the
retailer’s customer numbers.
• Walmart uses data which helps them to provide product recommendations to the users based on
which products were bought before or which products were bought together. The chief information
officer, linda dillman, examined sales data after hurricane charley to determine what would be
needed following the forecasted hurricane frances. As well as the predictable increase in sales of
flashlights and emergency equipment, the period saw an unexpected increase in demand for beer
and strawberry pop-tarts. This data was used to inform stocking decisions, and led to strong sales.
Strawberry pop-tarts sales increased by 7 times before a hurricane.
• Walmart collected valuable information from each individual consumer, what was bought, where 39

they live and what they are interested in to then predict the customer’s behavior.
Apple
Apple’s products make people happy, it’s a fact. Everyone is curious
about their secrets, they want to know what are the processes behind all
those excellent product designs. It is inevitable that they use big data
analytics to analyse their user’s behaviors. Apple is interested in user
experience, and they always think about how they can produce tech
products that would provide the most logic and comfortable feeling to
their users. Big data analytics helps apple with the “how, when & why”
that they need to build products and determine what new features
should be added.
40
Ebay & Amazon
• When we are shopping online on ebay or on amazon, you can almost always see
some suggested products. For example, when I want to buy a dress, during the
time i am purchasing it, I will get suggestions to buy a handbag, a scarf, some
jewelry or even a pair of shoes that will help me be more satisfied of the outfit.
It’s a sort of inspiration for me and who knows, i might buy more products from
them.
• In this case, they studied the customer’s behavior by researching data and by
analyzing hobbies and habits of customers when they shopped at their stores in
previous times. This enables to give a rational suggestion to a specific person,
based on past attitudes. 41
Linkedin
• Linkedin has more than 400 million members in over 200 countries and territories.
There are 3 million company pages, 2 new members join linkedin per second and
there are more than 100 million unique monthly visitors. The total revenue
advanced by 35% year-on-year to $861 M in the Q1 of 2016 Q1.
• Big data analytics is a key tool that helps linkedin create great features on its
platform. Let’s see how wise it is! Linkedin suggests us amazing points: people you
may know, skill endorsements, jobs you may be interested in, news feed updates,
who has viewed my linkedin profile, etc.
• These features increase the value of the platform for the users, increasing
42
therefore their satisfaction and interest. It has helped linkedin build its success.

You might also like