Lecture 3
Lecture 3
Business Intelligence(IS401)
Lecture 3
Prepared By:
Dr. Heba Askr
First Term 2023-2024
Business Intelligence, Analytics, Data
Science, and AI
Fifth Edition
Chapter 3
Descriptive Analytics I: Ro m a n n ume ral one col on
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives (1 of 2)
3.1 Understand the nature of data as it relates to business
intelligence (B I) and analytics
3.2 Learn the methods used to make real-world data
analytics ready
3.3 Learn what Big Data is and how it is changing the world
of analytics
3.4 Understand the motivation for and business drivers of Big
Data analytics
3.5 Become familiar with the wide range of enabling
technologies for Big Data analytics
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives (2 of 2)
3.6 Learn about Hadoop, Spark, MapReduce, and NoSQL
as they relate to Big Data analytics
3.7 Become familiar with the Data for Good concept
3.8 Understand the need for and appreciate the capabilities
of stream analytics
3.9 Learn about the applications of stream analytics
3.10 Describe statistical modeling and its relationship to
business analytics
3.11 Learn about descriptive and inferential statistics
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The Nature of Data (1 of 2)
• Data: a collection of facts
– usually obtained as the result of experiences,
observations, or experiments
• Data may consist of numbers, words, images, …
• Data is the lowest level of abstraction (from which
information and knowledge are derived)
• Data is the source for information and knowledge
• Data quality and data integrity → critical to analytics
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The Nature of Data (2 of 2)
Figure 3.1 A Data to Knowledge Continuum
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Metrics for Analytics Ready Data
• Data source reliability
• Data content accuracy
• Data accessibility
• Data security and data privacy
• Data richness
• Data consistency
• Data currency/data timeliness
• Data granularityدقة البيانات
• Data validity and data relevancy
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
A Simple Taxonomy of Data (1 of 2)
• Data (datum—singular form of data) = facts
• Structured data
– Targeted for computers to process
– Numeric versus nominal
• Unstructured/textual data
– Targeted for humans to process/digest
• Semi-structured data?
– XML, HT ML, Log files, etc.
• Data taxonomy …
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
A Simple Taxonomy of Data (2 of 2)
Figure 3.2 A Simple Taxonomy of Data
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The Art and Science of Data
Preprocessing (1 of 2)
• The real-world data is dirty, misaligned, overly complex,
and inaccurate
– Not ready for analytics!
• Readying the data for analytics is needed
– Data preprocessing
▪ Data consolidationتوحيد البيانات
▪ Data cleaning
▪ Data transformation
▪ Data reduction
• Art – it develops and improves with experience
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
The Art and Science of Data
Preprocessing (2 of 2)
Figure 3. 3 Data Preprocessing Steps
• The process is usually iterative with
many feedbacks and redo
• Data reduction
1. Variables
▪ Dimensional reduction
▪ Variable selection
2. Cases/samples
▪ Sampling
▪ Balancing / stratification
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Data Preprocessing Tasks and
Methods (1 of 3)
Table 3.1 A Summary of Data Preprocessing Tasks and
Potential Methods
Main Task Subtasks Popular Methods
Data Access and collect the data S Q L queries, software agents, Web services.
consolidation Select and filter the data Domain expertise, S Q L queries, statistical
Integrate and unify the data tests.
S Q L queries, domain expertise, ontology-
driven data mapping.
Data cleaning Handle missing values in Fill in missing values (imputations) with most
the data appropriate values (mean, median, min/max,
mode, etc.); recode the missing values with a
constant such as “M L”; remove the record of
the missing value; do nothing.
Data cleaning Identify and reduce noise in Identify the outliers in data with simple
the data statistical techniques (such as averages and
standard deviations) or with cluster analysis;
once identified, either remove the outliers or
smooth them by using binning, regression, or
simple averages.
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Data Preprocessing Tasks and
Methods (2 of 3)
Table 3.1 A Summary of Data Preprocessing Tasks and
Potential Methods
Main Task Subtasks Popular Methods
Data cleaning Find and eliminate Identify the erroneous values in data (other than outliers),
erroneous data such as odd values, inconsistent class labels, odd
distributions; once identified, use domain expertise to
correct the values or remove the records holding the
erroneous values.
Data Normalize the data Reduce the range of values in each numerically valued
transformation variable to a standard range (e.g., 0 to 1 or -1 to +1) by
using a variety of normalization or scaling techniques.
Data Discretize or aggregate If needed, convert the numeric variables into discrete
transformation the representations using range-or
data frequency-based binning techniques; for categorical
variables, reduce the number of values by applying proper
concept hierarchies.
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Data Preprocessing Tasks and
Methods (3 of 3)
Table 3.1 A Summary of Data Preprocessing Tasks and
Potential Methods
Main Task Subtasks Popular Methods
Data Construct new Derive new and more informative variables from
transformati attributes the existing ones using a wide range of
on mathematical functions (as simple as addition and
multiplication or as complex as a hybrid
combination of log transformations).
Data Reduce number of Principal component analysis, independent
reduction attributes component analysis, chi-square testing, correlation
analysis, and decision tree induction.
Data Reduce number of Random sampling, stratified sampling, expert-
reduction records knowledge-driven purposeful sampling.
Data Balance skewed Oversample the less represented or undersample
reduction data the more represented classes.
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Big Data - Definition and Concepts (1 of 2)
• Big Data means different things to people with different
backgrounds and interests
• Traditionally, “Big Data” = massive volumes of data
– Example, volume of data at CE RN, NASA, Google, …
• Where does the Big Data come from?
– Everywhere! Web logs, RF ID, GPS systems, sensor
networks, social networks, Internet-based text
documents, Internet search indexes, detail call
records, astronomy, atmospheric science, biology,
genomics, nuclear physics, biochemical experiments,
medical records, scientific research, military
surveillance, multimedia archives, …
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Big Data - Definition and Concepts (2 of 2)
• Big Data is a misnomer!
• Big Data is more than just “big”
• The Vs that define Big Data
– Volume
– Variety
– Velocity
– Veracityالموثوقية
– Variability
– Value
– …
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Fundamentals of Big Data Analytics
• Big Data by itself, regardless of the size, type, or speed, is
worthless
• Big Data + “big” analytics = value
• With the value proposition, Big Data also brought about
big challenges
– Effectively and efficiently capturing, storing, and
analyzing Big Data
– New breed of technologies needed (developed or
purchased or hired or outsourced …)
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Critical Success Factors for Big Data
Analytics (1 of 2)
• A clear business need (alignment with the vision and the
strategy)
• Strong, committed sponsorship (executive champion)
• Alignment between the business and IT strategy
• A fact-based decision-making culture
• A strong data infrastructure
• The right analytics tools
• Right people with right skills
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Critical Success Factors for Big Data
Analytics (2 of 2)
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Enablers of Big Data Analytics
• In-memory analytics
– Storing and processing the complete data set in RAM
• In-database analytics
– Placing analytic procedures close to where data is stored
• Grid computing & MPP
– Use of many machines and processors in parallel (MPP -
massively parallel processing)
• Appliances (devices)
– Combining hardware, software, and storage in a single unit
for performance and scalability
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Challenges of Big Data Analytics
• Data volume
– The ability to capture, store, and process the huge
volume of data in a timely manner
• Data integration
– The ability to combine data quickly and at reasonable
cost
• Processing capabilities
– The ability to process the data quickly, as it is captured
(i.e., stream analytics)
• Data governance (… security, privacy, access)
• Skill availability (… data scientist)
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Business Problems Addressed by Big
Data Analytics (stop here)
• Process efficiency and cost reduction
• Brand management
• Revenue maximization, cross-selling/up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
• Risk management
• …
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Big Data Technologies
• MapReduce …
• Hadoop …
• Hive
• Pig
• Hbase
• Flume
• Oozie
• Ambari
• …
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Big Data Technologies--Hadoop (1 of 3)
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Big Data Technologies--Hadoop (2 of 3)
• How Does Hadoop Work?
– Access unstructured and semi-structured data (example,
log files, social media feeds, other data sources)
– Break the data up into “parts,” which are then loaded into a
file system made up of multiple nodes running on
commodity hardware using HDFS
– Each “part” is replicated multiple times and loaded into the
file system for replication and failsafe processing
– A node acts as the Facilitator and another as Job Tracker
– Jobs are distributed to the clients, and once completed the
results are collected and aggregated using MapReduce
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Big Data Technologies--Hadoop (3 of 3)
• Hadoop Technical Components
– Hadoop Distributed File System (HDFS)
– Name Node (primary facilitator)
– Secondary Node (backup to Name Node)
– Job Tracker
– Slave Nodes (the grunts of any Hadoop cluster)
– Additionally, Hadoop ecosystem is made up of a
number of complementary sub-projects: No SQL
(Cassandra, Hbase), DW (Hive), …
▪ No SQL = not only S QL
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Big Data Technologies--MapReduce (1 of 2)
• MapReduce distributes the processing of very large multi-
structured data files across a large cluster of ordinary
machines/processors
• Goal - achieving high performance with “simple”
computers
• Developed and popularized by Google
• Good at processing and analyzing large volumes of multi-
structured data in a timely manner
• Example tasks: indexing the Web for search, graph
analysis, text analysis, machine learning, …
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Big Data Technologies--MapReduce (2 of 2)
• How does MapReduce work?
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Technology Insights 3.2
A Few Demystifying Facts about Hadoop
1. Hadoop consists of multiple products
2. Hadoop is open source but available from vendors, too
3. Hadoop is an ecosystem, not a single product
4. HDFS is a file system, not a DBMS
5. Hive resembles S QL but is not standard SQL
6. Hadoop and MapReduce are related but not the same
7. MapReduce provides control for analytics, not analytics
8. Hadoop is about data diversity, not just data volume
9. Hadoop complements a DW; it’s rarely a replacement
10. Hadoop enables many types of analytics, not just Web
analytics
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Spark Versus Hadoop (1 of 2)
• Both of these open-source frameworks are developed by the
Apache Software Foundation (in 2004 and 2009)
• Hadoop is for large volumes and varied type of data
• Spark is for in-memory processing for speed/efficiency
1. Order of magnitude faster processing of big data
2. A unified engine that supports highly efficient SQL queries,
streaming data, machine learning and graph processing,
and
3. A revamped APIs designed for ease of use, especially for
processing of unstructured and semi-structured data.
• Comparison dimensions: Performance, Cost, Parallel
processing, Scalability, Security, and Analytics
• NoSQL – Not only SQL!
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Spark Versus Hadoop (2 of 2)
• Use Hadoop when …
– Processing big data sets in environments where data size
exceeds available memory
– Batch processing with tasks that exploit disk read and write
operations
– Building data analysis infrastructure with a limited budget
– Completing jobs that are not time-sensitive
– Historical and archive data analysis
• Use Spark when …
– Dealing with parallel operations of iterative algorithms
– Achieving quick results with in-memory computations
– Analyzing stream data analysis in real time
– Graph-parallel processing to model data
– All ML applications
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Stream Analytics Applications
• e-Commerce
• Telecommunication
• Law Enforcement and Cyber Security
• Power Industry
• Financial Services
• Health Services
• Government
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Statistical Modeling for Business
Analytics (1 of 2)
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Statistical Modeling for Business
Analytics (2 of 2)
• Statistics
– A collection of mathematical techniques to
characterize and interpret data
• Descriptive Statistics
– Describing the data (as it is)
• Inferential statistics
– Drawing inferences about the population based on
sample data
• Descriptive statistics for descriptive analytics
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Centrality Tendency
• Arithmetic mean
n
x1 + x2 + + xn x
x = x = i =1 i
n n
• Median
– The number in the middle
• Mode
– The most frequent observation
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Dispersion (1 of 2)
• Dispersion
– Degree of variation in a given
variable
• Range
– Max - Min
• Variance Standard Deviation
n n
( xi − x )2
( x − x )2
s = =1
s = =1
2 i i i
n −1 n −1
• Mean Absolute Deviation (MAD)
– Average absolute deviation from the mean
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics Measures of
Dispersion (2 of 2)
Figure 3.12 Understanding
the Specifics about Box-and-
Whiskers Plots
• a.k.a. box-and-whiskers
plot
• Versatile / informative
• Quartiles
• Median, mean, outliers
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics - Shape of a
Distribution (1 of 2)
• Histogram – frequency chart
• Skewness
– Measure of asymmetry
i =1 i
n
( x − x )3
Skewness = S =
(n − 1)s 3
• Kurtosis
– Peak/tall/skinny nature of the distribution
i =1 i
n
( x − x ) 4
Kurtosis = K = 4
− 3
ns
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Descriptive Statistics - Shape of a
Distribution (2 of 2)
Figure 3.13 Relationship
between Dispersion and
Shape Properties
• Skewness – positive
versus negative
• Kurtosis – tall versus
short
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Technology Insight 3.3 (1 of 2)
Descriptive Statistics in Excel
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Technology Insight 3.3 (2 of 2)
Descriptive Statistics in Excel Creating box-plot in
Microsoft Excel
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling for Inferential
Statistics
• Regression
– A part of inferential statistics
– The most widely known and used analytics technique
in statistics
– Used to characterize relationship between explanatory
(input) and response (output) variable
• It can be used for
– Hypothesis testing (explanation)
– Forecasting (prediction)
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (1 of 3)
• Correlation versus Regression
– What is the difference (or relationship)?
• Simple Regression versus Multiple Regression
– Base on number of input variables
• How do we develop linear regression models?
– Scatter plots (visualization—for simple regression)
– Ordinary least squares method
▪ A line that minimizes squared of the errors
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (2 of 3)
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Regression Modeling (3 of 3)
• x: input, y: output
• Simple Linear Regression
y = 0 + 1 x
• Multiple Linear Regression
y = 0 + 1 x1 + 2 x2 + 3 x3 + + n xn
• The meaning of Beta ( ) coefficients
– Sign (+ or -) and magnitude
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Process of Developing a Regression
Model
• How do we know if the
model is good enough?
– R 2 (R-Square)
– p Values
– Error measures (for
prediction problems)
▪ MSE, MAD, RM SE
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Logistic Regression Modeling (1 of 2)
• A very popular statistics-based classification algorithm
• Employs supervised learning
• Developed in 1940s
• The difference between Linear Regression and Logistic
Regression
– In Logistic Regression Output/Target variable is a
binomial (binary classification) variable (as opposed to
numeric variable)
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Logistic Regression Modeling (2 of 2)
1
f ( y) =
1+ e −( 0 + 1x )
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Analytics In Action 3.3 (1 of 6)
Predicting NCAA Bowl Game Outcomes
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Analytics In Action 3.3 (2 of 6)
• The analytics process to develop prediction models (both
regression and classification type) for NCAA Bowl Game
outcomes
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Analytics In Action 3.3 (3 of 6)
• Prediction Results
1. Classification (directly predicting “Win” versus “Loss”)
▪ Simple binary classification
2. Regression (predicting the score difference and then
converting the results into “Win” versus “Loss”
▪ Regression based classification
– Which one would be more accurate?
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Analytics In Action 3.3 (4 of 6)
Table 3.7 Prediction Results for the Direct Classification
Methodology
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved
Analytics In Action 3.3 (5 of 6)
Table 3.8 Prediction Results for the Regression-Based
Classification Methodology
Copyright © 2024, 2018, 2014, 2011 Pearson Education, Inc. All Rights Reserved