0% found this document useful (0 votes)
6 views74 pages

combinepdf-1

The document provides an overview of data mining, defining it as the process of extracting usable data from larger datasets and detailing its techniques, tools, and applications across various industries. It outlines the data mining process, including data cleaning, preprocessing, and the use of algorithms for classification, clustering, and association rules. Additionally, it discusses the importance of data warehousing and ETL processes in supporting data mining efforts.

Uploaded by

Malsha Vithanage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views74 pages

combinepdf-1

The document provides an overview of data mining, defining it as the process of extracting usable data from larger datasets and detailing its techniques, tools, and applications across various industries. It outlines the data mining process, including data cleaning, preprocessing, and the use of algorithms for classification, clustering, and association rules. Additionally, it discusses the importance of data warehousing and ETL processes in supporting data mining efforts.

Uploaded by

Malsha Vithanage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Introduction

Data Mining
Outline
What is Data Mining?

How Does Data Mining Process Work?

Data Mining Techniques

Software Tools for Data Mining

Advantages of Data Mining


What is Data Mining?
Data mining is defined as a process used to extract usable data
from a larger set of raw data
What is not Data What is Data
Mining? Mining?
What is (not)
Data • Look up phone
• Include performing
Mining? number in phone
market analysis to
identify new
directory.
product bundles.
• Query a Web search
engine for • Discover groups of
information about similar documents
“Amazon”. on the Web.
Why Mine Data? Commercial Viewpoint
Lots of data is being collected and warehoused
• Web data, e-commerce
• purchases at department/grocery stores
• Bank/Credit Card transactions
Largest databases: Examples
• Google
8.5 billion searches
• Segmentation of Google
searches involves
categorizing search queries
into distinct groups based on
common characteristics.
• This helps businesses and
researchers better
understand user behaviour,
preferences, and trends.
What you can identify for analyzing Google
searches

Product or
Demographic Geographic Behavioral
Service
Segmentation Segmentation Segmentation
Segmentation

Seasonal and Mobile vs.


Keyword Time-Based
Event-Based Desktop
Clusters Segmentation
Segmentation Segmentation
Knowledge
Discovery
Steps of a KDD Process
• Learning the application domain
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction, invariant representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Why Data Preprocessing?

Data in the real world is dirty


• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names

No quality data, no quality mining results!


• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
• Required for both OLAP and Data Mining!
Why can Data be Incomplete?

Attributes of interest are not available (e.g., customer information for sales transaction data)

Data were not considered important at the time of transactions, so they were not recorded!

Data not recorded , because of misunderstanding or malfunctions

Data may have been recorded and later deleted!

Missing/unknown values for some data


Knowledge Discovery in Databases (KDD) has a broad range of
application areas

• Healthcare and Medical Research


• Finance and Banking
• Retail and E-commerce
• Telecommunications
• Manufacturing and Supply Chain
• Social Media and Web Analytics
• Government and Public Services
• Environmental Monitoring
• Education and Learning Analytics
• Transportation and Logistics
• Etc….
Related Termsr
Data Mining in the Business Intelligence
How Data Mining is used

1 2 3 4
1. Identify the 2. Use data mining 3. Act on the 4. Measure the
problem techniques to information results
transform the data
into information
Data Mining Tasks

Data mining tasks are generally divided into two major categories:

Predictive tasks [Use some attributes to • Classification


predict unknown or future values of other • Regression
attributes.] • Deviation Detection

Descriptive tasks [Find human-


• Association Discovery
interpretable patterns that describe the • Clustering
data.]
Data Mining Tasks

• Predictive tasks. The objective of these tasks is to predict the value of a


particular attribute based on the values of other attributes. The attribute
to be predicted is commonly known as the target or dependent variable,
while the attributes used for making the prediction are known as the
explanatory or independent variables.

• Descriptive tasks. Here, the objective is to derive patterns (correlations,


trends, clusters, trajectories, and anomalies) that summarize the
underlying relationships in data. Descriptive data mining tasks are often
exploratory in nature and frequently require post processing techniques
to validate and explain the results.
Data mining
techniques
Classification

• Classification in data mining is a


common technique that
separates data points into
different classes.
• It allows you to organize data
sets of all sorts, including
complex and large datasets as
well as small and simple ones.
Classifiers in Machine
Learning

Classification is a highly popular aspect


of data mining. As a result, machine
learning has many classifiers:
1. Logistic regression
2. Linear regression
3. Decision trees
4. Random forest
5. K-nearest neighbours
Clustering

• Cluster analysis or clustering


is the task of grouping a set
of objects in such a way that
objects in the same group
(called a cluster) are more
similar (in some sense) to
each other than to those in
other groups (clusters).
Example: Document Clustering
Suppose you're a researcher working with a large collection of news articles. You
want to organize these articles into meaningful groups based on their content.
Clustering can help you achieve this:
• Cluster 1 (Technology News): This cluster might contain articles about the
latest tech gadgets, software updates, and innovations in the tech industry.
• Cluster 2 (Political News): This cluster could include articles about
government policies, elections, and international relations.
• Cluster 3 (Health and Wellness News): This cluster might consist of articles
about medical breakthroughs, fitness tips, and healthy lifestyle trends.
Classification Vs Clustering

Aspect Classification Clustering


Objective Assign data points to pre-defined classes or labels Group similar data points based on similarities
Supervision Requires labeled training data Does not require labeled training data
Goal Predict the class label of new data points Discover inherent structures in the data
Outcome Each data point is assigned to a specific class Data points are grouped into clusters

Algorithm Types Decision Trees, Naive Bayes, SVM, Neural Networks K-Means, Hierarchical Clustering, DBSCAN, etc.

Training Requires training on labeled data for each class Unsupervised learning, no specific training
Interpretability Results can be interpreted in terms of class labels Interpretation is based on cluster properties

Use Case Spam detection, medical diagnosis, image recognition Customer segmentation, document clustering, etc.

Example Identifying whether an email is spam or not Grouping news articles by topic
Association Rules
• Association rule mining finds interesting associations
and relationships among large sets of data items.
• This rule shows how frequently a itemset occurs in a
transaction.
• Market Basket Analysis is one of the key techniques
used by large relations to show associations between
items.
• It allows retailers to identify relationships between
the items that people buy together frequently.
Example: Market Basket Analysis

Association Rule 1: {Bread} -> {Butter} This rule indicates that customers who
buy bread are likely to also buy butter.

Association Rule 2: {Milk, Eggs} -> {Cheese} This rule suggests that customers
who purchase both milk and eggs are likely to buy cheese as well.
Transaction ID Items Purchased
1 Bread, Milk, Butter
2 Bread, Butter, soap
3 Milk, Bread, Eggs
4 Bread, Milk, Eggs
5 Bread, Milk, Butter, Eggs
software

• R:R is a programming language and environment built for statistical


computing and graphics. It has a wide range of packages for data mining,
machine learning, and statistical analysis.
• Weka:A widely used open-source platform for data mining and machine
learning. It provides a graphical user interface for various data preprocessing,
modeling, and evaluation tasks.
• Rapid Miner: This platform offers a user-friendly interface for data
preparation, machine learning, and predictive modeling. It supports a wide
range of data mining and machine learning techniques.
Data Mining and Visualization

Approaches Interactive data mining issues


Visualization to display results of data mining
Relationships between the analyst,
• Help analyst to better understand the results of the data the data mining tool and the
mining tool
visualization tool
Visualization to aid the data mining process
• Interactive control over the data exploration process
• Interactive steering of analytic approaches (“grand tour”)
Data Mining and Visualization

Visualized
result

Data Mining
Analyst Tool
Advantages of Data Mining
• Marketing and sales are more effective
• Improved customer service
• Supply Chain Mangement
• Enhanced production Uptime
• Better Risk Management
QA
Basic of Data Warehouse and Data Integration Data Warehouse
Data warehouse is a subject-oriented, integrated, time
variant and non-volatile collection of data in support of
management decision making process (Willam H Inmon).
It is a large store of data and a set of processes collected
into a database for the primary purpose of helping a
business analyze data to make decisions

 Subject-oriented - A data warehouse typically


provides information on a topic (such as a sales
inventory, customers, suppliers or supply chain) rather
Prof. Pradeep Dharmadasa,
Dharmadasa,
than company operations.
Dean, Faculty of Management & Finance,
University of Colombo

Data Warehouse Data Warehouse


 Integrated - A data warehouse combines data from  Non-volatile: Non-volatile means the previous data is not
various sources. These may include a cloud, erased when new data is added to it. A data warehouse is
relational databases, flat files, structured and semi- kept separate from the operational database and therefore
structured data, metadata, and master data. The frequent changes in operational database is not reflected in
sources are combined in a manner that’s consistent, the data warehouse.
relatable, and ideally certifiable, providing a business  Prior data isn’t deleted when new data is added. Historical
with confidence in the data’s quality. data is preserved for comparisons, trends, and analytics.

 Time Variant - The data collected in a data warehouse


is identified with a particular time period. The data in a
data warehouse provides information from the historical
point of view.
Data Mart Data Mart
 A data mart is similar to a data warehouse, but it holds data  Based on their relation to the data warehouse and the data
only for a specific department or line of business, such as sources that are used to create the system , here are three
sales, finance, or human resources. types of data marts:
 A data warehouse can feed data to a data mart, or a data  Dependent – This is created from an existing enterprise
mart can feed a data warehouse. data warehouse.
 The data mart is a subset of the data warehouse and is  Independent - An independent data mart is a stand-alone
usually oriented to a specific business line or team. Whereas system—created without the use of a data warehouse—
data warehouses have an enterprise-wide depth, the that focuses on one subject area or business function.
information in data marts pertains to a single department.  Hybrid - A hybrid data mart combines data from an
existing data warehouse and other operational source
systems

Operational Data Store (ODS) Ralph Kimball’s Approach vs WH Inmon’s Approach


Two school of thought when it come to building of a DW
 An Operational Data Store (ODS) also known as OLTP
 According to Kimball’s a
(On-Line Transfer Processing) is a Database Management data warehouse is made-
System where data is stored and processed in real-time. up of all the data marts in
 Operational data stores (ODS) are data repositories that an enterprise. This is a
store a snapshot of an organization's current data. This is a bottom up approach.
highly volatile data repository that is ideally suited for real-  According to Inmon’s a
data warehouse is a
time analysis.
subject-oriented,
integrated, time variant and
non-volatile collection of
data in support of
management decisions.
This is a top down
approach.
Comparison: Kimball vs Inmon Goals of Data Warehousing (DW)
 How to decide between Kimball and Inmon’s architectures?  Information accessibility – Data in a DW must be easy to
 It is all in the tradeoffs between the comparative advantages comprehend, by both business users and developers alike. The
and disadvantages. business user should be allowed to slice and dice the data in every
possible way.
 Kimball is the better choice if you want to see results faster,
have a small team of engineers, and foresee little changes in  Information credibility – The data in a DW should be credible,
the business requirements. Otherwise, the data redundancy complete and of desired quality.
could cause anomalies and maintenance costs down the line.  Flexible to change - The data DW must be adaptable to change
 Inmon is the go-to for huge enterprises that wish to see a  Support for more fact based decision making
complete picture of their enterprise data, even if the deployment
of the data warehouse is going to cost them more and take  Support for data security
longer than Kimball’s counterpart.  Information consistency

Data Warehouse – Advantage and Limitations Data Warehouse – Advantage and Limitations

Advantages Limitations
 Integration at the lowest level, eliminating need for  Process would take a considerable amount of time
integration queries. and effort
 Runtime schematic cleaning is not needed –  Requires an understanding of the domain
performed at the data staging environment  More scalable when accompanied with a
 Independent of original data source metadata repository – increased load.
 Query optimization is possible.  Tightly coupled architecture.
Data Warehousing (DW) Extract, Transform , Load (ETL)
• ETL is a process that extracts the data from different source
systems, then transforms the data (like applying
calculations, concatenations, etc.) and finally loads the data
into the data warehouse system. ETL provides the
foundation for data analytics and machine learning work
streams. ETL is often used by an organization to:
• Extract data from legacy systems
• Cleanse the data to improve data quality and establish
consistency
• Load data into a target database , usually a DW

Data Mapping Data Mapping Techniques


 The process of creating data element mapping between two There are three main data mapping techniques
distinct data models  Manual Data Mapping: It requires IT professionals to hand-
 It is used as the first step towards a wide variety of data code or manually map the data source to the target schema.
integration tasks which include  Schema Mapping: It is a semi-automated strategy. A data
 Data transformation between data sources and data mapping solution establishes a relationship between a data
source and the target schema. IT professionals check the
destination
connections made by the schema mapping tool and make any
 Identification of data relationships required adjustments.
 Discovery of hidden sensitive data  Fully-Automated Data Mapping: The most convenient, simple,
and efficient data mapping technique uses a code-free, drag-
 Consolidation of multiple data base in to a single
and-drop data mapping UI. Even non-technical users can carry
dtabse.
out mapping tasks in just a few clicks.
Data Staging Data Extraction
A data staging area can be defined as an intermediate Extraction is the operation of extracting data from the
storage area that falls between the operational/ source system for further use in a data warehouse
transactional sources of data and DW or data mart. A environment. This the first step in the ETL process.
staging are can be used for to
Designing this process means making decisions about the
 Gather data from different sources reday to be following main aspects:
processed at different times
 Which extraction method would I choose?
 Quickly load information from the operational
database  How do I provide the extracted data for further
processing?
 Find changes against current DB/DM values
 Cleanse data and recalculate aggregates

Data Extraction Data Transformation


The data has to be extracted both logically and physically. It is the most complex and, in terms of production the most
costly part of ETL process.
The logical extraction method They can range from simple data conversion to extreme data
 Full extraction scrubbing techniques.
From an architectural perspective, transformations can be
 Incremental extraction
performed in two ways.
The physical extraction method  Multistage data transformation - Extracted data is moved to a
staging area where transformations occur prior to loading the data
 Online extraction
into the warehouse
 Offline extraction  In-warehouse data transformation – Data is extracted and
loaded into the analytics warehouse, and transformations are done
there. This is sometimes referred to as Extract, Load, Translate (ELT).
Data Loading What Is Data Integration?
 The load phase loads the data into the end target, DI is a process of coherent merging of data from various
usually the data warehouse (DW). Depending on the data sources and presenting a cohesive/consolidated
requirements of the organization, this process varies view to the user.
widely.  Involves combining data residing at different sources
and providing users with a unified view of the data.
 The timing and scope to replace or append into the
DW are strategic design choices dependent on the  Significant in a variety of situations; both
time available and the business needs.  commercial (e.g., two similar companies trying to
merge their database)
 More complex systems can maintain a history and  Scientific (e.g., combining research results from
audit trail of all changes to the data loaded in the DW different bioinformatics research repositories)

Need for Data Integration Knowledge required for Data Integration


 Concepts and skills required Development challenges
 Able to quickly access information based on a key
o Translation of relational database to object-oriented
variable along with the query against existing data for
applications
meaningful insights
o Consistent and inconsistent metadata
 Helps reduce costs, overlaps, and redundancies, and o Handling redundant and missing data
business will be less expose to risks and losses o Normalization of data from different sources
 Helps in better monitoring of key variables trending  Technological challenges
patterns which would alleviate to need the conduct more o Various formats of data
studies and survey and bring down the R& D expending o Structured and unstructured data
o Huge volumes of data
 Organizational challenges
o Unavailability of data
o Manual integration risk, failure
Common Data Integration Approaches Common Data Integration Approaches
 Federated database (virtual database):
Type of meta-database management system which  Data warehousing
transparently integrates multiple autonomous  Memory mapped data structure -Memory mapping is a
databases into a single federated database process or command in computer programming that requests that
files, code, or objects be brought into system memory. It allows files
The constituent databases are interconnected via a or data to be processed temporarily as main memory by a central
computer network, geographically decentralized. processing unit

The federated databases is the fully integrated,


logical composite of all constituent databases in a
federated database management system.

Data Integration Approaches Difference between a federated databases and a DW


Federated DW
Memory-mapped data structure:
Preferred when the databases are present Preferred when the source of information
 Useful when needed to do in-memory data across various locations over a large area can be taken from one location

manipulation and data structure is large. It’s Data would be present in various servers The entire DW would be present in one
mainly used in the dot net platform and is always server
Requires high speed network connections Requires no network connections
performed with C# or using VB.NET
It is easier to create as compared to DW Its creation is not easy as that of federated
 It’s is a much faster way of accessing the data database
than using Memory Stream. Requires no creation of new database DW must be created from scratch

Requires network expert to set up the Requited database experts such as data
network connection steward
Data Integration Technologies Data Integration Technologies
The technologies that are used for data integration  Modeling techniques
include:  Entity-Relational Modeling - An Entity–relationship
model (ER model) describes the structure of a database with
 Data interchange – it is a structured transmission of the help of a diagram, which is known as Entity Relationship
organizational data between two or more organizations through
electronic means. Used for the transfer of electronic documents Diagram (ER Diagram).
from one computer to another.
 Object Brokering - an object request broker (ORB) is a
middleware software. It gives programmers the freedom to make
calls from one computer to another over via a computer network.

Data Integration Technologies Difference between ER modelling and Dimensional modelling


 Modeling
ER modelling Dimensional modelling
techniques
Optimised for transactional data Optimised for query ability and
 Dimensional performance
Eliminate redundant data Does not eliminate redundant data where
Modeling - appropriate
Dimensional Modeling
Highly normalised It aggregates most of the attributes and
(DM) is a data structure hierarchies of a dimension into single
technique optimized for entity
data storage in a Data It is complex maze of hundreds of entries It has logical grouped set of star schemas
warehouse. The linked with each other
purpose of dimensional Useful for transactional systems Useful for analytical systems
modeling is to optimize It is spilt as per the entities It is spilt as per the dimensions and facts
the database for faster
retrieval of data.
Advantages of Using Data Integration Challenges in Data Integration
 Development challenges
 Of benefit to decision-makers, who have access o Translation of relational database to object-oriented
to important information from past studies applications
o Consistent and inconsistent metadata
 Reduces cost, overlaps and redundancies; o Handling redundant and missing data
reduces exposure to risks o Normalization of data from different sources
 Technological challenges
 Helps to monitor key variables like trends and o Various formats of data
consumer behaviour, etc. o Structured and unstructured data
o Huge volumes of data
 Organizational challenges
o Unavailability of data
o Manual integration risk, failure

Data Quality Data Quality


 Consistency: When one piece of data is stored in multiple
locations, do they have the same values?  Correcting, standardizing and validating the
 Accuracy: Does the data accurately describes the information
properties of the object it is meant to model?  Creating business rules to correct, standardize
 Relevance: Is the data appropriate to support the objective? and validate your data.
 Existence: Does the organization have the right data?  High-quality data is essential to successful
 Integrity: How accurate are the relationships between data business operations
elements and data sets?
 Validity: Are the values acceptable?
Data Quality Data Quality in Data Integration
An effective data integration strategy can lower costs
Data quality helps you to: and improve productivity by ensuring the consistency,
 Plan and prioritize data accuracy and reliability of data.

 Parse data Data integration enables to:

 Standardize, correct and normalize data  Match, link and consolidate multiple data sources

 Verify and validate data accuracy  Gain access to the right data sources at the right
time
 Apply business rules
 Deliver high-quality information
 Increase the quality of information

Data Quality in Data Integration ETL Tools


 Understand Corporate Information Anywhere in ETL tools can be grouped into four categories based on their
the Enterprise infrastructure and supporting organization or vendor. These
categories — enterprise-grade, open-source, cloud-based,
 Data integration involves combining processes and custom ETL tool
and technology to ensure an effective use of the
 Enterprise Software ETL Tools – These tools are
data can be made. developed and supported by commercial organizations. These
solutions tend to be the most robust and mature in the marketplace
Data integration can include:
 Open-Source ETL Tools
 Data movement  Cloud-Based ETL Tools
 Data linking and matching  Custom ETL Tools - Companies with development resources
may produce proprietary ETL tools using general programming
 Data house holding languages
ETL Tools ETL Tools
ETL process can be create using programming language. Some Popular ETL Tools
Some Open source ETL framework tools  IBM DataStage
 Hevo Data  Oracle Data Integrator
 Apache Camel  Informatica PowerCenter
 Airbyte  SAS Data Management
 Apache Kafka  Talend Open Studio
 Logstash  Pentaho Data Integration
 Pentaho Kettle  Singer
 Talend Open Studio  Hadoop
 Singer  Dataddo
 KETL ,Apache NiFi and CloverDX  AWS Glue , Azure Data Factory,Google Cloud Dataflow
Introduction to Business Analytics Business Analytics

(Business) Analytics is the use of:


o data,
o information technology,
o statistical analysis,
o quantitative methods, and
o mathematical or computer-based models
to help managers gain improved insight about their
business operations and make better, fact-based
decisions.
Prof. Pradeep Dharmadasa,
Dharmadasa,
Dean, Faculty of Management & Finance,
University of Colombo

Examples of Applications Examples of Applications


• Pricing
• Supply Chain Design
– setting prices for consumer and industrial goods,
government contracts, and maintenance contracts – determining the best sourcing and transportation
• Customer segmentation
options and finding the best delivery routes
• Staffing
– identifying and targeting key customer groups in retail,
– ensuring appropriate staffing levels and capabilities,
insurance, and credit card industries
and hiring the right people
• Merchandising
• Health care
– determining brands to buy, quantities, and allocations
– scheduling operating rooms to improve utilization,
• Location
improving patient flow and waiting times, purchasing
– finding the best location for bank branches and A T M s, or supplies, and predicting health risk factors
where to service industrial equipment
Impacts of Analytics Importance of Business Analytics
• Benefits
 There is a strong relationship of BA with:
– …reduced costs, better risk management, faster
- profitability of businesses
decisions, better productivity and enhanced bottom-line
performance such as profitability and customer - revenue of businesses
satisfaction. - shareholder return
 BA enhances understanding of data
• Challenges
 BA is vital for businesses to remain competitive
– …lack of understanding of how to use analytics,
 BA enables creation of informative reports
competing business priorities, insufficient analytical
skills, difficulty in getting good data and sharing
information, and not understanding the benefits versus
perceived costs of analytics studies.

Evolution of Business Analytics business intelligence & Business analytics


• Analytic Foundations
– Business Intelligence (BI)
– Information Systems (IS)
– Statistics
– Operations Research/Management Science (OR/MS)

• Modern Business Analytic


– Data mining
– Simulation and risk analysis
– Decision Support Systems (DSS)
– Visualization
A Visual Perspective of Business Analytics Software Support and Spreadsheet Technology
• Commercial software
– IBM Cognos Express
– SAS Analytics
– Tableau
• Spreadsheets
– Widely used
– Effective for manipulating data and developing and
solving models
– Support powerful commercial add-ons
– Facilitate communication of results

Business Analytics in Practice Descriptive, Predictive, and Prescriptive Analytics

 Financial Analytics  Supply Chain Analytics • Descriptive analytics: the use of data to understand
 HR Analytics  Analytics for Government past and current business performance and make
 Marketing Analytics and Nonprofits informed decisions
 Health Care Analytics  Sports Analytics
• Predictive analytics: predict the future by examining
 Web Analytics historical data, detecting patterns or relationships in
these data, and then extrapolating these relationships
forward in time.
• Prescriptive analytics: identify the best alternatives to
minimize or maximize some objective
Data for Business Analytics Examples of Data Sources and Uses
• Annual reports
• Data: numbers or textual data that are • Accounting audits
collected through some type of measurement • Financial profitability analysis
process • Economic trends
• Information: result of analyzing data; that is, • Marketing research
extracting meaning from data to support • Operations management performance
evaluation and decision making • Human resource measurements
• Web behavior
– page views, visitor’s country, time of view, length of time,
origin and destination paths, products they searched for
and viewed, products purchased, what reviews they read,
and many others

Data for Business Analytics Data for Business Analytics


Categorical (nominal) Data
 Four Types Data Based on Measurement Scale:
 Data placed in categories according to a specified
 Categorical (nominal) data characteristic
 Ordinal data  Categories bear no quantitative relationship to one another
 Interval data  Examples:
 Ratio data - customer’s location (America, Europe, Asia)
- employee classification (manager, supervisor,
associate)
Data for Business Analytics Data for Business Analytics

Ordinal Data Interval Data


 Data that is ranked or ordered according to some  Ordinal data but with constant differences between
relationship with one another observations
 No fixed units of measurement  No true zero point
 Examples:  Ratios are not meaningful
- college football rankings  Examples:
- survey responses - temperature readings
(poor, average, good, very good, excellent) - SAT scores

Data for Business Analytics Big Data


• Big data refers to massive amounts of business data
Ratio Data (volume) from a wide variety of sources (variety), much of
 Continuous values and have a natural zero point which is available in real time (velocity), and much of which
 Ratios are meaningful is uncertain or unpredictable (veracity).
 Examples: • “The effective use of big data has the potential to transform
- monthly sales economies, delivering a new wave of productivity growth
- delivery times and consumer surplus. Using big data will become a key
basis of competition for existing companies, and will create
new competitors who are able to attract employees that
have the critical skills for a big data world.” - McKinsey
Global Institute, 2011
Big Data Data Reliability and Validity

 Big Data? • Reliability - data are accurate and consistent.


• Validity - data correctly measures what it is supposed to
 Not just big!
measure.
 Volume

 Variety Data Reliability and Validity Examples


 Velocity The number of calls to a customer service
desk might be counted correctly each day (and
 veracity
thus is a reliable measure) but not valid if it is
used to assess customer dissatisfaction, as
many calls may be simple queries.

Models in Business Analytics Example - Three Forms of a Model

• Model - an abstraction or representation of a real system, The sales of a new product, such as a first-generation iPad or
idea, or object. 3D television, often follow a common pattern.
– Captures the most important features 1. Verbal description: The rate of sales starts small as early
– Can be a written or verbal description, a visual adopters begin to evaluate a new product and then begins
representation, a mathematical formula, or a to grow at an increasing rate over time as positive
spreadsheet. customer feedback spreads. Eventually, the market begins
to become saturated and the rate of sales begins to
decrease.
Example - Three Forms of a Model Example - Three Forms of a Model

2. Visual model: A sketch of sales as an S-shaped curve 3. Mathematical model:


over time

where S is sales, t is time, e is the base of natural


logarithms, and a, b and c are constants

Decision Models Nature of Decision Models

• Decision Model - a logical or mathematical representation


of a problem or business situation that can be used to
understand, analyze, or facilitate making a decision
• Inputs:
– Data – assumed to be constant
– Uncontrollable inputs – quantities that can change
but cannot be controlled
– Decision options – controllable and selected at the
discretion of the decision maker
Descriptive Models Prescriptive Models
• Descriptive models explain behavior and allow users to • Prescriptive models help decision makers identify the best
evaluate potential decisions by asking “what-if?” questions. solution to a decision problem.
Predictive Models • Optimization - finding values of decision variables that
minimize (or maximize) something such as cost (or profit)
• Predictive models focus on what will happen in the future.
– Objective function - the equation that minimizes (or
• Many predictive models are developed by analyzing maximizes) the quantity of interest
historical data and assuming that the past is representative – Optimal solution - values of the decision variables at
of the future. the minimum (or maximum) point

Analytics Overview Model Assumptions


• Assumptions are made to
– simplify a model and make it more tractable; that is,
able to be easily analyzed or solved.
– better characterize historical data or past observations.
• The task of the modeler is to select or build an appropriate
model that best represents the behavior of the real
situation.
Example - Linear Demand Prediction Model Example - A Nonlinear Demand Prediction Model
Assumes price elasticity is
As price increases, demand
constant (constant ratio of %
falls. A simple model is:
change in demand to % change
D = a – bP in price):
where D is the demand, P is
the unit price, a is a D = cP-d
constant that estimates the
demand when the price is where c is the demand when the
zero, and b is the slope of price is 0 and d > 0 is the price
the demand function. elasticity.

Uncertainty and Risk Problem Solving with Analytics


• Uncertainty is imperfect knowledge of what will happen in
the future. 1. Recognizing a problem
• Risk is associated with the consequences of what actually 2. Defining the problem
happens. 3. Structuring the problem
• “To try to eliminate risk in business enterprise is futile. Risk
is inherent in the commitment of present resources to future 4. Analyzing the problem
expectations. Indeed, economic progress can be defined as 5. Interpreting results and making a decision
the ability to take greater risks. The attempt to eliminate
6. Implementing the solution
risks, even the attempt to minimize them, can only make
them irrational and unbearable. It can only result in the
greatest risk of all: rigidity.” - Peter Drucker
Recognizing a Problem Defining the Problem
• Clearly defining the problem is not a trivial task.
Problems exist when there is a gap between what is
happening and what we think should be happening. • Complexity increases when the following occur:
– large number of courses of action
• For example, costs are too high compared with competitors.
– the problem belongs to a group and not an individual
– competing objectives
– external groups are affected
– problem owner and problem solver are not the same
person
– time limitations exist

Structuring the Problem Analyzing the Problem

 Stating goals and objectives  Analytics plays a major role.


 Characterizing the possible decisions  Analysis involves some sort of experimentation or solution
process, such as evaluating different scenarios, analyzing
 Identifying any constraints or restrictions
risks associated with various decision alternatives, finding
a solution that meets certain goals, or determining an
optimal solution.
Interpreting Results and Making a Decision Implementing the Solution

 Models cannot capture every detail of the real problem.  Translate the results of the model back to the real world.
 Managers must understand the limitations of models and  Requires providing adequate resources, motivating
their underlying assumptions and often incorporate employees, eliminating resistance to change, modifying
judgment into making a decision. organizational policies, and developing trust.

Implementing the Solution

 Translate the results of the model back to the real world.


 Requires providing adequate resources, motivating
employees, eliminating resistance to change, modifying
organizational policies, and developing trust.
Basic of Enterprise Reporting Reporting Perspectives Common to all Levels of Enterprise

 Many enterprises have headquarters and several


regional centers. Several geographic location-focused
operations may aggregate to the nearest regional
center.
 Geographic location may have “revenue generating —
customer facing units” and “support units”.
 There could be regional or corporate level support
functions.
 IT enabled reporting to occur at local, regional, or
corporate levels.

Reporting Perspectives Common to all Levels of Enterprise Reporting Perspectives Common to all Levels of Enterprise

 Function level: Reports being generated at the function  Strategic/operational: Reports could also be classified
level may be consumed by users within a department or based on the nature of the purpose they preserve.
geographic location or region or by decision makers at Strategic reports inform the alignment with the goals,
the corporate level. whereas operational reports present transaction facts
 Internal/external: Sometimes the consumers of reports  Summary/detail: Summary reports do not provide
may be external to the enterprise transaction-level information, whereas detailed reports
list atomic facts.
 Role-based: Provide standard format of report to similar
roles across the enterprise, as they are likely to make  Standard/ad hoc: Company may generate periodic
similar decisions. reports, say, weekly, monthly, or quarterly reports in
standard formats. Executives many times need ad hoc
or on-demand critical business decision making.
Reporting Perspectives Common to all Levels of Enterprise Report Standardization And Presentation Practices
 Purpose: Enterprises classify reports as statutory that Enterprises tend to standardize reporting from several
focus on business transparency and need to be shared perspectives. Some report standardization perspectives
with regulatory bodies. Also analytical reports look into a are as follows:
particular area of operation representing large data
 Data standardization
interpretations in the form of graphs.
 Content standardization
 Technology platform-centric: Reporting in today’s
context need not use paper at all. Dashboards could be  Presentation standardization
delivered on smartphones and tablets. Reports could be  Metrics standardization
published in un-editable (secure) form with watermarks.
 Reporting tools’ standardization
Reports could be protected to be used by a specific
person, during specific hours from specific device.

Common Report Layout Types Common Report Layout Types


 Tabular reports - this is similar to basic spreadsheet. Each row is  Summary – similar to Tabular reports, Summary reports
one record, with numerous columns of data that comes from that include rows of data where each row equates to one record and
record. The complete list can be total at the bottom by record count or fields from the record populate each column in the
any other summarizable value. corresponding row.
Common Report Layout Types Common Report Layout Types
 List reports - A listing report displays data from a
 Matrix Reports - The Matrix report is pretty similar to the
table. Using the Listing Report task, you can display all
Summary report, but with the additional capability to group by
both rows and columns. Similar to the Tabular and Summary the data from a table, or a portion of the data, based on
report format, you can add summarizable fields in Matrix criteria that you specify.
reports. By default, the record count is added into the report
matrix.

Common Report Layout Types Common Report Layout Types


 Chart Reports -A chart is a graphic that displays numeric data in a  Gauge report - Also known as dial charts or speedometer charts, use needles to
compact, visual layout and that reveals essential data relationships. show information as a reading on a dial. On a gauge chart, the value for each
You can add a chart to a form/report to visualize your data and make needle is read against the colored data range or chart axis. This chart type is
informed decisions. often used in executive dashboard reports to show key business indicators
Report Delivery Formats Enterprise Reporting Characteristics in OLAP World

 Printed Reports  Single version of truth – The value of providing the


same “fact value” irrespective of the path the user has
 Secure Soft Copy taken to reach for the data is of paramount importance
 Email attachment in reporting.
 FTP - File Transfer Protocol" that govern how computers transfer files from one system to another over  Role-based delivery – This feature is critical to avoid
the internet.
information overload.
 Link to reports
 Anywhere/anytime/any-device access – Users have
 Worksheet, PowerPoint Presentation, text
their own preferences and therefore flexibility needs to
 eBook be built to ensure that users come to the same source
of information again and again and don’t find alternative
ways to decision making.

Enterprise Reporting Characteristics in OLAP World Enterprise Reporting Characteristics in OLAP World

 Personalization – Users’ choices of delivery method,


format like PDF/worksheet/CSV, book marking,  Alerts – Decision makers need immediate notification
customizing (in terms of language), etc. will need to be of threshold breaches of critical business KPIs. These
addressed. alerts need to be delivered to various devices such as
laptop, mobile devices in different forms like email,
 Security – Enterprises have huge concern over the sound, voice message, SMS text, pop-up, etc.
unauthorized access to business critical information, depending on user preferences
hacking by malicious sources, inadvertent leakage of
business data, etc. The security framework needs to
be thoroughly examined before implementing
reporting.
Corporations and Strategy Corporations and Strategy

 Traditional Strategy Deployment and Communication  35% of stakeholder valuation decision is based on
follows functional hierarchy non-financial data
 Strategy Formulation involves a little Planning and  Strategy Execution
lots of Budgeting  Management Credibility
 Only 5% of the workforce understands the strategy,
 Innovation
 85% of executives spend less than 1 hour per month
discussing strategy,  Ability to Attract Talent
 only 25% of Managers have incentives linked to strategy –
Fortune Magazine Survey

Traditional Balanced Scorecard (Source: Kaplan & Norton, 1992) The Balanced Scorecard

 Financial perspective: The financial perspective


addresses the question of how shareholders view the
firm and which financial goals are desired from the
shareholder’s perspective.
 Customer perspective: The customer perspective
addresses the question of how the firm is viewed by its
customers and whether the firm will be able to fulfil
customers’ expectations.
The Balanced Scorecard Strategy Map

 Internal business process perspective: The


business process perspective identifies the processes  The balanced
in which the organization must excel to satisfy its scorecard strategy
shareholders’ expectations of good financial returns map describes how
and also keep its customers happy and loyal. the company intends
 Learning and growth perspective: The learning and to create value for
growth perspective identifies the competencies that the shareholders and
employees of the organization must acquire for long- customers
term improvement, sustainability and growth

The Measurement System Strategy Management Framework


Dashboards Corporate/Enterprise Dashboards
Dashboard is a geographical user interface (GUI) that Corporate/Enterprise Dashboards are changing the way
organizes and presents information in a away that is easy we look at information and the way we analyze our
to read. Dashboards have following attributes business. A well-constructed corporate dashboard answers
 They display the data relevant to their own objectives four basic questions:
 They throw light on the key performance indicates  Where
and metrics used to measure and monitor the  What
organization performance  How
 Why
 Since the dashboards are designed to serve a
Instead of wading through pages of disparate operational
specific purpose, they inherently contain pre-defined
data, dashboards portray critical operating and strategic
conditions that help the end user analyses her/her
information about an organization using a collection of
own performance.
powerful graphical elements.

Corporate/Enterprise Dashboards Corporate/Enterprise Dashboards


Enterprise dashboards may include: Enterprise dashboards may include:
 Bar Charts  Bar Charts
 Maps  Maps
 Trend Lines  Trend Lines
 Speedos  Speedos
 Correlation  Correlation
One quick glance at the dashboard tells users the key One quick glance at the dashboard tells users the key
performance indicators and metrics used to measure and performance indicators and metrics used to measure and
monitor the companies performance. This contributes to: monitor the companies performance. This contributes to:
 Better Analysis  Better Analysis
 Better Tracking  Better Tracking
 Proactive Alerting .  Proactive Alerting .
Corporate/Enterprise Dashboards Why Corporate Dashboards?

When KPI’s are exceeded, visual and email alerts


can draw attention to the right area. Further drill down
to root cause analysis instantly signals where the
exception happened, and what triggered it.

Trend Analysis: It uses historical results to predict


future outcome. This is achieved by tracking
variances in cost and schedule performance.

Types of Corporate Dashboards Corporate Dashboards


Enterprise performance Dashboards provide a overall
view of the entire enterprise, rather than specific
business functions.

Typical portlets in an enterprise performance dashboard


include:
 Corporate financials
 Sales revenue
 Business Unit KPI’s [Key Performance Indicators]
 Supply chain information
 Compliance or regulatory data
 Balanced scorecard information.
Customer Support Dashboards Customer Support Dashboards
Customer Support Dashboards: Organizations
provide such a dashboard to its customers as a
value-add service. They provide the customer with
their personal account information as pertaining to
the business relationship, such as:
 Online Trading
 Utility Services
 Entertainment
 B2B Service level agreements (SLAs)
Monitoring

Divisional Dashboards Divisional Dashboards


Divisional Dashboards are one of the most popular dashboards,
used to provide at-a-glance actionable information to division heads,
operational managers and department managers. Each division has its
own set of KPIs which can be visually displayed on the enterprise
dashboard. Typical Divisional Dashboards include:

 Purchasing Dashboards
 Supply Chain Dashboards
 Operations Dashboards
 Manufacturing Dashboards
 Quality Control Dashboards
 Marketing Dashboards
 Sales Dashboards
 Finance Dashboards
 Human Resources Dashboard
Benefits of Enterprise Dashboard Steps in Creating a Dashboard
 Identify the data that will go into an Enterprise
Dashboards have the following benefits:
Dashboard
 Places all critical information in just one screen Enterprise Dashboards can contain either/both of the below
rather than flipping through the pages mentioned data:
 Improved decision making  Quantitative data

 Rapid problem detection  Non-Quantitative data


 Decide on the timeframe . The various timeframes can be:
 Better analysis of performance
This month to date
 Identifies the trends and corrective actions to
This quarter to date
improve the performance of the organization
This year to date
Today so far

Steps in Creating a Dashboard Summary – Corporate Dashboard


 Decide on the comparative measures – E.gthe comparative Corporate Dashboards are changing the way we look at
measures can be: information and the way we analyze our business.
 The same measure at the same point in time in the past Corporate Dashboards contributes to:
 The same measure at some other point in time in the past  Better Analysis
 Decide on the evaluation mechanisms - E.g.: the  Better Tracking
evaluation can be performed as follows:
 Proactive Alerting
 Using visual objects e.g. traffic lights
 Corporate Dashboards improves accountability and
 Using visual attributes e.g. red color for the measure to alert a
serious condition transparency across organization
 Corporate dashboards work effectively with Balanced
Scorecard implementations
Balanced Scorecard vs. Dashboards
OLTP AND OLAP Online Transaction Processing (OLTP)

 OLTP systems refer to a class of systems that manage


transaction-oriented applications.

 They are mainly concerned with the entry, storage, and


retrieval of data.

 They are designed to cover most of the day-to-day


operations of an organization such as purchasing,
inventory, manufacturing, payroll, accounting, etc.

Prof. Pradeep Dharmadasa,


Dharmadasa,
Dean, Faculty of Management & Finance,
University of Colombo

Online Transaction Processing (OLTP) Online Transaction Processing (OLTP)


 OLTP systems are characterized by a large number of short
on-line transactions such as  The data captured by OLTP systems is usually stored in
commercial relational databases.
 INSERT (a record of final purchase by a customer was  For example, the database of a supermarket store consists
added to the database), of data about its transactions, products, employees,
 UPDATE (the price of a product has been raised from inventory supplies,
Rs10 to Rs10.5), and
 DELETE (a product has gone out of demand and Like Transactions, Product Master, Employee Details,
therefore the store removes it from the shelf as well as Inventory Supplies, Suppliers, etc
from its database).

Almost all industries today use OLTP systems to record


transactional data.
ONLINE TRANSACTION PROCESSING SYSTEM ONLINE TRANSACTION PROCESSING SYSTEM

 Used for transaction oriented applications  Stores routine data


 Used by lower level employee  Follows client server model
 Quick updates and retrievals  Applications
 Many users accessing the same data  Banks
 Users are not technical persons  Retail stores
 Response rate is very fast  Airline reservation
 Single transaction (one application) at a time

OLTP Segmentation OLTP Segmentation

 They can be segmented into: Real-time Transaction processing


– Real-time Transaction Processing  Multiple users can get the information
– Batch Processing  Very fast response rate
 Transactions processed immediately
 Everything is processed in real time

Batch Processing
 Where information is required in batch
 Offline access to information
 Presorting (sequence) is applied
 Takes time to process information
Characteristics of OLTP Model Limitations of Relational Models
 Online connectivity - LAN,WAN
 Create and maintain large number of tables for the
 Availability - Available 24 hours a day voluminous data
 Response rate
– Rapid response rate  For new functionalities, new tables are added
– Load balancing by prioritizing the transactions  Unstructured data cannot be stored in relational
 Cost databases
– Cost of transactions is less  Very difficult to manage the data with common
 Update facility denominator (keys)
– Less lock periods
– Instant updates
– Use the full potential of hardware and software

Queries that an OLTP System can Process Advantages and Challenges of an OLTP System

 Search for a particular customer’s record. Advantages of an OLTP System

 Retrieve the product description and unit price of a  Simplicity – It is designed typically for use by clerks,
particular product. cashiers, clients, etc.
 Filter all products with a unit price equal to or above
Rs. 25.  Efficiency – It allows its users to read, write and delete
 Filter all products supplied by a particular supplier. data quickly.

 Search and display the record of a particular supplier.  Fast query processing – It responds to user actions
immediately and also supports transaction processing on
demand.
Advantages and Challenges of an OLTP System The Queries that OLTP Cannot Answer
 The super market store is deciding on introducing a new
Challenges of an OLTP System
product. The key questions they are debating are: “Which
product should they introduce?” and “Should it be
 Security – An OLTP system requires concurrency
specific to a few customer segments?”
control (locking) and recovery mechanisms (logging).
 The super market store is looking at offering some discount
 OLTP system data content not suitable for decision on their yearend sale. The questions here are: “How much
making – A typical OLTP system manages the current discount should they offer?” and “Should it be different
data within an enterprise/organization. discounts for different customer segments?” • All the
queries stated above have more to do with analysis than
This current data is far too detailed to be easily used for simple reporting”
decision making.
 Ideally these queries are not meant to be solved by an OLTP system

Online Analytical Processing (OLAP) Online Analytical Processing (OLAP)


 OLAP differs from traditional databases in the way data is
 OLAP databases are divided into one or more cubes.
conceptualized and stored.
The cubes are designed in such a way that creating
 In OLAP data is held in the dimensional form rather than and viewing reports become easy.
the relational form.
 OLAP databases are divided into one or more cubes,
 OLAP’s life blood is multi-dimensional data. and each cube is organized and designed by a cube
 OLAP tools are based on the multi-dimensional data model. administrator to fit the way that you retrieve and
The multidimensional data model views data in the form of a analyze data so that it is easier to create and use the
data cube. PivotTable reports and PivotChart reports that you
need.
 OLAP is a technology that is used to organize large business
databases and support business intelligence.
Online Analytical Processing (OLAP) Characteristics of OLAP
 OLAP is a category of software that allows users to analyze
information from multiple database systems at the same time. It is a
technology that enables analysts to extract and view business data from  Multidimensional analysis
different points of view
 Support for complex queries
 Analysts frequently need to group, aggregate and join data.
These operations in relational databases are resource intensive. With  Advanced database support
OLAP, data can be pre-calculated and pre-aggregated, making analysis
– Support large databases
faster.
– Access different data sources
 Provides multidimensional view of data – Access aggregated data and detailed data
 Used for analysis of data
 Data can be viewed from different perspectives
 Determine why data appears the way it does
 Drill down approach is used to further dig down deep into the data

Characteristics of OLAP Advantages of an OLAP System


 Easy-to-use End-user interface  Multi-dimensional data representation.
– Easy to use graphical interfaces
– Familiar interfaces with previous data analysis tools  Consistency of information.

 Client-Server Architecture  “What if ” analysis.


– Provides flexibility
– Can be used on different computers  Provides a single platform for all information and business
– More machines can be added needs – planning, budgeting, forecasting, reporting and
analysis.

 Fast and interactive ad hoc exploration.


OLTP vs. OLAP OLTP vs. OLAP
Online Transaction Processing Online Analytical Processing Online Transaction Processing Online Analytical Processing

Focus Data in Data out Simple queries, often returning Often complex queries
Queries
fewer records involving aggregations
Data extracted from various
operational data sources, Queries usually take a long
Source of data Operational/Transactional Data
transformed and loaded into Processing speed Usually returns fast time (several hours) to execute
the data warehouse and return
Assists in planning, budgeting,
Manage (control and execute)
Purpose of data forecasting and decision Typically aggregated access to
basic business tasks Access Field level access
making data of business interest
Historical data. Has support for
summarization and Typically normalized tables.
Typically de-normalized tables;
aggregation. Stores and Database Design OLTP system adopts ER (Entity
Current data. Far too detailed – uses star or snowflake schema
Data contents manages data at various levels Relationship) model
not suitable for decision making
of granularity, thereby suitable
for decision
making Operations Read/Write Mostly read

Periodic updates to refresh the


Inserts and updates Very frequent updates and inserts
data warehouse

OLTP vs. OLAP


Online Transaction Processing Online Analytical Processing

Regular backups of operational Instead of regular backups,


data are mandatory. Requires data warehouse is refreshed
Backup and Recovery
concurrency control (locking) and periodically using data from
recovery mechanisms (logging) operational data sources

Joins Many Few

Derived data and


Rare Common
aggregates

Data Structures Complex Multi-dimensional

 Which courses have


productivity impact on-the-
job?
 Search & locate student(s)  How much training is
Few Sample Queries  Print student scores needed on future
 Filter students above 90% marks technologies for non
linear growth in BI?
 Why consider investing in
DSS experience lab?

You might also like