combinepdf-1
combinepdf-1
Data Mining
Outline
What is Data Mining?
Product or
Demographic Geographic Behavioral
Service
Segmentation Segmentation Segmentation
Segmentation
Attributes of interest are not available (e.g., customer information for sales transaction data)
Data were not considered important at the time of transactions, so they were not recorded!
1 2 3 4
1. Identify the 2. Use data mining 3. Act on the 4. Measure the
problem techniques to information results
transform the data
into information
Data Mining Tasks
Data mining tasks are generally divided into two major categories:
Algorithm Types Decision Trees, Naive Bayes, SVM, Neural Networks K-Means, Hierarchical Clustering, DBSCAN, etc.
Training Requires training on labeled data for each class Unsupervised learning, no specific training
Interpretability Results can be interpreted in terms of class labels Interpretation is based on cluster properties
Use Case Spam detection, medical diagnosis, image recognition Customer segmentation, document clustering, etc.
Example Identifying whether an email is spam or not Grouping news articles by topic
Association Rules
• Association rule mining finds interesting associations
and relationships among large sets of data items.
• This rule shows how frequently a itemset occurs in a
transaction.
• Market Basket Analysis is one of the key techniques
used by large relations to show associations between
items.
• It allows retailers to identify relationships between
the items that people buy together frequently.
Example: Market Basket Analysis
Association Rule 1: {Bread} -> {Butter} This rule indicates that customers who
buy bread are likely to also buy butter.
Association Rule 2: {Milk, Eggs} -> {Cheese} This rule suggests that customers
who purchase both milk and eggs are likely to buy cheese as well.
Transaction ID Items Purchased
1 Bread, Milk, Butter
2 Bread, Butter, soap
3 Milk, Bread, Eggs
4 Bread, Milk, Eggs
5 Bread, Milk, Butter, Eggs
software
Visualized
result
Data Mining
Analyst Tool
Advantages of Data Mining
• Marketing and sales are more effective
• Improved customer service
• Supply Chain Mangement
• Enhanced production Uptime
• Better Risk Management
QA
Basic of Data Warehouse and Data Integration Data Warehouse
Data warehouse is a subject-oriented, integrated, time
variant and non-volatile collection of data in support of
management decision making process (Willam H Inmon).
It is a large store of data and a set of processes collected
into a database for the primary purpose of helping a
business analyze data to make decisions
Data Warehouse – Advantage and Limitations Data Warehouse – Advantage and Limitations
Advantages Limitations
Integration at the lowest level, eliminating need for Process would take a considerable amount of time
integration queries. and effort
Runtime schematic cleaning is not needed – Requires an understanding of the domain
performed at the data staging environment More scalable when accompanied with a
Independent of original data source metadata repository – increased load.
Query optimization is possible. Tightly coupled architecture.
Data Warehousing (DW) Extract, Transform , Load (ETL)
• ETL is a process that extracts the data from different source
systems, then transforms the data (like applying
calculations, concatenations, etc.) and finally loads the data
into the data warehouse system. ETL provides the
foundation for data analytics and machine learning work
streams. ETL is often used by an organization to:
• Extract data from legacy systems
• Cleanse the data to improve data quality and establish
consistency
• Load data into a target database , usually a DW
manipulation and data structure is large. It’s Data would be present in various servers The entire DW would be present in one
mainly used in the dot net platform and is always server
Requires high speed network connections Requires no network connections
performed with C# or using VB.NET
It is easier to create as compared to DW Its creation is not easy as that of federated
It’s is a much faster way of accessing the data database
than using Memory Stream. Requires no creation of new database DW must be created from scratch
Requires network expert to set up the Requited database experts such as data
network connection steward
Data Integration Technologies Data Integration Technologies
The technologies that are used for data integration Modeling techniques
include: Entity-Relational Modeling - An Entity–relationship
model (ER model) describes the structure of a database with
Data interchange – it is a structured transmission of the help of a diagram, which is known as Entity Relationship
organizational data between two or more organizations through
electronic means. Used for the transfer of electronic documents Diagram (ER Diagram).
from one computer to another.
Object Brokering - an object request broker (ORB) is a
middleware software. It gives programmers the freedom to make
calls from one computer to another over via a computer network.
Standardize, correct and normalize data Match, link and consolidate multiple data sources
Verify and validate data accuracy Gain access to the right data sources at the right
time
Apply business rules
Deliver high-quality information
Increase the quality of information
Financial Analytics Supply Chain Analytics • Descriptive analytics: the use of data to understand
HR Analytics Analytics for Government past and current business performance and make
Marketing Analytics and Nonprofits informed decisions
Health Care Analytics Sports Analytics
• Predictive analytics: predict the future by examining
Web Analytics historical data, detecting patterns or relationships in
these data, and then extrapolating these relationships
forward in time.
• Prescriptive analytics: identify the best alternatives to
minimize or maximize some objective
Data for Business Analytics Examples of Data Sources and Uses
• Annual reports
• Data: numbers or textual data that are • Accounting audits
collected through some type of measurement • Financial profitability analysis
process • Economic trends
• Information: result of analyzing data; that is, • Marketing research
extracting meaning from data to support • Operations management performance
evaluation and decision making • Human resource measurements
• Web behavior
– page views, visitor’s country, time of view, length of time,
origin and destination paths, products they searched for
and viewed, products purchased, what reviews they read,
and many others
• Model - an abstraction or representation of a real system, The sales of a new product, such as a first-generation iPad or
idea, or object. 3D television, often follow a common pattern.
– Captures the most important features 1. Verbal description: The rate of sales starts small as early
– Can be a written or verbal description, a visual adopters begin to evaluate a new product and then begins
representation, a mathematical formula, or a to grow at an increasing rate over time as positive
spreadsheet. customer feedback spreads. Eventually, the market begins
to become saturated and the rate of sales begins to
decrease.
Example - Three Forms of a Model Example - Three Forms of a Model
Models cannot capture every detail of the real problem. Translate the results of the model back to the real world.
Managers must understand the limitations of models and Requires providing adequate resources, motivating
their underlying assumptions and often incorporate employees, eliminating resistance to change, modifying
judgment into making a decision. organizational policies, and developing trust.
Reporting Perspectives Common to all Levels of Enterprise Reporting Perspectives Common to all Levels of Enterprise
Function level: Reports being generated at the function Strategic/operational: Reports could also be classified
level may be consumed by users within a department or based on the nature of the purpose they preserve.
geographic location or region or by decision makers at Strategic reports inform the alignment with the goals,
the corporate level. whereas operational reports present transaction facts
Internal/external: Sometimes the consumers of reports Summary/detail: Summary reports do not provide
may be external to the enterprise transaction-level information, whereas detailed reports
list atomic facts.
Role-based: Provide standard format of report to similar
roles across the enterprise, as they are likely to make Standard/ad hoc: Company may generate periodic
similar decisions. reports, say, weekly, monthly, or quarterly reports in
standard formats. Executives many times need ad hoc
or on-demand critical business decision making.
Reporting Perspectives Common to all Levels of Enterprise Report Standardization And Presentation Practices
Purpose: Enterprises classify reports as statutory that Enterprises tend to standardize reporting from several
focus on business transparency and need to be shared perspectives. Some report standardization perspectives
with regulatory bodies. Also analytical reports look into a are as follows:
particular area of operation representing large data
Data standardization
interpretations in the form of graphs.
Content standardization
Technology platform-centric: Reporting in today’s
context need not use paper at all. Dashboards could be Presentation standardization
delivered on smartphones and tablets. Reports could be Metrics standardization
published in un-editable (secure) form with watermarks.
Reporting tools’ standardization
Reports could be protected to be used by a specific
person, during specific hours from specific device.
Enterprise Reporting Characteristics in OLAP World Enterprise Reporting Characteristics in OLAP World
Traditional Strategy Deployment and Communication 35% of stakeholder valuation decision is based on
follows functional hierarchy non-financial data
Strategy Formulation involves a little Planning and Strategy Execution
lots of Budgeting Management Credibility
Only 5% of the workforce understands the strategy,
Innovation
85% of executives spend less than 1 hour per month
discussing strategy, Ability to Attract Talent
only 25% of Managers have incentives linked to strategy –
Fortune Magazine Survey
Traditional Balanced Scorecard (Source: Kaplan & Norton, 1992) The Balanced Scorecard
Purchasing Dashboards
Supply Chain Dashboards
Operations Dashboards
Manufacturing Dashboards
Quality Control Dashboards
Marketing Dashboards
Sales Dashboards
Finance Dashboards
Human Resources Dashboard
Benefits of Enterprise Dashboard Steps in Creating a Dashboard
Identify the data that will go into an Enterprise
Dashboards have the following benefits:
Dashboard
Places all critical information in just one screen Enterprise Dashboards can contain either/both of the below
rather than flipping through the pages mentioned data:
Improved decision making Quantitative data
Batch Processing
Where information is required in batch
Offline access to information
Presorting (sequence) is applied
Takes time to process information
Characteristics of OLTP Model Limitations of Relational Models
Online connectivity - LAN,WAN
Create and maintain large number of tables for the
Availability - Available 24 hours a day voluminous data
Response rate
– Rapid response rate For new functionalities, new tables are added
– Load balancing by prioritizing the transactions Unstructured data cannot be stored in relational
Cost databases
– Cost of transactions is less Very difficult to manage the data with common
Update facility denominator (keys)
– Less lock periods
– Instant updates
– Use the full potential of hardware and software
Queries that an OLTP System can Process Advantages and Challenges of an OLTP System
Retrieve the product description and unit price of a Simplicity – It is designed typically for use by clerks,
particular product. cashiers, clients, etc.
Filter all products with a unit price equal to or above
Rs. 25. Efficiency – It allows its users to read, write and delete
Filter all products supplied by a particular supplier. data quickly.
Search and display the record of a particular supplier. Fast query processing – It responds to user actions
immediately and also supports transaction processing on
demand.
Advantages and Challenges of an OLTP System The Queries that OLTP Cannot Answer
The super market store is deciding on introducing a new
Challenges of an OLTP System
product. The key questions they are debating are: “Which
product should they introduce?” and “Should it be
Security – An OLTP system requires concurrency
specific to a few customer segments?”
control (locking) and recovery mechanisms (logging).
The super market store is looking at offering some discount
OLTP system data content not suitable for decision on their yearend sale. The questions here are: “How much
making – A typical OLTP system manages the current discount should they offer?” and “Should it be different
data within an enterprise/organization. discounts for different customer segments?” • All the
queries stated above have more to do with analysis than
This current data is far too detailed to be easily used for simple reporting”
decision making.
Ideally these queries are not meant to be solved by an OLTP system
Focus Data in Data out Simple queries, often returning Often complex queries
Queries
fewer records involving aggregations
Data extracted from various
operational data sources, Queries usually take a long
Source of data Operational/Transactional Data
transformed and loaded into Processing speed Usually returns fast time (several hours) to execute
the data warehouse and return
Assists in planning, budgeting,
Manage (control and execute)
Purpose of data forecasting and decision Typically aggregated access to
basic business tasks Access Field level access
making data of business interest
Historical data. Has support for
summarization and Typically normalized tables.
Typically de-normalized tables;
aggregation. Stores and Database Design OLTP system adopts ER (Entity
Current data. Far too detailed – uses star or snowflake schema
Data contents manages data at various levels Relationship) model
not suitable for decision making
of granularity, thereby suitable
for decision
making Operations Read/Write Mostly read