0% found this document useful (0 votes)
5 views

introduction to DataWarehouse and DataMining

IIIRD CSE SEM I DWDM NOTES

Uploaded by

P.Padmini Rani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

introduction to DataWarehouse and DataMining

IIIRD CSE SEM I DWDM NOTES

Uploaded by

P.Padmini Rani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

DATAWAREHOUSE INTRODUCTION

What is Data and Information?


Data is an individual unit that contains raw materials which do not carry any specific meaning.
Information is a group of data that collectively carries a logical meaning.
Data doesn't depend on information.
Information depends on data.
Data is measured in bits and bytes.
Information is measured in meaningful units like time, quantity, etc.

Data Warehouse:
Data warehouse is like a relational database designed for analytical needs. It functions on the
basis of OLAP (Online Analytical Processing). It is a central location where consolidated data
from multiple locations (databases) are stored.

What is Data warehousing?


Data warehousing is the act of organizing & storing data in a way so as to make its retrieval
efficient and insightful. It is also called as the process of transforming data into information.
Fig: Data warehousing Process

Data Warehouse Characteristics:


A Data warehouse is a subject-oriented, integrated, time variant and non-volatile collection of
data in support of management’s decision making process.
Subject-oriented:
A Data warehouse can be used to analyze a particular subject area
Ex: “Sales” can be particular subject
Integrated:
A Data warehouse integrates data from multiple data sources.
Time Variant:
Historical data is kept in a data warehouse.
Ex: one can retrieve data from 3 months, 6months, 12 months or even older data from a
data warehouse. This contrasts with a transactions system, where often only the most recent
data is kept.
Non-Volatile:
Once data is in the data warehouse, it will not change. So historical data in a data warehouse
should never be altered.
Data warehouse Architecture:

Fig: Data ware housing Architecture


Data warehouses often adopt a three-tier architecture
 The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed the data into the bottom
tier from operational database or other external sources. These tools and utilities
perform data extraction, cleaning and transformation(ex. To merge similar data from
different sources into a unified format), as well as load and refresh functions to update
the data warehouse. The data are extracted using application program interfaces known
as gateways.A gateway is supported by the underlying DBMS and allows client programs
to generate SQL code to be executed at a server.
Examples of gateways include ODBC(Open Database Connection) and OLEDB(Open
Linking and Embedding for Databases) by Microsoft and JDBC(Java Database
Connection).This tier also contains a metadata repository, which stores information
about the data warehouse and its contents.
The middle tier is an OLAP server that is typically implemented using either
(a) a relational OLAP(ROLAP) model, that is an extended relational DBMS that maps
operations on multidimensional data to standard relational operations, or
(b) a multidimensional OLAP(MOLAP) model that is a special-purpose server that directly
implements multidimensional data and operations.
The top tier is a front end client layer, which contains query and reporting tools, analysis tools
and data mining tools(ex: trend analysis, prediction….)

Multi-dimensional Data Model:


 A multidimensional model views data in the form of a data-cube.
 When data is grouped or combined in multidimensional matrices called Data Cubes.
 A data cube enables data to be modeled and viewed in multiple dimensions.
It is defined by dimensions and facts.
 A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures.
 The fact table contains the names of the facts or measures of the related dimensional
tables.
FACT VS DIMENSION
Fact/Measure(What you want to analyse is your fact)
Ex: What is My sales, What is my profit, What is my custmes preferences.
Dimensions(By Which you want to Analyze is your Dimensions)
Sales By Location/Product/Period
Total Profit By Location/Product/Period
 These Dimensions allow the store to keep track of things like monthly sales of items and
branches and locations at which the items were sold.
 Each dimension may have a table associated with it called a dimension table, which
further describes the dimension.
Fig: Multidimensional Representation
 Consider the data of a shop for items sold per quarter in the city of Delhi. The data is
shown in the table.
 In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an
item sold).
 The fact or measure displayed in rupee_sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:

What is Schema?
 Schema is a logical description of the entire database.
 It includes the name and description of records of all record types including all
associated data-Items and aggregates.
 Much like a database, a data warehouse also requires to maintain a schema.
 A database uses relational model, while a data warehouse uses Star, Snowflake, and
Fact Constellation schema.
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a set of dimension tables
 Snowflake schema: A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables, viewed as a collection
of stars, therefore called galaxy schema or fact constellation

Star Schema:
 A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions.
 A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.
 A star schema is a relational schema where a relational schema whose design represents
a multidimensional data model.
 The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table.
 The center of the schema consists of a large fact table, and the points of the star are the
dimension tables.

Fig: Star Schema Representation


Star Schema:
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
 Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key,
street, city, province_or_state,country}. This constraint may cause data redundancy. For
example, "Vancouver" and "Victoria" both the cities are in the Canadian province of
British Columbia. The entries for such cities may cause data redundancy along the
attributes province_or_state and country.

Characteristics of Star Schema:


 Every dimension in a star schema is represented with the only one-dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized. For instance, in the above figure, Country_ID
does not have Country lookup table as an OLTP design would have.
 The schema is widely supported by BI Tools.
 Advantages:
 (i) Simplest and Easiest
 (ii) It optimizes navigation through database
 (iii) Most suitable for Query Processing

Snowflake Schema:
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized.
 For example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

 Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note : Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.

Fig: Snowflake image


A snowflake schemas can have any number of dimension, and each dimension can have any
number of levels.
The following diagram shows a snowflake schema with two dimensions, each having three
levels.

Advantages:
(i) Less redundancies due to normalization Dimension Tables.
(ii) Dimension Tables are easier to update.
Disadvantages:
It is complex schema when compared to star schema.

Fact Constellation Schema:


 A Fact constellation means two or more fact tables sharing one or more dimensions. It is
also called Galaxy schema.
 Fact Constellation Schema describes a logical structure of data warehouse or data mart.
Fact Constellation Schema can design with a collection of de-normalized FACT, Shared,
and Conformed Dimension tables.
A fact constellation schema is shown in the figure below.

 This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location.
 The schema contains a fact table for sales that includes keys to each of the four
dimensions, along with two measures: Rupee_sold and units_sold.
 The shipping table has five dimensions, or keys: item_key, time_key, shipper_key,
from_location, and to_location, and two measures: Rupee_cost and units_shipped.
 It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.

Disadvantages:
(i) Complex due to multiple fact tables
(ii) It is difficult to manage
(iii) Dimension Tables are very large.

OLAP OPERATIONS:
 In the multidimensional model, the records are organized into various dimensions, and
each dimension includes multiple levels of abstraction described by concept hierarchies.
 This organization support users with the flexibility to view data from various
perspectives.
 A number of OLAP data cube operation exist to demonstrate these different views,
allowing interactive queries and search of the record at hand. Hence, OLAP supports a
user-friendly environment for interactive data analysis.
 Consider the OLAP operations which are to be performed on multidimensional data.
 The data cubes for sales of a shop. The cube contains the dimensions, location, and time
and item, where the location is aggregated with regard to city values, time is
aggregated with respect to quarters, and an item is aggregated with respect to item
types.
OLAP having 5 different operations

(i) Roll-up
(ii) Drill-down
(iii) Slice
(iv) Dice
(v) Pivot

Roll-up:
 The roll-up operation performs aggregation on a data cube, by climbing down concept
hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data cubes.
 It is also known as drill-up or aggregation operation
 Figure shows the result of roll-up operations performed on the dimension location. The
hierarchy for the location is defined as the Order Street, city, province, or state, country.
 The roll-up operation aggregates the data by ascending the location hierarchy from the
level of the city to the level of the country.
 When a roll-up is performed by dimensions reduction, one or more dimensions are
removed from the cube.
 For example, consider a sales data cube having two dimensions, location and time. Roll-
up may be performed by removing, the time dimensions, appearing in an aggregation of
the total sales by location, relatively than by location and by time.
Fig: Roll-up operation on Data Cube

Drill-Down
 The drill-down operation is the reverse operation of roll-up.
 It is also called roll-down operation.
 Drill-down is like zooming-in on the data cube.
 It navigates from less detailed record to more detailed data. Drill-down can be
performed by either stepping down a concept hierarchy for a dimension or adding
additional dimensions.
 Figure shows a drill-down operation performed on the dimension time by stepping
down a concept hierarchy which is defined as day, month, quarter, and year.
 Drill-down appears by descending the time hierarchy from the level of the quarter to a
more detailed level of the month.
 Because a drill-down adds more details to the given data, it can also be performed by
adding a new dimension to a cube.
Fig: Drill-down operation

Slice:
 A slice is a subset of the cubes corresponding to a single value for one or more members
of the dimension.
 The slice operation provides a new sub cube from one particular dimension in a given
cube.
 For example, a slice operation is executed when the customer wants a selection on one
dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice
operations perform a selection on one dimension of the given cube, thus resulting in a
sub cube.
 Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
 It will form a new sub-cubes by selecting one or more dimensions.
Fig: Slice operation

Dice:
 The dice operation describes a sub cube by operating a selection on two or more
dimension.
Fig: Dice operation

 The dice operation on the cubes based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")

Pivot:
 The pivot operation is also called a rotation.
 Pivot is a visualization operations which rotates the data axes in view to provide an
alternative presentation of the data.
 It may contain swapping the rows and columns or moving one of the row-dimensions
into the column dimensions.

Fig: Pivot Operation


Parallel DBMS Vendors:

What is a DBMS vendor?

(Data base Management System)Software that controls the organization, storage, retrieval, security and
integrity of data in a database.

The major DBMS vendors are Oracle, IBM, Microsoft and Sybase (see Oracle Database, DB2, SQL Server
and ASE).

D V Type Primary Market


B en
M do
S r
Access (Jet, MSDE) Microsoft R Desktop
Adabas D Software AG R Enterprise
Adaptive Server Anywhere Sybase R Mobile/Embedded
Adaptive Server Enterprise Sybase R Enterprise
Advantage Database Server Extended Systems R Mobile/Enterprise
Datacom Computer Associates R Enterprise
DB2 Everyplace IBM R Mobile
Filemaker FileMaker Inc. R Desktop
IDMS Computer Associates R Enterprise
Ingres ii Computer Associates R Enterprise
Interbase Inprise (Borland) R Open Source
MySQL Freeware R Open Source
NonStop SQL Tandem R Enterprise
Pervasive.SQL 2000 Pervasive Software R Embedded
(Btrieve)
Pervasive.SQL Workgroup Pervasive Software R Enterprise (Windows 32)
Progress Progress Software R Mobile/Embedded
Quadbase SQL Server Quadbase Systems, Inc. Relation Enterprise
al
R:Base R:Base Technologies Relation Enterprise
al
Rdb Oracle R Enterprise
Enterprise
Red Brick Informix (Red Brick) R
(Data
Warehousi
ng)
Enterprise
SQL Server Microsoft R
Mobile/Em
SQLBase Centura Software R
bedded
Enterprise
SUPRA Cincom R
VLDB
Teradata NCR R
(Data
Warehousi
ng)
Enterprise
YARD-SQL YARD Software Ltd. R
In-
TimesTen TimesTen Performance R
Memory
Software
Enterprise
Adabas Software AG XR
VLDB
Model 204 Computer Corporation of XR
America
Enterprise
UniData Informix (Ardent) XR
Enterprise
UniVerse Informix (Ardent) XR
Enterprise
Cache' InterSystems OR
Mobile/Em
Cloudscape Informix OR
bedded
Enterprise/
DB2 IBM OR
VLDB
Enterprise
Informix Dynamic Server Informix OR
2000
VLDB
Informix Extended Parallel Informix OR
(Data
Server
Warehousi
ng)
Mobile
Oracle Lite Oracle OR
Enterprise
Oracle 8I Oracle OR
Embedded
PointBase Embedded PointBase OR
Mobile
PointBase Mobile PointBase OR
Enterprise
PointBase Network Server PointBase OR
Open
PostgreSQL Freeware OR
Source
Enterprise
UniSQL Cincom OR
Enterprise
Jasmine ii Computer Associates OO
Enterprise
Object Store Exceleron OO
VLDB
Objectivity DB Objectivity OO
(Scientific)
Enterprise
POET Object Server Suite Poet Software OO
Enterprise
Versant Versant Corporation OO
Mobile/Em
Raima Database Manager Centura Software RN
bedded
Enterprise/
Velocis Centura Software RN
Embedded
Open
Db.linux Centura Software RNH
Source/Mo
bile/Embe
dded
Open
Db.star Centura Software RNH
Source/Mo
bile/Embe
dded
Types of Data Warehouse:
There are three main types of DWH. Each has its specific role in data management operations.

1. Enterprise Data Warehouse


Enterprise data warehouse (EDW) serves as a central or main database to facilitate decision-
making throughout the enterprise. Key benefits of having an EDW include access to cross-
organizational information, the ability to run complex queries, and the enablement of enriched,
far-sighted insights for data-driven decisions and early risk assessment.

2. ODS (Operational Data Store)


In ODS, the DWH refreshes in real-time. Therefore, organizations often used it for routine
enterprise activities, such as storing records of the employees. Business processes also use ODS
as a source for providing data to the EDW.

3. Data Mart
It is a subset of a DWH that supports a particular department, region, or business unit. Consider
this: You have multiple departments, including sales, marketing, product development, etc.
Each department will have a central repository where it stores data. This repository is called a
data mart. The EDW stores the data from the data mart in the ODS on a daily/weekly (or as
configured) basis. The ODS acts as a staging area for data integration. It then sends the data to
the EDW to store it and use it for BI purposes.

DATAWAREHOUSE COMPONENTS:
The data warehouse is based on an RDBMS server which is a central information repository that
is surrounded by some key components to make the entire environment functional,
manageable and accessible
There are mainly five components of Data Warehouse:

DATA WAREHOUSE DATABASE:


The central database is the foundation of the data warehousing environment. This database is
implemented on the RDBMS technology. Although, this kind of implementation is constrained
by the fact that traditional RDBMS system is optimized for transactional database processing
and not for data warehousing. For instance, ad-hoc query, multi-table joins, aggregates are
resource intensive and slow down performance.
Hence, alternative approaches to Database are used as listed below-
In a data warehouse, relational databases are deployed in parallel to allow for scalability.
Parallel relational databases also allow shared memory or shared nothing model on various
multiprocessor configurations or massively parallel processors.
New index structures are used to bypass relational table scan and improve speed.
Use of multidimensional database (MDDBs) to overcome any limitations which are placed
because of the relational data model. Example: Essbase from Oracle.

SOURCING, ACQUISITION, CLEAN-UP AND TRANSFORMATION TOOLS


(ETL):
The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a unified
format in the datawarehouse. They are also called Extract, Transform and Load (ETL) Tools.
These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol
programs, shell scripts, etc. that regularly update data in datawarehouse. These tools are also
helpful to maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.

METADATA:
The name Meta Data suggests some high- level technological concept. However, it is quite
simple. Metadata is data about data which defines the data warehouse. It is used for building,
maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the
source, usage, values, and features of data warehouse data. It also defines how data can be
changed and processed. It is closely connected to the data warehouse.

QUERY TOOLS:
One of the primary objects of data warehousing is to provide information to businesses to
make strategic decisions. Query tools allow users to interact with the data warehouse system.
These tools fall into four different categories:
Query and reporting tools
Application Development tools
Data mining tools
OLAP tools

Characteristics of OLAP:
The main characteristics of OLAP are as follows:
Multidimensional conceptual view: OLAP systems let business users have a dimensional and
logical view of the data in the data warehouse. It helps in carrying slice and dice operations.
Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should provide
normal database operations, containing retrieval, update, adequacy control, integrity, and
security.
Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP
operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-
end.
Storing OLAP results: OLAP results are kept separate from data sources.
Uniform documenting performance: Increasing the number of dimensions or database size
should not significantly degrade the reporting performance of the OLAP system.
OLAP provides for distinguishing between zero values and missing values so that aggregates are
computed correctly.
OLAP system should ignore all missing values and compute correct aggregate values.
OLAP facilitate interactive query and complex analysis for the users.
OLAP allows users to drill down for greater details or roll up for aggregations of metrics along a
single business dimension or across multiple dimension.
OLAP provides the ability to perform intricate calculations and comparisons.
OLAP presents results in a number of meaningful ways, including charts and graphs.
OLAP Types:
Three types of OLAP servers are:-
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)

1. Relational OLAP (ROLAP):


Relational On-Line Analytical Processing (ROLAP) work mainly for the data that resides in a
relational database, where the base data and dimension tables are stored as relational tables.
ROLAP servers are placed between the relational back-end server and client front-end tools.
ROLAP servers use RDBMS to store and manage warehouse data, and OLAP middleware to
support missing pieces.

Advantages of ROLAP:
1. ROLAP can handle large amounts of data.
2. Can be used with data warehouse and OLTP systems.

Disadvantages of ROLAP:
1. Limited by SQL functionalities.
2. Hard to maintain aggregate tables.

2. Multidimensional OLAP (MOLAP):


Multidimensional On-Line Analytical Processing (MOLAP) support multidimensional views of
data through array-based multidimensional storage engines. With multidimensional data
stores, the storage utilization may be low if the data set is sparse.

Advantages of MOLAP
1. Optimal for slice and dice operations.
2. Performs better than ROLAP when data is dense.
3. Can perform complex calculations.
Disadvantages of MOLAP
1. Difficult to change dimension without re-aggregation.
2. MOLAP can handle limited amount of data.

3. Hybrid OLAP (HOLAP):


Hybrid On-Line Analytical Processing (HOLAP) is a combination of ROLAP and MOLAP. HOLAP
provide greater scalability of ROLAP and the faster computation of MOLAP.

Advantages of HOLAP
1. HOLAP provide advantages of both MOLAP and ROLAP.
2. Provide fast access at all levels of aggregation.

Disadvantages of HOLAP
1. HOLAP architecture is very complex because it support both MOLAP and ROLAP servers.
CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

Why we need Data Mining?


Volume of information is increasing everyday that we can handle from business
transactions, scientific data, sensor data, Pictures, videos, etc. So, we need a system that will
be capable of extracting essence of information available and that can automatically generate
report, views or summary of data for better decision-making.

Why Data Mining is used in Business?


Data mining is used in business to make better managerial decisions by:
Automatic summarization of data
Extracting essence of information stored.
Discovering patterns in raw data.

What is Data Mining?


Data Mining is defined as extracting information from huge sets of data. In other words,
we can say that data mining is the procedure of mining knowledge from data. The

Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration

Knowledge discovery from Data (KDD) is essential for data mining. While others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps

Data Cleaning data is removed.


Data Integration combined.
Data Selection
database.
Data Transformation In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining In this step, intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation
Knowledge Presentation In this step, knowledge is represented.

CH.Y.KUMAR Asst.Prof Page 1


CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

What kinds of data can be mined?


1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
1. Flat Files
Flat files are defined as data files in text form or binary form with a structure that
can be easily extracted by data mining algorithms.
Data stored in flat files have no relationship or path among themselves, like if a
relational database is stored on flat file, and then there will be no relations
between the tables.
Flat files are represented by data dictionary. Eg: CSV file.
Application: Used in Data Warehousing to store data, Used in carrying data to
and from server, etc.
2. Relational Databases
A Relational database is defined as the collection of data organized in tables with
rows and columns.
Physical schema in Relational databases is a schema which defines the structure of
tables.
Logical schema in Relational databases is a schema which defines the relationship
among tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc.
CH.Y.KUMAR Asst.Prof Page 2
CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

3. DataWarehouse
A datawarehouse is defined as the collection of data integrated from multiple
sources that will query and decision making.
There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
Two approaches can be used to update data in DataWarehouse: Query-
driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc.
4. Transactional Databases
Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
Highly flexible system where users can modify information without changing any
sensitive information.
Follows ACID property of DBMS.
Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
6. Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
7. Time-series Databases
Time series databases contain stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
It is the most heterogeneous repository as it collects data from multiple resources.
It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.

What kinds of Patterns can be mined?


On the basis of the kind of data to be mined, there are two categories of functions involved
in Data Mining
a) Descriptive
b) Classification and Prediction

CH.Y.KUMAR Asst.Prof Page 3


CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

a) Descriptive Function
The descriptive function deals with the general properties of data in the database.

1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters

1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and
concepts of customers include big spenders and budget spenders. Such descriptions of a class
or a concept are called class/concept descriptions. These descriptions can be derived by the
fol
Data Characterization
class under study is called as Target Class.
Data Discrimination
predefined group or class.

2. Mining of Frequent Patterns


Frequent patterns are those patterns that occur frequently in transactional data. Here is
the list of kind of frequent patterns
Frequent Item Set
example, milk and bread.
Frequent Subsequence
purchasing a camera is followed by memory card.
Frequent Sub Structure
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
3. Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk
is sold with bread and only 30% of times biscuits are sold with bread.

4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item sets to analyze
that if they have positive, negative or no effect on each other.

5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the objects
in other clusters.

CH.Y.KUMAR Asst.Prof Page 4


CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

b) Classification and Prediction


Classification is the process of finding a model that describes the data classes or
concepts. The purpose is to be able to use this model to predict the class of objects whose
class label is unknown. This derived model is based on the analysis of sets of training data.
The deri

1. Classification (IF-THEN) Rules


2. Prediction
3. Decision Trees
4. Mathematical Formulae
5. Neural Networks
6. Outlier Analysis
7. Evolution Analysis

The list of functions involved in these processes is as follows


1. Classification It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.

2. Prediction
than class labels. Regression Analysis is generally used for prediction. Prediction can
also be used for identification of distribution trends based on available data.

3. Decision Trees
and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes
the outcome of a test, and each leaf node holds a class label.

4. Mathematical Formulae Data can be mined by using some mathematical formulas.

5. Neural Networks
processing. These models are biologically inspired rather than an exact replica of how
the brain actually functions. Neural networks have been shown to be very promising
systems in many forecasting applications and business classification applications due
learn data.

6. Outlier Analysis
with the general behavior or model of the data available.

7. Evolution Analysis
regularities or trends for objects whose behavior changes over time.

CH.Y.KUMAR Asst.Prof Page 5


CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

Data Mining Task Primitives


We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.

Note

Set of task relevant data to be mined.


Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.

Which Technologies are used in data mining?

1. Statistics:
It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
Statistical analysis involves the collection of methods, applicable to large amount of
data to conclude and report the trend.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives computers the
ability to learn without being programmed.
When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning.
In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
It is related to computational statistics.

CH.Y.KUMAR Asst.Prof Page 6


CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

The four types of machine learning are:


a. Supervised learning
It is based on the classification.
It is also called as inductive learning. In this method, the desired outputs are
included in the training dataset.
b. Unsupervised learning
Unsupervised learning is based on clustering. Clusters are formed on the basis of
similarity measures and desired outputs are not included in the training dataset.
c. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training dataset to
generate the appropriate functions. This method generally avoids the large number
of labeled examples (i.e. desired outputs).
d. Active learning
Active learning is a powerful approach in analyzing the data efficiently.
The algorithm is designed in such a way that, the desired output should be decided
by the algorithm itself (the user plays important role in this type).
3. Information retrieval
Information deals with uncertain representations of the semantics of objects (text,
images).
For example: Finding relevant information from a large document.

4. Database systems and data warehouse


Databases are used for the purpose of recording the data as well as data warehousing.
Online Transactional Processing (OLTP) uses databases for day to day transaction
purpose.
Data warehouses are used to store historical data which helps to take strategically
decision for business.
It is used for online analytical processing (OALP), which helps to analyze the data.
5. Pattern Recognition:
Pattern recognition is the automated recognition of patterns and regularities in data.
Pattern recognition is closely related to artificial intelligence and machine learning, together
with applications such as data mining and knowledge discovery in databases (KDD), and is
often used interchangeably with these terms.
6. Visualization:
It is the process of extracting and visualizing the data in a very clear and
understandable way without any form of reading or writing by displaying the results in the
form of pie charts, bar graphs, statistical representation and through graphical forms as well.
7. Algorithms:
To perform data mining techniques we have to design best algorithms.
8. High Performance Computing:
High Performance Computing most generally refers to the practice of aggregating
computing power in a way that delivers much higher performance than one could get out of a
typical desktop computer or workstation in order to solve large problems in science,
engineering, or business.

CH.Y.KUMAR Asst.Prof Page 7


CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

Are all patterns interesting?


Typically the answer is No only small fraction of the patterns potentially generated would
actually be of interest to a given user.
What makes patterns interesting?
The answer is if it is (1) easily understood by humans, (2) valid on new or test data with
some degree of certainty, (3) potentially useful, (4) novel.
A Pattern is also interesting if it is validates a hypothesis that the user sought to confirm.
Data Mining Applications

Financial Data Analysis


Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Like,
Loan payment prediction and customer credit policy analysis.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue to
expand rapidly because of the increasing ease, availability and popularity of the web.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet messenger, images, e-
mail, web data transmission, etc. Due to the development of new computer and
communication technologies, the telecommunication industry is rapidly expanding. This is
the reason why data mining is become very important to help and understand the business.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data mining
is a very important part of Bioinformatics.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data have been
collected from scientific domains such as geosciences, astronomy, etc.
A large amount of data sets is being generated because of the fast numerical
simulations in various fields such as climate and ecosystem modeling, chemical engineering,
fluid dynamics, etc.

CH.Y.KUMAR Asst.Prof Page 8


CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
intruding and attacking network prompted intrusion detection to become a critical component
of network administration.
Major Issues in data mining:
Data mining is a dynamic and fast-expanding field with great strengths. The major issues
can divided into five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a) MiningMethodology:

Mining different kinds of knowledge in databases


interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
Mining knowledge in multidimensional space when searching for knowledge in
large datasets, we can explore the data in multidimensional space.
Handling noisy or incomplete data
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
Pattern evaluation
they represent common knowledge or lack novelty.
b) User Interaction:
Interactive mining of knowledge at multiple levels of abstraction
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
Incorporation of background knowledge
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
Presentation and visualization of data mining results
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.

CH.Y.KUMAR Asst.Prof Page 9


CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

c) Efficiency and scalability


There can be performance-
Efficiency and scalability of data mining algorithms
extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from scratch.
d) Diverse Data Types Issues
Handling of relational and complex types of data se may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
Mining information from heterogeneous databases and global information
systems ailable at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
e) Data Mining and Society
Social impacts of data mining With data mining penetrating our everyday lives, it
is important to study the impact of data mining on society.
Privacy-preserving data mining data mining will help scientific discovery,
business management, economy recovery, and security protection.
Invisible data mining we cannot expect everyone in society to learn and master
data mining techniques. More and more systems should have data mining functions
built within so that people can perform data mining or use data mining results simply
by mouse clicking, without any knowledge of data mining algorithms.
Data Objects and Attribute Types:
Data Object:
An Object is real time entity.
Attribute:
It can be seen as a data field that represents characteristics or features of a data object.
For a customer object attributes can be customer Id, address etc. The attribute types can
represented as follows
1. Nominal Attributes related to names: The values of a Nominal attribute are name
of things, some kind of symbols. Values of Nominal attributes represent some

attributes.
Example:
Attribute Values
Colors Black, Green, Brown, red

CH.Y.KUMAR Asst.Prof Page 10

You might also like