0% found this document useful (0 votes)
3 views

DAT MINING MODULE1

The document covers data mining techniques and data warehousing, highlighting the differences between operational databases and data warehouses, including their structures, characteristics, and functions. It explains the multidimensional data model, schemas like star and snowflake, and OLAP operations for data analysis. Key data mining techniques such as association, classification, and clustering are also discussed, emphasizing their role in extracting insights from large datasets.

Uploaded by

blessonsunil26
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DAT MINING MODULE1

The document covers data mining techniques and data warehousing, highlighting the differences between operational databases and data warehouses, including their structures, characteristics, and functions. It explains the multidimensional data model, schemas like star and snowflake, and OLAP operations for data analysis. Key data mining techniques such as association, classification, and clustering are also discussed, emphasizing their role in extracting insights from large datasets.

Uploaded by

blessonsunil26
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

SDC2DS07: Data Mining Techniques

Module I:

Data warehouse – definition – operational database systems Vs data warehouses –multidimensional model – from
tables and spreadsheets to Data Cubes – schemas for multidimensional databases – measures – concept hierarchies
- OLAP operations in the multidimensional data model – data warehouse architecture.

DATA MINING TECHNIQUES


Data mining includes the utilization of refined data analysis tools to find previously unknown, valid patterns and
relationships in huge data sets. These tools can incorporate statistical models, machine learning techniques, and
mathematical algorithms, such as neural networks or decision trees. Thus, data mining incorporates analysis and
prediction.

Depending on various methods and technologies from the intersection of machine learning, database management,
and statistics, professionals in data mining have devoted their careers to better understanding how to process and
make conclusions from the huge amount of data, but what are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been developed and used, including
association, classification, clustering, prediction, sequential patterns, and regression.

DATA WAREHOUSE

A data warehouse is a centralized repository for storing and managing large amounts of data from various sources
for analysis and reporting. It is optimized for fast querying and analysis, enabling organizations to make informed
decisions by providing a single source of truth for data. Data warehousing typically involves transforming and
integrating data from multiple sources into a unified, organized, and consistent format.

Major characteristics of data warehouse:

Subject-oriented – A data warehouse is always a subject oriented as it delivers information about a theme instead
of organization’s current operations. It can be achieved on specific theme. That means the data warehousing
process is proposed to handle with a specific theme which is more defined. These themes can be sales,
distributions, marketing etc.

A data warehouse never put emphasis only current operations. Instead, it focuses on demonstrating and analysis
of data to make various decision. It also delivers an easy and precise demonstration around particular theme by
eliminating data which is not required to make the decisions.

Integrated – It is somewhere same as subject orientation which is made in a reliable format. Integration means
founding a shared entity to scale the all-similar data from the different databases. The data also required to be
resided into various data warehouse in shared and generally granted manner.

A data warehouse is built by integrating data from various sources of data such that a mainframe and a relational
database. In addition, it must have reliable naming conventions, format and codes. Integration of data warehouse
benefits in effective analysis of data. Reliability in naming conventions, column scaling, encoding structure etc.
should be confirmed. Integration of data warehouse handles various subject related warehouse.

Time-Variant – In this data is maintained via different intervals of time such as weekly, monthly, or annually etc.
It founds various time limit which are structured between the large datasets and are held in online transaction
process (OLTP). The time limits for data warehouse are wide-ranged than that of operational systems. The data
resided in data warehouse is predictable with a specific interval of time and delivers information from the historical
perspective. It comprises elements of time explicitly or implicitly. Another feature of time-variance is that once
data is stored in the data warehouse then it cannot be modified, alter, or updated. Data is stored with a time
dimension, allowing for analysis of data over time.

Non-Volatile – As the name defines the data resided in data warehouse is permanent. It also means that data is
not erased or deleted when new data is inserted. It includes the mammoth quantity of data that is inserted into
modification between the selected quantity on logical business. It evaluates the analysis within the technologies
of warehouse. Data is not updated, once it is stored in the data warehouse, to maintain the historical data.

In this, data is read-only and refreshed at particular intervals. This is beneficial in analysing historical data and in
comprehension the functionality. It does not need transaction process, recapture and concurrency control
mechanism. Functionalities such as delete, update, and insert that are done in an operational application are lost
in data warehouse environment. Two types of data operations done in the data warehouse are:

• Data Loading
• Data Access

Functions of Data warehouse:


It works as a collection of data and here is organized by various communities that endures the features to recover
the data functions. It has stocked facts about the tables which have high transaction levels which are observed so
as to define the data warehousing techniques and major functions which are involved in this are mentioned below:

Data Consolidation: The process of combining multiple data sources into a single data repository in a data
warehouse. This ensures a consistent and accurate view of the data.

Data Cleaning: The process of identifying and removing errors, inconsistencies, and irrelevant data from the data
sources before they are integrated into the data warehouse. This helps ensure the data is accurate and trustworthy.

Data Integration: The process of combining data from multiple sources into a single, unified data repository in
a data warehouse. This involves transforming the data into a consistent format and resolving any conflicts or
discrepancies between the data sources. Data integration is an essential step in the data warehousing process to
ensure that the data is accurate and usable for analysis. Data from multiple sources can be integrated into a single
data repository for analysis.

Data Storage: A data warehouse can store large amounts of historical data and make it easily accessible for
analysis.

Data Transformation: Data can be transformed and cleaned to remove inconsistencies, duplicate data, or
irrelevant information.

Data Analysis: Data can be analysed and visualized in various ways to gain insights and make informed decisions.

Data Reporting: A data warehouse can provide various reports and dashboards for different departments and
stakeholders.

Data Mining: Data can be mined for patterns and trends to support decision-making and strategic planning.

Performance Optimization: Data warehouse systems are optimized for fast querying and analysis, providing
quick access to data.

Operational Database and Data Warehouse

The Operational Database is the source of information for the data warehouse. It includes detailed information
used to run the day-to-day operations of the business. The data frequently changes as updates are made and reflect
the current value of the last transactions. Operational Database Management Systems also called as OLTP (Online
Transactions Processing Databases), are used to manage dynamic data in real-time.

Data Warehouse Systems serve users or knowledge workers in the purpose of data analysis and decision-making.
Such systems can organize and present information in specific formats to accommodate the diverse needs of
various users. These systems are called as Online-Analytical Processing (OLAP) Systems. Data Warehouse and
the OLTP database are both relational databases. However, the goals of both these databases are different

Operational Database Data Warehouse

Operational systems are designed to support high- Data warehousing systems are typically designed to support high-
volume transaction processing. volume analytical processing (i.e., OLAP).

Operational systems are usually concerned with Data warehousing systems are usually concerned with historical data.
current data.

Data within operational systems are mainly Non-volatile, new data may be added regularly. Once Added rarely
updated regularly according to need. changed.

It is designed for real-time business dealing and It is designed for analysis of business measures by subject area,
processes. categories, and attributes.

It is optimized for a simple set of transactions, It is optimized for extent loads and high, complex, unpredictable
generally adding or retrieving a single row at a queries that access many rows per table.
time per table.

It is optimized for validation of incoming Loaded with consistent, valid information, requires no real-time
information during transactions, uses validation validation.
data tables.

It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.

Operational systems are widely process-oriented. Data warehousing systems are widely subject-oriented

Operational systems are usually optimized to Data warehousing systems are usually optimized to perform fast
perform fast inserts and updates of associatively retrievals of relatively high volumes of data.
small volumes of data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line Data Warehouse designed for on-line Analytical Processing (OLAP)
transactional Processing (OLTP)

Multi-Dimensional Data Model

A multidimensional model views data in the form of a data-cube. A data cube enables data to be modelled and
viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps records. For example, a
shop may create a sales data warehouse to keep records of the store's sales for the dimension time, item, and
location. These dimensions allow the save to keep track of things, for example, monthly sales of items and the
locations at which the items were sold. Each dimension has a table related to it, called a dimensional table, which
describes the dimension further. For example, a dimensional table for an item may contain the attributes item
name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This theme is represented
by a fact table. Facts are numerical measures. The fact table contains the names of the facts or measures of the
related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the table. In this
2D representation, the sales for Delhi are shown for the time dimension (organized in quarters) and the item
dimension (classified according to the types of an item sold). The fact or measure displayed in rupee_sold (in
thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data according to time
and item, as well as the location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi. These 3D data
are shown in the table. The 3D data of the table are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown in fig:
Schemas
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema.
Star Schema
The star schema is a modelling paradigm in which the data warehouse contains
(1) a large central table (fact table), and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table
• Each dimension in a star schema is represented with only one-dimension table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
• There is a fact table at the centre. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
Snowflake Schema
The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables. The resulting schema graph
forms a shape similar to a snowflake. The major difference between the snowflake and star
schema models is that the dimension tables of the snowflake model may be kept in normalized
form. Such a table is easy to maintain and also saves storage space because a large dimension
table can be extremely large when the dimensional structure is included as columns.
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two-
dimension tables, namely item and supplier table.
• Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Fact Constellation Schema
Sophisticated applications may require multiple fact tables to share dimension tables. This kind
of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
• The sales fact table is same as that in the star schema.
• The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
• The shipping fact table also contains two measures, namely dollars sold and units sold.
• It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact table.
OLAP
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows
managers, and analysts to get an insight of the information through fast, consistent, and interactive
access to information. This chapter cover the types of OLAP, operations on OLAP, difference between
OLAP, and statistical databases and OLTP.

Types of OLAP Servers


We have four types of OLAP servers −
• Relational OLAP (ROLAP)
• Multidimensional OLAP (MOLAP)
• Hybrid OLAP (HOLAP)
• Specialized SQL Servers
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −
• Implementation of aggregation navigation logic.
• Optimization for each DBMS back end.
• Additional tools and services.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of
data. With multidimensional data stores, the storage utilization may be low if the data set is
sparse. Therefore, many MOLAP server use two levels of data storage representation to handle
dense and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
The following diagram illustrates how roll-up works.

• Roll-up is performed by climbing up a concept hierarchy for the dimension location.


• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country
• The data is grouped into cities rather than countries.
• When roll-up is performed, one or more dimensions from the data cube are
removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways

• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.
The following diagram illustrates how drill-down works –
• Drill-down is performed by stepping down a concept hierarchy for the
dimension time.
• Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of quarter to
the level of month.
• When drill-down is performed, one or more dimensions from the data cube are
added.
• It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-
cube. Consider the following diagram that shows how slice works.
• Here Slice is performed for the dimension "time" using the criterion time = "Q1".
• It will form a new sub-cube by selecting one or more dimensions.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three
dimensions.

• (location = "Toronto" or "Vancouver")


• (time = "Q1" or "Q2")
• (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide
an alternative presentation of data. Consider the following diagram that shows the pivot
operation.

.
OLAP vs OLTP

Data Warehouse Architecture

Data Warehouse Architecture is complex as it’s an information system that contains historical and commutative
data from multiple sources. There are 3 approaches for constructing Data Warehouse layers: Single Tier, Two tier
and Three tier.

Single-tier architecture

The objective of a single layer is to minimize the amount of data stored. This goal is to remove data redundancy.
This architecture is not frequently used in practice.

Two-tier architecture

Two-layer architecture is one of the Data Warehouse layers which separates physically available sources and data
warehouse. This architecture is not expandable and also not supporting a large number of end-users. It also has
connectivity problems because of network limitations.

Three-Tier Data Warehouse Architecture

This is the most widely used Architecture of Data Warehouse.

It consists of the Top, Middle and Bottom Tier.

Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It is usually a relational database
system. Data is cleansed, transformed, and loaded into this layer using back-end tools.

Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either ROLAP or
MOLAP model. For a user, this application tier presents an abstracted view of the database. This layer also acts
as a mediator between the end-user and the database.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get data out
from the data warehouse. It could be Query tools, reporting tools, managed query tools, Analysis tools and Data
mining tools.

Datawarehouse Components

We will learn about the Datawarehouse Components and Architecture of Data Warehouse with Diagram as shown
below:

The Data Warehouse is based on an RDBMS server which is a central information repository that is surrounded

by some key Data Warehousing components to make the entire environment functional, manageable and
accessible.

There are mainly five Data Warehouse Components:

Data Warehouse Database

The central database is the foundation of the data warehousing environment. This database is implemented on the
RDBMS technology. Although, this kind of implementation is constrained by the fact that traditional RDBMS
system is optimized for transactional database processing and not for data warehousing. For instance, ad-hoc
query, multi-table joins, aggregates are resource intensive and slow down performance.

Hence, alternative approaches to Database are used as listed below-

In a datawarehouse, relational databases are deployed in parallel to allow for scalability. Parallel relational
databases also allow shared memory or shared nothing model on various multiprocessor configurations or
massively parallel processors.

New index structures are used to bypass relational table scan and improve speed.

Use of multidimensional database (MDDBs) to overcome any limitations which are placed because of the
relational Data Warehouse Models. Example: Essbase from Oracle.

Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)

The data sourcing, transformation, and migration tools are used for performing all the conversions,
summarizations, and all the changes needed to transform data into a unified format in the datawarehouse. They
are also called Extract, Transform and Load (ETL) Tools

Their functionality includes:

• Anonymize data as per regulatory stipulations.


• Eliminating unwanted data in operational databases from loading into Data warehouse.
• Search and replace common names and definitions for data arriving from different sources.
• Calculating summaries and derived data
• In case of missing data, populate them with defaults.
• De-duplicated repeated data arriving from multiple datasources.
• These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol programs,
shell scripts, etc. that regularly update data in datawarehouse. These tools are also helpful to maintain
the Metadata
• These ETL Tools have to deal with challenges of Database & Data heterogeneity.

Metadata

The name Meta Data suggests some high-level technological Data Warehousing Concepts. However, it is quite
simple. Metadata is data about data which defines the data warehouse. It is used for building, maintaining and
managing the data warehouse.

In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source, usage, values,
and features of data warehouse data. It also defines how data can be changed and processed. It is closely connected
to the data warehouse.

For example, a line in sales database may contain:

4030 KJ732 299.90

This is a meaningless data until we consult the Meta that tell us it was

• Model number: 4030


• Sales Agent ID: KJ732
• Total sales amount of $299.90

Therefore, Meta Data are essential ingredients in the transformation of data into knowledge.

• Metadata helps to answer the following questions


• What tables, attributes, and keys does the Data Warehouse contain?
• Where did the data come from?
• How many times do data get reloaded?
• What transformations were applied with cleansing?
• Metadata can be classified into following categories:
• Technical Meta Data: This kind of Metadata contains information about warehouse which is used by
Data warehouse designers and administrators.
• Business Meta Data: This kind of Metadata contains detail that gives end-users a way easy to understand
information stored in the data warehouse.

Query Tools

One of the primary objects of data warehousing is to provide information to businesses to make strategic decisions.
Query tools allow users to interact with the data warehouse system

These tools fall into four different categories:

• Query and reporting tools


• Application Development tools
• Data mining tools
• OLAP tools

1-Query and reporting tools:

Query and reporting tools can be further divided int

• Reporting tools
• Managed query tools

Reporting tools:

Reporting tools can be further divided into production reporting tools and desktop report writer

Report writers: This kind of reporting tool are tools designed for end-users for their analysis.

Production reporting: This kind of tools allows organizations to generate regular operational reports. It also
supports high volume batch jobs like printing and calculating. Some popular reporting tools are Brio, Business
Objects, Oracle, PowerSoft, SAS Institute.

Managed query tools:

This kind of access tools helps end users to resolve snags in database and SQL and database structure by inserting
meta-layer between users and database.

2. Application development tools:


Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of an organization. In such
cases, custom reports are developed using Application development tools.

3. Data mining tools:

Data mining is a process of discovering meaningful new correlation, pattens, and trends by mining large amount
data. Data mining tools are used to make this process automatic.

4. OLAP tools:

These tools are based on concepts of a multidimensional database. It allows users to analyse the data using
elaborate and complex multidimensional views.

Data warehouse Bus Architecture

Data warehouse Bus determines the flow of data in your warehouse. The data flow in a data warehouse can be
categorized as Inflow, Upflow, Downflow, Outflow and Meta flow.

While designing a Data Bus, one needs to consider the shared dimensions, facts across data marts.

Data Marts

A data mart is an access layer which is used to get data out to the users. It is presented as an option for large size
data warehouse as it takes less time and money to build. However, there is no standard definition of a data mart
is differing from person to person. In a simple word Data mart is a subsidiary of a data warehouse. The data mart
is used for partition of data which is created for the specific group of users. Data marts could be created in the
same database as the Datawarehouse or a physically separate Database.

You might also like