0% found this document useful (0 votes)
22 views12 pages

FoDS Notes - Unit 2

Notes

Uploaded by

rakeshrocky0607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

FoDS Notes - Unit 2

Notes

Uploaded by

rakeshrocky0607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit 2 Data Warehouse

Data Warehouse: Introduction, Definition, Multidimensional Data Model, Data


Cleaning, Data Integration and transformation, Data reduction, Discretization

A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in


support of management’s decision-making process.

It is a centralized data location for multiple sources of data.

Data Warehouse is separate from DBMS; it stores a huge amount of data, which is typically collected
from multiple heterogeneous sources like files, DBMS, etc.

Data Warehouse Architecture

Operational System: An operational system is a method used in data warehousing to refer to


a system that is used to process the day-to-day transactions of an organization.

Flat Files: A Flat file system is a system of files in which transactional data is stored, and every file in
the system must have a different name.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 1


Summarized data: The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.

Meta Data: A set of data that defines and gives information about other data.
• Meta Data used in Data Warehouse for a variety of purpose:
• Summarizes necessary information about data. For example,
author, data build, and data changed, and file size of any data.
• Metadata is used to direct a query to the most appropriate data
source.

End-User access Tools: The principal purpose of a data warehouse is to provide information to the
business managers for strategic decision-making. These customers interact with the warehouse using
end-client access tools.
The examples of some of the end-user access tools can be:
 Reporting and Query Tools
 Application Development Tools
 Executive Information Systems Tools
 Online Analytical Processing Tools
 Data Mining Tools

Characteristics of Data Warehouse

Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales instead of the ongoing operations.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,
such as relational databases, flat files, and online transaction records.
Time-variant: Data are stored to provide information from an historic perspective. (e.g., the past 5–10
years).
Non volatile: A data warehouse is always physically separate data storage. Due to this separation, a
data warehouse does not require transaction processing, recovery and concurrency control
mechanisms.
It usually requires only two operations in data accessing: initial loading of data and access of data.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 2


Benefits of Data Warehouse

• Data Warehouses are designed to perform well enormous amounts of data.


• The structure of data warehouses is more accessible for end-users to navigate, understand, and
query.
• Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
• Data warehousing is an efficient method to manage demand for lots of information from lots of
users.
• Data warehousing provide the capabilities to analyze a large amount of historical data.

Differences between OLTP and OLAP

OLTP OLAP

users clerk, IT professional knowledge worker

function Day to day operations Decision support

DB design application-oriented subject-oriented

data current, up-to-date historical,


detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc

access read/write Lots of scans


index /hash on primary key
Unit of work short, simple transaction complex query

#records accessed tens millions


#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric Transaction throughput Query throughput, response

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 3


Dimensional Modelling

 Dimensional Modelling is a data structure technique optimized for data storage in a Data
warehouse.
 The purpose of dimensional modeling is to optimize the database for faster retrieval of data.
 A dimensional model in data warehouse is designed to read, summarize, analyze numeric
information like values, balances, counts, weights, etc. in a data warehouse. In contrast, relation
models are optimized for addition, updating and deletion of data in a real-time Online
Transaction System.
 These dimensional and relational models have their unique way of data storage that has specific
advantages. For instance, in the relational mode, normalization and ER models reduce
redundancy in data. On the contrary, dimensional model in data warehouse arranges data in
such a way that it is easier to retrieve information and generate reports.
 Hence, Dimensional models are used in data warehouse systems and not a good fit for
relational systems.

Multidimensional Data Model

 The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with proper arrangement and assembling of the contents in the database.
 It represents data in the form of data cubes.
 Data cubes allow to model and view the data from many dimensions and perspectives.
 It is defined by dimensions and facts and is represented by a fact table.
 Fact
o Facts are the measurements from the business process.
o For a Sales business process, a measurement would be quarterly sales number
 Dimension
o Dimension provides the context surrounding a business process event. In simple terms,
they give who, what, where of a fact. In the Sales business process, for the fact
quarterly sales number, dimensions would be
o Who – Customer Names
o Where – Location
o What – Product Name
N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 4
 Fact Table
o A fact table is a primary table in dimension modelling.
 Dimension Table
o A dimension table contains dimensions of a fact.

2D Representation

3D Representation
Advantages:
• easy to handle.
• easy to maintain.
• performance is better than that of normal databases (e.g. relational databases).
• The representation of data is better than traditional databases. That is because the multi-
dimensional databases are multi-viewed and carry different types of factors.
Disadvantage:
• it requires professionals to recognize and examine the data in the database.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 5


Data Preprocessing Techniques
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation
 Data Discretization

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.

Steps for Cleaning Data

1. Remove duplicate or irrelevant observations


 Remove duplicate or pointless observations as well as irrelevant observations from the dataset.
 The majority of duplicate observations will occur during data gathering.
 Duplicate data can be produced when data sets are merged from several sources or get data
from clients or other departments.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 6


2. Fix structural errors
 When the data is measured or transferred and find odd naming practices, typos, or wrong
capitalization, such are structural faults. Mislabeled categories or classes may result from
these inconsistencies.
 For instance, "N/A" and "Not Applicable" might be present on any given sheet, but they ought
to be analyzed under the same heading.

3. Filter unwanted outliers


 There will frequently be isolated findings that, at first glance, do not seem to fit the data that are
being analyzed.
 Removing an outlier if there is a good reason to, such as incorrect data entry, will improve the
performance of the data that are working with.

4. Handle missing data


 There are a few options for handling missing data.
 Observations with missing values can be removed; doing so will result in the loss of
information, so proceed with caution.

Data Cleaning Techniques

1. Ignore the tuples: This approach is not very practical because it is only useful when a tuple has
multiple characteristics and missing values.

2. Fill in the missing value: This strategy is also not very practical or effective. Additionally, it could
be a time-consuming technique. One must add the missing value to the approach. The most common
method for doing this is manually, but other options include using attribute means or the most likely
value.

3. Binning method: This strategy is fairly easy to understand. The values nearby are used to smooth
the sorted data. The information is subsequently split into several equal-sized parts. The various
techniques are then used to finish the assignment.

4. Regression: With the use of the regression function, the data is smoothed out. Regression may be
multivariate or linear. Multiple regressions have more independent variables than linear regressions,
which only have one.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 7


5. Clustering: This technique focuses mostly on the group. Data are grouped using clustering. After
that, clustering is used to find the outliers. After that, the comparable values are grouped into a "group"
or "cluster".

Data Integration

Data integration is the process of merging heterogeneous data from several sources.
It is a strategy that integrates data from several sources to make it available to users in a single
uniform view that shows their status.

Ex: Integrated data from various patient records and clinics assist clinicians in identifying medical
disorders and diseases by integrating data from many systems into a single perspective of beneficial
information.

Issues in Data Integration

1. Entity Identification Problem

The records are obtained from heterogeneous sources, and how can you 'match the real-world entities
from the data'.

For example, you were given client data from specialized statistics sites. Customer identity is assigned
to an entity from one statistics supply, while a customer range is assigned to an entity from another
statistics supply. Analyzing such metadata statistics will prevent you from making errors during
schema integration.

2. Redundancy and Correlation Analysis

One of the major issues in the course of data integration is redundancy. Unimportant data that are no
longer required are referred to as redundant data. It may also appear due to attributes created from the
use of another property inside the information set.

Inconsistencies further increase the level of redundancy within the characteristic. The use of
correlation analysis can be used to determine redundancy.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 8


3. Tuple Duplication

Information integration has also handled duplicate tuples in addition to redundancy. Duplicate tuples
may also appear in the generated information if the denormalized table was utilized as a deliverable for
data integration.

4. Data Value Conflict Detection and Resolution

The data detection technique of combining records from several sources is unhealthy. In the same way,
that characteristic values can vary. The disparity may be related to the fact that they are represented
differently within the special data units.

For example, in one-of-a-kind towns, the price of an inn room might be expressed in a particular
currency. This type of issue is recognized and fixed during the data integration process.

Data Reduction

Data reduction is a technique used to reduce the size of a dataset while still preserving the most
important information.

This can be beneficial in situations where the dataset is too large to be processed efficiently, or
where the dataset contains a large amount of irrelevant or redundant information.

Data Reduction Techniques

1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than
using the entire dataset. This can be useful for reducing the size of a dataset while still preserving the
overall trends and patterns in the data.

2. Dimensionality Reduction: This technique involves reducing the number of features in the dataset,
either by removing features that are not relevant or by combining multiple features into a single
feature.

3. Data Compression: This technique involves using techniques such as lossy or lossless compression
to reduce the size of a dataset.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 9


4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.

5. Feature Selection: This technique involves selecting a subset of features from the dataset that are
most relevant to the task at hand.

Data Transformation

Data transformation is a technique used to convert the raw data into a suitable format that efficiently
eases data mining and retrieves strategic information.

Data transformation includes data cleaning techniques and a data reduction technique to convert the
data into the appropriate form.

Data Transformation Techniques

1. Data Smoothing
• Data smoothing is a process that is used to remove noise from the dataset using some
algorithms.
• The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns.
• Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
• Regression: This method identifies the relation among two dependent attributes so that
if we have one attribute, it can be used to predict the other attribute.
• Clustering: This method groups similar data values and form a cluster. The values that
lie outside a cluster are known as outliers.

2. Attribute Construction
• The new attributes consult the existing attributes to construct a new data set that eases data
mining.
• For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So we can construct a new attribute 'area' from
attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a
data set.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 10


3. Data Aggregation
• Data collection or aggregation is the method of storing and presenting data in a summary
format.
• For example, we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization
• Data normalization involves converting all data variables into a given range.
• This involves transforming the data to fall within a smaller or common range such as Range =
[-1,1], [0.0,1.0].
• Normalizing the data attempts to give all attributes an equal weight
• For Ex: Changing unit from meters to inches in height lead to different results because of larger
range for that attribute.
• To help avoid dependence on the choice of units, the data should be normalized.
• Normalization attempts to give all attributes equal weight.

5. Data Discretization
• This is a process of converting continuous data into a set of data intervals. Continuous attribute
values are substituted by small interval labels. This makes the data easier to study and analyze.
• For example, (1-10, 11-20) (age:- young, middle age, senior).

6. Data Generalization
• It converts low-level data attributes to high-level data attributes.
• This conversion from a lower level to a higher conceptual level is useful to get a clearer picture
of the data.
• For example: age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).

Data Discretization

Data discretization refers to a method of converting a huge number of data values into smaller ones
so that the evaluation and management of data become easy.

In other words, data discretization is a method of converting attributes values of continuous data

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 11


into a finite set of intervals with minimum data loss.

Supervised discretization refers to a method in which the class data is used.

Unsupervised discretization refers to a method depending upon the way which operation proceeds.

Data Discretization Techniques

1. Discretization by binning: It is unsupervised method of partitioning the data based on equal


partitions, either by equal width or by equal frequency.

2. Discretization by Cluster: clustering can be applied to discretize numeric attributes. It partitions


the values into different clusters or groups by following top down or bottom up strategy.

3. Discretization By decision tree: it employs top down splitting strategy. It is a supervised technique
that uses class information.

4. Discretization By correlation analysis: follows a bottom-up approach by finding the best


neighboring intervals and then merging them to form larger intervals.

5. Discretization by histogram: Histogram analysis is unsupervised learning because it doesn’t use


any class information like binning. There are various partition rules used to define histograms.

N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 12

You might also like