FoDS Notes - Unit 2
FoDS Notes - Unit 2
Data Warehouse is separate from DBMS; it stores a huge amount of data, which is typically collected
from multiple heterogeneous sources like files, DBMS, etc.
Flat Files: A Flat file system is a system of files in which transactional data is stored, and every file in
the system must have a different name.
Meta Data: A set of data that defines and gives information about other data.
• Meta Data used in Data Warehouse for a variety of purpose:
• Summarizes necessary information about data. For example,
author, data build, and data changed, and file size of any data.
• Metadata is used to direct a query to the most appropriate data
source.
End-User access Tools: The principal purpose of a data warehouse is to provide information to the
business managers for strategic decision-making. These customers interact with the warehouse using
end-client access tools.
The examples of some of the end-user access tools can be:
Reporting and Query Tools
Application Development Tools
Executive Information Systems Tools
Online Analytical Processing Tools
Data Mining Tools
Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales instead of the ongoing operations.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,
such as relational databases, flat files, and online transaction records.
Time-variant: Data are stored to provide information from an historic perspective. (e.g., the past 5–10
years).
Non volatile: A data warehouse is always physically separate data storage. Due to this separation, a
data warehouse does not require transaction processing, recovery and concurrency control
mechanisms.
It usually requires only two operations in data accessing: initial loading of data and access of data.
OLTP OLAP
Dimensional Modelling is a data structure technique optimized for data storage in a Data
warehouse.
The purpose of dimensional modeling is to optimize the database for faster retrieval of data.
A dimensional model in data warehouse is designed to read, summarize, analyze numeric
information like values, balances, counts, weights, etc. in a data warehouse. In contrast, relation
models are optimized for addition, updating and deletion of data in a real-time Online
Transaction System.
These dimensional and relational models have their unique way of data storage that has specific
advantages. For instance, in the relational mode, normalization and ER models reduce
redundancy in data. On the contrary, dimensional model in data warehouse arranges data in
such a way that it is easier to retrieve information and generate reports.
Hence, Dimensional models are used in data warehouse systems and not a good fit for
relational systems.
The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with proper arrangement and assembling of the contents in the database.
It represents data in the form of data cubes.
Data cubes allow to model and view the data from many dimensions and perspectives.
It is defined by dimensions and facts and is represented by a fact table.
Fact
o Facts are the measurements from the business process.
o For a Sales business process, a measurement would be quarterly sales number
Dimension
o Dimension provides the context surrounding a business process event. In simple terms,
they give who, what, where of a fact. In the Sales business process, for the fact
quarterly sales number, dimensions would be
o Who – Customer Names
o Where – Location
o What – Product Name
N V Vidyalakshmi, Assistant Professor in CS, NIE FGC, Mysore Page 4
Fact Table
o A fact table is a primary table in dimension modelling.
Dimension Table
o A dimension table contains dimensions of a fact.
2D Representation
3D Representation
Advantages:
• easy to handle.
• easy to maintain.
• performance is better than that of normal databases (e.g. relational databases).
• The representation of data is better than traditional databases. That is because the multi-
dimensional databases are multi-viewed and carry different types of factors.
Disadvantage:
• it requires professionals to recognize and examine the data in the database.
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
1. Ignore the tuples: This approach is not very practical because it is only useful when a tuple has
multiple characteristics and missing values.
2. Fill in the missing value: This strategy is also not very practical or effective. Additionally, it could
be a time-consuming technique. One must add the missing value to the approach. The most common
method for doing this is manually, but other options include using attribute means or the most likely
value.
3. Binning method: This strategy is fairly easy to understand. The values nearby are used to smooth
the sorted data. The information is subsequently split into several equal-sized parts. The various
techniques are then used to finish the assignment.
4. Regression: With the use of the regression function, the data is smoothed out. Regression may be
multivariate or linear. Multiple regressions have more independent variables than linear regressions,
which only have one.
Data Integration
Data integration is the process of merging heterogeneous data from several sources.
It is a strategy that integrates data from several sources to make it available to users in a single
uniform view that shows their status.
Ex: Integrated data from various patient records and clinics assist clinicians in identifying medical
disorders and diseases by integrating data from many systems into a single perspective of beneficial
information.
The records are obtained from heterogeneous sources, and how can you 'match the real-world entities
from the data'.
For example, you were given client data from specialized statistics sites. Customer identity is assigned
to an entity from one statistics supply, while a customer range is assigned to an entity from another
statistics supply. Analyzing such metadata statistics will prevent you from making errors during
schema integration.
One of the major issues in the course of data integration is redundancy. Unimportant data that are no
longer required are referred to as redundant data. It may also appear due to attributes created from the
use of another property inside the information set.
Inconsistencies further increase the level of redundancy within the characteristic. The use of
correlation analysis can be used to determine redundancy.
Information integration has also handled duplicate tuples in addition to redundancy. Duplicate tuples
may also appear in the generated information if the denormalized table was utilized as a deliverable for
data integration.
The data detection technique of combining records from several sources is unhealthy. In the same way,
that characteristic values can vary. The disparity may be related to the fact that they are represented
differently within the special data units.
For example, in one-of-a-kind towns, the price of an inn room might be expressed in a particular
currency. This type of issue is recognized and fixed during the data integration process.
Data Reduction
Data reduction is a technique used to reduce the size of a dataset while still preserving the most
important information.
This can be beneficial in situations where the dataset is too large to be processed efficiently, or
where the dataset contains a large amount of irrelevant or redundant information.
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than
using the entire dataset. This can be useful for reducing the size of a dataset while still preserving the
overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the dataset,
either by removing features that are not relevant or by combining multiple features into a single
feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless compression
to reduce the size of a dataset.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that are
most relevant to the task at hand.
Data Transformation
Data transformation is a technique used to convert the raw data into a suitable format that efficiently
eases data mining and retrieves strategic information.
Data transformation includes data cleaning techniques and a data reduction technique to convert the
data into the appropriate form.
1. Data Smoothing
• Data smoothing is a process that is used to remove noise from the dataset using some
algorithms.
• The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns.
• Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
• Regression: This method identifies the relation among two dependent attributes so that
if we have one attribute, it can be used to predict the other attribute.
• Clustering: This method groups similar data values and form a cluster. The values that
lie outside a cluster are known as outliers.
2. Attribute Construction
• The new attributes consult the existing attributes to construct a new data set that eases data
mining.
• For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So we can construct a new attribute 'area' from
attributes 'height' and 'weight'. This also helps understand the relations among the attributes in a
data set.
4. Data Normalization
• Data normalization involves converting all data variables into a given range.
• This involves transforming the data to fall within a smaller or common range such as Range =
[-1,1], [0.0,1.0].
• Normalizing the data attempts to give all attributes an equal weight
• For Ex: Changing unit from meters to inches in height lead to different results because of larger
range for that attribute.
• To help avoid dependence on the choice of units, the data should be normalized.
• Normalization attempts to give all attributes equal weight.
5. Data Discretization
• This is a process of converting continuous data into a set of data intervals. Continuous attribute
values are substituted by small interval labels. This makes the data easier to study and analyze.
• For example, (1-10, 11-20) (age:- young, middle age, senior).
6. Data Generalization
• It converts low-level data attributes to high-level data attributes.
• This conversion from a lower level to a higher conceptual level is useful to get a clearer picture
of the data.
• For example: age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).
Data Discretization
Data discretization refers to a method of converting a huge number of data values into smaller ones
so that the evaluation and management of data become easy.
In other words, data discretization is a method of converting attributes values of continuous data
Unsupervised discretization refers to a method depending upon the way which operation proceeds.
3. Discretization By decision tree: it employs top down splitting strategy. It is a supervised technique
that uses class information.