0% found this document useful (0 votes)
143 views

Unit IV Data Mining

Uploaded by

Abi Ganesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views

Unit IV Data Mining

Uploaded by

Abi Ganesh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

UNIT IV

UNIT IV: Data warehousing: introduction – characteristics of a data ware house–data marts– other
aspects of data mart. Online analytical processing: introduction -OLTP & OLAP systems Data
modelling – star schema for multidimensional view –data modelling – multi fact star schema or snow
flake schema – OLAP TOOLS – State of the market – OLAP TOOLS and the internet.

One marks
1. What is the primary purpose of a data warehouse?
• Answer: To store and analyze large volumes of data for decision-
making purposes.
2. Name one characteristic of a data warehouse.
• Answer: Subject-oriented.
3. Define OLAP.
• Answer: Online Analytical Processing, used for complex data analysis
and reporting.
4. What does OLTP stand for?
• Answer: Online Transactional Processing.
5. How does OLTP differ from OLAP?
• Answer: OLTP focuses on transactional data processing, while OLAP
focuses on analytical processing for decision support.
6. What is a data mart?
• Answer: A subset of a data warehouse focused on a specific business
area or department.
7. Name one aspect of a data mart.
• Answer: Data marts are typically tailored to meet the needs of specific
user groups.
8. What is the star schema used for?
• Answer: It provides a multidimensional view of data for efficient
querying and analysis.
9. In a star schema, what does the central fact table represent?
• Answer: It represents the primary focus of the analysis, typically
containing quantitative data.
10. What is a snowflake schema?
• Answer: A data modeling technique where dimension tables are
normalized, forming a snowflake-like structure.
11. Give an example of an OLAP tool.
• Answer: Microsoft Excel with its pivot tables feature.
12. What does OLAP enable users to do?
• Answer: OLAP allows users to analyze multidimensional data
interactively from multiple perspectives.
13. How does the state of the market for OLAP tools impact businesses?
• Answer: It influences the availability of features, pricing, and support for
analytical capabilities.
14. Name one benefit of using OLAP tools for data analysis.
• Answer: Faster decision-making based on real-time or near-real-time
data analysis.
15. How does the internet enhance OLAP tools?
• Answer: It enables access to distributed data sources and facilitates
collaboration among users.
16. Why are data warehouses crucial for business intelligence?
• Answer: They provide a centralized repository of integrated data for
analysis and reporting.
17. What role does data modeling play in data warehousing?
• Answer: It helps structure data for efficient storage, retrieval, and
analysis.
18. Name one challenge associated with implementing data warehousing
solutions.
• Answer: Data integration issues due to disparate data sources and
formats.
19. How do data marts differ from data warehouses?
• Answer: Data marts are smaller, specialized subsets of data warehouses,
catering to specific user needs.
20. Why is it essential for organizations to invest in OLAP tools?
• Answer: OLAP tools enable organizations to gain valuable insights from
their data, leading to informed decision-making and competitive
advantage.

10 marks and 5 marks


1. Discuss the main characteristics of a data warehouse and explain how they
differentiate it from operational databases.
2. Define OLAP and OLTP systems, highlighting their respective roles in data
management and analysis.
3. How does the star schema facilitate a multidimensional view of data in a data
warehousing environment?
4. Compare and contrast the multi-fact star schema and the snowflake schema in
the context of data modeling for OLAP systems.
5. What are data marts, and what advantages do they offer in comparison to a
centralized data warehouse?
6. Describe the key features of OLAP tools and their significance in supporting
decision-making processes.
7. Assess the current state of the market for OLAP tools, considering factors such
as key players, emerging trends, and market dynamics.
8. Explain the role of the internet in enhancing the capabilities of OLAP tools for
data analysis and visualization.
9. How do OLAP systems contribute to improving business intelligence and
decision-making within organizations?
10. Discuss some of the challenges and considerations in designing and
implementing effective data warehousing and OLAP solutions.

Data Warehousing: Introduction

A Database Management System (DBMS) stores data in the form of tables and uses an ER
model and the goal is ACID properties. For example, a DBMS of a college has tables for
students,faculty,etc.

A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is
typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to
produce statistical results that may help in decision-making. For example, a college might
want to see quick different results, like how the placement of CS students has improved over
the last 10 years, in terms of salaries, counts, etc.
Issues Occur while Building the Warehouse
• When and how to gather data: In a source-driven architecture for gathering
data, the data sources transmit new information, either continually (as transaction
processing takes place), or periodically (nightly, for example). In a destination-
driven architecture, the data warehouse periodically sends requests for new data
to the sources. Unless updates at the sources are replicated at the warehouse via
two phase commit, the warehouse will never be quite up to-date with the sources.
Two-phase commit is usually far too expensive to be an option, so data
warehouses typically have slightly out-of-date data. That, however, is usually not
a problem for decision-support systems.
• What schema to use: Data sources that have been constructed independently
are likely to have different schemas. In fact, they may even use different data
models. Part of the task of a warehouse is to perform schema integration, and to
convert data to the integrated schema before they are stored. As a result, the data
stored in the warehouse are not just a copy of the data at the sources. Instead, they
can be thought of as a materialized view of the data at the sources.
• Data transformation and cleansing: The task of correcting and preprocessing
data is called data cleansing. Data sources often deliver data with numerous minor
inconsistencies, which can be corrected. For example, names are often misspelled,
and addresses may have street, area, or city names misspelled, or postal codes
entered incorrectly. These can be corrected to a reasonable extent by consulting a
database of street names and postal codes in each city. The approximate matching
of data required for this task is referred to as fuzzy lookup.
• How to propagate update: Updates on relations at the data sources must be
propagated to the data warehouse. If the relations at the data warehouse are exactly
the same as those at the data source, the propagation is straightforward. If they are
not, the problem of propagating updates is basically the view-maintenance
problem.
• What data to summarize: The raw data generated by a transaction-processing
system may be too large to store online. However, we can answer many queries
by maintaining just summary data obtained by aggregation on a relation, rather
than maintaining the entire relation. For example, instead of storing data about
every sale of clothing, we can store total sales of clothing by item name and
category.
Need for Data Warehouse
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For
storing data of TB size, the storage shifted to the Data Warehouse. Besides this, a
transactional database doesn’t offer itself to analytics. To effectively perform analytics, an
organization keeps a central Data Warehouse to closely study its business by organizing,
understanding, and using its historical data for making strategic decisions and analyzing
trends.
Benefits of Data Warehouse
• Better business analytics: Data warehouse plays an important role in every
business to store and analysis of all the past data and records of the company.
which can further increase the understanding or analysis of data for the company.
• Faster Queries: The data warehouse is designed to handle large queries that’s
why it runs queries faster than the database.
• Improved data Quality: In the data warehouse the data you gathered from
different sources is being stored and analyzed it does not interfere with or add data
by itself so your quality of data is maintained and if you get any issue regarding
data quality then the data warehouse team will solve this.
• Historical Insight: The warehouse stores all your historical data which contains
details about the business so that one can analyze it at any time and extract insights
from it.

Data Warehouse vs DBMS

Database Data Warehouse

A common Database is based on operational


A data Warehouse is based on analytical
or transactional processing. Each operation
processing.
is an indivisible transaction.

A Data Warehouse maintains historical


Generally, a Database stores current and up- data over time. Historical data is the data
to-date data which is used for daily kept over years and can used for trend
operations. analysis, make future predictions and
decision support.

A Data Warehouse is integrated generally


at the organization level, by combining
data from different databases.
A database is generally application specific.
Example – A data warehouse integrates
Example – A database stores related data,
the data from one or more databases , so
such as the student details in a school.
that analysis can be done to get results ,
such as the best performing school in a
city.

Constructing a Database is not so Constructing a Data Warehouse can be


expensive. expensive.
Example Applications of Data Warehousing
Data Warehousing can be applied anywhere where we have a huge amount of data and we
want to see statistical results that help in decision making.
• Social Media Websites: The social networking websites like Facebook, Twitter,
Linkedin, etc. are based on analyzing large data sets. These sites gather data
related to members, groups, locations, etc., and store it in a single central
repository. Being a large amount of data, Data Warehouse is needed for
implementing the same.
• Banking: Most of the banks these days use warehouses to see the spending
patterns of account/cardholders. They use this to provide them with special offers,
deals, etc.
• Government: Government uses a data warehouse to store and analyze tax
payments which are used to detect tax thefts.
Features of Data Warehousing
Data warehousing is essential for modern data management, providing a strong foundation
for organizations to consolidate and analyze data strategically. Its distinguishing features
empower businesses with the tools to make informed decisions and extract valuable insights
from their data.
• Centralized Data Repository: Data warehousing provides a centralized
repository for all enterprise data from various sources, such as transactional
databases, operational systems, and external sources. This enables organizations
to have a comprehensive view of their data, which can help in making informed
business decisions.
• Data Integration: Data warehousing integrates data from different sources into
a single, unified view, which can help in eliminating data silos and reducing data
inconsistencies.
• Historical Data Storage: Data warehousing stores historical data, which
enables organizations to analyze data trends over time. This can help in
identifying patterns and anomalies in the data, which can be used to improve
business performance.
• Query and Analysis: Data warehousing provides powerful query and analysis
capabilities that enable users to explore and analyze data in different ways. This
can help in identifying patterns and trends, and can also help in making informed
business decisions.
• Data Transformation: Data warehousing includes a process of data
transformation, which involves cleaning, filtering, and formatting data from
various sources to make it consistent and usable. This can help in improving data
quality and reducing data inconsistencies.
• Data Mining: Data warehousing provides data mining capabilities, which
enable organizations to discover hidden patterns and relationships in their data.
This can help in identifying new opportunities, predicting future trends, and
mitigating risks.
• Data Security: Data warehousing provides robust data security features, such as
access controls, data encryption, and data backups, which ensure that the data is
secure and protected from unauthorized access.
Advantages of Data Warehousing
• Intelligent Decision-Making: With centralized data in warehouses, decisions
may be made more quickly and intelligently.
• Business Intelligence: Provides strong operational insights through business
intelligence.
• Historical Analysis: Predictions and trend analysis are made easier by storing
past data.
• Data Quality: Guarantees data quality and consistency for trustworthy reporting.
• Scalability: Capable of managing massive data volumes and expanding to meet
changing requirements.
• Effective Queries: Fast and effective data retrieval is made possible by an
optimized structure.
• Cost reductions: Data warehousing can result in cost savings over time by
reducing data management procedures and increasing overall efficiency, even
when there are setup costs initially.
• Data security: Data warehouses employ security protocols to safeguard
confidential information, guaranteeing that only authorized personnel are granted
access to certain data.
Disadvantages of Data Warehousing
• Cost: Building a data warehouse can be expensive, requiring significant
investments in hardware, software, and personnel.
• Complexity: Data warehousing can be complex, and businesses may need to
hire specialized personnel to manage the system.
• Time-consuming: Building a data warehouse can take a significant amount of
time, requiring businesses to be patient and committed to the process.
• Data integration challenges: Data from different sources can be challenging to
integrate, requiring significant effort to ensure consistency and accuracy.
• Data security: Data warehousing can pose data security risks, and businesses
must take measures to protect sensitive data from unauthorized access or breaches.
There can be many more applications in different sectors like E-Commerce,
telecommunications, Transportation Services, Marketing and Distribution, Healthcare, and
Retail.
Conclusion:
Data warehousing in database management systems (DBMS) enables integrated data
management, providing scalable solutions for enhanced business intelligence and decision-
making within businesses. Its advantages in data quality, historical analysis, and scalability
highlight its critical role in deriving important insights for a competitive edge, even in the
face of implementation problems.
Refer:
https://www.javatpoint.com/data-warehouse

https://www.geeksforgeeks.org/data-warehousing/

2. Characteristics of Data Warehouse


Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.

Integrated

A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.
Time-Variant

Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.

Non-Volatile

The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.

DATA MARTS
What is Data Mart?
A Data Mart is a subset of a directorial information store, generally oriented to a specific
purpose or primary data subject which may be distributed to provide business needs. Data
Marts are analytical record stores designed to focus on particular business functions for a
specific community within an organization. Data marts are derived from subsets of data in a
data warehouse, though in the bottom-up data warehouse design methodology, the data
warehouse is created from the union of organizational data marts.

The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record. It can be used by smaller businesses to utilize the data
they have accumulated since it is less expensive than implementing a data warehouse.
Reasons for creating a data mart

o Creates collective data by a group of users


o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouse
o Potential clients are more clearly defined than in a comprehensive data warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts


o Independent Data Marts

Dependent Data Marts

A dependent data mart is a logical subset of a physical subset of a higher data warehouse.
According to this technique, the data marts are treated as the subsets of a data warehouse. In
this technique, firstly a data warehouse is created from which further various data marts can be
created. These data mart is dependent on the data warehouse and extract the essential record
from it. In this technique, as the data warehouse creates the data mart; therefore, there is no
need for data mart integration. It is also known as a top-down approach.

Independent Data Marts

The second approach is independent data marts (IDM) Here, firstly independent data marts are
created, and then a data warehouse is designed using these independent multiple data marts. In
this approach, as all the data marts are designed independently; therefore, the integration of
data marts is required. It is also termed as a bottom-up approach as the data marts are
integrated to develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data Marts."

Hybrid Data Marts


It allows us to combine input from sources other than a data warehouse. This could be helpful
for many situations; especially when Ad hoc integrations are needed, such as after a new group
or product is added to the organizations.

Steps in Implementing a Data Mart

The significant steps in implementing a data mart are to design the schema, construct the
physical storage, populate the data mart with data from source systems, access it to make

informed decisions and manage it over time. So, the steps are:

Designing

The design step is the first in the data mart process. This phase covers all of the functions from
initiating the request for a data mart through gathering data about the requirements and
developing the logical and physical design of the data mart.

It involves the following tasks:

1. Gathering the business and technical requirements


2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.

Constructing

This step contains creating the physical database and logical structures associated with the data
mart to provide fast and efficient access to the data.

It involves the following tasks:


1. Creating the physical database and logical structures such as tablespaces associated
with the data mart.
2. creating the schema objects such as tables and indexes describe in the design step.
3. Determining how best to set up the tables and access structures.

Populating

This step includes all of the tasks related to the getting data from the source, cleaning it up,
modifying it to the right format and level of detail, and moving it into the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources


2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata

Accessing

This step involves putting the data to use: querying the data, analyzing it, creating reports,
charts and graphs and publishing them.

It involves the following tasks:

1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that the
end-clients can interact with the data mart using words which relates to the business
functions.
2. Set up and manage database architectures like summarized tables which help queries
agree through the front-end tools execute rapidly and efficiently.

Managing

This step contains managing the data mart over its lifetime. In this step, management functions
are performed as:
1. Providing secure access to the data.
2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.

Difference between Data Warehouse and Data Mart

Data Warehouse Data Mart

A Data Warehouse is a vast repository of A data mart is an only subtype of a Data


information collected from various Warehouses. It is architecture to meet the
organizations or departments within a requirement of a specific user group.
corporation.

It may hold multiple subject areas. It holds only one subject area. For example,
Finance or Sales.

It holds very detailed information. It may hold more summarized data.


Works to integrate all data sources It concentrates on integrating data from a
given subject area or set of source systems.

In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake
Schema are used.

It is a Centralized System. It is a Decentralized System.

Data Warehousing is the data-oriented. Data Marts is a project-oriented.

OTHER ASPECTS OF DATA MART

Data Mart vs. Data Warehouse vs. Data Lake

Data Marts, Data Warehouses, and Data Lakes are highly structured data repositories, but they
differ in the scope of data stored and serve different purposes within an organization.

The Data warehouse serves as the central repository of data for the entire organization. At the
same time, data mart focuses on data important to and needed by a specific division or line of
business. It aggregates data from different sources to support data mining, artificial
intelligence, machine learning, which results in improved analytics and business intelligence.
Since data warehouse stores all data generated by an organization, access to the warehouse
should be strictly controlled. It can be extremely difficult to query data needed for a particular
purpose from the enormous pool of data contained in a data warehouse. That is where the data
mart is helpful. The main purpose of a data mart is to partition or separate a subset of the entire
dataset to provide easy access to data to end-users.
Both data warehouse and data mart are relational databases built to store transactional data
(e.g., numerical order, time value, object reference) in tabular form for ease of organizing and
access.

A single data mart can be created from an existing data warehouse in the top-down
development approach or from other sources like internal operational systems or external data.
The designing process involves several tools and technologies to construct a physical database,
populate it with data and implement stringent access and management rules. It is a complex
process, but the mart enables a business to get more focused insights in less time than working
with a broader dataset in a warehouse.

A Data Lake is also a data repository that provides massive storage for raw or unstructured data
from various sources. Since a data lake stores raw data that is not processed or prepared for
analysis, it is more accessible and cost-effective than a data warehouse. The data does not
require cleanup or processing before being fed.

For more on data warehouses and data lakes, take a look at our Data Warehouse article and data
lake vs. data warehouse detailed comparison.

Benefits of a Data Mart

Data Marts are built to enable business users to access the most relevant data in the shortest
time. With its small size and focused design, data mart offers several benefits to the end-user,
including:

• Contains data that is valuable to specific groups within an organization

• Cost-effective to build than a data warehouse.

• Allows simplified data access. Data marts contain a small subset of data, so users
can easily retrieve data as needed compared to sifting through broader data set from
a data warehouse.
• Quick access to data insights. Insights gained from a data mart impacts decisions at
the department level. Teams can use these focused insights with specific goals in
mind, resulting in faster business processes and higher productivity.

• Data mart needs less Implementation Time compared to data warehouse because
you only need to focus on a small subset of data. Implementation tends to be more
efficient and less time-consuming.

• It contains historical data, which helps data analysts to predict data trends.

Structure of a Data Mart

A data mart and a data warehouse can be organized using a star, vault, snowflake, or other
schema as a blueprint.

Usually, a star schema is used that consists of one or many fact tables, referencing dimensional
tables in a relational database. In a star schema, fewer joints are required for writing queries.

In the snowflake schema, there’s no clear definition of dimensions. They are normalized, so
data redundancy gets reduced, and data integrity is protected. The structure is complicated and
difficult to maintain, though it takes less space to store dimension tables.

Data Mart and Cloud Architecture

Businesses are increasingly moving to cloud-based data marts and data warehouses instead of
traditional on-premises setups. Business and IT teams are striving to become more agile and
data-driven to improve regular decision-making. The benefits of cloud architecture include:

• Decreases need to purchase physical hardware

• Decreases need for manual intervention

• Faster and cheaper to set up and implement cloud data marts

• The cloud-based architecture uses massively parallel processing; hence, data marts

can perform complex analytical queries much faster.


The Future of Data Marts Is in the Cloud

Leading cloud service providers provide a shared cloud-based platform to create and store data,
access, and analyze efficiently. Business teams can quickly combine transient data clusters for
short-term analysis or long-lived clusters for sustained work. With the use of modern
technologies, data storage can be easily separated from computing, allowing for extensive
scalability for querying data.

Key advantages of cloud-based data marts are:

• Flexible architecture

• Single depository housing all data marts

• On-demand consumption of Resources

• Real-time access to Information

• Higher Efficiency

• Interactive Analytics in Realtime

• Consolidation of Resources that cost less

If you are looking to work as a data mart professional, visit Simplilearn – the world’s leading
online Bootcamp on data science certification. Stay updated with developments in the field of
data science.

What is OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis,
Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables
end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight
and understanding they require for better decision making.
Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:

o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling

Sales and Marketing

o Sales analysis and forecasting


o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation

Production

o Production planning
o Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.

The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.

OLAP Guidelines (Dr.E.F.Codd Rule)

Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:
1) Multidimensional Conceptual View: This is the central features of an OLAP system. By
needing a multidimensional view, it is possible to carry out methods like slice and dice.

2) Transparency: Make the technology, underlying information repository, computing


operations, and the dissimilar nature of source data totally transparent to users. Such
transparency helps to improve the efficiency and productivity of the users.

3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP
system must map its own logical schema to the heterogeneous physical data stores and perform
any necessary transformations. The OLAP operations should be sitting between data sources
(e.g., data warehouses) and an OLAP front-end.

4) Consistent Reporting Performance: To make sure that the users do not feel any significant
degradation in documenting performance as the number of dimensions or the size of the
database increases. That is, the performance of OLAP should not suffer as the number of
dimensions is increased. Users must observe consistent run time, response time, or machine
utilization every time a given query is run.

5) Client/Server Architecture: Make the server component of OLAP tools sufficiently


intelligent that the various clients to be attached with a minimum of effort and integration
programming. The server should be capable of mapping and consolidating data between
dissimilar databases.
6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in
both is structure and operational capabilities. Additional operational capabilities may be
allowed to selected dimensions, but such additional tasks should be grantable to any dimension.

7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical
model being created and loaded that optimizes sparse matrix handling. When encountering the
sparse matrix, the system must be easy to dynamically assume the distribution of the
information and adjust the storage and access to obtain and maintain a consistent level of
performance.

8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.

9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to


identify dimensional order and necessarily functions roll-up and drill-down methods within a
dimension or across the dimension.

10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction
like as reorientation (pivoting), drill-down and roll-up, and another manipulation to be
accomplished naturally and precisely via point-and-click and drag and drop methods on the
cells of the scientific model. It avoids the use of a menu or multiple trips to a user interface.

11) Flexible Reporting: It implements efficiency to the business clients to organize columns,
rows, and cells in a manner that facilitates simple manipulation, analysis, and synthesis of data.

12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should
be unlimited. Each of these common dimensions must allow a practically unlimited number of
customer-defined aggregation levels within any given consolidation path.

How OLAP systems work

To facilitate this kind of analysis, data is collected from multiple sources and stored in data
warehouses, then cleansed and organized into data cubes. Each OLAP cube contains data
categorized by dimensions (such as customers, geographic sales region and time period)
derived by dimensional tables in the data warehouses. Dimensions are then populated by
members (such as customer names, countries and months) that are organized hierarchically.
OLAP cubes are often pre-summarized across dimensions to drastically improve query time
over relational databases.

Analysts can then perform five types of OLAP analytical operations against
these multidimensional databases:

• Roll-up. Also known as consolidation, or drill-up, this operation summarizes the


data along the dimension.
• Drill-down. This allows analysts to navigate deeper among the dimensions of data.
For example, drilling down from "time period" to "years" and "months" to chart
sales growth for a product.

• Slice. This enables an analyst to take one level of information for display, such as
"sales in 2017."

• Dice. This allows an analyst to select data from multiple dimensions to analyze,
such as "sales of blue beach balls in Iowa in 2017."

• Pivot. Analysts can gain a new view of data by rotating the data axes of the cube.

OLAP software locates the intersection of dimensions, such as all products sold in the Eastern
region above a certain price during a certain time period, and displays them. The result is the
measure; each OLAP cube has at least one to perhaps hundreds of measures, which derive from
information stored in fact tables in the data warehouse.

Types of OLAP systems

OLAP systems typically fall into one of three types:


• Multidimensional OLAP (MOLAP) is OLAP that indexes directly into a
multidimensional database.

• Relational OLAP (ROLAP) is OLAP that performs dynamic multidimensional


analysis of data stored in a relational database.

• Hybrid OLAP (HOLAP) is a combination of ROLAP and MOLAP. HOLAP


combines the greater data capacity of ROLAP with the superior processing
capability of MOLAP.
Uses of OLAP

OLAP can be used for data mining or the discovery of previously undiscerned relationships
between data items. An OLAP database does not need to be as large as a data warehouse, since
not all transactional data is needed for trend analysis. Using Open Database Connectivity, data
can be imported from existing relational databases to create a multidimensional database for
OLAP.

OLAP products include IBM Cognos, Microsoft Power BI, Oracle OLAP and Tableau. OLAP
features are also included in tools such as Microsoft Excel and Microsoft SQL Server's
Analysis Services. OLAP products are typically designed for multiple-user environments, with
the cost of the software based on the number of users.

OLTP vs. OLAP


OLAP focuses on data analysis to generate business insights, whereas online transactional
processing (OLTP) focuses on real-time processing of online transactions. OLTP is used for
executing online database transactions that frontline workers such as cashiers and bank tellers
generate. Customer self-service applications such as online banking, travel and e-commerce
also
generate database transactions and tie into OLTP systems. OLTP can be a data source for
OLAP systems.

Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of the
characteristics are:

Fast

It defines which the system targeted to deliver the most feedback to the client within about five
seconds, with the elementary analysis taking no more than one second and very few taking
more than 20 seconds.

Analysis

It defines which the method can cope with any business logic and statistical analysis that is
relevant for the function and the user, keep it easy enough for the target client. Although some
preprogramming may be needed we do not think it acceptable if all application definitions have
to be allow the user to define new Adhoc calculations as part of the analysis and to document
on the data in any desired method, without having to program so we excludes products (like
Oracle Discoverer) that do not allow the user to define new Adhoc calculation as part of the
analysis and to document on the data in any desired product that do not allow adequate end
user-oriented calculation flexibility.
Share

It defines which the system tools all the security requirements for understanding and, if multiple
write connection is needed, concurrent update location at an appropriated level, not all
functions need customer to write data back, but for the increasing number which does, the
system should be able to manage multiple updates in a timely, secure manner.

Multidimensional

This is the basic requirement. OLAP system must provide a multidimensional conceptual view
of the data, including full support for hierarchies, as this is certainly the most logical method
to analyze business and organizations.

Information

The system should be able to hold all the data needed by the applications. Data sparsity should
be handled in an efficient manner.

The main characteristics of OLAP are as follows:

1. Multidimensional conceptual view: OLAP systems let business users have a


dimensional and logical view of the data in the data warehouse. It helps in carrying slice
and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation
should provide normal database operations, containing retrieval, update, adequacy
control, integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and an
OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or
database size should not significantly degrade the reporting performance of the OLAP
system.
6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics
along a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.

Benefits of OLAP

OLAP holds several benefits for businesses: -

1. OLAP helps managers in decision-making through the multidimensional record views


that it is efficient in providing, thus increasing their productivity.
2. OLAP functions are self-sufficient owing to the inherent flexibility support to the
organized databases.
3. It facilitates simulation of business models and problems, through extensive
management of analysis-capabilities.
4. In conjunction with data warehouse, OLAP can be used to support a reduction in the
application backlog, faster data retrieval, and reduction in query drag.

Criteria OLAP OLTP

OLAP helps you analyze large volumes OLTP helps you manage and
Purpose
of data to support decision-making. process real-time transactions.

OLTP uses real-time and


OLAP uses historical and aggregated
Data source transactional data from a single
data from multiple sources.
source.

OLAP uses multidimensional (cubes)


Data structure OLTP uses relational databases.
or relational databases.

OLAP uses star schema, snowflake OLTP uses normalized or


Data model
schema, or other analytical models. denormalized models.
OLAP has large storage requirements. OLTP has comparatively smaller
Volume of
Think terabytes (TB) and petabytes storage requirements. Think
data
(PB). gigabytes (GB).

OLAP has longer response times, OLTP has shorter response times,
Response time
typically in seconds or minutes. typically in milliseconds

OLAP is good for analyzing trends, OLTP is good for processing


Example
predicting customer behavior, and payments, customer data
applications
identifying profitability. management, and order processing.

Difference between OLAP and OLTP in DBMS

OLAP stands for Online Analytical Processing. OLAP systems have the capability to analyze
database information of multiple systems at the current time. The primary goal of OLAP
Service is data analysis and not data processing.
OLTP stands for Online Transaction Processing. OLTP has the work to administer day-to-
day transactions in any organization. The main goal of OLTP is data processing not data
analysis.

Online Analytical Processing (OLAP)


Online Analytical Processing (OLAP) consists of a type of software tool that is used for data
analysis for business decisions. OLAP provides an environment to get insights from the
database retrieved from multiple database systems at one time.

OLAP Examples

Any type of Data Warehouse System is an OLAP system. The uses of the OLAP System are
described below.
• Spotify analyzed songs by users to come up with a personalized homepage of their
songs and playlist.
• Netflix movie recommendation system.
Benefits of OLAP Services

• OLAP services help in keeping consistency and calculation.


• We can store planning, analysis, and budgeting for business analytics within one
platform.
• OLAP services help in handling large volumes of data, which helps in enterprise-
level business applications.
• OLAP services help in applying security restrictions for data protection.
• OLAP services provide a multidimensional view of data, which helps in applying
operations on data in various ways.

Drawbacks of OLAP Services

• OLAP Services requires professionals to handle the data because of its complex
modeling procedure.
• OLAP services are expensive to implement and maintain in cases when datasets
are large.
• We can perform an analysis of data only after extraction and transformation of
data in the case of OLAP which delays the system.
• OLAP services are not efficient for decision-making, as it is updated on a periodic
basis.
Online Transaction Processing (OLTP)
Online transaction processing provides transaction-oriented applications in a 3-tier
architecture. OLTP administers the day-to-day transactions of an organization.
OLTP Examples

An example considered for OLTP System is ATM Center a person who authenticates first
will receive the amount first and the condition is that the amount to be withdrawn must be
present in the ATM. The uses of the OLTP System are described below.
• ATM center is an OLTP application.
• OLTP handles the ACID properties during data transactions via the application.
• It’s also used for Online banking, Online airline ticket booking, sending a text
message, add a book to the shopping cart.

Benefits of OLTP Services

• OLTP services allow users to read, write and delete data operations quickly.
• OLTP services help in increasing users and transactions which helps in real-time
access to data.
• OLTP services help to provide better security by applying multiple security
features.
• OLTP services help in making better decision making by providing accurate data
or current data.
• OLTP Services provide Data Integrity, Consistency, and High Availability to the
data.

Drawbacks of OLTP Services

• OLTP has limited analysis capability as they are not capable of intending complex
analysis or reporting.
• OLTP has high maintenance costs because of frequent maintenance, backups, and
recovery.
• OLTP Services get hampered in the case whenever there is a hardware failure
which leads to the failure of online transactions.
• OLTP Services many times experience issues such as duplicate or inconsistent
data.
Difference between OLAP and OLTP
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

It is well-known as an online
It is well-known as an online
Definition database query management
database modifying system.
system.

Consists of historical data from Consists of only operational current


Data source
various Databases. data.

It makes use of a data It makes use of a standard database


Method used
warehouse. management system (DBMS).

It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.

In an OLAP database, tables are In an OLTP database, tables


Normalized
not normalized. are normalized (3NF).

The data is used in planning,


The data is used to perform day-to-
Usage of data problem-solving, and decision-
day fundamental operations.
making.

It provides a multi-dimensional It reveals a snapshot of present


Task
view of different business tasks. business tasks.

It serves the purpose to extract It serves the purpose to Insert,


Purpose information for analysis and Update, and Delete information from
decision-making. the database.
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

The size of the data is relatively


Volume of A large amount of data is stored
small as the historical data is
data typically in TB, PB
archived in MB, and GB.

Relatively slow as the amount


Very Fast as the queries operate on
Queries of data involved is large.
5% of the data.
Queries may take hours.

The OLAP database is not


The data integrity constraint must be
Update often updated. As a result, data
maintained in an OLTP database.
integrity is unaffected.

Backup and It only needs backup from time The backup and recovery process is
Recovery to time as compared to OLTP. maintained rigorously

It is comparatively fast in processing


Processing The processing of complex
because of simple and
time queries can take a lengthy time.
straightforward queries.

This data is generally managed This data is managed by clerksForex


Types of users
by CEO, MD, and GM. and managers.

Only read and rarely write


Operations Both read and write operations.
operations.

With lengthy, scheduled batch


The user initiates data updates,
Updates operations, data is refreshed on
which are brief and quick.
a regular basis.

Nature of The process is focused on the The process is focused on the


audience customer. market.
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)

Database Design with a focus on the Design that is focused on the


Design subject. application.

Improves the efficiency of


Productivity Enhances the user’s productivity.
business analysts.

Data Warehouse Modeling


Data warehouse modeling is the process of designing the schemas of the detailed and
summarized information of the data warehouse. The goal of data warehouse modeling is to
develop a schema describing the reality, or at least a part of the fact, which the data warehouse
is needed to support.

Data warehouse modeling is an essential stage of building a data warehouse for two main
reasons. Firstly, through the schema, data warehouse clients can visualize the relationships
among the warehouse data, to use them with greater ease. Secondly, a well-designed schema
allows an effective data warehouse structure to emerge, to help decrease the cost of
implementing the warehouse and improve the efficiency of using it.

Data modeling in data warehouses is different from data modeling in operational database
systems. The primary function of data warehouses is to support DSS processes. Thus, the
objective of data warehouse modeling is to make the data warehouse efficiently support
complex queries on long term information.

In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data.
Moreover, data warehouses are designed for the customer with general information knowledge
about the enterprise, whereas operational database systems are more oriented toward use by
software specialists for creating distinct applications.

Data Warehouse model is illustrated in the given diagram.


The data within the specific warehouse itself has a particular architecture with the emphasis on
various levels of summarization, as shown in figure:

The current detail record is central in importance as it:

o Reflects the most current happenings, which are commonly the most stimulating.
o It is numerous as it is saved at the lowest method of the Granularity.
o It is always (almost) saved on disk storage, which is fast to access but expensive and
difficult to manage.

Older detail data is stored in some form of mass storage, and it is infrequently accessed and
kept at a level detail consistent with current detailed data.

Lightly summarized data is data extract from the low level of detail found at the current,
detailed level and usually is stored on disk storage. When building the data warehouse have to
remember what unit of time is summarization done over and also the components or what
attributes the summarized data will contain.

Highly summarized data is compact and directly available and can even be found outside the
warehouse.

Metadata is the final element of the data warehouses and is really of various dimensions in
which it is not the same as file drawn from the operational data, but it is used as:-

o A directory to help the DSS investigator locate the items of the data warehouse.
o A guide to the mapping of record as the data is changed from the operational data to the
data warehouse environment.
o A guide to the method used for summarization between the current, accurate data and
the lightly summarized information and the highly summarized data, etc.

Data Modeling Life Cycle

In this section, we define a data modeling life cycle. It is a straight forward process of
transforming the business requirements to fulfill the goals for storing, maintaining, and
accessing the data within IT systems. The result is a logical and physical data model for an
enterprise data warehouse.

The objective of the data modeling life cycle is primarily the creation of a storage area for
business information. That area comes from the logical and physical data modeling stages, as
shown in Figure:
Conceptual Data Model

A conceptual data model recognizes the highest-level relationships between the different
entities.

Characteristics of the conceptual data model

o It contains the essential entities and the relationships among them.


o No attribute is specified.
o No primary key is specified.

We can see that the only data shown via the conceptual data model is the entities that define
the data and the relationships between those entities. No other data, as shown through the
conceptual data model.
Logical Data Model

A logical data model defines the information in as much structure as possible, without
observing how they will be physically achieved in the database. The primary objective of
logical data modeling is to document the business data structures, processes, rules, and
relationships by a single view - the logical data model.

Features of a logical data model

o It involves all entities and relationships among them.


o All attributes for each entity are specified.
o The primary key for each entity is stated.
o Referential Integrity is specified (FK Relation).

The phase for designing the logical data model which are as follows:

o Specify primary keys for all entities.


o List the relationships between different entities.
o List all attributes for each entity.
o Normalization.
o No data types are listed
Physical Data Model

Physical data model describes how the model will be presented in the database. A physical
database model demonstrates all table structures, column names, data types, constraints,
primary key, foreign key, and relationships between tables. The purpose of physical data
modeling is the mapping of the logical data model to the physical structures of the RDBMS
system hosting the data warehouse. This contains defining physical RDBMS structures, such
as tables and data types to use when storing the information. It may also include the definition
of new data structures for enhancing query performance.

Characteristics of a physical data model

o Specification all tables and columns.


o Foreign keys are used to recognize relationships between tables.

The steps for physical data model design which are as follows:

o Convert entities to tables.


o Convert relationships to foreign keys.
o Convert attributes to columns.

Types of Data Warehouse Models

Enterprise Warehouse

An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more operational
systems or external data providers, and it's cross-functional in scope. It generally contains
detailed information as well as summarized information and can range in estimate from a few
gigabyte to hundreds of gigabytes, terabytes, or beyond.

An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super


servers, or parallel architecture platforms. It required extensive business modeling and may
take years to develop and build.

Data Mart

A data mart includes a subset of corporate-wide data that is of value to a specific collection of
users. The scope is confined to particular selected subjects. For example, a marketing data mart
may restrict its subjects to the customer, items, and sales. The data contained in the data marts
tend to be summarized.

Data Marts is divided into two parts:

Independent Data Mart: Independent data mart is sourced from data captured from one or
more operational systems or external data providers, or data generally locally within a different
department or geographic area.

Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-
warehouses.

Virtual Warehouses

Virtual Data Warehouses is a set of perception over the operational database. For effective
query processing, only some of the possible summary vision may be materialized. A virtual
warehouse is simple to build but required excess capacity on operational database servers.

Star Schema in Data Warehouse modelling


A star schema is a type of data modeling technique used in data warehousing to
represent data in a structured and intuitive way. In a star schema, data is organized into a central
fact table that contains the measures of interest, surrounded by dimension tables that describe
the attributes of the measures.

The fact table in a star schema contains the measures or metrics that are of interest to
the user or organization. For example, in a sales data warehouse, the fact table might contain
sales revenue, units sold, and profit margins. Each record in the fact table represents a specific
event or transaction, such as a sale or order.

The dimension tables in a star schema contain the descriptive attributes of the measures
in the fact table. These attributes are used to slice and dice the data in the fact table, allowing
users to analyze the data from different perspectives. For example, in a sales data warehouse,
the dimension tables might include product, customer, time, and location.

In a star schema, each dimension table is joined to the fact table through a foreign key
relationship. This allows users to query the data in the fact table using attributes from the
dimension tables. For example, a user might want to see sales revenue by product category, or
by region and time period.

The star schema is a popular data modeling technique in data warehousing because it
is easy to understand and query. The simple structure of the star schema allows for fast query
response times and efficient use of database resources. Additionally, the star schema can be
easily extended by adding new dimension tables or measures to the fact table, making it a
scalable and flexible solution for data warehousing.

Star schema is the fundamental schema among the data mart schema and it is simplest.
This schema is widely used to develop or build a data warehouse and dimensional data marts.
It includes one or more fact tables indexing any number of dimensional tables. The star schema
is a necessary cause of the snowflake schema. It is also efficient for handling basic queries.

It is said to be star as its physical model resembles to the star shape having a fact table at its
center and the dimension tables at its peripheral representing the star’s points.
Below is an example to demonstrate the Star Schema:
In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID,
Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension
tables. Employee dimension table contains the attributes: Emp ID, Emp Name, Title,
Department and Region.
Product dimension table contains the attributes: Product ID, Product Name, Product Category,
Unit Price.
Customer dimension table contains the attributes: Customer ID, Customer Name, Address,
City, Zip.
Time dimension table contains the attributes: Order ID, Order Date, Year, Quarter, Month.

Model of Star Schema :


In Star Schema, Business process data, that holds the quantitative data about a business is
distributed in fact tables, and dimensions which are descriptive characteristics related to fact
data. Sales price, sale quantity, distant, speed, weight, and weight measurements are few
examples of fact data in star schema.
Often, A Star Schema having multiple dimensions is termed as Centipede Schema. It is easy
to handle a star schema which have dimensions of few attributes.

Advantages of Star Schema:


1. Simpler Queries:
Join logic of star schema is quite cinch in comparison to other join logic which are
needed to fetch data from a transactional schema that is highly normalized.
2. Simplified Business Reporting Logic:
In comparison to a transactional schema that is highly normalized, the star schema
makes simpler common business reporting logic, such as of reporting and period-
over-period.
3. Feeding Cubes:
Star schema is widely used by all OLAP systems to design OLAP cubes
efficiently. In fact, major OLAP systems deliver a ROLAP mode of operation
which can use a star schema as a source without designing a cube structure.
Disadvantages of Star Schema:
1. Data integrity is not enforced well since in a highly de-normalized schema state.
2. Not flexible in terms if analytical needs as a normalized data model.
3. Star schemas don’t reinforce many-to-many relationships within business entities
– at least not frequently.

Features:

Central fact table: The star schema revolves around a central fact table that contains the
numerical data being analyzed. This table contains foreign keys to link to dimension tables.
Dimension tables: Dimension tables are tables that contain descriptive attributes about the
data being analyzed. These attributes provide context to the numerical data in the fact table.
Each dimension table is linked to the fact table through a foreign key.
Denormalized structure: A star schema is denormalized, which means that redundancy is
allowed in the schema design to improve query performance. This is because it is easier and
faster to join a small number of tables than a large number of tables.
Simple queries: Star schema is designed to make queries simple and fast. Queries can be
written in a straightforward manner by joining the fact table with the appropriate dimension
tables.
Aggregated data: The numerical data in the fact table is usually aggregated at different
levels of granularity, such as daily, weekly, or monthly. This allows for analysis at different
levels of detail.
Fast performance: Star schema is designed for fast query performance. This is because the
schema is denormalized and data is pre-aggregated, making queries faster and more efficient.
Easy to understand: The star schema is easy to understand and interpret, even for non-
technical users. This is because the schema is designed to provide context to the numerical
data through the use of dimension tables.
MULTI FACT STAR SCHEMA OR SNOW FLAKE SCHEMA
The• snowflake schema is a variant of the star schema. Here, the centralized fact table is
connected to multiple dimensions. In the snowflake schema, dimensions are present in
a normalized form in multiple related tables. The snowflake structure materialized when the
dimensions of a star schema are detailed and highly structured, having several levels of
relationship, and the child tables have multiple parent tables. The snowflake effect affects only
the dimension tables and does not affect the fact tables.
A snowflake schema is a type of data modeling technique used in data warehousing to represent
data in a structured way that is optimized for querying large amounts of data efficiently. In a
snowflake schema, the dimension tables are normalized into multiple related tables, creating a
hierarchical or “snowflake” structure.
In a snowflake schema, the fact table is still located at the center of the schema, surrounded by
the dimension tables. However, each dimension table is further broken down into multiple
related tables, creating a hierarchical structure that resembles a snowflake.
For Example, in a sales data warehouse, the product dimension table might be normalized into
multiple related tables, such as product category, product subcategory, and product details.
Each of these tables would be related to the product dimension table through a foreign
key relationship.

Example:
Snowflake Schema

The Employee dimension table now contains the attributes: EmployeeID, EmployeeName,
DepartmentID, Region, and Territory. The DepartmentID attribute links with
the Employee table with the Department dimension table. The Department dimension is
used to provide detail about each department, such as the Name and Location of the department.
The Customer dimension table now contains the attributes: CustomerID, CustomerName,
Address, and CityID. The CityID attributes link the Customer dimension table with
the City dimension table. The City dimension table has details about each city such as city
name, Zipcode, State, and Country.
What is Snowflaking?
The snowflake design is the result of further expansion and normalization of the dimension
table. In other words, a dimension table is said to be snowflaked if the low-cardinality attribute
of the dimensions has been divided into separate normalized tables. These tables are then joined
to the original dimension table with referential constraints (foreign key constrain).
Generally, snowflaking is not recommended in the dimension table, as it hampers the
understandability and performance of the dimension model as more tables would be required
to be joined to satisfy the queries.
Difference between Star Schema and
Snowflake Schema

Star•• Schema: Star schema is the type of multidimensional model which is used
for data warehouse. In star schema, The fact tables and the dimension tables are
contained. In this schema fewer foreign-key join is used. This schema forms a star
with fact table and dimension tables.

Snowflake Schema: Snowflake Schema is also the type of multidimensional


model which is used for data warehouse. In snowflake schema, The fact tables,
dimension tables as well as sub dimension tables are contained. This schema
forms a snowflake with fact tables, dimension tables as well as sub-dimension
tables.
Let’s see the difference between Star and Snowflake Schema:
S.NO Star Schema Snowflake Schema

In star schema, The fact tables While in snowflake schema, The fact tables,
1. and the dimension tables are dimension tables as well as sub dimension
contained. tables are contained.

Star schema is a top-down


2. While it is a bottom-up model.
model.

3. Star schema uses more space. While it uses less space.

It takes less time for the While it takes more time than star schema
4.
execution of queries. for the execution of queries.

In star schema, Normalization While in this, Both normalization and


5.
is not used. denormalization are used.
S.NO Star Schema Snowflake Schema

6. It’s design is very simple. While it’s design is complex.

The query complexity of star While the query complexity of snowflake


7.
schema is low. schema is higher than star schema.

It’s understanding is very


8. While it’s understanding is difficult.
simple.

It has less number of foreign


9. While it has more number of foreign keys.
keys.

10. It has high data redundancy. While it has low data redundancy.

Characteristics of Snowflake Schema


• The snowflake schema uses small disk space.
• It is easy to implement the dimension that is added to the schema.
• There are multiple tables, so performance is reduced.
• The dimension table consists of two or more sets of attributes that define
information at different grains.
• The sets of attributes of the same dimension table are populated by different source
systems.
Features of the Snowflake Schema
• Normalization: The snowflake schema is a normalized design, which means that
data is organized into multiple related tables. This reduces data redundancy and
improves data consistency.
• Hierarchical Structure: The snowflake schema has a hierarchical structure that
is organized around a central fact table. The fact table contains the measures or
metrics of interest, and the dimension tables contain the attributes that provide
context to the measures.
• Multiple Levels: The snowflake schema can have multiple levels of dimension
tables, each related to the central fact table. This allows for more granular analysis
of data and enables users to drill down into specific subsets of data.
• Joins: The snowflake schema typically requires more complex SQL queries that
involve multiple tables joins. This can impact performance, especially when dealing
with large data sets.
• Scalability: The snowflake schema is scalable and can handle large volumes of
data. However, the complexity of the schema can make it difficult to manage and
maintain.
Advantages of Snowflake Schema
• It provides structured data which reduces the problem of data integrity.
• It uses small disk space because data are highly structured.
Disadvantages of Snowflake Schema
• Snowflaking reduces space consumed by dimension tables but compared with the
entire data warehouse the saving is usually insignificant.
• Avoid snowflaking or normalization of a dimension table, unless required and
appropriate.
• Do not snowflake hierarchies of dimension table into separate tables. Hierarchies
should belong to the dimension table only and should never be snowflakes.
• Multiple hierarchies that can belong to the same dimension have been designed at
the lowest possible detail.

Fact Constellation in Data Warehouse modelling


Fact Constellation is a schema for representing multidimensional model. It is a collection of
multiple fact tables having some common dimension tables. It can be viewed as a collection of
several star schemas and hence, also known as Galaxy schema. It is one of the widely used
schema for Data warehouse designing and it is much more complex than star and snowflake
schema. For complex systems, we require fact constellations.
Figure – General structure of Fact Constellation

Here, the pink coloured Dimension tables are the common ones among both the star schemas.
Green coloured fact tables are the fact tables of their respective star schemas.
Example:

In above demonstration:

Placement is a fact table having attributes: (Stud_roll, Company_id, TPO_id) with facts:
(Number of students eligible, Number of students placed).

• Workshop is a fact table having attributes: (Stud_roll, Institute_id, TPO_id) with


facts: (Number of students selected, Number of students attended the workshop).
• Company is a dimension table having attributes: (Company_id, Name,
Offer_package).
• Student is a dimension table having attributes: (Student_roll, Name, CGPA).
• TPO is a dimension table having attributes: (TPO_id, Name, Age).
• Training Institute is a dimension table having attributes: (Institute_id, Name,
Full_course_fee).
So, there are two fact tables namely, Placement and Workshop which are part of two different
star schemas having dimension tables – Company, Student and TPO in Star schema with fact
table Placement and dimension tables – Training Institute, Student and TPO in Star schema
with fact table Workshop. Both the star schema have two dimension tables common and hence,
forming a fact constellation or galaxy schema.

Advantage: Provides a flexible schema.

Disadvantage: It is much more complex and hence, hard to implement and maintain.

Difference between Snowflake Schema and


Fact Constellation Schema

Snowflake Schema: Snowflake Schema is a type of multidimensional model. It is used for


data warehouse. In snowflake schema contains the fact table, dimension tables and one or
more than tables for each dimension table. Snowflake schema is a normalized form of star
schema which reduce the redundancy and saves the significant storage. It is easy to operate
because it has less number of joins between the tables and in this simple and less complex
query is used for accessing the data from database.
Advantages:

Reduced data redundancy: The snowflake schema reduces data redundancy by normalizing
dimensions into multiple tables, resulting in a more efficient use of storage space.
Improved performance: The snowflake schema can improve query performance, as it
requires fewer joins to retrieve data from the fact table.
Scalability: The snowflake schema is scalable, making it suitable for large data warehousing
projects with complex hierarchies.
Increased complexity: The snowflake schema can be more complex to implement and
maintain due to the additional tables needed for the normalized dimensions.
Reduced query performance: The increased complexity of the snowflake schema can result
in reduced query performance, particularly for queries that require data from multiple
dimensions.
Data integrity: The snowflake schema can be more difficult to maintain data integrity due to
the additional relationships between tables.
Fact Constellation Schema: The fact constellation schema is also a type of
multidimensional model. The fact constellation schema consists of dimension tables that are
shared by several fact tables. The fact constellation schema consists of more than one star
schema at a time. Unlike the snowflake schema, the planetarium schema is not really easy to
operate, as it has multiple numbers between tables. Unlike the snowflake schema, the
constellation schema, in fact, uses heavily complex queries to access data from the database.

Fact Constellation Schema:


Advantages:
Simple to understand: The fact constellation schema is easy to understand and maintain, as it
consists of a multiple fact table and multiple dimension tables.
Improved query performance: The fact constellation schema can improve query
performance by reducing the number of joins required to retrieve data from the fact table.
Flexibility: The fact constellation schema is flexible, allowing for the addition of new
dimensions without affecting the existing schema.
Disadvantages:
Increased data redundancy: The fact constellation schema can result in increased data
redundancy due to repeated dimension data across multiple fact tables.
Storage space: The fact constellation schema may require more storage space than the
snowflake schema due to the denormalized dimensions.
Limited scalability: The fact constellation schema may not be as scalable as the snowflake
schema for large data warehousing projects with complex hierarchies.
Let’s see the difference between Snowflake Schema and Fact Constellation Schema:

S.NO Snowflake Schema Fact Constellation

Snowflake schema contains the large While in fact constellation schema,


1. central fact table, dimension tables and dimension tables are shared by many fact
sub dimension tables. tables.

Snowflake schema saves significant While fact constellation schema does not
2.
storage. save storage.

Whereas the fact constellation schema


The snowflake schema consists of one
3. consists of more than one star schema at a
star schema at a time.
time.

In snowflake schema, tables can be In fact constellation schema, the tables are
4.
maintained easily. tough to maintain.

While fact constellation schema is a


Snowflake schema is a normalized
5. normalized form of snowflake schema
form of star schema.
and star schema.

Snowflake schema is easy to operate as Fact constellation schema is not easy to


compared to fact constellation schema operate as compared to snowflake schema
6.
as it has less number of joins between as it has multiple number of joins between
the tables. the tables.

In snowflake schema, to access the While in fact constellation schema, to


7. data from database simple and less access the data from database heavier
complex query is used. complex query is used.
OLAP TOOLS
Introduction to OLAP

OLAP means On-Line Analytical Processing, which can extract or retrieve data selectively to
be analyzed from different viewpoints. It is of great value in Business Intelligence and can be
used in sales forecasting and financial reporting analysis.

Example:

You can ask the user for data and analysis of the number of footballs sold in a particular region,
says Jharkhand, in September and compare the same with the number sold in May. They may
see a comparison of other sports goods sold in Jharkhand for the same month.

How does OLAP Work?


OLAP works by extracting data from multiple sources, storing it in data warehouses, cleansing
it, and storing it in OLAP cubes. Users retrieve data from OLAP cubes in response to the
queries they run. The new term here is OLAP cubes. In cubes, dimensions such as geographical
region and time period are used to organize data derived from dimensions in data warehouses.
Members, such as names, IDs, and other relevant information, fill these dimensions.

Classification of OLAP Tools

So OLAP tools help us to analyze data multi-dimensionally in data mining. OLAP tools can
be classified as the following :

1. MOLAP

It stands for Multidimensional Online Analytical Processing. It stores data in multidimensional


arrays and requires pre-computation and storage of information in the cube.

Some of the tools for that are:

• IBM Cognos: It provides tools for reporting, analyzing, and monitoring events and
metrics.
• SAP NetWeaver BW: It is known as SAP NetWeaver Business Warehouse. Just like
IBM Cognos, It also delivers reporting, analysis, and interpretation of business data. It
runs on Industry-standard RDBMS and SAP’s HANA in-memory DBMS.
• Microsoft Analysis Services: Organizations use Microsoft Analysis Services to make
sense of data in multiple databases or a discrete form.
• MicroStrategy Intelligence Server: MicroStrategy Intelligence Server helps the
business standardize themselves on a single open platform, which will reduce their
maintenance and operating cost.
• Mondrian OLAP server: It is an open-source OLAP tool whose USP is written in
Java. Another feature of this tool is that it supports XML language, SQL, and other
data sources.
• Ic Cube: Like the above OLAP tool, this is also written in Java and is an in-memory
multidimensional OLAP tool.
• Infor BI OLAP Server: It is a real-time, in-memory OLAP database for
multidimensional analysis, planning, and modeling. It is also used for financial,
operational planning, and reporting.
• Jedox OLAP Server: It is a cell-oriented, multi-dimensional, and most important in-
memory OLAP server.
• Oracle Database OLAP option: As the name suggests, this OLAP tool is specifically
designed to introduce OLAP functionality within the database environment of Oracle.
The main objective that it serves is that it can direct the SQL queries to OLAP cubes
which in return will speed up the process.
• SAS OLAP Server: Like the IcCube OLAP server, It provides a multidimensional data
storage facility. This server can also be used to get quick access to pre-summarized data.
• IBM T1: The OLAP server provides multidimensional data storage, represented in
OLAP cubes, and performs real-time computations.

2. ROLAP

The ‘R’ in ROLAP stands for Relational. So, the full form of ROLAP becomes Relational
Online Analytical Processing. The salient feature of ROLAP is that it stores the data in
relational databases.

Some of the top ROLAP are as follows:

• IBM Cognos
• SAP NetWeaver BW
• Microsoft Analysis Services
• Essbase
• Jedox OLAP Server
• SAS OLAP Server
• MicroStrategy Intelligence Server
• Oracle Database OLAP option

3. HOLAP

It stands for Hybrid Online Analytical Processing. So, HOLAP bridges the shortcomings of
both MOLAP and ROLAP by combining their capabilities. Now how does it combine? It
combines data by dividing data of databases between relational and specialized storage.

Some of the top HOLAP are as follows:

• IBM Cognos
• SAP NetWeaver BW
• Mondrian OLAP server
• Microsoft Analysis Services
• Essbase
• Jedox OLAP Server
• SAS OLAP Server
• MicroStrategy Intelligence Server
• Oracle Database OLAP option

Now let’s go through the advantages that OLAP tools have in the domain of Business
Intelligence.

Advantages of OLAP Tools

• It helps us analyze and modify reports much faster since the data is from in-
memory data cubes rather than the data warehouse.
• MicroStrategy and other OLAP tools incorporate intelligent and secure Cube data
sharing capabilities, ensuring the secure sharing of data.
• Another benefit is the consistency of information and calculations. Reporting remains
consistently accurate in OLAP servers as the speed of data sharing does not impede the
process.
• The multidimensional presentation using OLAP tools helps better understand
relationships that were not present previously.
• Another popular scenario is the “What if” scenario of OLAP software. The
multidimensional processing of OLAP tools greatly enhances their potential.
• We can apply security restrictions on users and objects using OLAP tools.
• It creates a single platform for planning, forecasting, reporting, and analysis.

Drawbacks of Traditional OLAP

The traditional OLAP had its drawbacks. A couple of them is as follows.

Disadvantages of traditional OLAP:

• Pre-modeling is a must in traditional OLAP tools, which is a time-consuming process.


• Great dependence on IT: In this case, a user is a business person who should have good
IT knowledge. The traditional OLAP tools require the heavy involvement of IT
technicians, people with good business expertise, and IT.

Market Basket Analysis in Data Mining


A data mining technique that is used to uncover purchase patterns in any retail setting is known
as Market Basket Analysis. In simple terms Basically, Market basket analysis in data mining
is to analyze the combination of products which been bought together.
This is a technique that gives the careful study of purchases done by a customer in a
supermarket. This concept identifies the pattern of frequent purchase items by customers. This
analysis can help to promote deals, offers, sale by the companies, and data mining techniques
helps to achieve this analysis task. Example:

• Data mining concepts are in use for Sales and marketing to provide better
customer service, to improve cross-selling opportunities, to increase direct mail
response rates.
• Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.
• IF means Antecedent: An antecedent is an item found within the data
• THEN means Consequent: A consequent is an item found in combination with
the antecedent.

Let’s see ASSOCIATION RULE {IF} -> {THEN} rules used in Market Basket Analysis in
Data Mining. For example, customers buying a domain means they definitely need extra
plugins/extensions to make it easier for the users.

Like we said above Antecedent is the item sets that are available in data. By formulating from
the rules means {if} component and from the example is the domain.
Same as Consequent is the item that is found with the combination of Antecedents. By
formulating from the rules means {THEN} component and from the example is extra
plugins/extensions.
With the help of these, we are able to predict customer behavioral patterns. From this, we are
able to make certain combinations with offers that customers will probably buy those products.
That will automatically increase the sales and revenue of the company.

With the help of the Apriori Algorithm, we can further classify and simplify the item sets which
are frequently bought by the consumer.
There are three components in APRIORI ALGORITHM:

• SUPPORT
• CONFIDENCE
• LIFT
Now take an example, suppose 5000 transactions have been made through a popular
eCommerce website. Now they want to calculate the support, confidence, and lift for the two
products, let’s say pen and notebook for example out of 5000 transactions, 500 transactions for
pen, 700 transactions for notebook, and 1000 transactions for both.

SUPPORT: It is been calculated with the number of transactions divided by the total number
of transactions made,
support(pen) = transactions related to pen/total transactions

i.e support -> 500/5000=10 percent

CONFIDENCE: It is been calculated for whether the product sales are popular on individual
sales or through combined sales. That is calculated with combined transactions/individual
transactions.

Confidence = combine transactions/individual transactions

i.e confidence-> 1000/500=20 percent

LIFT: Lift is calculated for knowing the ratio for the sales.

Lift-> 20/10=2

When the Lift value is below 1 means the combination is not so frequently bought by
consumers. But in this case, it shows that the probability of buying both the things together is
high when compared to the transaction for the individual items sold.

With this, we come to an overall view of the Market Basket Analysis in Data Mining and how
to calculate the sales for combination products.

Types of Market Basket Analysis

There are three types of Market Basket Analysis. They are as follow:

1. Descriptive market basket analysis: This sort of analysis looks for patterns and
connections in the data that exist between the components of a market basket. This
kind of study is mostly used to understand consumer behavior, including what
products are purchased in combination and what the most typical item
combinations. Retailers can place products in their stores more profitably by
understanding which products are frequently bought together with the aid of
descriptive market basket analysis.
2. Predictive Market Basket Analysis: Market basket analysis that predicts future
purchases based on past purchasing patterns is known as predictive market basket
analysis. Large volumes of data are analyzed using machine learning algorithms in
this sort of analysis in order to create predictions about which products are most
likely to be bought together in the future. Retailers may make data-driven decisions
about which products to carry, how to price them, and how to optimize shop layouts
with the use of predictive market basket research.
3. Differential Market Basket Analysis: Differential market basket analysis
analyses two sets of market basket data to identify variations between them.
Comparing the behavior of various client segments or the behavior of customers
over time is a common usage for this kind of study. Retailers can respond to shifting
consumer behavior by modifying their marketing and sales tactics with the help of
differential market basket analysis.

Benefits of Market Basket Analysis

1. Enhanced Customer Understanding: Market basket research offers insights into


customer behavior, including what products they buy together and which products
they buy the most frequently. Retailers can use this information to better understand
their customers and make informed decisions.
2. Improved Inventory Management: By examining market basket data, retailers
can determine which products are sluggish sellers and which ones are commonly
bought together. Retailers can use this information to make well-informed choices
about what products to stock and how to manage their inventory most effectively.
3. Better Pricing Strategies: A better understanding of the connection between
product prices and consumer behavior might help merchants develop better pricing
strategies. Using this knowledge, pricing plans that boost sales and profitability can
be created.
4. Sales Growth: Market basket analysis can assist businesses in determining which
products are most frequently bought together and where they should be positioned
in the store to grow sales. Retailers may boost revenue and enhance customer
shopping experiences by improving store layouts and product positioning.
Applications of Market Basket Analysis

1. Retail: Market basket research is frequently used in the retail sector to examine
consumer buying patterns and inform decisions about product placement, inventory
management, and pricing tactics. Retailers can utilize market basket research to
identify which items are sluggish sellers and which ones are commonly bought
together, and then modify their inventory management strategy accordingly.
2. E-commerce: Market basket analysis can help online merchants better understand
the customer buying habits and make data-driven decisions about product
recommendations and targeted advertising campaigns. The behaviour of visitors to
a website can be examined using market basket analysis to pinpoint problem areas.
3. Finance: Market basket analysis can be used to evaluate investor behaviour and
forecast the types of investment items that investors will likely buy in the future.
The performance of investment portfolios can be enhanced by using this
information to create tailored investment strategies.
4. Telecommunications: To evaluate consumer behaviour and make data-driven
decisions about which goods and services to provide, the telecommunications
business might employ market basket analysis. The usage of this data can enhance
client happiness and the shopping experience.
5. Manufacturing: To evaluate consumer behaviour and make data-driven decisions
about which products to produce and which materials to employ in the production
process, the manufacturing sector might use market basket analysis. Utilizing this
knowledge will increase effectiveness and cut costs.

OLAP Tools And the Internet

The mainly comprehensive premises in computing have been the internet and data
warehousing thus the integration of these two giant technologies is a necessity. The advantages
of using the Web for access are inevitable.
These advantages are:
· The internet provides connectivity between countries acting as a free resource.
· The web eases administrative tasks of managing scattered locations.
· The Web allows users to store and manage data and applications on servers that
can be managed, maintained and updated centrally.
These reasons indicate the importance of the Web in data storage and manipulation. The Web-
enabled data access has many significant features, such as:

· The first
· The second
· The emerging third
· HTML publishing
· Helper applications
· Plug-ins
· Server-centric components
· Java and active-x applications

The primary key in the decision making process is the amount of data collected and how well
this data is interpreted. Nowadays, Managers aren’t satisfied by getting direct answers to their
direct questions, Instead due to the market growth and increase of clients their questions
became more complicated. Questions are like‖ How much profit from selling our products at
our different centers per month‖. A complicated question like this isn’t as simple to be
answered directly; it needs analysis to three fields in order to obtain an answer.

The Decision making processes

1. Identify information about the problem


2. Analyze the problem
3. Collect and evaluate the information, and define alternative solutions.

The decision making process exists in the different levels of an organization. The speed and
the

simplicity of gathering data and the ability to convert this data into information is the main
element in the decision process. That’s why the term Business Intelligence has evolved.
Business Intelligence

As mentioned Earlier, business Intelligence is concerned with gathering the data and
converting this data into information, so as to use a better decision. The better the data is
gathered and how well it is interpreted as information is one of the most important elements
in a successful business.

Elements of Business Intelligence

There are three main Components in Business Intelligence


Data Warehouse: it is a collection of data to support the management decisions making. It
revolves around the major subjects of the business to support the management.
OLAP: is used to generate complex queries of multidimensional collection of data from the
data warehouse.
Data Mining: consists of various techniques that explore and bring complex relationships in
very large sets.
In the next figure the relation between the three components are represented.

You might also like