Unit IV Data Mining
Unit IV Data Mining
UNIT IV: Data warehousing: introduction – characteristics of a data ware house–data marts– other
aspects of data mart. Online analytical processing: introduction -OLTP & OLAP systems Data
modelling – star schema for multidimensional view –data modelling – multi fact star schema or snow
flake schema – OLAP TOOLS – State of the market – OLAP TOOLS and the internet.
One marks
1. What is the primary purpose of a data warehouse?
• Answer: To store and analyze large volumes of data for decision-
making purposes.
2. Name one characteristic of a data warehouse.
• Answer: Subject-oriented.
3. Define OLAP.
• Answer: Online Analytical Processing, used for complex data analysis
and reporting.
4. What does OLTP stand for?
• Answer: Online Transactional Processing.
5. How does OLTP differ from OLAP?
• Answer: OLTP focuses on transactional data processing, while OLAP
focuses on analytical processing for decision support.
6. What is a data mart?
• Answer: A subset of a data warehouse focused on a specific business
area or department.
7. Name one aspect of a data mart.
• Answer: Data marts are typically tailored to meet the needs of specific
user groups.
8. What is the star schema used for?
• Answer: It provides a multidimensional view of data for efficient
querying and analysis.
9. In a star schema, what does the central fact table represent?
• Answer: It represents the primary focus of the analysis, typically
containing quantitative data.
10. What is a snowflake schema?
• Answer: A data modeling technique where dimension tables are
normalized, forming a snowflake-like structure.
11. Give an example of an OLAP tool.
• Answer: Microsoft Excel with its pivot tables feature.
12. What does OLAP enable users to do?
• Answer: OLAP allows users to analyze multidimensional data
interactively from multiple perspectives.
13. How does the state of the market for OLAP tools impact businesses?
• Answer: It influences the availability of features, pricing, and support for
analytical capabilities.
14. Name one benefit of using OLAP tools for data analysis.
• Answer: Faster decision-making based on real-time or near-real-time
data analysis.
15. How does the internet enhance OLAP tools?
• Answer: It enables access to distributed data sources and facilitates
collaboration among users.
16. Why are data warehouses crucial for business intelligence?
• Answer: They provide a centralized repository of integrated data for
analysis and reporting.
17. What role does data modeling play in data warehousing?
• Answer: It helps structure data for efficient storage, retrieval, and
analysis.
18. Name one challenge associated with implementing data warehousing
solutions.
• Answer: Data integration issues due to disparate data sources and
formats.
19. How do data marts differ from data warehouses?
• Answer: Data marts are smaller, specialized subsets of data warehouses,
catering to specific user needs.
20. Why is it essential for organizations to invest in OLAP tools?
• Answer: OLAP tools enable organizations to gain valuable insights from
their data, leading to informed decision-making and competitive
advantage.
A Database Management System (DBMS) stores data in the form of tables and uses an ER
model and the goal is ACID properties. For example, a DBMS of a college has tables for
students,faculty,etc.
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is
typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to
produce statistical results that may help in decision-making. For example, a college might
want to see quick different results, like how the placement of CS students has improved over
the last 10 years, in terms of salaries, counts, etc.
Issues Occur while Building the Warehouse
• When and how to gather data: In a source-driven architecture for gathering
data, the data sources transmit new information, either continually (as transaction
processing takes place), or periodically (nightly, for example). In a destination-
driven architecture, the data warehouse periodically sends requests for new data
to the sources. Unless updates at the sources are replicated at the warehouse via
two phase commit, the warehouse will never be quite up to-date with the sources.
Two-phase commit is usually far too expensive to be an option, so data
warehouses typically have slightly out-of-date data. That, however, is usually not
a problem for decision-support systems.
• What schema to use: Data sources that have been constructed independently
are likely to have different schemas. In fact, they may even use different data
models. Part of the task of a warehouse is to perform schema integration, and to
convert data to the integrated schema before they are stored. As a result, the data
stored in the warehouse are not just a copy of the data at the sources. Instead, they
can be thought of as a materialized view of the data at the sources.
• Data transformation and cleansing: The task of correcting and preprocessing
data is called data cleansing. Data sources often deliver data with numerous minor
inconsistencies, which can be corrected. For example, names are often misspelled,
and addresses may have street, area, or city names misspelled, or postal codes
entered incorrectly. These can be corrected to a reasonable extent by consulting a
database of street names and postal codes in each city. The approximate matching
of data required for this task is referred to as fuzzy lookup.
• How to propagate update: Updates on relations at the data sources must be
propagated to the data warehouse. If the relations at the data warehouse are exactly
the same as those at the data source, the propagation is straightforward. If they are
not, the problem of propagating updates is basically the view-maintenance
problem.
• What data to summarize: The raw data generated by a transaction-processing
system may be too large to store online. However, we can answer many queries
by maintaining just summary data obtained by aggregation on a relation, rather
than maintaining the entire relation. For example, instead of storing data about
every sale of clothing, we can store total sales of clothing by item name and
category.
Need for Data Warehouse
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For
storing data of TB size, the storage shifted to the Data Warehouse. Besides this, a
transactional database doesn’t offer itself to analytics. To effectively perform analytics, an
organization keeps a central Data Warehouse to closely study its business by organizing,
understanding, and using its historical data for making strategic decisions and analyzing
trends.
Benefits of Data Warehouse
• Better business analytics: Data warehouse plays an important role in every
business to store and analysis of all the past data and records of the company.
which can further increase the understanding or analysis of data for the company.
• Faster Queries: The data warehouse is designed to handle large queries that’s
why it runs queries faster than the database.
• Improved data Quality: In the data warehouse the data you gathered from
different sources is being stored and analyzed it does not interfere with or add data
by itself so your quality of data is maintained and if you get any issue regarding
data quality then the data warehouse team will solve this.
• Historical Insight: The warehouse stores all your historical data which contains
details about the business so that one can analyze it at any time and extract insights
from it.
https://www.geeksforgeeks.org/data-warehousing/
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.
DATA MARTS
What is Data Mart?
A Data Mart is a subset of a directorial information store, generally oriented to a specific
purpose or primary data subject which may be distributed to provide business needs. Data
Marts are analytical record stores designed to focus on particular business functions for a
specific community within an organization. Data marts are derived from subsets of data in a
data warehouse, though in the bottom-up data warehouse design methodology, the data
warehouse is created from the union of organizational data marts.
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record. It can be used by smaller businesses to utilize the data
they have accumulated since it is less expensive than implementing a data warehouse.
Reasons for creating a data mart
There are mainly two approaches to designing data marts. These approaches are
A dependent data mart is a logical subset of a physical subset of a higher data warehouse.
According to this technique, the data marts are treated as the subsets of a data warehouse. In
this technique, firstly a data warehouse is created from which further various data marts can be
created. These data mart is dependent on the data warehouse and extract the essential record
from it. In this technique, as the data warehouse creates the data mart; therefore, there is no
need for data mart integration. It is also known as a top-down approach.
The second approach is independent data marts (IDM) Here, firstly independent data marts are
created, and then a data warehouse is designed using these independent multiple data marts. In
this approach, as all the data marts are designed independently; therefore, the integration of
data marts is required. It is also termed as a bottom-up approach as the data marts are
integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
The significant steps in implementing a data mart are to design the schema, construct the
physical storage, populate the data mart with data from source systems, access it to make
informed decisions and manage it over time. So, the steps are:
Designing
The design step is the first in the data mart process. This phase covers all of the functions from
initiating the request for a data mart through gathering data about the requirements and
developing the logical and physical design of the data mart.
Constructing
This step contains creating the physical database and logical structures associated with the data
mart to provide fast and efficient access to the data.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up,
modifying it to the right format and level of detail, and moving it into the data mart.
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports,
charts and graphs and publishing them.
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer
translates database operations and objects names into business conditions so that the
end-clients can interact with the data mart using words which relates to the business
functions.
2. Set up and manage database architectures like summarized tables which help queries
agree through the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions
are performed as:
1. Providing secure access to the data.
2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.
It may hold multiple subject areas. It holds only one subject area. For example,
Finance or Sales.
In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake
Schema are used.
Data Marts, Data Warehouses, and Data Lakes are highly structured data repositories, but they
differ in the scope of data stored and serve different purposes within an organization.
The Data warehouse serves as the central repository of data for the entire organization. At the
same time, data mart focuses on data important to and needed by a specific division or line of
business. It aggregates data from different sources to support data mining, artificial
intelligence, machine learning, which results in improved analytics and business intelligence.
Since data warehouse stores all data generated by an organization, access to the warehouse
should be strictly controlled. It can be extremely difficult to query data needed for a particular
purpose from the enormous pool of data contained in a data warehouse. That is where the data
mart is helpful. The main purpose of a data mart is to partition or separate a subset of the entire
dataset to provide easy access to data to end-users.
Both data warehouse and data mart are relational databases built to store transactional data
(e.g., numerical order, time value, object reference) in tabular form for ease of organizing and
access.
A single data mart can be created from an existing data warehouse in the top-down
development approach or from other sources like internal operational systems or external data.
The designing process involves several tools and technologies to construct a physical database,
populate it with data and implement stringent access and management rules. It is a complex
process, but the mart enables a business to get more focused insights in less time than working
with a broader dataset in a warehouse.
A Data Lake is also a data repository that provides massive storage for raw or unstructured data
from various sources. Since a data lake stores raw data that is not processed or prepared for
analysis, it is more accessible and cost-effective than a data warehouse. The data does not
require cleanup or processing before being fed.
For more on data warehouses and data lakes, take a look at our Data Warehouse article and data
lake vs. data warehouse detailed comparison.
Data Marts are built to enable business users to access the most relevant data in the shortest
time. With its small size and focused design, data mart offers several benefits to the end-user,
including:
• Allows simplified data access. Data marts contain a small subset of data, so users
can easily retrieve data as needed compared to sifting through broader data set from
a data warehouse.
• Quick access to data insights. Insights gained from a data mart impacts decisions at
the department level. Teams can use these focused insights with specific goals in
mind, resulting in faster business processes and higher productivity.
• Data mart needs less Implementation Time compared to data warehouse because
you only need to focus on a small subset of data. Implementation tends to be more
efficient and less time-consuming.
• It contains historical data, which helps data analysts to predict data trends.
A data mart and a data warehouse can be organized using a star, vault, snowflake, or other
schema as a blueprint.
Usually, a star schema is used that consists of one or many fact tables, referencing dimensional
tables in a relational database. In a star schema, fewer joints are required for writing queries.
In the snowflake schema, there’s no clear definition of dimensions. They are normalized, so
data redundancy gets reduced, and data integrity is protected. The structure is complicated and
difficult to maintain, though it takes less space to store dimension tables.
Businesses are increasingly moving to cloud-based data marts and data warehouses instead of
traditional on-premises setups. Business and IT teams are striving to become more agile and
data-driven to improve regular decision-making. The benefits of cloud architecture include:
• The cloud-based architecture uses massively parallel processing; hence, data marts
Leading cloud service providers provide a shared cloud-based platform to create and store data,
access, and analyze efficiently. Business teams can quickly combine transient data clusters for
short-term analysis or long-lived clusters for sustained work. With the use of modern
technologies, data storage can be easily separated from computing, allowing for extensive
scalability for querying data.
• Flexible architecture
• Higher Efficiency
If you are looking to work as a data mart professional, visit Simplilearn – the world’s leading
online Bootcamp on data science certification. Stay updated with developments in the field of
data science.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis,
Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables
end-clients to perform ad hoc analysis of record in multiple dimensions, providing the insight
and understanding they require for better decision making.
Who uses OLAP and Why?
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Production
o Production planning
o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.
Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:
1) Multidimensional Conceptual View: This is the central features of an OLAP system. By
needing a multidimensional view, it is possible to carry out methods like slice and dice.
3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP
system must map its own logical schema to the heterogeneous physical data stores and perform
any necessary transformations. The OLAP operations should be sitting between data sources
(e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant
degradation in documenting performance as the number of dimensions or the size of the
database increases. That is, the performance of OLAP should not suffer as the number of
dimensions is increased. Users must observe consistent run time, response time, or machine
utilization every time a given query is run.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical
model being created and loaded that optimizes sparse matrix handling. When encountering the
sparse matrix, the system must be easy to dynamically assume the distribution of the
information and adjust the storage and access to obtain and maintain a consistent level of
performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction
like as reorientation (pivoting), drill-down and roll-up, and another manipulation to be
accomplished naturally and precisely via point-and-click and drag and drop methods on the
cells of the scientific model. It avoids the use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns,
rows, and cells in a manner that facilitates simple manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should
be unlimited. Each of these common dimensions must allow a practically unlimited number of
customer-defined aggregation levels within any given consolidation path.
To facilitate this kind of analysis, data is collected from multiple sources and stored in data
warehouses, then cleansed and organized into data cubes. Each OLAP cube contains data
categorized by dimensions (such as customers, geographic sales region and time period)
derived by dimensional tables in the data warehouses. Dimensions are then populated by
members (such as customer names, countries and months) that are organized hierarchically.
OLAP cubes are often pre-summarized across dimensions to drastically improve query time
over relational databases.
Analysts can then perform five types of OLAP analytical operations against
these multidimensional databases:
• Slice. This enables an analyst to take one level of information for display, such as
"sales in 2017."
• Dice. This allows an analyst to select data from multiple dimensions to analyze,
such as "sales of blue beach balls in Iowa in 2017."
• Pivot. Analysts can gain a new view of data by rotating the data axes of the cube.
OLAP software locates the intersection of dimensions, such as all products sold in the Eastern
region above a certain price during a certain time period, and displays them. The result is the
measure; each OLAP cube has at least one to perhaps hundreds of measures, which derive from
information stored in fact tables in the data warehouse.
OLAP can be used for data mining or the discovery of previously undiscerned relationships
between data items. An OLAP database does not need to be as large as a data warehouse, since
not all transactional data is needed for trend analysis. Using Open Database Connectivity, data
can be imported from existing relational databases to create a multidimensional database for
OLAP.
OLAP products include IBM Cognos, Microsoft Power BI, Oracle OLAP and Tableau. OLAP
features are also included in tools such as Microsoft Excel and Microsoft SQL Server's
Analysis Services. OLAP products are typically designed for multiple-user environments, with
the cost of the software based on the number of users.
Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of the
characteristics are:
Fast
It defines which the system targeted to deliver the most feedback to the client within about five
seconds, with the elementary analysis taking no more than one second and very few taking
more than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical analysis that is
relevant for the function and the user, keep it easy enough for the target client. Although some
preprogramming may be needed we do not think it acceptable if all application definitions have
to be allow the user to define new Adhoc calculations as part of the analysis and to document
on the data in any desired method, without having to program so we excludes products (like
Oracle Discoverer) that do not allow the user to define new Adhoc calculation as part of the
analysis and to document on the data in any desired product that do not allow adequate end
user-oriented calculation flexibility.
Share
It defines which the system tools all the security requirements for understanding and, if multiple
write connection is needed, concurrent update location at an appropriated level, not all
functions need customer to write data back, but for the increasing number which does, the
system should be able to manage multiple updates in a timely, secure manner.
Multidimensional
This is the basic requirement. OLAP system must provide a multidimensional conceptual view
of the data, including full support for hierarchies, as this is certainly the most logical method
to analyze business and organizations.
Information
The system should be able to hold all the data needed by the applications. Data sparsity should
be handled in an efficient manner.
Benefits of OLAP
OLAP helps you analyze large volumes OLTP helps you manage and
Purpose
of data to support decision-making. process real-time transactions.
OLAP has longer response times, OLTP has shorter response times,
Response time
typically in seconds or minutes. typically in milliseconds
OLAP stands for Online Analytical Processing. OLAP systems have the capability to analyze
database information of multiple systems at the current time. The primary goal of OLAP
Service is data analysis and not data processing.
OLTP stands for Online Transaction Processing. OLTP has the work to administer day-to-
day transactions in any organization. The main goal of OLTP is data processing not data
analysis.
OLAP Examples
Any type of Data Warehouse System is an OLAP system. The uses of the OLAP System are
described below.
• Spotify analyzed songs by users to come up with a personalized homepage of their
songs and playlist.
• Netflix movie recommendation system.
Benefits of OLAP Services
• OLAP Services requires professionals to handle the data because of its complex
modeling procedure.
• OLAP services are expensive to implement and maintain in cases when datasets
are large.
• We can perform an analysis of data only after extraction and transformation of
data in the case of OLAP which delays the system.
• OLAP services are not efficient for decision-making, as it is updated on a periodic
basis.
Online Transaction Processing (OLTP)
Online transaction processing provides transaction-oriented applications in a 3-tier
architecture. OLTP administers the day-to-day transactions of an organization.
OLTP Examples
An example considered for OLTP System is ATM Center a person who authenticates first
will receive the amount first and the condition is that the amount to be withdrawn must be
present in the ATM. The uses of the OLTP System are described below.
• ATM center is an OLTP application.
• OLTP handles the ACID properties during data transactions via the application.
• It’s also used for Online banking, Online airline ticket booking, sending a text
message, add a book to the shopping cart.
• OLTP services allow users to read, write and delete data operations quickly.
• OLTP services help in increasing users and transactions which helps in real-time
access to data.
• OLTP services help to provide better security by applying multiple security
features.
• OLTP services help in making better decision making by providing accurate data
or current data.
• OLTP Services provide Data Integrity, Consistency, and High Availability to the
data.
• OLTP has limited analysis capability as they are not capable of intending complex
analysis or reporting.
• OLTP has high maintenance costs because of frequent maintenance, backups, and
recovery.
• OLTP Services get hampered in the case whenever there is a hardware failure
which leads to the failure of online transactions.
• OLTP Services many times experience issues such as duplicate or inconsistent
data.
Difference between OLAP and OLTP
OLAP (Online Analytical OLTP (Online Transaction
Category Processing) Processing)
It is well-known as an online
It is well-known as an online
Definition database query management
database modifying system.
system.
It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.
Backup and It only needs backup from time The backup and recovery process is
Recovery to time as compared to OLTP. maintained rigorously
Data warehouse modeling is an essential stage of building a data warehouse for two main
reasons. Firstly, through the schema, data warehouse clients can visualize the relationships
among the warehouse data, to use them with greater ease. Secondly, a well-designed schema
allows an effective data warehouse structure to emerge, to help decrease the cost of
implementing the warehouse and improve the efficiency of using it.
Data modeling in data warehouses is different from data modeling in operational database
systems. The primary function of data warehouses is to support DSS processes. Thus, the
objective of data warehouse modeling is to make the data warehouse efficiently support
complex queries on long term information.
In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data.
Moreover, data warehouses are designed for the customer with general information knowledge
about the enterprise, whereas operational database systems are more oriented toward use by
software specialists for creating distinct applications.
o Reflects the most current happenings, which are commonly the most stimulating.
o It is numerous as it is saved at the lowest method of the Granularity.
o It is always (almost) saved on disk storage, which is fast to access but expensive and
difficult to manage.
Older detail data is stored in some form of mass storage, and it is infrequently accessed and
kept at a level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the current,
detailed level and usually is stored on disk storage. When building the data warehouse have to
remember what unit of time is summarization done over and also the components or what
attributes the summarized data will contain.
Highly summarized data is compact and directly available and can even be found outside the
warehouse.
Metadata is the final element of the data warehouses and is really of various dimensions in
which it is not the same as file drawn from the operational data, but it is used as:-
o A directory to help the DSS investigator locate the items of the data warehouse.
o A guide to the mapping of record as the data is changed from the operational data to the
data warehouse environment.
o A guide to the method used for summarization between the current, accurate data and
the lightly summarized information and the highly summarized data, etc.
In this section, we define a data modeling life cycle. It is a straight forward process of
transforming the business requirements to fulfill the goals for storing, maintaining, and
accessing the data within IT systems. The result is a logical and physical data model for an
enterprise data warehouse.
The objective of the data modeling life cycle is primarily the creation of a storage area for
business information. That area comes from the logical and physical data modeling stages, as
shown in Figure:
Conceptual Data Model
A conceptual data model recognizes the highest-level relationships between the different
entities.
We can see that the only data shown via the conceptual data model is the entities that define
the data and the relationships between those entities. No other data, as shown through the
conceptual data model.
Logical Data Model
A logical data model defines the information in as much structure as possible, without
observing how they will be physically achieved in the database. The primary objective of
logical data modeling is to document the business data structures, processes, rules, and
relationships by a single view - the logical data model.
The phase for designing the logical data model which are as follows:
Physical data model describes how the model will be presented in the database. A physical
database model demonstrates all table structures, column names, data types, constraints,
primary key, foreign key, and relationships between tables. The purpose of physical data
modeling is the mapping of the logical data model to the physical structures of the RDBMS
system hosting the data warehouse. This contains defining physical RDBMS structures, such
as tables and data types to use when storing the information. It may also include the definition
of new data structures for enhancing query performance.
The steps for physical data model design which are as follows:
Enterprise Warehouse
An Enterprise warehouse collects all of the records about subjects spanning the entire
organization. It supports corporate-wide data integration, usually from one or more operational
systems or external data providers, and it's cross-functional in scope. It generally contains
detailed information as well as summarized information and can range in estimate from a few
gigabyte to hundreds of gigabytes, terabytes, or beyond.
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of
users. The scope is confined to particular selected subjects. For example, a marketing data mart
may restrict its subjects to the customer, items, and sales. The data contained in the data marts
tend to be summarized.
Independent Data Mart: Independent data mart is sourced from data captured from one or
more operational systems or external data providers, or data generally locally within a different
department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-
warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For effective
query processing, only some of the possible summary vision may be materialized. A virtual
warehouse is simple to build but required excess capacity on operational database servers.
The fact table in a star schema contains the measures or metrics that are of interest to
the user or organization. For example, in a sales data warehouse, the fact table might contain
sales revenue, units sold, and profit margins. Each record in the fact table represents a specific
event or transaction, such as a sale or order.
The dimension tables in a star schema contain the descriptive attributes of the measures
in the fact table. These attributes are used to slice and dice the data in the fact table, allowing
users to analyze the data from different perspectives. For example, in a sales data warehouse,
the dimension tables might include product, customer, time, and location.
In a star schema, each dimension table is joined to the fact table through a foreign key
relationship. This allows users to query the data in the fact table using attributes from the
dimension tables. For example, a user might want to see sales revenue by product category, or
by region and time period.
The star schema is a popular data modeling technique in data warehousing because it
is easy to understand and query. The simple structure of the star schema allows for fast query
response times and efficient use of database resources. Additionally, the star schema can be
easily extended by adding new dimension tables or measures to the fact table, making it a
scalable and flexible solution for data warehousing.
Star schema is the fundamental schema among the data mart schema and it is simplest.
This schema is widely used to develop or build a data warehouse and dimensional data marts.
It includes one or more fact tables indexing any number of dimensional tables. The star schema
is a necessary cause of the snowflake schema. It is also efficient for handling basic queries.
It is said to be star as its physical model resembles to the star shape having a fact table at its
center and the dimension tables at its peripheral representing the star’s points.
Below is an example to demonstrate the Star Schema:
In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID,
Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension
tables. Employee dimension table contains the attributes: Emp ID, Emp Name, Title,
Department and Region.
Product dimension table contains the attributes: Product ID, Product Name, Product Category,
Unit Price.
Customer dimension table contains the attributes: Customer ID, Customer Name, Address,
City, Zip.
Time dimension table contains the attributes: Order ID, Order Date, Year, Quarter, Month.
Features:
Central fact table: The star schema revolves around a central fact table that contains the
numerical data being analyzed. This table contains foreign keys to link to dimension tables.
Dimension tables: Dimension tables are tables that contain descriptive attributes about the
data being analyzed. These attributes provide context to the numerical data in the fact table.
Each dimension table is linked to the fact table through a foreign key.
Denormalized structure: A star schema is denormalized, which means that redundancy is
allowed in the schema design to improve query performance. This is because it is easier and
faster to join a small number of tables than a large number of tables.
Simple queries: Star schema is designed to make queries simple and fast. Queries can be
written in a straightforward manner by joining the fact table with the appropriate dimension
tables.
Aggregated data: The numerical data in the fact table is usually aggregated at different
levels of granularity, such as daily, weekly, or monthly. This allows for analysis at different
levels of detail.
Fast performance: Star schema is designed for fast query performance. This is because the
schema is denormalized and data is pre-aggregated, making queries faster and more efficient.
Easy to understand: The star schema is easy to understand and interpret, even for non-
technical users. This is because the schema is designed to provide context to the numerical
data through the use of dimension tables.
MULTI FACT STAR SCHEMA OR SNOW FLAKE SCHEMA
The• snowflake schema is a variant of the star schema. Here, the centralized fact table is
connected to multiple dimensions. In the snowflake schema, dimensions are present in
a normalized form in multiple related tables. The snowflake structure materialized when the
dimensions of a star schema are detailed and highly structured, having several levels of
relationship, and the child tables have multiple parent tables. The snowflake effect affects only
the dimension tables and does not affect the fact tables.
A snowflake schema is a type of data modeling technique used in data warehousing to represent
data in a structured way that is optimized for querying large amounts of data efficiently. In a
snowflake schema, the dimension tables are normalized into multiple related tables, creating a
hierarchical or “snowflake” structure.
In a snowflake schema, the fact table is still located at the center of the schema, surrounded by
the dimension tables. However, each dimension table is further broken down into multiple
related tables, creating a hierarchical structure that resembles a snowflake.
For Example, in a sales data warehouse, the product dimension table might be normalized into
multiple related tables, such as product category, product subcategory, and product details.
Each of these tables would be related to the product dimension table through a foreign
key relationship.
Example:
Snowflake Schema
The Employee dimension table now contains the attributes: EmployeeID, EmployeeName,
DepartmentID, Region, and Territory. The DepartmentID attribute links with
the Employee table with the Department dimension table. The Department dimension is
used to provide detail about each department, such as the Name and Location of the department.
The Customer dimension table now contains the attributes: CustomerID, CustomerName,
Address, and CityID. The CityID attributes link the Customer dimension table with
the City dimension table. The City dimension table has details about each city such as city
name, Zipcode, State, and Country.
What is Snowflaking?
The snowflake design is the result of further expansion and normalization of the dimension
table. In other words, a dimension table is said to be snowflaked if the low-cardinality attribute
of the dimensions has been divided into separate normalized tables. These tables are then joined
to the original dimension table with referential constraints (foreign key constrain).
Generally, snowflaking is not recommended in the dimension table, as it hampers the
understandability and performance of the dimension model as more tables would be required
to be joined to satisfy the queries.
Difference between Star Schema and
Snowflake Schema
•
Star•• Schema: Star schema is the type of multidimensional model which is used
for data warehouse. In star schema, The fact tables and the dimension tables are
contained. In this schema fewer foreign-key join is used. This schema forms a star
with fact table and dimension tables.
In star schema, The fact tables While in snowflake schema, The fact tables,
1. and the dimension tables are dimension tables as well as sub dimension
contained. tables are contained.
It takes less time for the While it takes more time than star schema
4.
execution of queries. for the execution of queries.
10. It has high data redundancy. While it has low data redundancy.
Here, the pink coloured Dimension tables are the common ones among both the star schemas.
Green coloured fact tables are the fact tables of their respective star schemas.
Example:
In above demonstration:
Placement is a fact table having attributes: (Stud_roll, Company_id, TPO_id) with facts:
(Number of students eligible, Number of students placed).
Disadvantage: It is much more complex and hence, hard to implement and maintain.
Reduced data redundancy: The snowflake schema reduces data redundancy by normalizing
dimensions into multiple tables, resulting in a more efficient use of storage space.
Improved performance: The snowflake schema can improve query performance, as it
requires fewer joins to retrieve data from the fact table.
Scalability: The snowflake schema is scalable, making it suitable for large data warehousing
projects with complex hierarchies.
Increased complexity: The snowflake schema can be more complex to implement and
maintain due to the additional tables needed for the normalized dimensions.
Reduced query performance: The increased complexity of the snowflake schema can result
in reduced query performance, particularly for queries that require data from multiple
dimensions.
Data integrity: The snowflake schema can be more difficult to maintain data integrity due to
the additional relationships between tables.
Fact Constellation Schema: The fact constellation schema is also a type of
multidimensional model. The fact constellation schema consists of dimension tables that are
shared by several fact tables. The fact constellation schema consists of more than one star
schema at a time. Unlike the snowflake schema, the planetarium schema is not really easy to
operate, as it has multiple numbers between tables. Unlike the snowflake schema, the
constellation schema, in fact, uses heavily complex queries to access data from the database.
Snowflake schema saves significant While fact constellation schema does not
2.
storage. save storage.
In snowflake schema, tables can be In fact constellation schema, the tables are
4.
maintained easily. tough to maintain.
OLAP means On-Line Analytical Processing, which can extract or retrieve data selectively to
be analyzed from different viewpoints. It is of great value in Business Intelligence and can be
used in sales forecasting and financial reporting analysis.
Example:
You can ask the user for data and analysis of the number of footballs sold in a particular region,
says Jharkhand, in September and compare the same with the number sold in May. They may
see a comparison of other sports goods sold in Jharkhand for the same month.
So OLAP tools help us to analyze data multi-dimensionally in data mining. OLAP tools can
be classified as the following :
1. MOLAP
• IBM Cognos: It provides tools for reporting, analyzing, and monitoring events and
metrics.
• SAP NetWeaver BW: It is known as SAP NetWeaver Business Warehouse. Just like
IBM Cognos, It also delivers reporting, analysis, and interpretation of business data. It
runs on Industry-standard RDBMS and SAP’s HANA in-memory DBMS.
• Microsoft Analysis Services: Organizations use Microsoft Analysis Services to make
sense of data in multiple databases or a discrete form.
• MicroStrategy Intelligence Server: MicroStrategy Intelligence Server helps the
business standardize themselves on a single open platform, which will reduce their
maintenance and operating cost.
• Mondrian OLAP server: It is an open-source OLAP tool whose USP is written in
Java. Another feature of this tool is that it supports XML language, SQL, and other
data sources.
• Ic Cube: Like the above OLAP tool, this is also written in Java and is an in-memory
multidimensional OLAP tool.
• Infor BI OLAP Server: It is a real-time, in-memory OLAP database for
multidimensional analysis, planning, and modeling. It is also used for financial,
operational planning, and reporting.
• Jedox OLAP Server: It is a cell-oriented, multi-dimensional, and most important in-
memory OLAP server.
• Oracle Database OLAP option: As the name suggests, this OLAP tool is specifically
designed to introduce OLAP functionality within the database environment of Oracle.
The main objective that it serves is that it can direct the SQL queries to OLAP cubes
which in return will speed up the process.
• SAS OLAP Server: Like the IcCube OLAP server, It provides a multidimensional data
storage facility. This server can also be used to get quick access to pre-summarized data.
• IBM T1: The OLAP server provides multidimensional data storage, represented in
OLAP cubes, and performs real-time computations.
2. ROLAP
The ‘R’ in ROLAP stands for Relational. So, the full form of ROLAP becomes Relational
Online Analytical Processing. The salient feature of ROLAP is that it stores the data in
relational databases.
• IBM Cognos
• SAP NetWeaver BW
• Microsoft Analysis Services
• Essbase
• Jedox OLAP Server
• SAS OLAP Server
• MicroStrategy Intelligence Server
• Oracle Database OLAP option
3. HOLAP
It stands for Hybrid Online Analytical Processing. So, HOLAP bridges the shortcomings of
both MOLAP and ROLAP by combining their capabilities. Now how does it combine? It
combines data by dividing data of databases between relational and specialized storage.
• IBM Cognos
• SAP NetWeaver BW
• Mondrian OLAP server
• Microsoft Analysis Services
• Essbase
• Jedox OLAP Server
• SAS OLAP Server
• MicroStrategy Intelligence Server
• Oracle Database OLAP option
Now let’s go through the advantages that OLAP tools have in the domain of Business
Intelligence.
• It helps us analyze and modify reports much faster since the data is from in-
memory data cubes rather than the data warehouse.
• MicroStrategy and other OLAP tools incorporate intelligent and secure Cube data
sharing capabilities, ensuring the secure sharing of data.
• Another benefit is the consistency of information and calculations. Reporting remains
consistently accurate in OLAP servers as the speed of data sharing does not impede the
process.
• The multidimensional presentation using OLAP tools helps better understand
relationships that were not present previously.
• Another popular scenario is the “What if” scenario of OLAP software. The
multidimensional processing of OLAP tools greatly enhances their potential.
• We can apply security restrictions on users and objects using OLAP tools.
• It creates a single platform for planning, forecasting, reporting, and analysis.
• Data mining concepts are in use for Sales and marketing to provide better
customer service, to improve cross-selling opportunities, to increase direct mail
response rates.
• Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.
• IF means Antecedent: An antecedent is an item found within the data
• THEN means Consequent: A consequent is an item found in combination with
the antecedent.
Let’s see ASSOCIATION RULE {IF} -> {THEN} rules used in Market Basket Analysis in
Data Mining. For example, customers buying a domain means they definitely need extra
plugins/extensions to make it easier for the users.
Like we said above Antecedent is the item sets that are available in data. By formulating from
the rules means {if} component and from the example is the domain.
Same as Consequent is the item that is found with the combination of Antecedents. By
formulating from the rules means {THEN} component and from the example is extra
plugins/extensions.
With the help of these, we are able to predict customer behavioral patterns. From this, we are
able to make certain combinations with offers that customers will probably buy those products.
That will automatically increase the sales and revenue of the company.
With the help of the Apriori Algorithm, we can further classify and simplify the item sets which
are frequently bought by the consumer.
There are three components in APRIORI ALGORITHM:
• SUPPORT
• CONFIDENCE
• LIFT
Now take an example, suppose 5000 transactions have been made through a popular
eCommerce website. Now they want to calculate the support, confidence, and lift for the two
products, let’s say pen and notebook for example out of 5000 transactions, 500 transactions for
pen, 700 transactions for notebook, and 1000 transactions for both.
SUPPORT: It is been calculated with the number of transactions divided by the total number
of transactions made,
support(pen) = transactions related to pen/total transactions
CONFIDENCE: It is been calculated for whether the product sales are popular on individual
sales or through combined sales. That is calculated with combined transactions/individual
transactions.
LIFT: Lift is calculated for knowing the ratio for the sales.
Lift-> 20/10=2
When the Lift value is below 1 means the combination is not so frequently bought by
consumers. But in this case, it shows that the probability of buying both the things together is
high when compared to the transaction for the individual items sold.
With this, we come to an overall view of the Market Basket Analysis in Data Mining and how
to calculate the sales for combination products.
There are three types of Market Basket Analysis. They are as follow:
1. Descriptive market basket analysis: This sort of analysis looks for patterns and
connections in the data that exist between the components of a market basket. This
kind of study is mostly used to understand consumer behavior, including what
products are purchased in combination and what the most typical item
combinations. Retailers can place products in their stores more profitably by
understanding which products are frequently bought together with the aid of
descriptive market basket analysis.
2. Predictive Market Basket Analysis: Market basket analysis that predicts future
purchases based on past purchasing patterns is known as predictive market basket
analysis. Large volumes of data are analyzed using machine learning algorithms in
this sort of analysis in order to create predictions about which products are most
likely to be bought together in the future. Retailers may make data-driven decisions
about which products to carry, how to price them, and how to optimize shop layouts
with the use of predictive market basket research.
3. Differential Market Basket Analysis: Differential market basket analysis
analyses two sets of market basket data to identify variations between them.
Comparing the behavior of various client segments or the behavior of customers
over time is a common usage for this kind of study. Retailers can respond to shifting
consumer behavior by modifying their marketing and sales tactics with the help of
differential market basket analysis.
1. Retail: Market basket research is frequently used in the retail sector to examine
consumer buying patterns and inform decisions about product placement, inventory
management, and pricing tactics. Retailers can utilize market basket research to
identify which items are sluggish sellers and which ones are commonly bought
together, and then modify their inventory management strategy accordingly.
2. E-commerce: Market basket analysis can help online merchants better understand
the customer buying habits and make data-driven decisions about product
recommendations and targeted advertising campaigns. The behaviour of visitors to
a website can be examined using market basket analysis to pinpoint problem areas.
3. Finance: Market basket analysis can be used to evaluate investor behaviour and
forecast the types of investment items that investors will likely buy in the future.
The performance of investment portfolios can be enhanced by using this
information to create tailored investment strategies.
4. Telecommunications: To evaluate consumer behaviour and make data-driven
decisions about which goods and services to provide, the telecommunications
business might employ market basket analysis. The usage of this data can enhance
client happiness and the shopping experience.
5. Manufacturing: To evaluate consumer behaviour and make data-driven decisions
about which products to produce and which materials to employ in the production
process, the manufacturing sector might use market basket analysis. Utilizing this
knowledge will increase effectiveness and cut costs.
The mainly comprehensive premises in computing have been the internet and data
warehousing thus the integration of these two giant technologies is a necessity. The advantages
of using the Web for access are inevitable.
These advantages are:
· The internet provides connectivity between countries acting as a free resource.
· The web eases administrative tasks of managing scattered locations.
· The Web allows users to store and manage data and applications on servers that
can be managed, maintained and updated centrally.
These reasons indicate the importance of the Web in data storage and manipulation. The Web-
enabled data access has many significant features, such as:
· The first
· The second
· The emerging third
· HTML publishing
· Helper applications
· Plug-ins
· Server-centric components
· Java and active-x applications
The primary key in the decision making process is the amount of data collected and how well
this data is interpreted. Nowadays, Managers aren’t satisfied by getting direct answers to their
direct questions, Instead due to the market growth and increase of clients their questions
became more complicated. Questions are like‖ How much profit from selling our products at
our different centers per month‖. A complicated question like this isn’t as simple to be
answered directly; it needs analysis to three fields in order to obtain an answer.
The decision making process exists in the different levels of an organization. The speed and
the
simplicity of gathering data and the ability to convert this data into information is the main
element in the decision process. That’s why the term Business Intelligence has evolved.
Business Intelligence
As mentioned Earlier, business Intelligence is concerned with gathering the data and
converting this data into information, so as to use a better decision. The better the data is
gathered and how well it is interpreted as information is one of the most important elements
in a successful business.