UNIT 5
UNIT 5
UNIT-5
Data visualization is very critical to market research where both numerical and
categorical data can be visualized that helps in an increase in impacts of insights
and also helps in reducing risk of analysis paralysis. So, data visualization is
categorized into following categories :
2
variety of possible views of data that has been transformed from raw information
to reflect the real dimensionality of the enterprise as understood by the clients.
4
OLAP operations: There are five basic analytical operations that can be
performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is performed
by moving down in the concept hierarchy of Time dimension (Quarter ->
Month).
5
Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is selected
by selecting following dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a
new sub-cube creation. In the cube given in the overview section, Slice is
6
performed on the dimension Time = “Q1”.
7
3) Accessibility: It provides access only to the data that is actually required to
perform the particular analysis, present a single, coherent, and consistent view to
the clients. The OLAP system must map its own logical schema to the
heterogeneous physical data stores and perform any necessary transformations.
The OLAP operations should be sitting between data sources (e.g., data
warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel
any significant degradation in documenting performance as the number of
dimensions or the size of the database increases. That is, the performance of OLAP
should not suffer as the number of dimensions is increased. Users must observe
consistent run time, response time, or machine utilization every time a given query
is run.
8
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the
specific analytical model being created and loaded that optimizes sparse matrix
handling. When encountering the sparse matrix, the system must be easy to
dynamically assume the distribution of the information and adjust the storage and
access to obtain and maintain a consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data
integrity, and access security.
10
2. Multidimensional Online Analytical Processing (MOLAP) :MOLAP uses
array-based multidimensional storage engines for multidimensional views of data.
MOLAP stores data on discs in the form of a specialized multidimensional array
structure. It is used for OLAP, which is based on the arrays’ random access
capability.
MOLAP has 3 components :
Database Server
MOLAP server
Front-end tool.
Advantages of MOLAP –
Suitable for slicing and dicing operations.
Outperforms ROLAP when data is dense.s
Capable of performing complex calculations.
Disadvantages of MOLAP –
MOLAP can’t handle large amount of data.
In MOLAP, Requires additional investment.
Without re-aggregation, difficult to change dimension.
Advantages of HOLAP –
HOLAP provides the functionalities of both MOLAP and ROLAP.
HOLAP provides fast access at all levels of aggregation.
Disadvantages of HOLAP –
13
Data Mining interface
The data mining interface provides the medium that allows users to communicate
with data mining processes. It is difficult to use data mining query language. A
graphical user interface can be used to communicate with data mining systems. A
data mining query language can serve as a core language, on top of which GUIs
can easily be designed.
Data mining Architecture system contains too many components. That is a data
source, data warehouse server, data mining engine, and knowledge base.
a. Data Sources
There are so many documents present. That is a database, data warehouse, World
Wide Web (WWW). That are the actual sources of data. Sometimes, data
may reside even in plain text files or spreadsheets. World Wide Web or the Internet
is another big source of data.
b. Database or Data Warehouse Server
The database server contains the actual data that is ready to be processed. Hence,
the server handles retrieving the relevant data. That is based on the data mining
request of the user.
14
c. Data Mining Engine
In data mining system data mining engine is the core component. As It consists a
number of modules. That we used to perform data mining tasks. That includes
association, classification, characterization, clustering, prediction, etc.
d. Pattern Evaluation Modules
This module is mainly responsible for the measure of interestingness of the pattern.
For this, we use a threshold value. Also, it interacts with the data mining engine.
That’s main focus is to search towards interesting patterns.
e. Graphical User Interface
We use this interface to communicate between the user and the data mining
system. Also, this module helps the user use the system easily and efficiently. They
don’t know the real complexity of the process.
When the user specifies a query, this module interacts with the data mining system.
Thus, displays the result in an easily understandable manner.
f. Knowledge Base
In whole data mining process, the knowledge base is beneficial. We use it to
guiding the search for the result patterns. The knowledge base might even contain
user beliefs and data from user experiences. That can be useful in the process
of data mining.
Data backup is a form of disaster recovery and should be part of any disaster
recovery plan.
1. Backup: Backup refers to storing a copy of original data which can be used in
case of data loss. Backup is considered one of the approaches to data protection.
Important data of the organization needs to be kept in backup efficiently for
protecting valuable data. Backup can be achieved by storing a copy of the
original data separately or in a database on storage devices. There are various
types of backups are available like full backup, incremental backup, Local
15
backup, mirror backup, etc. An example of a Backup can be SnapManager which
makes a backup of everything in the database.
2. Recovery: Recovery refers to restoring lost data by following some processes.
Even if the data was backed up still lost so it can be recovered by
using/implementing some recovery techniques. When a database fails due to any
reason then there is a chance of data loss, so in that case recovery process helps in
improve the reliability of the database. An example of Recover can be
SnapManager which recovers the data from the last transaction.
16
Difference between Backup and Recovery:
BACKUP RECOVERY
Backup refers to storing a copy of Recovery refers to restoring the lost data in
original data separately. case of failure.
So we can say Backup is a copy of So we can say Recovery is a process of
data which is used to restore original retrieving lost, corrupted or damaged data to
data after a data loss/damage occurs. its original state.
In simple backup is the replication In simple recovery is the process to store the
of data. database.
The prior goal of backup is just to The prior goal of recovery is retrieve
keep one extra copy to refer in case original data in case of original data failure.
of original data loss.
It helps in improving data It helps in improving the reliability of the
protection. database.
Backup makes the recovery process Recovery has no role in data backup.
more easier.
The cost of backup is affordable. The cost of recovery is expensive.
It’s production usage is very It’s production usage is very rare.
common.
Backup is not created automatically. There is automatic generation of restore
points by your computer.
A backup stores copies of the files A restore is carried out internally on your
in a location that is external to it. computer.
Backup requires extra storage space. Restore is internal so it does not require
extra external storage space.
Backup offers a means of recovery. Recovery aims to guarantee the atomicity of
the transaction and data.
17
Data warehouse is dynamic; it never remains constant.
It is very difficult to predict what query the user is going to post in the
future.
Business requirements change with time.
Users and their profiles keep changing.
The user can switch from one group to another.
A data warehouse keeps evolving and it is unpredictable what query the user is
going to post in the future. Therefore it becomes more difficult to tune a data
warehouse system.
18
There are three basic levels of testing performed on data warehouse which are as
follows :
1. Unit Testing –
This type of testing is being performed at the developer’s end. In unit testing,
each unit/component of modules is separately tested. Each modules of the
whole data warehouse, i.e. program, SQL Script, procedure,, Unix shell is
validated and tested.
2. Integration Testing –
In this type of testing the various individual units/ modules of the application
are brought together or combined and then tested against the number of inputs.
It is performed to detect the fault in integrated modules and to test whether the
various components are performing well after integration.
3. System Testing –
System testing is the form of testing that validates and tests the whole data
warehouse application. This type of testing is being performed by technical
testing team. This test is conducted after developer’s team performs unit
testing and the main purpose of this testing is to check whether the entire
system is working altogether or not.
Challenges of data warehouse testing are :
Data selection from multiple source and analysis that follows pose great
challenge.
Volume and complexity of the data, certain testing strategies are time
consuming.
ETL testing requires hive SQL skills, thus it pose challenges for tester who
have limited SQL skills.
Redundant data in a data warehouse.
Inconsistent and inaccurate reports.
ETL testing is performed in five stages :
Identifying data sources and requirements.
Data acquisition.
Implement business logic’s and dimensional modeling.
Build and populate data.
Build reports.
19
Recent Trends: Types of Warehousing Applications
20
Healthcare
Another critical use of data warehouses is in the Healthcare sector. All the clinical,
financial, and employee data are stored in the warehouse, and analysis is run to
derive valuable insights to strategize resources in the best way possible.
Manufacturing & Distribution
With an effective data warehousing solution, organizations in the manufacturing &
distribution sector can organize all their data under one roof and predict market
changes, analyze the latest trends, view development areas, and finally can make
result-driven decisions.
Retailing
Retailers are the mediators between wholesalers and end customers, and that’s why
it is necessary for them to maintain the records of both parties. For helping them
store data in an organized manner, the application of data warehousing comes into
the frame.
Insurance
In the Insurance sector, data warehousing is required to maintain existing
customers’ records and analyze the same to up see client trends to bring more
footsteps towards the business.
Services
In the services sector, data warehousing is used for maintaining customer details,
financial records, and resources to analyze patterns and boost decision-making for
positive outcomes.
Web Mining
Web mining defines the process of using data mining techniques to extract
beneficial patterns trends and data generally with the help of the web by dealing
with it from web-based records and services, server logs, and hyperlinks. The
main goal of web mining is to find the designs in web data by collecting and
analyzing data to get important insights.
Web mining can widely be viewed as the application of adapted data mining
methods to the web, whereas data mining is represented as the application of the
algorithm to find patterns on mostly structured data fixed into a knowledge
discovery process.
Web mining has a distinctive property to support a collection of multiple data
types. The web has several aspects that yield multiple approaches for the mining
process, such as web pages including text, web pages are connected via
hyperlinks, and user activity can be monitored via web server logs.
21
Applications of Web Mining:
1. Web mining helps to improve the power of web search engine by classifying
the web documents and identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching
e.g., FatLens, Become etc.
3. Web mining is used to predict user behavior.
4. Web mining is very useful of a particular Website and e-service e.g., landing
page optimization.
Web mining can be broadly divided into three different types of techniques of
mining:
22
usage mining, user access data on the web and collect data in form of logs. So,
Web usage mining is also called log mining.
Spatial Mining :
Spatial data mining is the process of discovering interesting and previously
unknown, but potentially useful patterns from spatial databases. In spatial data
mining analyst use geographical or spatial information to produce business
intelligence or other results. Challenges involved in spatial data mining include
identifying patterns or finding objects that are relevant to research project.
Spatial means space, whereas temporal means time
Spatial data mining refers to the extraction of knowledge, spatial relationships, or
other interesting patterns not specifically stored in spatial databases. Such mining
demands the unification of data mining with spatial database technologies. It can
be used for learning spatial records, discovering spatial relationships and
relationships among spatial and nonspatial records, constructing spatial knowledge
bases, reorganizing spatial databases, and optimizing spatial queries.
Temporal Mining.
Temporal data refers to the extraction of implicit, non-trivial and potentially
useful abstract information from large collection of temporal data. It is concerned
with the analysis of temporal data and for finding temporal patterns and
regularities in sets of temporal data tasks of temporal data mining are –
Data Characterization and Comparison
Cluster Analysis
Classification
23
Association rules
Prediction and Trend Analysis
Pattern Analysis
24