0% found this document useful (0 votes)
7 views

UNIT 5

Benefits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

UNIT 5

Benefits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

KCA012: Data Warehousing & Data Mining

UNIT-5

Data Visualization and Overall Perspective: Aggregation, Historical information,


Query Facility, OLAP function and Tools. OLAP Servers, ROLAP, MOLAP,
HOLAP, Data Mining interface, Security, Backup and Recovery, Tuning Data
Warehouse, Testing Data Warehouse.
Warehousing applications and Recent Trends: Types of Warehousing Applications,
Web Mining, Spatial Mining and Temporal Mining.
----------------------------------------------------------------------------------------------------
Data Visualization
It is actually a set of data points and information that are represented graphically
to make it easy and quick for user to understand. Data visualization is good if it
has a clear meaning, purpose, and is very easy to interpret, without requiring
context. Tools of data visualization provide an accessible way to see and
understand trends, outliers, and patterns in data by using visual effects or
elements such as a chart, graphs, and maps.

Categories of Data Visualization :

Data visualization is very critical to market research where both numerical and
categorical data can be visualized that helps in an increase in impacts of insights
and also helps in reducing risk of analysis paralysis. So, data visualization is
categorized into following categories :

Figure – Categories of Data Visualization


1
Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data
where data generally represents amount such as height, weight, age of a person,
etc. Numerical data visualization is easiest way to visualize data. It is generally
used for helping others to digest large data sets and raw numbers in a way that
makes it easier to interpret into action. Numerical data is categorized into two
categories :
 Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
 Discrete Data –
This type of data is not “continuous” (Example: Number of cars or
children’s a household has).
The type of visualization techniques that are used to represent numerical data
visualization is Charts and Numerical Values. Examples are Pie Charts, Bar
Charts, Averages, Scorecards, etc.
2. Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data
where data generally represents groups. It simply consists of categorical
variables that are used to represent characteristics such as a person’s ranking,
a person’s gender, etc. Categorical data visualization is all about depicting key
themes, establishing connections, and lending context. Categorical data is
classified into three categories :
 Binary Data –
In this, classification is based on positioning (Example: Agrees or
Disagrees).
 Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
 Ordinal Data –
In this, classification is based on ordering of information (Example:
Timeline or processes).
The type of visualization techniques that are used to represent categorical data
is Graphics, Diagrams, and Flowcharts. Examples are Word clouds, Sentiment
Mapping, Venn Diagram, etc.
What is OLAP (Online Analytical Processing)?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of
software technology which authorizes analysts, managers, and executives to gain
insight into information through fast, consistent, interactive access in a wide

2
variety of possible views of data that has been transformed from raw information
to reflect the real dimensionality of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and


support the capability for complex estimations, trend analysis, and sophisticated
data modeling. It is rapidly enhancing the essential foundation for Intelligent
Solutions containing Business Performance Management, Planning, Budgeting,
Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge
Discovery, and Data Warehouses Reporting. OLAP enables end-clients to perform
ad hoc analysis of record in multiple dimensions, providing the insight and
understanding they require for better decision making.

Difference between OLAP and OLTP (AKTU Question)


Category OLAP (Online analytical OLTP (Online transaction
processing) processing)
Definition It is well-known as an online It is well-known as an online
database query management database modifying system.
system.
Method used It makes use of a data It makes use of a standard
warehouse. database management system
(DBMS).
Application It is subject-oriented. Used It is application-oriented. Used
for Data Mining, Analytics, for business tasks.
Decisions making, etc.
Normalized In an OLAP database, tables In an OLTP database, tables are
are not normalized. normalized (3NF).
Usage of data The data is used in planning, The data is used to perform day-
3
problem-solving, and to-day fundamental operations.
decision-making.
Volume of A large amount of data is The size of the data is relatively
data stored typically in TB, PB small as the historical data is
archived. For ex MB, GB
Queries Relatively slow as the amount Very Fast as the queries operate
of data involved is large. on 5% of the data.
Queries may take hours.
Update The OLAP database is not The data integrity constraint
often updated. As a result, must be maintained in an OLTP
data integrity is unaffected. database.
Backup and It only need backup from Backup and recovery process is
Recovery time to time as compared to maintained rigorously
OLTP.
Processing The processing of complex It is comparatively fast in
time queries can take a lengthy processing because of simple and
time. straightforward queries.
Types of This data is generally This data is managed by clerks,
users managed by CEO, MD, GM. managers.

OLAP Operations on Data Cube


OLAP stands for Online Analytical Processing Server. It is a software
technology that allows users to analyze information from multiple database
systems at the same time. It is based on multidimensional data model and allows
the user to query on multi-dimensional data (eg. Delhi -> 2018 -> Sales data).
OLAP databases are divided into one or more cubes and these cubes are known
as Hyper-cubes.

4
OLAP operations: There are five basic analytical operations that can be
performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed
by moving down in the concept hierarchy of Time dimension (Quarter ->
Month).

2. Roll up: It is just opposite of the drill-down operation. It performs


aggregation on the OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed
by climbing up in the concept hierarchy of Location dimension (City ->

5
Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is selected
by selecting following dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results in a
new sub-cube creation. In the cube given in the overview section, Slice is

6
performed on the dimension Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view to


get a new view of the representation. In the sub-cube obtained after the slice
operation, performing pivot operation gives a new view of it.

OLAP Guidelines (Dr.E.F.Codd Rule)


Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12
guidelines and requirements as the basis for selecting OLAP systems:

1) Multidimensional Conceptual View: This is the central features of an OLAP


system. By needing a multidimensional view, it is possible to carry out methods
like slice and dice.

2) Transparency: Make the technology, underlying information repository,


computing operations, and the dissimilar nature of source data totally transparent
to users. Such transparency helps to improve the efficiency and productivity of the
users.

7
3) Accessibility: It provides access only to the data that is actually required to
perform the particular analysis, present a single, coherent, and consistent view to
the clients. The OLAP system must map its own logical schema to the
heterogeneous physical data stores and perform any necessary transformations.
The OLAP operations should be sitting between data sources (e.g., data
warehouses) and an OLAP front-end.

4) Consistent Reporting Performance: To make sure that the users do not feel
any significant degradation in documenting performance as the number of
dimensions or the size of the database increases. That is, the performance of OLAP
should not suffer as the number of dimensions is increased. Users must observe
consistent run time, response time, or machine utilization every time a given query
is run.

5) Client/Server Architecture: Make the server component of OLAP tools


sufficiently intelligent that the various clients to be attached with a minimum of
effort and integration programming. The server should be capable of mapping and
consolidating data between dissimilar databases.

6) Generic Dimensionality: An OLAP method should treat each dimension as


equivalent in both is structure and operational capabilities. Additional operational
capabilities may be allowed to selected dimensions, but such additional tasks
should be grantable to any dimension.

8
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the
specific analytical model being created and loaded that optimizes sparse matrix
handling. When encountering the sparse matrix, the system must be easy to
dynamically assume the distribution of the information and adjust the storage and
access to obtain and maintain a consistent level of performance.

8) Multiuser Support: OLAP tools must provide concurrent data access, data
integrity, and access security.

9) Unrestricted cross-dimensional Operations: It provides the ability for the


methods to identify dimensional order and necessarily functions roll-up and drill-
down methods within a dimension or across the dimension.

10) Intuitive Data Manipulation: Data Manipulation fundamental the


consolidation direction like as reorientation (pivoting), drill-down and roll-up, and
another manipulation to be accomplished naturally and precisely via point-and-
click and drag and drop methods on the cells of the scientific model. It avoids the
use of a menu or multiple trips to a user interface.

11) Flexible Reporting: It implements efficiency to the business clients to


organize columns, rows, and cells in a manner that facilitates simple manipulation,
analysis, and synthesis of data.

12) Unlimited Dimensions and Aggregation Levels: The number of data


dimensions should be unlimited. Each of these common dimensions must allow a
practically unlimited number of customer-defined aggregation levels within any
given consolidation path.

Types of OLAP Servers


OR
Difference between ROLAP, MOLAP and HOLAP
What is OLAP?
Online Analytical Processing Server (OLAP) is based on the multidimensional
data model. It allows managers and analysts to get an insight of the information
through fast, consistent, and interactive access to information.
Types of OLAP Servers:
1. Relational Online Analytical Processing (ROLAP) :
Relational On-Line Analytical Processing (ROLAP) is primarily used for data
stored in a relational database, where both the base data and dimension tables are
stored as relational tables. ROLAP servers are used to bridge the gap between the
9
relational back-end server and the client’s front-end tools. ROLAP servers are
placed between relational backend server and client front-end tools. It uses
relational or extended DBMS to store and manage warehouse data.
ROLAP has basically 3 main components:
 Database Server
 ROLAP server
 Front-end tool.
Advantages of ROLAP –
 ROLAP is used for handle the large amount of data.
 ROLAP tools don’t use pre-calculated data cubes.
 Data can be stored efficiently.
 ROLAP can leverage functionalities inherent in the relational database.
Disadvantages of ROLAP –
 Performance of ROLAP can be slow.
 In ROALP, difficult to maintain aggregate tables.
 Limited by SQL functionalities.

Relational OLAP Architecture


ROLAP Architecture includes the following components
o Database server.
o ROLAP server.
o Front-end tool.

Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology


segment in the market. This method allows multiple multidimensional views of
two-dimensional relational tables to be created, avoiding structuring record around
the desired view.

10
2. Multidimensional Online Analytical Processing (MOLAP) :MOLAP uses
array-based multidimensional storage engines for multidimensional views of data.
MOLAP stores data on discs in the form of a specialized multidimensional array
structure. It is used for OLAP, which is based on the arrays’ random access
capability.
MOLAP has 3 components :
 Database Server
 MOLAP server
 Front-end tool.

Advantages of MOLAP –
 Suitable for slicing and dicing operations.
 Outperforms ROLAP when data is dense.s
 Capable of performing complex calculations.

Disadvantages of MOLAP –
 MOLAP can’t handle large amount of data.
 In MOLAP, Requires additional investment.
 Without re-aggregation, difficult to change dimension.

3. Hybrid Online Analytical Processing (HOLAP) : ROLAP and MOLAP are


combined in Hybrid On-Line Analytical Processing (HOLAP). HOLAP offers
greater scalability than ROLAP and faster computation than MOLAP.HOLAP is
a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing large
11
amounts of detailed data. On the one hand, HOLAP benefits from ROLAP’s
greater scalability. HOLAP, on the other hand, makes use of cube technology for
faster performance and summary-type information. Because detailed data is
stored in a relational database, cubes are smaller than MOLAP.

Advantages of HOLAP –
 HOLAP provides the functionalities of both MOLAP and ROLAP.
 HOLAP provides fast access at all levels of aggregation.

Disadvantages of HOLAP –

HOLAP architecture is very complex to understand because it supports both


MOLAP and ROLAP.

Other types of OLAP include:


 Web OLAP (WOLAP): WOLAP refers to an OLAP application that can be
accessed through a web browser. WOLAP, in contrast to traditional
client/server OLAP applications, is thought to have a three-tiered architecture
consisting of three components: a client, middleware, and a database server.
 Desktop OLAP (DOLAP): DOLAP is an abbreviation for desktop analytical
processing. In that case, the user can download the data from the source and
work with it on their desktop or laptop. In comparison to other OLAP
applications, functionality is limited. It is less expensive.
 Mobile OLAP (MOLAP): Wireless functionality or mobile devices are
examples of MOLAP. The user is working and accessing data via mobile
devices.
 Spatial OLAP (SOLAP): SOLAP egress combines the capabilities of
Geographic Information Systems (GIS) and OLAP into a single user interface.
SOLAP is created because the data can be alphanumeric, image, or vector.
12
This allows for the quick and easy exploration of data stored in a spatial
database.

Difference between ROLAP, MOLAP and HOLAP :


Basis ROLAP MOLAP HOLAP
Storage Relational Multidimensional Multidimensional
location for Database is Database is used as Database is used as
summary used as storage storage location for storage location for
aggregation location for summary summary
summary aggregation. aggregation.
aggregation.
Processing Processing Processing time of Processing time of
time time of MOLAP is fast. HOLAP is fast.
ROLAP is
very slow.
Storage space Large storage Medium storage Small storage space
requirement space space requirement requirement in
requirement in in MOLAP as HOLAP as compare
ROLAP as compare to ROLAP to MOLAP and
compare to and HOLAP. ROLAP.
MOLAP and
HOLAP.
Storage Relational Multidimensional Relational database
location for database is database is used as is used as storage
detail data used as storage storage location for location for detail
location for detail data. data.
detail data.
Latency Low latency in High latency in Medium latency in
ROLAP as MOLAP as HOLAP as compare
compare to compare to ROLAP to MOLAP and
MOLAP nad and HOLAP. ROLAP.
HOLAP.
Query Slow query Fast query response Medium query
response time response time time in MOLAP as response time in
in ROLAP as compare to ROLAP HOLAP as compare
compare to and HOLAP. to MOLAP and
MOLAP and ROLAP.
HOLAP.

13
Data Mining interface
The data mining interface provides the medium that allows users to communicate
with data mining processes. It is difficult to use data mining query language. A
graphical user interface can be used to communicate with data mining systems. A
data mining query language can serve as a core language, on top of which GUIs
can easily be designed.

Data mining Architecture system contains too many components. That is a data
source, data warehouse server, data mining engine, and knowledge base.

a. Data Sources
There are so many documents present. That is a database, data warehouse, World
Wide Web (WWW). That are the actual sources of data. Sometimes, data
may reside even in plain text files or spreadsheets. World Wide Web or the Internet
is another big source of data.
b. Database or Data Warehouse Server
The database server contains the actual data that is ready to be processed. Hence,
the server handles retrieving the relevant data. That is based on the data mining
request of the user.

14
c. Data Mining Engine
In data mining system data mining engine is the core component. As It consists a
number of modules. That we used to perform data mining tasks. That includes
association, classification, characterization, clustering, prediction, etc.
d. Pattern Evaluation Modules
This module is mainly responsible for the measure of interestingness of the pattern.
For this, we use a threshold value. Also, it interacts with the data mining engine.
That’s main focus is to search towards interesting patterns.
e. Graphical User Interface
We use this interface to communicate between the user and the data mining
system. Also, this module helps the user use the system easily and efficiently. They
don’t know the real complexity of the process.
When the user specifies a query, this module interacts with the data mining system.
Thus, displays the result in an easily understandable manner.
f. Knowledge Base
In whole data mining process, the knowledge base is beneficial. We use it to
guiding the search for the result patterns. The knowledge base might even contain
user beliefs and data from user experiences. That can be useful in the process
of data mining.

Security, Backup and Recovery


Backup and recovery refers to the process of backing up data in case of a loss and
setting up systems that allow that data recovery due to data loss. Backing up data
requires copying and archiving computer data, so that it is accessible in case of
data deletion or corruption. Data from an earlier time may only be recovered if it
has been backed up.

Data backup is a form of disaster recovery and should be part of any disaster
recovery plan.

1. Backup: Backup refers to storing a copy of original data which can be used in
case of data loss. Backup is considered one of the approaches to data protection.
Important data of the organization needs to be kept in backup efficiently for
protecting valuable data. Backup can be achieved by storing a copy of the
original data separately or in a database on storage devices. There are various
types of backups are available like full backup, incremental backup, Local

15
backup, mirror backup, etc. An example of a Backup can be SnapManager which
makes a backup of everything in the database.
2. Recovery: Recovery refers to restoring lost data by following some processes.
Even if the data was backed up still lost so it can be recovered by
using/implementing some recovery techniques. When a database fails due to any
reason then there is a chance of data loss, so in that case recovery process helps in
improve the reliability of the database. An example of Recover can be
SnapManager which recovers the data from the last transaction.

There are various types of backup which are as follows −


 Complete backup − The entire database is backed up simultaneously. This
includes all data files, control files, and journal files.
 Cold backup − It is a backup that is taken while the database is completely
shut-down.
 Hot backup − It is a backup that is not cold and is considered to be hot. The
term hot is used because the database engine is up and running. A backup of
the database is made when it is open and potentially in use. The DBMS will
need to have special facilities to ensure that data in the backup is consistent.
Recovery is the phase of reconstructing a database after some element of a
database has been hidden. The recovery model of a current database is inherited
from the model database when the new database is generated. The model for a
database can be changed after the database has been created.
 Full recovery model − It provides the most flexibility for recovering the
database to an earlier point of time.
 Bulk-logged recovery model − Bulk-logged recovery provides higher
performance and lowers log space consumption for certain large-scale
operations.
 Simple recovery model − Simple recovery provides the highest
performance and lower log space consumption but with significant exposure
to data loss in the event of a system failure. The amount of exposure to data
loss varies with the model chosen. Each recovery model addresses a different
need.
Elements of Warehouse Strategy:
1. Walkthrough and observations of the operation
2. Data Gathering of necessary information
3. Interviews with key staff members.
4. Report Analysis
5. External Benchmarking to look areas of potential improvement.

16
Difference between Backup and Recovery:
BACKUP RECOVERY
Backup refers to storing a copy of Recovery refers to restoring the lost data in
original data separately. case of failure.
So we can say Backup is a copy of So we can say Recovery is a process of
data which is used to restore original retrieving lost, corrupted or damaged data to
data after a data loss/damage occurs. its original state.
In simple backup is the replication In simple recovery is the process to store the
of data. database.
The prior goal of backup is just to The prior goal of recovery is retrieve
keep one extra copy to refer in case original data in case of original data failure.
of original data loss.
It helps in improving data It helps in improving the reliability of the
protection. database.
Backup makes the recovery process Recovery has no role in data backup.
more easier.
The cost of backup is affordable. The cost of recovery is expensive.
It’s production usage is very It’s production usage is very rare.
common.
Backup is not created automatically. There is automatic generation of restore
points by your computer.
A backup stores copies of the files A restore is carried out internally on your
in a location that is external to it. computer.
Backup requires extra storage space. Restore is internal so it does not require
extra external storage space.
Backup offers a means of recovery. Recovery aims to guarantee the atomicity of
the transaction and data.

Data Warehouse Tuning


The process of applying different strategies in performing different operations of
data warehouse such that performance measures will enhance is called data
warehousing tuning. For this, it is very important to have a complete knowledge of
data warehouse. We can tune the different aspects of a data warehouse such as
performance, data load, queries, etc. A data warehouse keeps evolving and it is
unpredictable what query the user is going to post in the future. Therefore it
becomes more difficult to tune a data warehouse system. Tuning a data warehouse
is a difficult procedure due to following reasons:

17
 Data warehouse is dynamic; it never remains constant.
 It is very difficult to predict what query the user is going to post in the
future.
 Business requirements change with time.
 Users and their profiles keep changing.
 The user can switch from one group to another.

A data warehouse keeps evolving and it is unpredictable what query the user is
going to post in the future. Therefore it becomes more difficult to tune a data
warehouse system.

Difficulties in Data Warehouse Tuning


Tuning a data warehouse is a difficult procedure due to following reasons −
 Data warehouse is dynamic; it never remains constant.
 It is very difficult to predict what query the user is going to post in the future.
 Business requirements change with time.
 Users and their profiles keep changing.
 The user can switch from one group to another.
 The data load on the warehouse also changes with time.
Performance Assessment
Here is a list of objective measures of performance −
 Average query response time
 Scan rates
 Time used per day query
 Memory usage per process
 I/O throughput rates

Testing in Data warehouse


Data Warehouse stores huge amount of data, which is typically collected from
multiple heterogeneous source like files, DBMS, etc to produce statistical result
that help in decision making.
Testing is very important for data warehouse systems for data validation and to
make them work correctly and efficiently.

18
There are three basic levels of testing performed on data warehouse which are as
follows :
1. Unit Testing –
This type of testing is being performed at the developer’s end. In unit testing,
each unit/component of modules is separately tested. Each modules of the
whole data warehouse, i.e. program, SQL Script, procedure,, Unix shell is
validated and tested.
2. Integration Testing –
In this type of testing the various individual units/ modules of the application
are brought together or combined and then tested against the number of inputs.
It is performed to detect the fault in integrated modules and to test whether the
various components are performing well after integration.
3. System Testing –
System testing is the form of testing that validates and tests the whole data
warehouse application. This type of testing is being performed by technical
testing team. This test is conducted after developer’s team performs unit
testing and the main purpose of this testing is to check whether the entire
system is working altogether or not.
Challenges of data warehouse testing are :
 Data selection from multiple source and analysis that follows pose great
challenge.
 Volume and complexity of the data, certain testing strategies are time
consuming.
 ETL testing requires hive SQL skills, thus it pose challenges for tester who
have limited SQL skills.
 Redundant data in a data warehouse.
 Inconsistent and inaccurate reports.
ETL testing is performed in five stages :
 Identifying data sources and requirements.
 Data acquisition.
 Implement business logic’s and dimensional modeling.
 Build and populate data.
 Build reports.

19
Recent Trends: Types of Warehousing Applications

A data warehouse is simply an application system that supports an organization's


decision making process. A Data Warehousing (DW) is process for collecting
and managing data from varied sources to provide meaningful business insights. A
Data warehouse is typically used to connect and analyze business data from
heterogeneous sources. The data warehouse is the core of the BI system which is
built for data analysis and reporting.
Applications of Data Warehousing
Banking
With the perfect Data Warehousing solution, bankers can manage all their
available resources more effectively. They can better analyze their consumer data,
government regulations, and market trends to facilitate better decision-making.
Finance
The application of data warehousing in the financial industry is the same as in the
banking sector. The right solution helps the financing industry analyze customer
expenses that enable them to outline better strategies to maximize profits at both
ends.
Education
The educational sector requires data warehousing to have a comprehensive view of
their students’ and faculty data. It provides educational institutions access to real-
time data feeds to make valued and informed decisions.

20
Healthcare
Another critical use of data warehouses is in the Healthcare sector. All the clinical,
financial, and employee data are stored in the warehouse, and analysis is run to
derive valuable insights to strategize resources in the best way possible.
Manufacturing & Distribution
With an effective data warehousing solution, organizations in the manufacturing &
distribution sector can organize all their data under one roof and predict market
changes, analyze the latest trends, view development areas, and finally can make
result-driven decisions.
Retailing
Retailers are the mediators between wholesalers and end customers, and that’s why
it is necessary for them to maintain the records of both parties. For helping them
store data in an organized manner, the application of data warehousing comes into
the frame.
Insurance
In the Insurance sector, data warehousing is required to maintain existing
customers’ records and analyze the same to up see client trends to bring more
footsteps towards the business.
Services
In the services sector, data warehousing is used for maintaining customer details,
financial records, and resources to analyze patterns and boost decision-making for
positive outcomes.

Web Mining
Web mining defines the process of using data mining techniques to extract
beneficial patterns trends and data generally with the help of the web by dealing
with it from web-based records and services, server logs, and hyperlinks. The
main goal of web mining is to find the designs in web data by collecting and
analyzing data to get important insights.
Web mining can widely be viewed as the application of adapted data mining
methods to the web, whereas data mining is represented as the application of the
algorithm to find patterns on mostly structured data fixed into a knowledge
discovery process.
Web mining has a distinctive property to support a collection of multiple data
types. The web has several aspects that yield multiple approaches for the mining
process, such as web pages including text, web pages are connected via
hyperlinks, and user activity can be monitored via web server logs.

21
Applications of Web Mining:
1. Web mining helps to improve the power of web search engine by classifying
the web documents and identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching
e.g., FatLens, Become etc.
3. Web mining is used to predict user behavior.
4. Web mining is very useful of a particular Website and e-service e.g., landing
page optimization.
Web mining can be broadly divided into three different types of techniques of
mining:

1. Web Content Mining: Web content mining is the application of extracting


useful information from the content of the web documents. Web content
consist of several types of data – text, image, audio, video etc. Content data is
the group of facts that a web page is designed. It can provide effective and
interesting patterns about user needs. Text documents are related to text
mining, machine learning and natural language processing. This mining is also
known as text mining. This type of mining performs scanning and mining of
the text, images and groups of web pages according to the content of the input.
2. Web Structure Mining: Web structure mining is the application of
discovering structure information from the web. The structure of the web
graph consists of web pages as nodes, and hyperlinks as edges connecting
related pages. Structure mining basically shows the structured summary of a
particular website. It identifies relationship between web pages linked by
information or direct link connection. To determine the connection between
two commercial websites, Web structure mining can be very useful.
3. Web Usage Mining: Web usage mining is the application of identifying or
discovering interesting usage patterns from large data sets. And these patterns
enable you to understand the user behaviors or something like that. In web

22
usage mining, user access data on the web and collect data in form of logs. So,
Web usage mining is also called log mining.

Spatial Mining :
Spatial data mining is the process of discovering interesting and previously
unknown, but potentially useful patterns from spatial databases. In spatial data
mining analyst use geographical or spatial information to produce business
intelligence or other results. Challenges involved in spatial data mining include
identifying patterns or finding objects that are relevant to research project.
Spatial means space, whereas temporal means time
Spatial data mining refers to the extraction of knowledge, spatial relationships, or
other interesting patterns not specifically stored in spatial databases. Such mining
demands the unification of data mining with spatial database technologies. It can
be used for learning spatial records, discovering spatial relationships and
relationships among spatial and nonspatial records, constructing spatial knowledge
bases, reorganizing spatial databases, and optimizing spatial queries.

Temporal Mining.
Temporal data refers to the extraction of implicit, non-trivial and potentially
useful abstract information from large collection of temporal data. It is concerned
with the analysis of temporal data and for finding temporal patterns and
regularities in sets of temporal data tasks of temporal data mining are –
 Data Characterization and Comparison
 Cluster Analysis
 Classification
23
 Association rules
 Prediction and Trend Analysis
 Pattern Analysis

SNO. Spatial data mining Temporal data mining


1. It requires space. It requires time.
2. Spatial mining is the extraction Temporal mining is the extraction
of knowledge/spatial of knowledge about occurrence
relationship and interesting of an event whether they follow
measures that are not explicitly Cyclic , Random ,Seasonal
stored in spatial database. variations etc.
3. It deals with spatial (location , It deals with implicit or explicit
Geo-referenced) data. Temporal content , from large
quantities of data.
4. Spatial databases reverses Temporal data mining comprises
spatial objects derived by the subject as well as its
spatial data. types and spatial utilization in modification of
association among such fields.
objects.
5. It includes finding It aims at mining new and
characteristic rules, unknown knowledge, which takes
discriminant rules, association into account the temporal aspects
rules and evaluation rules etc. of data.
6. It is the method of identifying It deals with useful knowledge
unusual and unexplored data from temporal data.
but useful models from spatial
databases.
7. Examples – Examples –
Determining hotspots , Unusual An association rule which looks
locations. like – “Any Person who buys a
car also buys steering lock”. By
temporal aspect this rule would
be – ” Any person who buys a car
also buys a steering lock after
that “.

24

You might also like