DMWH M1
DMWH M1
Data mining
We live in a world where vast amounts of data are collected daily. Analyzing such
data is an important need. “We are living in the information age” is a popular saying;
however, we are actually living in the data age. Terabytes or petabytes of data pour into
our computer networks, the World Wide Web (WWW), and various data storage devices
every day from business society, science and engineering, medicine, and almost every
other aspect of daily life
Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD, while others view data mining as merely an
essential step in the process of knowledge discovery. The terms knowledge discovery in
databases (KDD) and data mining are often used interchangeably.
Over the last few years KDD has been used to refer to a process consisting of
many steps, while data mining is only one of these steps.
Knowledge discovery in databases (KDD) is the process of finding useful
information and patterns in data. Data mining is the use of algorithms to extract the
information and patterns derived by the KDD process
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are
prepared for mining. The data mining step may interact with the user or a knowledge
base. The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base. The preceding view shows data mining as one step
in the knowledge discovery process, albeit an essential one because it uncovers hidden
patterns for evaluation.
2. A database or data warehouse server which fetches the relevant data based on users’
data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the search or to
evaluate the interestingness of resulting patterns. For example, the knowledge base may
contain metadata which describes data from multiple heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for tasks such as
classification, association, classification, cluster analysis, and evolution and deviation
analysis.
5. A pattern evaluation module that works in tandem with the data mining modules by
employing interestingness measures to help focus the search towards interestingness
patterns.
6. A graphical user interface that allows the user an interactive approach to the
data mining system.
Data mining models
For example, to study the characteristics of software products with sales that
increased by 10% in the previous year, the data related to such products can be
collected by executing an SQL query on the sales database.
Classification is the process of finding a model (or function) that describes and
distinguishesdata classes or concepts. The model is derived based on the analysis of a
set oftraining data (i.e., data objects for which the class labels are known). The model is
usedto predict the class label of objects for which the the class label is unknown.
A decisiontree is a flowchart-like tree structure,where each node denotes a test
on an attribute value, each branch represents an outcomeof the test, and tree leaves
represent classes or class distributions.
4. Clustering analyzes data objects without consulting class labels. In many cases,
classlabeleddata may simply not exist at the beginning. Clustering can be used to
generateclass labels for a group of data. The objects are clustered or grouped based on
the principle of maximizing the intraclass similarity and minimizing the interclass
similarity. That is,clusters of objects are formed so that objects within a cluster have
high similarity in comparisonto one another, but are rather dissimilar to objects in other
clusters. Each clusterso formed can be viewed as a class of objects, from which rules can
be derived. Clusteringcan also facilitate taxonomy formation, that is, the organization
of observationsinto a hierarchy of classes that group similar events together.
6. Outlier Analysis
A data set may contain objects that do not comply with the general behavior or
modelof the data. These data objects are outliers. Many data mining methods discard
outliersas noise or exceptions. However, in some applications (e.g., fraud detection) the
rareevents can be more interesting than the more regularly occurring ones. The analysis
ofoutlier data is referred to as outlier analysis or anomaly mining.
Outlier analysis may uncover fraudulent usage of credit cards bydetecting
purchases of unusually large amounts for a given account number in comparisonto
regular charges incurred by the same account. Outlier values may also be detectedwith
respect to the locations and types of purchase, or the purchase frequency.
Statistics:
Machine learning: Investigates how computers can learn (or improve their
performance)based on data. A main research area is for computer programs to
automatically learn torecognize complex patterns and make intelligent decisions based
on data. For example, atypical machine learning problem is to program a computer so
that it can automaticallyrecognize handwritten postal codes on mail after learning from
a set of examples.Machine learning is a fast-growing discipline. Here, we illustrate
classic problems inmachine learning that are highly related to data mining.
Active learning is a machine learning approach that lets users play an active rolein the
learning process. An active learning approach can ask a user (e.g., a domainexpert) to
label an example, which may be from a set of unlabeled examples orsynthesized by the
learning program. The goal is to optimize the model quality byactively acquiring
knowledge from human users, given a constraint on how manyexamples they can be
asked to label.
3. Performance issues
Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable. The running time of a data mining
algorithm must be predictable and acceptable in large databases.
Parallel, distributed, and incremental mining algorithms:
The huge size of many databases, the wide distribution of data, and the
computational complexity of some data mining methods are factors motivating the
development of parallel and distributed data mining algorithms. Such algorithms divide
the data into partitions, which are processed in parallel. The results from the partitions
are then merged.
4. Issues relating to the diversity of database types:
Handling of relational and complex types of data:
Because relational databases and data warehouses are widely used, the development of
efficient and effective data mining systems for such data is important. However, other
databases may contain complex data objects, hypertext and multimedia data, spatial
data, temporal data, or transaction data. Specific data mining systems should be
constructed for mining specific kinds of data.
Mining information from heterogeneous databases and global information
systems:
Local- and wide-area computer networks (such as the Internet) connect many
sources of data, forming huge, distributed, and heterogeneous databases. The
discovery of knowledge from different sources of structured, semi structured, or
unstructured data with diverse data semantics poses great challenges to data
mining
Data Warehouse
Data warehousing provides architectures and tools for business executives to
systematically organize, understand, and use their data to make strategic decisions.
Data warehouse refers to a database that is maintained separately from an
integrated,
organization’s operational databases.A data warehouse is a subject-oriented,
time-variant, and non-volatile collection of data in support of management’s decision
making process”
A data warehouse focuses on the modelling and analysis of data for decision
makers(not on day to day transaction).
Provide a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a
relational OLAP (ROLAP) model or a multidimensional OLAP. OLAP model is an extended
relational DBMS that maps operations on multidimensional data to standard relational
operations A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on)
1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the
entire organization. It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-functional in scope. It
typically contains detailed data as well as summarized data, and can range in size from a few
gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may
be implemented on traditional mainframes, computer super servers, or parallel architecture
platforms. It requires extensive business modeling and may take years to design and build.
2.Data mart: *********
A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects. For example, a marketing
data mart may confine its subjects to customer, item, and sales. The data contained in data
marts tend to be summarized. Data marts are usually implemented on low-cost departmental
servers that are UNIX/LINUX- or Windows-based. The implementation cycle of a data mart
is more likely to be measured in weeks rather than months or years. However, it may involve
complex integration in the long run if its design and planning were not enterprise-wide.
3.virtual warehouse is a set of views over operational databases. For efficient query
processing, only some of the possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database
servers.
Consolidation involves the aggregation of data that can be accumulated and computed
in one or more dimensions.
For example, all sales offices are rolled up to the sales department or sales division to
anticipate sales trends.
The drill-down is a technique that allows users to navigate through the details. For
instance, users can view the sales by individual products that make up a region’s sales.
Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the
OLAP cube and view (dicing) the slices from different viewpoints.
Types of OLAP:
For example, AllElectronics shop may create a sales data warehouse in order to
time, item, branch, and
keep records of the store’s sales with respect to the dimensions
location. These dimensions allow the store to keep track of things like monthly sales of
items and the branches and locations at which the items were sold. Each dimension may
have a table associated with it, called a dimension table, which further describes the
dimension. For example, a dimension table for item may contain the attributes item
name, brand, and type. Dimension tables can be specified by users or experts, or
automatically generated and adjusted based on data distributions.
Snowflake schema: The snowflake schema is a variant of the star schema model, where
some dimension tables are normalized, thereby further splitting the data into additional
tables. The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies. Such a table is easy to maintain and saves storage space. However,
this saving of space is negligible in comparison to the typical magnitude of the fact
table. Furthermore, the snowflake structure can reduce the effectiveness of browsing,
since more joins will be needed to execute a query. Consequently, the system
performance may be adversely impacted. Hence, although the snowflake schema
reduces redundancy, it is not as popular as the star schema in data warehouse design.
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence
is called a galaxy schema or a fact constellation.
1. The data ware house market supports such diverse industries as manufacturing,
retail, telecommunications, and health care. Think of a personnel database for a
company that is continually modified as personnel are added and deleted... If
management wishes determine if there is a problem with too many employees
quitting. To analyze this problem, they would need to know which employees
have left, when they left, why they left, and other information about their
employment. For management to make these types of high-level business
analyses, more historical data not just the current snapshot are required.
12. Designing the Data Warehouse – People generally don’t want to “waste” their
time defining the requirements necessary for proper data warehouse design.
Usually, there is a high level perception of what they want out of a data
understand all the implications of these
warehouse. However, they don’t fully
perceptions and, therefore, have a difficult time adequately defining them. This
results in miscommunication between the business users and the technicians
building the data warehouse. The typical end result is a data warehouse which
does not deliver the results expected by the user. Since the data warehouse is
inadequate for the end user, there is a need for fixes and improvements
immediately after initial delivery.
Applications of DWH
Banking Industry
Finance Industry
Healthcare
All of their financial, clinical, and employee records are fed to warehouses
as it helps them to strategize and predict outcomes, track and analyze
their service feedback, generate patient reports, share data with tie-in
insurance companies, medical aid services, etc.
Hospitality Industry
Insurance
They also use them for product shipment records, records of product
portfolios, identify profitable product lines, analyze previous data and
customer feedback to evaluate the weaker product lines and eliminate
them.
The Retailers
They also analyze sales to determine fast selling and slow selling product
lines and determine their shelf space through a process of elimination.
Telephone Industry
The telephone industry operates over both offline and online data
burdening them with a lot of historical data which has to be consolidated
and integrated.
Transportation Industry
***************************************************************************************