DM&DW SEE Module 1
DM&DW SEE Module 1
6.Discuss the importance of data preprocessing in the context of the data mining process.
1.9.1Data Integration:
It combines data from multiple sources into a coherent data store, as in data
warehousing.
These sources may include multiple databases, data cubes, or flat files.
The data integration systems are formally defined as
triple<G,S,M>
Where G: The global schema
S:Heterogeneous source of schemas
M: Mapping between the queries of source and global schema
1.9.2 Issues in Data integration:
1. Schema integration and object matching: How can the data analyst or the computer
be sure that customer id in one database andcustomer number in another reference to the
same attribute.
2. Redundancy: An attribute (such as annual revenue, forinstance) may be redundant if it
can be derived from another attribute or set ofattributes. Inconsistencies in attribute or
dimension namingcan also cause redundanciesin the resulting data set.
3. detection and resolution of datavalue conflicts: For the same real-world entity,
attribute values fromdifferent sources may differ.
1.9.3 Data Transformation:
Smoothing, which works to remove noise from the data. Such techniques include
binning,regression, and clustering.
Aggregation, aggregation operations are applied to the data. This step is typically used in
constructing a data cube for analysis ofthe data at multiple granularities.
1.9.4 Data Reduction:
Data reduction techniques can be applied to obtain a reduced representation of thedata
set that ismuch smaller in volume, yet closely maintains the integrity of the originaldata.
Data cube aggregation, where aggregation operations are applied to the data in
theconstruction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributesor
dimensions may be detected and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
7.What are the four main problems of data mining functionality? Explain each one of them
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. In general, data mining tasks can be classified into two
types including descriptive and predictive.
Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset.
Prediction − It defines predict some unavailable data values or pending trends. An object
can be anticipated based on the attribute values of the object and attribute
values of the classes.
Clustering − It is similar to classification but the classes are not predefined. The classes
are represented by data attributes. It is unsupervised learning.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class
These are the data objects which have multiple behaviour from the general behaviour of
other data objects.
8. Identify common challenges associated with integrating a data mining system with a data
warehouse.
Integrating Data Mining systems with Databases and Data Warehouses with these methods
• No Coupling
• Loose Coupling
• Semi-Tight Coupling
• Tight Coupling
a) No Coupling
No coupling means that a DM system will not utilize any function of a DB or DW system. It may
fetch data from a particular source (such as a file system), process data using some data mining
algorithms, and then store the mining results in another file.
b) Loose Coupling
Loose coupling means that a Data Mining system will use some facilities of a Database or
Data warehouse system, fetching data from a data repository managed
Loose coupling is better than no coupling because it can fetch any portion of data stored
in Databases or Data Warehouses by using query processing
c) Semi-Tight Coupling
These primitives can include sorting, indexing, aggregation, histogram analysis, multi-way join,
and pre-computation of some essential statistical measures, such as sum, count, max, min,
standard deviation.
The semi-tight coupling means that besides linking a Data Mining system to a Database/Data
Warehouse system
d) Tight coupling
Tight coupling means that a Data Mining system is smoothly integrated into the
Database/Data Warehouse system.
The data mining subsystem is treated as one functional component of the information
system.
Disease Prediction and Prevention: Analyzing patient data to predict diseases, allowing for
early intervention and prevention strategies.
Drug Discovery: Analyzing biological data to identify potential drug compounds and accelerate
drug discovery processes.
Healthcare Fraud Detection: Identifying fraudulent claims and activities in healthcare
insurance and billing.