0% found this document useful (0 votes)
25 views

DM&DW SEE Module 1

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

DM&DW SEE Module 1

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

4.

Explain the major issues in data mining


i. Mining Methodology and User Interaction Issues It refers to the following kinds of
issues –
• Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of
knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing
• Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be used
to express the discovered patterns
• Data mining query languages and ad hoc data mining − Data Mining Query language that allows
the user to describe ad hoc mining tasks, should be integrated with a data warehouse query
language
• Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high level languages, and visual representations. These representations
should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle the
noise and incomplete objects while mining the data regularities. If the data cleaning methods are
not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
ii. Performance Issues –
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases.These algorithms divide the data into partitions which is further processed in a
parallel fashion. Then the results from the partitions is merged.
iii. Diverse Data Types Issues
• Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
• Mining information from heterogeneous databases − The data is available at different data
sources on LAN or WAN. These data source may be structured, semi structured or
unstructured.
5.Define structured and unstructured data in the context of data mining
i.Structured Data:
 Structured data refers to data that is organized and formatted in a predefined
manner, typically residing in databases or structured files such as spreadsheets.
Characteristics of structured data include:
 Organized into rows and columns, with each column representing a specific
attribute or variable.
 Conforms to a fixed schema, specifying the data types and relationships between
different attributes.
 Examples include relational databases, CSV files, Excel spreadsheets, and
structured XML or JSON documents.
 Structured data is well-suited for traditional data mining techniques and relational
database systems
ii.Unstructured Data:
 Unstructured data refers to data that lacks a predefined structure or organization,
making it more challenging to analyze using traditional methods.
 Lack of a fixed schema, with data often stored in formats such as text documents,
emails, images, videos, audio recordings, social media posts, and web pages.
 May contain a wide variety of information, including text, multimedia content, and
semi-structured data.
 Often contains valuable insights and hidden patterns but requires specialized
techniques to extract and analyze.

6.Discuss the importance of data preprocessing in the context of the data mining process.
1.9.1Data Integration:
 It combines data from multiple sources into a coherent data store, as in data
warehousing.
 These sources may include multiple databases, data cubes, or flat files.
 The data integration systems are formally defined as
 triple<G,S,M>
Where G: The global schema
S:Heterogeneous source of schemas
M: Mapping between the queries of source and global schema
1.9.2 Issues in Data integration:
1. Schema integration and object matching: How can the data analyst or the computer
be sure that customer id in one database andcustomer number in another reference to the
same attribute.
2. Redundancy: An attribute (such as annual revenue, forinstance) may be redundant if it
can be derived from another attribute or set ofattributes. Inconsistencies in attribute or
dimension namingcan also cause redundanciesin the resulting data set.
3. detection and resolution of datavalue conflicts: For the same real-world entity,
attribute values fromdifferent sources may differ.
1.9.3 Data Transformation:
Smoothing, which works to remove noise from the data. Such techniques include
binning,regression, and clustering.
Aggregation, aggregation operations are applied to the data. This step is typically used in
constructing a data cube for analysis ofthe data at multiple granularities.
1.9.4 Data Reduction:
Data reduction techniques can be applied to obtain a reduced representation of thedata
set that ismuch smaller in volume, yet closely maintains the integrity of the originaldata.
Data cube aggregation, where aggregation operations are applied to the data in
theconstruction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributesor
dimensions may be detected and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.

7.What are the four main problems of data mining functionality? Explain each one of them
Data mining functionalities are used to represent the type of patterns that have to be
discovered in data mining tasks. In general, data mining tasks can be classified into two
types including descriptive and predictive.

Data characterization − The output of data characterization can be presented in multiple


forms.The data corresponding to the user-specified class is generally collected by a
database query.

Data discrimination − It is a comparison of the general characteristics of target class data


objects with the general characteristics of objects from one or a set of contrasting
classes.

Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset.

Prediction − It defines predict some unavailable data values or pending trends. An object
can be anticipated based on the attribute values of the object and attribute
values of the classes.

Clustering − It is similar to classification but the classes are not predefined. The classes
are represented by data attributes. It is unsupervised learning.

Outlier analysis − Outliers are data elements that cannot be grouped in a given class
These are the data objects which have multiple behaviour from the general behaviour of
other data objects.

8. Identify common challenges associated with integrating a data mining system with a data
warehouse.
Integrating Data Mining systems with Databases and Data Warehouses with these methods
• No Coupling
• Loose Coupling
• Semi-Tight Coupling
• Tight Coupling

a) No Coupling
No coupling means that a DM system will not utilize any function of a DB or DW system. It may
fetch data from a particular source (such as a file system), process data using some data mining
algorithms, and then store the mining results in another file.
b) Loose Coupling
 Loose coupling means that a Data Mining system will use some facilities of a Database or
Data warehouse system, fetching data from a data repository managed
 Loose coupling is better than no coupling because it can fetch any portion of data stored
in Databases or Data Warehouses by using query processing
c) Semi-Tight Coupling
 These primitives can include sorting, indexing, aggregation, histogram analysis, multi-way join,
and pre-computation of some essential statistical measures, such as sum, count, max, min,
standard deviation.
 The semi-tight coupling means that besides linking a Data Mining system to a Database/Data
Warehouse system
d) Tight coupling
 Tight coupling means that a Data Mining system is smoothly integrated into the
Database/Data Warehouse system.
 The data mining subsystem is treated as one functional component of the information
system.

9.Discuss how data mining is applied in healthcare settings.

 Disease Prediction and Prevention: Analyzing patient data to predict diseases, allowing for
early intervention and prevention strategies.
 Drug Discovery: Analyzing biological data to identify potential drug compounds and accelerate
drug discovery processes.
 Healthcare Fraud Detection: Identifying fraudulent claims and activities in healthcare
insurance and billing.

10.Describe the role of data mining in detecting financial fraud.


 Fraud Detection: Identifying unusual patterns in transactions to detect credit card fraud, identity
theft, etc.
 Credit Scoring: Assessing the creditworthiness of applicants based on historical financial data.
 Algorithmic Trading: Analyzing historical data to develop trading strategies and predict market
trends.

You might also like