0% found this document useful (0 votes)

177 views

DM Sem U-1

Uploaded by

rinkydachavaram rinkydachavaram

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

177 views

DM Sem U-1

Uploaded by

rinkydachavaram rinkydachavaram

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 50

UNIT-1

1Q-Data mining is the process of discovering interesting patterns and knowledge

from large amounts of data. The data sources can include databases,
datawarehouses, theWeb, other information repositories, or data that are streamed
into the system dynamically.

2Q----Data Mining Techniques:

1.Classification:
This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.

2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each
other. This process helps to understand the differences and similarities between the
data.

3. Regression:
Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific
variable, given the presence of other variables.

4. Association Rules:
This data mining technique helps to find the association between two or more
Items. It discovers a hidden pattern in the data set.

5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can
be used in a variety of domains, such as intrusion, detection, fraud or fault
detection, etc. Outer detection is also called Outlier Analysis or Outlier mining.

6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period.

7. Prediction:
Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or
instances in a right sequence for predicting a future event.

8. Decision tree
Decision tree is one of the most common used data mining techniques because its
model is easy to understand for users. In decision tree you start with a simple
question which has two or more answers.Each answer leads to a further two or more
question which help us to make a final decision. The root node of decision tree is a
simple question.

3Q----Challenges of Implementation of Data mine:

-Skilled Experts are needed to formulate the data mining queries.

-Overfitting: Due to small size training database, a model may not fit future states.

-Data mining needs large databases which sometimes are difficult to manage

-Business practices may need to be modified to determine to use the information

uncovered.

-If the data set is not diverse, data mining results may not be accurate.

-Integration information needed from heterogeneous databases and global

information systems could be complex.

-Poor quality of data collection is one of most known challenges in data mining.

Proliferation of security and privacy concerns

4Q----Data Mining Implementation Process

-Business understanding:

In this phase, business and data-mining goals are established.

First, you need to understand business and client objectives. You need to define
what your client wants (which many times even they do not know themselves) Take
stock of the current data mining scenario. Factor in resources, assumption,
constraints, and other significant factors into your assessment.Using business
objectives and current scenario, define your data mining goals.A good data mining
plan is very detailed and should be developed to accomplish both business and data
mining goals.

-Data understanding:

In this phase, sanity check on data is performed to check whether its appropriate for
the data mining goals.First, data is collected from multiple data sources available in
the organization.These data sources may include multiple databases, flat filer or
data cubes. There are issues like object matching and schema integration which can
arise during Data Integration process. It is a quite complex and tricky process as
data from various sources unlikely to match easily. For example, table A contains
an entity named cust_no whereas another table B contains an entity named cust-
id.Therefore, it is quite difficult to ensure that both of these given objects refer to
the same value or not. Here, Metadata should be used to reduce errors in the data
integration process.Next, the step is to search for properties of acquired data. A
good way to explore the data is to answer the data mining questions (decided in
business phase) using the query, reporting, and visualization tools.Based on the
results of query, the data quality should be ascertained. Missing data if any should
be acquired.

-Data preparation:

In this phase, data is made production ready.The data preparation process consumes
about 90% of the time of the project.The data from different sources should be
selected, cleaned, transformed, formatted, anonymized, and constructed (if
required).Data cleaning is a process to "clean" the data by smoothing noisy data and
filling in missing values.
For example, for a customer demographics profile, age data is missing. The data is
incomplete and should be filled. In some cases, there could be data outliers. For
instance, age has a value 300. Data could be inconsistent. For instance, name of the
customer is different in different tables.

-Data transformation:

Data transformation operations change the data to make it useful in data mining.
Following transformation can be applied
Data transformation operations would contribute toward the success of the mining
process.

Smoothing: It helps to remove noise from the data.

Aggregation: Summary or aggregation operations are applied to the data. I.e., the
weekly sales data is aggregated to calculate the monthly and yearly total.

Generalization: In this step, Low-level data is replaced by higher-level concepts

with the help of concept hierarchies. For example, the city is replaced by the
county.

Normalization: Normalization performed when the attribute data are scaled up o

scaled down. Example: Data should fall in the range -2.0 to 2.0 post-normalization.

Attribute construction: these attributes are constructed and included the given set of
attributes helpful for data mining.The result of this process is a final data set that
can be used in modeling.

-Modelling:

In this phase, mathematical models are used to determine data patterns.

Based on the business objectives, suitable modeling techniques should be selected
for the prepared dataset.Create a scenario to test check the quality and validity of
the model.Run the model on the prepared dataset.Results should be assessed by all
stakeholders to make sure that model can meet data mining objectives.

-Evaluation:

In this phase, patterns identified are evaluated against the business

objectives.Results generated by the data mining model should be evaluated against
the business objectives.Gaining business understanding is an iterative process. In
fact, while understanding, new business requirements may be raised because of data
mining.
A go or no-go decision is taken to move the model in the deployment phase.
-Deployment:

In the deployment phase, you ship your data mining discoveries to everyday
business operations.The knowledge or information discovered during data mining
process should be made easy to understand for non-technical stakeholders.A
detailed deployment plan, for shipping, maintenance, and monitoring of data mining
discoveries is created.A final project report is created with lessons learned and key
experiences during the project. This helps to improve the organization's business
policy.

5Q----Data Mining Functionalities/Tasks:

Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks.Data mining tasks can be classified into two categories:
descriptive and predictive.

-Descriptive mining tasks characterize the general properties of the data in the
database.

-Predictive mining tasks perform inference on the current data in order to make
predictions.

-Concept/Class Description: Characterization and Discrimination

Data can be associated with classes or concepts. For example, in the Electronics
store, classes of items for sale include computers and printers, and concepts of
customers include bigSpenders and budgetSpenders.

-Data characterization:

Data characterization is a summarization of the general characteristics or features of

a target class of data.

-Data discrimination:

Data discrimination is a comparison of the general features of target class data

objects with the general features of objects from one or a set of contrasting classes.
-Mining Frequent Patterns, Associations, and Correlations:

Frequent patterns, are patterns that occur frequently in data. There are many kinds
of frequent patterns, including itemsets, subsequences, and substructures.

-Association analysis:

Suppose, as a marketing manager, you would like to determine which items are
frequently purchased together within the same transactions.

buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]
where X is a variable representing a customer.Confidence=50% means that if a
customer buys a computer, there is a 50% chance that she will buy software as well.

Support=1% means that 1% of all of the transactions under analysis showed that
computer and software were purchased together.

-Classification and Prediction:

Classification is the process of finding a model that describes and distinguishes data
classes for the purpose of being able to use the model to predict the class of objects
whose class label is unknown.

“How is the derived model presented?” The derived model may be represented in
various forms, such as classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks.

-Decision tree:

A decision tree is a flow-chart-like tree structure, where each node denotes a test on
an attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.

-Neural Network:

A neural network, when used for classification, is typically a collection of neuron-

like processing units with weighted connections between the units.
-Cluster Analysis:

In classification and prediction analyze class-labeled data objects, where as

clustering analyzes data objects without consulting a known class label.
The objects are grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. That is, clusters of objects are
formed so that objects within a cluster have high similarity in comparison to one
another, but are very dissimilar to objects in other clusters.

-Outlier Analysis:

A database may contain data objects that do not comply with the general behavior
or model of the data. These data objects are outliers. Most data mining methods
discard outliers as noise or exceptions.The analysis of outlier data is referred to as
outlier mining.

6Q-------Data Mining Applications:

Data mining refers to extraction of information from large amount of data.

Extracting important knowledge from a very large amount of data can be crucial to
organizations for the process of decision-making.

-Data Mining Applications are:

1 Data mining applications in Marketing:

Data mining process extract information from various data source which is very
useful in the process of planning, organising, managing and launching new product
in a cost effective way. Data mining technique help us to understand the purchase
behaviour of a buyer like how frequently customer purchase a item, total value of
all purchases and when was the last purchase.With data mining you can understand
the needs of buyer’s and make product and services according to buyer’s
requirement.
Data base marketing is one of the most popular application of data mining.

2 Data mining applications in HealthCare:

Data mining can be very useful to improve healthcare system.With data mining you
can predict number of patients which help you to make sure that every patient
receive proper care at right time and at right place.
Data mining can help all parties involved in the healthcare industry.For example,
data mining can help healthcare insurers detect fraud and abuse, healthcare
organizations can improve there decision making by using knowledge provided by
data mining, patients can receive better and more affordable healthcare services.

3 Data mining applications in Education:

Educational data mining (EDM) is a new emerging field which is used to address
students challenges and help us to understand how students learn by creating
student models.The main goal of educational data mining is to predict students
future learning behaviour so that necessary steps can taken before a student falls or
drops out.Data mining is also used to predict the results of the student.

4 Data mining applications in Retail Industry:

Retail industry collects large amount of data on sales and customer shopping
history.Retail data mining helps in analyzing client behavior, customer buying
patterns and trends and lead to better customer service, good customer satisfaction
and minimize the cost of business.

5 Data mining applications in Banking:

The banking industry has hugely benefited from the advancements in digital
technology. Data mining is becoming strategically important area for many business
organizations including banking sector.
Data mining is used in financial and banking sector for credit analysis, fraudulent
transactions, cash management and to predicting payment.

7Q-----Data Mining Architecture

Data mining architecture has many elements like Data Warehouse, Data Mining
Engine, Pattern evaluation,User Interface and Knowledge Base.

-Data Warehouse:
A data warehouse is a place which store information collected from multiple
sources under unified schema. Information stored in a data warehouse is critical to
organizations for the process of decision-making.

-Data Mining Engine:

Data Mining Engine is the core component of data mining process which consists of
various modules that are used to perform various tasks like clustering,
classification, prediction and correlation analysis.

-Pattern Evaluation:
Pattern Evaluation is responsible for finding various patterns with the help of Data
Mining Engine.

-User Interface:
User Interface provides communication between user and data mining system.It
allows user to use the system easily even if user doesn't have proper knowledge of
the system.

-Knowledge Base:
Knowledge Base consists of data that is very important in the process of data
mining.Knowledge Base provides input to the data mining engine which guides
data mining engine in the process of pattern search.

8Q--Data Mining Issues and challeneges :

Data mining systems face a lot of challenges and issues in today’s world some of
them are:

1 Mining methodology and user interaction issues

2 Performance issues

3 Issues relating to the diversity of database types

1. Mining methodology and user interaction issues:

Mining different kinds of knowledge in databases:
Different user - different knowledge - different way.That means different client
want a different kind of information so it becomes difficult to cover vast range of
data that can meet the client requirement.

Interactive mining of knowledge at multiple levels of abstraction:

Interactive mining allows users to focus the search for patterns from different
angles.The data mining process should be interactive because it is difficult to know
what can be discovered within a database.
Incorporation of background knowledge:
Background knowledge is used to guide discovery process and to express the
discovered patterns.

Query languages and ad hoc mining:

Relational query languages (such as SQL) allow users to pose ad-hoc queries for
data retrieval.The language of data mining query language should be in perfectly
matched with the query language of data warehouse.

Handling noisy or incomplete data:

In a large database, many of the attribute values will be incorrect.This may be due
to human error or because of any instruments fail. Data cleaning methods and data
analysis methods are used to handle noise data.

2. Performance issues
Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms:

The huge size of many databases, the wide distribution of data, and complexity of
some data mining methods are factors motivating the development of parallel and
distributed data mining algorithms. Such algorithms divide the data into partitions,
which are processed in parallel.

3.Issues relating to the diversity of database types:

Handling of relational and complex types of data:
There are many kinds of data stored in databases and data warehouses. It is not
possible for one system to mine all these kind of data.So different data mining
system should be construed for different kinds data.

Mining information from heterogeneous databases and global information systems:

Since data is fetched from different data sources on Local Area Network (LAN) and
Wide Area Network (WAN).The discovery of knowledge from different sources of
structured is a great challenge to data mining.

9Q------Knowledge Discovery Process (KDP)

Data mining is the core part of the knowledge discovery process.

KDP is a process of finding knowledge in data, it does this by using data mining
methods (algorithms) in order to extract demanding knowledge from large amount
of data.

Knowledge Discovery Process may consist of the following steps :-

1 Data cleaning -
First step in the Knowledge Discovery Process is Data cleaning in which noise and
inconsistent data is removed.

2 Data Integration -
Second step is Data Integration in which multiple data sources are combined.

3 Data Selection -
Next step is Data Selection in which data relevant to the analysis task are retrieved
from the database.

4 Data Transformation -
In Data Transformation, data are transformed into forms appropriate for mining by
performing summary or aggregation operations.

5 Data Mining -
In Data Mining, data mining methods (algorithms) are applied in order to extract
data patterns.

6 Pattern Evaluation -
In Pattern Evaluation, data patterns are identified based on some interesting
measures.

7 Knowledge Presentation -
In Knowledge Presentation, knowledge is represented to user using many
knowledge representation techniques.

10Q----Data Preprocessing in Data Mining

Data preprocessing is a data mining technique which is used to transform the raw
data in a useful and efficient format.
Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:

Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value

(b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :

Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace all data in a segment
by its mean or boundary values can be used to complete the task.

Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).

Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:

Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)

Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

Concept Hierarchy Generation:

Here attributes are converted from low level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.

The various steps to data reduction are:

-Data Cube Aggregation:

Aggregation operation is applied to data for the construction of the data cube.

-Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of the
attribute.the attribute having p-value greater than significance level can be
discarded.

-Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.

-Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Componenet Analysis).

11Q------Feature selection :

“feature selection is the process of selecting a subset of relevant features for use in
model construction” or in other words, the selection of the most important
features.Feature selection is simply selecting and excluding given features without
changing them.

Feature Selection techniques :

Remove features with missing values

Remove features with low variance
Remove highly correlated features
Univariate feature selection
Recursive feature elimination
Feature selection using SelectFromModel.

12Q-----Dimensionality reduction :

This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Componenet Analysis).

----What is Dimensionality Reduction?

→ The term dimensionality reduction is often reserved for those techniques that
reduce the dimensionality of a data set by creating new attributes that are a
combination of the old attributes.
Purpose:
→ Avoid curse of dimensionality.
→ Reduce amount of time and memory required by data mining algorithms.
→ Allow data to be more easily visualised.
→ May help to eliminate irrelevant features or reduce noise.
Techniques:
→ Principal Components Analysis (PCA)
-LDA
-GDA

-PCA:
PCA (Principle Component Analysis) is a dimensionality reduction technique that
projects the data into a lower dimensional space.
While there are many effective dimensionality reduction techniques, PCA is the
only example we will explore here.
PCA can be useful in many situations, but especially in cases with excessive
multicollinearity or explanation of predictors is not a priority.

13Q--What is Discretization and Binarization?

-Discretization

-Top-down discretization
If the process starts by first finding one or a few points (called split points or cut
points) to split the entire attribute range, and then repeats this recursively on the
resulting intervals, then it is called top-down discretization or splitting.

-Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to form intervals, then it is
called bottom-up discretization or merging.
→ Discretization is the process of converting a continuous attribute into an ordinal
attribute.
→ A potentially infinite number of values are mapped into a small number of
categories.
→ Discretization is commonly used in classification.
→ Many classification algorithms work best if both the independent and dependent
variables have only a few values.

-Binarization
→ Binarization maps a continuous or categorical attribute into one or more binary
variables
→ Typically used for association analysis
→ Often convert a continuous attribute to a categorical attribute and then convert a
categorical attribute to a set of binary attributes
→ Association analysis needs asymmetric binary attributes
→ Examples: eye colour and height measured as {low, medium, high}

15Q--Data Transformation In Data Mining

In data transformation process data are transformed from one format to another
format, that is more appropriate for data mining.

Some Data Transformation Strategies:-

1 Smoothing:
Smoothing is a process of removing noise from the data.

2 Aggregation:
Aggregation is a process where summary or aggregation operations are applied to
the data.

3 Generalization:
In generalization low-level data are replaced with high-level data by using concept
hierarchies climbing.

4 Normalization:
Normalization scaled attribute data so as to fall within a small specified range, such
as 0.0 to 1.0.

5 Attribute Construction:
In Attribute construction, new attributes are constructed from the given set of
attributes.

16Q------Concept hierarchies:

Discretization can be performed rapidly on an attribute to provide a hierarchical

partitioning of the attribute values, known as a concept hierarchy.

Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.

In the multidimensional model, data are organized into multiple dimensions, and
each dimension contains multiple levels of abstraction defined by concept
hierarchies. This organization provides users with the flexibility to view data from
different perspectives.

Data mining on a reduced data set means fewer input/output operations and is more
efficient than mining on a larger data set.

Because of these benefits, discretization techniques and concept hierarchies are

typically applied before data mining, rather than during mining.

-Discretization and Concept Hierarchy Generation for Numerical Data Typical

methods
1 Binning:
Binning is a top-down splitting technique based on a specified number of
bins.Binning is an unsupervised discretization technique.

2 Histogram Analysis:
Because histogram analysis does not use class information so it is an unsupervised
discretization technique.Histograms partition the values for an attribute into disjoint
ranges called buckets.

3 Cluster Analysis:
Cluster analysis is a popular data discretization method.A clustering algorithm can
be applied to discrete a numerical attribute of A by partitioning the values of A into
clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures,
forming a lower level of the hierarchy.

-Concept hierarchy generation for categorical data is as follows:

Specification of a set of attributes, but not of their partial ordering

Auto generate the attribute ordering based upon observation that attribute defining a
high level concept has a smaller # of distinct values than an attribute defining a
lower level concept
Example : country (15), state_or_province (365), city (3567), street (674,339)
Specification of only a partial set of attributes

Try and parse database schema to determine complete hierarchy.

17Q------Similarity and dissimilarity are important because they are used by a

number of data mining techniques, such as clustering nearest neighbor classification
and anomaly detection. The term proximity is used to refer to either similarity or
dissimilarity.
The similarity between two objects is a numeral measure of the degree to which the
two objects are alike. Consequently, similarities are higher for pairs of objects that
are more alike. Similarities are usually non-negative and are often between 0 (no
similarity) and 1(complete similarity). The dissimilarity between two objects is the
numerical measure of the degree to which the two objects are different.
Dissimilarity is lower for more similar pairs of objects.Frequently, the term distance
is used as a synonym for dissimilarity. Dissimilarities sometimes fall in the interval
[0,1], but it is also common for them to range from 0 to infinity.

The term proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects. Proximity measures refer to the
Measures of Similarity and Dissimilarity.

Transformation Function:
It is a function used to convert similarity to dissimilarity and vice versa, or to
transform a proximity measure to fall into a particular range. For instance:

s’ = (s-min(s)) / max(s)-min(s))
where,
s’ = new transformed proximity measure value,
s = current proximity measure value,
min(s) = minimum of proximity measure values,
max(s) = maximum of proximity measure values

UNIT-5

Introduction to Web Mining

Web mining is an application of data mining techniques to find information

patterns from the web data.
Web mining helps to improve the power of web search engine by identifying the
web pages and classifying the web documents.
Web mining is very useful to e-commerce websites and e-services.

There are three types of web mining:

1. Web Content Mining

Web content mining is defined as the process of converting raw data to useful
information using the content of web page of a specified web site.

The process starts with the extraction of structured data or information from web
pages and then identifying similar data with integration. Various types of web
content include text, audio, video etc. This process is called as text mining.

Text Mining uses Natural Language processing and retrieving information

techniques for a specific mining process.

2.Web Structure Mining:

Web graphs include a typical structure which consists of web pages such as nodes
and hyperlinks which will be treated as edges connected between web pages. It
includes a process of discovering a specified structure with information from the
web.

This category of mining can be performed either at document level or hyperlink

level. The research activity which involves hyperlink level is called hyperlink
analysis.

3.Web Usage Mining

 Web usage mining is used for mining the web log records (access information of
web pages) and helps to discover the user access patterns of web pages.
 Web server registers a web log entry for every web page.
 Analysis of similarities in web log records can be useful to identify the potential
customers for e-commerce companies.
 Some of the techniques to discover and analyze the web usage pattern are:
i) Session and visitor analysis:
 The analysis of preprocessed data can be performed in session analysis ,which
includes the record of visitors, days, sessions etc. This information can be used to
analyze the behavior of visitors.
 ii) OLAP (Online Analytical Processing)
 OLAP performs Multidimensional analysis of complex data.
 OLAP can be performed on different parts of log related data in a certain
interval of time.
 The OLAP tool can be used to derive the important business intelligence
metrics.

What is Text Mining?

“Text mining, also referred to as text data mining, roughly equivalent to text
analytics, is the process of deriving high-quality information from text.” Text
mining deals with natural language texts either stored in semi-structured or
unstructured formats.
The five fundamental steps involved in text mining are:

 Gathering unstructured data from multiple data sources like plain text, web pages,
pdf files, emails, and blogs, to name a few.
 Detect and remove anomalies from data by conducting pre-processing and
cleansing operations. Data cleansing allows you to extract and retain the valuable
information hidden within the data and to help identify the roots of specific words.
 For this, you get a number of text mining tools and text mining applications.
 Convert all the relevant information extracted from unstructured data into structured
formats.
 Analyze the patterns within the data via the Management Information System
(MIS).
 Store all the valuable information into a secure database to drive trend analysis and
enhance the decision-making process of the organization.

Text Mining Techniques:

1. Information Extraction
This is the most famous text mining technique. Information exchange refers to the
process of extracting meaningful information from vast chunks of textual data.
This text mining technique focuses on identifying the extraction of entities,
attributes, and their relationships from semi-structured or unstructured texts.
2. Information Retrieval
Information Retrieval (IR) refers to the process of extracting relevant and associated
patterns based on a specific set of words or phrases. In this text mining technique,
IR systems make use of different algorithms .
What Is Data Science? Who is a Data Scientist? What is Analytics?
3. Categorization
This is one of those text mining techniques that is a form of “supervised” learning
wherein normal language texts are assigned to a predefined set of topics depending
upon their content. Thus, categorization or rather Natural Language Processing
(NLP) is a process of gathering text documents and processing and analyzing for
each document.
4. Clustering
Clustering is one of the most crucial text mining techniques. It seeks to identify
intrinsic structures in textual information and organize them into relevant subgroups
or ‘clusters’ for further analysis. Cluster analysis is a standard text mining tool that
assists in data distribution or acts as a pre-processing step for other text mining
algorithms running on detected clusters.
5. Summarisation
Text summarisation refers to the process of automatically generating a compressed
version of a specific text that holds valuable information for the end-user. Text
summarisation integrates and combines the various methods like decision trees,
neural networks, regression models, and swarm intelligence.

Text mining applications:

1 – Risk management. ...
 2 – Knowledge management. ...
 3 – Cybercrime prevention. ...
 4 – Customer care service. ...
 5 – Fraud detection through claims investigation. ...
 6 – Contextual Advertising. ...
 7 – Business intelligence. ...
 8 – Content enrichment.

 Unstructured data |
 Text clustering | from textbook
 Hierarchy of categories |
Unit-4

Cluster analysis:

Cluster analysis groups data objects based only on information found in the data
that describes the objects and their relationships. The goal is that the objects within
a group be similar (or related) to one another and different from (or unrelated to)
the objects in other groups. The greater the similarity (or homogeneity) within a
group and the greater the difference between groups,the better or more distinct the
clustering.

1Q-They are different types of clustering methods, including:

Partitioning method:
A partitional clustering is simply a division of the set of data objects into non-
overlapping subsets (clusters) such that each data object is in exactly one subset.
Hierarchical clustering:
Hierarchical clustering, which is a set of nested clusters that are organized as a tree.
Each node (cluster) in the tree (except for the leaf nodes) is the union of its children
(subclusters), and the root of the tree is the cluster containing all the objects.
Fuzzy clustering:
In a fuzzy clustering, every object belongs to every cluster with a membership
weight that is between 0 (absolutely doesn’t belong) and 1 (absolutely belongs). In
other words, clusters are treated as fuzzy sets.
Density-based clustering:
DBSCAN is a density based clustering algorithm that divides a dataset into
subgroups of high density regions. There are two parameters required
for DBSCAN: epsilon (ε) and minimum amount of points required to form
a cluster (minPts). ε is a distance parameter that defines the radius to search for
nearby neighbors.
Model-based clustering:
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields
robust clustering methods.

2Q-Different Types of Clusters

Well-Separated:
A cluster is a set of objects in which each object is closer to every other object in
the cluster than to any object not in the cluster. The distance between any two points
in different groups is larger than the distance between any two points within a
group. Well-separated clusters do not need to be globular, but can have any shape.
Prototype-Based :
A cluster is a set of objects in which each object is closer to the prototype that
defines the cluster than to the prototype of any other cluster, such clusters tend to be
globular.
Graph-Based:
a group of objects that are connected to one another, but that have no connection to
objects outside the group.
Shared-Property (Conceptual Clusters):
More generally, we can define a cluster as a set of objects that share some property.
the shared-property approach also includes new types of clusters ,clustering
algorithm would need a very specific concept of a cluster to successfully detect
these clusters. The process of finding such clusters is called conceptual clustering.

3Q-Kmeans Algorithm:

Kmeans algorithm is an iterative algorithm that tries to partition the

dataset into Kpre-defined distinct non-overlapping subgroups (clusters)
where each data point belongs to only one group.
Clustering :
Clustering is dividing data points into homogeneous classes or clusters:
Points in the same group are as similar as possible.
Points in different group are as dissimilar as possible.
When a collection of objects is given, we put objects into group based on similarity.

K-means Clustering :
K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms
that solve the well-known clustering problem. K-means clustering is a method of
vector quantization.
Algorithmic steps for k-means clustering

Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be

the set of centers.

1) Randomly select ‘c’ cluster centers.

2) Calculate the distance between each data point and cluster centers.

3) Assign the data point to the cluster center whose distance from the cluster center
is minimum of all the cluster centers..

4) Recalculate the new cluster center using:

where, ‘ci’ represents the number of data points in ith cluster.

5) Recalculate the distance between each data point and new obtained cluster
centers.

6) If no data point was reassigned then stop, otherwise repeat from step

K-means Clustering – Example 1:

A pizza chain wants to open its delivery centres across a city. What do you think
would be the possible challenges?

-They need to analyse the areas from where the pizza is being ordered frequently.
-They need to understand as to how many pizza stores has to be opened to cover
delivery in the area.
-They need to figure out the locations for the pizza stores within all these areas in
order to keep the distance between the store and delivery points minimum.
-Resolving these challenges includes a lot of analysis and mathematics. -We would
now learn about how clustering can provide a meaningful and easy method of
sorting out such real life challenges.
Before that -let’s see what clustering is.

K-means Clustering Method:

If k is given, the K-means algorithm can be executed in the following steps:

1.Partition of objects into k non-empty subsets

2.Identifying the cluster centroids (mean point) of the current partition.
3.Assigning each point to a specific cluster
4.Compute the distances from each point and allot points to the cluster where the
distance from the centroid is minimum.
5.After re-allotting the points, find the centroid of the new cluster formed.

4Q-K-means: Additional Issues

Handling Empty Clusters:

One of the problems with the basic K-means algorithm given earlier is that empty
clusters can be obtained if no points are allocated to a cluster during the assignment
step.

Outliers:
Outliers are generally defined as samples that are exceptionally far from the
mainstream of data.

Reducing the SSE with Postprocessing:

An obvious way to reduce the SSE is to find more clusters, i.e., to use a larger.
However, in many cases, we would like to improve the SSE, but don’t want to
increase the number of clusters. This is often possible because K-means typically
converges to a local minimum. Various techniques are used to “fix up” the resulting
clusters in order to produce a clustering that has lower SSE. The strategy is to focus
on individual clusters since the total SSE is simply the sum of the SSE contributed
by each cluster. (We will use the terminology total SSE and cluster SSE,
respectively, to avoid any potential confusion.) We can change the total SSE by
performing various operations on the clusters, such as splitting or merging clusters.
One commonly used approach is to use alternate cluster splitting and merging
phases. During a splitting phase, clusters are divided, while during a merging phase,
clusters are combined. In this way, it is often possible to escape local SSE minima
and still produce a clustering solution with the desired number of clusters. The
following are some techniques used in the splitting and merging phases. Two
strategies that decrease the total SSE by increasing the number of clusters are the
following:

Split a cluster:
The cluster with the largest SSE is usually chosen, but we could also split the
cluster with the largest standard deviation for one particular attribute.

Introduce a new cluster centroid:

Often the point that is farthest from any cluster center is chosen. We can easily
determine this if we keep
track of the SSE contributed by each point. Another approach is to choose randomly
from all points or from the points with the highest SSE.

Two strategies that decrease the number of clusters, while trying to minimize the
increase in total SSE, are the following:

Disperse a cluster:
This is accomplished by removing the centroid that cor- responds to the cluster and
reassigning the points to other clusters. Ideally, the cluster that is dispersed should
be the one that increases the total SSE the least.

Merge two clusters:

The clusters with the closest centroids are typically chosen, although another,
perhaps better, approach is to merge the two clusters that result in the smallest
increase in total SSE. These two merging strategies are the same ones that are used
in the hierarchical clustering techniques known as the centroid method and Ward’s
method, respectively.

5Q-Evaluation of Clustering
In general, cluster evaluation assesses the feasibility of clustering analysis on a data
set and the quality of the results generated
by a clustering method. The major tasks of clustering evaluation include the
following:
--Assessing clustering tendency :

In this task, for a given data set, we assess whether a non random structure exists in
the data. Blindly applying a clustering method on a
data set will return clusters; however, the clusters mined may be misleading.
Clustering analysis on a data set is meaningful only when there is a nonrandom
structure in the data.
--Determining the number of clusters in a data set :

A few algorithms, such as k-means, require the number of clusters in a data set as
the parameter. Moreover, the number of clusters can be regarded as an interesting
and important summary statistic of a
data set. Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
--Measuring clustering quality :

After applying a clustering method on a data set, we want to assess how good the
resulting clusters are. A number of measures can be used.
Some methods measure how well the clusters fit the data set, while others measure
how well the clusters match the ground truth, if such truth is available. There are
also measures that score clusterings and thus can compare two sets of clustering
results on the same data set.

6Q-PAM ALOGORITHM :

this algorithm is very similar to K-means, mostly because both are partitional
algorithms, in other words, both break the dataset into groups (clusters), and both
work by trying to minimize the error, but PAM works with Medoids,K-means
works with Centroids.The PAM algorithm partitions the dataset of n objects into k
clusters, where both the dataset and the number k is an input of the algorithm.Its
works with the matrix of dissimilarity and its goal is to minimize the overall
dissimilarity between each clusters and its members.
The algorithm uses the following model to solve the problem:

F(x) is the main function to minimize, d(i,j) is the dissimilarity measurement

between the entities,zij is a variable.

in a general analysis the algorithm proceeds this way:

Build phase:
1. Choose k entities to become the medoids, or in case these entities were provided
use them as the medoids;
2. Calculate the dissimilarity matrix if it was not informed;
3. Assign every entity to its closest medoid;

Swap phase:
4. For each cluster search if any of the entities of the cluster lower the average
dissimilarity coefficient, if it does select the entity that lowers this coefficient the
most as the medoid for this cluster;
5. If at least one medoid has changed go to (3), else end the algorithm.

7Q- Hierarchical Clustering

Hierarchical clustering involves creating clusters that have a predetermined
ordering from top to bottom. For example, all files and folders on the hard disk are
organized in a hierarchy. There are two types of hierarchical
clustering, Divisive and Agglomerative.

Divisive method
In divisive or top-down clustering method we assign all of the observations to a single clus
partition the cluster to two least similar clusters. Finally, we proceed recursively on each c
is one cluster for each observation. There is evidence that divisive algorithms produce mo
hierarchies than agglomerative algorithms in some circumstances but is conceptually mor

Agglomerative Clustering: Also known as bottom-up approach or hierarchical agglomer

(HAC). A structure that is more informative than the unstructured set of clusters returned
This clustering algorithm does not require us to prespecify the number of clusters. Bottom
treat each data as a singleton cluster at the outset and then successively agglomerates pairs
all clusters have been merged into a single cluster that contains all data.
Single Linkage
In single linkage hierarchical clustering, the distance between two clusters is defined as
the shortest distance between two points in each cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the length of the arrow between their
two closest points.

Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined
as the longest distance between two points in each cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the length of the arrow between their
two furthest points.

Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined as
the average distance between each point in one cluster to every point in the other cluster.
For example, the distance between clusters “r” and “s” to the left is equal to the average
length each arrow between connecting the points of one cluster to the other.
Hierarchical Agglomerative vs Divisive clustering –
 Divisive clustering is more complex as compared to agglomerative clustering,
as in case of divisive clustering we need a flat clustering method as
“subroutine” to split each cluster until we have each data having its own
singleton cluster.
 Divisive clustering is more efficient if we do not generate a complete hierarchy
all the way down to individual data leaves. Time complexity of a naive
agglomerative clustering is O(n3) because we exhaustively scan the N x N
matrix dist_mat for the lowest distance in each of N-1 iterations. Using priority
queue data structure we can reduce this complexity to O(n2logn). By using
some more optimizations it can be brought down to O(n2). Whereas for divisive
clustering given a fixed number of top levels, using an efficient flat algorithm
like K-Means, divisive algorithms are linear in the number of patterns and
clusters.
 Divisive algorithm is also more accurate. Agglomerative clustering makes
decisions by considering the local patterns or 31eighbour points without
initially taking into account the global distribution of data. These early
decisions cannot be undone. Whereas divisive clustering takes into
consideration the global distribution of data when making top-level partitioning
decisions.
8Q- HIERARCHIAL ALLGORAMATIVE ALGORITHM:

given a dataset (d1, d2, d3, ....dN) of size N

# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
untill only a single cluster remains

9Q-Key Issues in Hierarchical Clustering

Lack of a Global Objective Function: agglomerative hierarchical clustering
techniques perform clustering on a local level and as such there is no global
objective function like in the K-Means algorithm. This is actually an advantage of
this technique because the time and space complexity of global functions tends to
be very expensive.

Ability to Handle Different cluster Sizes: we have to decide how to treat clusters
of various sizes that are merged together.

Merging Decisions Are Final: one downside of this technique is that once two
clusters have been merged they cannot be split up at a later time for a more
favorable union.

10Q-outliers and methods for outlier detection ?

Outliers are generally defined as samples that are exceptionally far from the
mainstream of data.
outlier detection may be defined as the process of detecting and subsequently
excluding outliers from a givenset of data.it is a branch of data mining that has
many applications in data stream analysis.

Models for Outlier Detection Analysis

There are several approaches to detecting Outliers. Outlier detection models may be
classified into the following groups:

1. Extreme Value Analysis:

Extreme Value Analysis is the most basic form of outlier detection and great for 1-
dimension data. it is assumed that values which are too large or too small are
outliers.

2. Linear Models:
In this approach, the data is modelled into a lower-dimensional sub-space with the
use of linear correlations.
PCA (Principal Component Analysis) is an example of linear models for anomaly
detection.

3. Probabilistic and Statistical Models:

In this approach, Probabilistic and Statistical Models assume specific distributions
for data. They make use of the expectation-maximization (EM) methods to estimate
the parameters of the model. Finally, they calculate the probability of each data
point The points with a low probability of membership are marked as outliers.

4. Proximity-based Models:
In this method, outliers are modelled as points isolated from the rest of the
observations. Cluster analysis, density-based analysis, and nearest neighborhood are
the principal approaches of this kind.

5. Information-Theoretic Models:
In this method, the outliers increase the minimum code length to describe a data set.

11Q-There are four Outlier Detection techniques in general.

1. Numeric Outlier

Numeric Outlier is the simplest, nonparametric outlier detection technique

in a one-dimensional feature space. The outliers are calculated by means
of the IQR (InterQuartile Range). This technique can easily be implemented
in KNIME Analytics Platform using the Numeric Outliers node.
2. Z-Score

Z-score technique assumes a Gaussian distribution of the data. The outliers

are the data points that are in the tails of the distribution and therefore far
from the mean.

Z-Score
3. DBSCAN

This Outlier Detection technique is based on the DBSCAN clustering

method. DBSCAN is a nonparametric, density-based outlier detection
method in a one or multi-dimensional feature space. Here, all data points
are defined either as Core Points, Border Points or Noise Points.
4.Isolation Forest

This nonparametric method is ideal for large datasets in a one or multi-

dimensional feature space. The isolation number is of paramount
importance in this Outlier Detection technique. The isolation number is the
number of splits needed to isolate a data point.

12Q-APPLICATIONS OF OUTLIER ANALYSIS:

Quality control applications • Financial applications • Web log analytics • Intrusion

detection applications • Medical applications • Text and social media applications •
Earth science applications

UNIT-3
1Q-General Approach to Solving a Classification Problem
Classification: Classification is the process of finding a model that describes the data
classes or concepts.

Classification is considered as a challenging field and contains more scope for

research. It is considered challenging because of the following reasons:

 Information overload –The information explosion era is overloaded with

information and finding the required information is prohibitively expensive.

 Size and Dimension – The information stored is very high, which in turn, increases
the size of the database to be analyzed. Moreover, the databases have very high
number of “dimensions” or “features”, which again pose challenges during
classification.

A classification technique (or classifier) is a systematic approach to building

classification models from an input data set. Each technique employs a learning
algorithm to identify a model that best fits the relationship between the attribute set
and class label of the input data. The model generated by a learning algorithm should
both fit the input data well and correctly predict the class labels of records. First, a
training set consisting of records whose class labels are known must be provided. The
training set is used to build a classification model, which is subsequently applied to
the test set, which consists of records with unknown class labels.

2Q-Evaluation of classifier:

1. Jaccard index:
The Jaccard Index, also known as the Jaccard similarity coefficient, is a
statistic used in understanding the similarities between sample sets. The
mathematical representation of the index is written as:

2. Confusion Matrix:
The confusion matrix is used to describe the performance of a classification model
on a set of test data for which true values are known.

confusion matrix

From the confusion matrix the following information can be extracted :

1. True positive(TP).: This shows that a model correctly predicted Positive cases
as Positive. eg an illness is diagnosed as present and truly is present.
2. False positive(FP): This shows that a
model incorrectly predicted Negative cases as Positive.eg an illness is diagnosed
as present and but is absent. (Type I error)
3. False Negative:(FN) This shows that an incorrectly model
predicted Positive cases as Negative.eg an illness is diagnosed as absent and but
is present. (Type II error)
4. True Negative(TN): This shows that a model correctly predicted Negative cases
as Positive. eg an illness is diagnosed as absent and truly is absent.

3. F-1 Score:
This comes from the confusion matrix. The F1 score is calculated based on the
precision and recall of each class. It is the weighted average of the Precision and
the recall scores. The F1 score reaches its perfect value at one and worst at 0.

Precision score: this is the measure of the accuracy, provided that a class label
has been predicted.

Recall score(Sensitivity): This is the true positive rate that is if it predicts

positive then how often does this take place?

4.Log loss:

Log loss measures the performance of a model where the predicted outcome is a
probability value between 0 and 1. Log loss can be calculated for each row in the
data set using the Log loss equation.
3Q-Decision Tree
Decision Tree : Decision tree is the most powerful and popular tool for
classification and prediction. A Decision tree is a flowchart like tree structure,
where each internal node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds a class label.

Construction of Decision Tree :

A tree can be “learned” by splitting the source set into subsets based on an attribute
value test. This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset at a node
all has the same value of the target variable, or when splitting no longer adds value
to the predictions. The construction of decision tree classifier does not require any
domain knowledge or parameter setting, and therefore is appropriate for exploratory
knowledge discovery. Decision trees can handle high dimensional data. In general
decision tree classifier has good accuracy. Decision tree induction is a typical
inductive approach to learn knowledge on classification.

Advantages:
1. Compared to other algorithms decision trees requires less effort for data
preparation during pre-processing.
2. A decision tree does not require normalization of data.
3. A decision tree does not require scaling of data as well.
4. Missing values in the data also does NOT affect

Disadvantage:
1. A small change in the data can cause a large change in the structure of the
decision tree causing instability.
2. For a Decision tree sometimes calculation can go far more complex
compared to other algorithms.
3. Decision tree often involves higher time to train the model.
4. Decision tree training is relatively expensive as complexity and time taken
is more.

4Q-Methods for Expressing Attribute Test Conditions

Decision tree induction algorithms must provide a method for expressing
an attribute test condition and its corresponding outcomes for different
attribute types.
-Binary Attributes:
The test condition for a binary attribute generates two potential outcomes,
as shown in Figure
-Nominal Attributes:
Since a nominal attribute can have many values, its test condition can be
expressed in two ways, as shown in Figure 4.9. For a multiway
split (Figure 4.9(a)), the number of outcomes depends on the number of
distinct values for the corresponding attribute.

-Ordinal Attributes :
Ordinal attributes can also produce binary or multiway splits. Ordinal
attribute values can be grouped as long as the grouping does not violate
the order property of the attribute values.

-Continuous attribute:
Has real numbers as attribute values. Continuous attributes are typically represented
as floatingpoint variables.

5Q-BEST SPLIT:
Information Gain
Information gain (IG) measures how much “information” a feature
gives us about the class. The information gain is based on the decrease in
entropy after a dataset is split on an attribute. It is the main parameter used
to construct a Decision Tree. An attribute with the highest
Information gain will be tested/split first.

Information gain = base entropy — new entropy

Entropy is the measure of randomness or unpredictability in the dataset. In
other terms, it controls how a decision tree decides to split the data. Its value
ranges from 0 to 1.
Gain ratio:
modification of the information gain that reduces its bias towards multi-valued
attributes
● takes number and size of branches into account when choosing an attribute
G(S,A)=GAIN(S,A)/INTRINSIC INFO(S,A)

6Q-K-Nearest Neighbors
The KNN algorithm assumes that similar things exist in close proximity. In
other words, similar things are near to each other.

The KNN Algorithm

1. Load the data
2. Initialize K to your chosen number of neighbors

3. For each example in the data

3.1 Calculate the distance between the query example and the current
example from the data.

3.2 Add the distance and the index of the example to an ordered collection

4. Sort the ordered collection of distances and indices from smallest to

largest (in ascending order) by the distances

5. Pick the first K entries from the sorted collection

6. Get the labels of the selected K entries

7. If regression, return the mean of the K labels

8. If classification, return the mode of the K labels

Choosing the right value for K
To select the K that’s right for your data, we run the KNN algorithm several
times with different values of K and choose the K that reduces the number of
errors we encounter while maintaining the algorithm’s ability to accurately
make predictions when it’s given data it hasn’t seen before.

Advantages
1. The algorithm is simple and easy to implement.
2. There’s no need to build a model, tune several parameters, or make
additional assumptions.
3. The algorithm is versatile. It can be used for classification, regression,
and search (as we will see in the next section).

Disadvantages
1. The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.

7Q-The KNN algorithm has the following features:

 KNN is a Supervised Learning algorithm that uses labeled input data set to
predict the output of the data points.
 It is one of the most simple Machine learning algorithms and it can be easily
implemented for a varied set of problems.
 It is mainly based on feature similarity. KNN checks how similar a data point
is to its neighbor and classifies the data point into the class it is most similar to.

8Q- Naive Bayes

DEF: Naive Bayes algorithms are mostly used in sentiment analysis, spam
filtering, recommendation systems etc. They are fast and easy to implement
but their biggest disadvantage is that the requirement of predictors to be
independent
What is a classifier?
A classifier is a machine learning model that is used to discriminate different
objects based on certain features.

Principle of Naive Bayes Classifier:

A Naive Bayes classifier is a probabilistic machine learning model that’s used
for classification task. The crux of the classifier is based on the Bayes
theorem.

Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening, given

that B has occurred. Here, B is the evidence and A is the hypothesis. The
assumption made here is that the predictors/features are independent. That
is presence of one particular feature does not affect the other. Hence it is
called naive.

Types of Naive Bayes Classifier:

Multinomial Naive Bayes:

This is mostly used for document classification problem, i.e whether a
document belongs to the category of sports, politics, technology etc. The
features/predictors used by the classifier are the frequency of the words
present in the document.
Bernoulli Naive Bayes:
This is similar to the multinomial naive bayes but the predictors are boolean
variables. The parameters that we use to predict the class variable take up
only values yes or no, for example if a word occurs in the text or not.

Gaussian Naive Bayes:

When the predictors take up a continuous value and are not discrete, we
assume that these values are sampled from a gaussian distribution.

Gaussian Distribution(Normal Distribution)

Since the way the values are present in the dataset changes, the formula for
conditional probability changes to,

9Q-Classification tec
10Q-Decision tree induction algorithm

LCM 500 Presentation - 2
100% (9)
LCM 500 Presentation - 2
55 pages
Data Mining
100% (1)
Data Mining
18 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
Data
No ratings yet
Data
9 pages
Data Mining Process, Techniques, Tools & Examples
No ratings yet
Data Mining Process, Techniques, Tools & Examples
11 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Data Mining Implementation Process
No ratings yet
Data Mining Implementation Process
9 pages
Data Mining
No ratings yet
Data Mining
43 pages
Data Mining.intro
No ratings yet
Data Mining.intro
17 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Mining
No ratings yet
Data Mining
3 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Crisp-Dm
No ratings yet
Crisp-Dm
4 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Unit 1
No ratings yet
Unit 1
27 pages
Data Mining Simran
No ratings yet
Data Mining Simran
128 pages
Data Science
No ratings yet
Data Science
11 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Unit - I
No ratings yet
Unit - I
22 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
HND - BI - W8 - Data Mining
No ratings yet
HND - BI - W8 - Data Mining
19 pages
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
DM Module1
No ratings yet
DM Module1
15 pages
Data Mining
No ratings yet
Data Mining
15 pages
Steps in The Data Mining Process
No ratings yet
Steps in The Data Mining Process
5 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Presentation Data Mining
No ratings yet
Presentation Data Mining
22 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
Chapter 1___Data Mining and Data Warehouse
No ratings yet
Chapter 1___Data Mining and Data Warehouse
44 pages
Datawarehouse&Data mining_ALL
No ratings yet
Datawarehouse&Data mining_ALL
46 pages
Unit-1 Data Mining
No ratings yet
Unit-1 Data Mining
19 pages
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
No ratings yet
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
58 pages
DM NOTES
No ratings yet
DM NOTES
91 pages
IBA - MODULe 4.3
No ratings yet
IBA - MODULe 4.3
10 pages
Dmbi Unit-3
No ratings yet
Dmbi Unit-3
21 pages
WEEK 4-CRISP-DM Framework
No ratings yet
WEEK 4-CRISP-DM Framework
9 pages
M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining
No ratings yet
M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining
3 pages
data mining
No ratings yet
data mining
44 pages
Document
No ratings yet
Document
44 pages
Data Mining Mod 1 Notes
No ratings yet
Data Mining Mod 1 Notes
25 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
Advance Database With Lab: Professor & Head (Department of Software Engineering)
No ratings yet
Advance Database With Lab: Professor & Head (Department of Software Engineering)
5 pages
Data Mining
No ratings yet
Data Mining
30 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
VO_MCA_S4_Data Mining Unit 1
No ratings yet
VO_MCA_S4_Data Mining Unit 1
18 pages
Module-1 DM
No ratings yet
Module-1 DM
15 pages
Dadm (1) Sidra
No ratings yet
Dadm (1) Sidra
9 pages
Data-Mining-OVERVIEW (1)
No ratings yet
Data-Mining-OVERVIEW (1)
8 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
Data Mining Information
No ratings yet
Data Mining Information
7 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
E-F Codd
No ratings yet
E-F Codd
3 pages
Ada en
No ratings yet
Ada en
2 pages
Brook Universal Fighting Board (PS3 - PS4 - Xbox 360 - Xbox One - PC - WIIU - Switch - NEO - GEO Mini) Compatible
No ratings yet
Brook Universal Fighting Board (PS3 - PS4 - Xbox 360 - Xbox One - PC - WIIU - Switch - NEO - GEO Mini) Compatible
2 pages
Price Waterhouse Swift PDF
No ratings yet
Price Waterhouse Swift PDF
4 pages
Algebra
No ratings yet
Algebra
227 pages
Benefits From Big Data and Probes : ST RD
No ratings yet
Benefits From Big Data and Probes : ST RD
10 pages
Installation India GST Update1 AX2012R3
No ratings yet
Installation India GST Update1 AX2012R3
12 pages
Assignment 1 (Section 12)
No ratings yet
Assignment 1 (Section 12)
1 page
Dominguez Marketing Communications
0% (1)
Dominguez Marketing Communications
6 pages
Construction Contract Risk Identification Based On Knowledge-Augmented Language Model
No ratings yet
Construction Contract Risk Identification Based On Knowledge-Augmented Language Model
30 pages
DH PFS3218 16ET 135 - Datasheet - 20200724
No ratings yet
DH PFS3218 16ET 135 - Datasheet - 20200724
1 page
DCN_EXPrrrr
No ratings yet
DCN_EXPrrrr
13 pages
FOD Online Exam Guidelines
No ratings yet
FOD Online Exam Guidelines
3 pages
Virtualization_Case_Study
No ratings yet
Virtualization_Case_Study
3 pages
Atlas Battery Limited: Internship
0% (1)
Atlas Battery Limited: Internship
15 pages
Markov Processes and Applications Algorithms Networks Genome and Finance Wiley Series in Probability and Statistics 1st Edition Etienne Pardoux - Own the complete ebook set now in PDF and DOCX formats
100% (1)
Markov Processes and Applications Algorithms Networks Genome and Finance Wiley Series in Probability and Statistics 1st Edition Etienne Pardoux - Own the complete ebook set now in PDF and DOCX formats
44 pages
07 06 2024.applogs5343393292455460089
No ratings yet
07 06 2024.applogs5343393292455460089
48 pages
Arcfm™ Server: Flexible Web Environment For Arcfm Solution
No ratings yet
Arcfm™ Server: Flexible Web Environment For Arcfm Solution
4 pages
Artificial Passenger
No ratings yet
Artificial Passenger
22 pages
Lecture 5 - Data Acquisition
No ratings yet
Lecture 5 - Data Acquisition
26 pages
VNX - Data Mover in Rolling Panics After Upgrade To 8.1.21.266 Code Using The DVD Image (Dell EMC Correctable) - Dell US
No ratings yet
VNX - Data Mover in Rolling Panics After Upgrade To 8.1.21.266 Code Using The DVD Image (Dell EMC Correctable) - Dell US
2 pages
Block Chain List 2020
No ratings yet
Block Chain List 2020
86 pages
Email Mobile Database of Frequent Flyers Sample
No ratings yet
Email Mobile Database of Frequent Flyers Sample
9 pages
Week 3 Cpar Meaning Form Characteristics
No ratings yet
Week 3 Cpar Meaning Form Characteristics
34 pages
gustavares hudson guedes [email protected]
No ratings yet
gustavares hudson guedes [email protected]
2 pages
Đề kiểm tra cuối kỳ I LÊ QUÝ ĐÔN 3 TỜ 501
No ratings yet
Đề kiểm tra cuối kỳ I LÊ QUÝ ĐÔN 3 TỜ 501
6 pages
Oracle+Student+Management+data+sheet+2022
No ratings yet
Oracle+Student+Management+data+sheet+2022
4 pages
Benq MOnitor
No ratings yet
Benq MOnitor
1 page
Unit- 3 (HDFS)
No ratings yet
Unit- 3 (HDFS)
23 pages