DM Sem U-1
DM Sem U-1
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each
other. This process helps to understand the differences and similarities between the
data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific
variable, given the presence of other variables.
4. Association Rules:
This data mining technique helps to find the association between two or more
Items. It discovers a hidden pattern in the data set.
5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can
be used in a variety of domains, such as intrusion, detection, fraud or fault
detection, etc. Outer detection is also called Outlier Analysis or Outlier mining.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period.
7. Prediction:
Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or
instances in a right sequence for predicting a future event.
8. Decision tree
Decision tree is one of the most common used data mining techniques because its
model is easy to understand for users. In decision tree you start with a simple
question which has two or more answers.Each answer leads to a further two or more
question which help us to make a final decision. The root node of decision tree is a
simple question.
-Overfitting: Due to small size training database, a model may not fit future states.
-Data mining needs large databases which sometimes are difficult to manage
-If the data set is not diverse, data mining results may not be accurate.
-Poor quality of data collection is one of most known challenges in data mining.
-Business understanding:
-Data understanding:
In this phase, sanity check on data is performed to check whether its appropriate for
the data mining goals.First, data is collected from multiple data sources available in
the organization.These data sources may include multiple databases, flat filer or
data cubes. There are issues like object matching and schema integration which can
arise during Data Integration process. It is a quite complex and tricky process as
data from various sources unlikely to match easily. For example, table A contains
an entity named cust_no whereas another table B contains an entity named cust-
id.Therefore, it is quite difficult to ensure that both of these given objects refer to
the same value or not. Here, Metadata should be used to reduce errors in the data
integration process.Next, the step is to search for properties of acquired data. A
good way to explore the data is to answer the data mining questions (decided in
business phase) using the query, reporting, and visualization tools.Based on the
results of query, the data quality should be ascertained. Missing data if any should
be acquired.
-Data preparation:
In this phase, data is made production ready.The data preparation process consumes
about 90% of the time of the project.The data from different sources should be
selected, cleaned, transformed, formatted, anonymized, and constructed (if
required).Data cleaning is a process to "clean" the data by smoothing noisy data and
filling in missing values.
For example, for a customer demographics profile, age data is missing. The data is
incomplete and should be filled. In some cases, there could be data outliers. For
instance, age has a value 300. Data could be inconsistent. For instance, name of the
customer is different in different tables.
-Data transformation:
Data transformation operations change the data to make it useful in data mining.
Following transformation can be applied
Data transformation operations would contribute toward the success of the mining
process.
Aggregation: Summary or aggregation operations are applied to the data. I.e., the
weekly sales data is aggregated to calculate the monthly and yearly total.
Attribute construction: these attributes are constructed and included the given set of
attributes helpful for data mining.The result of this process is a final data set that
can be used in modeling.
-Modelling:
-Evaluation:
In the deployment phase, you ship your data mining discoveries to everyday
business operations.The knowledge or information discovered during data mining
process should be made easy to understand for non-technical stakeholders.A
detailed deployment plan, for shipping, maintenance, and monitoring of data mining
discoveries is created.A final project report is created with lessons learned and key
experiences during the project. This helps to improve the organization's business
policy.
Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks.Data mining tasks can be classified into two categories:
descriptive and predictive.
-Descriptive mining tasks characterize the general properties of the data in the
database.
-Predictive mining tasks perform inference on the current data in order to make
predictions.
Data can be associated with classes or concepts. For example, in the Electronics
store, classes of items for sale include computers and printers, and concepts of
customers include bigSpenders and budgetSpenders.
-Data characterization:
-Data discrimination:
Frequent patterns, are patterns that occur frequently in data. There are many kinds
of frequent patterns, including itemsets, subsequences, and substructures.
-Association analysis:
Suppose, as a marketing manager, you would like to determine which items are
frequently purchased together within the same transactions.
buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]
where X is a variable representing a customer.Confidence=50% means that if a
customer buys a computer, there is a 50% chance that she will buy software as well.
Support=1% means that 1% of all of the transactions under analysis showed that
computer and software were purchased together.
Classification is the process of finding a model that describes and distinguishes data
classes for the purpose of being able to use the model to predict the class of objects
whose class label is unknown.
“How is the derived model presented?” The derived model may be represented in
various forms, such as classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks.
-Decision tree:
A decision tree is a flow-chart-like tree structure, where each node denotes a test on
an attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.
-Neural Network:
-Outlier Analysis:
A database may contain data objects that do not comply with the general behavior
or model of the data. These data objects are outliers. Most data mining methods
discard outliers as noise or exceptions.The analysis of outlier data is referred to as
outlier mining.
Data mining process extract information from various data source which is very
useful in the process of planning, organising, managing and launching new product
in a cost effective way. Data mining technique help us to understand the purchase
behaviour of a buyer like how frequently customer purchase a item, total value of
all purchases and when was the last purchase.With data mining you can understand
the needs of buyer’s and make product and services according to buyer’s
requirement.
Data base marketing is one of the most popular application of data mining.
Data mining architecture has many elements like Data Warehouse, Data Mining
Engine, Pattern evaluation,User Interface and Knowledge Base.
-Data Warehouse:
A data warehouse is a place which store information collected from multiple
sources under unified schema. Information stored in a data warehouse is critical to
organizations for the process of decision-making.
-Pattern Evaluation:
Pattern Evaluation is responsible for finding various patterns with the help of Data
Mining Engine.
-User Interface:
User Interface provides communication between user and data mining system.It
allows user to use the system easily even if user doesn't have proper knowledge of
the system.
-Knowledge Base:
Knowledge Base consists of data that is very important in the process of data
mining.Knowledge Base provides input to the data mining engine which guides
data mining engine in the process of pattern search.
2 Performance issues
2. Performance issues
Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable.
1 Data cleaning -
First step in the Knowledge Discovery Process is Data cleaning in which noise and
inconsistent data is removed.
2 Data Integration -
Second step is Data Integration in which multiple data sources are combined.
3 Data Selection -
Next step is Data Selection in which data relevant to the analysis task are retrieved
from the database.
4 Data Transformation -
In Data Transformation, data are transformed into forms appropriate for mining by
performing summary or aggregation operations.
5 Data Mining -
In Data Mining, data mining methods (algorithms) are applied in order to extract
data patterns.
6 Pattern Evaluation -
In Pattern Evaluation, data patterns are identified based on some interesting
measures.
7 Knowledge Presentation -
In Knowledge Presentation, knowledge is represented to user using many
knowledge representation techniques.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace all data in a segment
by its mean or boundary values can be used to complete the task.
Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
-Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
-Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Componenet Analysis).
11Q------Feature selection :
“feature selection is the process of selecting a subset of relevant features for use in
model construction” or in other words, the selection of the most important
features.Feature selection is simply selecting and excluding given features without
changing them.
12Q-----Dimensionality reduction :
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If
after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Componenet Analysis).
-PCA:
PCA (Principle Component Analysis) is a dimensionality reduction technique that
projects the data into a lower dimensional space.
While there are many effective dimensionality reduction techniques, PCA is the
only example we will explore here.
PCA can be useful in many situations, but especially in cases with excessive
multicollinearity or explanation of predictors is not a priority.
-Top-down discretization
If the process starts by first finding one or a few points (called split points or cut
points) to split the entire attribute range, and then repeats this recursively on the
resulting intervals, then it is called top-down discretization or splitting.
-Bottom-up discretization
If the process starts by considering all of the continuous values as potential split-
points, removes some by merging neighborhood values to form intervals, then it is
called bottom-up discretization or merging.
→ Discretization is the process of converting a continuous attribute into an ordinal
attribute.
→ A potentially infinite number of values are mapped into a small number of
categories.
→ Discretization is commonly used in classification.
→ Many classification algorithms work best if both the independent and dependent
variables have only a few values.
-Binarization
→ Binarization maps a continuous or categorical attribute into one or more binary
variables
→ Typically used for association analysis
→ Often convert a continuous attribute to a categorical attribute and then convert a
categorical attribute to a set of binary attributes
→ Association analysis needs asymmetric binary attributes
→ Examples: eye colour and height measured as {low, medium, high}
2 Aggregation:
Aggregation is a process where summary or aggregation operations are applied to
the data.
3 Generalization:
In generalization low-level data are replaced with high-level data by using concept
hierarchies climbing.
4 Normalization:
Normalization scaled attribute data so as to fall within a small specified range, such
as 0.0 to 1.0.
5 Attribute Construction:
In Attribute construction, new attributes are constructed from the given set of
attributes.
16Q------Concept hierarchies:
Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.
In the multidimensional model, data are organized into multiple dimensions, and
each dimension contains multiple levels of abstraction defined by concept
hierarchies. This organization provides users with the flexibility to view data from
different perspectives.
Data mining on a reduced data set means fewer input/output operations and is more
efficient than mining on a larger data set.
2 Histogram Analysis:
Because histogram analysis does not use class information so it is an unsupervised
discretization technique.Histograms partition the values for an attribute into disjoint
ranges called buckets.
3 Cluster Analysis:
Cluster analysis is a popular data discretization method.A clustering algorithm can
be applied to discrete a numerical attribute of A by partitioning the values of A into
clusters or groups.
Each initial cluster or partition may be further decomposed into several subcultures,
forming a lower level of the hierarchy.
Auto generate the attribute ordering based upon observation that attribute defining a
high level concept has a smaller # of distinct values than an attribute defining a
lower level concept
Example : country (15), state_or_province (365), city (3567), street (674,339)
Specification of only a partial set of attributes
The term proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects. Proximity measures refer to the
Measures of Similarity and Dissimilarity.
Transformation Function:
It is a function used to convert similarity to dissimilarity and vice versa, or to
transform a proximity measure to fall into a particular range. For instance:
s’ = (s-min(s)) / max(s)-min(s))
where,
s’ = new transformed proximity measure value,
s = current proximity measure value,
min(s) = minimum of proximity measure values,
max(s) = maximum of proximity measure values
UNIT-5
Web content mining is defined as the process of converting raw data to useful
information using the content of web page of a specified web site.
The process starts with the extraction of structured data or information from web
pages and then identifying similar data with integration. Various types of web
content include text, audio, video etc. This process is called as text mining.
Web graphs include a typical structure which consists of web pages such as nodes
and hyperlinks which will be treated as edges connected between web pages. It
includes a process of discovering a specified structure with information from the
web.
Web usage mining is used for mining the web log records (access information of
web pages) and helps to discover the user access patterns of web pages.
Web server registers a web log entry for every web page.
Analysis of similarities in web log records can be useful to identify the potential
customers for e-commerce companies.
Some of the techniques to discover and analyze the web usage pattern are:
i) Session and visitor analysis:
The analysis of preprocessed data can be performed in session analysis ,which
includes the record of visitors, days, sessions etc. This information can be used to
analyze the behavior of visitors.
ii) OLAP (Online Analytical Processing)
OLAP performs Multidimensional analysis of complex data.
OLAP can be performed on different parts of log related data in a certain
interval of time.
The OLAP tool can be used to derive the important business intelligence
metrics.
“Text mining, also referred to as text data mining, roughly equivalent to text
analytics, is the process of deriving high-quality information from text.” Text
mining deals with natural language texts either stored in semi-structured or
unstructured formats.
The five fundamental steps involved in text mining are:
Gathering unstructured data from multiple data sources like plain text, web pages,
pdf files, emails, and blogs, to name a few.
Detect and remove anomalies from data by conducting pre-processing and
cleansing operations. Data cleansing allows you to extract and retain the valuable
information hidden within the data and to help identify the roots of specific words.
For this, you get a number of text mining tools and text mining applications.
Convert all the relevant information extracted from unstructured data into structured
formats.
Analyze the patterns within the data via the Management Information System
(MIS).
Store all the valuable information into a secure database to drive trend analysis and
enhance the decision-making process of the organization.
Unstructured data |
Text clustering | from textbook
Hierarchy of categories |
Unit-4
Cluster analysis:
Cluster analysis groups data objects based only on information found in the data
that describes the objects and their relationships. The goal is that the objects within
a group be similar (or related) to one another and different from (or unrelated to)
the objects in other groups. The greater the similarity (or homogeneity) within a
group and the greater the difference between groups,the better or more distinct the
clustering.
3Q-Kmeans Algorithm:
K-means Clustering :
K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms
that solve the well-known clustering problem. K-means clustering is a method of
vector quantization.
Algorithmic steps for k-means clustering
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center
is minimum of all the cluster centers..
5) Recalculate the distance between each data point and new obtained cluster
centers.
6) If no data point was reassigned then stop, otherwise repeat from step
A pizza chain wants to open its delivery centres across a city. What do you think
would be the possible challenges?
-They need to analyse the areas from where the pizza is being ordered frequently.
-They need to understand as to how many pizza stores has to be opened to cover
delivery in the area.
-They need to figure out the locations for the pizza stores within all these areas in
order to keep the distance between the store and delivery points minimum.
-Resolving these challenges includes a lot of analysis and mathematics. -We would
now learn about how clustering can provide a meaningful and easy method of
sorting out such real life challenges.
Before that -let’s see what clustering is.
Outliers:
Outliers are generally defined as samples that are exceptionally far from the
mainstream of data.
Split a cluster:
The cluster with the largest SSE is usually chosen, but we could also split the
cluster with the largest standard deviation for one particular attribute.
Two strategies that decrease the number of clusters, while trying to minimize the
increase in total SSE, are the following:
Disperse a cluster:
This is accomplished by removing the centroid that cor- responds to the cluster and
reassigning the points to other clusters. Ideally, the cluster that is dispersed should
be the one that increases the total SSE the least.
5Q-Evaluation of Clustering
In general, cluster evaluation assesses the feasibility of clustering analysis on a data
set and the quality of the results generated
by a clustering method. The major tasks of clustering evaluation include the
following:
--Assessing clustering tendency :
In this task, for a given data set, we assess whether a non random structure exists in
the data. Blindly applying a clustering method on a
data set will return clusters; however, the clusters mined may be misleading.
Clustering analysis on a data set is meaningful only when there is a nonrandom
structure in the data.
--Determining the number of clusters in a data set :
A few algorithms, such as k-means, require the number of clusters in a data set as
the parameter. Moreover, the number of clusters can be regarded as an interesting
and important summary statistic of a
data set. Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
--Measuring clustering quality :
After applying a clustering method on a data set, we want to assess how good the
resulting clusters are. A number of measures can be used.
Some methods measure how well the clusters fit the data set, while others measure
how well the clusters match the ground truth, if such truth is available. There are
also measures that score clusterings and thus can compare two sets of clustering
results on the same data set.
6Q-PAM ALOGORITHM :
this algorithm is very similar to K-means, mostly because both are partitional
algorithms, in other words, both break the dataset into groups (clusters), and both
work by trying to minimize the error, but PAM works with Medoids,K-means
works with Centroids.The PAM algorithm partitions the dataset of n objects into k
clusters, where both the dataset and the number k is an input of the algorithm.Its
works with the matrix of dissimilarity and its goal is to minimize the overall
dissimilarity between each clusters and its members.
The algorithm uses the following model to solve the problem:
Build phase:
1. Choose k entities to become the medoids, or in case these entities were provided
use them as the medoids;
2. Calculate the dissimilarity matrix if it was not informed;
3. Assign every entity to its closest medoid;
Swap phase:
4. For each cluster search if any of the entities of the cluster lower the average
dissimilarity coefficient, if it does select the entity that lowers this coefficient the
most as the medoid for this cluster;
5. If at least one medoid has changed go to (3), else end the algorithm.
Divisive method
In divisive or top-down clustering method we assign all of the observations to a single clus
partition the cluster to two least similar clusters. Finally, we proceed recursively on each c
is one cluster for each observation. There is evidence that divisive algorithms produce mo
hierarchies than agglomerative algorithms in some circumstances but is conceptually mor
Complete Linkage
In complete linkage hierarchical clustering, the distance between two clusters is defined
as the longest distance between two points in each cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the length of the arrow between their
two furthest points.
Average Linkage
In average linkage hierarchical clustering, the distance between two clusters is defined as
the average distance between each point in one cluster to every point in the other cluster.
For example, the distance between clusters “r” and “s” to the left is equal to the average
length each arrow between connecting the points of one cluster to the other.
Hierarchical Agglomerative vs Divisive clustering –
Divisive clustering is more complex as compared to agglomerative clustering,
as in case of divisive clustering we need a flat clustering method as
“subroutine” to split each cluster until we have each data having its own
singleton cluster.
Divisive clustering is more efficient if we do not generate a complete hierarchy
all the way down to individual data leaves. Time complexity of a naive
agglomerative clustering is O(n3) because we exhaustively scan the N x N
matrix dist_mat for the lowest distance in each of N-1 iterations. Using priority
queue data structure we can reduce this complexity to O(n2logn). By using
some more optimizations it can be brought down to O(n2). Whereas for divisive
clustering given a fixed number of top levels, using an efficient flat algorithm
like K-Means, divisive algorithms are linear in the number of patterns and
clusters.
Divisive algorithm is also more accurate. Agglomerative clustering makes
decisions by considering the local patterns or 31eighbour points without
initially taking into account the global distribution of data. These early
decisions cannot be undone. Whereas divisive clustering takes into
consideration the global distribution of data when making top-level partitioning
decisions.
8Q- HIERARCHIAL ALLGORAMATIVE ALGORITHM:
Ability to Handle Different cluster Sizes: we have to decide how to treat clusters
of various sizes that are merged together.
Merging Decisions Are Final: one downside of this technique is that once two
clusters have been merged they cannot be split up at a later time for a more
favorable union.
Outliers are generally defined as samples that are exceptionally far from the
mainstream of data.
outlier detection may be defined as the process of detecting and subsequently
excluding outliers from a givenset of data.it is a branch of data mining that has
many applications in data stream analysis.
2. Linear Models:
In this approach, the data is modelled into a lower-dimensional sub-space with the
use of linear correlations.
PCA (Principal Component Analysis) is an example of linear models for anomaly
detection.
4. Proximity-based Models:
In this method, outliers are modelled as points isolated from the rest of the
observations. Cluster analysis, density-based analysis, and nearest neighborhood are
the principal approaches of this kind.
5. Information-Theoretic Models:
In this method, the outliers increase the minimum code length to describe a data set.
Z-Score
3. DBSCAN
UNIT-3
1Q-General Approach to Solving a Classification Problem
Classification: Classification is the process of finding a model that describes the data
classes or concepts.
Size and Dimension – The information stored is very high, which in turn, increases
the size of the database to be analyzed. Moreover, the databases have very high
number of “dimensions” or “features”, which again pose challenges during
classification.
2Q-Evaluation of classifier:
1. Jaccard index:
The Jaccard Index, also known as the Jaccard similarity coefficient, is a
statistic used in understanding the similarities between sample sets. The
mathematical representation of the index is written as:
2. Confusion Matrix:
The confusion matrix is used to describe the performance of a classification model
on a set of test data for which true values are known.
confusion matrix
1. True positive(TP).: This shows that a model correctly predicted Positive cases
as Positive. eg an illness is diagnosed as present and truly is present.
2. False positive(FP): This shows that a
model incorrectly predicted Negative cases as Positive.eg an illness is diagnosed
as present and but is absent. (Type I error)
3. False Negative:(FN) This shows that an incorrectly model
predicted Positive cases as Negative.eg an illness is diagnosed as absent and but
is present. (Type II error)
4. True Negative(TN): This shows that a model correctly predicted Negative cases
as Positive. eg an illness is diagnosed as absent and truly is absent.
3. F-1 Score:
This comes from the confusion matrix. The F1 score is calculated based on the
precision and recall of each class. It is the weighted average of the Precision and
the recall scores. The F1 score reaches its perfect value at one and worst at 0.
Precision score: this is the measure of the accuracy, provided that a class label
has been predicted.
4.Log loss:
Log loss measures the performance of a model where the predicted outcome is a
probability value between 0 and 1. Log loss can be calculated for each row in the
data set using the Log loss equation.
3Q-Decision Tree
Decision Tree : Decision tree is the most powerful and popular tool for
classification and prediction. A Decision tree is a flowchart like tree structure,
where each internal node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds a class label.
Advantages:
1. Compared to other algorithms decision trees requires less effort for data
preparation during pre-processing.
2. A decision tree does not require normalization of data.
3. A decision tree does not require scaling of data as well.
4. Missing values in the data also does NOT affect
Disadvantage:
1. A small change in the data can cause a large change in the structure of the
decision tree causing instability.
2. For a Decision tree sometimes calculation can go far more complex
compared to other algorithms.
3. Decision tree often involves higher time to train the model.
4. Decision tree training is relatively expensive as complexity and time taken
is more.
-Ordinal Attributes :
Ordinal attributes can also produce binary or multiway splits. Ordinal
attribute values can be grouped as long as the grouping does not violate
the order property of the attribute values.
-Continuous attribute:
Has real numbers as attribute values. Continuous attributes are typically represented
as floatingpoint variables.
5Q-BEST SPLIT:
Information Gain
Information gain (IG) measures how much “information” a feature
gives us about the class. The information gain is based on the decrease in
entropy after a dataset is split on an attribute. It is the main parameter used
to construct a Decision Tree. An attribute with the highest
Information gain will be tested/split first.
6Q-K-Nearest Neighbors
The KNN algorithm assumes that similar things exist in close proximity. In
other words, similar things are near to each other.
3.1 Calculate the distance between the query example and the current
example from the data.
3.2 Add the distance and the index of the example to an ordered collection
Advantages
1. The algorithm is simple and easy to implement.
2. There’s no need to build a model, tune several parameters, or make
additional assumptions.
3. The algorithm is versatile. It can be used for classification, regression,
and search (as we will see in the next section).
Disadvantages
1. The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
KNN is a Supervised Learning algorithm that uses labeled input data set to
predict the output of the data points.
It is one of the most simple Machine learning algorithms and it can be easily
implemented for a varied set of problems.
It is mainly based on feature similarity. KNN checks how similar a data point
is to its neighbor and classifies the data point into the class it is most similar to.
DEF: Naive Bayes algorithms are mostly used in sentiment analysis, spam
filtering, recommendation systems etc. They are fast and easy to implement
but their biggest disadvantage is that the requirement of predictors to be
independent
What is a classifier?
A classifier is a machine learning model that is used to discriminate different
objects based on certain features.
Bayes Theorem:
Since the way the values are present in the dataset changes, the formula for
conditional probability changes to,
9Q-Classification tec
10Q-Decision tree induction algorithm