0% found this document useful (0 votes)
61 views

Data Mining Lab

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Data Mining Lab

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

lOMoARcPSD|37701387

Department of Information Technology

III B. Tech II Semester


Subject: Data Mining Lab

Subject Code: C1213


Mr. Joel krupakar
Assistant Professor,
IT Dept.,

Academic Year 2024-25


Regulations: MR22
lOMoARcPSD|37701387

Malla Reddy Engineering College


(An UGC Autonomous Institution), Approved by AICTE, New Delhi & Affiliated to
JNTUH, Hyderabad, Accredited by NAAC with ‘A++’ Grade (3rd Cycle), Maisammaguda
(H), Medchal-Malkajgiri, Secunderabad Telangana–500100 www.mrec.ac.in
Department of Information Technology

VISION

To be a premier center of professional education and research, offering quality programs in a


socio-economic and ethical ambience.

MISSION

 To impart knowledge of advanced technologies using state-of-the-art infrastructural


facilities.
 To inculcate innovation and best practices in education, training and research.
 To meet changing socio-economic needs in an ethical ambience.
lOMoARcPSD|37701387

Malla Reddy Engineering College


(An UGC Autonomous Institution), Approved by AICTE, New Delhi & Affiliated to
JNTUH, Hyderabad, Accredited by NAAC with ‘A++’ Grade (3rd Cycle), Maisammaguda
(H), Medchal-Malkajgiri, Secunderabad Telangana–500100 www.mrec.ac.in
Department of Information Technology

Department Vision
To attain global standards in Teaching, Training, and Research of the IT industry that
maintaining balance between the evolving needs of the sector and the socioeconomic and
ethical needs of society.
Department Mission
 To impart quality education and research to undergraduate and postgraduate students in Information
Technology (IT).
 To train students in advanced technologies using state-of-the-art facilities.
 To develop knowledge, skills and aptitude to function in the IT domain based on ethical values and
social relevance.

Programm Educational Objectives (PEOs)


PEO 1:
To outshine in professional career with sound problem solving ability for providing IT solutions by
proper plan, analysis, design, implementation and validation.

PEO 2:
To pursue training, advance study and research using scientific, technical and communication base to
cope with the evolution in the technology.

PEO 3: To utilize the acquired technical skills and knowledge for the benefit of society
lOMoARcPSD|37701387

Malla Reddy Engineering College


(An UGC Autonomous Institution), Approved by AICTE, New Delhi & Affiliated to
JNTUH, Hyderabad, Accredited by NAAC with ‘A++’ Grade (3rd Cycle), Maisammaguda
(H), Medchal-Malkajgiri, Secunderabad Telangana–500100 www.mrec.ac.in
Department of Information Technology

PROGRAMME OUTCOMES (POs)


PO 1: Engineering Knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an
engineering specialization to the solution of complex engineering problems.
PO 2: Problem Analysis: Identify, formulate, review research literature and analyze complex engineering problems
reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering sciences.
PO 3: Design / Development of Solutions: Design solutions for complex engineering problems and design system
components or processes that meet the specified needs with appropriate consideration for the public health and safety, and
the cultural, societal, and environmental considerations.
PO 4: Conduct Investigations of Complex Problems: User search-based knowledge and research methods including
design of experiments, analysis and interpretation of data, and synthesis of the information to provide valid conclusions.
PO 5: Modern Tool Usage: Create, select, and apply appropriate techniques, resources, and modern engineering and IT
tools including prediction and modeling to complex engineering activities with an understanding of the limitations.
PO 6: The Engineer and Society: Apply reasoning informed by the contextual knowledge to assess societal, health,
safety, legal and cultural issues and the consequent responsibilities relevant to the professional engineering practice.
PO 7: Environment and Sustainability: Understand the impact of the professional engineering solutions in societal and
environmental contexts, and demonstrate the knowledge of, and need for sustainable development.
PO 8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
PO 9: Individual and Team Work: Function effectively as an individual and as a member or leader in diverse teams,
and in multi disciplinary settings.
PO 10: Communication: Communicate effectively on complex engineering activities with the engineering community
and with society at large, such as, being able to comprehend and write effective reports and design documentation, make
effective presentations, and give and receive clear instructions.
PO 11: Project Management and Finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage projects and n
multi-disciplinary environments.
PO 12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent and
life-long learning in the broadest context of technological change.
lOMoARcPSD|37701387

Malla Reddy Engineering College


(An UGC Autonomous Institution), Approved by AICTE, New Delhi & Affiliated to
JNTUH, Hyderabad, Accredited by NAAC with ‘A++’ Grade (3rd Cycle), Maisammaguda
(H), Medchal-Malkajgiri, Secunderabad Telangana–500100 www.mrec.ac.in
Department of Information Technology

PROGRAMME SPECIFIC OUTCOMES (PSOs)

PSO 1:
Identify the mathematical abstractions and algorithm design techniques together with emerging
Software Tools to solve complexities indulged in efficient programming.

PSO 2:
Apply the core concepts of current technologies in the hardware, software mains in
accomplishing IT enabled services to meet out societal needs.

PSO 3:
Practice modern computing techniques by continual learning process with ethical
concerns in establishing innovative career path
lOMoARcPSD|37701387

Data Mining Lab


Lab
MR22
2024-25
lOMoARcPSD|37701387

2022-23
Malla Reddy Engineering College B.Tech.
Onwards
(Autonomous) VI Semester
(MR-22)
Code: C1213 Data Mining Lab L T P
[Professional Elective-III]
Credits: 1 - - 4

Prerequisites
 A course on “Database Management System
Course Objectives:
 The course is intended to obtain hands-on experience using data mining software.
 Intended to provide practical exposure of the concepts in data mining algorithms
LIST OF EXPERIMENTS: Experiments using Weka/ Pentaho/Python
1. Data Processing Techniques:
(i) Data cleaning (ii) Data transformation – Normalization (iii) Data integration
2. Partitioning - Horizontal, Vertical, Round Robin, Hash based
3. Data Warehouse schemas – star, snowflake, fact constellation
4. Data cube construction – OLAP operations
5. Data Extraction, Transformations & Loading operations
6. Implementation of Attribute oriented induction algorithm
7. Implementation of apriori algorithm
8. Implementation of FP – Growth algorithm
9. Implementation of Decision Tree Induction
10. Calculating Information gain measures
11. Classification of data using Bayesian approach
12. Classification of data using K – nearest neighbour approach
13. Implementation of K – means algorithm
14. Implementation of BIRCH algorithm
15. Implementation of PAM algorithm
16. Implementation of DBSCAN algorithm
TEXT BOOKS:
1. Data Mining – Concepts and Techniques - JIAWEI HAN &MICHELINE KAMBER, Elsevier.
2. Data Warehousing, Data Mining &OLAP- Alex Berson and Stephen J. Smith- Tata McGraw-Hill
Edition, Tenth reprint 2007
REFERENCE BOOK:
1. Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Anuj Karpatne,
Introduction to Data Mining, Pearson Education
Course Outcomes: At the end of the course, students will be able to:

 Apply preprocessing statistical methods for any given raw data


 Gain practical experience of constructing a data warehouse
 Implement various algorithms for data mining in order to discover interesting patterns
from large amounts of data.
 Apply OLAP operations on data cube construction
lOMoARcPSD|37701387

Bloom’s Taxonomy Action Verbs:

Bloom’s Taxonomy Triangle:


lOMoARcPSD|37701387

LIST OF EXPERIMENTS:

Experiments using Weka & Pentaho Tools

1. Data Processing Techniques: (i) Data cleaning (ii) Data transformation – Normalization

(iii) Data integration

2. Partitioning - Horizontal, Vertical, Round Robin, Hash based

3. Data Warehouse schemas – star, snowflake, fact constellation

4. Data cube construction – OLAP operations

5. Data Extraction, Transformations & Loading operations

6. Implementation of Attribute oriented induction algorithm

7. Implementation of apriori algorithm

8. Implementation of FP – Growth algorithm

9. Implementation of Decision Tree Induction

10. Calculating Information gain measures

11. Classification of data using Bayesian approach

12. Classification of data using K – nearest neighbour approach

13. Implementation of K – means algorithm

14. Implementation of BIRCH algorithm

15. Implementation of PAM algorithm

16. Implementation of DBSCAN algorithm


lOMoARcPSD|37701387

Bloom’s Taxonomy Action Verbs:

Bloom’s Taxonomy Triangle:


lOMoARcPSD|37701387

1. Data Processing Techniques: (i) Data cleaning (ii) Data transformation – Normalization

(iii) Data integration

Data Cleaning :

Data in the real world is frequently incomplete, noisy, and inconsistent. Many bits of the data may be

irrelevant or missing. Data cleaning is carried out to handle this aspect. Data cleaning methods aim to

fill in missing values, smooth out noise while identifying outliers, and fix data discrepancies. Unclean

data can confuse data and the model. Therefore, running the data through various Data

Cleaning/Cleansing methods is an important Data Preprocessing step.

(a) Missing Data :

It’s fairly common for your dataset to contain missing values. It could have happened during

data collection or as a result of a data validation rule, but missing values must be considered

anyway.

1. Dropping rows/columns: If the complete row is having NaN values then it doesn't make any

value out of it. So such rows/columns are to be dropped immediately. Or if the % of row/column

is mostly missing say about more than 65% then also one can choose to drop.
lOMoARcPSD|37701387

2. Checking for duplicates: If the same row or column is repeated then also you can drop it by

keeping the first instance. So that while running machine learning algorithms, so as not to offer

that particular data object an advantage or bias.

3. Estimate missing values: If only a small percentage of the values are missing, basic

interpolation methods can be used to fill in the gaps. However, the most typical approach of

dealing with missing data is to fill them in with the feature’s mean, median, or mode value.

(b) Noisy Data:

Noisy data is meaningless data that machines cannot interpret. It can be caused by poor data collecting,

data input problems, and so on. It can be dealt with in the following ways:

1. Binning Method: This method smooths data that has been sorted. The data is divided into

equal- sized parts, and the process is completed using a variety of approaches. Each segment is

dealt with independently. All data in a segment can be replaced by its mean, or boundary values

can be used to complete the task.

2. Clustering: In this method, related data is grouped in a cluster. Outliers may go unnoticed,

or they may fall outside of clusters.

3. Regression: By fitting data to a regression function, data can be smoothed out. The

regression model employed may be linear (with only one independent variable) or multiple

(with numerous independent variables) (having multiple independent variables).


lOMoARcPSD|37701387

Data Integration

It is involved in a data analysis task that combines data from multiple sources into a coherent data store.

These sources may include multiple databases. Do you think how data can be matched up ?? For a data

analyst in one database, he finds Customer_ID and in another he finds cust_id, How can he sure about

them and say these two belong to the same entity. Databases and Data warehouses have Metadata (It is

the data about data) it helps in avoiding errors.

Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0,
1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for attribute A
that are V1, V2, V3, ….Vn.

o Min-max normalization: This method implements a linear transformation on the original data.
Let us consider that we have minA and maxA as the minimum and maximum value observed for
attribute A and Viis the value for attribute A that has to be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute
income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
lOMoARcPSD|37701387

The value $73,600 would be transformed using min-max normalization as follows:

o Z-score normalization: This method normalizes the value for attribute A


using the meanand standard deviation. The following formula is used for Z-score
normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.

o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point
in the value. This movement of a decimal point depends on the maximum absolute value of A.
The formula for the decimal scaling is given below:

Here j is the smallest integer such that max(| v'i|)<1


For example, the observed values for attribute A range from -986 to 917, and the maximum
absolute value for attribute A is 986. Here, to normalize each value of attribute A using decimal
scaling, we have to divide each value of attribute A by 1000, i.e., j=3.
So, the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917.
The normalization parameters such as mean, standard deviation, the maximum absolute value
must be preserved to normalize the future data uniformly.

2. Partitioning - Horizontal, Vertical, Round Robin, Hash based

Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also
helps in balancing the various requirements of the system. It optimizes the hardware performance and
simplifies the management of data warehouse by partitioning each fact table into multiple separate
partitions.
Horizontal Partitioning

There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to
keep in mind the requirements for manageability of the data warehouse.
lOMoARcPSD|37701387

Partitioning by Time into Equal Segments

In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time
period represents a significant retention period within the business. For example, if the user queries
for month to date data then it is appropriate to partition the data into monthly segments. We can
reuse the partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments

This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of
small partitions for relatively current data, larger partition for inactive data.

Vertical Partition

Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is
done.
lOMoARcPSD|37701387

Vertical partitioning can be performed in the following two ways −

• Normalization
• Row Splitting

Normalization

Normalization is the standard relational method of database organization. In this method, the rows are
collapsed into a single row, hence it reduce space. Take a look at the following tables that show how
normalization is performed.

Row Splitting

Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is
to speed up the access to large table by reducing its size.

Hash Partitioning

Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies to a
partitioning key that you identify. The hashing algorithm evenly distributes rows among partitions,
giving partitions approximately the same size. Hash partitioning is the ideal method for distributing
data evenly across devices. Hash partitioning is also an easy-to-use alternative to range partitioning,
especially when the data to be partitioned is not historical.
lOMoARcPSD|37701387

Oracle Database uses a linear hashing algorithm and to prevent data from clustering within specific
partitions, you should define the number of partitions by a power of two (for example, 2, 4, 8).

The following statement creates a table sales_hash, which is hash partitioned on the
salesman_id field:

CREATE TABLE sales_hash (salesman_id


NUMBER(5), salesman_name
VARCHAR2(30), sales_amount
NUMBER(10), week_no NUMBER(2))
PARTITION BY HASH(salesman_id)
PARTITIONS 4;

Round-robin partitioning: the simplest strategy, it ensures uniform data distribution. With n
partitions, the ith tuple in insertion order is assigned to partition (i mod n). This strategy enables the
sequential access to a relation to be done in parallel. However, the direct access to individual tuples,
based on a predicate, requires accessing the entire relation.

3) Data Warehouse schemas – star, snowflake, fact constellation

Star Schema
• Each dimension in a star schema is represented with only one-dimension table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
lOMoARcPSD|37701387

• There is a fact table at the center. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
• Star Schema Definition
• The star schema that we have discussed can be defined using Data Mining Query
Language (DMQL) as follows −
• define cube sales star [time, item, branch, location]:

• dollars sold = sum(sales in dollars), units sold = count(*)

• define dimension time as (time key, day, day of week, month, quarter, year)
• define dimension item as (item key, item name, brand, type, supplier type)
• define dimension branch as (branch key, branch name, branch type)
• define dimension location as (location key, street, city, province or state, country)
Snowflake Schema
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example,
the item dimension table in star schema is normalized and split into two dimension tables,
namely item and supplier table.

• Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
• Snowflake Schema Definition
• Snowflake schema can be defined using DMQL as follows −
• define cube sales snowflake [time, item, branch, location]:
lOMoARcPSD|37701387

• dollars sold = sum(sales in dollars), units sold = count(*)



• define dimension time as (time key, day, day of week, month, quarter, year)
• define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier
type))
• define dimension branch as (branch key, branch name, branch type)
• define dimension location as (location key, street, city (city key, city, province or state,
country))
Fact Constellation Schema
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.

• The sales fact table is same as that in the star schema.


• The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
• The shipping fact table also contains two measures, namely dollars sold and units sold.
• It is also possible to share dimension tables between fact tables. For example, time, item, and
location dimension tables are shared between the sales and shipping fact table.
• Fact Constellation Schema Definition
• Fact constellation schema can be defined using DMQL as follows −
• define cube sales [time, item, branch, location]:

• dollars sold = sum(sales in dollars), units sold = count(*)

• define dimension time as (time key, day, day of week, month, quarter, year)
• define dimension item as (item key, item name, brand, type, supplier type)
• define dimension branch as (branch key, branch name, branch type)
• define dimension location as (location key, street, city, province or state,country)
lOMoARcPSD|37701387

• define cube shipping [time, item, shipper, from location, to location]:

• dollars cost = sum(cost in dollars), units shipped = count(*)

• define dimension time as time in cube sales


• define dimension item as item in cube sales
• define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
• define dimension from location as location in cube sales
• define dimension to location as location in cube sales.

4. Data cube construction – OLAP operations

An OLAP cube is a term that typically refers to multi-dimensional array of data. OLAP
is an acronym for online analytical processing,[1]which is a computer-based technique
of analyzing data to look for insights. The term cube here refers to a multi-dimensional
dataset, which is also sometimes called a hypercube if the number of dimensions is
greater than 3.
Operations:

1.Slice is the act of picking a rectangular subset of a cube by choosing a single value
for one of its dimensions, creating a new cube with one fewer dimension.[4] The
picture shows a slicing operation: The sales figures of all sales regions and all
product categories of the company in the year 2005 and 2006 are "sliced" out of the
data cube.

2.Dice: The dice operation produces a subcube by allowing the analyst to pick
specific values of multiple dimensions.[5]The picture shows a dicing operation: The
new cube shows the sales figures of a limited number of product categories, the time
and region dimensions cover the same range as before.

3.Drill Down/Up allows the user to navigate among levels of data ranging from the
most summarized (up) to the most detailed (down).[4] The picture shows a drill-down
operation: The analyst moves from the summary category "Outdoor-
Schutzausrüstung" to see the sales figures for the individual products.

4.Roll-up: A roll-up involves summarizing the data along a dimension. The


summarization rule might be computing totals along a hierarchy or applying a set
of formulas such as "profit = sales
- expenses".

5.Pivot allows an analyst to rotate the cube in space to see its various faces. For
lOMoARcPSD|37701387

example, cities could be arranged vertically and products horizontally while


viewing data for a particular quarter. Pivoting could replace products with time
periods to see data across time for a single product.
lOMoARcPSD|37701387
lOMoARcPSD|37701387

6. Implementation of Attribute oriented induction algorithm

Aim: To Perform the implementation of attribute oriented Induction algorithm Resources:


Weka
Theory: AOI stands for Attribute-Oriented Induction. The attribute-oriented induction approach to
concept description was first proposed in 1989, a few years before the introduction of the data cube
approach. The data cube approach is essentially based on materialized views of the data, which
typically have been pre-computed in a data
lOMoARcPSD|37701387

warehouse.
In general, it implements off-line aggregation earlier an OLAP or data mining query is submitted for
processing. In other words, the attribute-oriented induction approach is generally a query-oriented,
generalization-based, on-line data analysis methods. The general idea of attribute-oriented induction is
to first collect the task-relevant data using a database query and then perform generalization based
on the examination of the number of distinct values of each attribute in the relevant collection
of data.
The generalization is implemented by attribute removal or attribute generalization. Aggregation is
implemented by combining identical generalized tuples and accumulating their specific counts. This
decreases the size of the generalized data set. The resulting generalized association can be mapped
into several forms for presentation to the user, including charts or rules.
Algorithm: The process of attribute-oriented induction which are as follows −

• First, data focusing must be implemented before attribute-oriented induction. This step
corresponds to the description of the task-relevant records (i.e., data for analysis). The data are
collected based on the data supported in the data mining query.

• Because a data mining query is usually relevant to only a portion of the database, selecting the
relevant set of data not only makes mining more efficient, but also changes more significant
results than mining the whole database.

• It can be specifying the set of relevant attributes (i.e., attributes for mining, as indicated in DMQL
with the in relevance to clause) may be difficult for the user. A user can choose only a few attributes
that it is important, while missing others that can also play a role in the representation.

• For example, suppose that the dimension birth place is defined by the attributes city, province or
state, and country. It can allow generalization on the birth place dimension, the other attributes
defining this dimension should also be included.

• In other terms, having the system automatically involve province or state and country as relevant
attributes enables city to be generalized to these larger conceptual levels during the induction
phase.

• At the other extreme, suppose that the user may have introduced too many attributes
by specifying all of the possible attributes with the clause “in relevance to *”. In this case, all of the
attributes in the relation specified by the from clause would be included in the analysis.

• Some attributes are unlikely to contribute to an interesting representation. A correlation-based or


entropy-based analysis method can be used to perform attribute relevance analysis and filter out
statistically irrelevant or weakly relevant attributes from the descriptive mining process.
lOMoARcPSD|37701387

Procedure: Step1:open the weka explorer


Step2:load the data set
Step 3:choose select attributes option Choose cfssubseteval option Select the ranker option Click the
start button
Output: === Run information ===
Evaluator: weka.attributeSelection.CfsSubsetEval -P 1 -E 1
Search: weka.attributeSelection.GreedyStepwise -T -1.7976931348623157E308 -N
-1 -num-slots 1
Relation: breast-cancer
Instances: 286

7. Implementation of apriori algorithm

the basic elements of asscociation rule mining using WEKA. The sample dataset used
for this example is contactlenses.arff

Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have
been discretized. In this example it is age attribute.

Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.

Step3: We will use apriori algorithm. This is the default algorithm.

Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.

Dataset contactlenses.arff

18
lOMoARcPSD|37701387

The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.

13. Implementation of K – means algorithm

the use of simple k-mean clustering with Weka explorer. The sample data set used for
this example is based on the iris data available in ARFF format. This document assumes that
appropriate preprocessing has been performed. This iris dataset includes 150 instances.

Steps involved in this Experiment

Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.

Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup window shown
in the screenshots. In this window we enter six on the number of clusters
and we leave the value of the seed on as it is. The seed value is used in
generating a random number which is used for making the internal
assignments of instances of clusters.

Step 5 : Once of the option have been specified. We run the clustering
algorithm there we must make sure that they are in the ‘cluster mode’
panel. The use of training set option is selected and then we click ‘start’
button. This process and resulting window are shown in the following
screenshots.

Step 6 : The result window shows the centroid of each cluster as well as
statistics on the number and the percent of instances assigned to different
clusters. Here clusters centroid are means vectors for each clusters. This
clusters can be used to characterized the cluster.For eg, the centroid of
cluster1 shows the class iris.versicolor mean value of the sepal length is
5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.

Step 7: Another way of understanding characterstics of each cluster


through visualization ,we can do this, try right clicking the result set on the
result. List panel and selecting the visualize cluster assignments.

The following screenshot shows the clustering rules that were


generated when simple k means algorithm is applied on the given
dataset.
7. Implementation of apriori algorithm

The Apriori algorithm is an influential algorithm for mining frequent item sets for
Boolean association rules. It uses a “bottom-up” approach, where frequent subsets are
extended one at a time (a step known as candidate generation, and groups of candidates
are tested against the data).

❖ Problem:

TID ITEM
S
100 1,3,4
200 2,3,5
300 1,2,3,
5
400 2,5

To find frequent item sets for above transaction with a minimum support of 2 having
confidence measure of 70% (i.e, 0.7).

Procedure:
Step 1:
Count the number of transactions in which each item occurs

TI ITEM
D S
1 2
2 3
3 3
4 1
5 3
Step 2:
Eliminate all those occurrences that have transaction numbers less than the minimum
support ( 2 in this case).

ITEM NO. OF
TRANSACTIONS

1 2

2 3

3 3

5 3

ITEM NO. OF
TRANSACTIONS

1 2

2 3

3 3

5 3

This is the single items that are bought frequently. Now let‟s say we want to find a pair
of items that are bought frequently. We continue from the above table (Table in step 2).

Step 3:
We start making pairs from the first item like 1,2;1,3;1,5 and then from second item
like 2,3;2,5. We do not perform 2,1 because we already did 1,2 when we were making
pairs with 1 and buying 1 and 2 together is same as buying 2 and 1 together. After
making all the pairs we get,

ITEM PAIRS

1,2
1,3
1,5
2,3
2,5
3,5
Step 4:
Now, we count how many times each pair is bought together.

NO.OF
ITEM PAIRS TRANSACTIONS
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2
Step 5:
Again remove all item pairs having number of transactions less than 2.

ITEM PAIRS NO.OF


TRANSACTIONS
1,3 2
2,3 2
2,5 3
3,5 2

These pair of items is bought frequently together. Now, let‟s say we want to find a
set of three items that are bought together. We use above table (of step 5) and make a
set of three items.

Step 6:
To make the set of three items we need one more rule (It‟s termed as self-join), it
simply means, from item pairs in above table, we find two pairs with the same first
numeric, so, we get (2,3) and (2,5), which gives (2,3,5). Then we find how many times
(2, 3, 5) are bought together in the original table and we get the following

ITEM NO. OF
SET TRANSACTIONS
(2,3,5) 2

Thus, the set of three items that are bought together from this data are (2, 3,
5). Confidence:
We can take our frequent item set knowledge even further, by finding association rules
using the frequent item set. In simple words, we know (2, 3, 5) are bought together
frequently, but what is the association between them. To do this, we create a list of all
subsets of frequently bought items (2, 3, 5) in our case we get following subsets:
▪ {2}
▪ {3}
▪ {5}
▪ {2,3}
▪ {3,5}
▪ {2,5}

Now, we find association among all the subsets.


{2} => {3,5}: ( If „2‟ is bought , what‟s the probability that „3‟ and „5‟ would be bought in
same transaction)
Confidence = P (3฀5฀2)/ P(2) =2/3 =67%
{3}=>{2,5}= P (3฀5฀2)/P(3)=2/3=67%
{5}=>{2,3}= P (3฀5฀2)/P(5)=2/3=67%
{2,3}=>{5}= P (3฀5฀2)/P(2฀3)=2/2=100%
{3,5}=>{2}= P (3฀5฀2)/P(3฀5)=2/2=100%
{2,5}=>{3}= P (3฀5฀2)/ P(2฀5)=2/3=67%
Also, considering the remaining 2-items sets, we would get the following associations-
{1}=>{3}=P(1฀3)/P(1)=2/2=100%
{3}=>{1}=P(1฀3)/P(3)=2/3=67%
{2}=>{3}=P(3฀2)/P(2)=2/3=67%
{3}=>{2}=P(3฀2)/P(3)=2/3=67%
{2}=>{5}=P(2฀5)/P(2)=3/3=100%
{5}=>{2}=P(2฀5)/P(5)=3/3=100%
{3}=>{5}=P(3฀5)/P(3)=2/3=67%
{5}=>{3}=P(3฀5)/P(5)=2?3=67%
Eliminate all those having confidence less than 70%. Hence, the rules would be –
{2,3}=>{5}, {3,5}=>{2}, {1}=>{3},{2}=>{5}, {5}=>{2}.
➢ Now these manual results should be checked with the rules generated in WEKA.

So first create a csv file for the above problem, the csv file for the above problem will
look like the rows and columns in the above figure. This file is written in excel sheet.
lOMoARcPSD|37701387

8. Implementation of FP – Growth algorithm

PROBLEM:
To find all frequent item sets in following dataset using FP-growth algorithm. Minimum
support=2 and confidence =70%

TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5

Solution:
Similar to Apriori Algorithm, find the frequency of occurrences of all each item in dataset and
then prioritize the items according to its descending order of its frequency of occurrence.
Eliminating those occurrences with the value less than minimum support and assigning the
priorities, we obtain the following table.

ITEM NO. OF PRIORITY


TRANSACTIONS
1 2 4

2 3 1

3 3 2

5 3 3

Re-arranging the original table, we obtain

TID ITEMS

100 1,3

200 2,3,5

300 2,3,5,1

400 2,5
lOMoARcPSD|37701387

Construction of tree:
Note that all FP trees have „null‟ node as the root node. So, draw the root node first and attach
the items of the row 1 one by one respectively and write their occurrences in front of it. The
tree is further expanded by adding nodes according to the prefixes (count) formed and by
further incrementing the occurrences every time they occur and hence the tree is built.

Prefixes:

▪ 1->3:1 2,3,5:1
▪ 5->2,3:2 2:1
▪ 3->2:2

Frequent item sets:

▪ 1-> 3:2 /*2 and 5 are eliminated because they‟re less than minimum support,
and the occurrence of 3 is obtained by adding the occurrences in both the instances*/
▪ Similarly, 5->2,3:2 ; 2:3;3:2
▪ 3->2 :2

Therefore, the frequent item sets are {3,1}, {2,3,5}, {2,5}, {2,3},
{3,5} The tree is constructed as below:

NUL

3:1
2:3
1:1
3:2
5:1

5:2

1:1
lOMoARcPSD|37701387

Generating the association rules for the following tree and calculating
the confidence measures we get-
▪ {3}=>{1}=2/3=67%
▪ {1}=>{3}=2/2=100%
▪ {2}=>{3,5}=2/3=67%
▪ {2,5}=>{3}=2/3=67%
▪ {3,5}=>{2}=2/2=100%
▪ {2,3}=>{5}=2/2=100%
▪ {3}=>{2,5}=2/3=67%
▪ {5}=>{2,3}=2/3=67%
▪ {2}=>{5}=3/3=100%
▪ {5}=>{2}=3/3=100%
▪ {2}=>{3}=2/3=67%
▪ {3}=>{2}=2/3=67%

Thus eliminating all the sets having confidence less than 70%, we obtain the following
conclusions:
{1}=>{3} , {3,5}=>{2} , {2,3}=>{5} , {2}=>{5}, {5}=>{2}.

As we see there are 5 rules that are being generated manually and these are to be checked against
the results in WEKA. Inorder to check the results in the tool we need to follow the similar
procedure like
Apriori.

So first create a csv file for the above problem, the csv file for the above problem will look like
the rows and columns in the above figure. This file is written in excel sheet.

9. Implementation of Decision Tree Induction

Decision tree learning is one of the most widely used and practical methods for inductive
inference over supervised data. It represents a procedure for classifying categorical database on
their attributes. This representation of acquired knowledge in tree form is intuitive and easy to
assimilate by humans.
lOMoARcPSD|37701387

ILLUSTRATION:
Build a decision tree for the following data

AGE IN STU CREDIT_RATI BUYS_C


C D NG O
OM ENT MPUTER
E
Youth High No Fair No
Youth High No Excellen No
t
Middle aged High No Fair Yes

Senior Mediu No Fair Yes


m
Senior Low Yes Fair Yes
Senior Low Yes Excellen No
t
Middle aged Mediu Yes Excellen Yes
m t

Youth Low No Fair No


Youth Mediu Yes Fair Yes
m
Senior Mediu Yes Fair Yes
m
Youth Mediu Yes Excellen Yes
m t
Middle aged Mediu No Excellen Yes
m t

Middle aged High Yes Fair Yes

Senior Mediu No Excellen No


m t

The entropy is a measure of the uncertainty associated with a random variable. As uncertainty
increases, so does entropy, values range from [0-1] to present the entropy of information

Entropy (D) =
Information gain is used as an attribute selection measure; pick the attribute having the highest
information gain, the gain is calculated by:
Gain (D, A) = Entropy (D) -
Where, D: A given data partition A: Attribute
V: Suppose we were partition the tuples in D on some attribute A having v distinct values D is
split into v partition or subsets, (D1, D2….. Dj) , where Dj contains those tuples in D that have
outcome Aj of A.
lOMoARcPSD|37701387

Class P:
buys_computer=”yes” Class
N: buys_computer=”no”

Entropy (D) = -9/14log (9/14)-5/15log (5/14) =0.940


Compute the expected information requirement for each attribute start with the attribute age
Gain (age, D)

= Entropy (D) -

= Entropy ( D ) – 5/14Entropy(Syouth)-4/14Entropy(Smiddle-aged)-5/14Entropy(Ssenior)
= 0.940-0.694
=0.246

Similarly, for other


attributes, Gain (Income, D)
=0.029 Gain (Student, D ) =
0.151
Gain (credit_rating, D) = 0.048

Income Student Credit_ratin Class


g
High No Fair No

High No Excellent No

Medium No Fair No

Low Yes Fair Yes

medium Yes excellent yes

Now, calculating information gain for subtable (age<=30)


I The attribute age has the highest information gain and therefore becomes the splitting
* attribute at the root node of the decision tree. Branches are grown for each outcome
of age. These tuples are shown partitioned accordingly.
Income=”high” S11=0, S12=2
I=0 Income=”medium” S21=1
S22=1 I (S21, S23) = 1
Income=”low” S31=1
S32=0 I=0
Entropy for income
E( income ) = (2/5)(0) + (2/5)(1) + (1/5)(0) = 0.4
Gain( income ) = 0.971 - 0.4 = 0.571

Similarly, Gain(student)=0.971
Gain(credit)=0.0208
Gain( student) is highest ,
lOMoARcPSD|37701387

A decision tree for the concept buys_computer, indicating whether a customer at All Electronics
is likely to purchase a computer. Each internal (non-leaf) node represents a test on an attribute.
Each leaf node represents a class ( either buys_computer=”yes” or buys_computer=”no”.

first create a csv file for the above problem,the csv file for the above problem will look like the
rows and columns in the above figure. This file is written in excel sheet.
lOMoARcPSD|37701387

Procedure for running the rules in weka:


Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives a better
understanding of how to do that.

Step2:
Now select the classify tab in the tool and click on start button and then we can see the result of
the problem as below
lOMoARcPSD|37701387

Step3:

Check the main result which we got manually and the result in weka by right clicking on
the result and visualizing the tree.

The visualized tree in weka is as shown below:

10. Calculating Information gain measures

Information gain (IG) measures how much “information” a feature gives us about the class. –
Features that perfectly partition should give maximal information. – Unrelated features should
give no information. It measures the reduction in entropy. CfsSubsetEval aims to identify a
subset of attributes that are highly correlated with the target while not being strongly correlated
with one another. It searches through the space of possible attribute subsets for the “best” one
using the BestFirst search method by default, although other methods can be chosen. To use the
wrapper method rather than a filter method, such as CfsSubsetEval, first select
WrapperSubsetEval and then configure it by choosing a learning algorithm to apply and setting
the number of cross-validation folds to use when evaluating it on each attribute subset.

Steps:

▪ Open WEKA Tool.


▪ Click on WEKA Explorer.
▪ Click on Preprocessing tab button.
▪ Click on open file button.
▪ Select and Click on data option button.
▪ Choose a data set and open file.
▪ Click on select attribute tab and Choose attribute evaluator, search method algorithm
lOMoARcPSD|37701387

▪ Click on start button.

11. Classification of data using Bayesian approach

Description:
In machine learning, Naïve Bayes classifiers are a family of simple probabilistic classifiers based
on applying Bayes‟ Theorem with strong (naïve) independence assumptions between the features
lOMoARcPSD|37701387

Example:
.
AGE INCOME STUDENT CREDIT_RATING BUYS_COMPUTER

<30 High No Fair No

<30 High No Excellent No

31-40 High No Fair Yes

>40 Mediu m No Fair Yes

>40 Low Yes Fair Yes

>40 Low Yes Excellent No

31-40 Mediu m Yes Excellent Yes


<=30
Low No Fair No
<=30 Mediu m Yes Fair Yes

>40 Mediu m Yes Fair Yes

<30 Mediu m Yes Excellent Yes

31-40 Mediu m No Excellent Yes

31-40 High Yes Fair Yes

>40 Mediu m No Excellent No

CLASS:
C1:buys_co
m puter =
‘yes’
C2:buys_co
m puter=’no’
DATA TO
BECLASSIF
IED
lOMoARcPSD|37701387

X= (age<=30, income=Medium, Student=Yes, credit_rating=Fair)


P(C1): P(buys_computer=”yes”)= 9/14 =0.643

P (buys_computer=”no”) =5/14=0.357

Compute P(X/C1) and p(x/c2) weget:

1. P( age=”<=30” |buys_computer=”yes”)=2/9
2. P(age=”<=30”|buys_computer=”no”)=3/5
3. P(income=”medium”|buys_computer=”yes”)=4/9
4. P(income=”medium”|buys_computer=”no”)=2/5
5. P(student=”yes”|buys_computer=”yes”)=6/9
6. P(student=”yes” |buys_computer=”no”)=1/5=0.2
7. P(credit_rating=”fair ”|buys_computer=”yes”)=6/9
8. P(credit_rating=”fair” |buys_computer=”no”)=2/5

X=(age<=30, income=medium, student=yes,


credit_rating=fair) P(X/C1): P
(X/buys_computer=”yes”)=2/9*4/9*6/9*6/9=
32/1134

P(X/C2):P(X/buys_computer=”no”)=3/5*2/5*1

/5*2/5=12/125

P(C1/X)=P(X/C1)*P(C1)

P(X/buys_computer=”yes”)*P(buys_computer=”yes”)=(32/1134)*(9/14)=0.019

P(C2/X)=p(x/c2)*p(c2)
P (X/buys_computer=”no”)*P(buys_computer=”no”)=(12/125)*(5/14)=0.007

Therefore, conclusion is that the given data belongs to C1 since P(C1/X)>P(C2/X)


lOMoARcPSD|37701387

Checking the result in the WEKA tool:

In order to check the result in the tool we need to


follow aprocedure. Step 1:

Create a csv file with the above table considered in the example. the arff
file will look as shown below:

Step 2:

Now open weka explorer and then select all the attributes in the table.
lOMoARcPSD|37701387

Step 3:

Select the classifier tab in the tool and choose baye‟s folder and then naïve baye‟s classifier to
see the result as shown below.

12. Classification of data using K – nearest neighbour approach

KNN as Classifier

First, start with importing necessary python packages −

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Next, download the iris dataset from its weblink as follows −

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" Next, we need to

assign column names to the dataset as follows − headernames = ['sepal-length', 'sepal-width', 'petal-

length', 'petal-width', 'Class'] Now, we need to read dataset to pandas dataframe as follows −

dataset = pd.read_csv(path, names = headernames)


dataset.head()
sepal-length sepal-width petal-length petal-width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Data Preprocessing will be done with the help of following script lines.

X = dataset.iloc[:, :-1].values y
= dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. Following code will split the dataset into 60% training
lOMoARcPSD|37701387

data and 40% of testing data −

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.40)

Next, data scaling will be done as follows –


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −

from sklearn.neighbors import KNeighborsClassifier classifier =


KNeighborsClassifier(n_neighbors = 8) classifier.fit(X_train,
y_train)

At last we need to make prediction. It can be done with the help of following script −

y_pred = classifier.predict(X_test)
Next, print the results as follows −

from sklearn.metrics import classification_report,


confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output

Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60
weighted avg 0.92 0.88 0.88 60

Accuracy: 0.8833333333333333
lOMoARcPSD|37701387

13. Implementation of K – means algorithm

K-means algorithm aims to partition n observations into “k clusters” in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in
partitioning of the data into Voronoi cells.

ILLUSTRATION:

As a simple illustration of a k-means algorithm, consider the following data set consisting of
the scores of two variables on each of the five variables.

I X1 X2

A 1 1

B 1 0

C 0 2

D 2 4

E 3 5

This data set is to be grouped into two clusters: As a first step in finding a sensible partition,
let the A & C values of the two individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:

Cluster Individua Mean


l Vector(Centroid)
Cluster1 A (1,1)
Cluster2 C (0,2)
lOMoARcPSD|37701387

The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:

A C
A 0 1.4
B 1 2.5
C 1.4 0
D 3.2 2.82
E 4.5 4.2

Initial partitions have changed, and the two clusters at this stage having the
following characteristics.

Individual Mean vector(


Centroid)
Cluster 1 A,B (1,0.5)
Cluster 2 C,D,E (1.7,3.7)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individual‟s distance to its own cluster mean and to that of the opposite cluster.
And, we find:

I A C
A 0.5 2.7
B 0.5 3.7
C 1.8 2.4
D 3.6 0.5
E 4.9 1.9

The individuals C is now relocated to Cluster 1 due to its less mean distance with the centroid
lOMoARcPSD|37701387

points. Thus, its relocated to cluster 1 resulting in the new partition

Individual Mean vector(Centroid)


Cluster 1 A,B,C (0.7,1)
Cluster 2 D,E (2.5,4.5)

The iterative relocation would now continue from this new partition until no more relocation
occurs. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won‟t find a final solution. In this case, it would
be a better idea to consider stopping the algorithm after a pre-chosen maximum number of
iterations.
Checking the solution in weka:

In order to check the result in the tool we need to follow a


procedure. Step 1:
Create a csv file with the above table considered in the example. the csv file will look as shown
below:

Step 2:
Now open weka explorer and then select all the attributes in the table.
lOMoARcPSD|37701387

Step 3:
Select the cluster tab in the tool and choose normal k-means technique to
see the result as shown below.
lOMoARcPSD|37701387

14. Implementation of BIRCH algorithm

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining
algorithm that performs hierarchical clustering over large data sets. With modifications, it can also be
used to accelerate k-means clustering and Gaussian mixture modeling with the expectation-
maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster
incoming, multi-dimensional metric data points to produce the best quality clustering for a given set of
resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the
database.

Algorithm

The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature tree
(CF tree). This algorithm is based on the CF (clustering features) tree. In addition, this algorithm uses a
tree-structured summary to create clusters.
lOMoARcPSD|37701387

In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those nodes that
have several sub-clusters can be called CF subclusters. These CF subclusters are situated in no-terminal
CF nodes.

The CF tree is a height-balanced tree that gathers and manages clustering features and holds necessary
information of given data for further hierarchical clustering. This prevents the need to work with whole
data given as input. The tree cluster of data points as CF is represented by three numbers (N, LS, SS).

o N = number of items in subclusters


o LS = vector sum of the data points
o SS = sum of the squared data points

There are mainly four phases which are followed by the algorithm of BIRCH.

o Scanning data into memory.


o Condense data (resize data).
lOMoARcPSD|37701387

o Global clustering.
o Refining clusters.

Two of them (resize data and refining clusters) are optional in these four phases. They come in the
process when more clarity is required. But scanning data is just like loading data into a model. After
loading the data, the algorithm scans the whole data and fits them into the CF trees.

In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering, it
sends CF trees for clustering using existing clustering algorithms. Finally, refining fixes the problem of
CF trees where the same valued points are assigned to different leaf nodes.

15. Implementation of PAM algorithm

PAM is the most powerful algorithm of the three algorithms but has the disadvantage of time complexity.
The following K- Medoids are performed using PAM. In the further parts, we'll see what CLARA and
CLARANS are.

Algorithm:

Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points to k number of
clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and assign it to
the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the medoids)
4. Select a random point as the new medoid and swap it with the previous medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make the new
medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid, undo the swap and
repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids to classify data
points.

Here is an example to make the

theory clear: Data set:

x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9
lOMoARcPSD|37701387

Scatter plot:

If k is given as 2, we need to break down the data points into 2 clusters.

1. Initial medoids: M1(1, 3) and M2(4, 9)


2. Calculation of distances

Manhattan Distance: |x1 - x2| + |y1 - y2|

x< y From M1(1, 3) From M2(4, 9)

0 5 4 5 6
1 7 7 10 5
2 1 3 - -
3 8 6 10 7
4 4 9 - -

Cluster 1: 0

Cluster 2: 1, 3

1. Calculation of total
cost: (5) + (5 + 7) = 17
2. Random medoid: (5, 4)

M1(5, 4) and M2(4, 9):


lOMoARcPSD|37701387

x y From M1(5, 4) From M2(4, 9)

0 5 4 - -
1 7 7 5 5
2 1 3 5 9
3 8 6 5 7
4 4 9 - -

Cluster 1: 2, 3

Cluster 2: 1

1. Calculation of total
cost: (5 + 5) + 5 =
15
Less than the previous
cost New medoid: (5, 4).
2. Random medoid: (7, 7)

M1(5, 4) and M2(7, 7)

x y From M1(5, 4) From M2(7, 7)

0 5 4 - -
1 7 7 - -
2 1 3 5 10
3 8 6 5 2
4 4 9 6 5

Cluster 1: 2

Cluster 2: 3, 4

1. Calculation of total
cost: (5) + (2 + 5) =
12
Less than the previous
cost New medoid: (7, 7).
lOMoARcPSD|37701387

2. Random medoid: (8, 6)

M1(7, 7) and M2(8, 6)

x y From M1(7, 7) From M2(8, 6)


0 5 4 5 5
1 7 7 - -
2 1 3 10 10
3 8 6 - -
4 4 9 5 7

Cluster 1: 4

Cluster 2: 0, 2

1. Calculation of total
cost: (5) + (5 + 10) =
20
Greater than the previous cost
UNDO
Hence, the final medoids: M1(5, 4) and M2(7,
7) Cluster
1: 2
Cluster 2: 3, 4
Total cost: 12
Clusters:

Time complexity: O(k * (n - k)2)


lOMoARcPSD|37701387

16. Implementation of DBSCAN algorithm

Implementation steps for the DBSCAN algorithm:

Now, we will perform the implementation of the DBSCAN algorithm in Python. Still, we will do
this in steps as we have mentioned earlier so that the implementation part does not get any complex,
and we can understand it very easily. We have to follow the following steps in order to implement the
DBSCAN algorithm and its logic inside a Python program:

Step 1: Importing all the required libraries:

First and foremost, we have to import all the required libraries which we have installed in the
prerequisites part so that we can use their functions while implementing the DBSCAN algorithm.

Here, we have firstly imported all the required libraries or modules of libraries inside the
program:

1. # Importing numpy library as nmp


2. import numpy as nmp
3. # Importing pandas library as pds
4. import pandas as pds
5. # Importing matplotlib library as pplt
6. import matplotlib.pyplot as pplt
7. # Importing DBSCAN from cluster module of Sklearn library
8. from sklearn.cluster import DBSCAN
9. # Importing StandardSclaer and normalize from preprocessing module of Sklearn libr ary
10. from sklearn.preprocessing import StandardScaler
11. from sklearn.preprocessing import normalize
12. # Importing PCA from decomposition module of Sklearn
13. from sklearn.decomposition import PCA

Step 2: Loading the Data:

In this step, we have to load that data, and we can do this by importing or loading the dataset
(that is required in the DBSCAN algorithm to work on it) inside the program. To load the dataset
inside the program, we will use the read.csv() function of the panda's library and print the
information from the dataset as we have done below:

1. # Loading the data inside an initialized variable


2. M = pds.read_csv('sampleDataset.csv') # Path of dataset file
lOMoARcPSD|37701387

3. # Dropping the CUST_ID column from the dataset with drop() function
4. M = M.drop('CUST_ID', axis = 1)
5. # Using fillna() function to handle missing values
6. M.fillna(method ='ffill', inplace = True)
7. # Printing dataset head in output
8. print(M.head())

Output:

BALANCE BALANCE_FREQUENCY ... PRC_FULL_PAYMENT TENURE


0 40.900749 0.818182 ... 0.000000 12
1 3202.467416 0.909091 .. 0.222222 12
.
2 2495.148862 1.000000 .. 0.000000 12
.
3 1666.670542 0.636364 .. 0.000000 12
.
4 817.714335 1.000000 .. 0.000000 12
.
[5 rows x 17 columns]

The data as given in the output above will be printed when we run the program, and we will work on
this data from the dataset file we loaded.

Step 3: Preprocessing the data:

Now, we will start preprocessing the data of the dataset in this step by using the functions of
preprocessing module of the Sklearn library. We have to use the following technique while
preprocessing the data with Sklearn library functions:

1. # Initializing a variable with the StandardSclaer() function


2. scalerFD = StandardScaler()
3. # Transforming the data of dataset with Scaler
4. M_scaled = scalerFD.fit_transform(M)
5. # To make sure that data will follow gaussian distribution
6. # We will normalize the scaled data with normalize() function
7. M_normalized = normalize(M_scaled)
8. # Now we will convert numpy arrays in the dataset into dataframes of panda
9. M_normalized = pds.DataFrame(M_normalized)
lOMoARcPSD|37701387

Step 4: Reduce the dimensionality of the data:

In this step, we will be reducing the dimensionality of the scaled and normalized data so that the
data can be visualized easily inside the program. We have to use the PCA function in the following
way in order to transform the data and reduce its dimensionality:

1. # Initializing a variable with the PCA() function


2. pcaFD = PCA(n_components = 2) # components of data
3. # Transforming the normalized data with PCA
4. M_principal = pcaFD.fit_transform(M_normalized)
5. # Making dataframes from the transformed data
6. M_principal = pds.DataFrame(M_principal)
7. # Creating two columns in the transformed data
8. M_principal.columns = ['C1', 'C2']
9. # Printing the head of the transformed data
10. print(M_principal.head())

Output:

C1 C2
0 -0.489949 -0.679976
1 -0.519099 0.544828
2 0.330633 0.268877
3 -0.481656 -0.097610
4 -0.563512 -0.482506

You might also like