0% found this document useful (0 votes)
8 views

DWDM Lab Manual Excercises

The document outlines the process of creating a data warehouse using various tools such as MySQL, SQLyog, and Microsoft Visual Studio, detailing the steps for building tables, designing multi-dimensional models, and implementing ETL processes. It also discusses OLAP operations using Microsoft Excel and introduces the WEKA toolkit for data mining and machine learning. Key components include data extraction, cleaning, transformation, and loading, along with the exploration of machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DWDM Lab Manual Excercises

The document outlines the process of creating a data warehouse using various tools such as MySQL, SQLyog, and Microsoft Visual Studio, detailing the steps for building tables, designing multi-dimensional models, and implementing ETL processes. It also discusses OLAP operations using Microsoft Excel and introduces the WEKA toolkit for data mining and machine learning. Key components include data extraction, cleaning, transformation, and loading, along with the exploration of machine learning techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 91

1

1. Creation of a Data Warehouse.


 Build Data Warehouse/Data Mart (using open-source tools like Pentaho Data
Integration Tool, Pentaho Business Analytics; or other data warehouse tools
like Microsoft-SSIS, Informatica, Business Objects, etc.,)
ANS:
Identify source tables and populate sample data
In this task, we are going to use MySQL administrator, SQLyog Enterprise tools
for building & identifying tables in database & also for populating (filling) the
sample data in those tables of a database. A data warehouse is constructed by
integrating data from multiple heterogeneous sources. It supports analytical
reporting, structured and/or ad hoc queries and decision making. We are building
a data warehouse by integrating all the tables in database & analysing those data.

In the below figure we represented MySQL Administrator connection


establishment.

After successful login, it will open new window as shown below.


2

There are different options available in MySQL administrator. Another tool


SQLyog Enterprise, we are using for building & identifying tables in a database
after successful connection establishment through MySQL Administrator. Below
we can see the window of SQLyog Enterprise.

On left-side navigation, we can see different databases & it‘s related tables. Now
we are going to build tables & populate table‘s data in database through SQL
queries. These tables in database can be used further for building data warehouse.
3

In the above two windows, we created a database named “sample” & in that
database we created two tables named as “user_details” & “hockey” through SQL
queries.
Now, we are going to populate (filling) sample data through SQL queries in those
two created tables as represented in below windows.
4

Through MySQL administrator & SQLyog, we can import databases from other
sources (.XLS, .CSV, .sql) & also we can export our databases as backup for
further processing. We can connect MySQL to other applications for data analysis
& reporting.
5

 Design multi-dimensional data models namely Star, Snowflake and Fact


Constellation schemas for any one enterprise. (ex. Banking, Insurance,
Finance, Healthcare, manufacturing, Automobiles, sales etc)
ANS:
Multi-Dimensional model was developed for implementing data warehouses &
it provides both a mechanism to store data and a way for business analysis. The
primary components of dimensional model are dimensions & facts. There are
different of types of multi-dimensional data models. They are:
1. Star Schema Model
2. Snow Flake Schema Model
3. Fact Constellation Model.
Now, we are going to design these multi-dimensional models for the
Marketing enterprise.
First, we need to build the tables in a database through SQLyog as shown below.

In the above window, left side navigation bar consists of a database named as
―sales_dw‖ in which there are six different tables (dimcustdetails, dimcustomer,
dimproduct, dimsalesperson, dimstores, factproductsales) has been created.
After creating tables in database, here we are going to use a tool called as
“Microsoft Visual Studio 2012 for Business Intelligence” for building multi-
dimensional models.
6

In the above window, we are seeing Microsoft Visual Studio before creating a
project in which right side navigation bar contains different options like Data
Sources, Data Source Views, Cubes, Dimensions etc.
Through Data Sources, we can connect to our MySQL database named as
“sales_dw”.
Then, automatically all the tables in that database will be retrieved to this tool for
creating multi- dimensional models.
By data source views & cubes, we can see our retrieved tables in multi-
dimensional models. We need to add dimensions also through dimensions option.
In general, multi-dimensional models consists of dimension tables & fact tables.

Star Schema Model:


A Star schema model is a join between a fact table and a no. of dimension tables.
Each dimensional table are joined to the fact table using primary key to foreign
key join but dimensional tables are not joined to each other. It is the simplest style
of data ware house schema.
Star schema is an entity relationship diagram of this schema resembles a star with
point radiating from central table as we seen in the below implemented window in
visual studio.

Snow Flake Schema:


It is slightly different from star schema in which dimensional tables from a star
schema are organized into a hierarchy by normalizing them.
Snow flake schema is represented by centralized fact table which are connected to
multiple dimension tables. Snow flake effects only dimension tables not fact
tables. we developed a snowflake schema for sales_dw database by visual studio
tool as shown below.
7

Fact Constellation Schema:


Fact Constellation is a set of fact tables that share some dimension tables. In this
schema there are two or more fact tables. We developed fact constellation in
visual studio as shown below. Fact tables are labelled in yellow color.
8

 Write ETL scripts and implement using data warehouse tools.


ANS:
ETL (Extract-Transform-Load):
ETL comes from Data Warehousing and stands for Extract-Transform-Load.
ETL covers a process of how the data are loaded from the source system to the
data warehouse. Currently, the ETL encompasses a cleaning step as a separate
step. The sequence is then Extract-Clean- Transform-Load. Let us briefly
describe each step of the ETL process.

Process Extract:
The Extract step covers the data extraction from the source system and makes
it accessible for further processing. The main objective of the extract step is to
retrieve all the required data from the source system with as little resources as
possible. The extract step should be designed in a way that it does not
negatively affect the source system in terms or performance, response time or
any kind of locking.
There are several ways to perform the extract:
 Update notification - if the source system is able to provide a
notification that a record has been changed and describe the change,
this is the easiest way to get the data.
 Incremental extract - some systems may not be able to provide
notification that an update has occurred, but they are able to identify
which records have been modified and provide an extract of such
records. During further ETL steps, the system needs to identify
changes and propagate it down. Note, that by using daily extract, we
may not be able to handle deleted records properly.
 Full extract - some systems are not able to identify which data has
been changed at all, so a full extract is the only way one can get the
data out of the system. The full extract requires keeping a copy of the
last extract in the same format in order to be able to identify changes.
Full extract handles deletions as well.
 When using Incremental or Full extracts, the extract frequency is
extremely important. Particularly for full extracts; the data volumes
can be in tens of gigabytes.

Clean:
The cleaning step is one of the most important as it ensures the quality of the
data in the data warehouse.
Cleaning should perform basic data unification rules, such as:
 Making identifiers unique (sex categories Male/Female/Unknown,
M/F/null, Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
 Convert null values into standardized Not Available/Not Provided value
 Convert phone numbers, ZIP codes to a standardized form
 Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
 Validate address fields against each other (State/Country, City/State,
City/ZIP code, City/Street).

Transform:
The transform step applies a set of rules to transform the data from the source
to the target. This includes converting any measured data to the same
dimension (i.e. conformed dimension) using the same units so that they can
later be joined. The transformation step also requires joining data from
several sources,
9

generating aggregates, generating surrogate keys, sorting, deriving new


calculated values, and applying advanced validation rules.
Load:
During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible. The target of the Load
process is often a database. In order to make the load process efficient, it is
helpful to disable any constraints and indexes before the load and enable them
back only after the load completes. The referential integrity needs to be
maintained by ETL tool to ensure consistency.

Managing ETL Process:


The ETL process seems quite straight forward. As with every application,
there is a possibility that the ETL process fails. This can be caused by missing
extracts from one of the systems, missing values in one of the reference tables,
or simply a connection or power outage. Therefore, it is necessary to design
the ETL process keeping fail-recovery in mind.
Staging:
It should be possible to restart, at least, some of the phases independently from
the others. For example, if the transformation step fails, it should not be
necessary to restart the Extract step. We can ensure this by implementing
proper staging. Staging means that the data is simply dumped to the location
(called the Staging Area) so that it can then be read by the next processing
phase. The staging area is also used during ETL process to store intermediate
results of processing. This is ok for the ETL process which uses for this
purpose. However, the staging area should it be accessed by the load ETL
process only. It should never be available to anyone else; particularly not to
end users as it is not intended for data presentation to the end-user. May
contain incomplete or in-the-middle-of-the-processing data.

ETL Tool Implementation:


When you are about to use an ETL tool, there is a fundamental decision to be
made: will the company build its own data transformation tool or will it use an
existing tool?
Building your own data transformation tool (usually a set of shell scripts) is
the preferred approach for a small number of data sources which reside in
storage of the same type. The reason for that is the effort to implement the
necessary transformation is little due to similar data structure and common
system architecture. Also, this approach saves licensing cost and there is no
need to train the staff in a new tool. This approach, however, is dangerous
from the TOC point of view. If the transformations become more sophisticated
during the time or there is a need to integrate other systems, the complexity of
such an ETL system grows but the manageability drops significantly.
Similarly, the implementation of your own tool often resembles re-inventing
the wheel.
There are many ready-to-use ETL tools on the market. The main benefit of
using off-the-shelf ETL tools is the fact that they are optimized for the ETL
process by providing connectors to common data sources like databases, flat
files, mainframe systems, xml, etc. They provide a means to implement data
transformations easily and consistently across various data sources. This
includes filtering, reformatting, sorting, joining, merging, aggregation, and
other operations ready to use. The tools also support transformation
scheduling, version control, monitoring, and unified metadata management.
Some of the ETL tools are even integrated with BI tools.
1

Some of the Well Known ETL Tools:


The most well-known commercial tools are Ab Initio, IBM InfoSphere
DataStage, Informatica, Oracle Data Integrator, and SAP Data Integrator.
There are several open source ETL tools are OpenRefine, Apatar, CloverETL,
Pentaho and Talend.

In these above tools, we are going to use OpenRefine 2.8 ETL tool to different
sample datasets for extracting, data cleaning, transforming & loading.
1

 Perform Various OLAP operations such slice, dice, roll up, drill up and
pivot. ANS:
OLAP Operations are being implemented practically using Microsoft Excel.
Procedure for OLAP Operations:
1. Open Microsoft Excel, go to Data tab in top & click on ―Existing Connections”.
2. Existing Connections window will be opened, there “Browse for more” option
should be clicked for importing .cub extension file for performing OLAP
Operations. For sample, I took music.cub file.

3. As shown in above window, select ― PivotTable Report” and click “OK”.


4. We got all the music.cub data for analysing different OLAP Operations. Firstly,
we performed drill-down operation as shown below.
1

In the above window, we selected year „2008‟ in „Electronic‟ Category, then


automatically the Drill-Down option is enabled on top navigation options. We will
click on „Drill-Down‟ option, then the below window will be displayed.

5. Now we are going to perform roll-up (drill-up) operation, in the above window I
selected January month then automatically Drill-up option is enabled on top. We will
click on Drill- up option, then the below window will be displayed.
1

6. Next OLAP operation Slicing is performed by inserting slicer as shown in top


navigation options.

While inserting slicers for slicing operation, we select 2 Dimensions (for e.g.,
CategoryName & Year) only with one Measure (for e.g. Sum of sales). After inserting a
slice& adding a filter (CategoryName: AVANT ROCK & BIG BAND; Year: 2009 &
2010), we will get table as shown below.
1

7. Dicing operation is similar to Slicing operation. Here we are selecting 3 dimensions


(CategoryName, Year, RegionCode) & 2 Measures (Sum of Quantity, Sum of Sales)
through “insert slicer” option. After that adding a filter for CategoryName, Year &
RegionCode as shown below.

8. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order
Date- Year) & columns (Values-Sum of Quantity & Sum of Sales) through right side
bottom navigation bar as shown below.
1

After Swapping (rotating), we will get resultant as represented below with a pie-chart for
Category-Classical& Year Wise data.
1

2. Explore machine learning tool “WEKA”


 Explore WEKA Data Mining/Machine Learning
Toolkit. ANS:
WEKA (Waikato Environment for Knowledge Analysis) an open-source
software provides tools for data preprocessing, implementation of several
Machine Learning algorithms, and visualization tools so that we can develop
machine learning techniques and apply them to real-world data mining problems.
Features of WEKA -

a. Preprocessor – Most of the Data is Raw. Hence, Preprocessor is used to clean


the noisy data.
b. Classify – After preprocessing the data, we assign classes or categories to items.
c. Cluster – In Clustering, a dataset is arranged in different groups/clusters based
on some similarities.
d. Associate – Association rules highlight all the associations and correlations
between items of a dataset.
e. Select Attributes – Every dataset contains a lot of attributes; only significantly
valuable attributes are selected for building a good model.
f. Visualize – In Visualization, different plot matrices and graphs are available to
show the trends and errors identified by the model.

 Downloading and/or installation of WEKA data mining toolkit.


ANS:
1. Go to the Weka website, http://www.cs.waikato.ac.nz/ml/weka/, and
download the software.
2. Select the appropriate link corresponding to the version of the software based
on your operating system and whether or not you already have Java VM
running on your machine.
3. The link will forward you to a site where you can download the software from
a mirror site. Save the self-extracting executable to disk and then double click
on it to install Weka. Answer yes or next to the questions during the
installation.
4. Click yes to accept the Java agreement if necessary. After you install the
program Weka should appear on your start menu under Programs (if you are
using Windows).
5. Running Weka from the start menu select Programs, then Weka. You will see
the Weka GUI Chooser. Select Explorer. The Weka Explorer will then launch.
1

 Understand the features of WEKA toolkit such as Explorer, Knowledge Flow


interface, Experimenter, command-line interface.
ANS:
The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting
point for launching Weka‘s main GUI applications and supporting tools. If one
prefers a MDI (“Multiple Document Interface”) appearance, then this is
provided by an alternative launcher called “Main” (class weka.gui.Main).

The GUI Chooser application allows you to run five different types of
applications -
 The Explorer is the central panel where most data mining tasks are
performed.
 The Experimenter panel is used to run experiments and conduct statistical
tests between learning schemes.
 The KnowledgeFlow panel is used to provide an interface to drag and
drop components, connect them to form a knowledge flow and analyze the
data and results.
 The WorkBench panel is used to discover, explore & learn about different
statistical distributions.
 The Simple CLI panel provides the command-line interface powers to run
WEKA.
The Explorer - When you click on the Explorer button in the Applications selector,
it opens the following screen.
1

The Weka Explorer is designed to investigate your machine learning dataset. It


is useful when you are thinking about different data transforms and modeling
algorithms that you could investigate with a controlled experiment later. It is excellent
for getting ideas and playing what-if scenarios.
The interface is divided into 6 tabs, each with a specific function:
1. The preprocess tab is for loading your dataset and applying filters to transform
the data into a form that better exposes the structure of the problem to the
modelling processes. Also provides some summary statistics about loaded
data.
2. The classify tab is for training and evaluating the performance of different
machine learning algorithms on your classification or regression problem.
Algorithms are divided up into groups, results are kept in a result list and
summarized in the main Classifier output.
3. The cluster tab is for training and evaluating the performance of different
unsupervised clustering algorithms on your unlabelled dataset. Like the
Classify tab, algorithms are divided into groups, results are kept in a result list
and summarized in the main Clustered output.
4. The associate tab is for automatically finding associations in a dataset. The
techniques are often used for market basket analysis type data mining
problems and require data where all attributes are categorical.
5. The select attributes tab is for performing feature selection on the loaded
dataset and identifying those features that are most likely to be relevant in
developing a predictive model.
6. The visualize tab is for reviewing pairwise scatterplot matrix of each attribute
plotted against every other attribute in the loaded dataset. It is useful to get an
idea of the shape and relationship of attributes that may aid in data filtering,
transformation, and modelling.
The Experimenter - When you click on the Experimenter button in
the Applications selector, it opens the following screen.

The Weka Experiment Environment is for designing controlled experiments,


running them, then analyzing the results collected.
The interface is split into 3 tabs.
1. The setup tab is for designing an experiment. This includes the file where
results are written, the test setup in terms of how algorithms are evaluated, the
datasets to model and the algorithms to model them. The specifics of an
experiment can be saved for later use and modification.
 Click the “New” button to create a new Experiment.
1

 Click the “Add New” button in the Datasets pane and select the
required dataset (ARFF format files).
 Click the “Add New” button in the “Algorithms” pane and click “OK”
to add the required algorithm.
2. The run tab is for running your designed experiments. Experiments can be
started and stopped. There is not a lot to it.
 Click the “Start” button to run the small experiment you designed.
3. The analyze tab is for analyzing the results collected from an experiment.
Results can be loaded from a file, from the database or from an experiment
just completed in the tool. A no. of performance measures are collected from a
given experiment which can be compared between algorithms using tools like
statistical significance.
 Click the “Experiment” button the “Source” pane to load the results
from the experiment you just ran.
 Click the “Perform Test” button to summary the classification
accuracy results for the single algorithm in the experiment.
The KnowledgeFlow – When you click on the KnowledgeFlow button in the
Applications selector, it opens the following screen.

The Weka KnowledgeFlow Environment is a graphical workflow tool for


designing a machine learning pipeline from data source to results summary, and much
more. Once designed, the pipeline can be executed and evaluated within the tool.
Features of the KnowledgeFlow:
 Intuitive data flow style layout.
 Process data in batches or incrementally.
 Process multiple batches or streams in parallel!
 chain filters together.
 View models produced by classifiers for each fold in a cross validation.
 Visualize performance of incremental classifiers during processing.
The WorkBench – When you click on the WorkBench button in the Applications
selector, it opens the following screen.
2

The Weka Workbench is an environment that combines all the GUI interfaces
into a single interface. It is useful if you find yourself jumping a lot between two or
more different interfaces, such as between the Explorer and the Experiment
Environment. This can happen if you try out a lot of what if’s in the Explorer and
quickly take what you learn and put it into controlled experiments.
The Simple CLI – When you click on the Simple CLI button in the Applications
selector, it opens the following screen.

Weka can be used from a simple Command Line Interface (CLI). This is
powerful because you can write shell scripts to use the full API from command line
calls with parameters, allowing you to build models, run experiments and make
predictions without a graphical user interface.
The Simple CLI provides an environment where you can quickly and easily
experiment with the Weka command line interface commands.
2

 Navigate the options available in the WEKA (ex. Select attributes panel,
Preprocess panel, Classify panel, Cluster panel, Associate panel and Visualize
panel)
ANS:
EXPLORER PANEL
Preprocessor Panel
1. A variety of dataset formats can be loaded: WEKA‘s ARFF format (.arff
extension), CSV format (.csv extension), C4.5 format (.data & .names
extension), or serialized Instances format (.bsi entension).
2. Load a standard dataset in the data/ directory of your Weka installation,
specifically data/breast-cancer.arff.

Classify Panel
Test Options
1. The result of applying the chosen classifier will be tested according to the
options that are set by clicking in the Test options box.
2. There are four test modes:
 Use training set: The classifier is evaluated on how well it predicts the
class of the instances it was trained on.
 Supplied test set: The classifier is evaluated on how well it predicts
the class of a set of instances loaded from a file. Clicking the Set...
button brings up a dialog allowing you to choose the file to test on.
 Cross-validation: The classifier is evaluated by cross-validation, using
the number of folds that are entered in the Folds text field.
 Percentage split: The classifier is evaluated on how well it predicts a
certain percentage of the data which is held out for testing. The amount
of data held out depends on the value entered in the % field.
3. Click the “Start” button to run the ZeroR classifier on the dataset and
summarize the results.
2

Cluster Panel
1. Click the “Start” button to run the EM clustering algorithm on the dataset and
summarize the results.
2

Associate Panel
1. Click the “Start” button to run the Apriori association algorithm on the dataset
and summarize the results.

Select Attributes Panel


1. Click the “Start” button to run the CfsSubsetEval algorithm with
a BestFirst search on the dataset and summarize the results.
2

Visualize Panel
1. Increase the point size and the jitter and click the “Update” button to set an
improved plot of the categorical attributes of the loaded dataset.

EXPERIMENTER
Setup Panel
1. Click the “New” button to create a new Experiment.
2. Click the “Add New” button in the Datasets pane and select
the data/diabetes.arff dataset.
3. Click the “Add New” button in the “Algorithms” pane and click “OK” to add
the ZeroR algorithm.
2

Run Panel
1. Click the “Start” button to run the small experiment you designed.

Analyse Panel
1. Click the “Experiment” button the “Source” pane to load the results from the
experiment you just ran.
2. Click the “Perform Test” button to summary the classification accuracy results
for the single algorithm in the experiment.
2

 Study the arff file format Explore the available data sets in WEKA. Load a
data set (ex. Weather dataset, Iris dataset, etc.)
ANS:
1. An ARFF (Attribute-Relation File Format) file is an ASCII text file that
describes a list of instances sharing a set of attributes.
2. ARFF files have two distinct sections – The Header & the Data.
 The Header describes the name of the relation, a list of the attributes,
and their types.
 The Data section contains a comma separated list of data.
The ARFF Header Section
The ARFF Header section of the file contains the relation declaration and attribute
declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The format
is: @relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes
spaces.
The @attribute Declarations
Attribute declarations take the form of an ordered sequence of @attribute
statements. Each attribute in the data set has its own @attribute statement which
uniquely defines the name of that attribute and its data type. The order the
attributes are declared indicates the column position in the data section of the file.
The format for the @attribute statement is:
@attribute <attribute-name> <datatype>
where the <attribute-name> must start with an alphabetic character. If spaces are
to be included in the name, then the entire name must be quoted.
The <datatype> can be any of the four types:
1. numeric
2. <nominal-specification>
3. string
4. date [<date-format>]
ARFF Data Section
The ARFF Data section of the file contains the data declaration line and the actual
instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data segment in the
file. The format is:
@data
The instance data:
Each instance is represented on a single line, with carriage returns denoting the
end of the instance.
Attribute values for each instance are delimited by commas. They must appear in
the order that they were declared in the header section (i.e. the data corresponding
to the nth @attribute declaration is always the nth field of the attribute).
Missing values are represented by a single question mark, as
in: @data
4.4, ?, 1.5, ?, Iris-setosa
2

An example header on the standard IRIS dataset looks like this:


% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%[email protected])
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}

The Data of the ARFF file looks like the following:


@DATA
5.1, 3.5, 1.4, 0.2, Iris-setosa
4.9, 3.0, 1.4, 0.2, Iris-setosa
4.7, 3.2, 1.3, 0.2, Iris-setosa
4.6, 3.1, 1.5, 0.2, Iris-setosa
5.0, 3.6, 1.4, 0.2, Iris-setosa
5.4, 3.9, 1.7, 0.4, Iris-setosa
4.6, 3.4, 1.4, 0.3, Iris-setosa
5.0, 3.4, 1.5, 0.2, Iris-setosa
4.4, 2.9, 1.4, 0.2, Iris-setosa
4.9, 3.1, 1.5, 0.1, Iris-setosa

NOTE: Lines that begin with a % are comments. The @RELATION,


@ATTRIBUBTE and @DATA declarations are case insensitive.

Sparse ARFF files


Sparse ARFF files are very similar to ARFF files, but data with value 0 are not be
explicitly represented.
Sparse ARFF files have the same header (i.e @relation and @attribute tags) but
the data section is different. Instead of representing each value in order, like this:
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
the non-zero attributes are explicitly identified by attribute number and their value
stated, like this:
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
Each instance is surrounded by curly braces, and the format for each entry is:
<index> <space> <value> where index is the attribute index (starting from 0).
2

Available Datasets in WEKA:


There are 25 different datasets are available in WEKA (C:\Program Files\Weka-3-
8-6\data) by default for testing purpose. All the datasets are available in .arff
format. Those datasets are listed below.
2

 Load each dataset and observe the following:


1. List the attribute names and they type.
2. Number of records in each dataset
3. Identify the class attribute (if any)
4. Plot Histogram
5. Determine the number of records for each class.
6. Visualize the data in various dimensions
ANS:
Procedure:
1) Open the WEKA tool and Select the Explorer option.
2) A new window will be opened which consists of six tabs – Preprocess,
Classify, Cluster, Associate, Select Attributes and Visualize.
3) In the Preprocess tab, Click the “Open file” option.
4) Go to C:\Program Files\Weka-3-8-6\data for finding different existing .arff
datasets.
5) Click on any of the dataset for loading the data and then the data will be
displayed as shown.
6) Here Weather.arff dataset is chosen as sample for all the observations.
3

1. List the attribute names and they type.


There are 5 attributes and its data type presented in the loaded dataset Weather.arff.
S.NO. ATTRIBUTE NAME DATA TYPE
1 Outlook Nominal
2 Temperature Nominal
3 Humidity Nominal
4 Windy Nominal
5 Play Nominal

2. Number of records in each dataset


There is total 14 records (Instances) in the loaded dataset Weather.arff.

3. Identify the class attribute (if any)


No class attribute in the loaded dataset Weather.arff.
3

4. Plot Histogram

5. Determine the number of records for each class.


S.NO. ATTRIBUTE NAME RECORDS (INSTANCES)
1 Outlook 14
2 Temperature 14
3 Humidity 14
4 Windy 14
5 Play 14

6. Visualize the data in various dimensions


Plot Matrix for the loaded dataset Weather.arff.
3

3. Perform data preprocessing tasks and demonstrate performing association rule


mining on data sets.
 Explore various options available in Weka for preprocessing data and apply
Unsupervised filters like Discretization, Resample filter, etc. on each dataset.
ANS:
1. Select a dataset from the available for preprocessing.
2. Once the dataset was loaded, apply various Unsupervised filters such as
Discretization, Resample available in the WEKA on the selected dataset.
Applying Discretization Filter on the dataset selected, yields the following results.
3

Applying Resample Filter on the dataset selected, yields the following results.
3

 Load weather. nominal, Iris, Glass datasets into Weka and run Apriori
Algorithm with different support and confidence values.
ANS:
Loading WEATHER.NOMINAL dataset
1. Select WEATHER.NOMINAL dataset from the available datasets in the
preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click start
button.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17

Generated sets of large itemsets:


Size of set of large itemsets L(1):
12 Size of set of large itemsets
L(2): 47 Size of set of large
itemsets L(3): 39 Size of set of
large itemsets L(4): 6

Best rules found:


1. outlook=overcast 4 ==> play=yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1]
conv:(1.43)
2. temperature=cool 4 ==> humidity=normal 4 <conf:(1)> lift:(2) lev:(0.14) [2]
conv:(2)
3. humidity=normal windy=FALSE 4 ==> play=yes 4 <conf:(1)> lift:(1.56)
lev:(0.1) [1] conv:(1.43)
4. outlook=sunny play=no 3 ==> humidity=high 3 <conf:(1)> lift:(2) lev:(0.11)
[1] conv:(1.5)
5. outlook=sunny humidity=high 3 ==> play=no 3 <conf:(1)> lift:(2.8)
lev:(0.14) [1] conv:(1.93)
6. outlook=rainy play=yes 3 ==> windy=FALSE 3 <conf:(1)> lift:(1.75)
lev:(0.09) [1] conv:(1.29)
3

7. outlook=rainy windy=FALSE 3 ==> play=yes 3 <conf:(1)> lift:(1.56)


lev:(0.08) [1] conv:(1.07)
8. temperature=cool play=yes 3 ==> humidity=normal 3 <conf:(1)> lift:(2)
lev:(0.11) [1] conv:(1.5)
9. outlook=sunny temperature=hot 2 ==> humidity=high 2 <conf:(1)> lift:(2)
lev:(0.07) [1] conv:(1)
10. temperature=hot play=no 2 ==> outlook=sunny 2 <conf:(1)> lift:(2.8)
lev:(0.09) [1] conv:(1.29)

Loading GLASS dataset


1. Select GLASS dataset from the available datasets in the preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click start
button.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: Glass-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-
last-precision6
Instances: 214
Attributes: 10
RI
Na
Mg
Al
Si
K
Ca
Ba
Fe
Type
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.3 (64 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 14

Generated sets of large itemsets:


Size of set of large itemsets L(1): 12
Size of set of large itemsets L(2): 24
Size of set of large itemsets L(3): 9
3

Best rules found:


1. Na='(12.725-13.39]' 93 ==> Ba='(-inf-0.315]' 93 <conf:(1)> lift:(1.16)
lev:(0.06) [12] conv:(12.6)
2. Na='(12.725-13.39]' K='(-inf-0.621]' 69 ==> Ba='(-inf-0.315]' 69 <conf:(1)>
lift:(1.16) lev:(0.04) [9] conv:(9.35)
3. RI='(1.515706-1.517984]' Mg='(3.143-3.592]' 65 ==> Ba='(-inf-0.315]' 65
<conf:(1)> lift:(1.16) lev:(0.04) [8] conv:(8.81)
4. Na='(12.725-13.39]' Ca='(7.582-8.658]' 64 ==> Ba='(-inf-0.315]' 64
<conf:(1)> lift:(1.16) lev:(0.04) [8] conv:(8.67)
5. Type=build wind non-float 76 ==> Ba='(-inf-0.315]' 75 <conf:(0.99)>
lift:(1.14) lev:(0.04) [9] conv:(5.15)
6. Type=build wind float 70 ==> Ba='(-inf-0.315]' 69 <conf:(0.99)> lift:(1.14)
lev:(0.04) [8] conv:(4.74)
7. Al='(1.253-1.574]' 81 ==> Ba='(-inf-0.315]' 79 <conf:(0.98)> lift:(1.13)
lev:(0.04) [8] conv:(3.66)
8. Al='(1.253-1.574]' K='(-inf-0.621]' 67 ==> Ba='(-inf-0.315]' 65 <conf:(0.97)>
lift:(1.12) lev:(0.03) [7] conv:(3.03)
9. Mg='(3.143-3.592]' 86 ==> Ba='(-inf-0.315]' 83 <conf:(0.97)> lift:(1.12)
lev:(0.04) [8] conv:(2.91)
10. Type=build wind float 70 ==> K='(-inf-0.621]' 64 <conf:(0.91)> lift:(1.14)
lev:(0.04) [7] conv:(1.96)

Loading IRIS dataset


1. Select IRIS dataset from the available datasets in the preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click start
button.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: iris-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-
last-precision6
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.1 (15 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 18

Generated sets of large itemsets:


Size of set of large itemsets L(1): 20
3

Size of set of large itemsets L(2): 15


Size of set of large itemsets L(3): 3

Best rules found:


1. petalwidth='(-inf-0.34]' 41 ==> class=Iris-setosa 41 <conf:(1)> lift:(3)
lev:(0.18) [27] conv:(27.33)
2. petallength='(-inf-1.59]' 37 ==> class=Iris-setosa 37 <conf:(1)> lift:(3)
lev:(0.16) [24] conv:(24.67)
3. petallength='(-inf-1.59]' petalwidth='(-inf-0.34]' 33 ==> class=Iris-setosa 33
<conf:(1)> lift:(3) lev:(0.15) [22] conv:(22)
4. petalwidth='(1.06-1.3]' 21 ==> class=Iris-versicolor 21 <conf:(1)> lift:(3)
lev:(0.09) [14] conv:(14)
5. petallength='(5.13-5.72]' 18 ==> class=Iris-virginica 18 <conf:(1)> lift:(3)
lev:(0.08) [12] conv:(12)
6. sepallength='(4.66-5.02]' petalwidth='(-inf-0.34]' 17 ==> class=Iris-setosa 17
<conf:(1)> lift:(3) lev:(0.08) [11] conv:(11.33)
7. sepalwidth='(2.96-3.2]' class=Iris-setosa 16 ==> petalwidth='(-inf-0.34]' 16
<conf:(1)> lift:(3.66) lev:(0.08) [11] conv:(11.63)
8. sepalwidth='(2.96-3.2]' petalwidth='(-inf-0.34]' 16 ==> class=Iris-setosa 16
<conf:(1)> lift:(3) lev:(0.07) [10] conv:(10.67)
9. petallength='(3.95-4.54]' 26 ==> class=Iris-versicolor 25 <conf:(0.96)>
lift:(2.88) lev:(0.11) [16] conv:(8.67)
10. petalwidth='(1.78-2.02]' 23 ==> class=Iris-virginica 22 <conf:(0.96)>
lift:(2.87) lev:(0.1) [14] conv:(7.67)
3

 Apply different discretization filters on numerical attributes and run the


Apriori association rule algorithm. Study the rules generated.
 Derive interesting insights and observe the effect of discretization in the rule
generation process.
ANS:
Loading WEATHER.NUMERIC dataset
1. Select WEATHER.NUMERIC dataset from the available datasets in the
preprocessing tab.
2. Apply Apriori algorithm by selecting it from the Associate tab and click start
button.
3. The Associator output displays the following result.
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S
-1.0 -c -1
Relation: weather-weka.filters.unsupervised.attribute.Discretize-B10-M-1.0-
Rfirst-last-precision6
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.15 (2 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 17

Generated sets of large itemsets:


Size of set of large itemsets L(1):
17 Size of set of large itemsets
L(2): 34 Size of set of large
itemsets L(3): 13 Size of set of
large itemsets L(4): 1

Best rules found:


1. outlook=overcast 4 ==> play=yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1]
conv:(1.43)
2. humidity='(89.8-92.9]' 3 ==> windy=TRUE 3 <conf:(1)> lift:(2.33) lev:(0.12)
[1] conv:(1.71)
3. outlook=rainy play=yes 3 ==> windy=FALSE 3 <conf:(1)> lift:(1.75)
lev:(0.09) [1] conv:(1.29)
4. outlook=rainy windy=FALSE 3 ==> play=yes 3 <conf:(1)> lift:(1.56)
lev:(0.08) [1] conv:(1.07)
3

5. humidity='(77.4-80.5]' 2 ==> outlook=rainy 2 <conf:(1)> lift:(2.8) lev:(0.09)


[1] conv:(1.29)
6. temperature='(-inf-66.1]' 2 ==> windy=TRUE 2 <conf:(1)> lift:(2.33)
lev:(0.08) [1] conv:(1.14)
7. temperature='(68.2-70.3]' 2 ==> windy=FALSE 2 <conf:(1)> lift:(1.75)
lev:(0.06) [0] conv:(0.86)
8. temperature='(68.2-70.3]' 2 ==> play=yes 2 <conf:(1)> lift:(1.56) lev:(0.05)
[0] conv:(0.71)
9. temperature='(74.5-76.6]' 2 ==> play=yes 2 <conf:(1)> lift:(1.56) lev:(0.05)
[0] conv:(0.71)
10. humidity='(83.6-86.7]' 2 ==> temperature='(82.9-inf)' 2 <conf:(1)> lift:(7)
lev:(0.12) [1] conv:(1.71)
4

4. Demonstrate performing classification on data sets


 Load each dataset into Weka and run 1d3, J48 classification algorithm. Study
the classifier output. Compute entropy values, Kappa statistic.
ANS:
Loading CONTACT-LENSES dataset and Run ID3 algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply ID3 algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.trees.Id3
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Id3
tear-prod-rate = reduced:
none tear-prod-rate = normal
| astigmatism = no
| | age = young: soft
| | age = pre-presbyopic: soft
| | age = presbyopic
| | | spectacle-prescrip = myope: none
| | | spectacle-prescrip = hypermetrope: soft
| astigmatism = yes
| | spectacle-prescrip = myope: hard
| | spectacle-prescrip = hypermetrope
| | | age = young: hard
| | | age = pre-presbyopic: none
| | | age = presbyopic: none
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
4

Correctly Classified Instances 24 100 %


Incorrectly Classified Instances 0 0%
Kappa statistic 1
K&B Relative Info Score 100 %
K&B Information Score 31.9048 bits 1.3294 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 0 bits 0 bits/instance
Complexity improvement (Sf) 31.9048 bits 1.3294 bits/instance
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0%
Root relative squared error 0%
Total Number of Instances 24
=== Detailed Accuracy By Class ===
TP FP F- ROC PRC
Precision Recall MCC Class
Rate Rate Measure Area Area
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 soft
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 hard
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 none
Weighted
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000
Avg.

=== Confusion Matrix ===


a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
0 0 15 | c = none

Loading CONTACT-LENSES dataset and Run J48 algorithm.


1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply J48 algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: contact-lenses
Instances: 24
Attributes: 5
4

age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===
J48 pruned tree

tear-prod-rate = reduced: none


(12.0) tear-prod-rate = normal
| astigmatism = no: soft (6.0/1.0)
| astigmatism = yes
| | spectacle-prescrip = myope: hard (3.0)
| | spectacle-prescrip = hypermetrope: none (3.0/1.0)
Number of Leaves: 4
Size of the tree: 7
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correctly Classified Instances 22 91.6667 %
Incorrectly Classified Instances 2 8.3333 %
Kappa statistic 0.8447
K&B Relative Info Score 81.6411 %
K&B Information Score 26.0474 bits 1.0853 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 6.655 bits 0.2773 bits/instance
Complexity improvement (Sf) 25.2498 bits 1.0521 bits/instance
Mean absolute error 0.0833
Root mean squared error 0.2041
Relative absolute error 22.6257 %
Root relative squared error 48.1223 %
Total Number of Instances 24
=== Detailed Accuracy By Class ===
TP FP F- ROC PRC
Precision Recall MCC Class
Rate Rate Measure Area Area
1.000 0.053 0.833 1.000 0.909 0.889 0.974 0.833 soft
0.750 0.000 1.000 0.750 0.857 0.845 0.988 0.917 hard
0.933 0.111 0.933 0.933 0.933 0.822 0.967 0.972 none
Weighted
Avg. 0.917 0.080 0.924 0.917 0.916 0.840 0.972 0.934
4

=== Confusion Matrix ===


a b c <-- classified as
5 0 0 | a = soft
0 3 1 | b = hard
1 0 14 | c = none
4

 Extract if-then rules from the decision tree generated by the classifier, Observe
the confusion matrix.
ANS:
Loading CONTACT-LENSES dataset and Run JRip algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply JRip algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.rules.JRip -F 3 -N 2.0 -O 2 -S 1
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data

=== Classifier model (full training set) ===

JRIP rules:
===========
(tear-prod-rate = normal) and (astigmatism = yes) => contact-lenses=hard (6.0/2.0)
(tear-prod-rate = normal) => contact-lenses=soft (6.0/1.0)
=> contact-lenses=none (12.0/0.0)
Number of Rules : 3
Time taken to build model: 0 seconds

=== Evaluation on training set ===


Time taken to test model on training data: 0 seconds

=== Summary ===


Correctly Classified Instances 21 87.5 %
Incorrectly Classified Instances 3 12.5 %
Kappa statistic 0.7895
K&B Relative Info Score 73.756 %
K&B Information Score 23.5317 bits 0.9805 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 9.4099 bits 0.3921 bits/instance
Complexity improvement (Sf) 22.4949 bits 0.9373 bits/instance
Mean absolute error 0.1204
Root mean squared error 0.2453
Relative absolute error 32.6816 %
Root relative squared error 57.8358 %
Total Number of Instances 24
4

=== Detailed Accuracy By Class ===


TP FP F- ROC PRC
Precision Recall MCC Class
Rate Rate Measure Area Area
1.000 0.053 0.833 1.000 0.909 0.889 0.974 0.833 soft
1.000 0.100 0.667 1.000 0.800 0.775 0.950 0.667 hard
0.800 0.000 1.000 0.800 0.889 0.775 0.922 0.945 none
Weighted
Avg. 0.875 0.028 0.910 0.875 0.878 0.798 0.938 0.876

=== Confusion Matrix ===

a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
1 2 12 | c = none
4

 Load each dataset into Weka and perform Naïve-bayes classification and k-
Nearest Neighbour classification. Interpret the results obtained.
ANS:
Loading CONTACT-LENSES dataset and Run Naïve-Bayes algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply Naïve-Bayes algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute soft hard none
(0.22) (0.19) (0.59)
==========================================
age
young 3.0 3.0 5.0
pre-presbyopic 3.0 2.0 6.0
presbyopic 2.0 2.0 7.0
[total] 8.0 7.0 18.0
spectacle-prescrip
myope 3.0 4.0 8.0
hypermetrope 4.0 2.0 9.0
[total] 7.0 6.0 17.0

astigmatism
no 6.0 1.0 8.0
yes 1.0 5.0 9.0
[total] 7.0 6.0 17.0
tear-prod-rate
4

reduced 1.0 1.0 13.0


normal 6.0 5.0 4.0
[total] 7.0 6.0 17.0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correctly Classified Instances 23 95.8333 %
Incorrectly Classified Instances 1 4.1667 %
Kappa statistic 0.925
K&B Relative Info Score 62.2646 %
K&B Information Score 19.8654 bits 0.8277 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 12.066 bits 0.5028 bits/instance
Complexity improvement (Sf) 19.8387 bits 0.8266 bits/instance
Mean absolute error 0.1809
Root mean squared error 0.2357
Relative absolute error 49.1098 %
Root relative squared error 55.5663 %
Total Number of Instances 24
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC
Area Class
1.000 0.053 0.833 1.000 0.909 0.889 1.000 1.000 soft
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 hard
0.933 0.000 1.000 0.933 0.966 0.917 1.000 1.000 none
Weighted Avg. 0.958 0.011 0.965 0.958 0.960 0.925 1.000 1.000
=== Confusion Matrix ===
a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
1 0 14 | c = none

Loading CONTACT-LENSES dataset and Run k-Nearest Neighbour (IBk)


algorithm.
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply k-Nearest Neighbour (IBk) algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the following result.
4

=== Run information ===


Scheme: weka.classifiers.lazy.IBk -K 1 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R
first-last\""
Relation: contact-lenses
Instances: 24
Attributes: 5
age
spectacle-prescrip
astigmatism
tear-prod-rate
contact-lenses
Test mode: evaluate on training data
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correctly Classified Instances 24 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
K&B Relative Info Score 91.6478 %
K&B Information Score 29.24 bits 1.2183 bits/instance
Class complexity | order 0 31.9048 bits 1.3294 bits/instance
Class complexity | scheme 2.6648 bits 0.111 bits/instance
Complexity improvement (Sf) 29.24 bits 1.2183 bits/instance
Mean absolute error 0.0494
Root mean squared error 0.0524
Relative absolute error 13.4078 %
Root relative squared error 12.3482 %
Total Number of Instances 24
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC
Area Class
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 soft
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 hard
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 none
Weighted Avg. 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000
4

=== Confusion Matrix ===


a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard
0 0 15 | c = none
5

 Plot RoC
Curves.
ANS:
1. Select CONTACT-LENSES dataset from the available datasets in the
preprocessing tab.
2. Apply Naïve-Bayes algorithm by selecting it from the Classify tab.
3. Now select one of the Test Options available (Training Set/ Supplied Test Set/
Cross-Validation/ Percentage Split).
4. Also select Output Entropy Evaluation Measures from the More options available
in Test Options.
5. Now click Start button available.
6. The Classifier output displays the result.
7. For plotting RoC curves – right click on the -bayes.NaiveBayes available in the
result list, select the Visulaize Threshold Curve option in which we can select any
of the class available (soft, hard, none).
8. After selecting a class, RoC curve plot will be displayed with False Positive Rate
as X-axis and True Positive Rate as Y-axis.
5

 Compare classification results of ID3, J48, Naïve-Bayes and k-NN classifiers for
each dataset, and deduce which classifier is performing best and poor for each
dataset and justify.
ANS:
By observing all the classification results of the algorithms ID3, K-NN, J48 & Naïve
Bayes –
ID3 Algorithm accuracy & performance is best.
J48 Algorithm accuracy & performance is poor.
5

5. Demonstrate performing clustering of data sets


 Load each dataset into Weka and run simple k-means clustering algorithm with
different values of k (number of desired clusters).
 Study the clusters formed. Observe the sum of squared errors and centroids, and
derive insights.
ANS:
Loading IRIS dataset and Run Simple K-Means clustering algorithm
1. Select and Load IRIS dataset from the available datasets in the preprocessing tab.
2. Apply Simple K-Means algorithm by selecting it from the Cluster tab.
3. Now select one of the cluster modes available (Training Set/ Supplied Test Set/
Percentage Split/ Classes to Clusters Evaluation) – Training Set option was selected.
4. Now click Start button.
5. The Clusterer output displays the following result.

=== Run information ===


Scheme: weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-
pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A
"weka.core.EuclideanDistance -R first-last"-I 500 -num-slots 1 -S 10
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: evaluate on training data

=== Clustering model (full training set) ===


kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 62.1436882815797

Initial starting points (random):


Cluster 0: 6.1, 2.9, 4.7, 1.4, Iris-versicolor
Cluster 1: 6.2, 2.9, 4.3, 1.3, Iris-versicolor

Missing values globally replaced with

mean/mode
5

Final cluster centroids:


Cluster#
Attribute Full Data 0 1
(150.0) (100.0) (50.0)
================================================================
==
sepallength 5.8433 6.262 5.006
sepalwidth 3.054 2.872 3.418
petallength 3.7587 4.906 1.464
petalwidth 1.1987 1.676 0.244
class Iris-setosa Iris-versicolor Iris-setosa
Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on training set ===


Clustered Instances
0 100 (67%)
1 50 (33%)
5

 Explore other clustering techniques available in


Weka. ANS:
1. A clustering algorithm finds groups of similar instances in the entire dataset.
2. WEKA supports several clustering algorithms such as EM, Filtered Clusterer,
Hierarchical Clusterer and so on.
a) The EM algorithm is an iterative approach that cycles between two modes. The
first mode attempts to estimate the missing or latent variables, called the
estimation- step or E-step. The second mode attempts to optimize the parameters
of the model to best explain the data, called the maximization-step or M-step.
b) The Filtered Clusterer algorithm improves the performance of k-means by
imposing an index structure on the dataset and reduces the number of cluster
centers searched while finding the nearest center of a point.
c) The Hierarchical Clusterer algorithm works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a separate
cluster. Then, it repeatedly executes the subsequent steps:
 Identify the 2 clusters which can be closest together, and
 Merge the 2 maximum comparable clusters. We need to continue these steps
until all the clusters are merged together.
Example,
Applying HIERARCHICAL CLUSTERER algrithm on IRIS dataset.
5

 Explore visualization features of Weka to visualize the clusters. Derive interesting


insights and explain.
ANS:
1. As in the case of classification, distinction between the correctly and incorrectly
identified instances can be observed.
2. By changing the X and Y axes we can analyze the results. Use jittering to find out the
concentration of correctly identified instances.
3. The operations in visualization plot of clustering are similar to in the case of
classification.
5

6. Demonstrate knowledge flow application on data sets


 Develop a knowledge flow layout for finding strong association rules by using
Apriori, FP Growth algorithms.
ANS:
Knowledge flow layout for finding strong association rules by using Apriori
Algorithm

RESULT
5

Knowledge flow layout for finding strong association rules by using FP


Growth algorithms

RESULT
5

 Set up the knowledge flow to load an ARFF (batch mode) and perform a
cross validation using J48 algorithm.
ANS:
Knowledge flow to load an ARFF (batch mode) and perform a cross
validation using J48 algorithm.

RESULT
5

 Demonstrate plotting multiple ROC curves in the same plot window by using
J48 and Random Forest tree.

Plotting multiple ROC curves in the same plot window by using J48 and
Random Forest tree.

RESULT
6

7. Demonstrate ZeroR technique on Iris dataset (by using necessary preprocessing


technique(s)) and share your observations.
ANS:
1.Select IRIS dataset from the pool of datasets available for preprocessing.
2.Once the dataset was loaded, apply various unsupervised filters, if necessary.
3.Now apply ZeroR technique on the dataset by selecting it from the Classify tab.
4.Check the results with test options –
i. Cross-Validation with 10 folds and
ii. Percentage Split of 66%
Loading IRIS Dataset for Preprocessing

ZeroR technique with test option – Cross-Validation with 10 folds


The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.rules.ZeroR
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: Iris-setosa
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
6

Correctly Classified Instances 50 33.3333 %


Incorrectly Classified Instances 100 66.6667 %
Kappa statistic 0
Mean absolute error 0.4444
Root mean squared error 0.4714
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 150
=== Detailed Accuracy by Class ===
F-
TP FP Precisio Recal MC ROC PRC
Measur Class
Rate Rate n l C Area Area
e
1.00 0.50 Iris-
1.000 0.333 1.000 0.500 ? 0.333
0 0 setosa
Iris-
0.00 0.50
0.000 ? 0.000 ? ? 0.333 versicolo
0 0
r
0.00 0.50 Iris-
0.000 ? 0.000 ? ? 0.333
0 0 virginica
Weighte 0.33
d Avg. 0.333 3 0.333 ? 0.333 ? ? 0.500 0.333

=== Confusion Matrix ===


a b c <-- classified as
50 0 0 | a = Iris-setosa
50 0 0 | b = Iris-versicolor
50 0 0 | c = Iris-virginica

ZeroR technique with test option – Percentage Split of 66%


The Classifier output displays the following result.
=== Run information ===
Scheme: weka.classifiers.rules.ZeroR
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: split 66.0% train, remainder test
=== Classifier model (full training set) ===
ZeroR predicts class value: Iris-setosa
Time taken to build model: 0 seconds
6

=== Evaluation on test split ===


Time taken to test model on test split: 0 seconds

=== Summary ===


Correctly Classified Instances 15 29.4118 %
Incorrectly Classified Instances 36 70.5882 %
Kappa statistic 0
Mean absolute error 0.4455
Root mean squared error 0.4728
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 51
=== Detailed Accuracy by Class ===
F-
TP FP Precisio Recal MC ROC PRC
Measur Class
Rate Rate n l C Area Area
e
1.00 0.50 Iris-
1.000 0.294 1.000 0.455 ? 0.294
0 0 setosa
Iris-
0.00 0.50
0.000 ? 0.000 ? ? 0.373 versicolo
0 0
r
0.00 0.50 Iris-
0.000 ? 0.000 ? ? 0.333
0 0 virginica
Weighte 0.29
d Avg. 0.294 4 0.294 ? 0.294 ? ? 0.500 0.336

=== Confusion Matrix ===


a b c <-- classified as
15 0 0 | a = Iris-setosa
19 0 0 | b = Iris-versicolor
17 0 0 | c = Iris-virginica
6

8. Write a java program to prepare a simulated data set with unique


instances. PROGRAM:
import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class UniqueDataSet


{
public static void main(String[] args)
{
// Create a random number generator.
Random random = new Random();

// Create a list to store the data.


List<Integer> data = new ArrayList<>();

// Generate 20 unique integers.


for (int i = 0; i < 20; i++)
{
int number = random.nextInt(100);
if (!data.contains(number))
{
data.add(number);
}
}

// Print the data.


for (int i = 0; i < data.size(); i++)
{
System.out.println(data.get(i));
}
}
}
6

OUTPUT:
Set
12
37
1
38
84
28
24
61
88
65
66
72
85
75
64
91
27
47
42
6

9. Write a Python program to generate frequent item sets / association rules using
Apriori algorithm.
PROGRAM:
# installing the apyori package
!pip install apyori

# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# importing the dataset


Data = pd.read_csv('/content/drive/MyDrive/Market_Basket_Optimisation.csv',
header = None)

# Intializing the list


transacts = []
# populating a list of transactions
for i in range(0, 7501):
transacts.append([str(Data.values[i,j]) for j in range(0,
20)])

# trains our apriori model


from apyori import apriori
rule = apriori (transactions = transacts, min_support = 0.003, min_confidence = 0.2,
min_lift = 3, min_length = 2, max_length = 2)

# Visualising the results


output = list(rule) # returns a non-tabular output

# putting output into a pandas data frame


def inspect(output):
lhs = [tuple (result [2][0][0]) [0] for result in output]
rhs = [tuple (result [2][0][1]) [0] for result in
output] support = [result [1] for result in output]
confidence = [result [2][0][2] for result in output]
lift = [result [2][0][3] for result in output]
return list (zip (lhs, rhs, support, confidence, lift))
output_DataFrame = pd.DataFrame(inspect(results), columns = ['Left_Hand_Side',
'Right_Hand_Side', 'Support', 'Confidence', 'Lift'])
6

OUTPUT:
# Displaying the results non-sorted
output_DataFrame

# Displaying the results sorted by descending order of Lift column


output_DataFrame.nlargest(n = 10, columns = 'Lift')
6

10. Write a program to calculate chi-square value using Python. Report your
observation.
PROGRAM:
# importing libraries
import numpy as np
from scipy.stats import chi2

# Create a contingency table


observed = np.array([[10, 20], [30, 40]])

# Calculate the chi-square statistic


chi2_stat = chi2.contingency(observed)[0]

# Calculate the p-value


p_value = chi2.contingency(observed)[1]

# Print the results


print ("Chi-square statistic:", chi2_stat)
print ("P-value:", p_value)

OUTPUT:
Chi-square statistic: 10.0
P-value: 0.0014

OBSERVATION:
The chi-square statistic is a measure of the discrepancy between the observed and
expected frequencies in a contingency table. The p-value is the probability of
obtaining a chi-square statistic as large or larger than the one observed, assuming that
the null hypothesis is true. In this case, the null hypothesis is that there is no
association between the two variables in the contingency table. The p-value of 0.0014
is less than the significance level of 0.05, so we reject the null hypothesis and
conclude that there is a significant association between the two variables.
6

11. Write a program of Naive Bayesian classification using Python programming


language.
PROGRAM:
# Importing library
import math
import random
import csv

# the categorical class names are changed to numeric data


# eg: yes and no encoded to 1 and 0
def encode_class(mydata):
classes = []
for i in range(len(mydata)):
if mydata[i][-1] not in classes:
classes.append(mydata[i][-1])
for i in range(len(classes)):
for j in range(len(mydata)):
if mydata[j][-1] == classes[i]:
mydata[j][-1] = i
return mydata

# Splitting the data


def splitting(mydata, ratio):
train_num = int(len(mydata) * ratio)
train = []
# initially testset will have all the dataset
test = list(mydata)
while len(train) < train_num:
# index generated randomly from range 0
# to length of testset
index = random.randrange(len(test))
# from testset, pop data rows and put it in train
train.append(test.pop(index))
return train, test

# Group the data rows under each class yes


or # no in dictionary eg: dict[yes] and dict[no]
def groupUnderClass(mydata):
dict = {}
for i in range(len(mydata)):
if (mydata[i][-1] not in dict):
dict[mydata[i][-1]] =
[]
dict[mydata[i][-1]].append(mydata[i])
6

return dict

# Calculating Mean
def mean(numbers):
return sum(numbers) / float(len(numbers))

# Calculating Standard Deviation


def std_dev(numbers):
avg = mean(numbers)
variance = sum([pow(x - avg, 2) for x in numbers]) / float(len(numbers) - 1)
return math.sqrt(variance)

def MeanAndStdDev(mydata):
info = [(mean(attribute), std_dev(attribute)) for attribute in zip(*mydata)]
# eg: list = [ [a, b, c], [m, n, o], [x, y, z]]
# here mean of 1st attribute =(a + m+x), mean of 2nd attribute = (b +
n+y)/3 # delete summaries of last class
del info[-1]
return info

# find Mean and Standard Deviation under each class


def MeanAndStdDevForClass(mydata):
info = {}
dict = groupUnderClass(mydata)
for classValue, instances in dict.items():
info[classValue] =
MeanAndStdDev(instances)
return info

# Calculate Gaussian Probability Density Function


def calculateGaussianProbability(x, mean, stdev):
expo = math.exp(-(math.pow(x - mean, 2) / (2 * math.pow(stdev,
2)))) return (1 / (math.sqrt(2 * math.pi) * stdev)) * expo

# Calculate Class Probabilities


def calculateClassProbabilities(info, test):
probabilities = {}
for classValue, classSummaries in
info.items(): probabilities[classValue]
=1
for i in range(len(classSummaries)):
mean, std_dev = classSummaries[i]
x = test[i]
probabilities[classValue] *= calculateGaussianProbability(x,
6
mean, std_dev)
7

return probabilities

# Make prediction - highest probability is the prediction


def predict(info, test):
probabilities = calculateClassProbabilities(info, test)
bestLabel, bestProb = None, -1
for classValue, probability in probabilities.items():
if bestLabel is None or probability > bestProb:
bestProb = probability
bestLabel =
classValue return bestLabel

# returns predictions for a set of examples


def getPredictions(info, test):
predictions = []
for i in range(len(test)):
result = predict(info, test[i])
predictions.append(result)
return predictions

# Accuracy score
def accuracy_rate(test, predictions):
correct = 0
for i in range(len(test)):
if test[i][-1] == predictions[i]:
correct += 1
return (correct / float(len(test))) * 100.0

# driver code
# add the data path in your system
filename = r'E:\user\MACHINE LEARNING\machine learning algos\Naive
bayes\filedata.csv'

# load the file and store it in mydata


list mydata = csv.reader(open(filename,
"rt")) mydata = list(mydata)
mydata = encode_class(mydata)
for i in range(len(mydata)):
mydata[i] = [float(x) for x in mydata[i]]

# split ratio = 0.7


# 70% of data is training data and 30% is test data used for testing
ratio = 0.7
7

train_data, test_data = splitting(mydata, ratio)


print('Total number of examples are: ',
len(mydata))
print('Out of these, training examples are: ', len(train_data))
print("Test examples are: ", len(test_data))

# prepare model
info = MeanAndStdDevForClass(train_data)

# test model
predictions = getPredictions(info, test_data)
accuracy = accuracy_rate(test_data, predictions)
print("Accuracy of your model is: ", accuracy)

OUTPUT:
Total number of examples are: 200
Out of these, training examples are: 140
Test examples are: 60
Accuracy of your model is: 71.2376788
7

12. Implement a Java program to perform Apriori algorithm.


PROGRAM:
//Implementation of Apriori algorithm in Java.
import java.io.*;
class Main
{
public static void main(String []arg)throws IOException
{
int i, j, m=0; //initalize variables
int t1=0;
BufferedReader b=new BufferedReader(new InputStreamReader(System.in));
//Java BufferedReader class is used to read the text from a character-based input
stream.
//It can be used to read data line by line by readLine() method.
System.out.println("Enter the number of transaction:");
//here n is the number of transactions
int n=Integer.parseInt(b.readLine());
System.out.println("items :1--Milk 2--Bread 3--Coffee 4--Juice 5--Cookies 6--Jam 7-
-Tea 8--Butter 9--Sugar 10--Water");
//cresting array of 10 items.
int item[][]=new int[n][10];
//loop generating for number of transactions
for(i=0;i<n;i++)
//loop generating for items array
for(j=0;j<10;j++)
//initializing unique items with their frequency as 0.
item[i][j]=0;
String[] itemlist = {"MILK", "BREAD", "COFFEE", "JUICE", "COOKIES", "JAM",
"TEA", "BUTTER", "SUGAR", "WATER"};
//getting 10 items into array called itemlist.
int nt[]=new
int[10]; int q[]=new
int[10];
for(i=0;i<n;i++)
{

//incrementing for each items in 'n' transactions.


System.out.println("Transaction "+(i+1)+" :");
for(j=0;j<10;j++)
{
//System.out.println(itemlist[j]);
System.out.println("Is Item "+itemlist[j]+" present in this transaction(1/0)? :");
//checking whether items from itemlistis present in transaction or not where
0- not present,1-present.
item[i][j]=Integer.parseInt(b.readLine());
7

//reading for each item from item list in n transaction.


}
}
for(j=0;j<10;j++)
{
for(i=0;i<n;i++)
{
//checking whether atleast there would be multiple items repeated at each n
transaction.
if(item[i][j]==1)
//if condition is satisfied then we increment for all n transaction of items.
nt[j]=nt[j]+1;
}
System.out.println("Number of Item "+itemlist[j]+" :"+nt[j]);
//generating number of multiple items repeated at their transaction with
frequency number.
}
for(j=0;j<10;j++)
{
//calculating items with their threshold values.
if(((nt[j]/(float)n)*100)>=50)
//segregating present items left after removal of items which is below threshold
into array
q[j]=1;
else
//segregating not present items removed as items are below the threshold values
q[j]=0;
if(q[j]==1)
{
t1++; //getting the count of repetitions of same items
System.out.println("Item "+itemlist[j]+" is selected ");
//generating particular item which is selected after threshold calculating.
}
}
for(j=0;j<10;j++)
{
for(i=0;i<n;i++)
{
if(q[j]==0)
{
item[i][j]=0;
}
}
7

}
//creating array for 2-frequency
itemset int nt1[][]=new int[10][10];
for(j=0;j<10;j++)
{
//generating unique items for 2-frequency itemlist
for(m=j+1;m<10;m++)
{
for(i=0;i<n;i++)
{
if(item[i][j]==1 &&item[i][m]==1)
//checking there would atleast 1 itemset in 1-frequency itemset and 2-
frequency itemlist.
{
nt1[j][m]=nt1[j][m]+1;
//incrementing for each items with all other items in 2-frequency itemset
}
}
if(nt1[j][m]!=0) //if 2-frequency itemlist is present
System.out.println("Number of Items of “+itemlist[j]+"& "+itemlist[m]+"
:"+nt1[j][m]);
//printing number of items of each items with other items with their
frequency.
}
}
for(j=0;j<10;j++)
{
for(m=j+1;m<10;m++)
{
if(((nt1[j][m]/(float)n)*100)>=50)
q[j]=1;
else
q[j]=0;
if(q[j]==1)
{
System.out.println("Item "+itemlist[j]+"& "+itemlist[m]+" is selected ");
}
}
}
}
}
7

OUTPUT:
Enter the number of transaction:
3
items :1--Milk 2--Bread 3--Coffee 4--Juice 5--Cookies 6--Jam 7--Tea 8--Butter 9--
Sugar 10--Water
Transaction 1:
Is Item MILK present in this transaction(1/0)? :
1
Is Item BREAD present in this transaction(1/0)? :
1
Is Item COFFEE present in this transaction(1/0)? :
1
Is Item JUICE present in this transaction(1/0)? :
1
Is Item COOKIES present in this transaction(1/0)? :
1
Is Item JAM present in this transaction(1/0)? :
1
Is Item TEA present in this transaction(1/0)?
:1
Is Item BUTTER present in this transaction(1/0)? :
1
Is Item SUGAR present in this transaction(1/0)? :
1
Is Item WATER present in this transaction(1/0)?
:1
Transaction 2:
Is Item MILK present in this transaction(1/0)? :
1
Is Item BREAD present in this transaction(1/0)? :
1
Is Item COFFEE present in this transaction(1/0)? :
0
Is Item JUICE present in this transaction(1/0)? :
0
Is Item COOKIES present in this transaction(1/0)? :
1
Is Item JAM present in this transaction(1/0)? :
1
Is Item TEA present in this transaction(1/0)?
:0
7

Is Item BUTTER present in this transaction(1/0)? :


0
Is Item SUGAR present in this transaction(1/0)? :
1
Is Item WATER present in this transaction(1/0)?
:1
Transaction 3:
Is Item MILK present in this transaction(1/0)? :
0
Is Item BREAD present in this transaction(1/0)? :
0
Is Item COFFEE present in this transaction(1/0)? :
1
Is Item JUICE present in this transaction(1/0)? :
1
Is Item COOKIES present in this transaction(1/0)? :
0
Is Item JAM present in this transaction(1/0)? :
0
Is Item TEA present in this transaction(1/0)?
:1
Is Item BUTTER present in this transaction(1/0)? :
1
Is Item SUGAR present in this transaction(1/0)? :
0
Is Item WATER present in this transaction(1/0)?
:0
Number of Item MILK :2
Number of Item BREAD :2
Number of Item COFFEE :2
Number of Item JUICE :2
Number of Item COOKIES :2
Number of Item JAM :2
Number of Item TEA :2
Number of Item BUTTER :2
Number of Item SUGAR :2
Number of Item WATER :2
Item MILK is selected
Item BREAD is selected
Item COFFEE is selected
Item JUICE is selected
Item COOKIES is selected
Item JAM is selected
7

Item TEA is selected


Item BUTTER is selected
Item SUGAR is selected
Item WATER is selected
Number of Items of MILK & BREAD :2
Number of Items of MILK & COFFEE :1
Number of Items of MILK & JUICE :1
Number of Items of MILK & COOKIES :2
Number of Items of MILK & JAM :2
Number of Items of MILK & TEA :1
Number of Items of MILK & BUTTER :1
Number of Items of MILK & SUGAR :2
Number of Items of MILK & WATER :2
Number of Items of BREAD & COFFEE :1
Number of Items of BREAD & JUICE :1
Number of Items of BREAD & COOKIES :2
Number of Items of BREAD & JAM :2
Number of Items of BREAD & TEA :1
Number of Items of BREAD & BUTTER :1
Number of Items of BREAD & SUGAR :2
Number of Items of BREAD & WATER :2
Number of Items of COFFEE & JUICE :2
Number of Items of COFFEE & COOKIES :1
Number of Items of COFFEE & JAM :1
Number of Items of COFFEE & TEA :2
Number of Items of COFFEE & BUTTER :2
Number of Items of COFFEE & SUGAR :1
Number of Items of COFFEE & WATER :1
Number of Items of JUICE & COOKIES :1
Number of Items of JUICE & JAM :1
Number of Items of JUICE & TEA :2
Number of Items of JUICE & BUTTER :2
Number of Items of JUICE & SUGAR :1
Number of Items of JUICE & WATER :1
Number of Items of COOKIES & JAM :2
Number of Items of COOKIES & TEA :1
Number of Items of COOKIES & BUTTER :1
Number of Items of COOKIES & SUGAR :2
Number of Items of COOKIES & WATER :2
Number of Items of JAM & TEA :1
Number of Items of JAM & BUTTER :1
Number of Items of JAM & SUGAR :2
Number of Items of JAM & WATER :2
7

Number of Items of TEA & BUTTER :2


Number of Items of TEA & SUGAR :1
Number of Items of TEA & WATER :1
Number of Items of BUTTER & SUGAR :1
Number of Items of BUTTER & WATER
:1 Number of Items of SUGAR &
WATER :2 Item MILK& BREAD is
selected
Item MILK& COOKIES is selected
Item MILK& JAM is selected
Item MILK& SUGAR is selected
Item MILK& WATER is selected
Item BREAD& COOKIES is selected
Item BREAD& JAM is selected
Item BREAD& SUGAR is selected
Item BREAD& WATER is selected
Item COFFEE& JUICE is selected
Item COFFEE& TEA is selected
Item COFFEE& BUTTER is selected
Item JUICE& TEA is selected
Item JUICE& BUTTER is selected
Item COOKIES& JAM is selected
Item COOKIES& SUGAR is selected
Item COOKIES& WATER is
selected Item JAM& SUGAR is
selected
Item JAM& WATER is selected
Item TEA& BUTTER is selected
Item SUGAR& WATER is
selected
7

13. Write a program to cluster your choice of data using simple k-means algorithm
using JDK.
PROGRAM:
import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class KMeansClustering {


private static final int NUM_CLUSTERS = 3;
public static void main(String[] args) {
// Create a list of data points
List<Point> dataPoints = new ArrayList<>();
dataPoints.add(new Point(1, 2));
dataPoints.add(new Point(3, 4));
dataPoints.add(new Point(5, 6));
dataPoints.add(new Point(7, 8));
dataPoints.add(new Point(9, 10));

// Initialize the cluster centroids


List<Point> clusterCentroids = new ArrayList<>();
for (int i = 0; i < NUM_CLUSTERS; i++)
{
clusterCentroids.add(new Point(Math.random() * 10, Math.random() * 10));
}

// Assign each data point to the closest cluster centroid


for (Point dataPoint : dataPoints) {
int closestClusterIndex = 0;
double closestDistance =
Double.MAX_VALUE; for (int i = 0; i <
NUM_CLUSTERS; i++) {
double distance =
dataPoint.distanceTo(clusterCentroids.get(i)); if (distance <
closestDistance) {
closestClusterIndex = i;
closestDistance = distance;
}
}
dataPoint.setClusterIndex(closestClusterIndex);
}

// Update the cluster centroids


for (int i = 0; i < NUM_CLUSTERS; i++) {
Point clusterCentroid = new Point(0, 0);
7
for (Point dataPoint : dataPoints) {
8

if (dataPoint.getClusterIndex() == i) {
clusterCentroid.add(dataPoint);
}
}
clusterCentroid.divide(dataPoints.size());
clusterCentroids.set(i, clusterCentroid);
}

// Print the final cluster centroids


for (Point clusterCentroid : clusterCentroids) {
System.out.println(clusterCentroid);
}
}

private static class Point {


private double x;
private double y;
private int clusterIndex;

public Point(double x, double y) {


this.x = x;
this.y = y;
this.clusterIndex = -1;
}

public double distanceTo(Point otherPoint) {


return Math.sqrt((x - otherPoint.x) * (x - otherPoint.x) + (y - otherPoint.y) * (y
- otherPoint.y));
}

public void add(Point otherPoint) {


this.x += otherPoint.x;
this.y += otherPoint.y;
}

public void divide(int divisor) {


this.x /= divisor;
this.y /= divisor;
}

public int getClusterIndex() {


return clusterIndex;
}
8

public void setClusterIndex(int clusterIndex) {


this.clusterIndex = clusterIndex;
}

@Override
public String toString() {
return "(" + x + ", " + y + ")";
}
}
}

OUTPUT:
(0.8, 1.2)
(3.2, 3.6)
(1.0, 1.2)
8

14. Write a program of cluster analysis using simple k-means algorithm Python
programming language.
PROGRAM:
# importing libraries
import numpy as np
import matplotlib.pyplot as plt

# Generate some data


data = np.random.rand(100, 2)

# Choose the number of clusters


k=3

# Randomly select the centroids


centroids = data[np.random.choice(len(data), k, replace=False)]

# Assign each data point to a cluster


clusters = np.zeros(len(data))
for i in range(len(data)):
distances = np.linalg.norm(data[i] - centroids, axis=1)
clusters[i] = np.argmin(distances)

# Update the centroids


for i in range(k):
centroids[i] = np.mean(data[clusters == i], axis=0)

# Plot the data


plt.scatter(data[:, 0], data[:, 1], c=clusters)
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='x')
plt.show()
8

OUTPUT:

OBSERVATION:
This program will generate 100 random data points and then cluster them using the k-
means algorithm. The number of clusters is chosen to be 3. The program will then plot
the data points and the centroids.
8

15. Write a program to compute/display dissimilarity matrix (for your own dataset
containing at least four instances with two attributes) using Python.
PROGRAM:
# importing library
import numpy as np

# Create the dataset


dataset = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Compute the dissimilarity matrix


dissimilarity_matrix = np.zeros((dataset.shape[0], dataset.shape[0]))
for i in range(dataset.shape[0]):
for j in range(dataset.shape[0]):
dissimilarity_matrix[i, j] = np.sqrt(np.sum((dataset[i] - dataset[j])**2))

# Display the dissimilarity matrix


print(dissimilarity_matrix)

OUTPUT:
[[0. 2.82842712 5.65685425 8.48528137]
[2.82842712 0. 2.82842712 5.65685425]
[5.65685425 2.82842712 0. 2.82842712]
[8.48528137 5.65685425 2.82842712 0. ]]

OBSERVATION:
This program first creates a dataset containing four instances with two attributes.
Then, it computes the dissimilarity matrix using the Euclidean distance formula.
Finally, it displays the dissimilarity matrix.
8

16. Visualize the datasets using matplotlib in python. (Histogram, Box plot, Bar
chart, Pie chart etc.,)
PROGRAM:
LINE CHART
import matplotlib.pyplot as plt

# initializing the data


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# plotting the data


plt.plot(x, y)

# Adding title to the plot


plt.title("Line Chart")

# Adding label on the y-axis


plt.ylabel('Y-Axis')

# Adding label on the x-axis


plt.xlabel('X-Axis')

# Show the plot


plt.show()

OUTPUT:
8

HISTOGRAM
import matplotlib.pyplot as plt
import numpy as np

# Create a random dataset


data = np.random.normal(100, 25, 200)

# Create a histogram
plt.hist(data)

# Show the plot


plt.show()

OUTPUT:
8

BOXPLOT
import matplotlib.pyplot as plt
import numpy as np

# Create a random dataset


data = np.random.normal(100, 25, 200)

# Create a box plot


plt.boxplot(data)

# Show the plot


plt.show()

OUTPUT:
8

BAR CHART
import matplotlib.pyplot as plt
import numpy as np

# Create a random dataset


data = np.random.randint(0, 100, 10)

# Create a bar chart


plt.bar(range(len(data)), data)

# Show the plot


plt.show()

OUTPUT:
8

PIE CHART
import matplotlib.pyplot as plt
import numpy as np

# Create a random dataset


data = np.random.randint(0, 100, 10)

# Create a pie chart


plt.pie(data)

# Show the plot


plt.show()

OUTPUT:

You might also like