ADL LAB Manual
ADL LAB Manual
AIM: To study concept of clustering using partitioning techniques like K-means/ K-medoids algorithm.
PROBLEM STATEMENT: Write a java/ Python program to implement K-means algorithm to cluster the
given data.
PREREQUISITES:
1. Knowledge of java/ python programming and knowledge of clustering
COURSE OBJECTIVE:
1. To understand clustering concept of data mining.
COURSE OUTCOMES:
1. To solve problems of clustering and given data set is organized into given number of
clusteres . THEORY:
The K-means clustering algorithm is a simple method for estimating the mean (vectors) of a set of
K-groups. In short, it is an algorithm to classify or to group your objects based on attributes /
features into K number of groups. K is positive integer number. The grouping is done by
minimizing the sum of squares of distances between data and the corresponding cluster centroid.
Thus, the purpose of K-means is to classify the data.
As a simple illustration of a k-means algorithm, consider the following data set consisting of the
scores of two variables on each of seven individuals:
Subject B
A
1 1.0 1
.
0
2 1.5 2
.
0
3 3.0 4
.
0
4 5.0 7
.
0
5 3.5 5.
0
6 4.5 5.
0
7 3.5 4.
5
This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the
A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the
initial cluster means, giving:
Individua Mean
l Vector
(centroi
d)
Group (1.0,
11 1.0)
Group (5.0,
24 7.0)
MITADT University, Pu
The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a
new member is added. This leads to the following series of steps:
Cluster 1 Cluster 2
5 1, 2, 3 (1.8, 4, 5, 6 (4.3,
5.7)
6 1, 2, 3 (1.8, 4, 5, 6, 7 (4.1,
Now the initial partition has changed, and the two clusters at this stage having the following characteristics:
Individu Mean
al Vector
(centroi
d)
Cluster 1, 2, 3 (1.8,
1 2.3)
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare
each individual’s distance to its own cluster mean and to that of the opposite cluster. And we find:
3 2.1 1.8
Individua Distance to Distance
mean
l tomean
(centroid) of(centroid) 4 5.7 1.8
of Cluster
5 3.2 0.7
Cluster 1 2
6 3.8 0.6
1 1.5 5.4
7 2.8 1.1
2 0.4 4.3
MITADT University, Pune
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In other
words, each individual's distance to its own cluster mean should be smaller that the distance to the other
cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:
Individual Mean
Vector
(centroi
d)
Cluster 1, 2 (1.3,
1 1.5)
Cluster 2
3, 4, 5, 6,
7(3.9, 5.1)
The iterative relocation would now continue from this new partition until no more relocations occur.
However, in this example each individual is now nearer its own cluster mean than that of the other cluster
and the iteration stops, choosing the latest partitioning as the final cluster solution. Also, it is possible that
the k-means algorithm won't find a final solution. In this case it would be a good idea to consider stopping
the algorithm after a pre-chosen maximum of iterations.
IMPLEMENTATION:
MATHEMATICAL MODEL
Program/Code/queries:
Here's a step-by-step implementation of Data Clustering using K-means Algorithm in
Python: # Step 1: Import necessary libraries
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
CONCLUSION:
Given data set is clustered into given K clusters. Algorithm gives us guaranteed to converge and achieve local
optimal, not necessarily global optimal.
MITADT University, Pune
Experiment No 9
PREREQUISITES:
1. Knowledge of java/ python programming and knowledge of clustering
COURSE OBJECTIVE:
1. To understand clustering concept of data mining.
COURSE OUTCOMES:
1. To implement and understand the concept of data classification using K-NN
approach. THEORY:
The k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and
regression. In both cases, the input consists of the k closest training examples in the feature space. The output
depends on whether k-NN is used for classification or regression:
In k-NN classification, the output is a class membership. An object is classified by a majority vote of its
neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a
positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest
neighbor.
In k-NN regression, the output is the property value for the object. This value is the average of the values of its
k nearest neighbors. When to use KNN algorithm?KNN can be used for both classification and regression
predictive problems. However, it is more widely used in classification problems in the industry. To evaluate
any technique we generally look at 3 important aspects:
1. Ease to interpret output
2. Calculation time
3. Predictive Power
How does the KNN algorithm work?
Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green
squares (GS) :
You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing
else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K =
3. Hence, we will now make a circle with BS as center just as big as to enclose only three
datapoints on the plane. Refer to following diagram for more details:
MITADT University, Pune
The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should
belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went
to RC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the
factors to be considered to conclude the best K.
How do we choose the ‘K’ factor?
First let us try to understand what exactly does K influence in the algorithm. If we see the last example, given
that all the 6 training observation remain constant, with a given K value we can make boundaries of each
class. These boundaries will segregate RC from GS. The same way, let’s try to see the effect of value “K” on
the class boundaries. Following are the different boundaries separating the two classes with different values of
K.
If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. With K
increasing to infinity it finally becomes all blue or all red depending on the total majority. The training error
rate and the validation error rate are two parameters we need to access on different K-value. Following is the
curve for the training error rate with varying value of K :
As you can see, the error rate at K=1 is always zero for the training sample. This is because the closest point to
any training data point is itself.Hence the prediction is always accurate with K=1.If validation error curve
would have been similar, our choice of K would have been 1. Following is the validation error curve with
varying value of K:
This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error rate initially
decreases and reaches a minima. After the minima point, it then increase with increasing K. To get the optimal
value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error
curve to get the optimal value of K. This value of K should be used for all predictions.
Code :
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Feature scaling (optional but recommended)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
MITADT University, Pu
# Create KNN model
k = 5 # You can tune this value
knn = KNeighborsClassifier(n_neighbors=k)
# Train the model
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
CONCLUSION:
Thus it is understood that the KNN algorithm is one of the simplest classification algorithm. Even with such
simplicity, it can give highly competitive results. KNN algorithm can also be used for regression problems.
The only difference from the discussed methodology will be using averages of nearest neighbors rather than
voting from nearest neighbors. KNN can be coded in a single line on R.
MITADT University, Pune
Experiment No 10 A
Title: Decision Tree Induction in data mining
Aim: Diabetes Diagnosis using Decision Tree Induction.
Software required: Python's Scikit-learn library and a sample dataset.
Theory:
Decision tree induction is a data mining technique used for supervised learning tasks, primarily classification
and regression. It involves constructing a tree-like structure (known as a decision tree) from the given dataset,
where each internal node represents a feature (or attribute), each branch represents a decision rule, and each
leaf node represents an outcome (class label or numerical value).
Key Concepts in Decision Tree Induction:
1. Root Node: The topmost node in the decision tree, representing the most significant feature that best
splits the dataset based on a specific criterion (e.g., information gain, Gini impurity).
2. Internal Nodes: Nodes in the decision tree that represent features or attributes. Internal nodes are used
to partition the dataset into subsets based on different attribute values.
3. Branches: The edges connecting nodes in the decision tree, representing decision rules or conditions
that guide the traversal from the root node to leaf nodes based on attribute tests.
4. Leaf Nodes: Terminal nodes in the decision tree that represent the final outcomes or class labels. Each
leaf node corresponds to a specific class label or numerical value, indicating the predicted outcome for
instances that satisfy the conditions along the path from the root node to the leaf node.
5. Decision Rule: Criteria used to determine the attribute and value for splitting the dataset at each internal
node, such as maximizing information gain, minimizing impurity, or other optimization criteria.
Steps in Decision Tree Induction:
1. Attribute Selection: Identify the most informative attributes (features) for partitioning the dataset
based on criteria like information gain, gain ratio, Gini impurity, or entropy.
2. Tree Construction: Recursively partition the dataset into subsets based on the selected attributes and
values. Create internal nodes for each attribute and leaf nodes for each class label or outcome.
3. Tree Pruning: Optimize the decision tree by pruning unnecessary branches or nodes to improve
generalization, reduce overfitting, and enhance predictive accuracy on unseen data.
4. Tree Evaluation: Evaluate the decision tree's performance using metrics like accuracy, precision,
recall, F1-score, or confusion matrix on a validation or test dataset to assess its effectiveness in
classifying instances and generalizing patterns.
Applications of Decision Tree Induction:
1. Classification: Predicting categorical class labels based on input features, such as identifying customer
segments, classifying email as spam or non-spam, or diagnosing medical conditions.
2. Regression: Estimating numerical values or predicting continuous outcomes, such as forecasting sales,
predicting house prices, or evaluating risk factors.
3. Feature Selection: Identifying relevant features or attributes that contribute most to the target variable
and simplifying complex datasets by focusing on essential predictors.
4. Pattern Recognition: Discovering meaningful patterns, relationships, or rules within datasets to
support decision-making, insights generation, and knowledge discovery.
MITADT University, Pune
Decision tree induction is a fundamental data mining technique that facilitates the creation of interpretable,
rule-based models for classification and regression tasks. By partitioning datasets, selecting informative
attributes, and constructing hierarchical tree structures, decision trees provide a transparent, intuitive approach
to analysing data, making predictions, and extracting valuable insights from diverse domains, including
business, healthcare, finance, and engineering.
Program/Code/Queries:
Implementing a Diabetes Diagnosis system using Decision Tree Induction involves building a predictive
model to classify patients as diabetic or non-diabetic based on relevant features or attributes such as glucose
level, blood pressure, BMI, age, etc.
Here's a step-by-step guide to creating a simple diabetes diagnosis model using Python's Scikit-learn library
and a sample dataset:
Step 1: Import Libraries
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
Result/Conclusion:
By following the above steps, we can implement a Diabetes Diagnosis system using Decision Tree Induction
in Python. Ensure that we have a relevant dataset with features such as glucose level, blood pressure, BMI,
age, etc., to train and evaluate the model effectively. Additionally, consider optimizing the model, performing
feature selection, and incorporating domain knowledge or additional preprocessing techniques to enhance the
model's performance, interpretability, and reliability for diagnosing diabetes based on patient data.
Experiment No. 10 B
Outcomes:
1.Scalable and distributed storage for high availability.
2.Low-latency queries for product information, customer details, and order history.
Theory:
Designing a database schema for an application using Cassandra Query Language (CQL) involves considering
the characteristics of Cassandra, which is a NoSQL database designed for horizontal scalability, high
availability, and fault tolerance.
Here are some key principles and considerations when creating a database schema for a Cassandra-based
application:
∙ 1.Denormalization:
∙ Cassandra uses a process called compaction to merge and organize data on disk. It's important
to understand the compaction strategy and tune it according to your application's needs.
∙ 7.Materialized Views:
∙ Cassandra
supports materialized views, which allow you to create alternative views of your
data. This can be useful for handling different query patterns without requiring complex query
logic.
∙ 8.Time-Series Data:
∙ If your application involves time-series data, consider using time-based strategies such as time
window compaction or time bucketing to optimize data storage and retrieval.
∙ Use appropriate data types for your columns. Cassandra supports various data types, including
collections (lists, sets, maps), which can be useful for modeling certain types of data.
Here are key theoretical aspects and principles associated with Cassandra and CQL:
∙ Distributed Architecture:
∙ Cassandra is designed to be distributed, allowing it to scale horizontally by adding more nodes
to the cluster. This distributed architecture provides fault tolerance and high availability. ∙ Peer-to-Peer
Model:
∙ Cassandra
follows a peer-to-peer model where all nodes in the cluster have equal status. Each
node can accept read and write requests, and there is no single point of failure.
∙ No Single Point of Failure:
∙ Cassandra
is built to ensure high availability and fault tolerance. Data is replicated across
multiple nodes, and if one node fails, the system can continue to operate with the remaining
nodes.
∙ CAP Theorem:
∙ Cassandra adheres to the CAP theorem, which states that in a distributed system, it's impossible
to simultaneously provide all three of the following guarantees: Consistency, Availability, and
Partition tolerance. Cassandra is designed to provide high Availability and Partition tolerance,
making it an AP system.
∙ Eventual Consistency:
∙ Cassandra provides eventual consistency, meaning that given enough time and assuming no
further updates, all replicas of a piece of data will converge to the same value. This model
allows for high availability and performance in distributed environments.
Conclusion:
Hence, we have successfully studied Design of Database Schema for an application using Cassandra (CQL)
Experiment No. 10 C
Title: Implementation of DynamoDB queries
Theory:
∙ Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that require
consistent single-digit millisecond latency at any scale.
MITADT University, Pune
∙ It is a fully managed database that supports both document and key-value data models.
∙ Its flexible data model and performance makes it a great fit for mobile, web, gaming, ad-tech, IOT,
and many other applications.
∙ DynamoDB allows users to create databases capable of storing and retrieving any amount of data, and
serving any amount of traffic.
∙ It automatically distributes data and traffic over servers to dynamically manage each customer's
requests, and also maintains fast performance.
∙ The DynamoDB Environment only consists of using your Amazon Web Services account to access the
DynamoDB GUI console, however, you can also perform a local install.
∙ The AWS (Amazon Web Service) provides a version of DynamoDB for local installations.
∙ It also reduces provisioned throughput, data storage, and transfer fees by allowing a local database.
Working Environment -
You can use a JavaScript shell, a GUI console, and multiple languages to work with DynamoDB. The
languages available include Ruby, Java, Python, C#, Erlang, PHP, and Perl.
∙ Tables
∙ Each
table in Amazon DynamoDB contains one or more items. Items are made up of a group
of attributes that are uniquely identifiable.
∙ Attributes
∙ Attributesin AWS DynamoDB are fundamental data elements or values that reside in an item.
Equivalent to data values that reside in a particular cell of a table in a relational database.
∙ Using
DynamoDB, developers can easily develop scalable cloud-based applications
∙ AWS can easily achieve data retrieval in single-digit milliseconds
∙ DevOps
need not worry about managing the high availability and durability of data because
DynamoDB automatically replicates it synchronously across multiple AWS Availability Zones (AZs) ∙
DynamoDB can be provisioned according to the number of write units and several read units allocated ∙
The user’s database table always remain available based on provisioned throughput requirements like
read-write units per second
∙ DynamoDB utilizes JSON as a transport protocol
∙ Control Plane (It is responsible for creating and managing DynamoDB table)
MITADT University, Pune
∙ Create Table
∙ Describe Table
∙ List Table
∙ Delete Table
∙ Data Plane (It consists of ‘CRUD’ operation, i.e. Create, Read, Update & Delete)
∙ Creating Data
∙ PutItem
∙ BatchWriteItem
∙ Reading Data
∙ GetItem
∙ BatchGetItem
∙ Query
∙ Scan
∙ Updating Data
∙ UpdateItem
∙ Deleting Data
∙ DeleteItem
∙ BatchWriteItem
∙ DynamoDB Stream
∙ ListStream
∙ DescribeStream
∙ GetSharedIterator
∙ GetRecords
When you create a table, in addition to the table name, you have to specify the primary key of the table. The
primary key uniquely identifies each item in the table, so that no two items can have the same key.
∙ Partition key– A simple primary key that is composed of one attribute known as the partition key.
DynamoDB uses the partition key's value as input to an internal hash function. The output from the
hash function determines the partition (physical storage internal to DynamoDB) in which the item will
be stored. An important rule to implementing a Partition key is that in A table that has only a partition
key, no two items can have the same partition key value. The People table described in Tables, Items,
and Attributes is an example of a table with a simple primary key (PersonID). You can access any item
in the People table directly by providing the Person Id value for that item.
∙ Partition key and sort key– It is referred to as a composite primary key, this type of key is composed
of two attributes. The first attribute is the partition key, and the second attribute is the sort key.
DynamoDB uses the partition key value as input to an internal hash function. The output from the hash
function determines the partition (physical storage internal to DynamoDB) in which the item will be
stored. All items with the same partition key value are stored together, in sorted order by sort key
value.
. A DynamoDB table must have a primary key. There are two possible types to choose from:
1. Partition Key — Single Attribute —which will just be a field in your data source that uniquely
represents the row (e.g., an auto-generated, unique product ID).
2. Partition Key & Sort Key — Composite Key — which will be a combo of two attributes that will
uniquely identify the row, and how the data should naturally be sorted (e.g., Unique product ID and
purchase date timestamp)
∙ Your DynamoDB partition key must be unique and sparse. As this key is hashed internally and used to
distribute that data for storage.
∙ This is a similar technique to Redshift and HBase that prevents hot-spotting of data.
∙ If using a composite key, then two items can have the same Partition Key, but the Sort Key must be
unique.
∙ This will mean all items with the same Partition key will be stored together but sorted in ascending
order using the Sort Key.
∙ Amazon DynamoDB is a NoSQL managed database service provided by Amazon that stores semi
structured data like key-value pairs.
∙A DynamoDB table consists of items. Each item consists of one partition key and one or more
attributes. An example of an item is given below:
Example:
{
"MovieID": 101,
"Name": "The Shawshank Redemption",
"Rating": 9.2,
MITADT University, Pune
"Year": 1994
}
In the above example, MovieID is the partition key.
∙A partition key is used to differentiate between items. A query operation in DynamoDB finds items
based on primary key values.
∙ The name of the partition key attribute and a single value for that attribute must be provided. The
query returns all items searched against that partition key value.
Step 3: You can view your table being created. Click on “Overview” to understand your table, click on
“Items” to edit, insert and query on the table. There are many more options you can use to understand your
table better.
MITADT University, Pune
How to Insert a table in DynamoDB?
Step 1: Navigate to “Items” and click on “Create item.“
Step 2: It will open a JSON file where you can add different items. Click on the “+” symbol and select
“Append” and select what type of data you want to enter.
Step 3: This is what it looks like after adding multiple columns to your table. Click on “Save“.
MITADT University, Pune
Step 4: Since it is a NoSQL architecture, you can play around with the columns you add to the table. E.g.,
“Position.“
Step 5: This is how your table will look like once you have inserted the data.
Source Code –
Let's assume we're building a simple application for managing books and authors. The application needs to
support queries for retrieving books by title, author, and publication year.
# Define the table for books
MITADT University, Pune
# Attributes
Title STRING,
AuthorID UUID,
PublicationYear INT,
Genre STRING,
Summary STRING,
# Secondary Indexes
GSI1 (
# Global Secondary Index for querying by AuthorID
AuthorID UUID
),
GSI2 (
# Global Secondary Index for querying by Title
Title STRING
),
GSI3 (
# Global Secondary Index for querying by PublicationYear
PublicationYear INT
)
);
# Attributes
FirstName STRING,
LastName STRING,
BirthYear INT,
# Secondary Index
GSI (
# Global Secondary Index for querying by LastName
LastName STRING
)
);
Explanation:
1. Book Table:
∙ Primary Key: BookID (UUID) - Unique identifier for each book.
Conclusion –
MITADT University,
We have implemented Dynomo DB queries on single database schema.
Practical No. 10 D
Title: Data Mining Tools
Aim: Study of Data Mining tools using WEKA / ORANGE
Theory :-
Data Mining is the set of techniques that utilize specific algorithms, statical analysis, artificial intelligence,
and database systems to analyze data from different dimensions and perspectives.
Data Mining tools have the objective of discovering patterns/trends/groupings among large sets of data and
transforming data into more refined information.
Orange :-
∙ Orange is a framework for data visualization, machine learning, and data mining with a front-end for
visual programming.
∙ It has been around since 1996 and is free software. The analysis is achieved by connecting widgets that
perform various functions, such as reading files, displaying statistics on features, constructing models,
evaluating, etc.
∙ Moreover, if you intend to dig deeper into finer tuning, it is available as a Python library. For
programmers, analysts, and data mining experts, Orange supports a versatile domain. Python, a
scripting language and programming environment of the modern century, where our data mining
scripts can be simple but efficient.
∙ Foreasy implementation, Orange uses a component-based method. Simply like placing the Wooden
blocks, or even using an existing algorithm, we can apply our research technique.
∙ Orange is a great software package for machine learning and data mining
Advantages :
1. Open-source software is cost-effective.
2. Constant improvements are a hallmark of open-source software.
3. Visual Programming
4. Interactive Data Visualization
5. Add-ons Extended Functionality
Disadvantages :
1. Open-source software might not stick around.
2. Manual Troubleshooting
3. Advance analysis is not so easy
4. Support isn’t always reliable.
5. Security becomes a major issue.
∙ Orange scripting:
If we want to access Orange objects, then we need to write our components and design our test
schemes and machine learning applications through the script. Orange interfaces to Python, a model simple to
use a scripting language with clear and powerful syntax and a broad set of additional libraries. Same as any
scripting language, Python can be used to test a few ideas mutually or to develop more detailed scripts and
programs.
MITADT University, Pune
∙ Orange interfaces to Python, model simple to use a scripting language with clear and powerful syntax
and broad set of additional libraries.
import orange
data1 = orange.ExampleTable('voting.tab')
print('Instance:', len(data1))
print(Attributes:', 1len(data.domain.attributes))
If we store this script in script.py and run it by shell command "python script.py" ensure that the data file is in
the same directory then we get
Instances: 543
Attributes: 16
Let us proceed with our script that uses the same data created by a naïve Bayesian classifier and print the
classification of the first five instances:
model = orange.BayesLearner(data1)
for i in range(5):
print(model(data1[i]))
It is easy to produce the classification model; we have called Oranges object (Bayes Learner) and gave it the
data set. It returned another object (naïve Bayesian classifier) when given an instance returns the label of the
possible class.
∙ WEKA :-
Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to these functions. The original non-Java
version of Weka was a Tcl/Tk front-end to (mostly third-party) modelling algorithms implemented in
other programming languages, plus data preprocessing utilities in C and a makefile-based system for
running machine learning experiments.
MITADT University, Pune
Weka supports several standard data mining tasks, specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted
according to the Attribute-Relational File Format and filename with the .arff extension.
∙ Features of Weka
1. Preprocess
The preprocessing of data is a crucial task in data mining. Because most of the data is raw, there are chances
that it may contain empty or duplicate values, have garbage values, outliers, extra columns, or have a different
naming convention. All these things degrade the results.
2. Classify
Classification is one of the essential functions in machine learning, where we assign classes or categories to
items. The classic examples of classification are: declaring a brain tumour as "malignant" or "benign" or
assigning an email to a "spam" or "not_spam" class.
After selecting the desired classifier, we select test options for the training set. Some of the options
are: ∙ Use training set: the classifier will be tested on the same training set.
∙A supplied test set: evaluates the classifier based on a separate test set.
∙ Cross-validation Folds: assessment of the classifier based on cross-validation using the number of
provided folds.
3. Cluster
In clustering, a dataset is arranged in different groups/clusters based on some similarities. In this case, the
items within the same cluster are identical but different from other clusters. Examples of clustering include
identifying customers with similar behaviours and organizing the regions according to homogenous land use.
4. Associate
Association rules highlight all the associations and correlations between items of a dataset. In short, it is an if
then statement that depicts the probability of relationships between data items. A classic example of
association refers to a connection between the sale of milk and bread.
5.Select Attributes
Every dataset contains a lot of attributes, but several of them may not be significantly valuable. Therefore,
removing the unnecessary and keeping the relevant details are very important for building a good model.
MITADT University, Pune
Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and Ranker.
6. Visualize
In the visualize tab, different plot matrices and graphs are available to show the trends and errors identified by
the model.
As shown in the above screenshot, five options are available in the Applications category.
∙ The Exploreris the central panel where most data mining tasks are performed. We will further explore
this panel in upcoming sections.
∙ The tool provides an Experimenter In this panel, we can run experiments and also design them.
∙ WEKA provides the KnowledgeFlow panel. It provides an interface to drag and drop components,
connect them to form a knowledge flow and analyze the data and results.
∙ The Simple CLIpanel provides the command line powers to run WEKA. For example, to fire up
the ZeroR classifier on the arff data, we'll run from the command line:
java weka.classifiers.trees.ZeroR -t iris.arff
Numeric (Integer and Real), String, Date, and Relational are the only four datatypes provided by WEKA.
By default, WEKA supports the ARFF format. The ARFF, attribute-relation file format, is an ASCII
format that describes a list of instances sharing a set of attributes. Every ARFF file has two sections:
header and data.
∙ And the data section contains a comma-separated list of data for that attributes.
WEKA provides many algorithms for machine learning tasks. Because of their core nature, all the
algorithms are divided into several groups. These are available under the Explorer tab of the WEKA. Let's
look at those groups and their core nature:
∙ Bayes: consists of algorithms based on Bayes theorem like Naive Bayes
∙ functions: comprises the algorithms that estimate a function, including Linear Regression
∙ lazy: covers all algorithms that use lazy learning similar to KStar, LWL
∙ meta: consists of those algorithms that use or integrate multiple algorithms for their work like Stacking,
Bagging
∙ misc: miscellaneous algorithms that do not fit any of the given categories
Conclusion :-
Data mining tools Orange and Weka are studied.
ITADT University, Pune