0% found this document useful (0 votes)

2 views

ADL LAB Manual

The document outlines experiments on clustering and classification techniques using K-means and K-Nearest Neighbors (KNN) algorithms, including problem statements, objectives, and implementations in Python. It explains the K-means algorithm for clustering data into K groups based on minimizing distances to centroids and the KNN algorithm for classifying data based on the majority vote of nearest neighbors. Additionally, it discusses decision tree induction for diabetes diagnosis, detailing key concepts and steps involved in constructing decision trees.

Uploaded by

sahilkshirsagar39

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

ADL LAB Manual

Uploaded by

sahilkshirsagar39

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

MITADT University, Pune

DT University, Pune Experiment No: 8

AIM: To study concept of clustering using partitioning techniques like K-means/ K-medoids algorithm.

PROBLEM STATEMENT: Write a java/ Python program to implement K-means algorithm to cluster the
given data.

PREREQUISITES:
1. Knowledge of java/ python programming and knowledge of clustering

COURSE OBJECTIVE:
1. To understand clustering concept of data mining.
COURSE OUTCOMES:
1. To solve problems of clustering and given data set is organized into given number of
clusteres . THEORY:
The K-means clustering algorithm is a simple method for estimating the mean (vectors) of a set of
K-groups. In short, it is an algorithm to classify or to group your objects based on attributes /
features into K number of groups. K is positive integer number. The grouping is done by
minimizing the sum of squares of distances between data and the corresponding cluster centroid.
Thus, the purpose of K-means is to classify the data.

As a simple illustration of a k-means algorithm, consider the following data set consisting of the
scores of two variables on each of seven individuals:
Subject B
A

1 1.0 1
.
0

2 1.5 2
.
0

3 3.0 4
.
0

4 5.0 7
.
0

5 3.5 5.
0

6 4.5 5.
0

7 3.5 4.
5

This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the
A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the
initial cluster means, giving:

Individua Mean
l Vector
(centroi
d)

Group (1.0,
11 1.0)

Group (5.0,
24 7.0)

MITADT University, Pu
The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a
new member is added. This leads to the following series of steps:
Cluster 1 Cluster 2

Step Mean Individu Mean

Individual Vector al Vector
(centroi (centroi
d) d)

1 1 (1.0, 1.0) (5.0,

4 7.0)

2 1, 2 (1.2, 1.5) (5.0,

4 7.0)

3 1, 2, 3 (1.8, 2.3) (5.0,

4 7.0)

4 1, 2, 3 (1.8, 2.3) 4, (4.2,

6.0)

5 1, 2, 3 (1.8, 4, 5, 6 (4.3,
5.7)

6 1, 2, 3 (1.8, 4, 5, 6, 7 (4.1,

Now the initial partition has changed, and the two clusters at this stage having the following characteristics:
Individu Mean
al Vector
(centroi
d)

Cluster 1, 2, 3 (1.8,
1 2.3)

Cluster 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare
each individual’s distance to its own cluster mean and to that of the opposite cluster. And we find:
3 2.1 1.8
Individua Distance to Distance
mean
l tomean
(centroid) of(centroid) 4 5.7 1.8
of Cluster
5 3.2 0.7
Cluster 1 2
6 3.8 0.6
1 1.5 5.4
7 2.8 1.1
2 0.4 4.3
MITADT University, Pune

Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In other
words, each individual's distance to its own cluster mean should be smaller that the distance to the other
cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:
Individual Mean
Vector
(centroi
d)

Cluster 1, 2 (1.3,
1 1.5)

Cluster 2
3, 4, 5, 6,
7(3.9, 5.1)

The iterative relocation would now continue from this new partition until no more relocations occur.
However, in this example each individual is now nearer its own cluster mean than that of the other cluster
and the iteration stops, choosing the latest partitioning as the final cluster solution. Also, it is possible that
the k-means algorithm won't find a final solution. In this case it would be a good idea to consider stopping
the algorithm after a pre-chosen maximum of iterations.

IMPLEMENTATION:

1: Select K points as the initial Centroids.

2: repeat
3: Form K clusters by assigning all points to the closest centroid
4: Recompute the Centroid of each Cluster.
5: until The centroids don’t change

Input: n objects (or points) and a number k

MITADT University, Pune
Algorithm
1. Randomly place K points into the space represented by the objects that are being clustered.
These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the stopping criteria is met.

MATHEMATICAL MODEL

Let ‘S’ be the system to perform image processing operation. S =

<I,O,F>
1. I = <Ii>
Where,
<Ii> = Data points

2. Identify O as the output O =

<Og, Om, Od>
Where,
Om = Centroids or mean distance
Od = Distance of each object to the centroids
Og = Cluster sets (Grouping of object based on minimum distance).

3. Identify set of functions F = <F1,

F2, F3>
Where,
F1 = kMeanCluster() : used to perform grouping of object or point based on minimum
distance. F2 = dist() : used to calculate Eucledian distance between data point and centroid.
F3 = initialize() : used for centroid initialization. Basically initialized at lowest set and highest set.

Program/Code/queries:
Here's a step-by-step implementation of Data Clustering using K-means Algorithm in
Python: # Step 1: Import necessary libraries
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Step 2: Generate random data for demonstration

np.random.seed(42)
data = np.random.rand(100, 2)

# Step 3: Number of clusters (you can adjust this)

num_clusters = 3

# Step 4: Create KMeans instance

kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Step 5: Fit the data to the KMeans model

kmeans.fit(data)
MITADT University, Pune

# Step 6: Get the cluster assignments and centroids

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Step 7: Visualize the clusters

plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis', edgecolor='k')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', s=200, color='red')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

CONCLUSION:

Given data set is clustered into given K clusters. Algorithm gives us guaranteed to converge and achieve local
optimal, not necessarily global optimal.
MITADT University, Pune

Experiment No 9

Title: Classification using: KNN algorithm.

AIM: To study any similarity based technique and develop an application to classify given text.

PROBLEM STATEMENT: Write a Java/ Python program to implement classification

algorithm to classify the given data using K-Nearest Neighbor method.

PREREQUISITES:
1. Knowledge of java/ python programming and knowledge of clustering

COURSE OBJECTIVE:
1. To understand clustering concept of data mining.
COURSE OUTCOMES:
1. To implement and understand the concept of data classification using K-NN
approach. THEORY:
The k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and
regression. In both cases, the input consists of the k closest training examples in the feature space. The output
depends on whether k-NN is used for classification or regression:
In k-NN classification, the output is a class membership. An object is classified by a majority vote of its
neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a
positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest
neighbor.
In k-NN regression, the output is the property value for the object. This value is the average of the values of its
k nearest neighbors. When to use KNN algorithm?KNN can be used for both classification and regression
predictive problems. However, it is more widely used in classification problems in the industry. To evaluate
any technique we generally look at 3 important aspects:
1. Ease to interpret output
2. Calculation time
3. Predictive Power
How does the KNN algorithm work?
Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green
squares (GS) :
You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing
else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K =
3. Hence, we will now make a circle with BS as center just as big as to enclose only three
datapoints on the plane. Refer to following diagram for more details:
MITADT University, Pune
The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should
belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went
to RC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the
factors to be considered to conclude the best K.
How do we choose the ‘K’ factor?
First let us try to understand what exactly does K influence in the algorithm. If we see the last example, given
that all the 6 training observation remain constant, with a given K value we can make boundaries of each
class. These boundaries will segregate RC from GS. The same way, let’s try to see the effect of value “K” on
the class boundaries. Following are the different boundaries separating the two classes with different values of
K.
If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. With K
increasing to infinity it finally becomes all blue or all red depending on the total majority. The training error
rate and the validation error rate are two parameters we need to access on different K-value. Following is the
curve for the training error rate with varying value of K :

As you can see, the error rate at K=1 is always zero for the training sample. This is because the closest point to
any training data point is itself.Hence the prediction is always accurate with K=1.If validation error curve
would have been similar, our choice of K would have been 1. Following is the validation error curve with
varying value of K:
This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error rate initially
decreases and reaches a minima. After the minima point, it then increase with increasing K. To get the optimal
value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error
curve to get the optimal value of K. This value of K should be used for all predictions.
Code :
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Feature scaling (optional but recommended)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
MITADT University, Pu
# Create KNN model
k = 5 # You can tune this value
knn = KNeighborsClassifier(n_neighbors=k)
# Train the model
knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

CONCLUSION:
Thus it is understood that the KNN algorithm is one of the simplest classification algorithm. Even with such
simplicity, it can give highly competitive results. KNN algorithm can also be used for regression problems.
The only difference from the discussed methodology will be using averages of nearest neighbors rather than
voting from nearest neighbors. KNN can be coded in a single line on R.
MITADT University, Pune

Experiment No 10 A
Title: Decision Tree Induction in data mining
Aim: Diabetes Diagnosis using Decision Tree Induction.
Software required: Python's Scikit-learn library and a sample dataset.
Theory:
Decision tree induction is a data mining technique used for supervised learning tasks, primarily classification
and regression. It involves constructing a tree-like structure (known as a decision tree) from the given dataset,
where each internal node represents a feature (or attribute), each branch represents a decision rule, and each
leaf node represents an outcome (class label or numerical value).
Key Concepts in Decision Tree Induction:
1. Root Node: The topmost node in the decision tree, representing the most significant feature that best
splits the dataset based on a specific criterion (e.g., information gain, Gini impurity).
2. Internal Nodes: Nodes in the decision tree that represent features or attributes. Internal nodes are used
to partition the dataset into subsets based on different attribute values.
3. Branches: The edges connecting nodes in the decision tree, representing decision rules or conditions
that guide the traversal from the root node to leaf nodes based on attribute tests.
4. Leaf Nodes: Terminal nodes in the decision tree that represent the final outcomes or class labels. Each
leaf node corresponds to a specific class label or numerical value, indicating the predicted outcome for
instances that satisfy the conditions along the path from the root node to the leaf node.
5. Decision Rule: Criteria used to determine the attribute and value for splitting the dataset at each internal
node, such as maximizing information gain, minimizing impurity, or other optimization criteria.
Steps in Decision Tree Induction:
1. Attribute Selection: Identify the most informative attributes (features) for partitioning the dataset
based on criteria like information gain, gain ratio, Gini impurity, or entropy.
2. Tree Construction: Recursively partition the dataset into subsets based on the selected attributes and
values. Create internal nodes for each attribute and leaf nodes for each class label or outcome.
3. Tree Pruning: Optimize the decision tree by pruning unnecessary branches or nodes to improve
generalization, reduce overfitting, and enhance predictive accuracy on unseen data.
4. Tree Evaluation: Evaluate the decision tree's performance using metrics like accuracy, precision,
recall, F1-score, or confusion matrix on a validation or test dataset to assess its effectiveness in
classifying instances and generalizing patterns.
Applications of Decision Tree Induction:
1. Classification: Predicting categorical class labels based on input features, such as identifying customer
segments, classifying email as spam or non-spam, or diagnosing medical conditions.
2. Regression: Estimating numerical values or predicting continuous outcomes, such as forecasting sales,
predicting house prices, or evaluating risk factors.
3. Feature Selection: Identifying relevant features or attributes that contribute most to the target variable
and simplifying complex datasets by focusing on essential predictors.
4. Pattern Recognition: Discovering meaningful patterns, relationships, or rules within datasets to
support decision-making, insights generation, and knowledge discovery.
MITADT University, Pune
Decision tree induction is a fundamental data mining technique that facilitates the creation of interpretable,
rule-based models for classification and regression tasks. By partitioning datasets, selecting informative
attributes, and constructing hierarchical tree structures, decision trees provide a transparent, intuitive approach
to analysing data, making predictions, and extracting valuable insights from diverse domains, including
business, healthcare, finance, and engineering.
Program/Code/Queries:
Implementing a Diabetes Diagnosis system using Decision Tree Induction involves building a predictive
model to classify patients as diabetic or non-diabetic based on relevant features or attributes such as glucose
level, blood pressure, BMI, age, etc.
Here's a step-by-step guide to creating a simple diabetes diagnosis model using Python's Scikit-learn library
and a sample dataset:
Step 1: Import Libraries
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

Step 2: Load the Dataset

For this example, let's assume you have a dataset named diabetes_data.csv with relevant features and a target
variable ('Outcome' indicating diabetic or non-diabetic).
python
data = pd.read_csv('diabetes_data.csv')
print(data.head())

Step 3: Data Preprocessing

python
# Split the dataset into features (X) and target variable (y)
X = data.drop('Outcome', axis=1)
y = data['Outcome']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Build the Decision Tree Model

python
# Initialize the DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
# Train the classifier on the training data
clf.fit(X_train, y_train)
MITADT University, Pune

Step 5: Make Predictions

python
# Make predictions on the test data
y_pred = clf.predict(X_test)
# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')
# Generate a classification report
print(classification_report(y_test, y_pred))

Step 6: Interpret the Model

After training the Decision Tree model, you can interpret the model to understand the most important features
influencing the diabetes diagnosis by examining feature importances:
python
# Feature importances
feature_importances = clf.feature_importances_
print("Feature Importances:")
for feature, importance in zip(X.columns, feature_importances):
print(f"{feature}: {importance}")

Result/Conclusion:
By following the above steps, we can implement a Diabetes Diagnosis system using Decision Tree Induction
in Python. Ensure that we have a relevant dataset with features such as glucose level, blood pressure, BMI,
age, etc., to train and evaluate the model effectively. Additionally, consider optimizing the model, performing
feature selection, and incorporating domain knowledge or additional preprocessing techniques to enhance the
model's performance, interpretability, and reliability for diagnosing diabetes based on patient data.
Experiment No. 10 B

Aim: Design a Database Schema for an application using Cassandra (CQL).

Pre-Requisites: DBMS, MongoDB
Objective:
MITADT University, Pune

1.Efficiently store and retrieve product information.

2.Enable fast retrieval of order details for a given customer.
3.Facilitate quick retrieval of customer information.
4.Support the tracking of order history.

Outcomes:
1.Scalable and distributed storage for high availability.

2.Low-latency queries for product information, customer details, and order history.

3.Flexibility to handle growth in data and traffic.

Theory:
Designing a database schema for an application using Cassandra Query Language (CQL) involves considering
the characteristics of Cassandra, which is a NoSQL database designed for horizontal scalability, high
availability, and fault tolerance.

Here are some key principles and considerations when creating a database schema for a Cassandra-based
application:

∙ 1.Denormalization:

∙ Cassandra encourages denormalization to optimize for read performance. It's common to

duplicate data across tables to avoid complex joins and to ensure that data is available in the
form needed for specific queries.
∙ 2.Query-Driven Schema Design:
∙ Design your schema based on the queries your application will perform. Cassandra is optimized
for specific query patterns, so the schema should align with the types of queries your
application will execute.
∙ 3.Understanding Data Distribution:
∙ Cassandra distributes data across nodes based on the partition key. The choice of partition key is
crucial for achieving even distribution and efficient query performance. It's essential to select a
partition key that evenly distributes data and prevents hotspots.
∙ 4.Partitioning and Clustering Keys:
∙ Define a good partition key to distribute data evenly across nodes. Additionally, use clustering
keys to define the order of data within a partition, as this impacts the physical storage and
retrieval of data.
∙ 5.Avoiding Secondary Indexes:
∙ Cassandra supports secondary indexes, but their use comes with performance considerations.
Avoid using too many secondary indexes, as they can impact write performance and may not
scale well in distributed environments.
∙ 6.Understanding Compaction:

∙ Cassandra uses a process called compaction to merge and organize data on disk. It's important
to understand the compaction strategy and tune it according to your application's needs.
∙ 7.Materialized Views:
∙ Cassandra
supports materialized views, which allow you to create alternative views of your
data. This can be useful for handling different query patterns without requiring complex query
logic.
∙ 8.Time-Series Data:

∙ If your application involves time-series data, consider using time-based strategies such as time
window compaction or time bucketing to optimize data storage and retrieval.

9.Cassandra Data Types:

MITADT University, Pune

∙ Use appropriate data types for your columns. Cassandra supports various data types, including
collections (lists, sets, maps), which can be useful for modeling certain types of data.

Here are key theoretical aspects and principles associated with Cassandra and CQL:

∙ Distributed Architecture:
∙ Cassandra is designed to be distributed, allowing it to scale horizontally by adding more nodes
to the cluster. This distributed architecture provides fault tolerance and high availability. ∙ Peer-to-Peer
Model:
∙ Cassandra
follows a peer-to-peer model where all nodes in the cluster have equal status. Each
node can accept read and write requests, and there is no single point of failure.
∙ No Single Point of Failure:

∙ Cassandra
is built to ensure high availability and fault tolerance. Data is replicated across
multiple nodes, and if one node fails, the system can continue to operate with the remaining
nodes.
∙ CAP Theorem:

∙ Cassandra adheres to the CAP theorem, which states that in a distributed system, it's impossible
to simultaneously provide all three of the following guarantees: Consistency, Availability, and
Partition tolerance. Cassandra is designed to provide high Availability and Partition tolerance,
making it an AP system.
∙ Eventual Consistency:
∙ Cassandra provides eventual consistency, meaning that given enough time and assuming no
further updates, all replicas of a piece of data will converge to the same value. This model
allows for high availability and performance in distributed environments.

Conclusion:

Hence, we have successfully studied Design of Database Schema for an application using Cassandra (CQL)
Experiment No. 10 C
Title: Implementation of DynamoDB queries

Aim: Design a Database Schema for an application using DynamoDB (DQL)..

Software required: DynamoDB software

Theory:

∙ Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that require
consistent single-digit millisecond latency at any scale.
MITADT University, Pune

∙ It is a fully managed database that supports both document and key-value data models.

∙ Its flexible data model and performance makes it a great fit for mobile, web, gaming, ad-tech, IOT,
and many other applications.

∙ It is stored in SSD storage.

∙ It is spread across three geographically data centres.

∙ DynamoDB allows users to create databases capable of storing and retrieving any amount of data, and
serving any amount of traffic.

∙ It automatically distributes data and traffic over servers to dynamically manage each customer's
requests, and also maintains fast performance.

∙ The DynamoDB Environment only consists of using your Amazon Web Services account to access the
DynamoDB GUI console, however, you can also perform a local install.

∙ The AWS (Amazon Web Service) provides a version of DynamoDB for local installations.

∙ It supports creating applications without the web service or a connection.

∙ It also reduces provisioned throughput, data storage, and transfer fees by allowing a local database.

Working Environment -
You can use a JavaScript shell, a GUI console, and multiple languages to work with DynamoDB. The
languages available include Ruby, Java, Python, C#, Erlang, PHP, and Perl.

Core Concepts of AWS DynamoDB

∙ Tables

∙ In Amazon DynamoDB, the collection of items is known as a table. A table in AWS

DynamoDB is not a structured table with a fixed number of cells or columns.
∙ Items

∙ Each
table in Amazon DynamoDB contains one or more items. Items are made up of a group
of attributes that are uniquely identifiable.
∙ Attributes

∙ Attributesin AWS DynamoDB are fundamental data elements or values that reside in an item.
Equivalent to data values that reside in a particular cell of a table in a relational database.

Amazon DynamoDB Features

∙ Using
DynamoDB, developers can easily develop scalable cloud-based applications
∙ AWS can easily achieve data retrieval in single-digit milliseconds

∙ DevOps
need not worry about managing the high availability and durability of data because
DynamoDB automatically replicates it synchronously across multiple AWS Availability Zones (AZs) ∙
DynamoDB can be provisioned according to the number of write units and several read units allocated ∙
The user’s database table always remain available based on provisioned throughput requirements like
read-write units per second
∙ DynamoDB utilizes JSON as a transport protocol

∙ Hashkeys are used for the data partitioning in DynamoDB

∙ NoSQL and Big Data are the technologies that work together because they both share the same
allocated and scalable side-to-side structure of the database

The classification of DynamoDB is as follows: -

∙ Control Plane (It is responsible for creating and managing DynamoDB table)
MITADT University, Pune

∙ Create Table
∙ Describe Table
∙ List Table
∙ Delete Table

∙ Data Plane (It consists of ‘CRUD’ operation, i.e. Create, Read, Update & Delete)
∙ Creating Data
∙ PutItem

∙ BatchWriteItem

∙ Reading Data
∙ GetItem
∙ BatchGetItem

∙ Query

∙ Scan

∙ Updating Data
∙ UpdateItem

∙ Deleting Data
∙ DeleteItem

∙ BatchWriteItem

∙ DynamoDB Stream
∙ ListStream

∙ DescribeStream

∙ GetSharedIterator

∙ GetRecords

Amazon DynamoDB Primary Key -

When you create a table, in addition to the table name, you have to specify the primary key of the table. The
primary key uniquely identifies each item in the table, so that no two items can have the same key.

DynamoDB supports two different kinds of primary keys:

∙ Partition key– A simple primary key that is composed of one attribute known as the partition key.
DynamoDB uses the partition key's value as input to an internal hash function. The output from the
hash function determines the partition (physical storage internal to DynamoDB) in which the item will
be stored. An important rule to implementing a Partition key is that in A table that has only a partition
key, no two items can have the same partition key value. The People table described in Tables, Items,
and Attributes is an example of a table with a simple primary key (PersonID). You can access any item
in the People table directly by providing the Person Id value for that item.

∙ Partition key and sort key– It is referred to as a composite primary key, this type of key is composed
of two attributes. The first attribute is the partition key, and the second attribute is the sort key.
DynamoDB uses the partition key value as input to an internal hash function. The output from the hash
function determines the partition (physical storage internal to DynamoDB) in which the item will be
stored. All items with the same partition key value are stored together, in sorted order by sort key
value.

How does DynamoDB work?

A DynamoDB database can be broken down into three theorems:

1. Tables: A collection of things that you want to store together

MITADT University, Pune

2. Items: An item is just like a row in a normal database

Attributes: A column or field in a normal database

. A DynamoDB table must have a primary key. There are two possible types to choose from:
1. Partition Key — Single Attribute —which will just be a field in your data source that uniquely
represents the row (e.g., an auto-generated, unique product ID).
2. Partition Key & Sort Key — Composite Key — which will be a combo of two attributes that will
uniquely identify the row, and how the data should naturally be sorted (e.g., Unique product ID and
purchase date timestamp)

∙ Your DynamoDB partition key must be unique and sparse. As this key is hashed internally and used to
distribute that data for storage.
∙ This is a similar technique to Redshift and HBase that prevents hot-spotting of data.

∙ If using a composite key, then two items can have the same Partition Key, but the Sort Key must be
unique.

∙ This will mean all items with the same Partition key will be stored together but sorted in ascending
order using the Sort Key.

AWS DynamoDB – Working with Queries –

∙ Amazon DynamoDB is a NoSQL managed database service provided by Amazon that stores semi
structured data like key-value pairs.
∙A DynamoDB table consists of items. Each item consists of one partition key and one or more
attributes. An example of an item is given below:

Example:
{
"MovieID": 101,
"Name": "The Shawshank Redemption",
"Rating": 9.2,
MITADT University, Pune

"Year": 1994
}
In the above example, MovieID is the partition key.

∙A partition key is used to differentiate between items. A query operation in DynamoDB finds items
based on primary key values.
∙ The name of the partition key attribute and a single value for that attribute must be provided. The
query returns all items searched against that partition key value.

How to Create a table in DynamoDB

Step 1: Navigate to the DynamoDB section in AWS. Select “Create Table“.

Step 2: Fill in with the necessary details and click on “Create.“

Step 3: You can view your table being created. Click on “Overview” to understand your table, click on
“Items” to edit, insert and query on the table. There are many more options you can use to understand your
table better.
MITADT University, Pune
How to Insert a table in DynamoDB?
Step 1: Navigate to “Items” and click on “Create item.“

Step 2: It will open a JSON file where you can add different items. Click on the “+” symbol and select
“Append” and select what type of data you want to enter.

Step 3: This is what it looks like after adding multiple columns to your table. Click on “Save“.
MITADT University, Pune

Step 4: Since it is a NoSQL architecture, you can play around with the columns you add to the table. E.g.,
“Position.“
Step 5: This is how your table will look like once you have inserted the data.

Source Code –
Let's assume we're building a simple application for managing books and authors. The application needs to
support queries for retrieving books by title, author, and publication year.
# Define the table for books
MITADT University, Pune

CREATE TABLE Book (

# Primary key
BookID UUID PRIMARY KEY,

# Attributes
Title STRING,
AuthorID UUID,
PublicationYear INT,
Genre STRING,
Summary STRING,

# Secondary Indexes
GSI1 (
# Global Secondary Index for querying by AuthorID
AuthorID UUID
),
GSI2 (
# Global Secondary Index for querying by Title
Title STRING
),
GSI3 (
# Global Secondary Index for querying by PublicationYear
PublicationYear INT
)
);

# Define the table for authors

CREATE TABLE Author (
# Primary key
AuthorID UUID PRIMARY KEY,

# Attributes
FirstName STRING,
LastName STRING,
BirthYear INT,

# Secondary Index
GSI (
# Global Secondary Index for querying by LastName
LastName STRING
)
);

Explanation:

1. Book Table:
∙ Primary Key: BookID (UUID) - Unique identifier for each book.

∙ Attributes: Title, AuthorID, PublicationYear, Genre, Summary.

∙ Global Secondary Indexes (GSIs) for querying by AuthorID, Title, and PublicationYear.
2. Author Table:
∙ Primary Key: AuthorID (UUID) - Unique identifier for each author.

∙ Attributes: FirstName, LastName, BirthYear.

∙ Global Secondary Index (GSI) for querying by LastName.

Conclusion –
MITADT University,
We have implemented Dynomo DB queries on single database schema.

Practical No. 10 D
Title: Data Mining Tools
Aim: Study of Data Mining tools using WEKA / ORANGE

Software required: Orange, Weka

Theory :-
Data Mining is the set of techniques that utilize specific algorithms, statical analysis, artificial intelligence,
and database systems to analyze data from different dimensions and perspectives.
Data Mining tools have the objective of discovering patterns/trends/groupings among large sets of data and
transforming data into more refined information.

Orange :-

∙ Orange is a framework for data visualization, machine learning, and data mining with a front-end for
visual programming.
∙ It has been around since 1996 and is free software. The analysis is achieved by connecting widgets that
perform various functions, such as reading files, displaying statistics on features, constructing models,
evaluating, etc.
∙ Moreover, if you intend to dig deeper into finer tuning, it is available as a Python library. For
programmers, analysts, and data mining experts, Orange supports a versatile domain. Python, a
scripting language and programming environment of the modern century, where our data mining
scripts can be simple but efficient.
∙ Foreasy implementation, Orange uses a component-based method. Simply like placing the Wooden
blocks, or even using an existing algorithm, we can apply our research technique.
∙ Orange is a great software package for machine learning and data mining

Widgets offer essential functionality, like:

Displaying data table and allowing to select features
1) Data reading
2) Training predictors and comparison of learning algorithms

3) Data element visualization, etc.

MITADT University, Pune

Advantages :
1. Open-source software is cost-effective.
2. Constant improvements are a hallmark of open-source software.
3. Visual Programming
4. Interactive Data Visualization
5. Add-ons Extended Functionality
Disadvantages :
1. Open-source software might not stick around.
2. Manual Troubleshooting
3. Advance analysis is not so easy
4. Support isn’t always reliable.
5. Security becomes a major issue.

∙ Orange scripting:

If we want to access Orange objects, then we need to write our components and design our test
schemes and machine learning applications through the script. Orange interfaces to Python, a model simple to
use a scripting language with clear and powerful syntax and a broad set of additional libraries. Same as any
scripting language, Python can be used to test a few ideas mutually or to develop more detailed scripts and
programs.
MITADT University, Pune

∙ Orange interfaces to Python, model simple to use a scripting language with clear and powerful syntax
and broad set of additional libraries.

import orange
data1 = orange.ExampleTable('voting.tab')
print('Instance:', len(data1))
print(Attributes:', 1len(data.domain.attributes))
If we store this script in script.py and run it by shell command "python script.py" ensure that the data file is in
the same directory then we get
Instances: 543
Attributes: 16
Let us proceed with our script that uses the same data created by a naïve Bayesian classifier and print the
classification of the first five instances:
model = orange.BayesLearner(data1)
for i in range(5):
print(model(data1[i]))
It is easy to produce the classification model; we have called Oranges object (Bayes Learner) and gave it the
data set. It returned another object (naïve Bayesian classifier) when given an instance returns the label of the
possible class.

∙ WEKA :-

Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to these functions. The original non-Java
version of Weka was a Tcl/Tk front-end to (mostly third-party) modelling algorithms implemented in
other programming languages, plus data preprocessing utilities in C and a makefile-based system for
running machine learning experiments.
MITADT University, Pune

Weka supports several standard data mining tasks, specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted
according to the Attribute-Relational File Format and filename with the .arff extension.

∙ Features of Weka

Weka has the following features, such as:

1. Preprocess
The preprocessing of data is a crucial task in data mining. Because most of the data is raw, there are chances
that it may contain empty or duplicate values, have garbage values, outliers, extra columns, or have a different
naming convention. All these things degrade the results.

2. Classify
Classification is one of the essential functions in machine learning, where we assign classes or categories to
items. The classic examples of classification are: declaring a brain tumour as "malignant" or "benign" or
assigning an email to a "spam" or "not_spam" class.

After selecting the desired classifier, we select test options for the training set. Some of the options

are: ∙ Use training set: the classifier will be tested on the same training set.

∙A supplied test set: evaluates the classifier based on a separate test set.

∙ Cross-validation Folds: assessment of the classifier based on cross-validation using the number of
provided folds.

∙ Percentage split: the classifier will be judged on a specific percentage of data.

3. Cluster
In clustering, a dataset is arranged in different groups/clusters based on some similarities. In this case, the
items within the same cluster are identical but different from other clusters. Examples of clustering include
identifying customers with similar behaviours and organizing the regions according to homogenous land use.
4. Associate
Association rules highlight all the associations and correlations between items of a dataset. In short, it is an if
then statement that depicts the probability of relationships between data items. A classic example of
association refers to a connection between the sale of milk and bread.
5.Select Attributes
Every dataset contains a lot of attributes, but several of them may not be significantly valuable. Therefore,
removing the unnecessary and keeping the relevant details are very important for building a good model.
MITADT University, Pune

Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and Ranker.
6. Visualize
In the visualize tab, different plot matrices and graphs are available to show the trends and errors identified by
the model.

As shown in the above screenshot, five options are available in the Applications category.

∙ The Exploreris the central panel where most data mining tasks are performed. We will further explore
this panel in upcoming sections.

∙ The tool provides an Experimenter In this panel, we can run experiments and also design them.

∙ WEKA provides the KnowledgeFlow panel. It provides an interface to drag and drop components,
connect them to form a knowledge flow and analyze the data and results.

∙ The Simple CLIpanel provides the command line powers to run WEKA. For example, to fire up
the ZeroR classifier on the arff data, we'll run from the command line:
java weka.classifiers.trees.ZeroR -t iris.arff

∙ Weka Datatypes and Format of Data -

Numeric (Integer and Real), String, Date, and Relational are the only four datatypes provided by WEKA.
By default, WEKA supports the ARFF format. The ARFF, attribute-relation file format, is an ASCII
format that describes a list of instances sharing a set of attributes. Every ARFF file has two sections:
header and data.

∙ The header section consists of attribute types,

∙ And the data section contains a comma-separated list of data for that attributes.

∙ Types of Algorithms by Weka

MITADT University, Pune

WEKA provides many algorithms for machine learning tasks. Because of their core nature, all the
algorithms are divided into several groups. These are available under the Explorer tab of the WEKA. Let's
look at those groups and their core nature:
∙ Bayes: consists of algorithms based on Bayes theorem like Naive Bayes

∙ functions: comprises the algorithms that estimate a function, including Linear Regression
∙ lazy: covers all algorithms that use lazy learning similar to KStar, LWL

∙ meta: consists of those algorithms that use or integrate multiple algorithms for their work like Stacking,
Bagging
∙ misc: miscellaneous algorithms that do not fit any of the given categories

∙ rules: combines algorithms that use rules such as OneR, ZeroR

∙ trees: contains algorithms that use decision trees, such as J48, RandomForest

Conclusion :-
Data mining tools Orange and Weka are studied.
ITADT University, Pune

k Mean Clustering
No ratings yet
k Mean Clustering
32 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
K Mean Clustering 1
No ratings yet
K Mean Clustering 1
26 pages
Kmeans
No ratings yet
Kmeans
6 pages
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
No ratings yet
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
20 pages
K Mean Clustering
No ratings yet
K Mean Clustering
45 pages
K Mean
No ratings yet
K Mean
7 pages
DA_EXP_10_66
No ratings yet
DA_EXP_10_66
6 pages
K Mean Clustering
No ratings yet
K Mean Clustering
48 pages
Week 10
No ratings yet
Week 10
41 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Assignment No. A6: 1 Title
No ratings yet
Assignment No. A6: 1 Title
5 pages
K Mean Clustering
No ratings yet
K Mean Clustering
24 pages
K Mean
No ratings yet
K Mean
12 pages
K-Means Clustering-converted-merged
No ratings yet
K-Means Clustering-converted-merged
76 pages
K Mean Clustering
No ratings yet
K Mean Clustering
36 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
21csc305p Machine Learning Unit 3_updated (2)
No ratings yet
21csc305p Machine Learning Unit 3_updated (2)
147 pages
MINOR PROJECT
No ratings yet
MINOR PROJECT
10 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
algo
No ratings yet
algo
59 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
K Means
No ratings yet
K Means
33 pages
Lecture-18-Clustering-19092024-091909am
No ratings yet
Lecture-18-Clustering-19092024-091909am
33 pages
MLT Unit 3 Notes
No ratings yet
MLT Unit 3 Notes
19 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Pilot
No ratings yet
Pilot
3 pages
1-Kmeans
No ratings yet
1-Kmeans
13 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
ML Clustering K Mean (1)
No ratings yet
ML Clustering K Mean (1)
33 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
3.1 K - Means
No ratings yet
3.1 K - Means
16 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
25 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Exp 7
No ratings yet
Exp 7
3 pages
K means Clustering
No ratings yet
K means Clustering
11 pages
K MEANS
No ratings yet
K MEANS
40 pages
KNN VS Kmeans
No ratings yet
KNN VS Kmeans
3 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
K Means
No ratings yet
K Means
23 pages
DOC-20250407-WA0033.
No ratings yet
DOC-20250407-WA0033.
38 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
ML_Lec-16
No ratings yet
ML_Lec-16
16 pages
Unit 3 - KmeansClustering
No ratings yet
Unit 3 - KmeansClustering
17 pages
Lecture 4
No ratings yet
Lecture 4
64 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
ML 3
No ratings yet
ML 3
100 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
6-kmeans
No ratings yet
6-kmeans
15 pages
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
From Everand
Competitive Learning: Fundamentals and Applications for Reinforcement Learning through Competition
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
C Prograaming
No ratings yet
C Prograaming
76 pages
How To Read Musical Notes and Their Corresponding Piano Key
100% (5)
How To Read Musical Notes and Their Corresponding Piano Key
11 pages
Cloud Computing
No ratings yet
Cloud Computing
15 pages
Annexure 1
No ratings yet
Annexure 1
1 page
Division and Multiplication Relationship
No ratings yet
Division and Multiplication Relationship
8 pages
120 Most Common Pitman Shorthand Words
40% (5)
120 Most Common Pitman Shorthand Words
12 pages
Ombudsman Stylebook
100% (1)
Ombudsman Stylebook
140 pages
Reid - Frontiers of Violence in North-East Africa Genealogies of Conflict Since 1800 (2011)
No ratings yet
Reid - Frontiers of Violence in North-East Africa Genealogies of Conflict Since 1800 (2011)
329 pages
Web Designing: Certificate Course in
No ratings yet
Web Designing: Certificate Course in
6 pages
The Problem With Hurricanes
No ratings yet
The Problem With Hurricanes
3 pages
Geeglance Log
No ratings yet
Geeglance Log
11 pages
Reported Speech
No ratings yet
Reported Speech
4 pages
Multimedia Lab Manual IT
100% (1)
Multimedia Lab Manual IT
25 pages
26 - Contemporary - India - and - Education - N.E - UNIT-5
No ratings yet
26 - Contemporary - India - and - Education - N.E - UNIT-5
9 pages
Template - App Jadwal Pelajaran
No ratings yet
Template - App Jadwal Pelajaran
22 pages
(MB TUTORING 2024) GLOBAL SUCCESS 7 - ÔN TẬP NGỮ PHÁP (09 - 06 - 2024)
No ratings yet
(MB TUTORING 2024) GLOBAL SUCCESS 7 - ÔN TẬP NGỮ PHÁP (09 - 06 - 2024)
3 pages
Barcode Based Inventory System
No ratings yet
Barcode Based Inventory System
6 pages
Interview_Qs_1
100% (1)
Interview_Qs_1
48 pages
Knöll, L., W. Vogel, and D-G. Welsch. "Resonators in Quantum Optics: A First-Principles Approach." Physical Review A 43.1 (1991) : 543.
No ratings yet
Knöll, L., W. Vogel, and D-G. Welsch. "Resonators in Quantum Optics: A First-Principles Approach." Physical Review A 43.1 (1991) : 543.
11 pages
Prof Ed 2 Beed Prelim Test
No ratings yet
Prof Ed 2 Beed Prelim Test
11 pages
Barker - Cultural Studies
No ratings yet
Barker - Cultural Studies
8 pages
From The New Mexican School, To The Plan and Study Programs To The Free Primary Textbooks 1st To 6th
100% (1)
From The New Mexican School, To The Plan and Study Programs To The Free Primary Textbooks 1st To 6th
15 pages
Graphic Era Hill University, Dehradun: Presentation & Format of The Project
No ratings yet
Graphic Era Hill University, Dehradun: Presentation & Format of The Project
5 pages
Developing 2D Games with Unity: Independent Game Programming with C# 1st Edition Jared Halpern - The full ebook with all chapters is available for download now
100% (1)
Developing 2D Games with Unity: Independent Game Programming with C# 1st Edition Jared Halpern - The full ebook with all chapters is available for download now
67 pages
Absolute Convergence and Conditional Convergence
No ratings yet
Absolute Convergence and Conditional Convergence
7 pages
Introduction To Sight Translation (Difference-Sight&Standard Trans)
No ratings yet
Introduction To Sight Translation (Difference-Sight&Standard Trans)
6 pages
All in The Golden Afternoon: Alice's Adventures in Wonderland Reading Questions
No ratings yet
All in The Golden Afternoon: Alice's Adventures in Wonderland Reading Questions
13 pages
Java Methods
No ratings yet
Java Methods
5 pages
Charismatic Confusion
100% (2)
Charismatic Confusion
236 pages
CEO Olympiad Book For Class 1
No ratings yet
CEO Olympiad Book For Class 1
11 pages

ADL LAB Manual

Uploaded by

ADL LAB Manual

Uploaded by

MITADT University, Pune

DT University, Pune Experiment No: 8

Step Mean Individu Mean

1 1 (1.0, 1.0) (5.0,

2 1, 2 (1.2, 1.5) (5.0,

3 1, 2, 3 (1.8, 2.3) (5.0,

4 1, 2, 3 (1.8, 2.3) 4, (4.2,

Cluster 4, 5, 6, 7 (4.1, 5.4)

1: Select K points as the initial Centroids.

Input: n objects (or points) and a number k

Let ‘S’ be the system to perform image processing operation. S =

2. Identify O as the output O =

3. Identify set of functions F = <F1,

# Step 2: Generate random data for demonstration

# Step 3: Number of clusters (you can adjust this)

# Step 4: Create KMeans instance

# Step 5: Fit the data to the KMeans model

# Step 6: Get the cluster assignments and centroids

# Step 7: Visualize the clusters

Title: Classification using: KNN algorithm.

PROBLEM STATEMENT: Write a Java/ Python program to implement classification

Step 2: Load the Dataset

Step 3: Data Preprocessing

Step 4: Build the Decision Tree Model

Step 5: Make Predictions

Step 6: Interpret the Model

Aim: Design a Database Schema for an application using Cassandra (CQL).

1.Efficiently store and retrieve product information.

3.Flexibility to handle growth in data and traffic.

∙ Cassandra encourages denormalization to optimize for read performance. It's common to

9.Cassandra Data Types:

Aim: Design a Database Schema for an application using DynamoDB (DQL)..

Software required: DynamoDB software

∙ It is stored in SSD storage.

∙ It is spread across three geographically data centres.

∙ It supports creating applications without the web service or a connection.

Core Concepts of AWS DynamoDB

∙ In Amazon DynamoDB, the collection of items is known as a table. A table in AWS

Amazon DynamoDB Features

∙ Hashkeys are used for the data partitioning in DynamoDB

The classification of DynamoDB is as follows: -

Amazon DynamoDB Primary Key -

DynamoDB supports two different kinds of primary keys:

How does DynamoDB work?

1. Tables: A collection of things that you want to store together

2. Items: An item is just like a row in a normal database

Attributes: A column or field in a normal database

AWS DynamoDB – Working with Queries –

How to Create a table in DynamoDB

Step 2: Fill in with the necessary details and click on “Create.“

CREATE TABLE Book (

# Define the table for authors

∙ Attributes: Title, AuthorID, PublicationYear, Genre, Summary.

∙ Attributes: FirstName, LastName, BirthYear.

Software required: Orange, Weka

Widgets offer essential functionality, like:

3) Data element visualization, etc.

Weka has the following features, such as:

∙ Percentage split: the classifier will be judged on a specific percentage of data.

∙ Weka Datatypes and Format of Data -

∙ The header section consists of attribute types,

∙ Types of Algorithms by Weka

∙ rules: combines algorithms that use rules such as OneR, ZeroR

You might also like