0% found this document useful (0 votes)

148 views

Data Science and Its Applications (21AD62) Lab Manual

Uploaded by

Tahir Khan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

148 views

Data Science and Its Applications (21AD62) Lab Manual

Uploaded by

Tahir Khan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

KNS INSTITUTE OF TECHNOLOGY

HEGDE NAGAR, TIRUMENAHALLI, KOGILU ROAD, BENGALURU – 64

DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

DATA SCIENCE AND ITS APPLICATIONS

LABORATORY MANUAL
(21AD62)
Academic Year 2023- 2024
Prepared by: Reviewed by:
Mr. Dawa Chyophel Lepcha Dr. Aijaz Ali Khan
Assistant Professor, AIML Head of the Department, AIML
KNSIT, Bengaluru KNSIT, Bengaluru

Dr. Surendra Babu Nallagurla

Assistant Professor, AIML
KNSIT, Bengaluru

Name of the Student:

University Serial Number:

Semester: Batch:
CYCLE OF EXPERIMENTS
LAB CODE: 21AD62

List of Experiments Date Remark

Cycle I

Module 1
1. Installation of Python/R language, Visual Studio code editors can be demonstrated
along with Kaggle data set usage.
2. Write programs in Python/R and Execute them in either Visual Studio Code or
PyCharm Community Edition or any other suitable environment.
3. A study was conducted to understand the effect of number of hours the students
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on y-
axis. Use a red ‘*’ as the point character, label the axes and give the plot a title.

4. For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a

histogram to check the frequency distribution of the variable ‘mpg’ (Miles per
gallon)

Module 2

1. Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle

(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information about books. Write a program to demonstrate the following.
• Import the data into a DataFrame
• Find and drop the columns which are irrelevant for the book information.
• Change the Index of the DataFrame
• Tidy up fields in the data such as date of publication with the help of simple
regular expression.
• Combine str methods with NumPy to clean columns

Module 3
1. Train a regularized logistic regression classifier on the iris dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris
dataset) using sklearn. Train the model with the following hyperparameter C = 1e4
and report the best classification accuracy.
2. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and
the associated hyperparameters. Train model with the following set of
hyperparameters RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data

Cycle II
Module 4
1. Consider the following dataset. Write a program to demonstrate the working of the
decision tree based ID3 algorithm.

2. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in

the dataset corresponds to the co-ordinates of each data point. The third column
corresponds to the actual cluster label. Compute the rand index for the following
methods:

• K – means Clustering
• Single – link Hierarchical Clustering
• Complete link hierarchical clustering.
• Also visualize the dataset and which algorithm will be able to recover the true
clusters.
•
Module 5

1. Mini Project – Simple web scrapping in social media

DATA SCIENCE AND ITS APPLICATIONS
SUB CODE: 21AD62
LABORATORY EXPERIMENTS

Module 1
1. Installation of Python/R language, Visual Studio code editors can be demonstrated along with Kaggle data
set usage.
2. Write programs in Python/R and Execute them in either Visual Studio Code or PyCharm Community
Edition or any other suitable environment.
3. A study was conducted to understand the effect of number of hours the students spent studying on their
performance in the final exams. Write a code to plot line chart with number of hours spent studying on x-
axis and score in final exam on y-axis. Use a red ‘*’ as the point character, label the axes and give the plot
a title.

4. For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram to check the
frequency distribution of the variable ‘mpg’ (Miles per gallon)

Module 2

1. Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle

(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains information about
books. Write a program to demonstrate the following.
• Import the data into a DataFrame
• Find and drop the columns which are irrelevant for the book information.
• Change the Index of the DataFrame
• Tidy up fields in the data such as date of publication with the help of simple regular expression.
• Combine str methods with NumPy to clean columns

Module 3

1. Train a regularized logistic regression classifier on the iris dataset (https://archive.ics.uci.edu/ml/machine-

learning-databases/iris/ or the inbuilt iris dataset) using sklearn. Train the model with the following
hyperparameter C = 1e4 and report the best classification accuracy.
2. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the associated
hyperparameters. Train model with the following set of hyperparameters RBFkernel, gamma=0.5, one-vs-
rest classifier, no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of support vectors on the test
data
Module 4
3. Consider the following dataset. Write a program to demonstrate the working of the decision tree based ID3
algorithm.

4. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the dataset corresponds to
the co-ordinates of each data point. The third column corresponds to the actual cluster label. Compute the
rand index for the following methods:

Module 5

1. Mini Project – Simple web scrapping in social media

Assessment Details (both CIE and SEE)

The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%. The
minimum passing mark for the CIE is 40% of the maximum marks (20 marks). A student shall be deemed to have
satisfied the academic requirements and earned the credits allotted to each subject/ course if the student secures not
less than 35% (18 Marks out of 50) in the semester-end examination (SEE), and a minimum of 40% (40 marks out
of 100) in the sum total of the CIE (Continuous Internal Evaluation) and SEE (Semester End Examination) taken
together.

Continuous Internal Evaluation:

Three Unit Tests each of 20 Marks (duration 01 hour)

1. First test at the end of 5th week of the semester

2. Second test at the end of the 10th week of the semester
3. Third test at the end of the 15th week of the semester

Two assignments each of 10 Marks

4. First assignment at the end of 4th week of the semester

5. Second assignment at the end of 9th week of the semester
Practical Sessions need to be assessed by appropriate rubrics and viva-voce method. This will contribute
to 20 marks.

• Rubrics for each Experiment taken average for all Lab components – 15 Marks.
• Viva-Voce– 5 Marks (more emphasized on demonstration topics)

The sum of three tests, two assignments, and practical sessions will be out of 100 marks and will be
scaled down to 50 marks (to have a less stressed CIE, the portion of the syllabus should not be common /repeated
for any of the methods of the CIE. Each method of CIE should have a different syllabus portion of the course).

CIE methods /question paper has to be designed to attain the different levels of Bloom’s taxonomy
as per the outcome defined for the course.

Semester End Examination:

Theory SEE will be conducted by University as per the scheduled timetable, with common question
papers for the subject (duration 03 hours)

1. The question paper will have ten questions. Each question is set for 20 marks.
2. There will be 2 questions from each module. Each of the two questions under a module (with a maximum of
3 sub-questions), should have a mix of topics under that module.
3. The students have to answer 5 full questions, selecting one full question from each module.
4. Marks scored shall be proportionally reduced to 50 marks
DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

Introduction to Python
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its
high-level built in data structures, combined with dynamic typing and dynamic binding, make it very
attractive for Rapid Application Development, as well as for use as a scripting language to connect
existing components together. Python is simple, easy to learn syntax emphasizes readability and
therefore reduces the cost of program maintenance. Python supports modules and packages, which
encourages program modularity and code reuse. The Python interpreter and the extensive standard
library are available in source or binary form without charge for all major platforms, and can be freely
distributed.

Python Installation
Download Python Interpreter

Go to the Python downloads page and select the version foryour operating system (Windows, Mac,
Linux)

Install Python Interpreter

● Select the downloaded file to start the installation.

● Important! Remember the directory where Python is installed

Dept. of AIML, KNSIT Page 1

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

– 'Install Now' picks a default directory

– 'Customize installation' lets you specify the directory location

If installation is successful, you should see this message

Dept. of AIML, KNSIT Page 2

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

Introduction to PyCharm
PyCharm is an integrated development environment (IDE) used for programming in Python. It provides
code analysis, a graphical debugger, an integrated unit tester, integration with version control systems,
and supports web development with Django. PyCharm is developed by the Czech company JetBrains.

PyCharm IDE Installation

Go to PyCharm download page and select the FREECommunity Edition

Run the installation program

Dept. of AIML, KNSIT Page 3

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

Using PyCharm: Create a project

Dept. of AIML, KNSIT Page 4

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

● First, using File Explorer (Windows), or Finder (Mac), create a directory foryour projects,
e.g., PyCharm_Projects
● Name your project, e.g., INLS560
● Specify the Python Interpreter you will use for your projects

Specify the Python interpreter to use

● Select System Interpreter

● Ensure that the Interpreter field refers to the Python interpreterthat you just
installed. Click OK.

Dept. of AIML, KNSIT Page 5

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

Initial project structure for INLS560

● Folder for your project

● Folder with External Libraries

Create a Python Program

• Select INLS560, andright-

click for the context menu

• Select New > PythonFile,

and enter HelloWorld as
file name

Dept. of AIML, KNSIT Page 6

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

Create and run your program

● Enter print(“Hello World!”)in the Editor

● Select HelloWorld.pyand select Run from context menu; or, select Run icon
● Output is displayed in the Run Window in the bottom pane

Default Window Layout

Editor
Project
View

Tool Window

Dept. of AIML, KNSIT Page 7

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

EXPERIMENT -1

Aim: (3) A study was conducted to understand the effect of number of hours the students spent studying on their
performance in the final exams. Write a code to plot line chart with number of hours spent studying on x-axis and
score in final exam on y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title.

Students performance in the final exams

import matplotlib.pyplot as plt

hours = [10,9,2,15,10,16,11,16]
score = [95,80,10,50,45,98,38,93]

# Plotting the line chart

plt.plot(hours, score, marker='*', color='red', linestyle='-')

# Adding labels and title

plt.xlabel('Number of Hours Studied')
plt.ylabel('Score in Final Exam')
plt.title('Effect of Hours Studied on Exam Score')

# Displaying the plot

plt.grid(True)
plt.show()

Output

Dept. of AIML, KNSIT Page 8

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

The program above demonstrates a clear trend: generally, the more hours students study, the better they
perform on the final exam. However, there are some cases where this relationship isn’t quite as
straightforward, yielding slightly different outcomes.

Aim: (4) For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram to check the
frequency distribution of the variable ‘mpg’ (Miles per gallon)

Histogram to check the frequency distribution

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset

mtcars = pd.read_csv('mtcars.csv') # Replace 'path_to_your_mtcars.csv' with the actual path to your mtcars.csv
file

# Plotting the histogram

plt.hist(mtcars['mpg'], bins=10, color='skyblue', edgecolor='black')

# Adding labels and title

plt.xlabel('Miles per gallon (mpg)')
plt.ylabel('Frequency')
plt.title('Histogram of Miles per gallon (mpg)')

# Displaying the plot

plt.show()

Output

Dept. of AIML, KNSIT Page 9

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

EXPERIMENT -2
Aim: Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains information about
books. Write a program to demonstrate the following.
• Import the data into a DataFrame
• Find and drop the columns which are irrelevant for the book information.
• Change the Index of the DataFrame
• Tidy up fields in the data such as date of publication with the help of simple regular expression.
• Combine str methods with NumPy to clean columns

Kaggle Book Data set

import pandas as pd
import numpy as np

# Import the data into a DataFrame

df = pd.read_csv('BL-Flickr-Images-Book.csv')

# Display the first few rows of the DataFrame

print("Original DataFrame:")
print(df.head())

# Find and drop the columns which are irrelevant for the book information
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver',
'Contributors', 'Issuance type', 'Shelfmarks']
df.drop(columns=irrelevant_columns, inplace=True)

# Change the Index of the DataFrame

df.set_index('Identifier', inplace=True)

# Tidy up fields in the data such as date of publication with the help of simple regular expression
df['Date of Publication'] = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

# Combine str methods with NumPy to clean columns

df['Place of Publication'] = np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of
Publication'].str.replace('-', ' '))

Dept. of AIML, KNSIT Page 10

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

# Display the cleaned DataFrame

print("\nCleaned DataFrame:")
print(df.head())

Output
Original DataFrame:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London

Date of Publication Publisher \

0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh

Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.

Contributors Corporate Author \

0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
2 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
3 Appleyard, Ernest Silvanus. NaN
4 BROOME, John Henry. NaN

Corporate Contributors Former owner Engraver Issuance type \

0 NaN NaN NaN monographic
1 NaN NaN NaN monographic
Dept. of AIML, KNSIT Page 11
DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

2 NaN NaN NaN monographic

3 NaN NaN NaN monographic
4 NaN NaN NaN monographic

Flickr URL \
0 http://www.flickr.com/photos/britishlibrary/ta...
1 http://www.flickr.com/photos/britishlibrary/ta...
2 http://www.flickr.com/photos/britishlibrary/ta...
3 http://www.flickr.com/photos/britishlibrary/ta...
4 http://www.flickr.com/photos/britishlibrary/ta...

Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS 12626.cc.2.
2 British Library HMNTS 12625.dd.1.
3 British Library HMNTS 10369.bbb.15.
4 British Library HMNTS 9007.d.28.

Cleaned DataFrame:
Place of Publication Date of Publication Publisher \
Identifier
206 London 1879 S. Tinsley & Co.
216 London 1868 Virtue & Co.
218 London 1869 Bradbury, Evans & Co.
472 London 1851 James Darling
480 London 1857 Wertheim & Macintosh

Title Author \
Identifier
206 Walter Forbes. [A novel.] By A. A A. A.
216 All for Greed. [A novel. The dedication signed... A., A. A.
218 Love the Avenger. By the author of “All for Gr... A., A. A.
472 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
480 [The World in which I live, and my place in it... A., E. S.

Flickr URL
Identifier
Dept. of AIML, KNSIT Page 12
DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

206 http://www.flickr.com/photos/britishlibrary/ta...
216 http://www.flickr.com/photos/britishlibrary/ta...
218 http://www.flickr.com/photos/britishlibrary/ta...
472 http://www.flickr.com/photos/britishlibrary/ta...
480 http://www.flickr.com/photos/britishlibrary/ta...

Dept. of AIML, KNSIT Page 13

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

EXPERIMENT -3

Aim: (1) Train a regularized logistic regression classifier on the iris dataset (https://archive.ics.uci.edu/ml/machine-
learning-databases/iris/ or the inbuilt iris dataset) using sklearn. Train the model with the following hyperparameter
C = 1e4 and report the best classification accuracy.

Logistic Regression

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with StandardScaler and LogisticRegression with regularization

pipeline = make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000))

# Train the model

pipeline.fit(X_train, y_train)

# Calculate the accuracy on the testing set

accuracy = pipeline.score(X_test, y_test)
print("Classification accuracy:", accuracy)

Output

Classification accuracy: 1.0

Aim: (2) Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the associated
hyperparameters. Train model with the following set of hyperparameters RBFkernel, gamma=0.5, one-vs-rest
classifier, no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of hyperparameters, find
the best classification accuracy along with total number of support vectors on the test data.

SVM classifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target

Dept. of AIML, KNSIT Page 14

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set of hyperparameters to try

hyperparameters = [
{'kernel': 'rbf', 'gamma': 0.5, 'C': 0.01},
{'kernel': 'rbf', 'gamma': 0.5, 'C': 1},
{'kernel': 'rbf', 'gamma': 0.5, 'C': 10}
]

best_accuracy = 0
best_model = None
best_support_vectors = None

# Train SVM models with different hyperparameters and find the best accuracy
for params in hyperparameters:
model = SVC(kernel=params['kernel'], gamma=params['gamma'], C=params['C'], decision_function_shape='ovr')
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
support_vectors = model.n_support_.sum()
print(f"For hyperparameters: {params}, Accuracy: {accuracy}, Total Support Vectors: {support_vectors}")
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model = model
best_support_vectors = support_vectors

print("\nBest accuracy:", best_accuracy)

print("Total support vectors on test data:", best_support_vectors)

Output

For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 0.01}, Accuracy: 0.3, Total Support Vectors: 120
For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 1}, Accuracy: 1.0, Total Support Vectors: 39
For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 10}, Accuracy: 1.0, Total Support Vectors: 31

Best accuracy: 1.0

Total support vectors on test data: 39

Dept. of AIML, KNSIT Page 15

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

EXPERIMENT - 4
Aim: (1) Consider the following dataset. Write a program to demonstrate the working of the decision tree based ID3
algorithm.

Decision Tree based ID3 algorithm

from sklearn.tree import DecisionTreeClassifier, export_graphviz

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
from io import StringIO
from IPython.display import Image
import pydotplus

# Define the dataset

data = {
'Price': ['Low', 'Low', 'Low', 'Low', 'Low', 'Med', 'Med', 'Med', 'Med', 'High', 'High', 'High', 'High'],
'Maintenance': ['Low', 'Med', 'Low', 'Med', 'High', 'Med', 'Med', 'High', 'High', 'Med', 'Med', 'High', 'High'],
'Capacity': ['2', '4', '4', '4', '4', '4', '4', '2', '5', '4', '2', '2', '5'],
'Airbag': ['No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes'],
'Profitable': [1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1]
}

df = pd.DataFrame(data)

# Convert categorical variables into numerical ones

df = pd.get_dummies(df, columns=['Price', 'Maintenance', 'Airbag'])

Dept. of AIML, KNSIT Page 16

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

# Separate features and target variable

X = df.drop('Profitable', axis=1)
y = df['Profitable']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree classifier

clf = DecisionTreeClassifier(criterion='entropy')

# Train the classifier on the training data

clf.fit(X_train, y_train)

# Predict on the testing data

y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Visualize the decision tree

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True,
feature_names=X.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())

Output
Accuracy: 0.6666666666666666

Aim: (2) Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the dataset corresponds
to the co-ordinates of each data point. The third column corresponds to the actual cluster label. Compute the rand
index for the following methods:

Dept. of AIML, KNSIT Page 17

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

Clustering
import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score

import matplotlib.pyplot as plt

# Load the dataset

data = np.loadtxt("Spiral.txt", delimiter=",", skiprows=1)
X = data[:, :2] # Features
y_true = data[:, 2] # Actual cluster labels

# Visualize the dataset

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis')
plt.title('True Clusters')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

# K-means clustering
# kmeans = KMeans(n_clusters=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_clusters = kmeans.fit_predict(X)

# Single-link Hierarchical Clustering

single_link = AgglomerativeClustering(n_clusters=3, linkage='single')
single_link_clusters = single_link.fit_predict(X)

# Complete-link Hierarchical Clustering

complete_link = AgglomerativeClustering(n_clusters=3, linkage='complete')
complete_link_clusters = complete_link.fit_predict(X)

# Compute the Rand Index

rand_index_kmeans = adjusted_rand_score(y_true, kmeans_clusters)
rand_index_single_link = adjusted_rand_score(y_true, single_link_clusters)
rand_index_complete_link = adjusted_rand_score(y_true, complete_link_clusters)

print("Rand Index for K-means Clustering:", rand_index_kmeans)

print("Rand Index for Single-link Hierarchical Clustering:", rand_index_single_link)
print("Rand Index for Complete-link Hierarchical Clustering:", rand_index_complete_link)

# This code will compute the Rand Index for each clustering method and provide a visualization of the true
clusters.
# The Rand Index ranges from 0 to 1, where 1 indicates perfect clustering agreement with the true clusters.
# The method with a higher Rand Index is better at recovering the true clusters.

Dept. of AIML, KNSIT Page 18

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

Output

Dept. of AIML, KNSIT Page 19

DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)

EXPERIMENT -5
Aim: Mini Project – Simple web scrapping in social media
Mini Project

import requests
from bs4 import BeautifulSoup

# URL of the Instagram profile you want to scrape

url = 'https://www.instagram.com/openai/'

# Send a GET request to the URL

response = requests.get(url)

print(response.status_code)

# Check if the request was successful (status code 200)

if response.status_code == 200:
# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')

# Find all post elements

posts = soup.find_all('div', class_='v1Nh3')

# Extract data from each post

for post in posts:
print("Hi")
# Extract post link
post_link = post.find('a')['href']

# Extract post image URL

image_url = post.find('img')['src']

print(f"Post Link: {post_link}")

print(f"Image URL: {image_url}")
print("------")
else:
print("Failed to retrieve data from Instagram")

Output

200

Dept. of AIML, KNSIT Page 20

Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
ME P4252-II Semester - MACHINE LEARNING
No ratings yet
ME P4252-II Semester - MACHINE LEARNING
48 pages
6CS4 22 Machine Learning Lab Manual
50% (2)
6CS4 22 Machine Learning Lab Manual
46 pages
AML - Lab - Syllabus - Chandigarh University
No ratings yet
AML - Lab - Syllabus - Chandigarh University
9 pages
ML Lab Manual-18csl76
No ratings yet
ML Lab Manual-18csl76
52 pages
ML Lab Manual 18csl76 1
No ratings yet
ML Lab Manual 18csl76 1
54 pages
AIML Hard
No ratings yet
AIML Hard
22 pages
Kamal ML
No ratings yet
Kamal ML
38 pages
Ayush Machine Learning Lab
No ratings yet
Ayush Machine Learning Lab
38 pages
Assignment 7
No ratings yet
Assignment 7
3 pages
Screenshot 2024-05-28 at 12.25.15 PM
No ratings yet
Screenshot 2024-05-28 at 12.25.15 PM
53 pages
Algorithm Design and Analysis LAB ETCS-351
No ratings yet
Algorithm Design and Analysis LAB ETCS-351
69 pages
Aids - 21ad62 - Datascience Lab Manual-1
No ratings yet
Aids - 21ad62 - Datascience Lab Manual-1
15 pages
AIML Lab Improvement
No ratings yet
AIML Lab Improvement
20 pages
ML Lab 09 Manual - Introduction To Scikit Learn
No ratings yet
ML Lab 09 Manual - Introduction To Scikit Learn
6 pages
7csedssyll
No ratings yet
7csedssyll
12 pages
Introduction to Programming
No ratings yet
Introduction to Programming
7 pages
BEC302_Lab_Manual
No ratings yet
BEC302_Lab_Manual
80 pages
22CM1105
No ratings yet
22CM1105
2 pages
18csl76 Lab Manual Lab Material
No ratings yet
18csl76 Lab Manual Lab Material
61 pages
6csdsyll
No ratings yet
6csdsyll
48 pages
6cessyll
No ratings yet
6cessyll
50 pages
Daa Lab Final
No ratings yet
Daa Lab Final
39 pages
6ccesyll
No ratings yet
6ccesyll
49 pages
UpdatedNew Lp3LabManual
No ratings yet
UpdatedNew Lp3LabManual
118 pages
Cse2001 Data-structures-And-Algorithms Eth 1.1 3 Cse2001
No ratings yet
Cse2001 Data-structures-And-Algorithms Eth 1.1 3 Cse2001
4 pages
22cs503 Machine Learning - Unit - III
No ratings yet
22cs503 Machine Learning - Unit - III
73 pages
B.Tech.AIDS-90
No ratings yet
B.Tech.AIDS-90
1 page
ME P4252-II Semester - MACHINE LEARNING
No ratings yet
ME P4252-II Semester - MACHINE LEARNING
46 pages
ML Lab Manual-17csl76
No ratings yet
ML Lab Manual-17csl76
43 pages
Lab Format ADSA Lab
No ratings yet
Lab Format ADSA Lab
2 pages
DSA Lab Syllabus
No ratings yet
DSA Lab Syllabus
5 pages
ML Lab 09 Manual - Introduction To Scikit Learn (Ver5)
No ratings yet
ML Lab 09 Manual - Introduction To Scikit Learn (Ver5)
6 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
ml syll
No ratings yet
ml syll
2 pages
Challenge-2024
No ratings yet
Challenge-2024
5 pages
CIS 6213 Applied Machine Learning Coursework
No ratings yet
CIS 6213 Applied Machine Learning Coursework
5 pages
22 PLC15 B
No ratings yet
22 PLC15 B
5 pages
CS documents
No ratings yet
CS documents
19 pages
TMLS20 Machine Learning Coursework-1
No ratings yet
TMLS20 Machine Learning Coursework-1
5 pages
After Modifications (WB - ALGORITHMS - 2018-19 (1) 550amHRS)
No ratings yet
After Modifications (WB - ALGORITHMS - 2018-19 (1) 550amHRS)
80 pages
mca2syll
No ratings yet
mca2syll
27 pages
ML Hota Assign5
No ratings yet
ML Hota Assign5
2 pages
DDCO Lab Manual
No ratings yet
DDCO Lab Manual
76 pages
DS Lab Manual CSE_updated
No ratings yet
DS Lab Manual CSE_updated
61 pages
C Programming Module 1
No ratings yet
C Programming Module 1
27 pages
OS Lab Manual AIML
No ratings yet
OS Lab Manual AIML
47 pages
ML Assignments
No ratings yet
ML Assignments
2 pages
Soft Computing Lab Record
100% (1)
Soft Computing Lab Record
35 pages
Computing (Syllabus 9569) : Singapore-Cambridge General Certificate of Education Advanced Level Higher 2 (2022)
No ratings yet
Computing (Syllabus 9569) : Singapore-Cambridge General Certificate of Education Advanced Level Higher 2 (2022)
13 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
MLDA Syllabus
No ratings yet
MLDA Syllabus
20 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
4 pages
MACHINE-LEARNING-LAB
No ratings yet
MACHINE-LEARNING-LAB
3 pages
ML Record_unlocked
No ratings yet
ML Record_unlocked
67 pages
2_syllabus
No ratings yet
2_syllabus
3 pages
DSA Chapter 0
No ratings yet
DSA Chapter 0
58 pages
21-1124
No ratings yet
21-1124
6 pages
Control Theory - 3141708 Lab Manual - 4th Sem
No ratings yet
Control Theory - 3141708 Lab Manual - 4th Sem
62 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Ai for hacking
No ratings yet
Ai for hacking
9 pages
js notes on Random selector
No ratings yet
js notes on Random selector
4 pages
Ad hoc cloud computing
No ratings yet
Ad hoc cloud computing
6 pages
DAA Module 1
No ratings yet
DAA Module 1
27 pages
Svref
No ratings yet
Svref
369 pages
Java (Programming Language)
No ratings yet
Java (Programming Language)
20 pages
DynMemory Programs
No ratings yet
DynMemory Programs
6 pages
22684-S24
100% (1)
22684-S24
2 pages
Internship Lab 1 RD
No ratings yet
Internship Lab 1 RD
3 pages
Gdsii Vs Oasis
No ratings yet
Gdsii Vs Oasis
13 pages
ALV in A Pop Up Window and ALV in A Dialog Box
No ratings yet
ALV in A Pop Up Window and ALV in A Dialog Box
6 pages
Chapter 3 Solutions: Unit 1 Colutions Oomd 06CS71
No ratings yet
Chapter 3 Solutions: Unit 1 Colutions Oomd 06CS71
14 pages
Chain - Module Pool Programming
No ratings yet
Chain - Module Pool Programming
12 pages
Complete Guide On Configuration Steps in Idoc
No ratings yet
Complete Guide On Configuration Steps in Idoc
36 pages
Test 20131128
No ratings yet
Test 20131128
2 pages
Lab Manual - Linear Linked List
No ratings yet
Lab Manual - Linear Linked List
4 pages
Lifecycle of A Request-Response Process For A Spring REST API
No ratings yet
Lifecycle of A Request-Response Process For A Spring REST API
4 pages
SDT PDF
No ratings yet
SDT PDF
16 pages
Niloy-Ahmed-Rasel
No ratings yet
Niloy-Ahmed-Rasel
2 pages
Le Wagon Fullstack
No ratings yet
Le Wagon Fullstack
23 pages
Jeena C D
No ratings yet
Jeena C D
4 pages
Luther Risman Luosaro Zega - Pemlan - Tugas2 PDF
No ratings yet
Luther Risman Luosaro Zega - Pemlan - Tugas2 PDF
4 pages
What Is MySQLi Connect
No ratings yet
What Is MySQLi Connect
7 pages
CS10-8L: Computer Programming Laboratory Machine Problem #3: Variables, Input and Output
No ratings yet
CS10-8L: Computer Programming Laboratory Machine Problem #3: Variables, Input and Output
4 pages
Plugin Architecture
No ratings yet
Plugin Architecture
12 pages
VHDL Reference Manual: March 1997
No ratings yet
VHDL Reference Manual: March 1997
145 pages
Escug Systemc GDB
No ratings yet
Escug Systemc GDB
12 pages
DSD LAB Manual Ex1-8 to Print
No ratings yet
DSD LAB Manual Ex1-8 to Print
58 pages
An Overview of The Spring System
No ratings yet
An Overview of The Spring System
10 pages
Spring Boot Reference PDF
No ratings yet
Spring Boot Reference PDF
519 pages
MD070 Application Extensions Technical Design
No ratings yet
MD070 Application Extensions Technical Design
18 pages
K. J. Somaiya College of Engineering, Mumbai-77
No ratings yet
K. J. Somaiya College of Engineering, Mumbai-77
8 pages
Python 101-PCEP Preparation
No ratings yet
Python 101-PCEP Preparation
49 pages
Varianta 4
No ratings yet
Varianta 4
19 pages