Data Science and Its Applications (21AD62) Lab Manual
Data Science and Its Applications (21AD62) Lab Manual
DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Semester: Batch:
CYCLE OF EXPERIMENTS
LAB CODE: 21AD62
Module 1
1. Installation of Python/R language, Visual Studio code editors can be demonstrated
along with Kaggle data set usage.
2. Write programs in Python/R and Execute them in either Visual Studio Code or
PyCharm Community Edition or any other suitable environment.
3. A study was conducted to understand the effect of number of hours the students
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on y-
axis. Use a red ‘*’ as the point character, label the axes and give the plot a title.
Module 2
Module 3
1. Train a regularized logistic regression classifier on the iris dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris
dataset) using sklearn. Train the model with the following hyperparameter C = 1e4
and report the best classification accuracy.
2. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and
the associated hyperparameters. Train model with the following set of
hyperparameters RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data
Cycle II
Module 4
1. Consider the following dataset. Write a program to demonstrate the working of the
decision tree based ID3 algorithm.
• K – means Clustering
• Single – link Hierarchical Clustering
• Complete link hierarchical clustering.
• Also visualize the dataset and which algorithm will be able to recover the true
clusters.
•
Module 5
Module 1
1. Installation of Python/R language, Visual Studio code editors can be demonstrated along with Kaggle data
set usage.
2. Write programs in Python/R and Execute them in either Visual Studio Code or PyCharm Community
Edition or any other suitable environment.
3. A study was conducted to understand the effect of number of hours the students spent studying on their
performance in the final exams. Write a code to plot line chart with number of hours spent studying on x-
axis and score in final exam on y-axis. Use a red ‘*’ as the point character, label the axes and give the plot
a title.
4. For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram to check the
frequency distribution of the variable ‘mpg’ (Miles per gallon)
Module 2
Module 3
4. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the dataset corresponds to
the co-ordinates of each data point. The third column corresponds to the actual cluster label. Compute the
rand index for the following methods:
• K – means Clustering
• Single – link Hierarchical Clustering
• Complete link hierarchical clustering.
• Also visualize the dataset and which algorithm will be able to recover the true clusters.
Module 5
The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%. The
minimum passing mark for the CIE is 40% of the maximum marks (20 marks). A student shall be deemed to have
satisfied the academic requirements and earned the credits allotted to each subject/ course if the student secures not
less than 35% (18 Marks out of 50) in the semester-end examination (SEE), and a minimum of 40% (40 marks out
of 100) in the sum total of the CIE (Continuous Internal Evaluation) and SEE (Semester End Examination) taken
together.
• Rubrics for each Experiment taken average for all Lab components – 15 Marks.
• Viva-Voce– 5 Marks (more emphasized on demonstration topics)
The sum of three tests, two assignments, and practical sessions will be out of 100 marks and will be
scaled down to 50 marks (to have a less stressed CIE, the portion of the syllabus should not be common /repeated
for any of the methods of the CIE. Each method of CIE should have a different syllabus portion of the course).
CIE methods /question paper has to be designed to attain the different levels of Bloom’s taxonomy
as per the outcome defined for the course.
Theory SEE will be conducted by University as per the scheduled timetable, with common question
papers for the subject (duration 03 hours)
1. The question paper will have ten questions. Each question is set for 20 marks.
2. There will be 2 questions from each module. Each of the two questions under a module (with a maximum of
3 sub-questions), should have a mix of topics under that module.
3. The students have to answer 5 full questions, selecting one full question from each module.
4. Marks scored shall be proportionally reduced to 50 marks
DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)
Introduction to Python
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its
high-level built in data structures, combined with dynamic typing and dynamic binding, make it very
attractive for Rapid Application Development, as well as for use as a scripting language to connect
existing components together. Python is simple, easy to learn syntax emphasizes readability and
therefore reduces the cost of program maintenance. Python supports modules and packages, which
encourages program modularity and code reuse. The Python interpreter and the extensive standard
library are available in source or binary form without charge for all major platforms, and can be freely
distributed.
Python Installation
Download Python Interpreter
Go to the Python downloads page and select the version foryour operating system (Windows, Mac,
Linux)
Introduction to PyCharm
PyCharm is an integrated development environment (IDE) used for programming in Python. It provides
code analysis, a graphical debugger, an integrated unit tester, integration with version control systems,
and supports web development with Django. PyCharm is developed by the Czech company JetBrains.
● First, using File Explorer (Windows), or Finder (Mac), create a directory foryour projects,
e.g., PyCharm_Projects
● Name your project, e.g., INLS560
● Specify the Python Interpreter you will use for your projects
● Ensure that the Interpreter field refers to the Python interpreterthat you just
installed. Click OK.
● Select HelloWorld.pyand select Run from context menu; or, select Run icon
● Output is displayed in the Run Window in the bottom pane
Editor
Project
View
Tool Window
EXPERIMENT -1
Aim: (3) A study was conducted to understand the effect of number of hours the students spent studying on their
performance in the final exams. Write a code to plot line chart with number of hours spent studying on x-axis and
score in final exam on y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title.
hours = [10,9,2,15,10,16,11,16]
score = [95,80,10,50,45,98,38,93]
Output
The program above demonstrates a clear trend: generally, the more hours students study, the better they
perform on the final exam. However, there are some cases where this relationship isn’t quite as
straightforward, yielding slightly different outcomes.
Aim: (4) For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram to check the
frequency distribution of the variable ‘mpg’ (Miles per gallon)
import pandas as pd
import matplotlib.pyplot as plt
Output
EXPERIMENT -2
Aim: Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains information about
books. Write a program to demonstrate the following.
• Import the data into a DataFrame
• Find and drop the columns which are irrelevant for the book information.
• Change the Index of the DataFrame
• Tidy up fields in the data such as date of publication with the help of simple regular expression.
• Combine str methods with NumPy to clean columns
import pandas as pd
import numpy as np
# Find and drop the columns which are irrelevant for the book information
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver',
'Contributors', 'Issuance type', 'Shelfmarks']
df.drop(columns=irrelevant_columns, inplace=True)
# Tidy up fields in the data such as date of publication with the help of simple regular expression
df['Date of Publication'] = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
Output
Original DataFrame:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London
Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.
Flickr URL \
0 http://www.flickr.com/photos/britishlibrary/ta...
1 http://www.flickr.com/photos/britishlibrary/ta...
2 http://www.flickr.com/photos/britishlibrary/ta...
3 http://www.flickr.com/photos/britishlibrary/ta...
4 http://www.flickr.com/photos/britishlibrary/ta...
Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS 12626.cc.2.
2 British Library HMNTS 12625.dd.1.
3 British Library HMNTS 10369.bbb.15.
4 British Library HMNTS 9007.d.28.
Cleaned DataFrame:
Place of Publication Date of Publication Publisher \
Identifier
206 London 1879 S. Tinsley & Co.
216 London 1868 Virtue & Co.
218 London 1869 Bradbury, Evans & Co.
472 London 1851 James Darling
480 London 1857 Wertheim & Macintosh
Title Author \
Identifier
206 Walter Forbes. [A novel.] By A. A A. A.
216 All for Greed. [A novel. The dedication signed... A., A. A.
218 Love the Avenger. By the author of “All for Gr... A., A. A.
472 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
480 [The World in which I live, and my place in it... A., E. S.
Flickr URL
Identifier
Dept. of AIML, KNSIT Page 12
DATA SCIENCE AND ITS APPLICATION LABORATORY (21AD62)
206 http://www.flickr.com/photos/britishlibrary/ta...
216 http://www.flickr.com/photos/britishlibrary/ta...
218 http://www.flickr.com/photos/britishlibrary/ta...
472 http://www.flickr.com/photos/britishlibrary/ta...
480 http://www.flickr.com/photos/britishlibrary/ta...
EXPERIMENT -3
Aim: (1) Train a regularized logistic regression classifier on the iris dataset (https://archive.ics.uci.edu/ml/machine-
learning-databases/iris/ or the inbuilt iris dataset) using sklearn. Train the model with the following hyperparameter
C = 1e4 and report the best classification accuracy.
Logistic Regression
Output
Aim: (2) Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the associated
hyperparameters. Train model with the following set of hyperparameters RBFkernel, gamma=0.5, one-vs-rest
classifier, no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of hyperparameters, find
the best classification accuracy along with total number of support vectors on the test data.
SVM classifier
best_accuracy = 0
best_model = None
best_support_vectors = None
# Train SVM models with different hyperparameters and find the best accuracy
for params in hyperparameters:
model = SVC(kernel=params['kernel'], gamma=params['gamma'], C=params['C'], decision_function_shape='ovr')
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
support_vectors = model.n_support_.sum()
print(f"For hyperparameters: {params}, Accuracy: {accuracy}, Total Support Vectors: {support_vectors}")
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model = model
best_support_vectors = support_vectors
Output
For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 0.01}, Accuracy: 0.3, Total Support Vectors: 120
For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 1}, Accuracy: 1.0, Total Support Vectors: 39
For hyperparameters: {'kernel': 'rbf', 'gamma': 0.5, 'C': 10}, Accuracy: 1.0, Total Support Vectors: 31
EXPERIMENT - 4
Aim: (1) Consider the following dataset. Write a program to demonstrate the working of the decision tree based ID3
algorithm.
df = pd.DataFrame(data)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Image(graph.create_png())
Output
Accuracy: 0.6666666666666666
Aim: (2) Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the dataset corresponds
to the co-ordinates of each data point. The third column corresponds to the actual cluster label. Compute the rand
index for the following methods:
• K – means Clustering
• Single – link Hierarchical Clustering
• Complete link hierarchical clustering.
• Also visualize the dataset and which algorithm will be able to recover the true clusters.
Clustering
import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
# K-means clustering
# kmeans = KMeans(n_clusters=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_clusters = kmeans.fit_predict(X)
# This code will compute the Rand Index for each clustering method and provide a visualization of the true
clusters.
# The Rand Index ranges from 0 to 1, where 1 indicates perfect clustering agreement with the true clusters.
# The method with a higher Rand Index is better at recovering the true clusters.
Output
EXPERIMENT -5
Aim: Mini Project – Simple web scrapping in social media
Mini Project
import requests
from bs4 import BeautifulSoup
print(response.status_code)
Output
200