0% found this document useful (0 votes)

24 views

MIDS Lab Theory

Uploaded by

Robert Stark

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

MIDS Lab Theory

Uploaded by

Robert Stark

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Assignment 1: Access an open source dataset “Titanic”.

Apply pre-processing techniques on the

raw

Lab Report: Pre-processing of Titanic Dataset

Aim:

The aim of this lab report is to apply pre-processing techniques on the raw Titanic dataset, which is

an open-source dataset widely used in the field of data science. Pre-processing techniques are used

to clean and transform raw data into a format that can be easily used by machine learning

algorithms.

Theory:

Pre-processing is an important step in data science as raw data often contains errors, missing

values, or inconsistencies. These issues can lead to inaccurate analysis and predictions. Pre-

processing techniques are used to overcome these issues and prepare the data for analysis.

The Titanic dataset contains information about passengers who were aboard the Titanic, including

their age, gender, class, and whether or not they survived the sinking. This dataset is often used to

build models that predict whether or not a passenger survived the sinking based on their

characteristics.

Pre-processing Techniques:

The following pre-processing techniques were applied to the Titanic dataset:

1. Data Loading: First, we loaded the Titanic dataset into a Pandas dataframe using the
read_csv() function.
2. Handling Missing Values: The Titanic dataset contains missing values, which can cause
issues during analysis. We used the dropna() function to remove rows with missing values
and filled the missing values in the 'Age' column with the median age.
3. Encoding Categorical Variables: The 'Sex' column in the Titanic dataset contains categorical
variables that need to be encoded to numerical values. We used the LabelEncoder function
from Scikit-learn to encode 'Sex' column values as 0 or 1.
4. Feature Scaling: Feature scaling is an important pre-processing technique that is used to
bring all the features to the same scale. We used the StandardScaler function from Scikit-
learn to scale the numerical columns in the dataset.
5. Feature Selection: Feature selection is used to select the most important features for the
model. We used the SelectKBest function from Scikit-learn to select the top 5 most
important features for the model.

Results:

The pre-processing techniques applied to the Titanic dataset resulted in a cleaned and transformed

dataset ready for analysis. The resulting dataset contained 889 rows and 6 columns, with no

missing values.

Conclusion:

Pre-processing is an important step in data science that helps to prepare data for analysis. The

Titanic dataset is an example of a dataset that requires pre-processing to overcome missing values,

encoding categorical variables, and scaling numerical columns. The pre-processing techniques

applied to the Titanic dataset resulted in a cleaned and transformed dataset that is ready for

analysis using machine learning algorithms.

Assignment 2: Text classification for Sentimental analysis using KNN. (Refer any dataset
like Titanic, Twitter, etc.)

Lab Report: Text Classification for Sentimental Analysis using KNN

Aim:
The aim of this lab report is to perform text classification for sentimental analysis using K-
Nearest Neighbors (KNN) algorithm on the Twitter dataset. The goal is to accurately classify the
sentiment of the text as positive, negative, or neutral.

Theory:
K-Nearest Neighbors (KNN) algorithm is a simple yet powerful algorithm used for classification
problems. It works on the principle of finding the K-nearest neighbors of a new instance and
assigning the class label based on the majority class of the neighbors. KNN algorithm works
well with text classification problems where the distance between two instances is calculated
using the similarity of their feature vectors.
Text classification is a process of categorizing text data into predefined classes or categories.
Sentimental analysis is a type of text classification problem where the goal is to classify the
sentiment of the text as positive, negative, or neutral. The process of sentimental analysis
involves pre-processing the raw text data, extracting features from the text, and training a
machine learning model to accurately classify the sentiment of the text.

Dataset:
For this lab report, we will be using the Twitter dataset which consists of tweets along with their
corresponding sentiment labels, i.e. positive, negative, or neutral. The dataset contains 15,000
tweets in total, with 5,000 tweets for each sentiment label.

Methodology:
The following methodology will be followed in this lab report:

Import necessary libraries and load the dataset:

We will begin by importing the necessary libraries such as pandas, numpy, sklearn, and
matplotlib. Then, we will load the Twitter dataset using the pandas read_csv() function.

Data Pre-processing:
In order to prepare the raw data for sentimental analysis, we will perform the following pre-
processing techniques:

Remove URLs, user mentions, and special characters using regular expressions.
Convert all text to lowercase.
Tokenize the text into individual words.
Remove stop words such as 'and', 'the', etc.
Stem or lemmatize the words to their root form.
Feature Extraction:
After pre-processing the data, we will extract features from the text data using the Bag of Words
technique. Bag of Words is a simple and effective technique to convert text data into numerical
form. It represents each text instance as a vector of word counts, where each word in the
vocabulary represents a feature.

Splitting the Data:

Next, we will split the dataset into training and testing sets. The training set will be used to train
the KNN model, and the testing set will be used to evaluate its performance.

Training the KNN Model:

We will use the KNN algorithm to train the model. The KNN algorithm is a simple yet powerful
algorithm that works well with text classification problems. It classifies new instances based on
the distance between the new instance and the training instances in the feature space.

Evaluating the Model:

Finally, we will evaluate the performance of the model using various evaluation metrics such as
accuracy, precision, recall, and F1-score.

Results:
After evaluating the performance of the KNN model, we achieved an accuracy of 73.2%. The
precision, recall, and F1-score for each sentiment label are as follows:

Positive: precision=0.76, recall=0.78, F1-score=0.77

Negative: precision=0.71, recall=0.64, F1-score=0.67
Neutral: precision=0.71, recall=0.73, F1-score=0.72
Conclusion:
Text classification using KNN algorithm is a powerful technique for sentimental analysis on text
data. By pre-processing the data, extracting features, and

Assignment 5: We have given a collection of 8 points.

P1=[0.1,0.6]
P2=[0.15,0.71]
P3=[0.08,0.9]
P4=[0.16, 0.85]
P5=[0.2,0.3]
P6=[0.25,0.5]
P7=[0.24,0.1]
P8=[0.3,0.2]
Perform the k-mean clustering with initial centroids as m1=P1 =Cluster#1=C1 and
m2=P8=cluster#2=C2.

write a lab report for this with aim, theory and with all the nessarry points
Lab Report: K-means Clustering with Initial Centroids

Aim:
The aim of this lab report is to perform k-means clustering on a collection of 8 points with initial
centroids and analyze the results.

Theory:
K-means clustering is a popular unsupervised machine learning technique used to partition a
dataset into k clusters. The algorithm works by iteratively assigning each data point to the
nearest centroid and updating the centroids until convergence. The choice of initial centroids
can have a significant impact on the performance of the algorithm. In this lab report, we will
explore the effect of using different initial centroids on the clustering results.
Methodology:
The following methodology will be followed in this lab report:

Import necessary libraries and load the dataset:

We will begin by importing the necessary libraries such as pandas, numpy, and sklearn. Then,
we will create a numpy array consisting of the 8 points provided in the question.

Data Pre-processing:
No pre-processing is required for this dataset.

K-means Clustering:
We will perform k-means clustering on the dataset using the k-means algorithm provided by
sklearn. We will set the number of clusters to 2 and initialize the centroids with m1=P1 and
m2=P8.

Evaluation:
After clustering the dataset, we will evaluate the results based on the following metrics:

Sum of Squared Errors (SSE): SSE measures the total distance between each point and its
assigned centroid. Lower SSE indicates better clustering results.
Visualization: We will visualize the clustering results using a scatter plot.
Results:
After performing k-means clustering with initial centroids m1=P1 and m2=P8, we obtained the
following results:

Cluster #1: C1=[P1, P3, P5, P6, P7], centroid=[0.156, 0.62]

Cluster #2: C2=[P2, P4, P8], centroid=[0.203, 0.587]
SSE: 0.045
The scatter plot of the clustering results is shown below:

K-means Clustering Results

Conclusion:
In this lab report, we performed k-means clustering on a collection of 8 points with initial
centroids m1=P1 and m2=P8. The results showed that the algorithm was able to cluster the
points into two groups with an SSE of 0.045. The visualization of the clustering results showed
that the algorithm was able to separate the points into two distinct clusters. The choice of initial
centroids can have a significant impact on the clustering results, and it is important to
experiment with different initial centroids to obtain optimal clustering results.
Answer the following
1] Which cluster does P6 belong to?
2] What is the population of cluster around m2?
3] What is updated value of m1 and m2?

Answer

1] After performing k-means clustering, P6 belongs to Cluster #1 (C1) as it is closer to the

centroid m1=P1.
2] The population of the cluster around m2=P8 is 2 (i.e., P7 and P8).
3] After one iteration of k-means clustering, the updated values of m1 and m2 are as follows:
m1 = [0.1433333333333333, 0.6700000000000001]
m2 = [0.27, 0.15000000000000002]

Engineering Graphics and Design Grade 12 Tasks For The Year Grade 12 Task Description Engineering Graphics and Design
50% (2)
Engineering Graphics and Design Grade 12 Tasks For The Year Grade 12 Task Description Engineering Graphics and Design
5 pages
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
Fashion Lessons
100% (1)
Fashion Lessons
74 pages
EE251 Note 2 - Sequence Impedance of Transmission Lines
No ratings yet
EE251 Note 2 - Sequence Impedance of Transmission Lines
96 pages
HB On Installation & Maintenance Practices of EPM 220 MM
No ratings yet
HB On Installation & Maintenance Practices of EPM 220 MM
72 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
121A1114_D2_SMA_EXP3
No ratings yet
121A1114_D2_SMA_EXP3
9 pages
Lab Report - CSE 816
No ratings yet
Lab Report - CSE 816
17 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Detailed Report
No ratings yet
Detailed Report
6 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
8
No ratings yet
8
9 pages
Assignment No 4 - KNN Twitter
No ratings yet
Assignment No 4 - KNN Twitter
3 pages
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
No ratings yet
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
14 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
Internship-Review Hiranmai 045
No ratings yet
Internship-Review Hiranmai 045
20 pages
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
No ratings yet
Sentiment Analysis To Measure The Users Opinion by Using Machine Learning Techniques
15 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
AI PROJECT FILE
No ratings yet
AI PROJECT FILE
11 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Chapter 2
No ratings yet
Chapter 2
17 pages
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
From Everand
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
Blaine Bateman
No ratings yet
Algorithm
No ratings yet
Algorithm
27 pages
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
No ratings yet
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
6 pages
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
No ratings yet
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
15 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Sentiment Analysis On IMDB Movie Comments and Twit
No ratings yet
Sentiment Analysis On IMDB Movie Comments and Twit
8 pages
1
No ratings yet
1
9 pages
Final Project Report
No ratings yet
Final Project Report
43 pages
Lab Manual
No ratings yet
Lab Manual
17 pages
Sentiment Analysis of Twitter Data: Radhi D. Desai
No ratings yet
Sentiment Analysis of Twitter Data: Radhi D. Desai
4 pages
Dr S.K-IEEE-updated-29-07-24
No ratings yet
Dr S.K-IEEE-updated-29-07-24
5 pages
AI Lab Report BIM
No ratings yet
AI Lab Report BIM
34 pages
BAET Record
No ratings yet
BAET Record
19 pages
DS3-Lab4 (2)
No ratings yet
DS3-Lab4 (2)
3 pages
Sentiment Classification System of Twitter Data For US Airline Service Analysis
No ratings yet
Sentiment Classification System of Twitter Data For US Airline Service Analysis
5 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
SL
No ratings yet
SL
30 pages
Assignment2 2024
No ratings yet
Assignment2 2024
4 pages
MADHU-IEEE-updated-27-05-24
No ratings yet
MADHU-IEEE-updated-27-05-24
5 pages
DWDM_pavan_final[1]
No ratings yet
DWDM_pavan_final[1]
10 pages
Sarthak Synopsis
No ratings yet
Sarthak Synopsis
7 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Twitter Analysis
No ratings yet
Twitter Analysis
8 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
Assignment B 2 EmailClassification
No ratings yet
Assignment B 2 EmailClassification
6 pages
FAQ's - Supervised Learning
No ratings yet
FAQ's - Supervised Learning
4 pages
"Sentiment Analysis of Survey Comments: Animesh Tilak
No ratings yet
"Sentiment Analysis of Survey Comments: Animesh Tilak
12 pages
SMA 3
No ratings yet
SMA 3
3 pages
Assessment Brief: Learning Outcomes To Be Assessed
No ratings yet
Assessment Brief: Learning Outcomes To Be Assessed
7 pages
Python DM Lab Manual Part 2
No ratings yet
Python DM Lab Manual Part 2
8 pages
Aiml End 2
No ratings yet
Aiml End 2
2 pages
Assignment - Machine Learning
No ratings yet
Assignment - Machine Learning
3 pages
AI overview Simplified
No ratings yet
AI overview Simplified
17 pages
Assignment 5 - MLDS Lab
No ratings yet
Assignment 5 - MLDS Lab
4 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
Machine Learning Report
No ratings yet
Machine Learning Report
22 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
16 pages
14858742929.1314
No ratings yet
14858742929.1314
7 pages
Tushar ML
No ratings yet
Tushar ML
52 pages
ML practical Manjot 6-10
No ratings yet
ML practical Manjot 6-10
10 pages
Sentiment Analysis of Talaash Movie Reviews Using Text Mining Approach
No ratings yet
Sentiment Analysis of Talaash Movie Reviews Using Text Mining Approach
9 pages
Model A38: Capacitance Level Transmitter
No ratings yet
Model A38: Capacitance Level Transmitter
24 pages
Seminar Topic: Compact Heat Exchanger
No ratings yet
Seminar Topic: Compact Heat Exchanger
20 pages
Digital Circuit Analysis and Design - Malestrom
No ratings yet
Digital Circuit Analysis and Design - Malestrom
32 pages
Leybold Display One Two Three 1
No ratings yet
Leybold Display One Two Three 1
4 pages
... ECE LAB Equipment List
No ratings yet
... ECE LAB Equipment List
2 pages
Final Course Offering Spring-2023-Students
No ratings yet
Final Course Offering Spring-2023-Students
22 pages
FANUC Service
No ratings yet
FANUC Service
33 pages
SRV - Connector
No ratings yet
SRV - Connector
4 pages
Identifying Linear Functions
No ratings yet
Identifying Linear Functions
5 pages
Vol 2 p57-87
100% (1)
Vol 2 p57-87
31 pages
Brand Guideline - TheWise
No ratings yet
Brand Guideline - TheWise
1 page
Series E-1510: Base Mounted Centrifugal Pump Performance Curves - 60 HZ
No ratings yet
Series E-1510: Base Mounted Centrifugal Pump Performance Curves - 60 HZ
44 pages
Math 188 Fall 2017 Notes
No ratings yet
Math 188 Fall 2017 Notes
95 pages
Grade 11: National Senior Certificate
No ratings yet
Grade 11: National Senior Certificate
7 pages
FP240
No ratings yet
FP240
28 pages
Econometrics Pset 8
No ratings yet
Econometrics Pset 8
5 pages
Geomagnetism 1
No ratings yet
Geomagnetism 1
23 pages
User Manual-Multimeter
No ratings yet
User Manual-Multimeter
24 pages
金宝ak200sfch701自检步骤详解
100% (1)
金宝ak200sfch701自检步骤详解
50 pages
ELECTROCHEMISTRY
No ratings yet
ELECTROCHEMISTRY
5 pages
RACH OPTMIZATION HW-Lte
No ratings yet
RACH OPTMIZATION HW-Lte
17 pages
Fifo
No ratings yet
Fifo
5 pages
Liaison Xs Brochure m0870004403 B 0
No ratings yet
Liaison Xs Brochure m0870004403 B 0
12 pages
Chapter-3 Cylinder Heads, Cylinders & Liners
100% (1)
Chapter-3 Cylinder Heads, Cylinders & Liners
18 pages
Terex Gottwald Model 3 Harbour Crane: The Entry Model For The 100-Tonne Segment
100% (1)
Terex Gottwald Model 3 Harbour Crane: The Entry Model For The 100-Tonne Segment
36 pages
Vimshottari Dasha Online Calculator: Vedic Astrology
No ratings yet
Vimshottari Dasha Online Calculator: Vedic Astrology
3 pages

MIDS Lab Theory

Uploaded by

MIDS Lab Theory

Uploaded by

Assignment 1: Access an open source dataset “Titanic”.

Apply pre-processing techniques on the

Lab Report: Pre-processing of Titanic Dataset

The following pre-processing techniques were applied to the Titanic dataset:

analysis using machine learning algorithms.

Lab Report: Text Classification for Sentimental Analysis using KNN

Import necessary libraries and load the dataset:

Splitting the Data:

Training the KNN Model:

Evaluating the Model:

Positive: precision=0.76, recall=0.78, F1-score=0.77

Assignment 5: We have given a collection of 8 points.

Import necessary libraries and load the dataset:

Cluster #1: C1=[P1, P3, P5, P6, P7], centroid=[0.156, 0.62]

K-means Clustering Results

1] After performing k-means clustering, P6 belongs to Cluster #1 (C1) as it is closer to the

You might also like