MIDS Lab Theory
MIDS Lab Theory
raw
Aim:
The aim of this lab report is to apply pre-processing techniques on the raw Titanic dataset, which is
an open-source dataset widely used in the field of data science. Pre-processing techniques are used
to clean and transform raw data into a format that can be easily used by machine learning
algorithms.
Theory:
Pre-processing is an important step in data science as raw data often contains errors, missing
values, or inconsistencies. These issues can lead to inaccurate analysis and predictions. Pre-
processing techniques are used to overcome these issues and prepare the data for analysis.
The Titanic dataset contains information about passengers who were aboard the Titanic, including
their age, gender, class, and whether or not they survived the sinking. This dataset is often used to
build models that predict whether or not a passenger survived the sinking based on their
characteristics.
Pre-processing Techniques:
1. Data Loading: First, we loaded the Titanic dataset into a Pandas dataframe using the
read_csv() function.
2. Handling Missing Values: The Titanic dataset contains missing values, which can cause
issues during analysis. We used the dropna() function to remove rows with missing values
and filled the missing values in the 'Age' column with the median age.
3. Encoding Categorical Variables: The 'Sex' column in the Titanic dataset contains categorical
variables that need to be encoded to numerical values. We used the LabelEncoder function
from Scikit-learn to encode 'Sex' column values as 0 or 1.
4. Feature Scaling: Feature scaling is an important pre-processing technique that is used to
bring all the features to the same scale. We used the StandardScaler function from Scikit-
learn to scale the numerical columns in the dataset.
5. Feature Selection: Feature selection is used to select the most important features for the
model. We used the SelectKBest function from Scikit-learn to select the top 5 most
important features for the model.
Results:
The pre-processing techniques applied to the Titanic dataset resulted in a cleaned and transformed
dataset ready for analysis. The resulting dataset contained 889 rows and 6 columns, with no
missing values.
Conclusion:
Pre-processing is an important step in data science that helps to prepare data for analysis. The
Titanic dataset is an example of a dataset that requires pre-processing to overcome missing values,
encoding categorical variables, and scaling numerical columns. The pre-processing techniques
applied to the Titanic dataset resulted in a cleaned and transformed dataset that is ready for
Assignment 2: Text classification for Sentimental analysis using KNN. (Refer any dataset
like Titanic, Twitter, etc.)
Aim:
The aim of this lab report is to perform text classification for sentimental analysis using K-
Nearest Neighbors (KNN) algorithm on the Twitter dataset. The goal is to accurately classify the
sentiment of the text as positive, negative, or neutral.
Theory:
K-Nearest Neighbors (KNN) algorithm is a simple yet powerful algorithm used for classification
problems. It works on the principle of finding the K-nearest neighbors of a new instance and
assigning the class label based on the majority class of the neighbors. KNN algorithm works
well with text classification problems where the distance between two instances is calculated
using the similarity of their feature vectors.
Text classification is a process of categorizing text data into predefined classes or categories.
Sentimental analysis is a type of text classification problem where the goal is to classify the
sentiment of the text as positive, negative, or neutral. The process of sentimental analysis
involves pre-processing the raw text data, extracting features from the text, and training a
machine learning model to accurately classify the sentiment of the text.
Dataset:
For this lab report, we will be using the Twitter dataset which consists of tweets along with their
corresponding sentiment labels, i.e. positive, negative, or neutral. The dataset contains 15,000
tweets in total, with 5,000 tweets for each sentiment label.
Methodology:
The following methodology will be followed in this lab report:
Data Pre-processing:
In order to prepare the raw data for sentimental analysis, we will perform the following pre-
processing techniques:
Remove URLs, user mentions, and special characters using regular expressions.
Convert all text to lowercase.
Tokenize the text into individual words.
Remove stop words such as 'and', 'the', etc.
Stem or lemmatize the words to their root form.
Feature Extraction:
After pre-processing the data, we will extract features from the text data using the Bag of Words
technique. Bag of Words is a simple and effective technique to convert text data into numerical
form. It represents each text instance as a vector of word counts, where each word in the
vocabulary represents a feature.
Results:
After evaluating the performance of the KNN model, we achieved an accuracy of 73.2%. The
precision, recall, and F1-score for each sentiment label are as follows:
write a lab report for this with aim, theory and with all the nessarry points
Lab Report: K-means Clustering with Initial Centroids
Aim:
The aim of this lab report is to perform k-means clustering on a collection of 8 points with initial
centroids and analyze the results.
Theory:
K-means clustering is a popular unsupervised machine learning technique used to partition a
dataset into k clusters. The algorithm works by iteratively assigning each data point to the
nearest centroid and updating the centroids until convergence. The choice of initial centroids
can have a significant impact on the performance of the algorithm. In this lab report, we will
explore the effect of using different initial centroids on the clustering results.
Methodology:
The following methodology will be followed in this lab report:
Data Pre-processing:
No pre-processing is required for this dataset.
K-means Clustering:
We will perform k-means clustering on the dataset using the k-means algorithm provided by
sklearn. We will set the number of clusters to 2 and initialize the centroids with m1=P1 and
m2=P8.
Evaluation:
After clustering the dataset, we will evaluate the results based on the following metrics:
Sum of Squared Errors (SSE): SSE measures the total distance between each point and its
assigned centroid. Lower SSE indicates better clustering results.
Visualization: We will visualize the clustering results using a scatter plot.
Results:
After performing k-means clustering with initial centroids m1=P1 and m2=P8, we obtained the
following results:
Conclusion:
In this lab report, we performed k-means clustering on a collection of 8 points with initial
centroids m1=P1 and m2=P8. The results showed that the algorithm was able to cluster the
points into two groups with an SSE of 0.045. The visualization of the clustering results showed
that the algorithm was able to separate the points into two distinct clusters. The choice of initial
centroids can have a significant impact on the clustering results, and it is important to
experiment with different initial centroids to obtain optimal clustering results.
Answer the following
1] Which cluster does P6 belong to?
2] What is the population of cluster around m2?
3] What is updated value of m1 and m2?
Answer