We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
Question 1
Task1Dealing with Missing values
Ist ty roping ne unacecsry column athe cabs fstoe hsv ghnumber ot msing data
Outliers
Here | have use box plot to Identify the outliers forall the important features.
Aewe can se from the box plot
1g: The plots show some dat values above 60 Thisieinaestve oF
erly people traveling inthe shiprather than outers. Hen,
+ Incase of P cass theres no ogc of checking for our as ony
contains thee values 2 and 2
«The Fre feature seems to have few outers but stilt we can ignore
thom a they are smalin umberCategorical Encoding
‘As we can see that there are some features that are in object format , these features must be converted to number format using some logic of character
encoding in order to use them for training and test the modet
+ Using “dataset nfo" we found out that Sex and Embarked
are the features which have object data-types
+ encode.embarted0 is function which Isuse to convert S,C,Qt00,1,2
respectively
«+ Similarly Male and Female are converted to binary 0 and 1 repectively
Visualizing feature-target
dependence
Strip Plot is used to analyze the distribution of
each feature with respect to the target variable
+ *Here, we can observe that the data in the features Pelass, Sex, and
Embarked isnot evenly distributed; rather, itis concentrated at
certain values. However, there Is some level of distribution
observed in the case of Age and Fare.”
+ For test-train split sci learn brary fs used, the distribution is
Biven as Follows:Decision Tree Implementation
Classes
Node Class:
* Purpose: Represents a node in the decision tree,
storing information about decision and leaf nodes.
Attributes:
* featureindex
splitting.
* threshold: Threshold value for the feature.
* left: Left child node.
* right: Right child node.
* infoGain: Information gain at the node.
* value: Value for leaf nodes.
Index of the feature used for
DecisionTreeClassifier Class:
Purpose: Implements a decision tree classifier.
Attributes:
* root: The root node of the decision tree.
* maxDepth: Maximum depth of the tree (stopping
condition).
Methods:
* buildTree(dataset, currDepth)
* contoCat(dataset, numSamples, numFeatures)
* split(dataset, featureindex, threshold)
* informationGain(parent, leftChild, rightChild)
* entropyly)
calculateLeafValue(Y)
printTree(tree, indent, featureNames)
it(X, Y)
infer(x)
* makePrediction(x, tree)
* calculate_accuracy(actual_labels, predicted_labels)Important Functions
wTocat
Entropy tntropy FunctionModel Training-Testing-Validation
OR ee CL EMCI MaxDepth: 2, Accuracy: @.74719101123
+ Best value for hyperparameter is being calculated using [aaa aa aR oe ageea bees
the validation dataset of size 10 where each iteration abel id eed eens
Accuracy: @.82622471910112:
changes the max depth and used to calculate the (ees Opener pC
accuracy. 8033707865168539
+ Finally the model s tested on the Test data set of size eee ae eer eta Set
20.
Confusion Matrix
‘The confusion_matrix function calculates the confusion matrix, a table used in classification to
evaluate the performance of a classification algorithm. Here's brief explanation of the logic:
Input Parameters:
«+ y.true: True labels of the test set.
+ y pred: Predicted labels by the classifier.
Initialization:
‘+ unique classes are extracted from the concatenation of true and predicted labels
(unique_classes)
+ confusion matrix conf matrix) is Initializes with zeros, where rows and columns correspond to
Unique classes
Filling the Confusion Matrix:
erate tvough each par of tre and predicted labels [147 5]
5 ldenty te neces ore and precedes inthe unique asses ary.
* Inrements te corresponding cain te confusion mati eeu pl
ouput
‘+ Returns the filled confusion matriPrecision, recall, and F1 score
Precision, recall, and Fl score are metrics commonly used to evaluate the performance of a classification model, especially in
binary classification tasks.
Precision:
Precision sa measure ofthe accuracy of positive predictions made by the model,
Formula: Precision = True Positives / (True Positives + False Positives)
Recall (Sensitivity or True Positive Rate):
Formula: Recall = True Positives / (True Positives + False Negatives)
Precision: @.8387096774193549
Fiscore:
The scoe sable math combines reson andreatintonsinglevave, MAA ee
‘The FL score ranges from O to J, where. indicates perfect precision and recall, and 0 Ee er ere ten
Indicates poor performance.Question 2
Dataset Exploration
‘+ The Given Dataset “twmarketing.sv" has two columns with feature
tv" and “sales”
‘+ We observe that there are no null values and the datatype of each
‘column
+ All the parameters for each column are as follows
Outliers
+ Here we have plotted Distribution plot the see whether the
features are normally distributed or they are skewed in nature.
‘+ Wecan clearly see the the sales feature is normally distributed
however the Tv Feature is skewed in nature which means the data
hhas to be normalised before it can be brought into use
Feature-Target Dependence
+ We have used scatter plot representation to
‘observe the elation between the target feature
Sales and TV
‘+ We then compute mean and standard deviation
of the dataset for each featureNormalization of the TV marketing budget and sales columns
+ Each feature is normalized using the given formuta: X= Xenia
‘We can clearly see that now the data has been confined.
between range of Es
Train-Test Split By Nee
Scikit learn library is used to spit the data into test and
trains segments with ratio 80:20,Task-3: Linear Regression Implementation
Hypothesis function
Predicts eutput (y_pred) using the linear equation with given weight (wl)
and bias (0)
Cost function
Computes the mean squared error (MSE) between predicted values and
actual output, providing a measure of how well the models performing
Gradient descent function
* Initializes random weights (wa and wO) and performs iterative updates
tominimize the cost
+ Ineach iteration:
a.Calculates the gradionts (wi grad and w0_grad) by summing the
product ofthe prediction errors and input features.
b.Updates weights using the learning rate (alpha) and the average of
stadients
.Appends the current cost t0 the C0St list jyseqeus, le) se
Input Dat: Parameters: 8
+ Xtraln_array: Normalized TV feature values. cost Futon: 10.0)
+ y.train_array: Normalized Sales values.
Hyperparameters:
+ alpha: Learning rate, determining the step size for each update
+ Iterations: Number of iterations to update weights
‘+ Mean Squared Error and Mean Absolute
terror are then calculated on the test set as
‘per the formula shown inthe image and
‘Outputs shown as belowQuestion 3
Dataset Exploration
«+ The Given Dataset "boston.cav” has in total 14 columns
+ We observe that there are no nul values and the datatype of
each column is number
Outliers
‘+ We have Plotted Correlation Matric Heat Map to study the
features
+ RAD and TAX are closely correlated to each other as visbe in the
heat map and CHAS couluma’s correlation with MEDV is very les
hence itcan be removed in pre-processing step
Feature-Target Dependence
+ Wehave used scatter plot representation to
‘observe the elation between the target feature
Medv and other 13 Features
‘+ We then compute mean and standard deviation
of the dataset for each featureNormalization of
‘+ Here we plot Distribution plot to whether the features follow
normal distribution or are Skewed in nature
‘+ We can see that majority ofthe columns are skewed in nature
hich means that they need to be normalised before we can
actually use them
«+ Each feature is normalized using the
siven formula
+ The normalized Feature look tke as
shown
‘We can clearly see that now the data has been confined
between range of 1
columnsTask-3: Linear Regression Implementation
Difference Between the simple Linear Regression and Multivariable Regression
Implementation
Hypothesis function
Taning Loss Over Hteratons
yy predicted earlier was Y= A=Bx but now it has become Y=w1-XL+W2X2+..+wn-Xn+wW0
+ Updated hypothesis Function to handle multiple Features, “np.dot(x, parameters)”
mean_squared_error: gus
will ealcalate error forallthe features simlataneouly'ap mean. pred-y.actuad*2) =...
Gradient descent function | ood
* The Gradient descent function will multiple coeficients (weights) corcesponding to each = aors
features. Gradients parameters all will be calculated for multiple features i
mean_absolute_error fe
+ wil calculate error forall the features simulataneouly “np.meaninp.absly_pred-y-actual)” °°:
Error V/s Iteration Plot eta me
‘have plotted Mean Squared Error Vs iteration plot to See how the model improves in succesive
iterations
Finally the Mean Squared Error and Absolute Error on Test has been Calculated