0% found this document useful (0 votes)
51 views13 pages

Question 1 The Given Dataset Can Be Visualized As Follows

kyfkyfx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
51 views13 pages

Question 1 The Given Dataset Can Be Visualized As Follows

kyfkyfx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
Question 1 Task1 Dealing with Missing values Ist ty roping ne unacecsry column athe cabs fstoe hsv ghnumber ot msing data Outliers Here | have use box plot to Identify the outliers forall the important features. Aewe can se from the box plot 1g: The plots show some dat values above 60 Thisieinaestve oF erly people traveling inthe shiprather than outers. Hen, + Incase of P cass theres no ogc of checking for our as ony contains thee values 2 and 2 «The Fre feature seems to have few outers but stilt we can ignore thom a they are smalin umber Categorical Encoding ‘As we can see that there are some features that are in object format , these features must be converted to number format using some logic of character encoding in order to use them for training and test the modet + Using “dataset nfo" we found out that Sex and Embarked are the features which have object data-types + encode.embarted0 is function which Isuse to convert S,C,Qt00,1,2 respectively «+ Similarly Male and Female are converted to binary 0 and 1 repectively Visualizing feature-target dependence Strip Plot is used to analyze the distribution of each feature with respect to the target variable + *Here, we can observe that the data in the features Pelass, Sex, and Embarked isnot evenly distributed; rather, itis concentrated at certain values. However, there Is some level of distribution observed in the case of Age and Fare.” + For test-train split sci learn brary fs used, the distribution is Biven as Follows: Decision Tree Implementation Classes Node Class: * Purpose: Represents a node in the decision tree, storing information about decision and leaf nodes. Attributes: * featureindex splitting. * threshold: Threshold value for the feature. * left: Left child node. * right: Right child node. * infoGain: Information gain at the node. * value: Value for leaf nodes. Index of the feature used for DecisionTreeClassifier Class: Purpose: Implements a decision tree classifier. Attributes: * root: The root node of the decision tree. * maxDepth: Maximum depth of the tree (stopping condition). Methods: * buildTree(dataset, currDepth) * contoCat(dataset, numSamples, numFeatures) * split(dataset, featureindex, threshold) * informationGain(parent, leftChild, rightChild) * entropyly) calculateLeafValue(Y) printTree(tree, indent, featureNames) it(X, Y) infer(x) * makePrediction(x, tree) * calculate_accuracy(actual_labels, predicted_labels) Important Functions wTocat Entropy tntropy Function Model Training-Testing-Validation OR ee CL EMCI MaxDepth: 2, Accuracy: @.74719101123 + Best value for hyperparameter is being calculated using [aaa aa aR oe ageea bees the validation dataset of size 10 where each iteration abel id eed eens Accuracy: @.82622471910112: changes the max depth and used to calculate the (ees Opener pC accuracy. 8033707865168539 + Finally the model s tested on the Test data set of size eee ae eer eta Set 20. Confusion Matrix ‘The confusion_matrix function calculates the confusion matrix, a table used in classification to evaluate the performance of a classification algorithm. Here's brief explanation of the logic: Input Parameters: «+ y.true: True labels of the test set. + y pred: Predicted labels by the classifier. Initialization: ‘+ unique classes are extracted from the concatenation of true and predicted labels (unique_classes) + confusion matrix conf matrix) is Initializes with zeros, where rows and columns correspond to Unique classes Filling the Confusion Matrix: erate tvough each par of tre and predicted labels [147 5] 5 ldenty te neces ore and precedes inthe unique asses ary. * Inrements te corresponding cain te confusion mati eeu pl ouput ‘+ Returns the filled confusion matri Precision, recall, and F1 score Precision, recall, and Fl score are metrics commonly used to evaluate the performance of a classification model, especially in binary classification tasks. Precision: Precision sa measure ofthe accuracy of positive predictions made by the model, Formula: Precision = True Positives / (True Positives + False Positives) Recall (Sensitivity or True Positive Rate): Formula: Recall = True Positives / (True Positives + False Negatives) Precision: @.8387096774193549 Fiscore: The scoe sable math combines reson andreatintonsinglevave, MAA ee ‘The FL score ranges from O to J, where. indicates perfect precision and recall, and 0 Ee er ere ten Indicates poor performance. Question 2 Dataset Exploration ‘+ The Given Dataset “twmarketing.sv" has two columns with feature tv" and “sales” ‘+ We observe that there are no null values and the datatype of each ‘column + All the parameters for each column are as follows Outliers + Here we have plotted Distribution plot the see whether the features are normally distributed or they are skewed in nature. ‘+ Wecan clearly see the the sales feature is normally distributed however the Tv Feature is skewed in nature which means the data hhas to be normalised before it can be brought into use Feature-Target Dependence + We have used scatter plot representation to ‘observe the elation between the target feature Sales and TV ‘+ We then compute mean and standard deviation of the dataset for each feature Normalization of the TV marketing budget and sales columns + Each feature is normalized using the given formuta: X= Xenia ‘We can clearly see that now the data has been confined. between range of Es Train-Test Split By Nee Scikit learn library is used to spit the data into test and trains segments with ratio 80:20, Task-3: Linear Regression Implementation Hypothesis function Predicts eutput (y_pred) using the linear equation with given weight (wl) and bias (0) Cost function Computes the mean squared error (MSE) between predicted values and actual output, providing a measure of how well the models performing Gradient descent function * Initializes random weights (wa and wO) and performs iterative updates tominimize the cost + Ineach iteration: a.Calculates the gradionts (wi grad and w0_grad) by summing the product ofthe prediction errors and input features. b.Updates weights using the learning rate (alpha) and the average of stadients .Appends the current cost t0 the C0St list jyseqeus, le) se Input Dat: Parameters: 8 + Xtraln_array: Normalized TV feature values. cost Futon: 10.0) + y.train_array: Normalized Sales values. Hyperparameters: + alpha: Learning rate, determining the step size for each update + Iterations: Number of iterations to update weights ‘+ Mean Squared Error and Mean Absolute terror are then calculated on the test set as ‘per the formula shown inthe image and ‘Outputs shown as below Question 3 Dataset Exploration «+ The Given Dataset "boston.cav” has in total 14 columns + We observe that there are no nul values and the datatype of each column is number Outliers ‘+ We have Plotted Correlation Matric Heat Map to study the features + RAD and TAX are closely correlated to each other as visbe in the heat map and CHAS couluma’s correlation with MEDV is very les hence itcan be removed in pre-processing step Feature-Target Dependence + Wehave used scatter plot representation to ‘observe the elation between the target feature Medv and other 13 Features ‘+ We then compute mean and standard deviation of the dataset for each feature Normalization of ‘+ Here we plot Distribution plot to whether the features follow normal distribution or are Skewed in nature ‘+ We can see that majority ofthe columns are skewed in nature hich means that they need to be normalised before we can actually use them «+ Each feature is normalized using the siven formula + The normalized Feature look tke as shown ‘We can clearly see that now the data has been confined between range of 1 columns Task-3: Linear Regression Implementation Difference Between the simple Linear Regression and Multivariable Regression Implementation Hypothesis function Taning Loss Over Hteratons yy predicted earlier was Y= A=Bx but now it has become Y=w1-XL+W2X2+..+wn-Xn+wW0 + Updated hypothesis Function to handle multiple Features, “np.dot(x, parameters)” mean_squared_error: gus will ealcalate error forallthe features simlataneouly'ap mean. pred-y.actuad*2) =... Gradient descent function | ood * The Gradient descent function will multiple coeficients (weights) corcesponding to each = aors features. Gradients parameters all will be calculated for multiple features i mean_absolute_error fe + wil calculate error forall the features simulataneouly “np.meaninp.absly_pred-y-actual)” °°: Error V/s Iteration Plot eta me ‘have plotted Mean Squared Error Vs iteration plot to See how the model improves in succesive iterations Finally the Mean Squared Error and Absolute Error on Test has been Calculated

You might also like