0% found this document useful (0 votes)
18 views

Problem Statement

This document provides instructions for two questions involving data analysis and machine learning modeling. For question 1, it instructs to import a movie review dataset, perform 10-fold cross validation, extract TF-IDF features, train GaussianNB, BernoulliNB and MultinomialNB classifiers, and output accuracy, confusion matrices and predictions. For question 2, it instructs to import a diabetes dataset, handle missing values, visualize the data, perform 10-fold cross validation, train a logistic regression model, output coefficients and decision boundary, and compute accuracy and confusion matrix.

Uploaded by

Brianearl
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Problem Statement

This document provides instructions for two questions involving data analysis and machine learning modeling. For question 1, it instructs to import a movie review dataset, perform 10-fold cross validation, extract TF-IDF features, train GaussianNB, BernoulliNB and MultinomialNB classifiers, and output accuracy, confusion matrices and predictions. For question 2, it instructs to import a diabetes dataset, handle missing values, visualize the data, perform 10-fold cross validation, train a logistic regression model, output coefficients and decision boundary, and compute accuracy and confusion matrix.

Uploaded by

Brianearl
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

Instructions

1.Follow the instructions in each question carefully.


2. A Jupyter notebook along with output for each cell is expected.
3. Any assignment submitted using other python IDEs are not considered for
grading.
4. Use appropriate labels for all visualizations.
5. Upload the output.csv file along with the notebook when required.\
6. If dataset link is expired, search for the same dataset online from any
repository and use it.

Question 1

1. Import the dataset from http://www.cs.cornell.edu/people/pabo/movie-review-


data/review_polarity.tar.gz .
2. Split the data into training and testing. use 10-fold cross validation.
3. Extract features using TF-IDF and display the features.
4. Model the classifier using GaussianNB, BernoulliNB and MultinomialNB and
train the classifiers.
5. Compute the accuracy and confusion matrix for each models.
6. Create an output .csv file consisting actual Test set values of Y (column
name: Actual) and Predictions of Y(column name: Predicted).

Question 2

Consider the diabetes data (diabetes.csv) has a response variable of whether a


person is having diabetes, which is given by a 1.

1. Import the dataset from https://www.kaggle.com/uciml/pima-indians-diabetes-


database.
2. Identify the columns with missing values (1 point). Fill the missing values
with mean value for numerical attributes and mode value for categorical attributes.

3. Extract X as all columns except the last column and Y as last column.
4. Visualize the dataset.
5. Split the data into training set and testing set. Perform 10-fold cross
validation.
6. Train a Logistic regression model for the dataset.
7. Display the coefficients and form the logistic regression equation.
8. Compute the accuracy and confusion matrix.
9. Plot the decision boundary.

You might also like