0% found this document useful (0 votes)
6 views

Make Up Assignment - Data Science

Uploaded by

lorrainencube175
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Make Up Assignment - Data Science

Uploaded by

lorrainencube175
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Assignment Rules

Please ensure that you strictly adhere to the following rules while completing this assignment. Any violation of
these guidelines will result in significant penalties or disqualification.

1. Plagiarism and Cheating

• Plagiarism Check: This assignment will be submitted through Turnitin, and any submission that
shows a similarity index of more than 50% will automatically receive a zero mark. Ensure that you
submit your own work.

• Originality: You are required to produce original work. Copying or paraphrasing substantial portions
of code or content from online sources or other students without appropriate acknowledgment will be
classified as plagiarism.

• Consequences: Plagiarism or cheating in any form, including the sharing or collaboration on this
individual assignment, will result in an automatic zero for the assignment. Further disciplinary action
may follow in accordance with the academic integrity policies of the institution.

2. Submission Guidelines

• No Code Submissions: You must not submit any code as part of your submission. This assignment is
based on your analysis and understanding of the concepts, so I will only be marking your content and
explanations. Focus on demonstrating a clear and deep understanding of the material, with well-
articulated answers.

• Clarity of Submission: Ensure that your work is well-organized, with answers clearly labelled
according to the question number.

3. Referencing

• Harvard Referencing Style: You must use the Harvard referencing style for all citations, including
inline citations. Failure to do so will result in up to 30% off your final mark. All external sources,
including datasets, academic papers, or online resources, must be cited properly.

• Incorrect Referencing: Improper or missing citations will result in up to a 30% deduction from your
overall grade, depending on the severity of the issues.

4. Assessment Criteria

• Content Evaluation: I will be marking you based on the content and depth of your answers. Make
sure your responses demonstrate critical thinking, the ability to apply theoretical concepts to practical
problems, and a comprehensive understanding of the dataset and the questions posed.

5. Code of Conduct

• Deadlines: Late submissions will not be accepted unless prior approval for an extension is granted
based on valid circumstances.

• Honesty and Integrity: Academic integrity is of paramount importance. Ensure that your work reflects
your individual effort and understanding.

By submitting this assignment, you confirm that you have read and understood these rules and agree to comply
with them. Non-compliance will result in academic penalties as specified.

Introduction to Data Science Assignment (100 Marks)


Dataset: Titanic: Machine Learning from Disaster
Time: 36 Hours Due at 12pm Friday the 18th. No late submissions will be accepted.
Instructions: Use the Titanic dataset from Kaggle to complete the following tasks. Submit a
zipped folder containing your code (in Python), dataset, and a report with answers,
visualizations, and interpretations.
Dataset Link: Titanic - Machine Learning from
Disaster(https://www.kaggle.com/c/titanic/overview)

Question 1: k-Nearest Neighbours (k-NN) (12 Marks)


You are tasked with using the k-NN algorithm to predict whether a passenger survived the
Titanic disaster based on features like age, fare, and class.
1.1 Explain the k-NN algorithm and how it can be used to classify passengers. (2)
1.2 How does class imbalance between the number of passengers who survived and those
who did not affect the performance of the k-NN model? (3)
1.3 Explain how varying the value of k could impact the effect of class imbalance. (3)
1.4 Propose a method to address the class imbalance in k-NN. Explain your choice. (4)

Question 2: Decision Trees (12 Marks)


A decision tree will be used to predict passenger survival using the Titanic dataset.
2.1 Explain how a decision tree works and how it can be used to predict survival. (2)
2.2 Give an example where the depth of the decision tree is less important in decision-
making. (2)
2.3 Provide a scenario where the error rate is less important than the simplicity or
interpretability of the model. (2)
2.4 Discuss the characteristics of the Titanic dataset that make decision trees a suitable model.
(4)
2.5 What is one major challenge in decision tree modelling, and how can it be addressed? (2)

Question 3: Ensemble Learning (12 Marks)


You will explore how ensemble learning techniques, like Random Forests, can improve
predictions on the Titanic dataset.
3.1 Can bagging and feature selection be applied to k-NN classifiers? Discuss any challenges.
(4)
3.2 Discuss the interpretability of Random Forest models. Can decision rules be extracted
from them? (3)
3.3 How is majority voting typically used in Random Forests, and how can it be adapted for
regression tasks? (3)
3.4 If some trees in a random forest are less accurate than others, how can you ensure that
majority voting remains fair? (2)
Question 4: Neural Networks and Perceptron’s (14 Marks)
Explore how neural networks can be applied to predict passenger survival.
4.1 What is a perceptron, and how could it be used to classify passengers? (3)
4.2 Compare step functions with smooth activation functions like sigmoid. What are the
advantages of smooth activation functions? (3)
4.3 What is the purpose of hidden layers in a neural network? (2)
4.4 Neural networks are often referred to as “black boxes.” Why is that? Does this apply to
perceptron’s? (3)
4.5 Compare the perceptron and Support Vector Machine (SVM) in terms of classification
tasks. (3)

Question 5: Regression Analysis (20 Marks)


Perform regression analysis using the Titanic dataset to predict the fare a passenger paid.
5.1 What is regression analysis, and how does it differ from classification? (3)
5.2 Perform a linear regression to predict the fare using features like age, class, and
embarkation point. Interpret the results. (6)
5.3 Explain the concept of a correlation matrix. What relationships can you observe between
variables in the dataset? (3)
5.4 Interpret the regression output (coefficients, R²). What does it tell you about the model's
effectiveness? (4)
5.5 Discuss the problem of overfitting in regression models and how it can be avoided. (4)

Question 6: Clustering (20 Marks)


Clustering will be used to group passengers based on similar characteristics.
6.1 Explain k-means clustering and how you could group passengers based on features like
age, class, and fare. (3)
6.2 Perform k-means clustering on the Titanic dataset. Determine the number of clusters and
describe them. (5)
6.3 Visualize the centroids of the clusters and explain what patterns you observe. (4)
6.4 How would you evaluate the quality of your clusters? Discuss methods for cluster
validation. (4)
6.5 Provide a real-world application of clustering in the context of the Titanic dataset, and
explain its potential use. (4)
Question 7: Model Comparison and Overfitting (10 Marks)
Compare the models you've used in this assignment in terms of performance and overfitting.
7.1 Compare the strengths and weaknesses of k-NN, Decision Trees, and Neural Networks
for the Titanic dataset. (5)
7.2 Discuss how each model can be prone to overfitting. What techniques would you use to
address overfitting for each model? (5)

Good luck

You might also like