0% found this document useful (1 vote)
636 views

ICT583 Data Science Applications - Final Assignment - Individual - UPDATED!!! - Explanation

This assignment requires students to complete a health care data science project using the Mammographic Mass Data Set. Students must ask two interesting questions about the data, clean and explore the data, build predictive models using at least three machine learning methods, analyze the results, and document their findings in an R code file and report. The report should include an overview, data description, data cleaning steps, exploratory analysis, predictive modeling details and results, final analysis, and conclusion.

Uploaded by

Hammadiqbal12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
636 views

ICT583 Data Science Applications - Final Assignment - Individual - UPDATED!!! - Explanation

This assignment requires students to complete a health care data science project using the Mammographic Mass Data Set. Students must ask two interesting questions about the data, clean and explore the data, build predictive models using at least three machine learning methods, analyze the results, and document their findings in an R code file and report. The report should include an overview, data description, data cleaning steps, exploratory analysis, predictive modeling details and results, final analysis, and conclusion.

Uploaded by

Hammadiqbal12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

This assignment specification has been updated on 5th May.

Please pay
attention to the BLUE highlight!!

ICT 583 Data Science Applications


School of Engineering and Information Technology
Murdoch University
Semester 1, 2020

Assignment: Data Science Project


Due date: 5th June 11:55 pm

Unit Coordinator: Dr Guanjin Wang

Instructions
1. Individual assignment
2. The assignment accounts for 35% of the whole unit.
3. Submit the assignment from ICT583 LMS site using the Assignment unit tool.
4. Late work may attract a penalty of 10% (of the mark for that piece of assessment) per day late, up
to and including 10 days late. Work submitted more than 10 days late might not be marked.
5. You must keep a copy of the final version of your assignment as submitted and be prepared to
provide it on request.
6. The University treats plagiarism, collusion, theft of other students’ work and other forms of
dishonesty in assessment seriously. Any instances of dishonesty in this assessment will be forwarded
immediately to the Faculty Dean. For guidelines on honesty in assessment including avoiding
plagiarism, see: http://our.murdoch.edu.au/Educationaltechnologies/Academic-integrity/

Assignment overview:
In recent years, advances in machine learning are opening the door for intelligent health care data
prediction and decision-making. A variety of machine learning algorithms can be used to iteratively
learn from data to improve, find out the hidden patterns, and predict future events. Successful
applications such as individualized diagnosis and prognosis, hospital readmission prediction, and
personalized medicine can lead to improvements in medical practices and health care experiences. Your
final assignment will work on a health care data science project. The goal of this project is to follow the
data science analysis pipeline to answer interesting questions of your own choosing, acquire the data,
perform data manipulations, design your visualizations, build your predictive modeling, run statistical
analysis, and present the results in a report format.
How does data science analysis pipeline looks like (pp.26, Topic 6):

The dataset is given; you need to complete the rest of the steps - ask interesting questions, explore the
data, model the data, communicate and visualize the results.
Step 1: Get your dataset: You will use one health care dataset in this project called Mammographic
Mass Data Set (retrieve it from http://archive.ics.uci.edu/ml/datasets/mammographic+mass.)
* understand your dataset first
6 Attributes in total (1 goal field, 1 non-predictive, 4 predictive attributes)
1. BI-RADS assessment: 1 to 5 (ordinal, non-predictive!)
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5
(nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
6. Severity: benign=0 or malignant=1 (binominal, goal field!)

Step 2: You will raise four TWO interesting questions on the dataset and prepare to answer them
in your following analysis via data manipulation, visualization or predictive modeling, etc. You can
refer to the examples in lecture 1, and exercise 1, where several good questions have been raised based
on the given datasets.
* refer to lecture 1, exercise 1
 Can we predict the probability that a patient will have a a malignant mammographic mass lession
given BI-RADS attributes and the age? do not recommend
 What is the age distribution of benign and malignant target?

Think of your own questions!

Step 3: Data manipulation and cleaning: Observe your dataset and pre-process the data if necessary and
justify.
*refer to lecture 3
Is there any missing data? How to deal with them? What kind of strategies are available? How to apply
it in the dataset?
Any feature selection? non predictive attribute?
Are there any outliers?

Step 4: Exploratory data analysis: perform initial investigations on data using summary statistic and
visualizations.
Descriptive statistics:
Central tendency, variability,
Visualizations:
Remember box-and-whisker plot (by groups)? Histogram? Cumulative distribution function?

What about the categorical variables? - Frequency tables, bar chart (stacked bar chart)

Step 5: You will select at least TWO three machine learning methods and apply them to the dataset
for predictive modeling. The performances of different models should be evaluated.

select at least TWO three machine learning methods: now you know logistic regression which could
be a possible option for this binary (0 - benign 1 - malignant) classification task; the other machine
learning methods (neural networks; support vector machines, k nearest neighbouring, etc) will be
introduced in the following two lectures.
apply them to the dataset for predictive modeling:
Dataset partitioning - training (build your predictive modelling) and testing subsets (evaluate the
performance on the constructed predictive modelling)
training subset; 70% or 80% -to build the prediction model
testing subset. 30% 20%; - to validate your constructed prediction model
For example, if you only have 10 samples in this dataset as shown in the table below, you randomly
select 20% of the dataset to be the testing subset (orange), and the remaining to be the training subset

Number Patient1 shape Margin Density Outcome


age

2
4

6
7

10

Repeat this random partitioning process for 10 times so that you can calculate the mean expected
performance.

The performances of different models should be evaluated: What performance metric do you choose to
compare the results?
Accuracy mean accuracy ± SD
.........
If the mean expected performance from two models are different, how do you know How do you know
that the difference is statistically significant? (statistical test)
T-test

Step 6: Analyze the results


Step 7: Document all your findings
What you need to submit:
R file
An essential part of your project is your R coding. Your R file should record the steps in developing
your solutions and obtaining the final data analysis results. Make sure your code matches the findings
you put in the report. For example, if there are three separate plots in the report, your code should
produce exactly the same three separate plots.
Report
You also need to submit an in-depth report. The following components and discussions might be
considered in your report:
Overview of the project: Provide an overview of the project, the goals, and the motivation for it.
Consider that this will be read by people who first see your project.
Dataset: Describe the background of the dataset and provide the summary statistic. Interesting
questions: What questions are you trying to answer? Do any questions evolve throughout the project?
Are there any new questions you consider in the course of your analysis? ...
Data manipulation and cleaning: Are there any data pre-processing steps performed, and why? Are there
any questions that can be answered via data manipulation? ...
Exploratory data analysis: What visualizations did you use to look at your data in different ways? Are
there any detected outliers? ...
Predictive modeling: What are the various machine learning methods you considered? Justify the
decisions you made. What are the main ideas of the selected methods? How do you build the models?
Are there any concerns when designing your model? ...
Final analysis: What did you learn about the data? Which method statistically outperformed the rest?
Have you found the answers to the raised questions? How can you justify your answers? ... Engagingly
present your results using text, visualizations.
Conclusion: Are there any limitations of your study? What are your future work?

You might also like