ICT583 Data Science Applications - Final Assignment - Individual - UPDATED!!! - Explanation
ICT583 Data Science Applications - Final Assignment - Individual - UPDATED!!! - Explanation
Please pay
attention to the BLUE highlight!!
Instructions
1. Individual assignment
2. The assignment accounts for 35% of the whole unit.
3. Submit the assignment from ICT583 LMS site using the Assignment unit tool.
4. Late work may attract a penalty of 10% (of the mark for that piece of assessment) per day late, up
to and including 10 days late. Work submitted more than 10 days late might not be marked.
5. You must keep a copy of the final version of your assignment as submitted and be prepared to
provide it on request.
6. The University treats plagiarism, collusion, theft of other students’ work and other forms of
dishonesty in assessment seriously. Any instances of dishonesty in this assessment will be forwarded
immediately to the Faculty Dean. For guidelines on honesty in assessment including avoiding
plagiarism, see: http://our.murdoch.edu.au/Educationaltechnologies/Academic-integrity/
Assignment overview:
In recent years, advances in machine learning are opening the door for intelligent health care data
prediction and decision-making. A variety of machine learning algorithms can be used to iteratively
learn from data to improve, find out the hidden patterns, and predict future events. Successful
applications such as individualized diagnosis and prognosis, hospital readmission prediction, and
personalized medicine can lead to improvements in medical practices and health care experiences. Your
final assignment will work on a health care data science project. The goal of this project is to follow the
data science analysis pipeline to answer interesting questions of your own choosing, acquire the data,
perform data manipulations, design your visualizations, build your predictive modeling, run statistical
analysis, and present the results in a report format.
How does data science analysis pipeline looks like (pp.26, Topic 6):
The dataset is given; you need to complete the rest of the steps - ask interesting questions, explore the
data, model the data, communicate and visualize the results.
Step 1: Get your dataset: You will use one health care dataset in this project called Mammographic
Mass Data Set (retrieve it from http://archive.ics.uci.edu/ml/datasets/mammographic+mass.)
* understand your dataset first
6 Attributes in total (1 goal field, 1 non-predictive, 4 predictive attributes)
1. BI-RADS assessment: 1 to 5 (ordinal, non-predictive!)
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5
(nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
6. Severity: benign=0 or malignant=1 (binominal, goal field!)
Step 2: You will raise four TWO interesting questions on the dataset and prepare to answer them
in your following analysis via data manipulation, visualization or predictive modeling, etc. You can
refer to the examples in lecture 1, and exercise 1, where several good questions have been raised based
on the given datasets.
* refer to lecture 1, exercise 1
Can we predict the probability that a patient will have a a malignant mammographic mass lession
given BI-RADS attributes and the age? do not recommend
What is the age distribution of benign and malignant target?
Step 3: Data manipulation and cleaning: Observe your dataset and pre-process the data if necessary and
justify.
*refer to lecture 3
Is there any missing data? How to deal with them? What kind of strategies are available? How to apply
it in the dataset?
Any feature selection? non predictive attribute?
Are there any outliers?
Step 4: Exploratory data analysis: perform initial investigations on data using summary statistic and
visualizations.
Descriptive statistics:
Central tendency, variability,
Visualizations:
Remember box-and-whisker plot (by groups)? Histogram? Cumulative distribution function?
What about the categorical variables? - Frequency tables, bar chart (stacked bar chart)
Step 5: You will select at least TWO three machine learning methods and apply them to the dataset
for predictive modeling. The performances of different models should be evaluated.
select at least TWO three machine learning methods: now you know logistic regression which could
be a possible option for this binary (0 - benign 1 - malignant) classification task; the other machine
learning methods (neural networks; support vector machines, k nearest neighbouring, etc) will be
introduced in the following two lectures.
apply them to the dataset for predictive modeling:
Dataset partitioning - training (build your predictive modelling) and testing subsets (evaluate the
performance on the constructed predictive modelling)
training subset; 70% or 80% -to build the prediction model
testing subset. 30% 20%; - to validate your constructed prediction model
For example, if you only have 10 samples in this dataset as shown in the table below, you randomly
select 20% of the dataset to be the testing subset (orange), and the remaining to be the training subset
2
4
6
7
10
Repeat this random partitioning process for 10 times so that you can calculate the mean expected
performance.
The performances of different models should be evaluated: What performance metric do you choose to
compare the results?
Accuracy mean accuracy ± SD
.........
If the mean expected performance from two models are different, how do you know How do you know
that the difference is statistically significant? (statistical test)
T-test