Machinelearning
Machinelearning
INTRODUCTION
An accurate estimate of uncertainties in machine learning predictions is
paramount to building reliable models. It allows to make better informed
decisions, to identify outliers, as well as detect anomalies in the data or to
interpret the results more easily. A major challenge for the deployment of
these systems in critical applications (such as medical diagnostics, self driving
vehicles, etc.) is to identify when and to what extent the system may fail a
prediction. After all, an evaluation of uncertainty is built-in into our
behaviour. For example, a human driver will slow down in case of a
significant amount of uncertainty.
When we turn to regression problems, in certain families of models the task
can be easily accomplished through theoretical results. The most obvious
example in this respect is (obviously) linear regression. In general, models
are much more complicated and the complex interaction within the
algorithms make almost hopeless, even in regressive problems, to derive a
reasonable theoretical analysis. In classification problems, often, the output of
an algorithm is the probability distribution over all possible classes, assessing
the likelihood of each class. The problem appears to be completely solved, but
in fact uncertainty is actually moved on the values of the outcome
probabilities.
2. GoALS
Your task is to explore possible ways of producing a measure or an estimate
of uncertainty in classification predictions. You are not restricted to using
a single classification method, and you can, in principle, develop different
assessments of uncertainty for different algorithms. When developing your
method(s), try to consider
the computational complexity of your method,
the precision of different regions of the feature space,
that the results should be as much informative as possible.
3. DAtA
Data are of medical interest. The reason is that this is the kind of framework
in which it is of greatest importance to have a clear and reliable assessment of
the uncertainty in the prediction.
The first dataset is only tabular, and it has been taken from:
Date: June 15, 2023.
1
2 FINAL TASK FOR ANALISI DEI DATI
https://www.kaggle.com/datasets/fedesoriano/cirrhosis-prediction-dataset
It contains data about 418 patients with biliary cirrhosis of the liver. The preli-
minary goal is to analyse extensively methods that will provide accurate pre-
dictions of the histologic stage of disease. The main step is then to develop, for
one or more of the methods analysed, an assessment of the uncertainty of the
prediction.
A second dataset contains images, and as such is more computationally
complicated. It can be find here:
https://www.kaggle.com/datasets/pranavraikokte/covid19-image-dataset
4. SUBMISSION
The results of your own analysis and ideas must be summarised in a report
which explains how you have planned to tackle the problem and the possible
strategies you have tried to solve the problem. The emphasis is not on the
performances of the final method(s) proposed, but on the way you have dealt
with the problem.
You are not only allowed but actually encouraged to read up on the subject.
In order to be complete and fair, you are required to cite all sources of research
material you have used (books, scientific papers, etc.).
This final assignment is a personal piece of work and must not be done in
groups. Discussions with colleagues or experts, although discouraged, should
be reported for fairness.
Your report can be uploaded on the e-learning website. The deadline is
August 5, 2023.
You should add, at the end of your report, the link to a script (R or Python)
containing the implementation of the final method(s) proposed, based on the
analysis developed. The script must be shared via a notebook onGoogle Colab.
Obviously the script must not contain any errors. Please add a link to the
notebook in your report.
1
In case, you may consider to downscale pictures.
FINAL TASK FOR ANALISI DEI DATI 3
It is not necessary (and in fact useless) for the script to contain the entire
analysis. The recommendation is that the output of your scripts will be a
detailed account of your conclusions. The numbers, without any explanation
about their meaning, are not really helpful.