0% found this document useful (0 votes)
16 views

Coursework Assessment MFKhan v1.4

The document outlines the coursework requirements for the Applications of Data Science module, focusing on developing a data science solution for a complex dataset. Students must complete a mid-term presentation and a final report, analyzing a selected real-world problem such as diabetic retinopathy, signature forgery detection, or wildfire detection. Key tasks include dataset refinement, feature engineering, statistical analysis, and the application of machine learning algorithms, with strict guidelines against using neural networks.

Uploaded by

VeRu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Coursework Assessment MFKhan v1.4

The document outlines the coursework requirements for the Applications of Data Science module, focusing on developing a data science solution for a complex dataset. Students must complete a mid-term presentation and a final report, analyzing a selected real-world problem such as diabetic retinopathy, signature forgery detection, or wildfire detection. Key tasks include dataset refinement, feature engineering, statistical analysis, and the application of machine learning algorithms, with strict guidelines against using neural networks.

Uploaded by

VeRu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

School of Arts, Humanities and Social Science

Module title and code: Applications of Data Science (CMP020L014)


Title of coursework: Coursework (Portfolio)

Learning outcomes: LO1: Demonstrate a comprehensive understanding of current


developments in data science.

LO2: Systematically and critically analyse and evaluate diverse


sources of data to solve a problem.

LO3: Propose and develop a data science solution for a complex


dataset.
Assessment weighting Mid-term presentation report: 25%, Final report: 75%

Maximum mark 100


Part 1: Submit the mid-term presentation report in PDF format:
https://moodle.roehampton.ac.uk/mod/assign/view.php?id=2097536
Submission details (e.g.
submission link) Part 2: Submit the final report in PDF format and MATLAB code
in .m/.mlx format:
https://moodle.roehampton.ac.uk/mod/assign/view.php?id=2097537
2500-word limit for a mid-term presentation report.
Word limit (if applicable)
4000-word limit for a final report.
Diabetic retinopathy:
https://zenodo.org/records/4647952#.YGNjXVUzbIU

Wet signature:
https://www.kaggle.com/datasets/akashgundu/signature-
Dateset
verification-dataset

Wildfire:
https://datasets.omdena.com/dataset/onfire-dataset

Mid-term presentation report: 30/03/2025


Deadline
Final report: 14/04/2025
For feedback, refer to the rubric at the end of this coursework.
Feedback and marks
Marks will be released within 4 weeks of the final submission
deadline.
Assessment setter’s name Dr Mohammad F Khan

Page 1 of 9
1) ASSESSMENT OVERVIEW:

This assignment evaluates your skills in proposing and developing a data science solution for a complex
dataset and presenting your findings in a report. You are required to complete a set of tasks by selecting
one problem from the options in Task 1 (or choosing a similar problem as per Appendix 2), applying
techniques such as feature engineering, visualisation, and statistical analysis to enhance machine
learning algorithms, and comprehensively analysing and reporting your results.

Important: Do not use neural networks or deep learning in this coursework. If you choose a research
article that uses neural networks, replace that section with an alternative explainable AI algorithm.

2) ACADEMIC MISCONDUCT:

“Academic integrity and honesty are fundamental to the academic work you produce at the University
of Roehampton. You are expected to complete coursework which is your own and which is referenced
appropriately. The university has in place measures to detect academic dishonesty in all its forms. If
you are found to be cheating or attempting to gain an unfair advantage over other students in any way,
this is considered academic misconduct, and you will be penalised accordingly.” Further details about
“Student Code of Conduct” and “Disciplinary Regulations” can be found at:
https://www.roehampton.ac.uk/corporate-information/policies/

3) TASKS:

• Task 1: Select one of the following real-world problems to investigate and solve. Alternatively,
you may identify a similar problem in a different domain (see Appendix 2).

➢ Problem 1: Classification of diabetic retinopathy


Diabetic retinopathy is a leading cause of vision impairment and blindness among diabetic
patients. Early detection through retinal image analysis is crucial for timely intervention.
However, manual diagnosis by ophthalmologists is time-consuming, subjective, and prone to
variability. Automated classification of diabetic retinopathy using machine learning can
improve accuracy and efficiency, but challenges such as imbalanced datasets, feature
extraction, and differentiation between severity levels persist. Developing a robust
classification model that effectively distinguishes diabetic retinopathy stages from retinal
fundus images is essential for enhancing diagnostic reliability and patient outcomes.

Resources:
o Research article: https://www.mdpi.com/2075-4418/12/9/2262
o Dataset: https://zenodo.org/records/4647952#.YGNjXVUzbIU

➢ Problem 2: Detecting wet signature forgery through images


Detecting forgery in wet signatures through images is essential to safeguard the authenticity of
legal and financial documents. Wet signatures are widely used for verifying identity and
consent; forgery can lead to fraud, data breaches, and unauthorised transactions. Image-based
detection allows for detailed analysis of signature characteristics, such as stroke patterns,
pressure points, and ink flow, which are difficult to replicate perfectly. Advanced techniques
like AI and machine learning enhance accuracy, identifying subtle inconsistencies that might
escape human observation. This ensures the integrity of documents, protects individuals and
organisations from potential fraud, and upholds trust in sensitive transactions.

Resources:

Page 2 of 9
o Research article: https://www.mdpi.com/2076-3417/10/11/3716
o Dataset: https://www.kaggle.com/datasets/akashgundu/signature-verification-dataset

➢ Problem 3: Detecting wildfire using surveillance data


Wildfires pose a significant threat to ecosystems, human lives, and infrastructure. Early
detection is crucial for effective mitigation, but traditional methods relying on satellite imagery
and ground-based sensors often suffer from delays and limited coverage. Surveillance video
data offers a real-time alternative for wildfire detection. Still, challenges such as varying
lighting conditions, smoke interference, and false alarms due to similar visual patterns make
accurate identification difficult. Hence it is necessary to develop an advanced machine learning-
based model to detect wildfires in surveillance videos, enhancing early warning systems and
enabling rapid response to minimise environmental and economic damage.

Resources:
o Research article: https://ieeexplore.ieee.org/abstract/document/9690875
o Dataset: https://datasets.omdena.com/dataset/onfire-dataset

Note: Use a university PC for downloading the aforementioned research articles or refer to
Appendix 1 for downloading the research articles at home by using a university library login. If
required, you can alternatively opt for a similar type of problem having a different application
domain with a different dataset, which must follow Tasks 2-7 given below. The new problem must
be decided after a detailed discussion with the module tutor. For more information, refer to the
guidelines given in Appendix 2 for deciding on a new problem.

• Task 2: Refine the dataset by choosing at least 100 images or 01 video sample (3-second video
recorded at ≥30 fps) for each class.

Helpful tip: Choosing the image/video dataset for binary classification involves several key
considerations. First, ensure the dataset has balanced classes to avoid bias in model training. The
images should be relevant to the classification task and of sufficient quality, with clear features
distinguishing the two categories. Consider the size of the dataset; larger datasets provide more
training examples, improving model generalisation. Additionally, check for labelled data to
facilitate supervised learning. Dataset diversity, including variations in lighting, angles, and
backgrounds, is crucial for building a robust model.

• Task 3: Formulate a mathematical concept for feature engineering for the image/video data.

Helpful tip: Formulating a mathematical concept for feature engineering in the data involves
designing and defining transformations that extract meaningful patterns. To reveal the patterns that
are not visible in the spatial domain, you can think of starting by representing images as matrices
of pixel intensities; applying pre-processing techniques like mathematical filters to detect edge and
texture-based features in the data; reducing the dimensionality of the data through principal
component analysis (PCA); projecting the data into lower-dimensional spaces while retaining key
characteristics; applying advanced feature engineering methods like HoG transform, Fourier
transform, wavelet transform et cetera to analyse the frequency components. The goal of this task
is to encode complex information into compact, informative features that can enhance model
performance and enable accurate classification.

• Task 4: Apply descriptive and inferential statistical tools to analyse the data and test the model
performance.

Page 3 of 9
Helpful tip: Descriptive statistics summarise image data by calculating metrics like mean, median,
kurtosis, skewness, standard deviation, and pixel intensity distributions to understand patterns and
variability. They can help in identifying data imbalances or anomalies before model training. For
inferential statistics, techniques like hypothesis testing and confidence intervals evaluate
relationships in image features, such as comparing pixel intensity distributions across classes. In
machine learning classification, these statistics assess feature relevance, help refine pre-processing
steps, and validate model assumptions. Inferential statistics also help test the significance of model
performance metrics, ensuring robust conclusions about the classifier’s effectiveness on unseen
image data. Together, they may enhance the data-driven decision-making process.

• Task 5: Create appropriate visualisations, e.g. 2D/3D plots of the complex data and simulation
results.

Helpful tip: Employ colour, size, transparency, and marker shapes to encode additional variables
in the plot without overcrowding the plot. Use different scaling functions for different feature
values, and visualise the scale in the single plot to maintain the interpretability and clarity of the
data.

• Task 6: Incorporate machine learning algorithms (excluding neural networks) in your final
solution.

Helpful tip: To apply machine learning algorithms to features extracted from image data for binary
classification, start by looking into the data obtained from the feature engineering step, ensuring it
is normalised for optimal machine learning performance. Feed these features into the machine
learning model. Use hyper-optimisation to handle non-linearly separable data. Evaluate the model
using metrics like accuracy, sensitivity, specificity, recall, F1-score, AUC et cetera to ensure robust
classification performance.

• Task 7: Develop clear, well-commented MATLAB code without relying on built-in functions. The
code must be executable on a university machine that can load the cloud dataset and produce the
reported output without issues.

Helpful tip: Writing good comments in MATLAB code enhances readability and helps others
understand the logic. Begin by commenting at the start of the script or function, explaining its
purpose, inputs, and outputs. Use inline comments to clarify complex or non-obvious code,
explaining the reasoning behind key steps or formulas. Keep comments concise, but informative -
avoid redundancy, and focus on what the code does, rather than how. Group-related blocks of code
with section comments for better organisation. Ensure comments are up-to-date and relevant as
code evolves, improving maintainability and making it easier for collaborators or future you to
understand the code’s functionality.

4) DELIVERABLES (WHAT YOU WILL NEED TO SUBMIT):

Submit your work in two parts:

• Part 1: Mid-term Presentation and Report (25 marks)


Present your progress in a 10-minute seminar talk with 10-15 slides, followed by a 5-minute Q&A.
The presentation slides and 2500-word mid-term report (in PDF format) should include the
following information:
➢ The literature review related to the problem you have opted for from Task 1.
➢ Explanation of the dataset by using tools from data visualisation.
➢ Descriptive statistical analysis of the dataset you have opted for.

Page 4 of 9
➢ Discussion on the part of the solution you have implemented to solve that problem mentioned
in Task 3.
➢ A vision of how you are going to apply machine learning to that problem selected in Task 1.

Helpful tip: Complete the following tutorial to develop effective presentation skills:
https://roehampton.libwizard.com/f/presentations

• Part 2: Final Report (75 marks)


➢ Submit a 4000-word final report in PDF format using the provided template.
➢ Include your MATLAB (.m/.mlx) file with the report.

Page 5 of 9
5) ASSESSMENT EXPECTATIONS AND RUBRIC:

Criteria Expectation Maximum


marks (100)
Mid-term presentation and ➢ The literature review related to the problem you have opted for from Task 1. 25
report submission
class) & mid-term


Presentation (in

brief report Explaining the dataset by using tools from data visualisation.
➢ Present the descriptive statistical analysis of the dataset you have opted for.
➢ Discuss the part of the solution you have implemented to solve that problem mentioned
in Task 3.
➢ Present a vision of how you are going to apply machine learning to that problem opted
in Task 1.
Abstract, conclusion and format A brief 200-300 words glance at the problem statement and its possible solution along 10
of demonstration with results. Demonstrating report by using appropriate language, clear formatting, and
correct referencing.
Introduction/Literature review A detailed survey of related work that covers relevant literature review on the chosen 10
appropriateness problem. Discuss the detailed modifications you have conducted to the reference research
article.
Mathematical understanding Detailed mathematical formulation is provided with an appropriate explanation of the 20
Final report submission

and feature engineering algorithmic equations used in the study. Also, showing the ability to define a part of a
problem in the scope of algebra, calculus, probability, approximation theory and/or
numerical analysis.
Statistical analysis Appropriate and detailed inferential and descriptive statistical analyses are conducted to 10
refine the possible solution.

Data visualisation Various types of graphical representation attempted to visualise the dataset as well as 5
simulation results. Also showing the ability to efficiently visualise overlapping complex
data distribution/simulation results in single plots.
Application of machine learning Multiple algorithms have been used for comparison purposes, and a comprehensive 10
algorithm analysis has been conducted with detailed reasoning.

Programming language A clear well-commented MATLAB code has been developed without using built-in 10
used/Statistical software used functions.

Page 6 of 9
Rubric Distribution of marks
100-80% 79-70% 69-60% 59-50% 49-00% (Fail)
Abstract, conclusion and A brief 200-300 words A brief 200-300 words A brief 200-300 words A general 200-300 words The report failed to use
format demonstration glance at the problem glance at the problem glance at the problem glance at the problem understandable language,
statement and its possible statement and its possible statement and its possible statement. Demonstrating reasonable formatting and
solution along with results. solution along with results. solution along with results. report by using referencing.
Demonstrating report by Demonstrating report by Demonstrating report by understandable language,
using appropriate using appropriate using understandable reasonable formatting and
language, clear formatting language, clear formatting language, good formatting referencing.
and correct referencing. and correct referencing. and correct referencing.
Introduction/Literature Demonstrates Demonstrates very broad Demonstrates in-depth Evidence of independent Limited evidence of
review appropriateness outstandingly broad and and in-depth independent independent reading from reading from a wide range independent reading. The
in-depth independent reading from appropriate appropriate sources, of appropriate sources, application of literature is
reading from appropriate sources, including the including the most current including current ones. too descriptive overall.
sources, including the most current ones in the ones in the field. The Clear, accurate, systematic
most current ones in the field. The choice of choice of sources clearly application of the material.
field. The choice of sources sources clearly enhances enhances the fulfilment of Shows an ability to
highly enhances the the fulfilment of the the assignment objectives. appraise material critically.
fulfilment of the assignment objectives. Clear, accurate, systematic
assignment objectives. Clear, accurate, systematic application of material
Clear, accurate, systematic application of material with developed and/or
application of material with well-developed integrated critical
with highly developed and/or integrated critical appraisal.
and/or integrated critical appraisal.
appraisal.
Mathematical Application of knowledge Demonstrates a very Shows a systematic and Effective application of Knowledge of
understanding and and understanding of detailed, accurate, accurate understanding of knowledge of key mathematical theory is
feature engineering mathematical concepts is systematic mathematical key mathematical theories, mathematical theories and inaccurate/incomplete.
outstanding and shows understanding. including the most up-to- conclusions resulting from The choice of
mastery of the discipline Appropriately selected date ones, which are own research. mathematical theory is
and professional practice. theoretical knowledge is appropriately applied, inappropriate/incomplete.
Appreciation of the limits integrated into the overall along with own research, Application and/or
of theory is demonstrated assessment task, including within the context of the understanding of concepts
throughout the work. The up-to-date theories, assessment task. are very limited.
approach to the concepts and practices of

Page 7 of 9
assessment task is the subject area and own
informed by the most up- research.
to-date theories, concepts
and practices in the
discipline and own
research.
Statistical analysis Appropriate and detailed Appropriate and detailed Appropriate and detailed Appropriate inferential Basic descriptive statistical
inferential and descriptive inferential and descriptive inferential and descriptive and/or descriptive analysis is conducted on
statistical analyses are statistical analyses are statistical analyses are statistical analyses are the dataset with limited or
conducted on the dataset conducted on the dataset conducted on the dataset conducted on the dataset incorrect or no
to refine the possible to refine the possible but have limited evidence but no evidence is explanation.
solution. solution. of refinement of the provided of refinement of
possible solution the possible solution.
Data visualisation Various types of graphical Various types of graphical Various types of graphical A basic but relevant type The graphical
representation attempted representation attempted representation attempted of graphical representation representation is limited
to visualise the dataset as to visualise the dataset as to visualise the dataset as attempted to visualise the and/or inappropriate for
well as simulation results. well as simulation results. well as simulation results. dataset as well as the the defined problem.
Demonstrated the ability Demonstrated the ability simulation results.
to efficiently visualising to efficiently visualise
overlapping complex data overlapping complex data
distribution/simulation distribution/simulation
results in single 3D plots. results in single 2D plots.
Application of machine Multiple algorithms have Multiple algorithms have Multiple algorithms have A single algorithm has A single algorithm has
learning algorithm been used for comparison been used for comparison been used with detailed been used with limited been used with limited
purposes, and a purposes, and a reasoning. Demonstrates a reasoning. Demonstrates reasoning. Demonstrates
comprehensive analysis comprehensive analysis reasonable understanding reasonable understanding incorrect or no
has been conducted with has been conducted with of the application of of the application of understanding of the
detailed reasoning. detailed reasoning. knowledge in optimising knowledge. application of knowledge.
Demonstrates a good Demonstrates a good the solution.
understanding of the understanding of the
application of knowledge application of knowledge
in optimising the solution. in optimising the solution.
Programming language A clear well-commented A clear well-commented A clear well-commented A clear well-commented A vague-commented or
used/Statistical software MATLAB code has been MATLAB code has been MATLAB code has been MATLAB code has been uncommented MATLAB
used developed without using developed using minimal developed using a developed by using lots of code has been developed
built-in functions. built-in functions. reasonable number of built-in functions. by using lots of built-in
built-in functions. functions.

Page 8 of 9
ADDITIONAL INFORMATION (IF REQUIRED):

APPNEDIX 1: DOWNLOADING RESTRICTED RESEARCH PAPER:

• For the ScienceDirect paper given in Problem 1, visit:


https://library.roehampton.ac.uk/sciencedirect, and for the IEEE paper given in Problems 3, visit:
https://library.roehampton.ac.uk/iel
• Use your university credentials to log in, search for the paper title, and download the paper.

APPENDIX 2: HOW TO DECIDE A NEW PROBLEM:

To decide the new problem, you must contact your module tutor first. The reference article for the new
problem must belong to the Science Citation Index Expanded (SCIE) database and require
images/videos as a dataset. The step-to-step process for deciding a new problem can be conducted by
using the following steps:
• Search the keywords of the problem in Google Scholar (https://scholar.google.com/).
• Filter the search result that should not go beyond research paper that are older than 5 years. For
example, if you are taking the Application of Data Science module in year 2022, the research
article you can opt should lie in the range 2018-2022.
• Discuss the modification you are planning in the reference article with your module tutor.
• Confirm with the tutor, if the article of your choice falls in the SCIE Journal category by searching
the journal name in MJL (https://mjl.clarivate.com/search-results) and appears in the Web of
Science Core Collection: Science Citation Index Expanded list.
• Refer to the screenshot below from MJL illustrating the example paper mentioned in Problem 3
which belongs to IEEE Access journal:

Page 9 of 9

You might also like