0% found this document useful (0 votes)

23 views

CM2060 NLP Coursework

The coursework assignment for the CM3060 Natural Language Processing module involves developing a text classifier for a specific domain using both statistical and embedding-based models. Students are required to implement data preprocessing, establish baseline performance, and conduct a comparative analysis of model effectiveness, with a focus on metrics such as accuracy and precision. The final submission should include a performance analysis, project reflections, and suggestions for future research directions.

Uploaded by

tastybites.sa1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

CM2060 NLP Coursework

Uploaded by

tastybites.sa1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

BSc in Computer Science

Module: CM3060 Natural Language Processing

Coursework description

Coursework assignment: comparative text classification using

statistical and embedding-based models

Contents

I. Introduction
- Domain-specific area
- Objectives
- Dataset Description
- Evaluation methodology

II. Implementation
- Data Preprocessing
- Baseline Performance
- Comparative Classification methodology
- Programming style

III. Conclusions
- Performance Analysis & Comparative Discussion
- Project Summary and Reflections

I. Introduction
This coursework requires you to develop a text classifier and apply it to a
specific domain or challenge, e.g. fake news detection, sentiment analysis,
spam detection, document tagging, etc., using both statistical and embedding-
based language models. You will need to identify a suitable problem area with
an associated data set. Additionally, your work must be situated within
existing literature, with proper citations and references to high-quality sources.
The core technical exercise will involve comparing the effectiveness of a
traditional statistical model and a modern deep learning model in addressing
the chosen problem.

1. Domain-specific area
The first step of the coursework is to identify and describe the problem
or challenge. This is an area of industry or science where text
classification methods can contribute. Include relevant literature to
support the significance of the chosen area.
1|Page
2. Objectives
Outline the goals of exploring both statistical and embedding-based
models to understand their effectiveness and applicability in text
classification tasks. State any contribution which the results may make
to the challenge addressed, supported by relevant literature.

3. Dataset Description
The next step is to identify a suitable dataset which is representative of
the challenge and will require attention to all the steps outlined in this
assignment. Provide a description of the dataset, its size, data types,
the way the data were acquired. Clearly state the source of the dataset.
Large technology companies, such as Microsoft, Google and Amazon,
provide a wide variety of datasets.
Example: ‘Fake and real news’ dataset available from the Kaggle
official website.

4. Evaluation methodology
Describe the metrics (e.g., Accuracy, Precision, Recall, F1-Score) for
assessing model performance and discuss how to apply these metrics
to compare the two methodologies.

II. Implementation
This part of the coursework is the implementation of the project. It includes
preprocessing the data, building and testing your classifier and obtaining
results. The project is expected to be developed using the Python language
and Jupyter notebook. Provide well-commented Python code, accompanied
by a document describing the following steps:

5. Data Preprocessing
Convert/store the dataset locally and preprocess the data. Describe the
text representation (e.g., bag of words, word embedding, etc.) and any
preprocessing steps you have applied and why they were needed (e.g.,
tokenization, lemmatization). Address differences in data preparation
for statistical models (e.g., frequency tables) versus embedding models
(e.g., word vectors).

6. Baseline performance
Describe and justify the baseline against which you are going to
compare the performance of your chosen approach. This can be an
already published baseline (e.g. cited in the literature) or the results of
a basic algorithm that you implement yourself. The baseline should
represent a meaningful benchmark for comparison.

7. Comparative Classification approach

Implement both a traditional statistical model and a modern deep
learning model. Build a classifier using the appropriate Python library.
Detail the architecture, training, and optimization processes for each,
emphasizing their strengths and weaknesses in the context of the
chosen dataset. Clearly compare the performance of the two models.

2|Page
8. Programming style
Ensure all code is clear and well-commented. Documentation should
include detailed explanations of the rationale behind model choices,
parameter settings, and any specific libraries or tools used.

III. Conclusions

9. Performance Analysis & Comparative Discussion

Present and analyze the results for both models. Use visualizations to
compare performance across different classes and discuss any
significant findings. Critically evaluate the advantages and
disadvantages of statistical and embedding-based models based on
the results. Discuss scenarios where one might be preferred over the
other and hypothesize reasons for observed performance disparities.

10. Project Summary and Reflections

Reflect on the learning experience, the practicality of each model type,
and their potential applications in real-world scenarios. Describe its
contributions to the problem area and discuss the extent to which your
solution is transferable to other domain-specific areas. Suggest
improvements and future research directions.

Rubric: marks are shown in parentheses.

I. Introduction

1. Introduction to the domain-specific area (200-500 words)

[0] Missing or incorrect.
[2] Briefly discussed.
[3] Adequately discussed.
[4] Domain-specific area clearly stated, informative description, fully
referenced work.
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

2. Objectives of the project (200-500 words)

[0] Missing or incorrect.
[2] Briefly described.
[3] Adequately described.
[4] Objectives clearly stated with sufficient details and potential
contributions.
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

3. Description of the selected dataset (200-500 words)

[0] Missing or incorrect.
[2] Briefly described.
[3] Adequately described.

3|Page
[4] Described in sufficient details, including origin, size, structure,
data types.
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

4. Evaluation methodology (200-500 words)

[0] Missing or incorrect.
[2] Briefly described.
[3] Adequately described.
[4] Methods and metrics clearly described with convincing rationale.
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

II. Implementation

5. Data Preprocessing
[0] Missing or incorrect.
[2] Briefly described.
[3] Working code fragments with some preprocessing steps.
[4] All appropriate preprocessing steps undertaken, clearly
described with convincing rationale.
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

6. Baseline performance
[0] Missing or incorrect.
[2] Briefly described.
[3] Adequately described, some justification provided.
[4] Baseline appropriately chosen, clearly described/implemented
with convincing rationale and preliminary comparison with
advanced models.
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

7. Comparative Classification methodology

[0] Missing or incorrect.
[2] Briefly described.
[3] Working solution with unconfirmed results.
[4] Working solution with confirmed results generated and
presented appropriately, including comparative insights.
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

8. Programming style
[0] Non-meaningful names in code, no comments, use of 'magic
numbers'.
[2] A minimal attempt at readability.
[3] The source code is readable with some comments.
[4] The source code is of high quality and follows general coding
convention.

4|Page
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

III. Concludions

9. Performance Analysis & Comparative Discussion (200-500 words)

[0] No evaluation or incorrect evaluation provided.
[2] Basic description of the results, minimal comparison between
the two models.
[3] Results discussed but not convincingly evaluated.
[4] Results convincingly evaluated with clear quantitative and
qualitative comparison.
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

10. Project Summary and Reflections (200 to 400 words)

[0] Missing or incorrect.
[2] Briefly described.
[3] Some conclusions without fully evaluating the comparative
effectiveness of the classifiers and the results.
[4] Detailed project evaluation including a comparative analysis
(classifier, pre-processing, results, reproducibility).
[5] Exceptional work which includes the above but goes beyond
what would be expected from a student at this level.

[END OF COURSEWORK ASSIGNMENT]

5|Page

Final Report - Baseline Assessment For The Usaid Expanding Water and Sanitation Project - 06-07-2022
No ratings yet
Final Report - Baseline Assessment For The Usaid Expanding Water and Sanitation Project - 06-07-2022
97 pages
6.891 Machine Learning: Project Proposal
No ratings yet
6.891 Machine Learning: Project Proposal
2 pages
Markwiz 2.0: A Marketing Case Challenge
No ratings yet
Markwiz 2.0: A Marketing Case Challenge
5 pages
2development of 900 MW Coal-Fired Generating Units
100% (1)
2development of 900 MW Coal-Fired Generating Units
8 pages
Semester Project Description and Instructions
No ratings yet
Semester Project Description and Instructions
3 pages
project_descr
No ratings yet
project_descr
2 pages
Ce473 Project - Fall 2024
No ratings yet
Ce473 Project - Fall 2024
8 pages
TAC Technical Report Template - AI Stream
No ratings yet
TAC Technical Report Template - AI Stream
13 pages
Machine Learning Assignment-02
No ratings yet
Machine Learning Assignment-02
2 pages
Flight Fare Prediction
No ratings yet
Flight Fare Prediction
5 pages
1
No ratings yet
1
1 page
Project Progress Report Handout and Rubric
No ratings yet
Project Progress Report Handout and Rubric
2 pages
Group Project Software Management: A Guide for University Students and Instructors
From Everand
Group Project Software Management: A Guide for University Students and Instructors
Tommy Yuan
No ratings yet
7641 Assignment 1
No ratings yet
7641 Assignment 1
4 pages
Assessment 4
No ratings yet
Assessment 4
4 pages
COE101 - Project Guidelines (Spring 24-25)
No ratings yet
COE101 - Project Guidelines (Spring 24-25)
19 pages
CS502M_project_spec
No ratings yet
CS502M_project_spec
8 pages
Capstone 2 Corizo
No ratings yet
Capstone 2 Corizo
2 pages
ML Presubmission Guidelines
No ratings yet
ML Presubmission Guidelines
2 pages
Data Science Project Proposal guidelines
No ratings yet
Data Science Project Proposal guidelines
11 pages
CapStone Project
No ratings yet
CapStone Project
4 pages
Spring 2025_CS619_10969
No ratings yet
Spring 2025_CS619_10969
4 pages
RubricsDA AI BCSE306L
No ratings yet
RubricsDA AI BCSE306L
5 pages
IT Interview Guide for Freshers: Crack your IT interview with confidence
From Everand
IT Interview Guide for Freshers: Crack your IT interview with confidence
Sameer S Paradkar
No ratings yet
DM Assignment 2
No ratings yet
DM Assignment 2
2 pages
Online Assignment Plagiarism Check
No ratings yet
Online Assignment Plagiarism Check
5 pages
IntroML Project Description - CLC 2425
No ratings yet
IntroML Project Description - CLC 2425
5 pages
Bìa Đề Làm Bài Vào Giấy Thi - Form 2
No ratings yet
Bìa Đề Làm Bài Vào Giấy Thi - Form 2
4 pages
Assignment_Data_Science (1)
No ratings yet
Assignment_Data_Science (1)
6 pages
Phase
No ratings yet
Phase
3 pages
ML Project Guidelines SWE Winter 2024
No ratings yet
ML Project Guidelines SWE Winter 2024
8 pages
AI Recruit (2)
No ratings yet
AI Recruit (2)
7 pages
CS4622 Machine Learning PROJECT
No ratings yet
CS4622 Machine Learning PROJECT
3 pages
CS254 Project Proposal Report Format
No ratings yet
CS254 Project Proposal Report Format
3 pages
00_EEME30002___Coursework_Brief_24_25
No ratings yet
00_EEME30002___Coursework_Brief_24_25
2 pages
COM7039M MachineLearning Assignment Brief-Level 7-1
No ratings yet
COM7039M MachineLearning Assignment Brief-Level 7-1
12 pages
A2
No ratings yet
A2
11 pages
Asiign2 Smith
No ratings yet
Asiign2 Smith
10 pages
Cyber Cafe Management System DEEPAK SHINDE
No ratings yet
Cyber Cafe Management System DEEPAK SHINDE
36 pages
Phase-1 for DA & DS
No ratings yet
Phase-1 for DA & DS
3 pages
Final Project Guideline MachineLearning v1.0
No ratings yet
Final Project Guideline MachineLearning v1.0
4 pages
ML%20PROJECT%20PROPOSAL.pdf
No ratings yet
ML%20PROJECT%20PROPOSAL.pdf
4 pages
ML Case Study
No ratings yet
ML Case Study
1 page
Lab Report Guidelines
No ratings yet
Lab Report Guidelines
9 pages
Machine Learning-Assignments PDF
No ratings yet
Machine Learning-Assignments PDF
2 pages
CS229 Final Project Spring 2023 Public PDF
No ratings yet
CS229 Final Project Spring 2023 Public PDF
12 pages
Submission Type Due Date Total Score Available From Description
No ratings yet
Submission Type Due Date Total Score Available From Description
3 pages
CAT King study material 4
No ratings yet
CAT King study material 4
32 pages
COBOL for the Approved Workman
From Everand
COBOL for the Approved Workman
Wesley Sweetser, Jr
No ratings yet
Milestone
No ratings yet
Milestone
7 pages
Project Questions
No ratings yet
Project Questions
3 pages
Data Mining Assignment No 2
No ratings yet
Data Mining Assignment No 2
4 pages
Project 2
No ratings yet
Project 2
2 pages
VIDEO PRESENTATION INFORMATION
No ratings yet
VIDEO PRESENTATION INFORMATION
5 pages
Important Questions
No ratings yet
Important Questions
4 pages
Project Requirements Student Version 1.0
No ratings yet
Project Requirements Student Version 1.0
6 pages
CW Sequence Analysis
No ratings yet
CW Sequence Analysis
9 pages
Final Exam MPML
No ratings yet
Final Exam MPML
5 pages
Assignment 3-PDS Python-24S3
No ratings yet
Assignment 3-PDS Python-24S3
5 pages
Example AI Project
No ratings yet
Example AI Project
2 pages
Asiign2 Aaryan Ai
No ratings yet
Asiign2 Aaryan Ai
11 pages
Project2 - 158755. 4.21
No ratings yet
Project2 - 158755. 4.21
3 pages
Microsoft Certified Azure Data Fundamentals (DP-900) Exam Guide: Build a solid foundation in Azure data services and pass the DP-900 exam on your first try
From Everand
Microsoft Certified Azure Data Fundamentals (DP-900) Exam Guide: Build a solid foundation in Azure data services and pass the DP-900 exam on your first try
Steve Miles
No ratings yet
Proposal 2023 Alamata
No ratings yet
Proposal 2023 Alamata
6 pages
Bestcom-Considerate Computing
No ratings yet
Bestcom-Considerate Computing
8 pages
Indo-China Relations: Recent Issues On Which Differences Have Emerged
No ratings yet
Indo-China Relations: Recent Issues On Which Differences Have Emerged
2 pages
Transfer Pricing Adjustments: by CA - Karnik Gulati
No ratings yet
Transfer Pricing Adjustments: by CA - Karnik Gulati
9 pages
Sdetail Div14 PDF
No ratings yet
Sdetail Div14 PDF
9 pages
Chapter 4 - Financial Markets
No ratings yet
Chapter 4 - Financial Markets
81 pages
Table 6 - Mechanical Products PDF
No ratings yet
Table 6 - Mechanical Products PDF
18 pages
Imso Guide
No ratings yet
Imso Guide
116 pages
List of National Debt by Country
No ratings yet
List of National Debt by Country
7 pages
PRODUKKOMERSIAL
No ratings yet
PRODUKKOMERSIAL
3 pages
KL 002.12 en Labs v1.7.1
No ratings yet
KL 002.12 en Labs v1.7.1
178 pages
Merger and Acquistion of Pharmaseutical Sector
100% (2)
Merger and Acquistion of Pharmaseutical Sector
36 pages
SAR - JSA For Concrete Work
No ratings yet
SAR - JSA For Concrete Work
7 pages
GLens Client User Manual V6.0
No ratings yet
GLens Client User Manual V6.0
65 pages
VCP VC
No ratings yet
VCP VC
36 pages
Evans Analytics1e PPT 17
No ratings yet
Evans Analytics1e PPT 17
63 pages
Assigment and Marking Scheme - Resit - 6
No ratings yet
Assigment and Marking Scheme - Resit - 6
10 pages
Siebel TSiebel Tools Configuration Interview Questions - Docxools Configuration Interview Questions
No ratings yet
Siebel TSiebel Tools Configuration Interview Questions - Docxools Configuration Interview Questions
52 pages
IWA Draft For Review at Tokyo Meeting
No ratings yet
IWA Draft For Review at Tokyo Meeting
47 pages
AQY Volume 94 Issue 374 Cover and Back Matter
No ratings yet
AQY Volume 94 Issue 374 Cover and Back Matter
3 pages
Electrical Drives & Control Lab.
No ratings yet
Electrical Drives & Control Lab.
4 pages
OOAD Unit-2 Notes
No ratings yet
OOAD Unit-2 Notes
8 pages
TOA Test Bank
No ratings yet
TOA Test Bank
4 pages
Skupage 108925-82AM
No ratings yet
Skupage 108925-82AM
2 pages
Activity Guide and Evaluation Rubric - Unit 2 - Task 3 - Tech Tools For World Language Classes
No ratings yet
Activity Guide and Evaluation Rubric - Unit 2 - Task 3 - Tech Tools For World Language Classes
6 pages
ISP Presentation
No ratings yet
ISP Presentation
21 pages
Flexible Modules Catalogue - India
No ratings yet
Flexible Modules Catalogue - India
2 pages

CM2060 NLP Coursework

Uploaded by

CM2060 NLP Coursework

Uploaded by

BSc in Computer Science

Module: CM3060 Natural Language Processing

Coursework assignment: comparative text classification using

7. Comparative Classification approach

9. Performance Analysis & Comparative Discussion

10. Project Summary and Reflections

Rubric: marks are shown in parentheses.

1. Introduction to the domain-specific area (200-500 words)

2. Objectives of the project (200-500 words)

3. Description of the selected dataset (200-500 words)

4. Evaluation methodology (200-500 words)

7. Comparative Classification methodology

9. Performance Analysis & Comparative Discussion (200-500 words)

10. Project Summary and Reflections (200 to 400 words)

[END OF COURSEWORK ASSIGNMENT]

You might also like