0% found this document useful (0 votes)

17 views5 pages

Challenge-2024

Uploaded by

tenfoldpersonality

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views5 pages

Challenge-2024

Uploaded by

tenfoldpersonality

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Machine and Deep Learning

Data Challenge: Sub-Event Detection

November 2024

1 Introduction
The goal of this data challenge is to study and apply machine learning/artificial intelligence
techniques to a real-world classification problem. In this binary classification problem, your
mission is to build a model that can accurately predict the occurrence of notable events
within specified one-minute intervals of a football match. You are provided a Twitter dataset
1
centered on the 2010 and 2014 FIFA World Cup tournaments, organized by one-minute long
time periods. Each period is annotated with a binary label: 0 if no notable event occurred, or 1 if
a significant event—such as a goal, half time, kick-off, full time, penalty, red card, yellow card, or
own goal—occurred within that period. For an interval to be labeled as containing an event, it
must align closely with the actual event time, without excessive delay.

This setting presents several challenges, from data preprocessing and feature extraction to
model selection and hyperparameter tuning. You can employ both traditional machine learning
and deep learning techniques to tackle this binary classification task. The primary evaluation
metric will be accuracy, measuring the percentage of correctly predicted labels. This data
challenge is hosted on Kaggle as an in-class competition. To participate, you must have a
Kaggle account. If you don’t have one, you can create it for free. The URL to register for the
competition and gain access to all relevant materials is the following:

https://www.kaggle.com/t/946c29c13d024ffcad34ab0b40e85688

2 Dataset Description
As part of the challenge, you are given the following files:

1
Meladianos, P., Xypolopoulos, C., Nikolentzos, G., & Vazirgiannis, M. (2018). An optimization approach
for sub-event detection and summarization in twitter. In Advances in Information Retrieval: 40th European
Conference on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings 40 (pp.
481-493). Springer International Publishing.

1
- train_tweets/*.json: This directory contains all annotated tweets split by their corresponding
match. Each file contains tweet data divided into time periods, with each entry labeled as 0 or 1
based on the presence of sub-events.
- eval_tweets/*.json: This directory contains all tweets that need to be annotated
- baseline.py: This script contains two simple baselines, a Logistic Regression classifier and a
Dummy Classifier that always predicts the most frequent class of the training set. You can use
the code provided here as a starting point on how to read and process the data and
- logistic_predictions.csv and dummy_predictions.csv: sample submission files in the
correct format for the two provided baseline classifiers.

3 Task and Evaluation

For each time period in the test set, your model should predict whether a specific sub-event
occurred based on the provided tweet data. The evaluation metric for this competition is
accuracy. Accuracy measures the proportion of correct predictions your model makes,
calculated by dividing the number of correct predictions by the total number of predictions. For
this binary classification task, the accuracy metric is defined as follows:

4 Provided Source Code

You are given two baseline classifiers, written in Python, that will help you get started with the
challenge. The script Dummy Classifier constantly predicts the most frequent label that
appears in the training set. This baseline achieves 0.65234 accuracy in the public Kaggle
leaderboard. The classifier is based on Logistic Regression that only uses the text of the
tweet. The method originally performs a basic preprocessing and then transforms the text using
GloVe vectors trained on a Twitter dataset. The accuracy score of this baseline is approximately
0.62890.

2
Both baselines ignore the order of the tweets and treat each time period as unrelated to the
football match they belong to.

5 Useful Python Libraries

In this section, we briefly discuss some packages that can be useful in the challenge:
● A very powerful machine learning library in Python is scikit-learn2. It can be used in the
preprocessing step (e.g., for feature selection) and in the classification task (several
regression algorithms have been implemented in scikit-learn).
● A very popular deep learning library in Python is PyTorch3. The library provides a simple
and user-friendly interface to build and train deep learning models.
● Since you will deal with textual data, the Natural Language Toolkit (NLTK)4 of Python can
also be found useful.
● Gensim5 is a Python library for unsupervised topic modeling and natural language
processing, using modern statistical machine learning. The library provides all the
necessary tools for learning word and document embeddings. An alternative to it is
FastText6.
● If you do not want to spend a lot of time producing the word embeddings, you can use
transfer learning. You can download a pre-trained set of word embeddings on
GoogleNews7, Glove8 or Numberbatch9).
● In case you want to work with the data represented as a graph, the use of a library for
managing and analyzing graphs may prove to be important. An example of such a library
is the NetworkX10 library of Python that will allow you to create, manipulate and study the
structure and several other features of a graph.

6 Grading Scheme
Each team has to be composed of 2 or 3 students. No larger or smaller teams will be accepted.
Group choices need to be submitted to us by Friday 22nd November at 17:00 at the below
link:

2
http://scikit- learn.org/
3
https://pytorch.org/
4
http://www.nltk.org/
5
https://radimrehurek.com/gensim/
6
https://fasttext.cc/
7
https://code.google.com/archive/p/word2vec/
8
https://nlp.stanford.edu/projects/glove/
9
https://github.com/commonsense/conceptnet-numberbatch
10
http://networkx.github.io/

3
https://forms.gle/G3YQVj8AkX7GHL8p8

Grading will be out of 100 points in total. Each team should deliver:

A submission on the Kaggle competition webpage. (20 points) will be allocated based on
raw performance only, provided that the results are reproducible. That is, using only your code,
the data provided on the competition page and any additional resources you are able to
reference and demonstrate understanding of, the jury should be able to train your final model
and use it to generate the predictions you submitted for scoring.

A zipped folder (30 points) including:

1. A folder named ”code” containing all the scripts needed to reproduce your submission.
2. A README file with brief instructions on how to run your code and where it expects the
original data files.
3. A report (.pdf file), of max 3 pages, excluding the cover page and references. In addition
to your self-contained 3-page report, you can use up to 3 extra pages of the appendix
(for extra explanations, algorithms, figures, tables, etc.). Please ensure that both your
real name(s) and the name of your Kaggle team appear on the cover page.

The 3-page report should include the following sections (in that order):

● Section 1: Data Preprocessing and Feature Selection/Extraction (10 points).

Independent of the prediction performance achieved, the jury will reward the research
effort done here. You are expected to:
1. The steps taken to clean and preprocess the data, including tokenization,
lowercasing, and handling any potential duplicates.
2. Explain the motivation and intuition behind each feature. How did you come up
with the feature (e.g., are you following the recommendation of a research
paper)? What is it intended to capture?
3. Rigorously report your experiments about the impact of various combinations of
features on predictive performance, and, depending on the regressor, how you
tackled the task of feature selection.
● Section 2: Model Choice, Tuning and Comparison (15 points). Best submissions will:
1. Compare your model against different classification models (e.g., Neural
Network, Random Forest, Graph-based event detection, LLMs, ...).
2. For each classifier, explain the procedure that was followed to tackle parameter
tuning and prevent overfitting.
● Report and code completeness, organization and readability will be worth 5 points. Best
submissions will (1) clearly deliver the solution, providing detailed explanations of each
step, (2) provide clear, well-organized and commented code, and (3) refer to research
papers. You are free to search for relevant papers and articles and try to incorporate
their ideas and approaches into your code and report as long as (a) it is clearly cited

4
within the report, (b) it is not a direct copy of code and (c) you are able to demonstrate
understanding of the content you incorporated.

An oral presentation of your project and the achieved results (50 points) Oral
presentations will be scheduled in the week of the 16th of December. A schedule on which
you can choose presentation slots will be released after the team submissions have been made.
We ask you to prepare 10-minute presentations, which will then be followed by questions from
the examiners. It is required that all group members are present and take a speaking role during
the presentation.

Please note that the inclusion of any external data sources that directly contain the labels
will result in you getting 0 points on the kaggle submission and being penalised on the report
and presentation.

7 Submission Process
Submission files should be in .csv format, and contain two columns respectively named "ID"
and "EventType". The "ID" column, which is a concatenation of the “MatchID” and “PeriodID”
columns, is used to identify each data sample and therefore needs to appear as-is in the results
file. The “EventType” column takes 0 or 1 values and is the output of your classifier. Note that
two sample submission files are available for download (logistic_predictions.csv and
dummy_predictions.csv). You can use it to test that everything works fine.

The competition ends on Thursday 12th December at 17:00. This is the deadline for you to
submit a compressed file containing your source code and final report, explaining your solutions
and discussing the scores you have achieved. Until then, you can submit your solution to
Kaggle and get a score at most 5 times per day. There must be one final submission per team.

Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
100 Dataengineering Interview Questions TRRaveendra 1694654407
No ratings yet
100 Dataengineering Interview Questions TRRaveendra 1694654407
58 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Python Data Science Essentials - Second Edition
From Everand
Python Data Science Essentials - Second Edition
Alberto Boschetti
4.5/5 (3)
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
COMP90049 2021S1 A3-Spec
No ratings yet
COMP90049 2021S1 A3-Spec
7 pages
ARIES Kaizen PS
No ratings yet
ARIES Kaizen PS
6 pages
Financial Data Science with Python: An Integrated Approach to Analysis, Modeling, and Machine Learning
From Everand
Financial Data Science with Python: An Integrated Approach to Analysis, Modeling, and Machine Learning
Haojun Chen
No ratings yet
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
From Everand
DP-600: Implementing Analytics Solutions Using Microsoft Fabric Exam Preparation
Georgio Daccache
No ratings yet
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
From Everand
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
PURNA CHANDER RAO. KATHULA
5/5 (1)
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Mastering matplotlib
From Everand
Mastering matplotlib
Duncan M. McGreggor
No ratings yet
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
From Everand
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
Mark Magic
No ratings yet
Mathematica Data Analysis
From Everand
Mathematica Data Analysis
Suchok Sergiy
No ratings yet
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Object–Oriented Programming with Swift 2
From Everand
Object–Oriented Programming with Swift 2
Hillar Gastón C.
No ratings yet
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Learning Advanced Programming
From Everand
Learning Advanced Programming
IT Campus Academy
No ratings yet
E1213 PRNN: Assignment 1 - Basic Models: Prof. Prathosh A. P. Submission Deadline: 1st March 2022
No ratings yet
E1213 PRNN: Assignment 1 - Basic Models: Prof. Prathosh A. P. Submission Deadline: 1st March 2022
3 pages
IQBAL Fresher 19
No ratings yet
IQBAL Fresher 19
3 pages
Data Manipulation with Python Step by Step: A Practical Guide with Examples
From Everand
Data Manipulation with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
From Everand
PyTorch Cookbook: 100+ Solutions across RNNs, CNNs, python tools, distributed training and graph networks
Matthew Rosch
No ratings yet
PyTorch Cookbook
From Everand
PyTorch Cookbook
Matthew Rosch
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
From Everand
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
Younes Hamdani
No ratings yet
DMlab2021
No ratings yet
DMlab2021
4 pages
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
From Everand
Introduction to Algorithms & Data Structures: A solid foundation for the real world of machine learning and data analytics
Bolakale Aremu
No ratings yet
ML Lab Manual Devansh (1)
No ratings yet
ML Lab Manual Devansh (1)
57 pages
assign 5 tt
No ratings yet
assign 5 tt
13 pages
Ultimate Enterprise Data Analysis and Forecasting using Python: Leverage Cloud platforms with Azure Time Series Insights and AWS Forecast Components for Deep learning Modeling using Python (English Edition)
From Everand
Ultimate Enterprise Data Analysis and Forecasting using Python: Leverage Cloud platforms with Azure Time Series Insights and AWS Forecast Components for Deep learning Modeling using Python (English Edition)
Shanthababu Pandian
No ratings yet
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
Designing Machine Learning Systems with Python
From Everand
Designing Machine Learning Systems with Python
David Julian
No ratings yet
Modular Programming with Python
From Everand
Modular Programming with Python
Erik Westra
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Detectron2 in Practice: Definitive Reference for Developers and Engineers
From Everand
Detectron2 in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ChatGPT for Beginners: A Comprehensive Guide
From Everand
ChatGPT for Beginners: A Comprehensive Guide
Joseph Capps
No ratings yet
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Prompt Engineering with ChatGPT
From Everand
Prompt Engineering with ChatGPT
Nikiforos Kontopoulos
No ratings yet
Machine Learning and Deep Learning With Python
From Everand
Machine Learning and Deep Learning With Python
James Chen
No ratings yet
Microsoft Visual C++ Windows Applications by Example
From Everand
Microsoft Visual C++ Windows Applications by Example
Stefan BjÃ¶rnander
3.5/5 (3)
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
From Everand
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis
Fabio Nelli
No ratings yet
Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language
From Everand
Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language
Anthony Adams
5/5 (3)
Mastering Machine Learning Algorithms - Second Edition: Expert techniques for implementing popular machine learning algorithms, fine-tuning your models, and understanding how they work, 2nd Edition
From Everand
Mastering Machine Learning Algorithms - Second Edition: Expert techniques for implementing popular machine learning algorithms, fine-tuning your models, and understanding how they work, 2nd Edition
Giuseppe Bonaccorso
2/5 (1)
COL774_A4_v3
No ratings yet
COL774_A4_v3
4 pages
Getting Started with Python Data Analysis
From Everand
Getting Started with Python Data Analysis
Vo.T.H Phuong
No ratings yet
DSCI 303: Machine Learning For Data Science Fall 2020
No ratings yet
DSCI 303: Machine Learning For Data Science Fall 2020
5 pages
Mastering Python Algorithms: Practical Solutions for Complex Problems
From Everand
Mastering Python Algorithms: Practical Solutions for Complex Problems
Robert Johnson
No ratings yet
Learning Programming and Computer Science: 1, #1
From Everand
Learning Programming and Computer Science: 1, #1
MATHY WISDOM
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
Python Algorithms Step by Step: A Practical Guide with Examples
From Everand
Python Algorithms Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
Mastering C++: Advanced Techniques and Tricks
From Everand
Mastering C++: Advanced Techniques and Tricks
Ted Norice
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Algorithms Made Simple: Understanding the Building Blocks of Software
From Everand
Algorithms Made Simple: Understanding the Building Blocks of Software
William E. Clark
No ratings yet
Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel
From Everand
Data Science with .NET and Polyglot Notebooks: Programmer's guide to data science using ML.NET, OpenAI, and Semantic Kernel
Matt Eland
No ratings yet
2025-01-15-11-23-34-F4-23B-2024-Trained-Graduate-Teacher-(Female)-Chemistry (1)
No ratings yet
2025-01-15-11-23-34-F4-23B-2024-Trained-Graduate-Teacher-(Female)-Chemistry (1)
18 pages
Basics of compilation process COM 413
No ratings yet
Basics of compilation process COM 413
31 pages
DBMS Questions and Answers
No ratings yet
DBMS Questions and Answers
48 pages
Learning SQL 2nd edition Edition Alan Beaulieu download
100% (1)
Learning SQL 2nd edition Edition Alan Beaulieu download
60 pages
Ai Notes
No ratings yet
Ai Notes
2 pages
M.SC - IT Sem 1234 Syllabus
No ratings yet
M.SC - IT Sem 1234 Syllabus
136 pages
Deep Learning Kathi
No ratings yet
Deep Learning Kathi
18 pages
Data Deduplication and Secure Sharing in Cloud Storage
No ratings yet
Data Deduplication and Secure Sharing in Cloud Storage
5 pages
Motivation Letter 1650771519
No ratings yet
Motivation Letter 1650771519
2 pages
Web-Based Student Result Management System: October 2018
No ratings yet
Web-Based Student Result Management System: October 2018
21 pages
nosql
No ratings yet
nosql
64 pages
Quiz AI Paradigms-1
No ratings yet
Quiz AI Paradigms-1
6 pages
A Study On Deep Learning
No ratings yet
A Study On Deep Learning
6 pages
Expert System
No ratings yet
Expert System
51 pages
Lecture 6
No ratings yet
Lecture 6
9 pages
20-AI-MS-4
No ratings yet
20-AI-MS-4
6 pages
Application of Mongodb Technology in Nosql Database in Video Intelligent Big Data Analysis
No ratings yet
Application of Mongodb Technology in Nosql Database in Video Intelligent Big Data Analysis
5 pages
CPO Pre 2024 MALE After Normalisation Marks Analysis Data by RankMitra
No ratings yet
CPO Pre 2024 MALE After Normalisation Marks Analysis Data by RankMitra
3 pages
Toaz - Info Doc Attendance Management System PR
No ratings yet
Toaz - Info Doc Attendance Management System PR
41 pages
Kshitiz Basnet
No ratings yet
Kshitiz Basnet
28 pages
CASE STUDY - BAE Systems GRPS 2, 7 N 14
No ratings yet
CASE STUDY - BAE Systems GRPS 2, 7 N 14
2 pages
Conversation
No ratings yet
Conversation
37 pages
IPCR Almazan
No ratings yet
IPCR Almazan
4 pages
9 - Databases New Updated (MT-L)
No ratings yet
9 - Databases New Updated (MT-L)
19 pages
Fall.2024 INF204A Mgmt.info.Systems Lecture.09
No ratings yet
Fall.2024 INF204A Mgmt.info.Systems Lecture.09
45 pages
IMP6
No ratings yet
IMP6
60 pages
Csm-Part C
No ratings yet
Csm-Part C
25 pages
Question Paper Code:: (10×2 20 Marks)
No ratings yet
Question Paper Code:: (10×2 20 Marks)
2 pages
Creating Business Intelligence Through Machine Learning: An Effective Business Decision Making Tool
No ratings yet
Creating Business Intelligence Through Machine Learning: An Effective Business Decision Making Tool
11 pages

Challenge-2024

Uploaded by

Challenge-2024

Uploaded by

Machine and Deep Learning

Data Challenge: Sub-Event Detection

3 Task and Evaluation

4 Provided Source Code

5 Useful Python Libraries

A zipped folder (30 points) including:

● Section 1: Data Preprocessing and Feature Selection/Extraction (10 points).

You might also like