0% found this document useful (0 votes)
2 views

Trial Assignment Machine Learning - NLP

The document outlines a task for a machine learning project focused on Natural Language Processing (NLP) modeling, specifically for data collection and analysis of projects and tenders. Participants are instructed to harvest data from various sources, implement models for keyword extraction and Named Entity Recognition, and present their findings in a report with visualizations. Evaluation criteria include dataset extensiveness, code quality, data visualization, and the ability to explain results effectively.

Uploaded by

tinaniraj7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Trial Assignment Machine Learning - NLP

The document outlines a task for a machine learning project focused on Natural Language Processing (NLP) modeling, specifically for data collection and analysis of projects and tenders. Participants are instructed to harvest data from various sources, implement models for keyword extraction and Named Entity Recognition, and present their findings in a report with visualizations. Evaluation criteria include dataset extensiveness, code quality, data visualization, and the ability to explain results effectively.

Uploaded by

tinaniraj7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Read the document carefully and follow as instructed.

The behavioral survey (Link at the


end of the document) is to be filled once the task is completed.

Machine Learning Roles (NLP Modeling)

About the Task:


Important Points:
(1) Please use separate .py files to harvest ONLY the data from API or scrap from websites.
(2) Firstly, collect data from diverse and large datasets listed below.
(3) Use a combination of models of your choice to benchmark the accuracy, Tagging for
Keyword Extraction or Named Entity Recognition (NER) using models such as hugging face,
YAKE, BERT-derived models, spaCy, most popular language models. You are free to use other
models
(4) Provide a report with succinct visualization of results and all your different .py scripts (class
object oriented good scripting practices) and final Python notebook.

Please submit a PDF of google slides or a document presenting your findings. Upload the
PDF onto your Google Drive and share the link as follows:
Google drive link with general access set as 'Anyone with a link' and role set as 'Editor'
Share the link in the Behavioral Survey.
Your evaluation criteria is partially technical and partially the ability to explain meaningful results
in a presentable manner.

Natural Language Understanding Trial Task


● Data Sources: Combine different data sources from the below list to (i) get data and (ii)
use Python harvester to scrape or download data from the sources. Evaluation will also
be based on the diversity of the data chosen.

World Bank Projects


SAM.gov
Open Contracting
Tenders Electronic Daily
Open Tenders
GI Hub Pipelines
California Tenders
Florida Tenders
Texas Tenders
FDOT
TxDOT
Caltran Tenders
Asian Development Bank (ADB) Projects
African Development Bank (AfDB) Projects
Asian Infrastructure Investment Bank (AIIB) Projects

● Learn more about Projects & Tenders standard data structure


○ https://developer.taiyo.ai/api-doc/ProjectsandTenders/
○ https://www.open-contracting.org/data-standard/

Modeling and Report


Use open source Natural Language Models with the above data sources to synthesize your
findings addressing the points mentioned below
1. Extract entities. Use Named Entity Recognition (NER) to identify and extract sector,
sub-sector, location, or entities like Government Agency, Company Name, Contractors,
Investor, or unit measurements such as cost per square kilometer. Ideally using the
projects / tenders description and the original PDF document.
2. Similar projects. Word2vec and / or cosine similarity for semantic and syntactic for
identifying similar projects. For example: For a given project identify all similar projects
within the past 10 years within 500 miles
3. Trends. Show data visualizations for aggregated time series, bar chart / line chart,
where the X-axis is Time and Y-axis is ‘Total number of records’ and/or ‘Total
budget/cost’. These charts can be segmented to show different countries or sectors
within a country over time.
4. Customize a GPT-3 Chatbot. A chatbot for Q&A for custom dataset
https://openai.com/blog/introducing-chatgpt-and-whisper-apis
Bonus question: Find data sources for road projects and tenders in the state of California.

Evaluation is based on the following parameters:

Stages Evaluation Criteria

1 Extensiveness of the dataset and understanding of projects and tenders data structure

2 Modular, DRY Code

3 Config Params, Unit Tests & Logging Standards

4 Data Visualization presentation of results and understanding of the problem statement

5 Explainability, the ability to experiment, do creative relevant problem solving, and structure
your analytical thinking with documentation

Behavioral Survey

You might also like