Chapter 4 - Data Curation
Chapter 4 - Data Curation
1
Motivation
A typical Machine Learning process, what is the most important part?
2
The importance of data
- In the past, only a little amount of data was available → people focused on
traditional machine learning algorithm (e.g. KNN, Kmeans, Linear Regression,
Decision Trees, …) to make use of that data.
- When Big Data is available, modern deep learning algorithms were developed
to take advantage of a large amount of data (and the computational power
also)
- At this moment, deep learning algorithms have reached some limitations,
people turn their attention back to data, i.e. collect higher quality data
3
Examples
How OpenAI prepares
data for ChatGPT?
Content
4.1. Data collection
4.4. Pitfalls
5
4.1. Data collection
Data can not only come from your own organization, but it can also be licensed
from a third-party data collection agency or consumer service, or created from
scratch
6
4.1.1. Internal Data Collection: Digital
Most of the organizations have their own information systems, e.g. Sales,
Manufacturing, CRM, HRM, …
These information systems generate data and keep in many forms (structured
database to unstructured log files).
Organizations may save all the data for future use, or provide data scientist to
analyze and build intelligent systems continuously
7
Types of data storage in an organization
Structured databases:
- Excel format
- Database Management Systems (DBMS): SQL Server, MySQL, Oracle, …
Unstructured databases:
- MongoDB, NoSQL
- Data warehouse, Data lake
Internal Data Collection: Digital
If you want to build an AI system but have not started to make data collection a
priority? Even in this scenario you may already have more data than you think
→ do an internal data exploration and come up with a data collection strategy
- The first part of data exploration consists of identifying and listing all existing
digital systems used in your organization
- this can be data explicitly being stored by the system (e.g., customer records) or just system
usage data that is being saved in log files
- Ask if this data is easily accessed or exported so that it can be used within
other systems (e.g., the AI system you are building)
Some possible access methods:
9
Application Programming Interface (API)
Best option: existing system provides a
well-documented API to access the data
10
Application Programming Interface (API)
How to use the API to obtain your necessary data?
Don’t worry, you don’t need to learn those kinds of programming languages, just
need to know which variables store your necessary data
11
File Export
If a system does not have a convenient API for exporting data, it might have a file
export capability. This capability is likely in the system's user interface and allows
an end user to export data in a standardized format such as a comma-separated
value (CSV) file
12
Direct Database Connection
If the system does not provide any supported data exporting capabilities and it is
infeasible or monetarily ineffective to add one, you could instead connect directly
to the system's internal database if one exists.
● This involves setting up a secure connection to the database that the system
uses in order to directly access database tables
● An important point to keep in mind is that you should access this data in a
readonly fashion so you don't inadvertently affect the application.
13
Practice
https://www.analyticsvidhya.com/blog/2021/05/how-to-fetch-data-using-api-and-sql
-databases/
https://www.mapbox.com/
4.1.2. Internal Data Collection: Physical
Identify potentially existing manual process → physical form →How to digitize?
Data in its physical form must be digitized before it can be used to create an AI
system, but digitization is not a trivial task → need to consider
15
4.1.3. Data Collection via Licensing
If you have not been collecting data, or you require data that you are unable to
collect internally → data licensing
- Companies with the business model of selling data
(https://onesignal.com/pricing)
- Free dataset (may not fit your purpose completely)
- https://www.kaggle.com/
- https://project-awesome.org/awesomedata/awesome-public-datasets
- https://www.data.gov/
- Data licensing companies (YCharts, Thomson Reuters)
- Other sources depend on your domain (geospartial, agriculture,
transportation, …) →may require some dialogues or discussions
16
Data Collection via Licensing
Notices:
- Data licensing pricing, especially in larger deals, should be negotiable based on how
you are planning to use the data. Scale-based pricing might be advantageous to you
and also allow the licensing company greater upside based on your success.
- Avoid not to be beholden to the licensing company for their data
- That company suddenly stop their services
- They use the licensed data to bootstrap your system
- E.g.: If you are Grab CTO, do you use Google Maps API, or develop your own?
→Use the licensed data to build the initial system, but start collect your own data
from your AI system for later use
17
Data Collection via Crowdsourcing
Crowdsourcing platforms consist of two different types of users:
18
Leveraging the Power of Existing Systems
There are already a number of
intelligent systems available that can
be used to generate a dataset, e.g.
Google, Flickr
19
Your tasks
Try your best to find as much as possible data for your project from:
- Licensing
- Kaggle-like website
- Data licensing companies
- Other sources based on your domain
- Crowdsourcing
- Find this kind of information in Vietnam
- Existing intelligent systems
20
How to know your data is high-quality
Data that is used to build AI systems is typically referred to as ground truth—that
is, the truth that underpins the knowledge in an AI system.
Supervised machine learning models are trained on labeled data that are
considered “ground truth” for the model to identify patterns that predict those
labels on new data.
→ A high quality dataset often has labels close to the ground truth in the real
scenario
21
Ground truth examples
Youtube wants to improve the speech to text system to automatically generate
subtitle for Vietnamese videos
Watch and listen carefully to this video, try to make a subtitle along with the song
in the video
- What is your own subtitle? → Label
- What is generated by the current speech2text system → Predicted
- What is the correct answer (the lyric provided by the author) → Ground truth
Does your label match exactly the ground truth?
- Yes → Your dataset is high quality
- No → You have to find other annotator
Ground truth
Good ground truth typically comes from or is produced by organizational systems
already in use.
If no existing data is available, subject matter experts (SMEs) can manually create
this ground truth
There are typically two key methods to build the distribution of your ground truth
when building an AI model:
23
4.2. The Role of a Data Scientist
Data scientists are pioneers that work with business leaders to solve problems by
understanding, preparing, and analyzing data to predict emerging trends
Jobs: take raw data and turn it into actionable business insights or predictive models
Just remember that as your data volume grows, the more data scientists you will need
24
Feedback Loops
The quality of the output generated by an AI system is based on the dataset used
to train it. Data scientists are responsible to ensure that data quality and integrity is
maintained with each feedback loop.
Each loop is a sprint toward the stated objectives, and at the end of each sprint,
feedback should be given by either end users or SMEs to ensure maximum
benefit from the adoption of Agile loops. At the end of every sprint, the system
should be tested thoroughly and suitable course corrections should be made.
SMEs are central to this process. They will help the engineers find gaps and
inaccurate predictions by the AI
25
Making Data Accessible
After the data is collected, it is typically stored in an organization's data
warehouse, which is a system that colocates data in a central location to be
conveniently accessed for analysis and training
Having data and being able to use it to train an AI system are two different things.
Data in an organization can sometimes be siloed, meaning that each department
maintains their own data. This division makes it hard to have a holistic view →
build a data platform that collected and compiled all siloed data into a central
location.
Data platform: data warehouse, data lake, data mart, Hadoop, Spark, Amazon
AWS, Google Cloud Platform, …
26
4.3. Data Governance
Governance is about ensuring that processes follow the highest standards of
ethics while following legal provisions in spirit
27
Data Collection Policies
● Users should be made aware of what data is being collected and for how long
it will be stored.
● Dark design patterns that imply consent rather than ask the user explicitly
should not be implemented.
● If data will be sent to third parties for processing/storage, the user should also
be informed of this upfront.
28
Encryption
Encrypting sensitive information such as credit card information the keys and the
passwords used to encrypt the data as well
29
Access Control Systems
All data should be classified based on an assessment of factors such as its
importance to the user and the company and whether it contains personal data of
users or company secrets.
30
Creating a Data Governance Board
The board will develop the organization's data governance policies by looking at
best practices across the globe like General Data Protection Regulation (GDPR),
Health Insurance Portability and Accountability Act (HIPAA) provisions
Your task:
31
Are You Data Ready?
It is important to take stock of all your digital and manual systems to see what data
is being generated.
32
4.4. Pitfalls
Pitfall 1: Insufficient Data Licensing
33
Action Checklist
___ Determine the possible internal and external datasets available to train your system.
___ Have a data scientist perform a data consolidation exercise if data is not currently easily accessed.
___ Understand the data protection laws applicable to your organization and implement them.
___ Appoint a data governance board to oversee activities relating to data governance in order to ensure
your organization stays on the right track.
___ Put together a data governance plan for your organization's data activities.
___ Create and then release a data privacy policy for how your organization uses the data it accesses.
___ Establish some data security protections such as using data encryption, providing employee security
training, and building relationships with white-hat security firms.
34