0% found this document useful (0 votes)
16 views

Chapter 4 - Data Curation

This section discusses the importance of collecting high quality data and different methods for data collection, including internal digital and physical collection, licensing data from third parties, crowdsourcing, and leveraging existing intelligent systems. It also provides examples of ensuring data quality by comparing labels to ground truth.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Chapter 4 - Data Curation

This section discusses the importance of collecting high quality data and different methods for data collection, including internal digital and physical collection, licensing data from third parties, crowdsourcing, and leveraging existing intelligent systems. It also provides examples of ensuring data quality by comparing labels to ground truth.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 4

Data curation and governance


Lecturer: Vu Trong Sinh

1
Motivation
A typical Machine Learning process, what is the most important part?

2
The importance of data
- In the past, only a little amount of data was available → people focused on
traditional machine learning algorithm (e.g. KNN, Kmeans, Linear Regression,
Decision Trees, …) to make use of that data.
- When Big Data is available, modern deep learning algorithms were developed
to take advantage of a large amount of data (and the computational power
also)
- At this moment, deep learning algorithms have reached some limitations,
people turn their attention back to data, i.e. collect higher quality data

→ How to collect high quality data?

3
Examples
How OpenAI prepares
data for ChatGPT?
Content
4.1. Data collection

4.2. The role of a Data Scientist

4.3. Data Governance

4.4. Pitfalls

5
4.1. Data collection
Data can not only come from your own organization, but it can also be licensed
from a third-party data collection agency or consumer service, or created from
scratch

6
4.1.1. Internal Data Collection: Digital
Most of the organizations have their own information systems, e.g. Sales,
Manufacturing, CRM, HRM, …
These information systems generate data and keep in many forms (structured
database to unstructured log files).

Organizations may save all the data for future use, or provide data scientist to
analyze and build intelligent systems continuously

7
Types of data storage in an organization
Structured databases:

- Excel format
- Database Management Systems (DBMS): SQL Server, MySQL, Oracle, …

Unstructured databases:

- MongoDB, NoSQL
- Data warehouse, Data lake
Internal Data Collection: Digital
If you want to build an AI system but have not started to make data collection a
priority? Even in this scenario you may already have more data than you think
→ do an internal data exploration and come up with a data collection strategy
- The first part of data exploration consists of identifying and listing all existing
digital systems used in your organization
- this can be data explicitly being stored by the system (e.g., customer records) or just system
usage data that is being saved in log files
- Ask if this data is easily accessed or exported so that it can be used within
other systems (e.g., the AI system you are building)
Some possible access methods:

9
Application Programming Interface (API)
Best option: existing system provides a
well-documented API to access the data

- APIs are preferable since they are secure,


are easily and pragmatically accessed,
and provide real-time access to data
- APIs can provide convenient capabilities
on top of the raw data being stored, such
as roll-up statistics or other derived data

E.g. Google API, Facebook API, …

10
Application Programming Interface (API)
How to use the API to obtain your necessary data?

→ Go asking the developer, or the administrator of the information systems in your


organization. He will provide you with the corresponding API, i.e. a sample of code
in his programming language (Java, C++, PHP,...)

Don’t worry, you don’t need to learn those kinds of programming languages, just
need to know which variables store your necessary data

11
File Export
If a system does not have a convenient API for exporting data, it might have a file
export capability. This capability is likely in the system's user interface and allows
an end user to export data in a standardized format such as a comma-separated
value (CSV) file

- Only certain data may be able to be exported


- Highly structured data with a lot of internal structure may be harder to export
in a single file
- Have to be manually exported periodically

12
Direct Database Connection
If the system does not provide any supported data exporting capabilities and it is
infeasible or monetarily ineffective to add one, you could instead connect directly
to the system's internal database if one exists.

● This involves setting up a secure connection to the database that the system
uses in order to directly access database tables
● An important point to keep in mind is that you should access this data in a
readonly fashion so you don't inadvertently affect the application.

13
Practice
https://www.analyticsvidhya.com/blog/2021/05/how-to-fetch-data-using-api-and-sql
-databases/

https://www.mapbox.com/
4.1.2. Internal Data Collection: Physical
Identify potentially existing manual process → physical form →How to digitize?

Data in its physical form must be digitized before it can be used to create an AI
system, but digitization is not a trivial task → need to consider

- Amount of time and cost required to digitize the data


- Is it sufficient to start collecting digital data onward?

If the physical data is valuable

→should start replacing it by a digital system

15
4.1.3. Data Collection via Licensing
If you have not been collecting data, or you require data that you are unable to
collect internally → data licensing
- Companies with the business model of selling data
(https://onesignal.com/pricing)
- Free dataset (may not fit your purpose completely)
- https://www.kaggle.com/
- https://project-awesome.org/awesomedata/awesome-public-datasets
- https://www.data.gov/
- Data licensing companies (YCharts, Thomson Reuters)
- Other sources depend on your domain (geospartial, agriculture,
transportation, …) →may require some dialogues or discussions

16
Data Collection via Licensing
Notices:

- Data licensing pricing, especially in larger deals, should be negotiable based on how
you are planning to use the data. Scale-based pricing might be advantageous to you
and also allow the licensing company greater upside based on your success.
- Avoid not to be beholden to the licensing company for their data
- That company suddenly stop their services
- They use the licensed data to bootstrap your system
- E.g.: If you are Grab CTO, do you use Google Maps API, or develop your own?

→Use the licensed data to build the initial system, but start collect your own data
from your AI system for later use

17
Data Collection via Crowdsourcing
Crowdsourcing platforms consist of two different types of users:

- Users who have questions need to be answered


- Post your unlabelled data to a crowdsourcing market
- Users who will answer these questions
- Monetarily incentivized to answer questions quickly and with high accuracy
- Typically, the same question is asked to multiple people for consistency. If there are
discrepancies for a single question, perhaps the image is ambiguous. If one particular user
has many discrepancies, this might mean the user is answering randomly and should be
removed from the job or that the user did not understand the prompt

Some crowdsourcing platforms: Figure eight, Mechanical Turk, Microworkers, …

18
Leveraging the Power of Existing Systems
There are already a number of
intelligent systems available that can
be used to generate a dataset, e.g.
Google, Flickr

Case study: use Google Images with a


search of “daytime pictures” and then
use a browser extension to download
all the images on that page

19
Your tasks
Try your best to find as much as possible data for your project from:

- Licensing
- Kaggle-like website
- Data licensing companies
- Other sources based on your domain
- Crowdsourcing
- Find this kind of information in Vietnam
- Existing intelligent systems

20
How to know your data is high-quality
Data that is used to build AI systems is typically referred to as ground truth—that
is, the truth that underpins the knowledge in an AI system.

Supervised machine learning models are trained on labeled data that are
considered “ground truth” for the model to identify patterns that predict those
labels on new data.

→ A high quality dataset often has labels close to the ground truth in the real
scenario

21
Ground truth examples
Youtube wants to improve the speech to text system to automatically generate
subtitle for Vietnamese videos
Watch and listen carefully to this video, try to make a subtitle along with the song
in the video
- What is your own subtitle? → Label
- What is generated by the current speech2text system → Predicted
- What is the correct answer (the lyric provided by the author) → Ground truth
Does your label match exactly the ground truth?
- Yes → Your dataset is high quality
- No → You have to find other annotator
Ground truth
Good ground truth typically comes from or is produced by organizational systems
already in use.

If no existing data is available, subject matter experts (SMEs) can manually create
this ground truth

There are typically two key methods to build the distribution of your ground truth
when building an AI model:

- balanced number of examples for each class you want to recognize


- proportionately representative of how your system will be used

23
4.2. The Role of a Data Scientist
Data scientists are pioneers that work with business leaders to solve problems by
understanding, preparing, and analyzing data to predict emerging trends
Jobs: take raw data and turn it into actionable business insights or predictive models

Just remember that as your data volume grows, the more data scientists you will need

24
Feedback Loops
The quality of the output generated by an AI system is based on the dataset used
to train it. Data scientists are responsible to ensure that data quality and integrity is
maintained with each feedback loop.

Each loop is a sprint toward the stated objectives, and at the end of each sprint,
feedback should be given by either end users or SMEs to ensure maximum
benefit from the adoption of Agile loops. At the end of every sprint, the system
should be tested thoroughly and suitable course corrections should be made.

SMEs are central to this process. They will help the engineers find gaps and
inaccurate predictions by the AI

25
Making Data Accessible
After the data is collected, it is typically stored in an organization's data
warehouse, which is a system that colocates data in a central location to be
conveniently accessed for analysis and training
Having data and being able to use it to train an AI system are two different things.
Data in an organization can sometimes be siloed, meaning that each department
maintains their own data. This division makes it hard to have a holistic view →
build a data platform that collected and compiled all siloed data into a central
location.
Data platform: data warehouse, data lake, data mart, Hadoop, Spark, Amazon
AWS, Google Cloud Platform, …

26
4.3. Data Governance
Governance is about ensuring that processes follow the highest standards of
ethics while following legal provisions in spirit

Almost are customer’s data → Security


● Data should not be obtained without the
consent of the individuals featured in it
● Data that is collected without the express
consent of the user should not be used
and should not have been collected in the
first place.
Some data policies?

27
Data Collection Policies
● Users should be made aware of what data is being collected and for how long
it will be stored.
● Dark design patterns that imply consent rather than ask the user explicitly
should not be implemented.
● If data will be sent to third parties for processing/storage, the user should also
be informed of this upfront.

28
Encryption
Encrypting sensitive information such as credit card information the keys and the
passwords used to encrypt the data as well

29
Access Control Systems
All data should be classified based on an assessment of factors such as its
importance to the user and the company and whether it contains personal data of
users or company secrets.

E.g.: “public,” “internal,”


“restricted,” or “top secret.”

Access to data must be


controlled, and only
approved users should be
granted access

30
Creating a Data Governance Board
The board will develop the organization's data governance policies by looking at
best practices across the globe like General Data Protection Regulation (GDPR),
Health Insurance Portability and Accountability Act (HIPAA) provisions

Your task:

- Summarize main points in GDPR & HIPAA

31
Are You Data Ready?
It is important to take stock of all your digital and manual systems to see what data
is being generated.

● Is this data sufficient for your system's needs?


● Do you need to start looking at data licensing or starting your own
crowdsourcing jobs?
● Do you have the necessary talent (such as data scientists) to make this
happen?
● Have you established your data governance model?

32
4.4. Pitfalls
Pitfall 1: Insufficient Data Licensing

Pitfall 2: Not Having Representative Ground Truth

Pitfall 3: Insufficient Data Security

Pitfall 4: Ignoring User Privacy

33
Action Checklist
___ Determine the possible internal and external datasets available to train your system.

___ Have a data scientist perform a data consolidation exercise if data is not currently easily accessed.

___ Understand the data protection laws applicable to your organization and implement them.

___ Appoint a data governance board to oversee activities relating to data governance in order to ensure
your organization stays on the right track.

___ Put together a data governance plan for your organization's data activities.

___ Create and then release a data privacy policy for how your organization uses the data it accesses.

___ Establish some data security protections such as using data encryption, providing employee security
training, and building relationships with white-hat security firms.

34

You might also like