0% found this document useful (0 votes)

16 views

Chapter 4 - Data Curation

This section discusses the importance of collecting high quality data and different methods for data collection, including internal digital and physical collection, licensing data from third parties, crowdsourcing, and leveraging existing intelligent systems. It also provides examples of ensuring data quality by comparing labels to ground truth.

Uploaded by

duongthutrang1902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Chapter 4 - Data Curation

Uploaded by

duongthutrang1902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Chapter 4

Data curation and governance

Lecturer: Vu Trong Sinh

1
Motivation
A typical Machine Learning process, what is the most important part?

2
The importance of data
- In the past, only a little amount of data was available → people focused on
traditional machine learning algorithm (e.g. KNN, Kmeans, Linear Regression,
Decision Trees, …) to make use of that data.
- When Big Data is available, modern deep learning algorithms were developed
to take advantage of a large amount of data (and the computational power
also)
- At this moment, deep learning algorithms have reached some limitations,
people turn their attention back to data, i.e. collect higher quality data

→ How to collect high quality data?

3
Examples
How OpenAI prepares
data for ChatGPT?
Content
4.1. Data collection

4.2. The role of a Data Scientist

4.3. Data Governance

4.4. Pitfalls

5
4.1. Data collection
Data can not only come from your own organization, but it can also be licensed
from a third-party data collection agency or consumer service, or created from
scratch

6
4.1.1. Internal Data Collection: Digital
Most of the organizations have their own information systems, e.g. Sales,
Manufacturing, CRM, HRM, …
These information systems generate data and keep in many forms (structured
database to unstructured log files).

Organizations may save all the data for future use, or provide data scientist to
analyze and build intelligent systems continuously

7
Types of data storage in an organization
Structured databases:

- Excel format
- Database Management Systems (DBMS): SQL Server, MySQL, Oracle, …

Unstructured databases:

- MongoDB, NoSQL
- Data warehouse, Data lake
Internal Data Collection: Digital
If you want to build an AI system but have not started to make data collection a
priority? Even in this scenario you may already have more data than you think
→ do an internal data exploration and come up with a data collection strategy
- The first part of data exploration consists of identifying and listing all existing
digital systems used in your organization
- this can be data explicitly being stored by the system (e.g., customer records) or just system
usage data that is being saved in log files
- Ask if this data is easily accessed or exported so that it can be used within
other systems (e.g., the AI system you are building)
Some possible access methods:

9
Application Programming Interface (API)
Best option: existing system provides a
well-documented API to access the data

- APIs are preferable since they are secure,

are easily and pragmatically accessed,
and provide real-time access to data
- APIs can provide convenient capabilities
on top of the raw data being stored, such
as roll-up statistics or other derived data

E.g. Google API, Facebook API, …

10
Application Programming Interface (API)
How to use the API to obtain your necessary data?

→ Go asking the developer, or the administrator of the information systems in your

organization. He will provide you with the corresponding API, i.e. a sample of code
in his programming language (Java, C++, PHP,...)

Don’t worry, you don’t need to learn those kinds of programming languages, just
need to know which variables store your necessary data

11
File Export
If a system does not have a convenient API for exporting data, it might have a file
export capability. This capability is likely in the system's user interface and allows
an end user to export data in a standardized format such as a comma-separated
value (CSV) file

- Only certain data may be able to be exported

- Highly structured data with a lot of internal structure may be harder to export
in a single file
- Have to be manually exported periodically

12
Direct Database Connection
If the system does not provide any supported data exporting capabilities and it is
infeasible or monetarily ineffective to add one, you could instead connect directly
to the system's internal database if one exists.

● This involves setting up a secure connection to the database that the system
uses in order to directly access database tables
● An important point to keep in mind is that you should access this data in a
readonly fashion so you don't inadvertently affect the application.

13
Practice
https://www.analyticsvidhya.com/blog/2021/05/how-to-fetch-data-using-api-and-sql
-databases/

https://www.mapbox.com/
4.1.2. Internal Data Collection: Physical
Identify potentially existing manual process → physical form →How to digitize?

Data in its physical form must be digitized before it can be used to create an AI
system, but digitization is not a trivial task → need to consider

- Amount of time and cost required to digitize the data

- Is it sufficient to start collecting digital data onward?

If the physical data is valuable

→should start replacing it by a digital system

15
4.1.3. Data Collection via Licensing
If you have not been collecting data, or you require data that you are unable to
collect internally → data licensing
- Companies with the business model of selling data
(https://onesignal.com/pricing)
- Free dataset (may not fit your purpose completely)
- https://www.kaggle.com/
- https://project-awesome.org/awesomedata/awesome-public-datasets
- https://www.data.gov/
- Data licensing companies (YCharts, Thomson Reuters)
- Other sources depend on your domain (geospartial, agriculture,
transportation, …) →may require some dialogues or discussions

16
Data Collection via Licensing
Notices:

- Data licensing pricing, especially in larger deals, should be negotiable based on how
you are planning to use the data. Scale-based pricing might be advantageous to you
and also allow the licensing company greater upside based on your success.
- Avoid not to be beholden to the licensing company for their data
- That company suddenly stop their services
- They use the licensed data to bootstrap your system
- E.g.: If you are Grab CTO, do you use Google Maps API, or develop your own?

→Use the licensed data to build the initial system, but start collect your own data
from your AI system for later use

17
Data Collection via Crowdsourcing
Crowdsourcing platforms consist of two different types of users:

- Users who have questions need to be answered

- Post your unlabelled data to a crowdsourcing market
- Users who will answer these questions
- Monetarily incentivized to answer questions quickly and with high accuracy
- Typically, the same question is asked to multiple people for consistency. If there are
discrepancies for a single question, perhaps the image is ambiguous. If one particular user
has many discrepancies, this might mean the user is answering randomly and should be
removed from the job or that the user did not understand the prompt

Some crowdsourcing platforms: Figure eight, Mechanical Turk, Microworkers, …

18
Leveraging the Power of Existing Systems
There are already a number of
intelligent systems available that can
be used to generate a dataset, e.g.
Google, Flickr

Case study: use Google Images with a

search of “daytime pictures” and then
use a browser extension to download
all the images on that page

19
Your tasks
Try your best to find as much as possible data for your project from:

- Licensing
- Kaggle-like website
- Data licensing companies
- Other sources based on your domain
- Crowdsourcing
- Find this kind of information in Vietnam
- Existing intelligent systems

20
How to know your data is high-quality
Data that is used to build AI systems is typically referred to as ground truth—that
is, the truth that underpins the knowledge in an AI system.

Supervised machine learning models are trained on labeled data that are
considered “ground truth” for the model to identify patterns that predict those
labels on new data.

→ A high quality dataset often has labels close to the ground truth in the real
scenario

21
Ground truth examples
Youtube wants to improve the speech to text system to automatically generate
subtitle for Vietnamese videos
Watch and listen carefully to this video, try to make a subtitle along with the song
in the video
- What is your own subtitle? → Label
- What is generated by the current speech2text system → Predicted
- What is the correct answer (the lyric provided by the author) → Ground truth
Does your label match exactly the ground truth?
- Yes → Your dataset is high quality
- No → You have to find other annotator
Ground truth
Good ground truth typically comes from or is produced by organizational systems
already in use.

If no existing data is available, subject matter experts (SMEs) can manually create
this ground truth

There are typically two key methods to build the distribution of your ground truth
when building an AI model:

- balanced number of examples for each class you want to recognize

- proportionately representative of how your system will be used

23
4.2. The Role of a Data Scientist
Data scientists are pioneers that work with business leaders to solve problems by
understanding, preparing, and analyzing data to predict emerging trends
Jobs: take raw data and turn it into actionable business insights or predictive models

Just remember that as your data volume grows, the more data scientists you will need

24
Feedback Loops
The quality of the output generated by an AI system is based on the dataset used
to train it. Data scientists are responsible to ensure that data quality and integrity is
maintained with each feedback loop.

Each loop is a sprint toward the stated objectives, and at the end of each sprint,
feedback should be given by either end users or SMEs to ensure maximum
benefit from the adoption of Agile loops. At the end of every sprint, the system
should be tested thoroughly and suitable course corrections should be made.

SMEs are central to this process. They will help the engineers find gaps and
inaccurate predictions by the AI

25
Making Data Accessible
After the data is collected, it is typically stored in an organization's data
warehouse, which is a system that colocates data in a central location to be
conveniently accessed for analysis and training
Having data and being able to use it to train an AI system are two different things.
Data in an organization can sometimes be siloed, meaning that each department
maintains their own data. This division makes it hard to have a holistic view →
build a data platform that collected and compiled all siloed data into a central
location.
Data platform: data warehouse, data lake, data mart, Hadoop, Spark, Amazon
AWS, Google Cloud Platform, …

26
4.3. Data Governance
Governance is about ensuring that processes follow the highest standards of
ethics while following legal provisions in spirit

Almost are customer’s data → Security

● Data should not be obtained without the
consent of the individuals featured in it
● Data that is collected without the express
consent of the user should not be used
and should not have been collected in the
first place.
Some data policies?

27
Data Collection Policies
● Users should be made aware of what data is being collected and for how long
it will be stored.
● Dark design patterns that imply consent rather than ask the user explicitly
should not be implemented.
● If data will be sent to third parties for processing/storage, the user should also
be informed of this upfront.

28
Encryption
Encrypting sensitive information such as credit card information the keys and the
passwords used to encrypt the data as well

29
Access Control Systems
All data should be classified based on an assessment of factors such as its
importance to the user and the company and whether it contains personal data of
users or company secrets.

E.g.: “public,” “internal,”

“restricted,” or “top secret.”

Access to data must be

controlled, and only
approved users should be
granted access

30
Creating a Data Governance Board
The board will develop the organization's data governance policies by looking at
best practices across the globe like General Data Protection Regulation (GDPR),
Health Insurance Portability and Accountability Act (HIPAA) provisions

Your task:

- Summarize main points in GDPR & HIPAA

31
Are You Data Ready?
It is important to take stock of all your digital and manual systems to see what data
is being generated.

● Is this data sufficient for your system's needs?

● Do you need to start looking at data licensing or starting your own
crowdsourcing jobs?
● Do you have the necessary talent (such as data scientists) to make this
happen?
● Have you established your data governance model?

32
4.4. Pitfalls
Pitfall 1: Insufficient Data Licensing

Pitfall 2: Not Having Representative Ground Truth

Pitfall 3: Insufficient Data Security

Pitfall 4: Ignoring User Privacy

33
Action Checklist
___ Determine the possible internal and external datasets available to train your system.

___ Have a data scientist perform a data consolidation exercise if data is not currently easily accessed.

___ Understand the data protection laws applicable to your organization and implement them.

___ Appoint a data governance board to oversee activities relating to data governance in order to ensure
your organization stays on the right track.

___ Put together a data governance plan for your organization's data activities.

___ Create and then release a data privacy policy for how your organization uses the data it accesses.

___ Establish some data security protections such as using data encryption, providing employee security
training, and building relationships with white-hat security firms.

MySQL Commands PDF
50% (2)
MySQL Commands PDF
3 pages
Lecture 4 - Machine learning pipeline
No ratings yet
Lecture 4 - Machine learning pipeline
38 pages
Lecture 4 - Machine Learning Pipeline
No ratings yet
Lecture 4 - Machine Learning Pipeline
38 pages
3- what-is-data
No ratings yet
3- what-is-data
3 pages
Video 3 What Is Data
No ratings yet
Video 3 What Is Data
3 pages
Subtitle
No ratings yet
Subtitle
4 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Fda 1
No ratings yet
Fda 1
5 pages
Research paper (3)
No ratings yet
Research paper (3)
14 pages
AI Project Cycle Class 9 Notes
No ratings yet
AI Project Cycle Class 9 Notes
9 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
AI Project Cycle
No ratings yet
AI Project Cycle
8 pages
MFDM™ Ai
50% (4)
MFDM™ Ai
48 pages
AI LIFE CYCLE
No ratings yet
AI LIFE CYCLE
30 pages
Essential Data Science Notes - A Concise PDF Guide
No ratings yet
Essential Data Science Notes - A Concise PDF Guide
20 pages
X CH 2 AI ProjectCycle Notes Revised
No ratings yet
X CH 2 AI ProjectCycle Notes Revised
9 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
DOC 1723125007147
No ratings yet
DOC 1723125007147
10 pages
CLASS 10_AI(PROJECT CYCLE) (1)
No ratings yet
CLASS 10_AI(PROJECT CYCLE) (1)
10 pages
Welcome to Ai Project Shidhant Mittaal
No ratings yet
Welcome to Ai Project Shidhant Mittaal
22 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
15 pages
Bab 1
No ratings yet
Bab 1
50 pages
Doubt Clearance Session(AI) on 29.12.2024
No ratings yet
Doubt Clearance Session(AI) on 29.12.2024
41 pages
Project Cycle 1-2-25
No ratings yet
Project Cycle 1-2-25
6 pages
Ll Ll Lllll Lllll
No ratings yet
Ll Ll Lllll Lllll
39 pages
Ds unit 1 notes
No ratings yet
Ds unit 1 notes
23 pages
MSE-merged
No ratings yet
MSE-merged
78 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
AI_and_Data_Literacy_Class_9
No ratings yet
AI_and_Data_Literacy_Class_9
4 pages
Introduction-to-Data-Collection-for-AI
No ratings yet
Introduction-to-Data-Collection-for-AI
10 pages
Big Data For Dummies
No ratings yet
Big Data For Dummies
8 pages
Sysintelli Ai Presentation
No ratings yet
Sysintelli Ai Presentation
32 pages
abhijitya_midsem
No ratings yet
abhijitya_midsem
6 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Tech Handbook - TechX IIMA (1)
No ratings yet
Tech Handbook - TechX IIMA (1)
41 pages
AI Project Cycle
No ratings yet
AI Project Cycle
10 pages
Notes_Class_09_AI_Project_Cycle
No ratings yet
Notes_Class_09_AI_Project_Cycle
28 pages
Data Science.pptx
No ratings yet
Data Science.pptx
25 pages
Data Glossary - Michael Dillon
No ratings yet
Data Glossary - Michael Dillon
11 pages
AI Project Cycle
No ratings yet
AI Project Cycle
10 pages
Unit 4 Data Science Applications
No ratings yet
Unit 4 Data Science Applications
32 pages
Class 9 AI Project Cycle Notes
No ratings yet
Class 9 AI Project Cycle Notes
8 pages
Emerging Tech Notes - Module1
No ratings yet
Emerging Tech Notes - Module1
55 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
X AI SS CH4 LM
No ratings yet
X AI SS CH4 LM
57 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Lecture 2 The data science process and tools for each step
No ratings yet
Lecture 2 The data science process and tools for each step
8 pages
datascience
No ratings yet
datascience
12 pages
Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN
No ratings yet
Getting Your Data Ready For Ai Oreilly Ebook 87023487USEN
25 pages
AI Cycle and Data SC - CH-4
No ratings yet
AI Cycle and Data SC - CH-4
56 pages
5213935-UNIT 2 AI PROJECT CYCLE With Modelling - Uploaded
No ratings yet
5213935-UNIT 2 AI PROJECT CYCLE With Modelling - Uploaded
42 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Glossary
No ratings yet
Glossary
50 pages
Big Data Lesson 1 Lucrezia Noli
No ratings yet
Big Data Lesson 1 Lucrezia Noli
46 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Bda - 2 Unit
No ratings yet
Bda - 2 Unit
12 pages
Oracle OCI Interview Questions and Answers
No ratings yet
Oracle OCI Interview Questions and Answers
21 pages
Onyx Work PDF
No ratings yet
Onyx Work PDF
4 pages
Certificate Portfolio Project 3
No ratings yet
Certificate Portfolio Project 3
25 pages
BSP Circular No. 808-13 Guidelines On Information Technology Risk Management
No ratings yet
BSP Circular No. 808-13 Guidelines On Information Technology Risk Management
73 pages
HTML
No ratings yet
HTML
10 pages
Shpro
No ratings yet
Shpro
2 pages
Word Notes
No ratings yet
Word Notes
10 pages
JavaScript Built in Object
No ratings yet
JavaScript Built in Object
4 pages
Astu Network Design and Documentation Presentation
No ratings yet
Astu Network Design and Documentation Presentation
17 pages
Gembird (R) Production Catalogue 2004
No ratings yet
Gembird (R) Production Catalogue 2004
22 pages
Design Pattern Part 2
No ratings yet
Design Pattern Part 2
47 pages
Semester-8 MCA Integrated IIPS DAVV Syllabus
No ratings yet
Semester-8 MCA Integrated IIPS DAVV Syllabus
15 pages
What Is Artificial Intelligence?: GOFAI Versus New AI
No ratings yet
What Is Artificial Intelligence?: GOFAI Versus New AI
17 pages
Juniper FW-IPS SRX345
No ratings yet
Juniper FW-IPS SRX345
123 pages
StorageBackupSoftwareManual ALL PDF
No ratings yet
StorageBackupSoftwareManual ALL PDF
292 pages
City of Iqaluit Website RFP - Technical - c1c
No ratings yet
City of Iqaluit Website RFP - Technical - c1c
53 pages
DBMS ER Diagram (UML Notation)
No ratings yet
DBMS ER Diagram (UML Notation)
1 page
The Hypdestopt Package: Heiko Oberdiek 2019/12/29 v2.6
No ratings yet
The Hypdestopt Package: Heiko Oberdiek 2019/12/29 v2.6
13 pages
Unit 1 - Introduction
No ratings yet
Unit 1 - Introduction
8 pages
Exp 8 - GPG - D12B - 74 PDF
No ratings yet
Exp 8 - GPG - D12B - 74 PDF
4 pages
Aviat CTR 8540 Data Sheet - April 26 - 2018
No ratings yet
Aviat CTR 8540 Data Sheet - April 26 - 2018
2 pages
istruzione
No ratings yet
istruzione
1 page
Eltek Fire & Safety Programmer MANUAL EDP1 v.1.3f PDF
No ratings yet
Eltek Fire & Safety Programmer MANUAL EDP1 v.1.3f PDF
8 pages
Complete SQL Tutorial in Hindi by Rishabh Mishra
100% (2)
Complete SQL Tutorial in Hindi by Rishabh Mishra
99 pages
Practice Test: HP HP3-L04
No ratings yet
Practice Test: HP HP3-L04
12 pages
New Report
No ratings yet
New Report
4 pages
ThinkStation P3 Datasheet
No ratings yet
ThinkStation P3 Datasheet
3 pages
Service Nx 2
No ratings yet
Service Nx 2
458 pages