Working With Data - Annotated
Working With Data - Annotated
Introductory
Artificial Intelligence
College of Engineering
Patient 1 120/80 90 No
Numerical Data: Any form of measurable data such as your height, weight, or the cost
of your phone bill. You can determine if a set of data is numerical by attempting to
average out the numbers or sort them in ascending or descending order.
Ordinal Data: data mixes numerical and categorical data. The data fall into categories, but the
numbers placed on the categories have meaning. For example, rating a restaurant on a scale
from 0 (lowest) to 4 (highest) stars gives ordinal data. Order is important.
Time Series Data: Consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals.
Textual Data: Words, sentences, or paragraphs that can provide some level of insight to
your machine learning models. Often grouped together or analyzed using various
methods such as word frequency, text classification, or sentiment analysis.
Explore Your Data
Know what you’re working with
Contains far
Cheaper
fewer errors
Gather
Takes more time
everything you
to collect
can find
More expensive
in general.
Collecting Data
2. Someone has already collected it for you
Missing The best way to handle missing You’re essentially adding a new class for the
feature.
data for categorical features is
categorical to simply label them as
This tells the algorithm that the value was
missing.
data ’Missing’! This also gets around the technical requirement
for no missing values.
Missing For missing numeric data, you Flag the observation with an indicator variable of
missingness.
should flag and fill the values.
numeric
Then, fill the original missing value with 0 just to
data meet the technical requirement of no missing
values.
Data Interpolation
Missing Data Interpolation
• Interpolation is a mathematical method that adjusts a function to your data and uses this function to
extrapolate the missing data.
• The most simple type of interpolation is the linear interpolation, that makes a mean between the
values before the missing data and the value after.
𝑦2 − 𝑦1
𝑦= 𝑥 − 𝑥1 + 𝑦1
𝑥2 − 𝑥1
Data Cleaning
Data Cleaning
Better Data Fancier Algorithms
1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Original Image
Data Augmentation
Common Augmentation Methods
1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods
1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods
1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods
1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods
1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Augmentation
Common Augmentation Methods
1. Mirroring
2. Random Cropping
3. Rotation
4. Shearing
5. Color Shifting
6. Brightness
Data Distribution/Histogram
Categorical
Numerical
Ordinal
Data
Transformation
Data Normalization
Definition
• Transform data features to be on a similar scale which improves the performance and training stability of the machine learnin g
model.
• Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the
distribution of your data.
Data Normalization
Normalization Techniques
Data Normalization
Log Scaling
Data Normalization
Feature Clipping
Data Normalization
Z-Score
Data Normalization
Linear Scaling vs. Z-Score
Data Normalization
Linear Scaling
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
#Credit Cards Range = 0-4 (4 credit cards)
45 32,000 3 No
23 7,000 4 No
27 20,000 2 No
33 25,000 2 Yes
Linear Scaling
Age min = 23 Age max = 65 Age max - Age min = 42
Age Age – Age min Age – Age min Age'
------------------
Age max - Age min
35 35-23 = 12 12/42 = 0.29 0.29
0.76 Yes
Normalization Completed !!
1.00 0.75
1.00 0.13 0.00 Yes
22.5 20 69 Low
14 -10 73 Low
Data Normalization
Clipping
Room People Inside Humidity Cooling
Temperature (C) (%) Needed
Room Temperature = 16-27 degrees
23.2 30 100 High
People Inside = 0-50
24.8 150 65 Medium Humidity = 60 - 85
22.1 20 67 Low
22.5 20 69 Low
14 -10 73 Low
Data Normalization
Clipping
Room People Humidity’ Cooling
Temperature’ (C) Inside’ (%) Needed
Room Temperature = 16-27 degrees
23.2 30 100 85 High
People Inside = 0-50
24.8 150 50 65 Medium Humidity = 60 - 85
22.1 20 67 Low
22.5 20 69 Low
22.5 20 69 Low
16.0 0 73 Low
Data Normalization
Z-Score
Age Income #Credit Buy
Cards Insurance Age Range = 23 to 65 (42 years)
35 15,000 1 No Income Range = 7k to 45k (AED 38,000)
#Credit Cards Range = 0-4 (4 credit cards)
45 32,000 3 No
Normalization Needed !!
23 7,000 4 No
55 45,000 3 Yes
65 12,000 0 Yes
27 20,000 2 No
33 25,000 2 Yes
Z-Score
Age Age – Age mean Age – Age mean Age'
------------------
Age mean = 40.43 Age std dev Age’ mean = 0
Age std dev = 14.17 35 -5.43 -0.35 -0.35 Age’ std dev = 1
45 4.57 0.30 0.30
23 -17.43 -1.14 -1.14
55 14.57 0.95 0.95
65 24.57 1.61 1.61
27 -13.43 -0.88 -0.88
33 -7.43 -0.49 -0.49
Breakout Session
Clean your Data! Years Of Position Salary (k AED)
Experience
Activity
1 Staff 8
1. Fill the missing Data using
Interpolation 2 Staff 11
2. Remove Duplicate Observations 3 Staff _
3. Fix Structural Errors
4 Staff 17
4. Remove Outliers
5. Apply One Hot Encoding 3 Staff 14
6 Staff _
7 Staff 26
7 Manager 20
8 Supervisor 30
9 Supervisor 33
Years Of Staff Supervisor Manager Salary (k
Clean your Data! Experience AED)
Activity 1 1 0 0 8
2 1 0 0 11
1. Fill the missing Data using
Interpolation 3 1 0 0 14
2. Remove Duplicate Observations
4 1 0 0 17
3. Fix Structural Errors: Supervisr, staff
4. Remove Outliers 3 1 0 0 14
5. Apply One Hot Encoding 6 1 0 0 23
7 1 0 0 26
7 0 0 1 20
8 0 1 0 30
9 0 1 0 33
Breakout Session
Data Augmentation
Class Activity
Configuration
Data Augmentation Images
Required
Right
Translate
Left
Smaller
Scale
Bigger
45 clockwise
Rotate 45 counter
clockwise
From top
Crop From bottom
From side
Data Augmentation
Class Activity - Solution
Configuration
Data Augmentation Images
Required
Right
Translate
Left
Smaller
Scale
Bigger
45 clockwise
Rotate 45 counter
clockwise
From top
Crop From bottom
From side
Breakout Session
23 7,000 4 No
27 20,000 2 No
33 25,000 2 Yes