0% found this document useful (0 votes)
11 views

CaseStudy1 (2)

Uploaded by

Hailey W.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

CaseStudy1 (2)

Uploaded by

Hailey W.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Case Study 1

Fundamentals of Machine Learning

Dr.rer.nat. Anda-Ramona Tănasie

Institut für Informatik, FHWN

March 1, 2024

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 1 / 22


Disclaimer

This is just a quick overview and does not replace a Fundamentals of


Machine learning Course!
I give very very brief introduction
Different Flavours of AI
How AI is build
How to build AI
How to use AI to enhance my work

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 2 / 22


History of AI

dates back to 1950’s


"AI Winter"
increased data availability, increased computational power - last two
decades
Neural Networks (1950)
even older: e.g. Bayesian Statistics, Markov Chains...
often, simple "older" ideas are put to new use!
AI is a goldmine today, with generative AI in focus
still many issues with AI: black boxes, data privacy...

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 3 / 22


Data Science Ingredients

Data
Model (with parameters) that fits our data = A function
A notion of "Goodness" we want to achieve
Note that it is very important to set the benchmark for the measure
of performance we want to try and capture from the beginning

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 4 / 22


The data

D: Raw: time, coordinates, color


I: Meaning : traffic light, location
K: Context: driving towards the traffic light
W: Applied: stop the car

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 5 / 22


The data
Available? If not, gather the data.
Files, DB, ...
Types of data and data types
Exploratory Data Analysis:
get familiar with your data
Types of Variables:
Numerical vs categorical variables,
formatting, transformations
Data Cleaning: Missing values, errors, outliers...
Tools:
▶ R and/or Python...
▶ statistics: descriptive (uni- and multivariate) and inferential, Hypothsis
Testing...
▶ linear algebra, calculus,
Let’s clarify a few in more detail
Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 6 / 22
Structured data

Data and Datasets = Collection of Samples (observations, data points)


Features, they describe properties of the data points
▶ categorical: predefined values of no order
Example: male and female
▶ ordinal: predefined values, intrinsic order
Examples: disease stage, grades...
▶ or numerical (e.g., real values)
Each feature = one dimension of the feature space
Its value for a particular data point places the point
time-dependency sometimes plays a central role, eg. weather forecasts.
In ML, including time as a distinguished continuous variable remains challenging:
limitation of—directly or indirectly—discretizing the time-dimension.

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 7 / 22


Properties of Data

Descriptive Univariate analysis


measures of central tendency
Mean - Arithmetic average of all the values
Median - Middle value in a sorted list of values
Mode - the value that appears most frequently in the dataset
measures of dispersion

Variance - Measure of the spread of data points around the mean (sd = Var )
Range of values: Min, Max
percentiles (Values that divide the dataset into 100 equal parts) and
IQR=Interquartile Range, i.e. between the 25th and 75th percentiles (Q1 and Q3)
Distribution Shape:
Skewness = shift in most values, degree of deviation from a sym
distrib.

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 8 / 22


Examples

Real life examples:


Skewness
income distribution is
right-skewed: Most people have
moderate incomes, but a few
individuals extremely high
2020 corona vaccine priority to
seen also on Summary stats (mean-median)
elderly: left skewed

Small Percentiles
Real life example: Child weight

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 9 / 22


Properties of Data

Inferential univariate Analysis


one sample z-test or t-test
▶ both for testing hypothesis about the mean of a population
▶ difference: is the standard deviation known?
▶ Normally distributed data, Central limit theorem
confidence intervals

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 10 / 22


Properties of Data

Bivariate analysis
- empirical relationship of two variables
- statistical significance
- depending on the type of the fearures involved

Pearson correlation coefficient


▶ two cont variables
▶ measures linear correlation ( values [−1, 1])
Z-test, T-test: comparing the means of two groups
ANOVA: comparing the means of ≥ 3 groups (one cont. and one
categ. variable).
cross tabulation: displaying the frequencies (counts) or percentages in
a matrix format for two categorical variables
χ2 -statistic : test for independence
Correlation does not imply causality!
Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 11 / 22
Data Cleaning

Missing Values
Are the missing values completely at random or not?
delete observations- if large number of observations still available
Note that if a feature has lots of missing values, you might consider deleting the
feature, not lots of rows
imputation methods
▶ mean/mode for cont/categ. variables
▶ predictions

Detecting and treating Outliers


Example: they can change the slope of a regression line
delete e.g. incorrect measurements
caping e.g. replace all values above 95-percentile with the 95-percentile itself
use predictions

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 12 / 22


Data transformations

are an important preprocessing step


can have a profound effect on the model performance

Common data transformations


Normalization (Min-Max Scaling): re-scales to a fixed range, typically [0, 1]
x − xmin
xscaled =
xmax − xmin

Standardization (Z-score Normalization):


x −µ
z=
σ
resulting in a distribution with a mean of 0 and a standard deviation of 1.
Log Transformation:
can help linearize relationships, stabilize the variance, and bring the data closer to
a normal distribution, reduces skewness in positive skewed data

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 13 / 22


Feature engineering

Feature engineering
using domain knowledge to extract new variables from raw data
Aims:
make machine learning algorithms work more effectively and/or
provide deeper insights.

Products, or ratios of features,


linear combinations of features
Polynomial Features
Binning: Transforming continuous variables into categorical variables.
Aggregation: Creating summary statistics for data grouped by key.

Challenges and Considerations


Overfitting, Dimensionality, Interpretability
Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 14 / 22
Data quality

Missing Data, Duplicate Records


Inconsistent data: merging data from different sources, human error
outliers: points that differ significantly from the majority (local or
global)
data relevance, data granularity
Data imbalance
▶ Carefully chosen ML methods
▶ undersampling/ oversampling the majority/minority class, respectively
▶ tweaking the miss-classification cost in the objective function.
Bias -will the model to generalize?

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 15 / 22


Explain keywords

Data trends
Multimodality
Stratified sampling

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 16 / 22


Always integrate problem specific Knowledge (e.g. Negative concentration)

Pit Falls
Is the data really representative? Do the results generalize?

Simulations to study properties of your methods

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 17 / 22


Conclusions

Data Literacy
is becoming more and more important across various Jobs
Asking the right questions is as important as answering them.

Exercise: Berkley Admisions dataset


Structured or unstructured data?
Extract the size of the data, which features do we have?
Calculate admission rates for men and women
Answer:

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 18 / 22


Conclusions

Data Literacy
is becoming more and more important across various Jobs
Asking the right questions is as important as answering them.

Exercise: Berkley Admisions dataset


Structured or unstructured data?
Extract the size of the data, which features do we have?
Calculate admission rates for men and women
Answer:
34.5% of female applicants and 44.2% of male applicants were admitted.
Gender Bias!

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 18 / 22


Conclusions

Data Literacy
is becoming more and more important across various Jobs
Asking the right questions is as important as answering them.

Exercise: Berkley Admisions dataset


Structured or unstructured data?
Extract the size of the data, which features do we have?
Calculate admission rates for men and women
Answer:
34.5% of female applicants and 44.2% of male applicants were admitted.
Gender Bias!
Or is it?

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 18 / 22


Berkley Admissions Dataset: Lessons
Simpson’s Paradox: Pattern in aggregate level, reversed in subgroups
Beispiel: Gender bias at Berkeley university.

34.5% of female and


44.2% of male Trend given by departments with lots of applicants
applicants

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 19 / 22


Correlation vs Causation

Confounding variables
Example
eating ice cream and getting sunburned

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 20 / 22


Correlation vs Causation

Confounding variables
Example
eating ice cream and getting sunburned
There’s a correlation between eating ice cream and getting sunburned, but
neither event actually causes the other. Instead, both events are caused by
something else—sunny weather

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 20 / 22


Bias-Variance Tradeoff

Wikipedia : Bias-Variance Tradeoff

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 21 / 22


Precision vs Accuracy

Accuracy:“the degree of closeness of measurements to true value”


Precision:“the degree to which repeated measurements under
unchanged conditions show the same results”.

Dr.rer.nat. Anda-Ramona Tănasie (FHWN Informatik) Case Study 1 March 1, 2024 22 / 22

You might also like