2024 Wk5 Explorative Data Analysis-1.Ko.en
2024 Wk5 Explorative Data Analysis-1.Ko.en
com
- Mid-term exam 30% (8th-week, 15:00~16:00): On-line Quiz (on eclass system)
- Perform full-cycle analysis from data cleansing to statistical analysis using the topic and data being studied or to be studied, and report the results.
- May be conducted individually, but teams of up to two members are allowed. (1 or 2 members in a team)
- Utilize the topic and data you are researching or plan to research to perform a full-cycle analysis, from data
cleansing to statistical analysis.
- Attendance 20%
Next week
- Individual presentation on
- This presentation will not be included in the final grade, but you cannot proceed with the
project without presentation at 6thweek.
- For 15 minutes
- You can detect and correct missing values, outliers, and data imbalances by checking the distribution by variable.
- You can determine whether variable normalization and renormalization are necessary.
• Preprocessing of structured data using samples of health checkup data from the National Health Insurance Corporation
(primary key)
.... • HME_YYYYMM: Examination date and month
....
• The number of rows and columns of data
How many?
• You can find out the unit (unit of measure) of each variable.
doesn't exist
data collectors.
Understanding Data: Unique Keys
done correctly
Understanding Your Data: Data Size
• View the received data table in my development environment
• Check with the data collector to see if the number of data rows and columns is correct.
• The number of rows and columns may be incorrect due to encoding errors, etc.
• Looking at the last column and last row can also help you visually catch encoding errors, etc.
• Depending on the size of the number of rows and columns, the choice of AI model architecture or the distribution strategy of the train, test, and validation
• Inclusion criteria
• In clinical studies, inclusion criteria refer to the criteria for selecting study participants.
• In medical data science, it is used to mean the standard of data to be included in research and learning.
• Example: “Visited to the emergency room for **** between January 1, 2010 and December 31, 2022”
• Exclusion criteria
• Criteria for excluding data from the study
• You should receive descriptive material detailing inclusion/exclusion criteria along with the data and codebook.
• Example: “Targets the elderly” -> “Targets those aged 65 or older as of the date of examination” -> (variable name) ≥ 65
Understanding the Data: Inclusion/Exclusion Criteria
• In clinical studies, inclusion/exclusion criteria determine whether each row is included or excluded.
• In medical data science, you can decide whether to include/exclude not only rows but also columns (variables).
learning.
Understanding the Data: Inclusion/Exclusion Criteria
• The expected hypothetical conversation when developing and training an AI model without clearly setting inclusion/
exclusion criteria
The external validation performance is not good. After looking again, I should have included only adult
patients, but I included all data regardless of age. Please exclude data for patients under 18 and retrain.
all right.
(A few weeks later)
repeat
Teacher, I have completed the study.
The external validation performance is still not good. After looking at it again, I should have excluded the case of
**** and included only the case of ****, but I didn't. Please exclude those patients and retrain.
Understanding Data: Data Types
• Categorical variable
• Ordinal variable
• Similar to a categorical variable, but a variable that has an order or rank between the categories.
• Example: Pain level (none, slight, moderate, severe), rank (1, 2, 3, ....)
• Continuous variable
• Discrete variable
• Data that does not exist between specific values, often expressed as a positive integer
• Date/time variable
• Example: Date of hospital visit, date of emergency room visit, date of death, etc. (2024-01-01 00:00)
• Rather than being used directly in the model, it is used to calculate elapsed time, etc. or to set inclusion/exclusion criteria.
• Text variable
• Be careful because data may be read differently than intended by the data collector during processes such as data encoding
• Most AI models that process structured data take categorical and continuous variables as input.
• ““Models should be as simple as possible, but not too simple.” – Albert Einstein
• Multi-collinearity
• There are pairs or clusters of variables that are highly correlated with each other.
• Examples: Weight and body mass index, raw and percentage scores of certain test results
• Problem: It only increases the dimensionality and does little to improve model performance.
• Problem: It provides data that violates the assumption that variables are independent of each other to AI models that operate
• Problems that arise across AI models that utilize distance-based similarity when the dimensionality gets too high
• When the dimensionality becomes too high, problems arise across artificial intelligence models
that utilize distance-based similarity.
• The problem occurs when learning does not converge or when adding information causes model performance to deteriorate.
• Issues can occur: model not converging / decrease in performance when adding information
• Rule of thumb: The number of dimensions (number of columns) < The number of cases (number
of rows)
• Feature selection
• feature = variable = column (feature = variable = column)
• Need for domain knowledge (Medicine): Need to understand related medical principles and context of examination, etc.
• Feature extraction
• How to condense multiple variables into one
<given data>
Dimension reduction
Text tokenization
• When developing real AI solutions in the healthcare sector, the time spent on EDA is much longer and more important than the time
• Garbage in, garbage out: Data quality determines model quality, and managing data quality is EDA.
• If you start developing an AI model without sufficient EDA, you will encounter an infinite loop of data debugging.
E2DA
. Distribution by variable
• It is necessary to determine which of the following each variable belongs to (based on knowledge of relevant medical fields):
• Independent variable: In research on creating a prediction model, the independent variable becomes a predictor, and the
• Dependent Variable: In the medical field, this is called “Outcome” and is the final result or management objective in the research design.
• Auxiliary variables: Variables that are not predictors but are needed to explain the case, such as unique key, test date, etc.
• EDA doesn't matter whether you use spreadsheets, R, Python, SAS, Stata, or SPSS. Just be careful about automatic encoding
conversion.
• EDA and preprocessing processes must be documented in their order and progress to ensure reproducibility.
Missing values
• Numbers that appear to be continuous numbers, such as 999 or -999, but actually represent missing values.
• “A string indicating a measurement failure, such as “missing”, “missed”, “-”, “null”, “none”, “NaN”, “missing”, or “none”.
• 0: Need to check if it is a “real zero” meaning there is a measurement and that measurement is 0, or if it is 0 meaning missing
• You can count the number of missing values (NA) using functions such as isna(), sum() in Python or is.na(), sum() in R.
Missing values
• Check with the data collector to see if the frequency or pattern of missingness is within a normally expected range.
• Identify if there is a pattern in which missing values are concentrated in certain dates, locations, patient characteristics (gender,
age, etc.), and discuss any suspicious patterns with the data collector.
• Measures the proportion of missing values relative to the number of data cases (number of rows) for each variable.
• If the missing rate is too high, consider removing variables (removing columns).
• Example) Out of 1,000 patients, 900 patients had missing urine protein test results -> urine protein test is excluded from the variables.
• If the missing rate is negligibly low and there are sufficiently many cases (rows) without missing, remove missing cases (row
removal).
• Example) Out of 1,000 patients, fasting blood sugar test results are missing for 10 patients -> 10 patient cases are removed.
• There is no absolute threshold for missingness rate or number of missing cases. It varies depending on the clinical context.
Missing values
• When the number of cases (number of rows) is small and each case is important because it is a rare disease or a test with high measurement cost.
• Some variables have a significant missing rate, but they are important in terms of medical mechanisms and should be included in the model.
case
• Replace with representative value: Replace with representative value such as mean, median, or mode depending on the data distribution type.
• Replace with special values: A realistic range of maximum, minimum, or maximum values that a variable can have, depending on the clinical context of the variable.
• Example 1) Since the average of all cases where weight was measured is 65 kg, all cases with missing weight are replaced with 65 kg <-
Mean replacement
• Example 2) Replace all cases with 36.5°C for missing body temperature <- Replace special value
• Example 3) For cases where smoking status is missing, all are considered non-smokers <- Special value substitution
Missing values
• Nearest neighbor imputation: Replaces other measured variable values with the values of the nearest neighbors that are most similar.
• Model-based imputation: Estimation is done by creating a model that estimates the variable to be interpolated using measured data.
• Example 1) For cases where the residential area variable is missing, randomly select one of the cases where the residential area variable is measured.
• Example 2) For cases where the GFR value is missing, find another case with the most similar age and gender and compare that case.
Replace with the value of the filtration rate of the new sphere
• It is assumed that the data types of variables are pre-organized into two types: continuous and categorical.
430 240
• Most of them are caused by input errors or encoding mismatch during data format conversion.
• Example 1) Gender was coded as 1 for male and 2 for female, but a case with gender 12 was detected.
1 2 12 0 550
Weight (kg)
3. This
this
Detection
award
2- toothaward Exit
knife
tooth
• Example) Gender12Cases coded as female or male are difficult to estimate and difficult to call the patient back to confirm, so they
• Example) Weight0kgCases entered as are processed in the same way as missing values.
Normalization (normalization)
• Most AI models work well when the distribution is unimodal and symmetric, similar to a normal distribution.
• The process of transforming a distribution into something closer to a normal distribution is called normalization.
• Models that absolutely require regularization: Parametric models such as linear regression and logistic regression (models that assume normal distribution of variables)
5. Re exacerbation
re
2-kyu casekyufury (re ing)
(ressccaalilng)
• Most AI models work well when the distribution width of multiple variables, i.e. the scale, is nearly constant.
• For example, some variables have scales that are so small that their distribution width is 0.000001, while others have scales that are so large.
1.12X1015If so, there is a risk that variables with small scales will be relatively underestimated.
• Example) In the same dataset, the performance of the prediction model may differ depending on whether the
• Method 1) Minmax scaling: Transform the distribution of variables so that the minimum value is 0 and the maximum value is 1.
• Method 2) Standard renormalization (standardization scaling): transforming the variable distribution so that the mean is 0 and the standard deviation is 1.
• Can be processed with MinMaxScaler and StandardScaler classes of Python sklearn.preprocessing library
Correlation (correlation)quest
• Most AI models are based on the assumption that all input variables for prediction are independent of each other.
• In reality, if you use medical data as is, pairs of variables that are not independent at all and are highly correlated
• When highly correlated variables are used, the computational load increases without helping the model prediction performance.
• Highly correlated pairs of variables should be selected based on clinical context and the rest should be eliminated.
Correlation (correlation)quest
• Correlation exploration method
• In Python, use corr() from the pandas library and heatmap() from the seaborn library.
Time series data preprocessing
• Structured data (data that can be expressed in a spreadsheet) + the added attribute of time flow
• Electroencephalography (EEG)
• Electrocardiogram (ECG)
• sleep record
Time series data preprocessing
• Trend Component
• Depending on the clinical context of data collection, it may be necessary to remove trend components.
• Examples: Long-term rising or falling trends in electrocardiogram data, long-term falling trends in birth data, etc.
• Seasonal Component
• Example: Season and day of week effects in emergency room visit data
• Irregular Component
• For example, in the case of data on the number of confirmed COVID-19 cases, missing data occurs if testing is not conducted on public
holidays, and missing data of this nature are periodic or follow certain rules.
• For time series data measured by sensors, missing data may occur due to sensor errors or network issues, and missing
data of this nature have the characteristic of occurring continuously over a certain period of time.
• For structured data, missing values can be removed by considering the missing ratio or interpolated with a representative value.
• However, because time series data changes over time, another missing value interpolation method is needed.
• Replace missing values with the value observed immediately before the missing value
• Example: Replace the missing number of confirmed COVID-19 cases on May 5, a public holiday, with the number of confirmed cases on May 4.
• Replace missing values with values observed immediately after the missing value
• Example: Replace the missing number of confirmed COVID-19 cases on May 5, a public holiday, with the number of confirmed cases on May 6.
• Replace missing values with the mean or median of a given time interval immediately preceding the missing value.
• How to estimate missing data from non-missing data using linear or spline models
• Use an artificial intelligence generative model instead of a linear function or spline function as an interpolation model.
Time series data preprocessing
• Noise means noise and refers to anything that causes unintended data distortion.
• Example: If the sensor value is 0, minimum or maximum for a certain period of time.
• How to deal with it: Remove noise intervals or consider them as missing values and then apply interpolation
• Sensors that measure vital signs, such as electrocardiograms, often use piezoelectric elements, but noise can easily occur due
• When recording breathing sounds with an electronic stethoscope, noise such as surrounding noise or vibration is also recorded.
Time series data preprocessing
• If there is a rule in the noise and a model can be created that mimics the noise component according to the rule, the noise model can be
• Smoothing
• By applying a moving average over an arbitrary time interval, small fluctuations are removed.
Noise removal
• In noisy environments, this is not a good method because the noise itself changes the moving average.
• Filtering
• When the noise rules are unknown and a lot of noise is generated
• Assuming that the distribution that noise follows is a Gaussian distribution, Kalman distribution, etc., noise is removed through filtering.
Time series data preprocessing
• Fourier transform
• The Fourier transform is a mathematical formula that converts a signal sampled in time into the same signal sampled in time
frequency.
• In signal processing, the Fourier transform reveals important characteristics of a signal, namely its frequency components.
• The Fourier transform is expressed as follows for a vector x with n uniformly sampled points:
Time series data preprocessing
• Fourier transform
• Time series data like the one on the left can be converted into frequency components like the one on the right.
• We also find frequency peaks and convert them into structured data to apply artificial intelligence models.
• The Fourier transformed graph itself is also used as an image to apply the artificial intelligence model.
https://kr.mathworks.com/help/matlab/math/fourier-transforms.html
Time series data preprocessing
• Extract peak frequency components through Fourier transform, measure maximum, minimum, median, variance, etc., and convert
• Image data artificial intelligence that converts time series data into images that well describe the features you want to reveal.
https://en.wikipedia.org/wiki/Spirometry
https://www.medikro.com/understanding-your-spirometry-test-results/2024/02/21/
Time series data preprocessing
-Coding book, unique key, data size, inclusion/exclusion criteria, data type
- Dimension reduction
-Reduce dimensionality and avoid the curse of dimensionality through variable selection or variable extraction.
- Text tokenization
-It converts data into a data type that is easy for computers to understand through methods such as one-hot encoding.
-You can visualize distributions by variables, detect outliers and missing values, interpolate, normalize, and renormalize.
- Correlation exploration
-We need to visualize the correlations between all variables and eliminate all but one pair of highly correlated variables.
-The main components of time series data, such as trends, seasons, and irregular components, must be identified and removed, and transformations can be utilized.
NHISS (Health Insurance Corporation) data
-https://nhiss.nhis.or.kr/bd/ay/bdaya001iv.do
NHISS (Health Insurance Corporation)
-Demo data
4dog table