0% found this document useful (0 votes)
3 views

2024 Wk5 Explorative Data Analysis-1.Ko.en

The document outlines the evaluation criteria for a Healthcare Data Analysis course, including a mid-term exam, project report, and presentation, with specific weightings for each component. It emphasizes the importance of understanding healthcare data, including preprocessing, exploratory data analysis (EDA), and handling missing values. Additionally, it discusses the significance of inclusion/exclusion criteria and data types in the context of AI model development in healthcare.

Uploaded by

Samia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

2024 Wk5 Explorative Data Analysis-1.Ko.en

The document outlines the evaluation criteria for a Healthcare Data Analysis course, including a mid-term exam, project report, and presentation, with specific weightings for each component. It emphasizes the importance of understanding healthcare data, including preprocessing, exploratory data analysis (EDA), and handling missing values. Additionally, it discusses the significance of inclusion/exclusion criteria and data types in the context of AI model development in healthcare.

Uploaded by

Samia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Translated from Korean to English - www.onlinedoctranslator.

com

Healthcare Data Analysis Evaluation

- Mid-term exam 30% (8th-week, 15:00~16:00): On-line Quiz (on eclass system)

- May take exam at the classroom or other place of the earth

- Assesses the level of theoretical understanding of the lecture

- Project Report 20%: Data analysis project (uploading on eclass system)

- Project presentation 30%

- Evaluation of data analysis project results presentation

- Individual projects are conducted, but teams of up to 2 people can form.

- Perform full-cycle analysis from data cleansing to statistical analysis using the topic and data being studied or to be studied, and report the results.

- Presenting the data analysis project results

- May be conducted individually, but teams of up to two members are allowed. (1 or 2 members in a team)

- Utilize the topic and data you are researching or plan to research to perform a full-cycle analysis, from data
cleansing to statistical analysis.

- Attendance 20%
Next week

- Individual presentation on

- your research interest -> mandatory

- Your data of interest -> mandatory

- Your preferred analysis method -> optional

- Your thesis -> optional

- This presentation will not be included in the final grade, but you cannot proceed with the
project without presentation at 6thweek.

- Language: Korean or English

- For 15 minutes

- The final presentation will be at 14thweek


Explorative Data
Analysis (EDA)
wk7
Learning Objectives

- Understand common problems found in healthcare data.

- We can explain the points that require preprocessing in healthcare data.

- You can explain preprocessing measures for each problem situation.

- Explain what Exploratory Data Analysis (EDA) is.

- You can detect and correct missing values, outliers, and data imbalances by checking the distribution by variable.

- You can determine whether variable normalization and renormalization are necessary.

- It can identify and process pairs of highly correlated variables.

- Explain the preprocessing method for time series data.


Understanding the data

• Preprocessing of structured data using samples of health checkup data from the National Health Insurance Corporation

• Table name: NSC2_G1E_0208.csv,


Health Checkup Table

• EXMD_BZ_YYYY: Examination year

• RN_INDI: Personal Identification Number

(primary key)
.... • HME_YYYYMM: Examination date and month

• G1E_HGHT: Height (cm)


• G1E_WGHT: Body weight (kg)

• G1E_BMI: Body Mass Index

• The above information is specified in


the “Code Book”
....
Understanding the data

• The first thing to know once you have the data

• Codebooks attached to the data, etc.


Was explanatory material provided?

• It consists of multiple tables


If there is, the Primary key is
What is it?

....
• The number of rows and columns of data

How many?

• Not used for analysis or learning


For rows or columns that will not
Is there a standard?
(inclusion/exclusion criteria)
• What is the data type of each column?
....
Understanding the Data: Codebook

• Codebook: A list or table that describes each variable in a data table.


Understanding the Data: Codebook

• What if I don't have a codebook?

• What are the variables in each column?

I don't know what it means

• The data type of each variable is unknown

• You can find out the unit (unit of measure) of each variable.

doesn't exist

• Unable to proceed with processes that must precede

artificial intelligence learning, such as preprocessing

and exploratory data analysis

• What if I ask for a codebook and they don't have one?

• It must be created by interviewing the

data collectors.
Understanding Data: Unique Keys

• Key required to connect to data in


other tables

• If you have multiple tables and you don't know what

the unique key is, you'll need to interview your data

collectors to find out.

• If you request a unique key and it doesn't exist, you will

need to create one.

• Need to check if anonymization has been

done correctly
Understanding Your Data: Data Size
• View the received data table in my development environment

• In Python, use functions such as read_csv or read_excel (library: pandas, openpyxl)

• In R, use functions such as read.csv or read_excel (package: base, readxl)

• Check with the data collector to see if the number of data rows and columns is correct.

• The number of rows and columns may be incorrect due to encoding errors, etc.

• Looking at the last column and last row can also help you visually catch encoding errors, etc.

• Depending on the size of the number of rows and columns, the choice of AI model architecture or the distribution strategy of the train, test, and validation

sets will vary.


Understanding the Data: Inclusion/Exclusion Criteria

• Inclusion criteria
• In clinical studies, inclusion criteria refer to the criteria for selecting study participants.

• Example: “Patients with **** disease aged 18 years or older”

• In medical data science, it is used to mean the standard of data to be included in research and learning.

• Example: “Visited to the emergency room for **** between January 1, 2010 and December 31, 2022”

• Exclusion criteria
• Criteria for excluding data from the study

• Example: “Excluding those with cardiovascular disease”

• You should receive descriptive material detailing inclusion/exclusion criteria along with the data and codebook.

• Inclusion/exclusion criteria should be stated as objectively and unambiguously as possible.

• Example: “Targets the elderly” -> “Targets those aged 65 or older as of the date of examination” -> (variable name) ≥ 65
Understanding the Data: Inclusion/Exclusion Criteria

• In clinical studies, inclusion/exclusion criteria determine whether each row is included or excluded.

• In medical data science, you can decide whether to include/exclude not only rows but also columns (variables).

• The RN_INDI column is a unique key, so it is not

suitable as a variable to be included in learning.

• The EXMD_BZ_YYYY, HME_YYYMM columns represent

the year and month of the examination.

Since it is data, it is not suitable as a variable to be used in

learning.
Understanding the Data: Inclusion/Exclusion Criteria

• The expected hypothetical conversation when developing and training an AI model without clearly setting inclusion/

exclusion criteria

Teacher, I have completed the study.

(A few weeks later)

The external validation performance is not good. After looking again, I should have included only adult
patients, but I included all data regardless of age. Please exclude data for patients under 18 and retrain.

all right.
(A few weeks later)

repeat
Teacher, I have completed the study.

(A few weeks later)

The external validation performance is still not good. After looking at it again, I should have excluded the case of
**** and included only the case of ****, but I didn't. Please exclude those patients and retrain.
Understanding Data: Data Types

• Categorical variable

• Data that can be divided into two or more categories

• Example: Gender (male, female), Smoking (yes, no)

• Ordinal variable

• Similar to a categorical variable, but a variable that has an order or rank between the categories.

• Example: Pain level (none, slight, moderate, severe), rank (1, 2, 3, ....)

• Continuous variable

• A variable that has continuous numeric values, usually real numbers.

• Weight, height, blood pressure, fasting blood sugar level, etc.

• Pay attention to the number of significant figures.


Understanding Data: Data Types

• Discrete variable

• Data that does not exist between specific values, often expressed as a positive integer

• Example: Number of hospital visits, length of stay, etc. (0, 1, 2, ….)

• Need to check if it is okay to consider it as a continuous variable

• Date/time variable

• Example: Date of hospital visit, date of emergency room visit, date of death, etc. (2024-01-01 00:00)

• Rather than being used directly in the model, it is used to calculate elapsed time, etc. or to set inclusion/exclusion criteria.

• Must not be entered as a categorical or continuous variable

• Text variable

• Natural language such as dictation, notes, readings, opinions, etc.

• It is appropriate to use it by converting it to categorical data, etc.


Understanding Data: Data Types

• Be careful because data may be read differently than intended by the data collector during processes such as data encoding

and format conversion.

• Most AI models that process structured data take categorical and continuous variables as input.

• Need to convert other data type variables to categorical or continuous


Dimension reduction

• ““Models should be as simple as possible, but not too simple.” – Albert Einstein

• Multi-collinearity
• There are pairs or clusters of variables that are highly correlated with each other.

• Examples: Weight and body mass index, raw and percentage scores of certain test results

• Problem: It only increases the dimensionality and does little to improve model performance.

• Problem: It provides data that violates the assumption that variables are independent of each other to AI models that operate

under that assumption.


Dimension reduction:dimension reduction

• The amount of computation increases exponentially as the number of variables increases.

• The computational complexity increases exponentially with the number of variables.

• Curse of dimensionality: Curse of dimensionality

• Problems that arise across AI models that utilize distance-based similarity when the dimensionality gets too high

• When the dimensionality becomes too high, problems arise across artificial intelligence models
that utilize distance-based similarity.

• The problem occurs when learning does not converge or when adding information causes model performance to deteriorate.

• Issues can occur: model not converging / decrease in performance when adding information

• Rule of thumb: The number of dimensions (number of columns) < The number of cases (number
of rows)

• Data points become too sparsely distributed in the high-dimensional space


Dimension reduction:dimension reduction

• Feature selection
• feature = variable = column (feature = variable = column)

• Need for domain knowledge (Medicine): Need to understand related medical principles and context of examination, etc.

• Feature extraction
• How to condense multiple variables into one

Feature selection Feature extraction


Dimension reduction

<given data>
Dimension reduction
Text tokenization

• Convert to a data type that is easy for the computer to process

• One-hot encoding: “Only one is hot (1)”

There are four distinct categories of inpatient ward variables:

Original form of the variable 'Inpatient


Increase the number of variables to 4 and change
Ward': categorical text
the variable type to logical (True/False)
EDA
• EDA: Explorative Data Analysis

• Why do you do it?

• Understand the characteristics and problems of data

• To know what preprocessing is needed

• To select an appropriate artificial intelligence model

• To choose a learning and validation strategy

• Preprocessing and EDA are inseparable processes that occur simultaneously.

• When developing real AI solutions in the healthcare sector, the time spent on EDA is much longer and more important than the time

spent on modifying models or performing training.

• Garbage in, garbage out: Data quality determines model quality, and managing data quality is EDA.

• If you start developing an AI model without sufficient EDA, you will encounter an infinite loop of data debugging.
E2DA
. Distribution by variable

• In structured data, each column can be viewed as a variable.

• It is necessary to determine which of the following each variable belongs to (based on knowledge of relevant medical fields):

• Independent variable: In research on creating a prediction model, the independent variable becomes a predictor, and the

data is entered to predict the dependent variable.

• Dependent Variable: In the medical field, this is called “Outcome” and is the final result or management objective in the research design.

We want to predict this with the variables that become available.

• Auxiliary variables: Variables that are not predictors but are needed to explain the case, such as unique key, test date, etc.

• Only independent variables should be included in learning.

• EDA doesn't matter whether you use spreadsheets, R, Python, SAS, Stata, or SPSS. Just be careful about automatic encoding

conversion.

• EDA and preprocessing processes must be documented in their order and progress to ensure reproducibility.
Missing values

• Need to know what missing values are coded as

• Blank space with no input value (=NA)

• “The text “NA” (Not Applicable)

• Numbers that appear to be continuous numbers, such as 999 or -999, but actually represent missing values.

• “A string indicating a measurement failure, such as “missing”, “missed”, “-”, “null”, “none”, “NaN”, “missing”, or “none”.

• 0: Need to check if it is a “real zero” meaning there is a measurement and that measurement is 0, or if it is 0 meaning missing

• You can count the number of missing values (NA) using functions such as isna(), sum() in Python or is.na(), sum() in R.
Missing values

• Identify the cause of the missing data

• Check with the data collector to see if the frequency or pattern of missingness is within a normally expected range.

• Identify if there is a pattern in which missing values are concentrated in certain dates, locations, patient characteristics (gender,

age, etc.), and discuss any suspicious patterns with the data collector.

• Measures the proportion of missing values relative to the number of data cases (number of rows) for each variable.

• If the missing rate is too high, consider removing variables (removing columns).

• Example) Out of 1,000 patients, 900 patients had missing urine protein test results -> urine protein test is excluded from the variables.

• If the missing rate is negligibly low and there are sufficiently many cases (rows) without missing, remove missing cases (row

removal).

• Example) Out of 1,000 patients, fasting blood sugar test results are missing for 10 patients -> 10 patient cases are removed.

• Apply missing value imputation when column/row removal is difficult

• There is no absolute threshold for missingness rate or number of missing cases. It varies depending on the clinical context.
Missing values

• When removing rows/columns is difficult

• When the number of cases (number of rows) is small and each case is important because it is a rare disease or a test with high measurement cost.

• Some variables have a significant missing rate, but they are important in terms of medical mechanisms and should be included in the model.

case

• Missing value interpolation method


• Single value imputation: Replace all missing values with a single value.

• Replace with representative value: Replace with representative value such as mean, median, or mode depending on the data distribution type.

• Replace with special values: A realistic range of maximum, minimum, or maximum values that a variable can have, depending on the clinical context of the variable.

Replaced in bulk with diagnostic criteria boundary values, etc.

• Example 1) Since the average of all cases where weight was measured is 65 kg, all cases with missing weight are replaced with 65 kg <-

Mean replacement

• Example 2) Replace all cases with 36.5°C for missing body temperature <- Replace special value

• Example 3) For cases where smoking status is missing, all are considered non-smokers <- Special value substitution
Missing values

• Missing value interpolation method


• Multivariate imputation: Replace each missing value with a different value.

• Replace by randomly selecting measurements of the same variable

• Nearest neighbor imputation: Replaces other measured variable values with the values of the nearest neighbors that are most similar.

• Model-based imputation: Estimation is done by creating a model that estimates the variable to be interpolated using measured data.

• Example 1) For cases where the residential area variable is missing, randomly select one of the cases where the residential area variable is measured.

Replace with residential area value

• Example 2) For cases where the GFR value is missing, find another case with the most similar age and gender and compare that case.

Replace with the value of the filtration rate of the new sphere

• Python: Can be done using Pandas fillna(), scikit-learn SimpleImputer(), etc.

• Interpolation should be done conservatively and in as simple a manner as possible.


Distribution by variable

• It is assumed that the data types of variables are pre-organized into two types: continuous and categorical.

• Categorical variable: bar graph

• Continuous variables: histogram

430 240

Male Female Age


Outlier Detection

• Outlier: Also called outlier or anomaly

• Medically impossible or unreasonable values

• Most of them are caused by input errors or encoding mismatch during data format conversion.

• Example 1) Gender was coded as 1 for male and 2 for female, but a case with gender 12 was detected.

• Example 2) Minimum weight 0kg, maximum 550kg

1 2 12 0 550
Weight (kg)
3. This
this
Detection
award
2- toothaward Exit
knife
tooth

• Correction and interpolation of detected outliers

• 1)Provide a list of outliers to the clinical data collector and discuss

• 2)Correction is possible through remeasurement, estimation, etc.

• Example) Weight550kgsilver55kgIt was judged to be an input error ->55Replace with

• Example) If the patient's weight can be remeasured, remeasure and correct

• 3)Cases that cannot be corrected are treated as missing values.

• Example) Gender12Cases coded as female or male are difficult to estimate and difficult to call the patient back to confirm, so they

are processed in the same way as missing values.

• Example) Weight0kgCases entered as are processed in the same way as missing values.
Normalization (normalization)

• Most AI models work well when the distribution is unimodal and symmetric, similar to a normal distribution.

• The process of transforming a distribution into something closer to a normal distribution is called normalization.

• Models that absolutely require regularization: Parametric models such as linear regression and logistic regression (models that assume normal distribution of variables)
5. Re exacerbation
re
2-kyu casekyufury (re ing)
(ressccaalilng)

• Most AI models work well when the distribution width of multiple variables, i.e. the scale, is nearly constant.

• For example, some variables have scales that are so small that their distribution width is 0.000001, while others have scales that are so large.

1.12X1015If so, there is a risk that variables with small scales will be relatively underestimated.

• Example) In the same dataset, the performance of the prediction model may differ depending on whether the

patient's weight is entered in grams or tons.

• Method 1) Minmax scaling: Transform the distribution of variables so that the minimum value is 0 and the maximum value is 1.

• Method 2) Standard renormalization (standardization scaling): transforming the variable distribution so that the mean is 0 and the standard deviation is 1.

• Can be processed with MinMaxScaler and StandardScaler classes of Python sklearn.preprocessing library
Correlation (correlation)quest
• Most AI models are based on the assumption that all input variables for prediction are independent of each other.

• In reality, if you use medical data as is, pairs of variables that are not independent at all and are highly correlated

will be included in the model.

• When highly correlated variables are used, the computational load increases without helping the model prediction performance.

• In linear regression models, the problem of multicollinearity is particularly serious.

• Highly correlated pairs of variables should be selected based on clinical context and the rest should be eliminated.
Correlation (correlation)quest
• Correlation exploration method

• In R, use the chart.Correlation() function from the PerformanceAnalytics package.

• In Python, use corr() from the pandas library and heatmap() from the seaborn library.
Time series data preprocessing

• Time series data

• Data where observations change over time

• Structured data (data that can be expressed in a spreadsheet) + the added attribute of time flow

• Example of time series data in the medical field

• Number of emergency room visitors per hour


Measurement

• Daily number of confirmed COVID-19 cases

• Electroencephalography (EEG)

• Electrocardiogram (ECG)

• Heart rate data per minute


Origin
Time (t)
• Vital signs such as body temperature and blood pressure

• Blood Sugar Record

• sleep record
Time series data preprocessing

• Time series data consists of components with three main properties:

• Trend Component

• Components that increase or decrease over time in the long term

• Depending on the clinical context of data collection, it may be necessary to remove trend components.

• Examples: Long-term rising or falling trends in electrocardiogram data, long-term falling trends in birth data, etc.

• Remove trend component: smooth to mean

• Seasonal Component

• Periodically changing ingredients

• Example: Season and day of week effects in emergency room visit data

• Irregular Component

• Irregular components including random noise from the sensor


Time series data preprocessing

Source: Samsung Health App.


Time series data preprocessing

• Missing data can “normally” exist in time series data

• For example, in the case of data on the number of confirmed COVID-19 cases, missing data occurs if testing is not conducted on public

holidays, and missing data of this nature are periodic or follow certain rules.

• For time series data measured by sensors, missing data may occur due to sensor errors or network issues, and missing

data of this nature have the characteristic of occurring continuously over a certain period of time.

• For structured data, missing values can be removed by considering the missing ratio or interpolated with a representative value.

• However, because time series data changes over time, another missing value interpolation method is needed.

• LOCF: Last Observation Carried Forward

• NOCB: Next Observation Carried Backward

• Moving Average or Median

• Linear Interpolation, Spline Interpolation

• Generative model interpolation


Time series data preprocessing

• LOCF: Last Observation Carried Forward

• Replace missing values with the value observed immediately before the missing value

• Example: Replace the missing number of confirmed COVID-19 cases on May 5, a public holiday, with the number of confirmed cases on May 4.

• NOCB: Next Observation Carried Backward

• Replace missing values with values observed immediately after the missing value

• Example: Replace the missing number of confirmed COVID-19 cases on May 5, a public holiday, with the number of confirmed cases on May 6.

• Moving Average or Median

• Replace missing values with the mean or median of a given time interval immediately preceding the missing value.

• Linear Interpolation, Spline Interpolation

• How to estimate missing data from non-missing data using linear or spline models

• Generative model interpolation

• Use an artificial intelligence generative model instead of a linear function or spline function as an interpolation model.
Time series data preprocessing

• Noise means noise and refers to anything that causes unintended data distortion.

• Noise due to sensor and network anomalies

• Example: If the sensor value is 0, minimum or maximum for a certain period of time.

• How to deal with it: Remove noise intervals or consider them as missing values and then apply interpolation

• Noise caused by unintended activity

• Sensors that measure vital signs, such as electrocardiograms, often use piezoelectric elements, but noise can easily occur due

to changes in the patient's position, contact with surgical tools, or vibration.

• When recording breathing sounds with an electronic stethoscope, noise such as surrounding noise or vibration is also recorded.
Time series data preprocessing

• If there is a rule in the noise and a model can be created that mimics the noise component according to the rule, the noise model can be

removed from the original time series data.

• Smoothing

• If the noise is not regular and occurs only intermittently

• By applying a moving average over an arbitrary time interval, small fluctuations are removed.
Noise removal

• In noisy environments, this is not a good method because the noise itself changes the moving average.

• Filtering

• When the noise rules are unknown and a lot of noise is generated

• Assuming that the distribution that noise follows is a Gaussian distribution, Kalman distribution, etc., noise is removed through filtering.
Time series data preprocessing

• Fourier transform

• The Fourier transform is a mathematical formula that converts a signal sampled in time into the same signal sampled in time

frequency.

• In signal processing, the Fourier transform reveals important characteristics of a signal, namely its frequency components.

• The Fourier transform is expressed as follows for a vector x with n uniformly sampled points:
Time series data preprocessing

• Fourier transform

• Time series data like the one on the left can be converted into frequency components like the one on the right.

• We also find frequency peaks and convert them into structured data to apply artificial intelligence models.

• The Fourier transformed graph itself is also used as an image to apply the artificial intelligence model.

https://kr.mathworks.com/help/matlab/math/fourier-transforms.html
Time series data preprocessing

• Conversion into structured data through feature selection

• Extract peak frequency components through Fourier transform, measure maximum, minimum, median, variance, etc., and convert

them into structured data as features.

• Convert to image data

• Image data artificial intelligence that converts time series data into images that well describe the features you want to reveal.

Applying the model

• Example: Displaying the results of a pulmonary function test as a volume-flowmetry graph.

• Example: Heart rate variability represented as a Lorenz plot


Time series data preprocessing

• Convert to image data

• Example: Displaying the results of a pulmonary function test as a volume-flowmetry graph.

https://en.wikipedia.org/wiki/Spirometry

https://www.medikro.com/understanding-your-spirometry-test-results/2024/02/21/
Time series data preprocessing

• Convert to image data

• Example: Heart rate variability represented as a Poincare plot


Key Summary

- To preprocess structured data, you need to figure out the following:

-Coding book, unique key, data size, inclusion/exclusion criteria, data type

- Dimension reduction

-Reduce dimensionality and avoid the curse of dimensionality through variable selection or variable extraction.

- Text tokenization

-It converts data into a data type that is easy for computers to understand through methods such as one-hot encoding.

- Detect and interpolate missing and outlier values

-You can visualize distributions by variables, detect outliers and missing values, interpolate, normalize, and renormalize.

- Correlation exploration

-We need to visualize the correlations between all variables and eliminate all but one pair of highly correlated variables.

- Time series data

-The main components of time series data, such as trends, seasons, and irregular components, must be identified and removed, and transformations can be utilized.
NHISS (Health Insurance Corporation) data

-https://nhiss.nhis.or.kr/bd/ay/bdaya001iv.do
NHISS (Health Insurance Corporation)

-Demo data

- Practice data available for download without authorization

- 36Composed of dog tables: JudgesHIRAThe data is

4dog table

- Since it is fake data and not real patient data, it can

be distributed without review.

- While administrative procedures such as data review are in progress

Analysis using demo datascriptI wrote it in advance

and did some debugging to get the data.

You can save time by bringing it to the center, modifying

it, and using it.


Next

- Week 6: Present your project plan

- Week 7: Billing Data

- Week 8: Midterm Exam (face-to-face, written exam)

- Week 9: Open Source Data Science Language R Practice

You might also like