COVID-19 Clinical Trials EDA Pandas
COVID-19 Clinical Trials EDA Pandas
Dataset : Dataset is available in the given link. You can download it at your convenience.
About Dataset
Dataset Description
ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. It is
maintained by the National Institute of Health. All data is publicly available and the site provides a direct download
feature which makes it super easy to use relevant data for analysis.
This dataset consists of clinical trials related to COVID 19 studies presented on the site.
The dataset consists of XML files where each XML file corresponds to one study. The filename is the NCT number
which is a unique identifier of a study in the ClinicalTrials repository. Additionally, a CSV file has also been provided,
which might not have as much information as contained in the XML file, but does give sufficient information.
IMPORTANT:
Listing a study does not mean it has been evaluated by the U.S. Federal Government. Read our disclaimer for
details.
Before participating in a study, talk to your health care provider and learn about the risks and potential benefits.
NOTE :
1. this project is only for your guidance, not exactly the same you have to create. Here I am trying to show the
way or idea of what steps you can follow and how your projects look. Some projects are very advanced (because it
will be made with the help of flask, nlp, advance ai, advance DL and some advanced things ) which you can not understand .
2. You can make or analyze your project with yourself, with your idea, make it more creative from where we can
get some information and understand about our business. make sure what overall things you have created all
things you understand very well.
Example
what steps you should have to follow
Project Title:
1. Objective
The objective is to explore the dataset to gain insights into the characteristics of
COVID-19 clinical trials, such as their status, phases, study designs, and
demographics.
First, you'll need to import the necessary libraries and load your dataset.
import pandas as pd
5. Univariate Analysis
print(df['Status'].value_counts())
df['Status'].value_counts().plot(kind='bar', title='Status of
Clinical Trials')
print(df['Phases'].value_counts())
df['Phases'].value_counts().plot(kind='bar',
title='Distribution of Phases')
print(df['Age'].value_counts())
df['Age'].value_counts().plot(kind='bar', title='Age Group
Distribution')
6. Bivariate Analysis
● Status vs. Phases: Explore how trial phases are distributed across different
statuses.
conditions_outcomes = df.groupby('Conditions')['Outcome
Measures'].apply(lambda x: ', '.join(x)).reset_index()
print(conditions_outcomes)
Analyze the trends over time, such as the number of trials started over the months.
8. Conclusion
9. Saving Results
You can save the processed data or specific analysis results for further use.
● Bar charts showing the distribution of trial statuses, phases, and age groups.
● A time series plot illustrating the trend of trials over time.
This project will provide a solid foundation in EDA using Pandas, with practical
insights into the clinical trials landscape for COVID-19.
Sample code
In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
df = pd.read_csv('../input/covid19-clinical-trials-dataset/COVID clinical
trials.csv' , index_col = 0)
In [3]:
R
e
Pr s
L
im u
S F a
ar C l St
t i st
y o t u
A u r U
S Int G O S C m s dy
NC c d s p
t er Outc Spon e th ta o pl F Lo D
T r y . t d
Titl a Condi ve ome sor/C n er rt m eti i ca oc
Nu o R . P at URL
e t tions nti Meas ollabo d I D pl o r tio u
mb n e . o e
u on ures rators e D at eti n s ns m
er y s s P
s s r s e o D t e
m u t o
n at P nt
lt e st
D e o s
s d e
at s
d
e t
e
d
R
a
n
k
Di
ag Di Gr
no ag ou
sti N no pe
A sti
c o H
ct c
Pe R N D os
iv Te M
rfo C e Evalu C o ec pit
e st: A a M
rm O s ate Group O v e ali
, ID pr r ar
NC an V u the e V e m er
n No il c c
T0 ce I lt diagn Hospi I m b Pa https://Clinical
o w A . 3 h N h N
47 of D s Covid ostic talier D b er ris Trials.gov/sho
1 t ™ l . 0, 8 a 8, a
85 the -I A 19 perfor Paris -I er 2 Sa w/NCT047858
r C l . 2 , N 2 N
89 ID D v manc Saint D 9, 2, int 98
e O 0 2 0
8 No N a e of Josep N 2 2 -J
cr VI 2 0 2
w o il the h o 0 0 os
ui D- 1 2 1
™ w a ID ... w 2 2 ep
ti 19 1
C b 0 0 h,
n Sc
O l Pa
g re
VI e ris
D- en ,
19. in ...
.. g
Te
st
Dr
ug
St : Ci
ud Dr m
y N ug ed
to o C C O ic
N
Ev R O O N D c al,
o Ja O
alu e VI Chan V o ec t Ba
t n ct
ate C s D1 ge on I v e o rra
y u o
NC the O u 9- viral Unite D e m b nq
e SARS ar b
T0 Eff V lt 00 load d 1 m b e uill https://Clinical
t -CoV- A . y N er N
45 ica I s 01 result Medic 9- b er r a, Trials.gov/sho
2 r 2 l . 2 a 2 a
95 cy D A -U s al 0 er 1 2 Atl w/NCT045951
e Infecti l . 9, N 0, N
13 of - v S from Speci 0 2, 5, 0 an 36
cr on 2 2
6 C 1 a R| baseli alties 0 2 2 , tic
ui 0 0
O 9 il Dr ne 1- 0 0 2 o,
ti 2 2
VI a ug aft... U 2 2 0 C
n 1 0
D1 b : S 0 0 2 ol
g
9- l no R 0 o
00 e rm m
01. al bi
.. sal a
in
e
O
Ot sp
Lu he ed
ng r: al
N
CT Lu e
o
Sc ng Pa
R N
an T CT A T pa
e M o
An A sc qualit A Ju Ju Gi
s M a v
aly C R an ative Unive C n n ov
NC u a y e
sis - e an analy rsity - e e an
T0 lt y 2 m https://Clinical
of C cr aly sis of of A . C 1 1 N ni N
43 s covid 7, 0 b Trials.gov/sho
3 SA O ui sis paren Milan l . O 5, 5, a X a
95 A 19 2 , er w/NCT043954
RS V ti in chym o l . V 2 2 N XII N
48 v 0 2 9, 82
-C I n C al Bicoc I 0 0 I,
2 a 2 0 2
oV D g O lung ca D 2 2 Be
il 0 2 0
2 1 VI dam.. 1 1 1 rg
a 0 2
Ind 9 D- . 9 a
b 0
uc 19 m
l
ed pa o,
e
Lu tie Ita
n... nt ly|
s P..
.
H
on
Th g
Di
e Ko
ag
Ro N ng
A no
le o Sa
ct sti
of R na
iv c A
a e Prop J tor
e Te M u J
Pri C s ortion R Ju u iu
, st: Hong a g u
NC vat O u of C ly n m
n C Kong y us n
T0 e V lt asym -2 3 e & https://Clinical
o O Sanat A . 2 t N e N
44 Ho I s COVI ptom 0 1, 4 H Trials.gov/sho
4 t VI orium l . 5, 3 a 4, a
16 spi D A D atic 2 2 , os w/NCT044160
r D & l . 2 1, N 2 N
06 tal - v subje 0- 0 2 pit 61
e 19 Hospi 0 2 0
1 in 1 a cts|Pr 0 2 0 al,
cr Di tal 2 0 2
Ho 9 il oporti 8 0 2 H
ui ag 0 2 0
ng a on... 0 on
ti no 0
Ko b g
n sti
ng l Ko
g c
A e ng
Te
m.. ,
st
. H
o..
.
Di
ag
Ma no
N
ter sti
o C
nal c
T R H
-fo Te COVI
M e Mater Centr C M R
eta st: D-19 J
F s nal e H M a Or
l R Di by u
NC - u Fetal Hospi F R a M M y lé
Tr e ag positi n
T0 C lt Infecti talier e O y ay ay 2 an https://Clinical
an cr no ve . N e N
43 O s on Régio m -2 5, 2 2 0 s, Trials.gov/sho
5 sm ui sis PCR . a 4, a
95 V A Trans nal a 0 2 0 0 , Or w/NCT043959
iss ti of in . N 2 N
92 I v missi d'Orlé l 2 0 2 2 2 lé 24
ion n S cord 0
4 D a on|C ans|C e 0- 2 1 1 0 an
of g A blood 2
- il OVID entre 1 0 2 s,
SA R and / 0
1 a -19... d... 0 0 Fr
RS S- o...
9 b an
-C Co
l ce
ov- v2
e
2 by
RT
-...
5 rows × 26 columns
In [4]:
Out[4]:
(5783, 26)
In [5]:
Out[5]:
dtype='object')
In [6]:
# Categorical Features
df.select_dtypes(include = 'object').columns
Out[6]:
dtype='object')
In [7]:
# Neumrical Features
df.select_dtypes(exclude = 'object').columns
Out[7]:
Index(['Enrollment'], dtype='object')
In [8]:
Out[8]:
In [9]:
# set labels
plt.ylabel(ylabel)
plt.show()
In [10]:
In [11]:
In [12]:
# Columns in the dataset after dropping Study Documents and Results First Posted
df.columns
Out[12]:
dtype='object')
In [13]:
In [14]:
In [15]:
df.isnull().mean() * 100
Out[15]:
dtype: float64
In [16]:
# We can extract a new feature form The Location which is the country where the study
hold
countries = [ str(df.Locations.iloc[i]).split(',')[-1] for i in range(df.shape[0])]
df['Country'] = countries
In [17]:
df.columns
Out[17]:
dtype='object')
In [18]:
df.Country.value_counts()[:35]
Out[18]:
In [19]:
Out[19]:
COVID-19 47
PROTECT 7
CORONA 6
RECOVER 5
SCOPE 5
..
ASD 1
VICO 1
LICORNE 1
LOSVID 1
MindMyMindFU 1
In [20]:
Out[20]:
Country
Iraq 100.000000
Belarus 100.000000
Rwanda 100.000000
South Sudan 100.000000
Cambodia 100.000000
Bulgaria 100.000000
Cyprus 100.000000
Bosnia and Herzegovina 100.000000
Guinea-Bissau 100.000000
Dominican Republic 100.000000
Ecuador 100.000000
North Macedonia 100.000000
Bahrain 100.000000
Azerbaijan 100.000000
Uruguay 100.000000
Uzbekistan 100.000000
Kyrgyzstan 100.000000
Cape Verde 100.000000
Republic of 96.774194
Taiwan 93.750000
Singapore 93.103448
Japan 88.888889
Kuwait 87.500000
China 87.134503
Turkey 86.757991
Ukraine 85.714286
Malaysia 84.615385
Egypt 83.854167
Hungary 83.333333
Hong Kong 80.555556
Bangladesh 80.000000
India 80.000000
Kazakhstan 80.000000
Saudi Arabia 77.777778
Puerto Rico 76.470588
Israel 75.000000
Zimbabwe 75.000000
Jordan 72.727273
Poland 72.413793
Indonesia 71.428571
United States 69.376480
Romania 69.230769
Kenya 66.666667
Nepal 66.666667
New Zealand 66.666667
Ethiopia 66.666667
Slovakia 66.666667
Thailand 66.666667
Lebanon 66.666667
nan 66.324786
Islamic Republic of 65.217391
Russian Federation 65.217391
Chile 64.705882
Austria 62.068966
Pakistan 60.377358
Brazil 59.124088
Mexico 57.954545
Sweden 57.894737
Argentina 57.446809
Canada 55.940594
● After inspecting the relation between the missing values in Acronym and Country we can conclude that
there is a sort of relation between these two features, so we can say that Data is Missing At Random
(MAR).
● So we can Impute by Missing Category.
In [21]:
In [22]:
Out[22]:
dtype: float64
We can do the same for other categorical features such as Interventions , Phases , Locations and other
categorical features
In [23]:
In [24]:
Out[24]:
dtype: float64
Now the Time to handle The missing data for the Enrollment
In [25]:
Out[25]:
34.06593382031148
The value of Skewness is 34 which means that we This feature isn't normally distributed
In [26]:
Out[26]:
<AxesSubplot:ylabel='Density'>
So We will impute by the median
In [27]:
min_Value = df.Enrollment.min()
max_Value = df.Enrollment.max()
mean_Value = df.Enrollment.mean()
median_Value = df.Enrollment.median()
std_Value = df.Enrollment.std()
In [29]:
Out[29]:
dtype: float64
In [30]:
df.head()
Out[30]:
Pr
L
i
a
S m F
C st
t ar i
O S o U
A u y r
S Int G t t m p C
NC c d Spon C s
t er Outc e h a pl d Lo o
T r y sor/C . Study o t
Tit a Cond ve ome n e rt et a ca u
Nu o R ollab . Desig m P URL
le t itions nti Meas d r D io t tio n
mb n e orator . ns pl o
u on ures e I a n e ns tr
er y s s et s
s s r D t D P y
m u io t
s e at o
lt n e
e st
s D d
e
at
d
e
R
a
n
k
Di Di
Gr
ag ag
A ou
no N no
c pe
sti o sti N
ti D H
c R c o
v e M os
Pe C e Te Eval C v M
e c A a pit
rfo O s st: uate Grou Allocat O e a
, e pr r ali
NC rm V u ID the pe ion: V m rc F
n m il c er
T0 an I lt N diagn Hospi N/A|Int I b h https://Clinical r
o A . b 3 h P
47 ce D s Covid o ostic talier erventi D e 8 Trials.gov/sho a
1 t l . er 0, 8 ari
85 of -I A 19 w perfo Paris on -I r , w/NCT04785 n
r l . 2 2 , s
89 th D v ™ rman Saint Model: D 9 2 898 c
e 2, 0 2 S
8 e N a C ce of Jose Single N , 0 e
c 2 2 0 ai
ID o il O the ph Gro... o 2 2
r 0 1 2 nt
N w a VI ID ... w 0 1
u 2 1 -J
o b D- 2
iti 0 os
w l 19 0
n ep
™ e Sc
g h,
C re
P
O en
ari
VI in
s,
D- g
19 Te ...
... st
Dr
St
ug
ud Ci
:
y m
Dr
to ed
ug
Ev N C ic
C
al N o O N O al,
O D O
ua o R V o J c B
VI Chan e ct
te t e I v a t ar
D ge c o
th C y s Allocat D e n o ra C
19 on e b
NC e O e u Unite ion: 1 m u b nq o
SAR -0 viral m e
T0 Eff V t lt d Rando 9 b ar e uil https://Clinical l
S-Co 00 load A . b r
45 ic I r s Medi mized| - e y r la, Trials.gov/sho o
2 V-2 1- result l . er 2
95 ac D e A cal Interve 0 r 2 2 At w/NCT04595 m
Infect U s l . 1 0
13 y - c v Speci ntion 0 2 9, 0 la 136 b
ion S from 5, ,
6 of 1 r a alties Model: 0 , 2 , nti i
R| basel 2 2
C 9 u il Par... 1 2 0 2 co a
Dr ine 0 0
O iti a - 0 2 0 ,
ug aft... 2 2
VI n b U 2 1 2 C
: 0 0
D g l S 0 0 ol
no
19 e R o
rm
-0 m
al
00 bi
sa
1.. a
lin
.
e
Lu O
Ot
ng sp
he
C ed
N r:
T al
o Lu N
Sc e
R ng o
an T A T P
e C J J M v
An A qualit Obser A M ap
R s T u u a e S
al C ative Unive vation C a a
NC e u sc n n y m a
ysi - analy rsity al - y Gi
T0 c lt an e e 2 b https://Clinical n
s C sis of of A . Model: C 7 ov
43 r s covid an 1 1 0 e Trials.gov/sho M
3 of O pare Milan l . Cohort O , an
95 u A 19 al 5, 5, , r w/NCT04395 a
S V nchy o l . |Time V 2 ni
48 iti v ysi 2 2 2 9 482 ri
A I mal Bicoc Persp I 0 X
2 n a s 0 0 0 , n
R D lung ca ective: D 2 XI
g il in 2 2 2 2 o
S- 1 dam. ... 1 0 II,
a C 1 1 0 0
C 9 .. 9 B
b O 2
oV er
l VI 0
2 ga
e D-
In m
19
du o,
pa
ce Ita
tie
d ly|
nt
Lu P..
n.. s .
.
H
on
Th
g
e
K
R Di
on
ol A ag
N g
e c no
o S
of ti sti
R an
a v c A
e Prop R M J J at
Pri e Te Obser J u
C s ortio C a u u ori H
va , st: Hong vation ul g
NC O u n of - y n n u o
te n C Kong al y u
T0 V lt asym 2 2 e e m https://Clinical n
H o O Sanat A . Model: 3 st
44 I s COVI ptom 0 5 4 4 & Trials.gov/sho g
4 os t VI orium l . Cohort 1, 3
16 D A D atic 2 , , , H w/NCT04416 K
pit r D & l . |Time 2 1,
06 - v subje 0 2 2 2 os 061 o
al e 19 Hospi Persp 0 2
1 1 a cts|P - 0 0 0 pit n
in c Di tal ective: 2 0
9 il ropor 0 2 2 2 al, g
H r ag ... 0 2
a tion... 8 0 0 0 H
on u no 0
b on
g iti sti
l g
Ko n c
e K
ng g Te
on
A st
g,
m.
H
..
o..
.
Di
M ag
at N no
er o sti C
na T R c C H
l-f Te COVI
M e Mater Centr H M J R
oe st: D-19 Obser M
F R s nal e R a u Or
tal Di by vation a M M
NC - e u Fetal Hospi F O y n lé F
Tr ag positi al y a a
T0 C c lt Infect talier e - 2 e an https://Clinical r
an no ve . Model: 5 y y
43 O r s ion Régio m 2 0 4 s, Trials.gov/sho a
5 s sis PCR . Cohort , 2 2
95 V u A Trans nal a 0 , , Or w/NCT04395 n
mi of in . |Time 2 0 0
92 I iti v missi d'Orlé l 2 2 2 lé 924 c
ssi S cord Persp 0 2 2
4 D n a on|C ans|C e 0 0 0 an e
on A blood ective: 2 1 1
- g il OVID entre - 2 2 s,
of R and / ... 0
1 a -19... d... 1 0 0 Fr
S S- o...
9 b 0 an
A l C ce
R e ov
S- 2
C by
ov R
T-.
-2 ..
5 rows × 25 columns
Data Visualizations
In [31]:
In [32]:
# Gender Visualiztions
gender = df.Gender.value_counts()
visualize_data(gender , caption = 'Gender Distribution' , ylabel = 'Denisty')
In [34]:
start_month_Distribution = start_month.value_counts()
So If We are going to apply a (Machine Learning) ML model we can drop NCT Number and URL because there is
an index already which is Rank. To reduce the number of categorical Features, Specially because they will need
to be doecoded inorder to be used in a ML Model.
1 Reference link
2 Reference link for ML project