0% found this document useful (0 votes)
78 views30 pages

COVID-19 Clinical Trials EDA Pandas

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views30 pages

COVID-19 Clinical Trials EDA Pandas

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Project Title COVID-19 Clinical Trials EDA Pandas

Tools Python, ML, SQL, Excel

Technologies Data Analyst & Data scientist

Project Difficulties level intermediate

Dataset : Dataset is available in the given link. You can download it at your convenience.

Click here to download data set

About Dataset
Dataset Description
ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. It is
maintained by the National Institute of Health. All data is publicly available and the site provides a direct download
feature which makes it super easy to use relevant data for analysis.

This dataset consists of clinical trials related to COVID 19 studies presented on the site.

The dataset consists of XML files where each XML file corresponds to one study. The filename is the NCT number
which is a unique identifier of a study in the ClinicalTrials repository. Additionally, a CSV file has also been provided,
which might not have as much information as contained in the XML file, but does give sufficient information.

Please refer to this notebook for details on the dataset :


https://www.kaggle.com/parulpandey/eda-on-covid-19-clinical-trials
Acknowledgements
ClinicalTrials.gov is a resource provided by the U.S. National Library of Medicine.

IMPORTANT:
Listing a study does not mean it has been evaluated by the U.S. Federal Government. Read our disclaimer for
details.
Before participating in a study, talk to your health care provider and learn about the risks and potential benefits.

NOTE :
1. this project is only for your guidance, not exactly the same you have to create. Here I am trying to show the
way or idea of what steps you can follow and how your projects look. Some projects are very advanced (because it
will be made with the help of flask, nlp, advance ai, advance DL and some advanced things ) which you can not understand .
2. You can make or analyze your project with yourself, with your idea, make it more creative from where we can
get some information and understand about our business. make sure what overall things you have created all
things you understand very well.

Example
what steps you should have to follow

Here's a step-by-step guide for performing Exploratory Data Analysis (EDA) on a


COVID-19 Clinical Trials dataset using Pandas, tailored for beginners.

Project Title:

Exploratory Data Analysis of COVID-19 Clinical Trials

1. Objective

The objective is to explore the dataset to gain insights into the characteristics of
COVID-19 clinical trials, such as their status, phases, study designs, and
demographics.

2. Importing Libraries and Loading Data

First, you'll need to import the necessary libraries and load your dataset.

import pandas as pd

# Load the dataset


df = pd.read_csv('covid_clinical_trials.csv') # Replace with
your dataset's path

3. Initial Data Exploration

Start by exploring the basic structure and content of the dataset.

# View the first few rows of the dataset


print(df.head())

# Check the columns and data types


print(df.info())

# Summary statistics for numerical columns


print(df.describe())

# Summary statistics for categorical columns


print(df.describe(include='object'))

4. Handling Missing Data

Check for missing values and decide how to handle them.

# Check for missing values


print(df.isnull().sum())

# Drop columns with a high percentage of missing values or fill


them
df = df.drop(columns=['Acronym', 'Study Documents']) # Example
of dropping columns
df['Results First Posted'].fillna('Unknown', inplace=True) #
Example of filling missing data

5. Univariate Analysis

Analyze each column individually to understand the distribution and key


characteristics.

● Status Distribution: Analyze the status of clinical trials (e.g., Completed,


Ongoing).

print(df['Status'].value_counts())
df['Status'].value_counts().plot(kind='bar', title='Status of
Clinical Trials')

● Phase Distribution: Understand the distribution of trial phases.

print(df['Phases'].value_counts())
df['Phases'].value_counts().plot(kind='bar',
title='Distribution of Phases')

● Age Group Analysis: Analyze the distribution of age groups.

print(df['Age'].value_counts())
df['Age'].value_counts().plot(kind='bar', title='Age Group
Distribution')

6. Bivariate Analysis

Explore relationships between different variables.

● Status vs. Phases: Explore how trial phases are distributed across different
statuses.

status_phase = pd.crosstab(df['Status'], df['Phases'])


print(status_phase)
status_phase.plot(kind='bar', stacked=True, title='Status vs.
Phases')
● Conditions vs. Outcome Measures: Understand the common outcome
measures for different conditions.

conditions_outcomes = df.groupby('Conditions')['Outcome
Measures'].apply(lambda x: ', '.join(x)).reset_index()
print(conditions_outcomes)

7. Time Series Analysis

Analyze the trends over time, such as the number of trials started over the months.

# Convert date columns to datetime


df['Start Date'] = pd.to_datetime(df['Start Date'],
errors='coerce')
df['Primary Completion Date'] = pd.to_datetime(df['Primary
Completion Date'], errors='coerce')

# Plot the number of trials started over time


df['Start
Date'].dt.to_period('M').value_counts().sort_index().plot(kind=
'line', title='Trials Started Over Time')

8. Conclusion

Summarize the findings from your EDA. For example:

● The majority of trials are in the "Completed" phase.


● Most trials target adult populations.
● There's a steady increase in the number of trials over time.

9. Saving Results

You can save the processed data or specific analysis results for further use.

# Save the cleaned data


df.to_csv('cleaned_covid_clinical_trials.csv', index=False)

10. Output and Visuals


After running the code, you should observe:

● Bar charts showing the distribution of trial statuses, phases, and age groups.
● A time series plot illustrating the trend of trials over time.

This project will provide a solid foundation in EDA using Pandas, with practical
insights into the clinical trials landscape for COVID-19.

Sample code

Import Required Libraries¶

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Load The DataSet

In [2]:

df = pd.read_csv('../input/covid19-clinical-trials-dataset/COVID clinical
trials.csv' , index_col = 0)

Exploratory Data Analysis

In [3]:

# print the first 5 rows in the dataset


df.head(n = 5)
Out[3]:

R
e
Pr s
L
im u
S F a
ar C l St
t i st
y o t u
A u r U
S Int G O S C m s dy
NC c d s p
t er Outc Spon e th ta o pl F Lo D
T r y . t d
Titl a Condi ve ome sor/C n er rt m eti i ca oc
Nu o R . P at URL
e t tions nti Meas ollabo d I D pl o r tio u
mb n e . o e
u on ures rators e D at eti n s ns m
er y s s P
s s r s e o D t e
m u t o
n at P nt
lt e st
D e o s
s d e
at s
d
e t
e
d

R
a
n
k

Di
ag Di Gr
no ag ou
sti N no pe
A sti
c o H
ct c
Pe R N D os
iv Te M
rfo C e Evalu C o ec pit
e st: A a M
rm O s ate Group O v e ali
, ID pr r ar
NC an V u the e V e m er
n No il c c
T0 ce I lt diagn Hospi I m b Pa https://Clinical
o w A . 3 h N h N
47 of D s Covid ostic talier D b er ris Trials.gov/sho
1 t ™ l . 0, 8 a 8, a
85 the -I A 19 perfor Paris -I er 2 Sa w/NCT047858
r C l . 2 , N 2 N
89 ID D v manc Saint D 9, 2, int 98
e O 0 2 0
8 No N a e of Josep N 2 2 -J
cr VI 2 0 2
w o il the h o 0 0 os
ui D- 1 2 1
™ w a ID ... w 2 2 ep
ti 19 1
C b 0 0 h,
n Sc
O l Pa
g re
VI e ris
D- en ,
19. in ...
.. g
Te
st

Dr
ug
St : Ci
ud Dr m
y N ug ed
to o C C O ic
N
Ev R O O N D c al,
o Ja O
alu e VI Chan V o ec t Ba
t n ct
ate C s D1 ge on I v e o rra
y u o
NC the O u 9- viral Unite D e m b nq
e SARS ar b
T0 Eff V lt 00 load d 1 m b e uill https://Clinical
t -CoV- A . y N er N
45 ica I s 01 result Medic 9- b er r a, Trials.gov/sho
2 r 2 l . 2 a 2 a
95 cy D A -U s al 0 er 1 2 Atl w/NCT045951
e Infecti l . 9, N 0, N
13 of - v S from Speci 0 2, 5, 0 an 36
cr on 2 2
6 C 1 a R| baseli alties 0 2 2 , tic
ui 0 0
O 9 il Dr ne 1- 0 0 2 o,
ti 2 2
VI a ug aft... U 2 2 0 C
n 1 0
D1 b : S 0 0 2 ol
g
9- l no R 0 o
00 e rm m
01. al bi
.. sal a
in
e

O
Ot sp
Lu he ed
ng r: al
N
CT Lu e
o
Sc ng Pa
R N
an T CT A T pa
e M o
An A sc qualit A Ju Ju Gi
s M a v
aly C R an ative Unive C n n ov
NC u a y e
sis - e an analy rsity - e e an
T0 lt y 2 m https://Clinical
of C cr aly sis of of A . C 1 1 N ni N
43 s covid 7, 0 b Trials.gov/sho
3 SA O ui sis paren Milan l . O 5, 5, a X a
95 A 19 2 , er w/NCT043954
RS V ti in chym o l . V 2 2 N XII N
48 v 0 2 9, 82
-C I n C al Bicoc I 0 0 I,
2 a 2 0 2
oV D g O lung ca D 2 2 Be
il 0 2 0
2 1 VI dam.. 1 1 1 rg
a 0 2
Ind 9 D- . 9 a
b 0
uc 19 m
l
ed pa o,
e
Lu tie Ita
n... nt ly|
s P..
.
H
on
Th g
Di
e Ko
ag
Ro N ng
A no
le o Sa
ct sti
of R na
iv c A
a e Prop J tor
e Te M u J
Pri C s ortion R Ju u iu
, st: Hong a g u
NC vat O u of C ly n m
n C Kong y us n
T0 e V lt asym -2 3 e & https://Clinical
o O Sanat A . 2 t N e N
44 Ho I s COVI ptom 0 1, 4 H Trials.gov/sho
4 t VI orium l . 5, 3 a 4, a
16 spi D A D atic 2 2 , os w/NCT044160
r D & l . 2 1, N 2 N
06 tal - v subje 0- 0 2 pit 61
e 19 Hospi 0 2 0
1 in 1 a cts|Pr 0 2 0 al,
cr Di tal 2 0 2
Ho 9 il oporti 8 0 2 H
ui ag 0 2 0
ng a on... 0 on
ti no 0
Ko b g
n sti
ng l Ko
g c
A e ng
Te
m.. ,
st
. H
o..
.

Di
ag
Ma no
N
ter sti
o C
nal c
T R H
-fo Te COVI
M e Mater Centr C M R
eta st: D-19 J
F s nal e H M a Or
l R Di by u
NC - u Fetal Hospi F R a M M y lé
Tr e ag positi n
T0 C lt Infecti talier e O y ay ay 2 an https://Clinical
an cr no ve . N e N
43 O s on Régio m -2 5, 2 2 0 s, Trials.gov/sho
5 sm ui sis PCR . a 4, a
95 V A Trans nal a 0 2 0 0 , Or w/NCT043959
iss ti of in . N 2 N
92 I v missi d'Orlé l 2 0 2 2 2 lé 24
ion n S cord 0
4 D a on|C ans|C e 0- 2 1 1 0 an
of g A blood 2
- il OVID entre 1 0 2 s,
SA R and / 0
1 a -19... d... 0 0 Fr
RS S- o...
9 b an
-C Co
l ce
ov- v2
e
2 by
RT
-...

5 rows × 26 columns
In [4]:

# Shape of the DataSet


df.shape

Out[4]:

(5783, 26)

In [5]:

# Columns in the dataset


df.columns

Out[5]:

Index(['NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',


'Conditions', 'Interventions', 'Outcome Measures',
'Sponsor/Collaborators', 'Gender', 'Age', 'Phases', 'Enrollment',
'Funded Bys', 'Study Type', 'Study Designs', 'Other IDs', 'Start Date',
'Primary Completion Date', 'Completion Date', 'First Posted',
'Results First Posted', 'Last Update Posted', 'Locations',
'Study Documents', 'URL'],

dtype='object')

In [6]:

# Categorical Features
df.select_dtypes(include = 'object').columns

Out[6]:

Index(['NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',


'Conditions', 'Interventions', 'Outcome Measures',
'Sponsor/Collaborators', 'Gender', 'Age', 'Phases', 'Funded Bys',
'Study Type', 'Study Designs', 'Other IDs', 'Start Date',
'Primary Completion Date', 'Completion Date', 'First Posted',
'Results First Posted', 'Last Update Posted', 'Locations',
'Study Documents', 'URL'],

dtype='object')
In [7]:

# Neumrical Features
df.select_dtypes(exclude = 'object').columns

Out[7]:

Index(['Enrollment'], dtype='object')

In [8]:

# Detecting (Percentage) Missing Data


missing_data = df.isnull().mean() * 100
missing_data

Out[8]:

NCT Number 0.000000


Title 0.000000
Acronym 57.115684
Status 0.000000
Study Results 0.000000
Conditions 0.000000
Interventions 15.320768
Outcome Measures 0.605222
Sponsor/Collaborators 0.000000
Gender 0.172921
Age 0.000000
Phases 42.555767
Enrollment 0.587930
Funded Bys 0.000000
Study Type 0.000000
Study Designs 0.605222
Other IDs 0.017292
Start Date 0.587930
Primary Completion Date 0.622514
Completion Date 0.622514
First Posted 0.000000
Results First Posted 99.377486
Last Update Posted 0.000000
Locations 10.115857
Study Documents 96.852845
URL 0.000000
dtype: float64

In [9]:

# Visualize data without calculating


def visualize_data(data , caption = '' , ylabel = 'Percentage of Mising Data'):

# set figure size


sns.set(rc={'figure.figsize':(15,8.27)})
# make ticks vertical
plt.xticks(rotation=90)

# set title to the image and plot it or the highest 40


fig = sns.barplot(x = data.keys()[:min(40 , len(data))].tolist() , y =
data.values[: min(40 , len(data))].tolist()) \
.set_title(caption)

# set labels
plt.ylabel(ylabel)

plt.show()

In [10]:

visualize_data(missing_data , 'Percentage of missing data in each feature')


As shown the percentae of missing data in Results First Posted is 99.3% and Study Documents is 96.8%, so
it's impossible to impute them without destoying our dataset.

In [11]:

# Drop Study Documents and Results First Posted


df.drop(['Results First Posted' , 'Study Documents'] , inplace = True , axis = 1 )

In [12]:

# Columns in the dataset after dropping Study Documents and Results First Posted
df.columns

Out[12]:

Index(['NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',


'Conditions', 'Interventions', 'Outcome Measures',
'Sponsor/Collaborators', 'Gender', 'Age', 'Phases', 'Enrollment',
'Funded Bys', 'Study Type', 'Study Designs', 'Other IDs', 'Start Date',
'Primary Completion Date', 'Completion Date', 'First Posted',
'Last Update Posted', 'Locations', 'URL'],

dtype='object')

In [13]:

# Drop Duplicate Rows


print(f"Shape before dropping duplicates data {df.shape}")
df.drop_duplicates(inplace = True)
print(f"Shape after dropping duplicates data {df.shape}")

Shape before dropping duplicates data (5783, 24)


Shape after dropping duplicates data (5783, 24)

There is no duplicate rows in the dataset.

In [14]:

# Drop rows that have less than 10 non-null values


print(f"Shape before dropping Null rows {df.shape}")
df.dropna(how = 'any' , axis = 0 , thresh = 10 , inplace = True)
print(f"Shape after dropping Null rows {df.shape}")

Shape before dropping Null rows (5783, 24)


Shape after dropping Null rows (5783, 24)

There is no rows with less than 10 non-null values

In [15]:

df.isnull().mean() * 100

Out[15]:

NCT Number 0.000000


Title 0.000000
Acronym 57.115684
Status 0.000000
Study Results 0.000000
Conditions 0.000000
Interventions 15.320768
Outcome Measures 0.605222
Sponsor/Collaborators 0.000000
Gender 0.172921
Age 0.000000
Phases 42.555767
Enrollment 0.587930
Funded Bys 0.000000
Study Type 0.000000
Study Designs 0.605222
Other IDs 0.017292
Start Date 0.587930
Primary Completion Date 0.622514
Completion Date 0.622514
First Posted 0.000000
Last Update Posted 0.000000
Locations 10.115857
URL 0.000000

dtype: float64

In [16]:

# We can extract a new feature form The Location which is the country where the study
hold
countries = [ str(df.Locations.iloc[i]).split(',')[-1] for i in range(df.shape[0])]
df['Country'] = countries

In [17]:

df.columns

Out[17]:

Index(['NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',


'Conditions', 'Interventions', 'Outcome Measures',
'Sponsor/Collaborators', 'Gender', 'Age', 'Phases', 'Enrollment',
'Funded Bys', 'Study Type', 'Study Designs', 'Other IDs', 'Start Date',
'Primary Completion Date', 'Completion Date', 'First Posted',
'Last Update Posted', 'Locations', 'URL', 'Country'],

dtype='object')
In [18]:

df.Country.value_counts()[:35]

Out[18]:

United States 1267


France 647
nan 585
United Kingdom 306
Italy 235
Spain 234
Turkey 219
Canada 202
Egypt 192
China 171
Brazil 137
Germany 128
Belgium 91
Mexico 88
Switzerland 76
Russian Federation 69
Sweden 57
Denmark 56
Israel 56
India 55
Pakistan 53
Argentina 47
Netherlands 46
Norway 38
Hong Kong 36
Colombia 33
Republic of 31
Austria 29
Poland 29
Singapore 29
Saudi Arabia 27
Australia 26
Greece 26
Islamic Republic of 23
South Africa 22

Name: Country, dtype: int64

Now We need to clasify the missing data to one of these categories

1) Missing Completely At Random (MCAR)


2) Missing At Random (MAR)

3) Not Missing At Random (NMAR)

In [19]:

# Lets's start with Acronym

print(f"Number of unique values is {df.Acronym.nunique()} \n")


df.Acronym.value_counts()

Number of unique values is 2338

Out[19]:

COVID-19 47
PROTECT 7
CORONA 6
RECOVER 5
SCOPE 5
..
ASD 1
VICO 1
LICORNE 1
LOSVID 1
MindMyMindFU 1

Name: Acronym, Length: 2338, dtype: int64

In [20]:

# Find the realtion between null values in Acronym and Countries


(df.Acronym.isnull().groupby(df.Country).mean().sort_values(ascending = False) *
100)[:60]

Out[20]:

Country
Iraq 100.000000
Belarus 100.000000
Rwanda 100.000000
South Sudan 100.000000
Cambodia 100.000000
Bulgaria 100.000000
Cyprus 100.000000
Bosnia and Herzegovina 100.000000
Guinea-Bissau 100.000000
Dominican Republic 100.000000
Ecuador 100.000000
North Macedonia 100.000000
Bahrain 100.000000
Azerbaijan 100.000000
Uruguay 100.000000
Uzbekistan 100.000000
Kyrgyzstan 100.000000
Cape Verde 100.000000
Republic of 96.774194
Taiwan 93.750000
Singapore 93.103448
Japan 88.888889
Kuwait 87.500000
China 87.134503
Turkey 86.757991
Ukraine 85.714286
Malaysia 84.615385
Egypt 83.854167
Hungary 83.333333
Hong Kong 80.555556
Bangladesh 80.000000
India 80.000000
Kazakhstan 80.000000
Saudi Arabia 77.777778
Puerto Rico 76.470588
Israel 75.000000
Zimbabwe 75.000000
Jordan 72.727273
Poland 72.413793
Indonesia 71.428571
United States 69.376480
Romania 69.230769
Kenya 66.666667
Nepal 66.666667
New Zealand 66.666667
Ethiopia 66.666667
Slovakia 66.666667
Thailand 66.666667
Lebanon 66.666667
nan 66.324786
Islamic Republic of 65.217391
Russian Federation 65.217391
Chile 64.705882
Austria 62.068966
Pakistan 60.377358
Brazil 59.124088
Mexico 57.954545
Sweden 57.894737
Argentina 57.446809
Canada 55.940594

Name: Acronym, dtype: float64

● After inspecting the relation between the missing values in Acronym and Country we can conclude that
there is a sort of relation between these two features, so we can say that Data is Missing At Random
(MAR).
● So we can Impute by Missing Category.

In [21]:

# impute by a missing Indicator


df.Acronym = df.Acronym.fillna("Missing Acronym")

In [22]:

# Detecting (Percentage) Missing Data


df.isnull().mean() * 100

Out[22]:

NCT Number 0.000000


Title 0.000000
Acronym 0.000000
Status 0.000000
Study Results 0.000000
Conditions 0.000000
Interventions 15.320768
Outcome Measures 0.605222
Sponsor/Collaborators 0.000000
Gender 0.172921
Age 0.000000
Phases 42.555767
Enrollment 0.587930
Funded Bys 0.000000
Study Type 0.000000
Study Designs 0.605222
Other IDs 0.017292
Start Date 0.587930
Primary Completion Date 0.622514
Completion Date 0.622514
First Posted 0.000000
Last Update Posted 0.000000
Locations 10.115857
URL 0.000000
Country 0.000000

dtype: float64

We can do the same for other categorical features such as Interventions , Phases , Locations and other
categorical features

In [23]:

# Impute Interventions , Phases , Locations by Missing Category

categorical_features = df.select_dtypes(include = object).columns

features = categorical_features[df[categorical_features].isnull().mean() > 0]

for feature in features:


df[feature] = df[feature].fillna(f"Missing {feature}")

In [24]:

# Detecting (Percentage) Missing Data


df.isnull().mean() * 100

Out[24]:

NCT Number 0.00000


Title 0.00000
Acronym 0.00000
Status 0.00000
Study Results 0.00000
Conditions 0.00000
Interventions 0.00000
Outcome Measures 0.00000
Sponsor/Collaborators 0.00000
Gender 0.00000
Age 0.00000
Phases 0.00000
Enrollment 0.58793
Funded Bys 0.00000
Study Type 0.00000
Study Designs 0.00000
Other IDs 0.00000
Start Date 0.00000
Primary Completion Date 0.00000
Completion Date 0.00000
First Posted 0.00000
Last Update Posted 0.00000
Locations 0.00000
URL 0.00000
Country 0.00000

dtype: float64

Now the Time to handle The missing data for the Enrollment

In [25]:

# Check the skewness


df.Enrollment.skew()

Out[25]:

34.06593382031148

The value of Skewness is 34 which means that we This feature isn't normally distributed

In [26]:

# Plotting the distribution of the enrollment


df.Enrollment.plot(kind = 'kde')

Out[26]:

<AxesSubplot:ylabel='Density'>
So We will impute by the median

In [27]:

# Some Statstical Valuse for the Enrollment Column

min_Value = df.Enrollment.min()
max_Value = df.Enrollment.max()
mean_Value = df.Enrollment.mean()
median_Value = df.Enrollment.median()
std_Value = df.Enrollment.std()

print(f"the min value is {min_Value} \n \


The max value is {max_Value} \n \
The mean is {mean_Value} \n \
The Median is {median_Value} \n \
Standard Devation is {std_Value}")

the min value is 0.0


The max value is 20000000.0
The mean is 18319.48860671421
The Median is 170.0
Standard Devation is 404543.7287841079
In [28]:

# Using Median to impute Missing Values


df.Enrollment = df.Enrollment.fillna(median_Value)

In [29]:

# Detecting (Percentage) Missing Data


df.isnull().mean() * 100

Out[29]:

NCT Number 0.0


Title 0.0
Acronym 0.0
Status 0.0
Study Results 0.0
Conditions 0.0
Interventions 0.0
Outcome Measures 0.0
Sponsor/Collaborators 0.0
Gender 0.0
Age 0.0
Phases 0.0
Enrollment 0.0
Funded Bys 0.0
Study Type 0.0
Study Designs 0.0
Other IDs 0.0
Start Date 0.0
Primary Completion Date 0.0
Completion Date 0.0
First Posted 0.0
Last Update Posted 0.0
Locations 0.0
URL 0.0
Country 0.0

dtype: float64

In [30]:

df.head()
Out[30]:

Pr
L
i
a
S m F
C st
t ar i
O S o U
A u y r
S Int G t t m p C
NC c d Spon C s
t er Outc e h a pl d Lo o
T r y sor/C . Study o t
Tit a Cond ve ome n e rt et a ca u
Nu o R ollab . Desig m P URL
le t itions nti Meas d r D io t tio n
mb n e orator . ns pl o
u on ures e I a n e ns tr
er y s s et s
s s r D t D P y
m u io t
s e at o
lt n e
e st
s D d
e
at
d
e

R
a
n
k

Di Di
Gr
ag ag
A ou
no N no
c pe
sti o sti N
ti D H
c R c o
v e M os
Pe C e Te Eval C v M
e c A a pit
rfo O s st: uate Grou Allocat O e a
, e pr r ali
NC rm V u ID the pe ion: V m rc F
n m il c er
T0 an I lt N diagn Hospi N/A|Int I b h https://Clinical r
o A . b 3 h P
47 ce D s Covid o ostic talier erventi D e 8 Trials.gov/sho a
1 t l . er 0, 8 ari
85 of -I A 19 w perfo Paris on -I r , w/NCT04785 n
r l . 2 2 , s
89 th D v ™ rman Saint Model: D 9 2 898 c
e 2, 0 2 S
8 e N a C ce of Jose Single N , 0 e
c 2 2 0 ai
ID o il O the ph Gro... o 2 2
r 0 1 2 nt
N w a VI ID ... w 0 1
u 2 1 -J
o b D- 2
iti 0 os
w l 19 0
n ep
™ e Sc
g h,
C re
P
O en
ari
VI in
s,
D- g
19 Te ...
... st

Dr
St
ug
ud Ci
:
y m
Dr
to ed
ug
Ev N C ic
C
al N o O N O al,
O D O
ua o R V o J c B
VI Chan e ct
te t e I v a t ar
D ge c o
th C y s Allocat D e n o ra C
19 on e b
NC e O e u Unite ion: 1 m u b nq o
SAR -0 viral m e
T0 Eff V t lt d Rando 9 b ar e uil https://Clinical l
S-Co 00 load A . b r
45 ic I r s Medi mized| - e y r la, Trials.gov/sho o
2 V-2 1- result l . er 2
95 ac D e A cal Interve 0 r 2 2 At w/NCT04595 m
Infect U s l . 1 0
13 y - c v Speci ntion 0 2 9, 0 la 136 b
ion S from 5, ,
6 of 1 r a alties Model: 0 , 2 , nti i
R| basel 2 2
C 9 u il Par... 1 2 0 2 co a
Dr ine 0 0
O iti a - 0 2 0 ,
ug aft... 2 2
VI n b U 2 1 2 C
: 0 0
D g l S 0 0 ol
no
19 e R o
rm
-0 m
al
00 bi
sa
1.. a
lin
.
e

Lu O
Ot
ng sp
he
C ed
N r:
T al
o Lu N
Sc e
R ng o
an T A T P
e C J J M v
An A qualit Obser A M ap
R s T u u a e S
al C ative Unive vation C a a
NC e u sc n n y m a
ysi - analy rsity al - y Gi
T0 c lt an e e 2 b https://Clinical n
s C sis of of A . Model: C 7 ov
43 r s covid an 1 1 0 e Trials.gov/sho M
3 of O pare Milan l . Cohort O , an
95 u A 19 al 5, 5, , r w/NCT04395 a
S V nchy o l . |Time V 2 ni
48 iti v ysi 2 2 2 9 482 ri
A I mal Bicoc Persp I 0 X
2 n a s 0 0 0 , n
R D lung ca ective: D 2 XI
g il in 2 2 2 2 o
S- 1 dam. ... 1 0 II,
a C 1 1 0 0
C 9 .. 9 B
b O 2
oV er
l VI 0
2 ga
e D-
In m
19
du o,
pa
ce Ita
tie
d ly|
nt
Lu P..
n.. s .
.

H
on
Th
g
e
K
R Di
on
ol A ag
N g
e c no
o S
of ti sti
R an
a v c A
e Prop R M J J at
Pri e Te Obser J u
C s ortio C a u u ori H
va , st: Hong vation ul g
NC O u n of - y n n u o
te n C Kong al y u
T0 V lt asym 2 2 e e m https://Clinical n
H o O Sanat A . Model: 3 st
44 I s COVI ptom 0 5 4 4 & Trials.gov/sho g
4 os t VI orium l . Cohort 1, 3
16 D A D atic 2 , , , H w/NCT04416 K
pit r D & l . |Time 2 1,
06 - v subje 0 2 2 2 os 061 o
al e 19 Hospi Persp 0 2
1 1 a cts|P - 0 0 0 pit n
in c Di tal ective: 2 0
9 il ropor 0 2 2 2 al, g
H r ag ... 0 2
a tion... 8 0 0 0 H
on u no 0
b on
g iti sti
l g
Ko n c
e K
ng g Te
on
A st
g,
m.
H
..
o..
.

Di
M ag
at N no
er o sti C
na T R c C H
l-f Te COVI
M e Mater Centr H M J R
oe st: D-19 Obser M
F R s nal e R a u Or
tal Di by vation a M M
NC - e u Fetal Hospi F O y n lé F
Tr ag positi al y a a
T0 C c lt Infect talier e - 2 e an https://Clinical r
an no ve . Model: 5 y y
43 O r s ion Régio m 2 0 4 s, Trials.gov/sho a
5 s sis PCR . Cohort , 2 2
95 V u A Trans nal a 0 , , Or w/NCT04395 n
mi of in . |Time 2 0 0
92 I iti v missi d'Orlé l 2 2 2 lé 924 c
ssi S cord Persp 0 2 2
4 D n a on|C ans|C e 0 0 0 an e
on A blood ective: 2 1 1
- g il OVID entre - 2 2 s,
of R and / ... 0
1 a -19... d... 1 0 0 Fr
S S- o...
9 b 0 an
A l C ce
R e ov
S- 2
C by
ov R
T-.
-2 ..

5 rows × 25 columns

Data Visualizations

In [31]:

# Get Countires with highest Contributiuons


top_10_Countires = df.Country.value_counts()[:10]
visualize_data(top_10_Countires , caption = 'Top 10 Countries' , ylabel =
'Contributions')

In [32]:

# Status of the Application


status = df.Status.value_counts()

visualize_data(status , caption = 'Status of The Application' , ylabel = 'Denisty')


In [33]:

# Gender Visualiztions
gender = df.Gender.value_counts()
visualize_data(gender , caption = 'Gender Distribution' , ylabel = 'Denisty')
In [34]:

# Which month has the highest start


start_month = pd.Series([ str(df['Start Date'].iloc[i]).split(' ')[0] for i in range
(df.shape[0])])

start_month_Distribution = start_month.value_counts()

visualize_data(start_month_Distribution , caption = 'Start Month Distribution' ,


ylabel = 'Denisty')
In [35]:

print(f"The shape of data frame is {df.shape}")


print(f"Nunique in NCT Number is {df['NCT Number'].nunique()}")
print(f"Nunique in URL is {df.URL.nunique()}")

The shape of data frame is (5783, 25)


Nunique in NCT Number is 5783
Nunique in URL is 5783

So If We are going to apply a (Machine Learning) ML model we can drop NCT Number and URL because there is
an index already which is Rank. To reduce the number of categorical Features, Specially because they will need
to be doecoded inorder to be used in a ML Model.

1 Reference link
2 Reference link for ML project

You might also like