0% found this document useful (0 votes)

78 views30 pages

COVID-19 Clinical Trials EDA Pandas

Uploaded by

Vamshi Krishna reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views30 pages

COVID-19 Clinical Trials EDA Pandas

Uploaded by

Vamshi Krishna reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Project Title COVID-19 Clinical Trials EDA Pandas

Tools Python, ML, SQL, Excel

Technologies Data Analyst & Data scientist

Project Difficulties level intermediate

Dataset : Dataset is available in the given link. You can download it at your convenience.

Click here to download data set

About Dataset
Dataset Description
ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. It is
maintained by the National Institute of Health. All data is publicly available and the site provides a direct download
feature which makes it super easy to use relevant data for analysis.

This dataset consists of clinical trials related to COVID 19 studies presented on the site.

The dataset consists of XML files where each XML file corresponds to one study. The filename is the NCT number
which is a unique identifier of a study in the ClinicalTrials repository. Additionally, a CSV file has also been provided,
which might not have as much information as contained in the XML file, but does give sufficient information.

Please refer to this notebook for details on the dataset :

https://www.kaggle.com/parulpandey/eda-on-covid-19-clinical-trials
Acknowledgements
ClinicalTrials.gov is a resource provided by the U.S. National Library of Medicine.

IMPORTANT:
Listing a study does not mean it has been evaluated by the U.S. Federal Government. Read our disclaimer for
details.
Before participating in a study, talk to your health care provider and learn about the risks and potential benefits.

NOTE :
1. this project is only for your guidance, not exactly the same you have to create. Here I am trying to show the
way or idea of what steps you can follow and how your projects look. Some projects are very advanced (because it
will be made with the help of flask, nlp, advance ai, advance DL and some advanced things ) which you can not understand .
2. You can make or analyze your project with yourself, with your idea, make it more creative from where we can
get some information and understand about our business. make sure what overall things you have created all
things you understand very well.

Example
what steps you should have to follow

Here's a step-by-step guide for performing Exploratory Data Analysis (EDA) on a

COVID-19 Clinical Trials dataset using Pandas, tailored for beginners.

Project Title:

Exploratory Data Analysis of COVID-19 Clinical Trials

1. Objective

The objective is to explore the dataset to gain insights into the characteristics of
COVID-19 clinical trials, such as their status, phases, study designs, and
demographics.

2. Importing Libraries and Loading Data

First, you'll need to import the necessary libraries and load your dataset.

import pandas as pd

# Load the dataset

df = pd.read_csv('covid_clinical_trials.csv') # Replace with
your dataset's path

3. Initial Data Exploration

Start by exploring the basic structure and content of the dataset.

# View the first few rows of the dataset

print(df.head())

# Check the columns and data types

print(df.info())

# Summary statistics for numerical columns

print(df.describe())

# Summary statistics for categorical columns

print(df.describe(include='object'))

4. Handling Missing Data

Check for missing values and decide how to handle them.

# Check for missing values

print(df.isnull().sum())

# Drop columns with a high percentage of missing values or fill

them
df = df.drop(columns=['Acronym', 'Study Documents']) # Example
of dropping columns
df['Results First Posted'].fillna('Unknown', inplace=True) #
Example of filling missing data

5. Univariate Analysis

Analyze each column individually to understand the distribution and key

characteristics.

● Status Distribution: Analyze the status of clinical trials (e.g., Completed,

Ongoing).

print(df['Status'].value_counts())
df['Status'].value_counts().plot(kind='bar', title='Status of
Clinical Trials')

● Phase Distribution: Understand the distribution of trial phases.

print(df['Phases'].value_counts())
df['Phases'].value_counts().plot(kind='bar',
title='Distribution of Phases')

● Age Group Analysis: Analyze the distribution of age groups.

print(df['Age'].value_counts())
df['Age'].value_counts().plot(kind='bar', title='Age Group
Distribution')

6. Bivariate Analysis

Explore relationships between different variables.

● Status vs. Phases: Explore how trial phases are distributed across different
statuses.

status_phase = pd.crosstab(df['Status'], df['Phases'])

print(status_phase)
status_phase.plot(kind='bar', stacked=True, title='Status vs.
Phases')
● Conditions vs. Outcome Measures: Understand the common outcome
measures for different conditions.

conditions_outcomes = df.groupby('Conditions')['Outcome
Measures'].apply(lambda x: ', '.join(x)).reset_index()
print(conditions_outcomes)

7. Time Series Analysis

Analyze the trends over time, such as the number of trials started over the months.

# Convert date columns to datetime

df['Start Date'] = pd.to_datetime(df['Start Date'],
errors='coerce')
df['Primary Completion Date'] = pd.to_datetime(df['Primary
Completion Date'], errors='coerce')

# Plot the number of trials started over time

df['Start
Date'].dt.to_period('M').value_counts().sort_index().plot(kind=
'line', title='Trials Started Over Time')

8. Conclusion

Summarize the findings from your EDA. For example:

● The majority of trials are in the "Completed" phase.

● Most trials target adult populations.
● There's a steady increase in the number of trials over time.

9. Saving Results

You can save the processed data or specific analysis results for further use.

# Save the cleaned data

df.to_csv('cleaned_covid_clinical_trials.csv', index=False)

10. Output and Visuals

After running the code, you should observe:

● Bar charts showing the distribution of trial statuses, phases, and age groups.
● A time series plot illustrating the trend of trials over time.

This project will provide a solid foundation in EDA using Pandas, with practical
insights into the clinical trials landscape for COVID-19.

Sample code

Import Required Libraries¶

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Load The DataSet

In [2]:

df = pd.read_csv('../input/covid19-clinical-trials-dataset/COVID clinical
trials.csv' , index_col = 0)

Exploratory Data Analysis

In [3]:

# print the first 5 rows in the dataset

df.head(n = 5)
Out[3]:

R
e
Pr s
L
im u
S F a
ar C l St
t i st
y o t u
A u r U
S Int G O S C m s dy
NC c d s p
t er Outc Spon e th ta o pl F Lo D
T r y . t d
Titl a Condi ve ome sor/C n er rt m eti i ca oc
Nu o R . P at URL
e t tions nti Meas ollabo d I D pl o r tio u
mb n e . o e
u on ures rators e D at eti n s ns m
er y s s P
s s r s e o D t e
m u t o
n at P nt
lt e st
D e o s
s d e
at s
d
e t
e
d

R
a
n
k

Di
ag Di Gr
no ag ou
sti N no pe
A sti
c o H
ct c
Pe R N D os
iv Te M
rfo C e Evalu C o ec pit
e st: A a M
rm O s ate Group O v e ali
, ID pr r ar
NC an V u the e V e m er
n No il c c
T0 ce I lt diagn Hospi I m b Pa https://Clinical
o w A . 3 h N h N
47 of D s Covid ostic talier D b er ris Trials.gov/sho
1 t ™ l . 0, 8 a 8, a
85 the -I A 19 perfor Paris -I er 2 Sa w/NCT047858
r C l . 2 , N 2 N
89 ID D v manc Saint D 9, 2, int 98
e O 0 2 0
8 No N a e of Josep N 2 2 -J
cr VI 2 0 2
w o il the h o 0 0 os
ui D- 1 2 1
™ w a ID ... w 2 2 ep
ti 19 1
C b 0 0 h,
n Sc
O l Pa
g re
VI e ris
D- en ,
19. in ...
.. g
Te
st

Dr
ug
St : Ci
ud Dr m
y N ug ed
to o C C O ic
N
Ev R O O N D c al,
o Ja O
alu e VI Chan V o ec t Ba
t n ct
ate C s D1 ge on I v e o rra
y u o
NC the O u 9- viral Unite D e m b nq
e SARS ar b
T0 Eff V lt 00 load d 1 m b e uill https://Clinical
t -CoV- A . y N er N
45 ica I s 01 result Medic 9- b er r a, Trials.gov/sho
2 r 2 l . 2 a 2 a
95 cy D A -U s al 0 er 1 2 Atl w/NCT045951
e Infecti l . 9, N 0, N
13 of - v S from Speci 0 2, 5, 0 an 36
cr on 2 2
6 C 1 a R| baseli alties 0 2 2 , tic
ui 0 0
O 9 il Dr ne 1- 0 0 2 o,
ti 2 2
VI a ug aft... U 2 2 0 C
n 1 0
D1 b : S 0 0 2 ol
g
9- l no R 0 o
00 e rm m
01. al bi
.. sal a
in
e

O
Ot sp
Lu he ed
ng r: al
N
CT Lu e
o
Sc ng Pa
R N
an T CT A T pa
e M o
An A sc qualit A Ju Ju Gi
s M a v
aly C R an ative Unive C n n ov
NC u a y e
sis - e an analy rsity - e e an
T0 lt y 2 m https://Clinical
of C cr aly sis of of A . C 1 1 N ni N
43 s covid 7, 0 b Trials.gov/sho
3 SA O ui sis paren Milan l . O 5, 5, a X a
95 A 19 2 , er w/NCT043954
RS V ti in chym o l . V 2 2 N XII N
48 v 0 2 9, 82
-C I n C al Bicoc I 0 0 I,
2 a 2 0 2
oV D g O lung ca D 2 2 Be
il 0 2 0
2 1 VI dam.. 1 1 1 rg
a 0 2
Ind 9 D- . 9 a
b 0
uc 19 m
l
ed pa o,
e
Lu tie Ita
n... nt ly|
s P..
.
H
on
Th g
Di
e Ko
ag
Ro N ng
A no
le o Sa
ct sti
of R na
iv c A
a e Prop J tor
e Te M u J
Pri C s ortion R Ju u iu
, st: Hong a g u
NC vat O u of C ly n m
n C Kong y us n
T0 e V lt asym -2 3 e & https://Clinical
o O Sanat A . 2 t N e N
44 Ho I s COVI ptom 0 1, 4 H Trials.gov/sho
4 t VI orium l . 5, 3 a 4, a
16 spi D A D atic 2 2 , os w/NCT044160
r D & l . 2 1, N 2 N
06 tal - v subje 0- 0 2 pit 61
e 19 Hospi 0 2 0
1 in 1 a cts|Pr 0 2 0 al,
cr Di tal 2 0 2
Ho 9 il oporti 8 0 2 H
ui ag 0 2 0
ng a on... 0 on
ti no 0
Ko b g
n sti
ng l Ko
g c
A e ng
Te
m.. ,
st
. H
o..
.

Di
ag
Ma no
N
ter sti
o C
nal c
T R H
-fo Te COVI
M e Mater Centr C M R
eta st: D-19 J
F s nal e H M a Or
l R Di by u
NC - u Fetal Hospi F R a M M y lé
Tr e ag positi n
T0 C lt Infecti talier e O y ay ay 2 an https://Clinical
an cr no ve . N e N
43 O s on Régio m -2 5, 2 2 0 s, Trials.gov/sho
5 sm ui sis PCR . a 4, a
95 V A Trans nal a 0 2 0 0 , Or w/NCT043959
iss ti of in . N 2 N
92 I v missi d'Orlé l 2 0 2 2 2 lé 24
ion n S cord 0
4 D a on|C ans|C e 0- 2 1 1 0 an
of g A blood 2
- il OVID entre 1 0 2 s,
SA R and / 0
1 a -19... d... 0 0 Fr
RS S- o...
9 b an
-C Co
l ce
ov- v2
e
2 by
RT
-...

5 rows × 26 columns
In [4]:

# Shape of the DataSet

df.shape

Out[4]:

(5783, 26)

In [5]:

# Columns in the dataset

df.columns

Out[5]:

Index(['NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',

'Conditions', 'Interventions', 'Outcome Measures',
'Sponsor/Collaborators', 'Gender', 'Age', 'Phases', 'Enrollment',
'Funded Bys', 'Study Type', 'Study Designs', 'Other IDs', 'Start Date',
'Primary Completion Date', 'Completion Date', 'First Posted',
'Results First Posted', 'Last Update Posted', 'Locations',
'Study Documents', 'URL'],

dtype='object')

In [6]:

# Categorical Features
df.select_dtypes(include = 'object').columns

Out[6]:

Index(['NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',

'Conditions', 'Interventions', 'Outcome Measures',
'Sponsor/Collaborators', 'Gender', 'Age', 'Phases', 'Funded Bys',
'Study Type', 'Study Designs', 'Other IDs', 'Start Date',
'Primary Completion Date', 'Completion Date', 'First Posted',
'Results First Posted', 'Last Update Posted', 'Locations',
'Study Documents', 'URL'],

dtype='object')
In [7]:

# Neumrical Features
df.select_dtypes(exclude = 'object').columns

Out[7]:

Index(['Enrollment'], dtype='object')

In [8]:

# Detecting (Percentage) Missing Data

missing_data = df.isnull().mean() * 100
missing_data

Out[8]:

NCT Number 0.000000

Title 0.000000
Acronym 57.115684
Status 0.000000
Study Results 0.000000
Conditions 0.000000
Interventions 15.320768
Outcome Measures 0.605222
Sponsor/Collaborators 0.000000
Gender 0.172921
Age 0.000000
Phases 42.555767
Enrollment 0.587930
Funded Bys 0.000000
Study Type 0.000000
Study Designs 0.605222
Other IDs 0.017292
Start Date 0.587930
Primary Completion Date 0.622514
Completion Date 0.622514
First Posted 0.000000
Results First Posted 99.377486
Last Update Posted 0.000000
Locations 10.115857
Study Documents 96.852845
URL 0.000000
dtype: float64

In [9]:

# Visualize data without calculating

def visualize_data(data , caption = '' , ylabel = 'Percentage of Mising Data'):

# set figure size

sns.set(rc={'figure.figsize':(15,8.27)})
# make ticks vertical
plt.xticks(rotation=90)

# set title to the image and plot it or the highest 40

fig = sns.barplot(x = data.keys()[:min(40 , len(data))].tolist() , y =
data.values[: min(40 , len(data))].tolist()) \
.set_title(caption)

# set labels
plt.ylabel(ylabel)

plt.show()

In [10]:

visualize_data(missing_data , 'Percentage of missing data in each feature')

As shown the percentae of missing data in Results First Posted is 99.3% and Study Documents is 96.8%, so
it's impossible to impute them without destoying our dataset.

In [11]:

# Drop Study Documents and Results First Posted

df.drop(['Results First Posted' , 'Study Documents'] , inplace = True , axis = 1 )

In [12]:

# Columns in the dataset after dropping Study Documents and Results First Posted
df.columns

Out[12]:

Index(['NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',

dtype='object')

In [13]:

# Drop Duplicate Rows

print(f"Shape before dropping duplicates data {df.shape}")
df.drop_duplicates(inplace = True)
print(f"Shape after dropping duplicates data {df.shape}")

Shape before dropping duplicates data (5783, 24)

Shape after dropping duplicates data (5783, 24)

There is no duplicate rows in the dataset.

In [14]:

# Drop rows that have less than 10 non-null values

print(f"Shape before dropping Null rows {df.shape}")
df.dropna(how = 'any' , axis = 0 , thresh = 10 , inplace = True)
print(f"Shape after dropping Null rows {df.shape}")

Shape before dropping Null rows (5783, 24)

Shape after dropping Null rows (5783, 24)

There is no rows with less than 10 non-null values

In [15]:

df.isnull().mean() * 100

Out[15]:

NCT Number 0.000000

Title 0.000000
Acronym 57.115684
Status 0.000000
Study Results 0.000000
Conditions 0.000000
Interventions 15.320768
Outcome Measures 0.605222
Sponsor/Collaborators 0.000000
Gender 0.172921
Age 0.000000
Phases 42.555767
Enrollment 0.587930
Funded Bys 0.000000
Study Type 0.000000
Study Designs 0.605222
Other IDs 0.017292
Start Date 0.587930
Primary Completion Date 0.622514
Completion Date 0.622514
First Posted 0.000000
Last Update Posted 0.000000
Locations 10.115857
URL 0.000000

dtype: float64

In [16]:

# We can extract a new feature form The Location which is the country where the study
hold
countries = [ str(df.Locations.iloc[i]).split(',')[-1] for i in range(df.shape[0])]
df['Country'] = countries

In [17]:

df.columns

Out[17]:

Index(['NCT Number', 'Title', 'Acronym', 'Status', 'Study Results',

'Conditions', 'Interventions', 'Outcome Measures',
'Sponsor/Collaborators', 'Gender', 'Age', 'Phases', 'Enrollment',
'Funded Bys', 'Study Type', 'Study Designs', 'Other IDs', 'Start Date',
'Primary Completion Date', 'Completion Date', 'First Posted',
'Last Update Posted', 'Locations', 'URL', 'Country'],

dtype='object')
In [18]:

df.Country.value_counts()[:35]

Out[18]:

United States 1267

France 647
nan 585
United Kingdom 306
Italy 235
Spain 234
Turkey 219
Canada 202
Egypt 192
China 171
Brazil 137
Germany 128
Belgium 91
Mexico 88
Switzerland 76
Russian Federation 69
Sweden 57
Denmark 56
Israel 56
India 55
Pakistan 53
Argentina 47
Netherlands 46
Norway 38
Hong Kong 36
Colombia 33
Republic of 31
Austria 29
Poland 29
Singapore 29
Saudi Arabia 27
Australia 26
Greece 26
Islamic Republic of 23
South Africa 22

Name: Country, dtype: int64

Now We need to clasify the missing data to one of these categories

1) Missing Completely At Random (MCAR)

2) Missing At Random (MAR)

3) Not Missing At Random (NMAR)

In [19]:

# Lets's start with Acronym

print(f"Number of unique values is {df.Acronym.nunique()} \n")

df.Acronym.value_counts()

Number of unique values is 2338

Out[19]:

COVID-19 47
PROTECT 7
CORONA 6
RECOVER 5
SCOPE 5
..
ASD 1
VICO 1
LICORNE 1
LOSVID 1
MindMyMindFU 1

Name: Acronym, Length: 2338, dtype: int64

In [20]:

# Find the realtion between null values in Acronym and Countries

(df.Acronym.isnull().groupby(df.Country).mean().sort_values(ascending = False) *
100)[:60]

Out[20]:

Country
Iraq 100.000000
Belarus 100.000000
Rwanda 100.000000
South Sudan 100.000000
Cambodia 100.000000
Bulgaria 100.000000
Cyprus 100.000000
Bosnia and Herzegovina 100.000000
Guinea-Bissau 100.000000
Dominican Republic 100.000000
Ecuador 100.000000
North Macedonia 100.000000
Bahrain 100.000000
Azerbaijan 100.000000
Uruguay 100.000000
Uzbekistan 100.000000
Kyrgyzstan 100.000000
Cape Verde 100.000000
Republic of 96.774194
Taiwan 93.750000
Singapore 93.103448
Japan 88.888889
Kuwait 87.500000
China 87.134503
Turkey 86.757991
Ukraine 85.714286
Malaysia 84.615385
Egypt 83.854167
Hungary 83.333333
Hong Kong 80.555556
Bangladesh 80.000000
India 80.000000
Kazakhstan 80.000000
Saudi Arabia 77.777778
Puerto Rico 76.470588
Israel 75.000000
Zimbabwe 75.000000
Jordan 72.727273
Poland 72.413793
Indonesia 71.428571
United States 69.376480
Romania 69.230769
Kenya 66.666667
Nepal 66.666667
New Zealand 66.666667
Ethiopia 66.666667
Slovakia 66.666667
Thailand 66.666667
Lebanon 66.666667
nan 66.324786
Islamic Republic of 65.217391
Russian Federation 65.217391
Chile 64.705882
Austria 62.068966
Pakistan 60.377358
Brazil 59.124088
Mexico 57.954545
Sweden 57.894737
Argentina 57.446809
Canada 55.940594

Name: Acronym, dtype: float64

● After inspecting the relation between the missing values in Acronym and Country we can conclude that
there is a sort of relation between these two features, so we can say that Data is Missing At Random
(MAR).
● So we can Impute by Missing Category.

In [21]:

# impute by a missing Indicator

df.Acronym = df.Acronym.fillna("Missing Acronym")

In [22]:

# Detecting (Percentage) Missing Data

df.isnull().mean() * 100

Out[22]:

NCT Number 0.000000

Title 0.000000
Acronym 0.000000
Status 0.000000
Study Results 0.000000
Conditions 0.000000
Interventions 15.320768
Outcome Measures 0.605222
Sponsor/Collaborators 0.000000
Gender 0.172921
Age 0.000000
Phases 42.555767
Enrollment 0.587930
Funded Bys 0.000000
Study Type 0.000000
Study Designs 0.605222
Other IDs 0.017292
Start Date 0.587930
Primary Completion Date 0.622514
Completion Date 0.622514
First Posted 0.000000
Last Update Posted 0.000000
Locations 10.115857
URL 0.000000
Country 0.000000

dtype: float64

We can do the same for other categorical features such as Interventions , Phases , Locations and other
categorical features

In [23]:

# Impute Interventions , Phases , Locations by Missing Category

categorical_features = df.select_dtypes(include = object).columns

features = categorical_features[df[categorical_features].isnull().mean() > 0]

for feature in features:

df[feature] = df[feature].fillna(f"Missing {feature}")

In [24]:

# Detecting (Percentage) Missing Data

df.isnull().mean() * 100

Out[24]:

NCT Number 0.00000

Title 0.00000
Acronym 0.00000
Status 0.00000
Study Results 0.00000
Conditions 0.00000
Interventions 0.00000
Outcome Measures 0.00000
Sponsor/Collaborators 0.00000
Gender 0.00000
Age 0.00000
Phases 0.00000
Enrollment 0.58793
Funded Bys 0.00000
Study Type 0.00000
Study Designs 0.00000
Other IDs 0.00000
Start Date 0.00000
Primary Completion Date 0.00000
Completion Date 0.00000
First Posted 0.00000
Last Update Posted 0.00000
Locations 0.00000
URL 0.00000
Country 0.00000

dtype: float64

Now the Time to handle The missing data for the Enrollment

In [25]:

# Check the skewness

df.Enrollment.skew()

Out[25]:

34.06593382031148

The value of Skewness is 34 which means that we This feature isn't normally distributed

In [26]:

# Plotting the distribution of the enrollment

df.Enrollment.plot(kind = 'kde')

Out[26]:

<AxesSubplot:ylabel='Density'>
So We will impute by the median

In [27]:

# Some Statstical Valuse for the Enrollment Column

min_Value = df.Enrollment.min()
max_Value = df.Enrollment.max()
mean_Value = df.Enrollment.mean()
median_Value = df.Enrollment.median()
std_Value = df.Enrollment.std()

print(f"the min value is {min_Value} \n \

The max value is {max_Value} \n \
The mean is {mean_Value} \n \
The Median is {median_Value} \n \
Standard Devation is {std_Value}")

the min value is 0.0

The max value is 20000000.0
The mean is 18319.48860671421
The Median is 170.0
Standard Devation is 404543.7287841079
In [28]:

# Using Median to impute Missing Values

df.Enrollment = df.Enrollment.fillna(median_Value)

In [29]:

# Detecting (Percentage) Missing Data

df.isnull().mean() * 100

Out[29]:

NCT Number 0.0

Title 0.0
Acronym 0.0
Status 0.0
Study Results 0.0
Conditions 0.0
Interventions 0.0
Outcome Measures 0.0
Sponsor/Collaborators 0.0
Gender 0.0
Age 0.0
Phases 0.0
Enrollment 0.0
Funded Bys 0.0
Study Type 0.0
Study Designs 0.0
Other IDs 0.0
Start Date 0.0
Primary Completion Date 0.0
Completion Date 0.0
First Posted 0.0
Last Update Posted 0.0
Locations 0.0
URL 0.0
Country 0.0

dtype: float64

In [30]:

df.head()
Out[30]:

Pr
L
i
a
S m F
C st
t ar i
O S o U
A u y r
S Int G t t m p C
NC c d Spon C s
t er Outc e h a pl d Lo o
T r y sor/C . Study o t
Tit a Cond ve ome n e rt et a ca u
Nu o R ollab . Desig m P URL
le t itions nti Meas d r D io t tio n
mb n e orator . ns pl o
u on ures e I a n e ns tr
er y s s et s
s s r D t D P y
m u io t
s e at o
lt n e
e st
s D d
e
at
d
e

R
a
n
k

Di Di
Gr
ag ag
A ou
no N no
c pe
sti o sti N
ti D H
c R c o
v e M os
Pe C e Te Eval C v M
e c A a pit
rfo O s st: uate Grou Allocat O e a
, e pr r ali
NC rm V u ID the pe ion: V m rc F
n m il c er
T0 an I lt N diagn Hospi N/A|Int I b h https://Clinical r
o A . b 3 h P
47 ce D s Covid o ostic talier erventi D e 8 Trials.gov/sho a
1 t l . er 0, 8 ari
85 of -I A 19 w perfo Paris on -I r , w/NCT04785 n
r l . 2 2 , s
89 th D v ™ rman Saint Model: D 9 2 898 c
e 2, 0 2 S
8 e N a C ce of Jose Single N , 0 e
c 2 2 0 ai
ID o il O the ph Gro... o 2 2
r 0 1 2 nt
N w a VI ID ... w 0 1
u 2 1 -J
o b D- 2
iti 0 os
w l 19 0
n ep
™ e Sc
g h,
C re
P
O en
ari
VI in
s,
D- g
19 Te ...
... st

Dr
St
ug
ud Ci
:
y m
Dr
to ed
ug
Ev N C ic
C
al N o O N O al,
O D O
ua o R V o J c B
VI Chan e ct
te t e I v a t ar
D ge c o
th C y s Allocat D e n o ra C
19 on e b
NC e O e u Unite ion: 1 m u b nq o
SAR -0 viral m e
T0 Eff V t lt d Rando 9 b ar e uil https://Clinical l
S-Co 00 load A . b r
45 ic I r s Medi mized| - e y r la, Trials.gov/sho o
2 V-2 1- result l . er 2
95 ac D e A cal Interve 0 r 2 2 At w/NCT04595 m
Infect U s l . 1 0
13 y - c v Speci ntion 0 2 9, 0 la 136 b
ion S from 5, ,
6 of 1 r a alties Model: 0 , 2 , nti i
R| basel 2 2
C 9 u il Par... 1 2 0 2 co a
Dr ine 0 0
O iti a - 0 2 0 ,
ug aft... 2 2
VI n b U 2 1 2 C
: 0 0
D g l S 0 0 ol
no
19 e R o
rm
-0 m
al
00 bi
sa
1.. a
lin
.
e

Lu O
Ot
ng sp
he
C ed
N r:
T al
o Lu N
Sc e
R ng o
an T A T P
e C J J M v
An A qualit Obser A M ap
R s T u u a e S
al C ative Unive vation C a a
NC e u sc n n y m a
ysi - analy rsity al - y Gi
T0 c lt an e e 2 b https://Clinical n
s C sis of of A . Model: C 7 ov
43 r s covid an 1 1 0 e Trials.gov/sho M
3 of O pare Milan l . Cohort O , an
95 u A 19 al 5, 5, , r w/NCT04395 a
S V nchy o l . |Time V 2 ni
48 iti v ysi 2 2 2 9 482 ri
A I mal Bicoc Persp I 0 X
2 n a s 0 0 0 , n
R D lung ca ective: D 2 XI
g il in 2 2 2 2 o
S- 1 dam. ... 1 0 II,
a C 1 1 0 0
C 9 .. 9 B
b O 2
oV er
l VI 0
2 ga
e D-
In m
19
du o,
pa
ce Ita
tie
d ly|
nt
Lu P..
n.. s .
.

H
on
Th
g
e
K
R Di
on
ol A ag
N g
e c no
o S
of ti sti
R an
a v c A
e Prop R M J J at
Pri e Te Obser J u
C s ortio C a u u ori H
va , st: Hong vation ul g
NC O u n of - y n n u o
te n C Kong al y u
T0 V lt asym 2 2 e e m https://Clinical n
H o O Sanat A . Model: 3 st
44 I s COVI ptom 0 5 4 4 & Trials.gov/sho g
4 os t VI orium l . Cohort 1, 3
16 D A D atic 2 , , , H w/NCT04416 K
pit r D & l . |Time 2 1,
06 - v subje 0 2 2 2 os 061 o
al e 19 Hospi Persp 0 2
1 1 a cts|P - 0 0 0 pit n
in c Di tal ective: 2 0
9 il ropor 0 2 2 2 al, g
H r ag ... 0 2
a tion... 8 0 0 0 H
on u no 0
b on
g iti sti
l g
Ko n c
e K
ng g Te
on
A st
g,
m.
H
..
o..
.

Di
M ag
at N no
er o sti C
na T R c C H
l-f Te COVI
M e Mater Centr H M J R
oe st: D-19 Obser M
F R s nal e R a u Or
tal Di by vation a M M
NC - e u Fetal Hospi F O y n lé F
Tr ag positi al y a a
T0 C c lt Infect talier e - 2 e an https://Clinical r
an no ve . Model: 5 y y
43 O r s ion Régio m 2 0 4 s, Trials.gov/sho a
5 s sis PCR . Cohort , 2 2
95 V u A Trans nal a 0 , , Or w/NCT04395 n
mi of in . |Time 2 0 0
92 I iti v missi d'Orlé l 2 2 2 lé 924 c
ssi S cord Persp 0 2 2
4 D n a on|C ans|C e 0 0 0 an e
on A blood ective: 2 1 1
- g il OVID entre - 2 2 s,
of R and / ... 0
1 a -19... d... 1 0 0 Fr
S S- o...
9 b 0 an
A l C ce
R e ov
S- 2
C by
ov R
T-.
-2 ..

5 rows × 25 columns

Data Visualizations

In [31]:

# Get Countires with highest Contributiuons

top_10_Countires = df.Country.value_counts()[:10]
visualize_data(top_10_Countires , caption = 'Top 10 Countries' , ylabel =
'Contributions')

In [32]:

# Status of the Application

status = df.Status.value_counts()

visualize_data(status , caption = 'Status of The Application' , ylabel = 'Denisty')

In [33]:

# Gender Visualiztions
gender = df.Gender.value_counts()
visualize_data(gender , caption = 'Gender Distribution' , ylabel = 'Denisty')
In [34]:

# Which month has the highest start

start_month = pd.Series([ str(df['Start Date'].iloc[i]).split(' ')[0] for i in range
(df.shape[0])])

start_month_Distribution = start_month.value_counts()

visualize_data(start_month_Distribution , caption = 'Start Month Distribution' ,

ylabel = 'Denisty')
In [35]:

print(f"The shape of data frame is {df.shape}")

print(f"Nunique in NCT Number is {df['NCT Number'].nunique()}")
print(f"Nunique in URL is {df.URL.nunique()}")

The shape of data frame is (5783, 25)

Nunique in NCT Number is 5783
Nunique in URL is 5783

So If We are going to apply a (Machine Learning) ML model we can drop NCT Number and URL because there is
an index already which is Rank. To reduce the number of categorical Features, Specially because they will need
to be doecoded inorder to be used in a ML Model.

1 Reference link
2 Reference link for ML project

COVID-19 Clinical Trials EDA Pandas ( ML _ FA _ DA Projects )
No ratings yet
COVID-19 Clinical Trials EDA Pandas ( ML _ FA _ DA Projects )
53 pages
Sample
No ratings yet
Sample
13 pages
CA2 Report Example 1
No ratings yet
CA2 Report Example 1
18 pages
DOC-20240416-WA0002.
No ratings yet
DOC-20240416-WA0002.
32 pages
DSBDA Mini Project.ipynb - Colab
No ratings yet
DSBDA Mini Project.ipynb - Colab
22 pages
yog
No ratings yet
yog
18 pages
SUR
No ratings yet
SUR
18 pages
Health Data Analysis
No ratings yet
Health Data Analysis
3 pages
COVID-19-Clinical-Trials-Dataset
No ratings yet
COVID-19-Clinical-Trials-Dataset
10 pages
Covid_vaccine
No ratings yet
Covid_vaccine
13 pages
M23aid027 DCS Ass2
No ratings yet
M23aid027 DCS Ass2
14 pages
kri
No ratings yet
kri
18 pages
report_MSA_Practice02
No ratings yet
report_MSA_Practice02
29 pages
Mini Report Python
No ratings yet
Mini Report Python
24 pages
Essential Software Assignment 3
No ratings yet
Essential Software Assignment 3
2 pages
PHASE_1
No ratings yet
PHASE_1
2 pages
DAC Phase5
No ratings yet
DAC Phase5
5 pages
COVID 19 Pandemic Analysis class 12 practicals (1) (2)
No ratings yet
COVID 19 Pandemic Analysis class 12 practicals (1) (2)
29 pages
Corona Virus Analysis
No ratings yet
Corona Virus Analysis
27 pages
DAC Phase4
No ratings yet
DAC Phase4
4 pages
Phase-2 (1)
No ratings yet
Phase-2 (1)
6 pages
Healthcare Data Exploration Report Word File
No ratings yet
Healthcare Data Exploration Report Word File
9 pages
Nishant mini project 1 rishi (3)
No ratings yet
Nishant mini project 1 rishi (3)
18 pages
Total Documentation
No ratings yet
Total Documentation
21 pages
DSBDA Covid - Cases
No ratings yet
DSBDA Covid - Cases
11 pages
rishi mini project
No ratings yet
rishi mini project
18 pages
DA_in_Medicine[2 (2)
No ratings yet
DA_in_Medicine[2 (2)
16 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
covid data report
No ratings yet
covid data report
21 pages
assignment 8_
No ratings yet
assignment 8_
2 pages
Python Report (rabeeeh).docx
No ratings yet
Python Report (rabeeeh).docx
7 pages
Case Study Guidelines
No ratings yet
Case Study Guidelines
7 pages
File - Elemental Abundances
No ratings yet
File - Elemental Abundances
15 pages
r.jeevitha
No ratings yet
r.jeevitha
16 pages
COVID 19 Pandemic Analysis
No ratings yet
COVID 19 Pandemic Analysis
26 pages
Harshdeep
No ratings yet
Harshdeep
57 pages
Data Analytics_Activity 1
No ratings yet
Data Analytics_Activity 1
2 pages
Report - Data Visualization and Exploration
No ratings yet
Report - Data Visualization and Exploration
14 pages
Final Project Guidelines: Dataset Selection & Planning
No ratings yet
Final Project Guidelines: Dataset Selection & Planning
3 pages
Pdm Brochure
No ratings yet
Pdm Brochure
26 pages
Computer Science Ip
No ratings yet
Computer Science Ip
16 pages
Artificial Intelligence Project Report
No ratings yet
Artificial Intelligence Project Report
15 pages
Intro To Py and ML - Part 2
No ratings yet
Intro To Py and ML - Part 2
10 pages
Covid Report PDF
No ratings yet
Covid Report PDF
17 pages
Ashutosh Project
No ratings yet
Ashutosh Project
19 pages
MMMMM
No ratings yet
MMMMM
23 pages
Da Phase1
No ratings yet
Da Phase1
6 pages
Maheswari Public School Kalwar Road: Project File Session 2023-24
No ratings yet
Maheswari Public School Kalwar Road: Project File Session 2023-24
28 pages
11 Xip 21,,matter
No ratings yet
11 Xip 21,,matter
38 pages
Assignment - 1: Data Analytics and R
No ratings yet
Assignment - 1: Data Analytics and R
4 pages
Sameer - Covid Data Set
No ratings yet
Sameer - Covid Data Set
13 pages
Ethos Pathos Logos (slides)
No ratings yet
Ethos Pathos Logos (slides)
17 pages
Syadatajveez
No ratings yet
Syadatajveez
21 pages
Name
No ratings yet
Name
23 pages
COMP551 Fall 2020 P1
No ratings yet
COMP551 Fall 2020 P1
4 pages
Name
No ratings yet
Name
23 pages
Covid 19 Phase 5 DR
No ratings yet
Covid 19 Phase 5 DR
21 pages
DSBDA - Mini Project Report
100% (1)
DSBDA - Mini Project Report
7 pages
DSBDA Mini Project
No ratings yet
DSBDA Mini Project
19 pages
Project-–-COVID-19-Analysis
No ratings yet
Project-–-COVID-19-Analysis
2 pages
IBM HR Analytics Employee Attrition & Performance - (Data Analyst)
No ratings yet
IBM HR Analytics Employee Attrition & Performance - (Data Analyst)
21 pages
Project Valuation (Finance Analysis)
No ratings yet
Project Valuation (Finance Analysis)
41 pages
Honor in Stars Level 1
No ratings yet
Honor in Stars Level 1
9 pages
Numerical Methods For O.D.E.s: Created by T. Madas
No ratings yet
Numerical Methods For O.D.E.s: Created by T. Madas
26 pages
Project Documentaiotn - InDIA Abellllll
No ratings yet
Project Documentaiotn - InDIA Abellllll
27 pages
YEAR PLAN Class 12, 2023-'24
No ratings yet
YEAR PLAN Class 12, 2023-'24
5 pages
Banking Dataset - Marketing Targets
No ratings yet
Banking Dataset - Marketing Targets
19 pages
Regulatory Affairs of Road Accident Data 2020 India
No ratings yet
Regulatory Affairs of Road Accident Data 2020 India
23 pages
IP Project Complete Color Coded Justification-Aligned Outputs Changed
No ratings yet
IP Project Complete Color Coded Justification-Aligned Outputs Changed
55 pages
Business Proposal
No ratings yet
Business Proposal
13 pages
TempDerating UEN103910 PDF
No ratings yet
TempDerating UEN103910 PDF
8 pages
Financial Performance Dashboard - (Tableau - Finance Analyst)
100% (1)
Financial Performance Dashboard - (Tableau - Finance Analyst)
9 pages
Cell Biology Unit-I MCQS
100% (1)
Cell Biology Unit-I MCQS
3 pages
Tobacco Use and Mortality, 2004-2015
No ratings yet
Tobacco Use and Mortality, 2004-2015
12 pages
Eng 7 Final Mock Test 3
No ratings yet
Eng 7 Final Mock Test 3
4 pages
Single Conductor 25-35KV Shielded MV-105
No ratings yet
Single Conductor 25-35KV Shielded MV-105
5 pages
COVID Project
0% (1)
COVID Project
1 page
Climate Change Modeling
No ratings yet
Climate Change Modeling
10 pages
Personalized Healthcare Recommendations
No ratings yet
Personalized Healthcare Recommendations
6 pages
Loan 3711-NEP: Urban Water Supply and Sanitation (Sector) Project - Request For Mission Clearance
No ratings yet
Loan 3711-NEP: Urban Water Supply and Sanitation (Sector) Project - Request For Mission Clearance
5 pages
Unit 8 - Week 6 Lectures: Assignment 6
No ratings yet
Unit 8 - Week 6 Lectures: Assignment 6
4 pages
II Lang Eng Taluk Level (2024-25) Final
No ratings yet
II Lang Eng Taluk Level (2024-25) Final
4 pages
COT LESSON PLAN 1 2023 Mako
0% (1)
COT LESSON PLAN 1 2023 Mako
2 pages
Motta - 1994 - GENERALIZED COULOMB ACTIVE-EARTH PRESSURE FOR DISTANCED SURCHARGE
No ratings yet
Motta - 1994 - GENERALIZED COULOMB ACTIVE-EARTH PRESSURE FOR DISTANCED SURCHARGE
8 pages
NSTP Module 8 - Peace Education: Example: War, Rape, Wife Battering, Child Abuse, and Crimes
100% (1)
NSTP Module 8 - Peace Education: Example: War, Rape, Wife Battering, Child Abuse, and Crimes
4 pages
Pop Cycle Sem 3 Camille Mazloomian
No ratings yet
Pop Cycle Sem 3 Camille Mazloomian
4 pages
Shape the Leadership of Tomorrow (1)
No ratings yet
Shape the Leadership of Tomorrow (1)
18 pages
Indira Gandhi Institute of Technology, Sarang
No ratings yet
Indira Gandhi Institute of Technology, Sarang
5 pages
Exercise 1: Choose The Word Whose Underlined Part Is Pronounced Differently From The Others in
100% (1)
Exercise 1: Choose The Word Whose Underlined Part Is Pronounced Differently From The Others in
12 pages
RL Optica P100
No ratings yet
RL Optica P100
1 page
GIS Raster Basics
No ratings yet
GIS Raster Basics
29 pages
Math Project
No ratings yet
Math Project
2 pages
MR Sample Candidate: Team Impact Individual Development Report
No ratings yet
MR Sample Candidate: Team Impact Individual Development Report
13 pages
20 Days Checklist JEE Advanced 2024
No ratings yet
20 Days Checklist JEE Advanced 2024
1 page
Timber Deck Cargoes
No ratings yet
Timber Deck Cargoes
70 pages
Perdev Module 2
No ratings yet
Perdev Module 2
7 pages
Lucy Tries Short Track
From Everand
Lucy Tries Short Track
Lisa Bowes
No ratings yet
ENGLISH 8 Q2 Mod1
100% (1)
ENGLISH 8 Q2 Mod1
15 pages