0% found this document useful (0 votes)
9 views

Practical 4

TP 4 de statistiques

Uploaded by

rtchuidjangnana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Practical 4

TP 4 de statistiques

Uploaded by

rtchuidjangnana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

University of Geneva GSEM

Statistics I Fall 2017


Prof. Eva Cantoni Practical 4

Review of Practicals 1-3 and a full data analysis

Goals: The objective of this practical is to review the descriptive statistics, plots and basic
R programming we have learned through the first three practicals. In additoin, a new data
analysis is perfomed to consolidate what has been learned so far, while learning about few
extra possibilities of R.

1 Revision
Revisit practical 1, 2 and 3 (files Practical1.R, Practical2.R, Practical3.R are pro-
vided in the corresponding folders in Chamilo). Make sure you understand what is done
with each command in R.

2 A Full Data Analysis


Import the dataset Cities.csv in R and store them as a dataframe called Cities. This
dataset contains information on the economic conditions in 48 cities around the world in
1991. The variables contained in the dataset are the following:

• “City” : City name

• “Work ” : Weighted average of the number of working hours in 12 occupations

• “Price” : Index of the cost of 112 goods and services excluding rent (Zurich = 100)

• “Salary” : Index of hourly earnings in 12 occupations after deductions (Zurich = 100)

Since the summary statistics and plots we have learned so far has its suitable variable type,
it is necessary to know what kind of data you have in your file before summarizing or
visualizing it. You can check the type of variables in the Cities dataset with:
str ( Cities )
or with:
class ( Cities $ ...) # ... has to be replaced by a variable name
to get the type for each variable. Use
summary ( Cities )

1
University of Geneva GSEM
Statistics I Fall 2017
Prof. Eva Cantoni Practical 4

to get a basic description of the dataset.

Look at the entire dataset by typing Cities in R. What do you observe?


The dataset contains some missing values, coded NA. Some of the functions (e.g. vioplot)
cannot handle this and would need special treatment, see point 4. below.

To perform your data analysis, consider the following steps:

1. Provide summary statistics for the variables which have suitable type in the dataset.
When appropriate, draw a kernal density plot to check whether their distributions are
symmetric or not.

2. Draw boxplots of all the numerical (continuous) variables into a single graphical win-
dow. You can use the par() function including the option mfrow=c(nrows, ncols)
to create a matrix of nrows by ncols plots that are filled in by row. For example, if
you need plots to be arranged horizontally, let nrows=1.
par ( mfrow = c (1 ,3)) # 3 figures arranged in a row
boxplot ( Cities $ Work , col = " lightsalmon1 " )
# with the default color changed to lightsalmon
boxplot ( Cities $ Price , col = " mediumseagreen " )
boxplot ( Cities $ Salary , col = " goldenrod2 " )
par ( mfrow = c (1 ,1)) # back to the default setting

What can you say about the distribution of each variable by looking only at the
boxplots?

3. Draw histograms of all the numerical (continuous) variables into a single graphical
window. Use here as well the col parameter to change the default settings.
Describe the distribution of the variables with these new information.

4. Draw violin plots of all the numerical (continuous) variables into a single graphical
window. You have to use the function na.omit() here to eliminate the missing values.
par ( mfrow = c (1 ,3)) # 3 figures arranged in a row
vioplot ( na . omit ( Cities $ Work ))
vioplot ( na . omit ( Cities $ Price ))
vioplot ( na . omit ( Cities $ Salary ))
par ( mfrow = c (1 ,1)) # back to the default setting

Describe the distribution of the variables with these new information. Try to change
manually the width of the bandwidth with the parameter h. What do you observe?

5. Compare, via QQ-plots, the empirical distribution of variables Work, Price and Salary
separately with the Gaussian distribution and draw a reference line. Does the Gaussian
distribution fit well?

You might also like