0% found this document useful (0 votes)
3 views

F24_Lab-01 (1)

Uploaded by

JUBAYAD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

F24_Lab-01 (1)

Uploaded by

JUBAYAD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

STAT 7000: Lab 1.

Descriptive Statistics, Distributions


Yasin Fatemi, JooChul Lee

8/28/2024

1. Working with Data Frames


Let us begin with a simple data set. In 1609 Galileo proved mathematically that the trajectory of a body
falling with a horizontal velocity component is a parabola. His search for an experimental setting in which
horizontal motion was not affected appreciably (to study inertia) let him to construct a certain apparatus.
The data comes from one of his experiments.

Location <- c("A", "A", "A", "B", "B", "B", "C")


Height <- c(100,200,300,450,600,800,1000)
Distance <- c(253,337,395,451,495,534,573)

Create a data frame called Galileo with the two variables

Galileo <- data.frame(Location, Height, Distance)

Check contents of the data frame

Galileo # display the content of the dataframe


head(Galileo) # display the first 6 rows
tail(Galileo) # display the last 6 rows
str(Galileo) # display the structure
dim(Galileo) # the number of rows and columns
names(Galileo) # the names of the variables

Index the data frame

Galileo$Height # output a vector


Galileo[[2]] # output a vector (same as previous)
Galileo[2] # output a data frame
Galileo["Height"] # output a data frame (same as previous)
Galileo[c(1,3)] # output a data frame
Galileo[-2] # output a data frame (same as previous)

Galileo[1,2] # value in row 1, column 2


Galileo[ ,2] # all values in column 2
Galileo[1, ] # all values in row 1

1
Summary statistics

summary(Galileo) # summary statistics for a data frame


summary(Galileo$Distance) # summary statistics for a variable

length(Galileo$Distance) # count the number of components


mean(Galileo$Distance) # mean
sd(Galileo$Distance) # standard deviation
var(Galileo$Distance) # variance
min(Galileo$Distance) # minimum
max(Galileo$Distance) # maximum
median(Galileo$Distance) # median
IQR(Galileo$Distance) # IQR

Create and add variables to data frames

Create a variable for estimated height D.Hat = 200 + .708 Height − .000344 Height2 and add it to the data
frame Galileo.

Galileo$D.Hat <- 200 + .708*Height - .000344*Heightˆ2

Create a new variable LO that takes a value of TRUE when the estimated distance is lower than the measured
distance (D.Hat < Distance) and a value of FALSE otherwise and add it to the data frame Galileo. Use
this to get a subset of the Galileo data frame removing the observations for which the estimated distance
is lower than the measured distance.

# Create the variable LO


Galileo$LO <- Galileo$D.Hat < Galileo$Distance

# Remove cases whose estimated distance is lower than the measured distance
Galileo[!Galileo$LO, ]

2. Motivation and Creativity


For Case Study 1: Motivation and Creativity from the textbook, the following questions are posed: Do grad-
ing systems promote creativity in students? Do ranking systems and incentive awards increase productivity
among employees? Do rewards and praise stimulate children to learn?
Data from an experiment concerning the effects of intrinsic and extrinsic motivation on creativity. Subjects
with considerable experience in creative writing were randomly assigned to one of two treatment groups.
(page 2 of the textbook).
Install the package associated with the textbook data. You only need to do this once.

install.packages("Sleuth3")

Load the library and look at the summary of the data.

library(Sleuth3)
summary(case0101)

2
Obtain summary statistics of the scores for the two treatment groups.

# save scores of intrinsic


int.score <- case0101$Score[case0101$Treatment == "Intrinsic"]

# save scores of extrinsic


ext.score <- case0101$Score[case0101$Treatment == "Extrinsic"]

# get summary statistics of the two


summary(int.score)
summary(ext.score)

Plot side-by-side histograms of scores for the two treatments.

par(mfrow=c(1,2))
hist(int.score)
hist(ext.score)

Obtain stem-and-leaf plots of scores for the two treatments.

stem(int.score)
stem(ext.score)

Find the average score difference between the two treatment groups.

mean(int.score) - mean(ext.score)

Compare the variances of the scores in the two treatments.

var(int.score)
var(ext.score)

Draw a comparison boxplot of the scores for the two treatments.

boxplot(Score ~ Treatment, data = case0101)

3. Gross Domestic Product (GDP) per Capita


The data file ex0116 contains the gross domestic product per capita for 228 countries in 2010 on the
following 3 variables: Rank: rank order of country from highest to lowest GDP; Country: name of country;
PerCapitaGDP: per capita GDP in $US.
Obtain the summary statistics for the data.

summary(ex0116)

Draw a histogram of per capita GDPs with a bin width of $5,000.

3
hist(ex0116$PerCapitaGDP, breaks = seq(0, max(ex0116$PerCapitaGDP)+5000, 5000))

Make a boxplot of per capita GDPs.

boxplot(ex0116$PerCapitaGDP)

Flag potential outliers.

# Compute the inner fences


LIF <- quantile(ex0116$PerCapitaGDP, .25) - 1.5*IQR(ex0116$PerCapitaGDP)
UIF <- quantile(ex0116$PerCapitaGDP, .75) + 1.5*IQR(ex0116$PerCapitaGDP)

# Get a list of countries with GDP < LIF or GDP > UIF
ex0116[ex0116$PerCapitaGDP < LIF | ex0116$PerCapitaGDP > UIF, ]

4. Exercise
1. Download the baseball data set baseball.csv given on the Canvas module for this lab. It contains
data from the back-side of 59 baseball cards. The file has 59 observations on the following 6 variables:
height: Height in inches; weight: Weight in pounds; bat: a factor with levels L R S; throw: a factor
with levels L R; field: a factor with levels 0 1, average: ERA if the player is a pitcher and his batting
average if the player is a fielder.
2. Create a data frame.

3. Calculate the mean standard deviation of the ERA of pitchers.


4. Calculate the mean standard deviation of the batting average of fielders.
5. Define a new variable BMI defined by
weight × 703
bmi =
height2
and add it to the data frame.

6. Sort the observations in increasing BMI order.


7. Draw a comparison boxplot of the BMIs of pitchers (field = 0) and fielders (field = 1).
8. Calculate the mean and standard deviation of the heights, weights, and BMIs of fielders.

9. Calculate the difference between the mean ERA of pitchers who are classified as overweight (BMI ≥ 25)
and the mean ERA of pitchers with BMI < 25.
10. Create a new data frame owbb that contains baseball players classified as overweight according to their
BMI.

You might also like