0% found this document useful (0 votes)
12 views

Chapter 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Chapter 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

PROBABILITY AND STATISTICS

Lesson 1: Overview and Descriptive Statistics


(Assoc. Prof. Dr. Nguyen Chanh Tu, DUT)
(Trích xuất từ bài giảng trong môi trường tương tác Maple)

Introduction to the course


Syllabus (on LMS and Teams)

Using Materials
• In LMS
• In Teams/Files
O

Using R and Devore7 package. (See video clips and pay attention
to the following)
1. Why we need of using R?
2. Install R
3. Install RStudio,
And then:
4. Run RStudio and open script file GettingStart
5. Install packages: Devore7, prob, Maples,....and taste the R with discussion.
O

1.1. Populations, Samples and Processes


Roles of Statistics:
1. Provides methods for organizing and summarizing data.
2. Drawing the conclusions based on information contained in the data.
O

Basic concepts
1. Population-Sample-Variable
• Population: The set of all objects under an investigation
• Sample: a subset of population
• A variable: is a characteristic whose value change from one object to another in the
population
Exam:
x:=brand of calculator owned by a student
y = number of visits to a particular website during a specified period
z = braking distance of an automobile under specified conditions
O
2. Observation Data: Univariate, Bivariate, Multivariate
• Univariate data set consists of observations made on a single variable
• Bivariate: on two variables
• Multivariate: on more than one variable
O

Branches of Statistics
Descriptive Statistics-Inference Statistics
• Descriptive Stat. provides methods to summarize and describe important features of the
data. Main discriptive methods consists of a) graphical tools like: histogram, boxplots,
scatter plot,...b) calculation of numeric measures: mean, median, mode,...
• Inference Stat. provide techniques for using sample information to draw some type of
conclusion about the population.
O

Statistics and Probability


• Probability and statistics both deal with questions involving populations and samples, but do so
in an “inverse manner” to one another.
• In a probability problem, properties of the population under study are assumed known, and
questions regarding a sample taken from the population are posed and answered.
• In a statistics problem, characteristics of a sample are available to the experimenter, and this
information enables the experi-menter to draw conclusions about the population.

Collecting Data
Remark on collecting Data
• Statistics deals not only with the organization and analysis of data once it has been collected
but also with the development of techniques for collecting the data.
• If data is not properly collected, an investigator may not be able to answer the questions
under consideration with a reasonable degree of confidence.
Common Problem
• One common problem is that the target population—the one about which conclusions are
to be drawn—may be different from the population actually sampled.
Exam
• Advertisers would like various kinds of information about the television-viewing habits
of potential customers. The most systematic information of this sort comes from placing
monitoring devices in a small number of homes across the United States.
• It has been conjectured that placement of such devices in and of itself alters viewing
behavior, so that characteristics of the sample may be different from those of the target
population.
Simple Random Sample
Any particular subset of the specified size (e.g., a sample of size 100) has the same chance of
being selected.
Sampling methods (read yourself)
• focus on stratified sampling

1.2. Pictorial and Tabular Methods in Descriptive Statistics


Stem-and-Leaf Displays

Note
• A display based on between 5 and 20 stems is recommended.
Exams 1.1 and 1.5
Exam 1.5

What does a stem-and-leaf display tell you?


Tell you about the following aspects of the data:
• identification of a typical or representative value
• extent of spread about the typical value
• presence of any gaps in the data
• extent of symmetry in the distribution of values
• number and location of peaks
• presence of any outlying values
O

Dotplots

When:
• when the data set is small or
• there are relatively few distinct data values
How
• Each observation is represented by a dot above the corresponding location on a horizontal
measurement scale.
• When value occurs more than once there is a dot for each occurrence, and these dots are
stacked vertically
What does it tell you?
As with a stem-and-leaf display, a dotplot gives information about location, spread, extremes,
and gaps.
Exam 1.7

Histograms
Number of classes

O
Discrete Numerical Variables
Def

Frequency and Relative frequency


• The frequency of any particular x value is the number of times that value occurs in the
data set.
• The relative frequency of a value is the fraction or proportion of times the value occurs:

O
Construction

O
Exam 1.8
Continuous Numerical Variables
Number of classes

O
Equal Class Widths

O
Example 1.9
O
Unequal Class Widths
O
Example 1.10

O
Histogram Shapes
O
Quanlitative Data

How to get histogram


• You should study the way to draw histogram of quanlitative Data with >barplot as in the
below example
> library(MASS) # load the MASS package
> school = painters$School # the painter schools
> school.freq = table(school) # apply the table function
Then we apply the barplot function to produce its bar graph.
> barplot(school.freq) # apply the barplot function
Multivariate Data
HistogramsPractice of R
All students study arguments of histogram commands and try to get back pictures of examples
1.8, 1.9, 1.10,1.11
• Looking the help by >?hist or ?dotplot,...
• Use help page in R

Exam:
Work out examples 25,26 in Sec 1.2 (Edi. 7) with R

1.3. Measures of Location


• Most important characteristic of a set of data is its location, and in particular its center
• Our primary concern will be with numerical data; some comments regarding categorical data
appear at the end of the section.

The Sample-Population Mean


Def

Exam 1.12
Dotplot

• Compute mean by calculators


• Use dotplot, histogram by R
O
Remarks
• The average of all values in the population is called the population mean and denoted by m
• On statistics inference, we will present methods based on the sample mean for drawing
conclusions about a population mean.
• Mean can be greatly affected by the presence of even a single outlier (unusually large or
small observation).
• However, although does have this potential defect, it is still the most widely used measure.

The Sample-Population Median


Def
Exam. Look at the follwing table data and try to compute sample median,
compare to the sample mean.

O
Exam 1.13. Compute sample mean and median by R and express by graphical
tools (dotplot, stem-and-leaf, barchart, histogram,..).

O
Remarks
• The sample median is not sensitive to outliers.
• The middle value in the population, the population median,denoted by
• Both quantities mean and median describe where the data is centered, but they will not in
general be equal.
• The population mean and median will not generally be identical.

Other Measures of Location: Quartiles, Percentiles and Trimmed


Means
1. The median (population or sample) divides the data set into two parts of equal size.
2. Quartiles divide the data set into four equal parts, the first quartile separating the lower
quarter, the second quartile being identical to the median, the third quartile constituting the
upper quarter of the data set.

3. Similarly, a data set (sample or population) can be even more finely divided using
percentiles; the 99th percentile separates the highest 1% from the bottom 99%, and so on.

4. A trimmed mean is a compromise between mean and median. A 10% trimmed mean, for
example, would be computed by eliminating the smallest 10% and the largest 10% of the
sample and then averaging what is left over.
Exam 1.14

Categorical Data and Sample Proportions


• When the data is categorical, a frequency distribution or relative frequency distribution
provides an effective tabular summary of the data.
• More generally, focus attention on a particular category and code the sample results so that a 1
is recorded for an observation in the category and a 0 for an observation not in the category.
Then the sample proportion of observations in the category is the sample mean of the sequence
x
of 1s and 0s, denoted by .Thus a sample mean can be used to summarize the results of a
n
categorical sample.
x
• Analogous to the sample proportion of individuals or objects falling in a particular
n
category, let p prepresent the proportion of those in the entire population falling in the
x
category. In particular, we will subsequently use to make inferences about p.
n
1.4. Measures of Variability
Idea

Our primary measures of variability involve the deviations from the mean:

Measures of Variability of Sample Data


Definition

Exams 1.15
Using R
• Load examples 1.15
• Compute directly the mean by using the commands >sum, >length. Check by command
>mean()
• Compute directly Sxx and variance and deviation by using formulae. Check by command
>var and >sd
O
Exam 1.16
Using R
• Load examples 1.16.
• Compute directly Sxx and variance and deviation by using formulse and check by
command >var and >sd
O

Boxplots
• Stem-and-leaf, dotplot displays and histograms convey rather general impressions about a data
set
• Mean, median or standard deviation focuses on just single aspects of the data
• A pictorial summary called a boxplot has been used successfully to describe several of a data
set’s most prominent features.
What does a boxplot tell you?
• Include
(1) center,
(2) spread,
(3) the extent and nature of any departure from symmetry, and
(4) identification of “outliers,” observations that lie unusually far from the main body of the
data.
• It uses the median and a measure of variability called the fourth spread.
Pictures of Boxplot and Definition
Exam 1.17
Boxplots that show outliers

Definition

O
Exam1.18 and 1.19
Using R
• Use bwplot(C1) or boxplot(C1,horizontal=TRUE)

Exam 1.19
Exam (Exc1.15)

Use R on data of Exc.1.15


O
Homework (Use Edi. 7)
Sec 1.2.
I .10-12; II. 13-27 al 2, 29-30 al. 1;
Sec 1.3.
I. 33-35; II.36-39 al. 2; III 40-43
Sec 1.4
I. 44-46; II. 48-55 al. 4; III. 56-61 al. 4
O

R practice
1. After installing R, RStudio (read file GettingStart), open file lab1_DesStat.R (in Teams/Files)
2. Please choose a data from internet (like titanic) or any data from Devore7/MAS library or from
any source in life which has at least 100 observations. Try to use tools of descriptive statistics to
analyse and present your data.
Note: You can see following video clips:
-https://www.youtube.com/watch?v=49fADBfcDD4
-https://www.youtube.com/watch?v=xYXif1UCs-g
O

You might also like