Introduction To Statistics
Introduction To Statistics
The subject of Statistics in different times, has been defined in different manners like: Statistics are the numerical statement of facts capable of analysis and interpretation and the science of statistics is the study of the principles and the methods applied in:
Collecting, Presenting, Analysis and Interpreting
What is Statistics
Definition of Statistics Statistics is the study of how to collect, organize, analyze, and Interpret numerical information from data. Descriptive statistics Involves methods of organizing, picturing and summarizing information from data. Inferential statistics Involves methods of using information from a sample to draw conclusions about the population. Keep in Mind: * Statistical inferences are no more accurate than the data they are based on weakest sources. * Statistical results should be interpreted by one who understands the methods used as well as the subject matter.
What is Data?
DATA Data is a collection of facts, such as values or measurements. It can be numbers, words, measurements, observations or even just descriptions of things. Qualitative data vs Quantitative data Data can be qualitative or quantitative. Qualitative data is descriptive information (it describes something) where Quantitative data, is numerical information (numbers). The Quantitative data can also be Discrete or Continuous: Discrete data can only take certain values (like whole numbers) Continuous data can take any value (within a range) Put simply: Discrete data is counted, Continuous data is measured.
Types of Data
Data
Quantitative (Numeric)
e.g., the height of a person in inches.
Variable
A quantity that, varies from an individual to individual.
Variable
Quantitative (Numeric)
Levels of Measurement
1. Nominal Level: (in name only): Qualities with no ranking/ordering; no numerical or quantitative value. Data consists of names, labels and categories, also called categorical data. E.g., Car colors for a certain model are: red, silver, blue and black or 0 = concentrations below reporting limit, 1 = above reporting limit but below a health standard, 2 = above health standard. 2. Ordinal Level: Can be arranged in some numerical order, but the differences between the data values are meaningless. E.g., Of 17 fishing reels rated: 6 were rated good quality, 4 were rated better quality, and 7 were rated best quality.
Levels of Measurement
3. Interval Level: Data values can be ranked and the differences between data values are meaningful. However, there is no intrinsic zero, or starting point, and the ratio of data values are meaningless. E.g., The years in which democrats won presidential elections or temperature. 4. Ratio Level: Similar to interval, except there is an inherent zero, or starting point, and the ratios of data values have meaning. E.g., Length of trout in the North River.
EXAMPLES OF DATA
Data are collected in many aspects of everyday life. Statements given to a police officer or physician or
psychologist during an interview are data. The correct and incorrect answers given by a student on a final examination. Almost any athletic event produces data. The time required by a runner to complete a marathon, The number of errors committed by a baseball team in nine innings of play, and so on.
Symmetric, bell-shaped Asymmetric or Skewed, not bell-shaped Bimodal: Two prominent peaks (modes) Skewed Right: On number line, values clumped at left end and extend to the right. Skewed Left: On number line, values clumped at right end and extend to the left.
Bell-shaped example
Bimodal Example
Old Faithful Geyser, time between eruptions, histogram
Times between eruptions of the Old Faithful geyser, shape is bimodal. Two clusters, one around 50 min., other around 80 min.
Bell-shaped distribution
Standard deviation measures variability by summarizing how far individual data values are from the mean. Think of the standard deviation as roughly the average distance values fall from the mean.
Data sets usually represent a sample from a larger population. If the data set includes measurements for an entire population, the notations for the mean and standard deviation are different, and the formula for the standard deviation is also slightly different. A population mean is represented by the Greek (mu), and the population standard deviation is represented by the Greek sigma (2 , lower case) 2 = (xi )2/n
Data Representation
Tabulation Simple bar chart Component bar chart Multiple bar chart Pie chart
Qualitative
Univariate Frequency Table Bivariate Frequency Table
Component Multiple
Quantitative
Discrete Frequency Distribution Bar Chart Line Chart Continuous Frequency Distribution Histogram Frequency Polygon Frequency Curve
Percentages
Pie Chart Bar Chart
Bar Chart
Qualitative Univariate Frequency Table Percentages Pie Chart Bar Chart Bivariate Frequency Table Component Bar Chart Multiple Bar Chart
Example
Suppose that we are carrying out a survey of the students of first year studying in a co-education. Suppose that in all there are 1200 students of first year in this college. We wish to determine:
Interview Results
We will have an array of observations as follows: U, U, E, U, E, E, E, U, , where (U : URDU MEDIUM) (E : ENGLISH MEDIUM)
The results can be shown in the following table: Medium of No. of Students Institution (f) Urdu 719
English Total
481 1200
Important: The technical term for the numbers given in the second column of this table is frequency. It means how frequently something happens? Out of the 1200 students, 719 stated that they had come from Urdu medium schools and the remaining 481 from English medium.
Dividing the cell frequencies by the total frequency and multiplying by 100 we obtain the following: Medium of Institution f %
Urdu English
719 481
1200
English 40%
Urdu 60%
In order to represent the above information in the form of a bar chart, all we have to do is to take the year along the x-axis and construct a scale for turnover along the y-axis.
50,000 40,000
30,000
20,000
10,000
Next, against each year, we will draw vertical bars of equal width and different heights in accordance with the turn-over figures that we have in our table.
50,000 40,000 30,000 20,000 10,000 0 1965 1966 1967 1968 1969
When our values do not relate to time, they should be arranged in ascending or descending order before-charting.
Student No. 1 2 3 4 5 6 7 8 : :
Medium U U E U E E U E : :
Gender F M M F M F M M : :
Now this is a bivariate situation; we have two variables, medium of schooling and sex of the student.
Male
Female Total
Box Head
Next, we will count the number of students falling in each of the following four categories:
Male student coming from an Urdu medium school. Female student coming from an Urdu medium school. Male student coming from an English medium school. Female student coming from an English medium school.
Sex
Med. Urdu English Total
COMPONENT BAR CHART This can be accomplish by constructing the component bar chart component bar chart is also known as the subdivided bar chart.
Urdu English
Female
In the above figure, each bar has been divided into two parts. The first bar represents the total number of male students whereas the second bar represents the total number of female students. As far as the medium of schooling is concerned, the lower part of each bar represents the students coming from English medium schools. Whereas the upper part of each bar represents the students coming from the Urdu medium schools. The advantage of this kind of a diagram is that we are able to ascertain the situation of both the variables at a glance. We can compare the number of male students in the college with the number of female students, and at the same time we can compare the number of English medium students among the males with the number of English medium students among the females.
Years 1970-71
A multiple bar chart is a very useful and effective way of presenting this kind of information. This kind of a chart consists of a set of grouped bars, the lengths of which are proportionate to the values of our variables, and each of which is shaded or colored differently in order to aid identification. With reference to the above example, we obtain the multiple bar chart shown ahead: Multiple Bar Chart representing Imports & Exports of Pakistan ( 1970 - 71 to 1974 - 75)
2500 2000
15 0 0
Im ports Exports
10 0 0
500
0 19 7 0 - 7 1 19 7 1- 7 2 19 7 2 - 7 3 19 7 3 - 7 4 19 7 4 - 7 5
For Example:
Total no. of male students i.e. English Medium and Urdu Medium
Statistical Inference
Sample
Population
Statistical inference is the process by which we acquire information about populations from samples. Two types of estimates for making inferences:
Point estimation. Interval estimate.
population parameter is said to lie. For example, a < x < b is an interval estimate of
the population mean . It indicates that the population mean is greater than a but less than b.
1/2/2014
Parameter Vs Statistic
Parameter:
Parameter Vs Statistic
Statistic: Any statistical characteristic of a sample.
Parameter Vs Statistic
Statistical Issue: To describe the distribution of a population through statistic. E.g., sample mean is an estimate of the population mean census or making inference on population distribution/ population parameter using sample distribution/
H0 true / HA false H0 false / HA true Type II error () OK p=1- Type I error () p= p= OK p=1-
1- - power of the test
- level of significance
there is only 5 chance in 100 that the result termed "significant" could occur by chance alone
The probability of making a Type I () can be decreased by altering the level of significance.
the power of the test will be decreased the risk of a Type II error will be increased
Decision:
Point Estimation
A point estimate draws inference about a population by estimating the value of an unknown parameter using a single value or a point.
Population distribution
?
Parameter
Interval Estimate
An interval estimator draws inferences about a population by
Population distribution
Parameter
E.g. 95% CI implies that if one repeats a study 100 times, the true measure of association will lie inside the CI in 95 out of 100 measures
Bell-shaped distributions
Measurements that have a bell-shape are so common in nature that they are said to have a normal distribution. Knowing the mean and standard deviation completely determines where all of the values fall for a normal distribution, assuming an infinite population! In practice we dont have an infinite population (or sample) but if we have a large sample, we can get good approximations of where values fall.