USFDA Basic Statistics and Data Presentation (v02) PDF
USFDA Basic Statistics and Data Presentation (v02) PDF
Revision #: 02
OFFICE OF REGULATORY AFFAIRS III-04
Revision Date:
ORA Laboratory Manual Volume III Section 4 08/13/2019
Title:
Page 1 of 28
Basic Statistics and Data Presentation
1. Introduction ................................................................................................................................. 2
2. General Considerations.............................................................................................................. 2
2.1. Accuracy, Precision, and Uncertainty ............................................................................. 3
2.2. Error and Deviation; Mean and Standard Deviation ....................................................... 3
2.3. Random and Determinate Error ...................................................................................... 6
2.4. The Normal Distribution ................................................................................................... 6
2.5. Confidence Intervals ........................................................................................................ 8
2.6. Populations and Samples: Student’s t Distribution ......................................................... 9
2.7. References....................................................................................................................... 9
3. Data Handling and Presentation .............................................................................................. 10
3.1. Rounding of Reported Data........................................................................................... 10
3.2. Significant Figures ......................................................................................................... 10
3.2.1. Definitions and Rules for Significant Figures .................................................. 11
3.2.2. Significant Figures in Calculated Results ....................................................... 11
4. Linear Curve Fitting .................................................................................................................. 12
5. Development and Validation of Spreadsheets for Calculation of Data ................................... 14
5.1.1. Introduction...................................................................................................... 15
5.1.2. Development of Spreadsheets........................................................................ 15
5.1.3. Validation of Spreadsheets ............................................................................. 15
6. Control Charts .......................................................................................................................... 16
6.1. Definitions ...................................................................................................................... 16
6.2. Discussion...................................................................................................................... 16
6.3. Quality Control Sample Example .................................................................................. 17
6.4. References..................................................................................................................... 17
7. Statistics Applied to Drug Analysis .......................................................................................... 17
7.1. Introduction .................................................................................................................... 17
7.2. USP Guidance on Significant Figures and Rounding ................................................... 17
7.3. Additional Guidance in the USP .................................................................................... 19
7.4. References..................................................................................................................... 20
8. Statistics Applied to Radioactivity ............................................................................................ 20
8.1. Introduction .................................................................................................................... 20
8.2. Sample Counting ........................................................................................................... 20
Title:
Page 2 of 28
Basic Statistics and Data Presentation
1. Introduction
Statistics may be used in the ORS laboratory to describe and summarize the
results of sample analysis in a concise and mathematically meaningful way.
Statistics may also be used to predict properties (ingredients, acidity, quantity,
dissolution, height, weight, etc.) of a contaminant or of a regulated product
based on measurements made on a subset, or sample, of the contaminant or
product. All statistical concepts are ultimately based on mathematically
derived laws of probability. Understanding statistical concepts will allow the
ORS analyst to better convey analytical results with the maximum accuracy
and precision.
Proper application of statistics gives analysts the ability to report accurate
results, while allowing for the fact that there is inherent error (both random and
determinate) in virtually every laboratory measurement made.
This section is meant to be a general guide for situations commonly
encountered in the ORS laboratory. The section also gives guidance on
various aspects of data presentation and verification.
2. General Considerations
Statistical procedures used to describe measurements of samples in the ORS
laboratory allow regulatory decisions to be made in as unbiased manner as
possible. The following are numerically descriptive measures commonly used
in ORS laboratories.
Title:
Page 3 of 28
Basic Statistics and Data Presentation
Title:
Page 4 of 28
Basic Statistics and Data Presentation
where
X = mean of set of N measurements,
xi = ith measurement, and
N = Number of Measurements.
Note: this is the arithmetic mean of a set of observations. There are
other types of mean which can be calculated, such as the geometric
mean (see the section on “Application of Statistics to Microbiology”
below), which may be more accurate in special situations.
E. Then, the deviation, di, for each measurement is defined by:
di = x i - X
Title:
Page 5 of 28
Basic Statistics and Data Presentation
∑d
2
∑ (x − X)
n 2
i
i =1 i
s= i =1
=
N −1 N −1
Title:
Page 6 of 28
Basic Statistics and Data Presentation
Title:
Page 7 of 28
Basic Statistics and Data Presentation
the values measured (i.e. variables) may vary continuously rather than
take on discrete values (the Poisson distribution, applicable to
radioactive decay is an example of a discrete probability distribution
function; see discussion under “Statistics Applied to Radioactivity”).
C. The normal distribution should be at least somewhat familiar to most
analysts as the “bell curve” or Gaussian curve. The curve can be
defined with just two statistical parameters that have been discussed:
the true value of the measured quantity, μ, and the true standard
deviation, σ. It is of the form:
−1 / 2{( x − µ ) / σ }2
Y= e
Where Y= frequency of occurrence of a measurement (a value between
0 and 1),
x = the magnitude of the measurement,
μ = the true value of the measurement,
σ = true standard deviation of the population, and
e = base of natural logarithms (2.718…)
D. An example of two normal curves with the same true value, μ, but two
different values of σ is shown below (this was calculated using an
Excel® spreadsheet, using the formula above and an array of x values):
Title:
Page 8 of 28
Basic Statistics and Data Presentation
Normal Distribution
1.2
0.8
standard
Normalized frequency
deviation = 0.1
0.6
standard
deviation =0.05
0.4
0.2
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101
measurement (mean = 0.5)
Title:
Page 9 of 28
Basic Statistics and Data Presentation
Title:
Page 10 of 28
Basic Statistics and Data Presentation
Title:
Page 11 of 28
Basic Statistics and Data Presentation
In a sense, this is the most general way to express “how well” a number is
known. The correct use of significant figures is important in today’s world,
where spreadsheets, handheld calculators, and instrumental digital readouts
are capable of generating numbers to almost any degree of apparent
precision, which may be much different than the actual precision associated
with a measurement. A few simple rules will allow us to express results with
the correct number of significant figures or digits. The aim of these rules is to
ensure that the final result should never contain any more significant figures
than the least precise data used to calculate it. This makes intuitive as well as
scientific sense: a result is only as good as the data that is used to calculate it
(or more popularly, “garbage in, garbage out”).
3.2.1. Definitions and Rules for Significant Figures
A. All non-zero digits are significant.
B. The most significant digit in a reported result is the left-most non-zero
digit: 359.741 (3 is the most significant digit).
C. If there is a decimal point, the least significant digit in a reported result is
the right-most digit (whether zero or not): 359.741 (1 is the least
significant digit). If there is no decimal point present, the right-most non-
zero digit is the least significant digit.
D. The number of digits between and including the most and least
significant digit is the number of significant digits in the result: 359.741
(there are six significant digits).
E. The following table gives examples of these definitions:
Sig.
Number
Digits
A 1.2345 g 5
B 12.3456 g 6
C 012.3 mg 3
D 12.3 mg 3
E 12.30 mg 4
F 12.030 mg 5
G 99.97 % 4
H 100.02 % 5
Title:
Page 12 of 28
Basic Statistics and Data Presentation
reported with a precision near that of the least precise numerical measurement
used to generate the number. Some guidelines and examples follow.
3.2.2.1. Addition and Subtraction
The general guideline when adding and subtracting numbers is that the
answer should have decimal places equal to that of the component with the
least number of decimal places:
21.1
2.037
6.13
________
29.267 = 29.3, since component 21.1 has the least number of decimal places
3.2.2.2. Multiplication and Division
The general guideline is that the answer has the same number of significant
figures as the number with the fewest significant figures:
56 X 0.003462 X 43.72
1.684
A calculator yields an answer of 4.975740998 = 5.0, since one of the
measurements has only two significant figures.
4. Linear Curve Fitting
This section deals with fitting of experimental data to a mathematical function.
This situation is encountered in a variety of situations in the ORS laboratory,
particularly with calibration curves. In most situations, the relationship between
the variables is linear, and therefore a linear function is needed:
y = f(x) = mx + b
Title:
Page 13 of 28
Basic Statistics and Data Presentation
m= i =1 i =1
2
n
n
n∑ xi − ∑ xi
2
i =1 i =1
n n
∑ y i − m∑ xi
b = i =1 i =1 = Y − mX (intercept)
n
An additional parameter, which is an indicator of the “goodness of fit” of the
line to the data points, is the coefficient of determination. This coefficient
denotes the strength of the linear association between x and y. The coefficient,
r2, uses information on means and deviations of each data set to express
variation numerically. If the two data sets correspond perfectly or exhibits no
variation, a coefficient of 1 will be calculated. A coefficient of 0 indicates there
is no relationship or no explanation of variation between the two data sets.
Typically, for analytical work performed in the ORS laboratory, the coefficient
should be very close to 1 (for example 0.999). The formula for the coefficient of
determination is:
Title:
Page 14 of 28
Basic Statistics and Data Presentation
2
n
∑ ( xi − x )( y i − y )
r 2 = i =1
n(s x )(s y )
Title:
Page 15 of 28
Basic Statistics and Data Presentation
5.1. Introduction
Although the formulas given above for calculation of statistical parameters may
seem complicated, matters are simplified by the ready availability of
spreadsheets and calculators which provide these values transparently. This
makes calculation of statistical parameters much more straightforward than in
the past, when direct application of these formulas was used. It is still useful to
have some familiarity with these formulas to understand how statistical
parameters are derived. In addition, there may be a need to verify the results
of statistical data generated by a spreadsheet or calculator; data can be
plugged directly into the formulas above to verify these results.
5.2. Development of Spreadsheets
Excel® and other spreadsheets incorporate all of the statistical parameters
discussed, as well as many others. Although individual spreadsheet functions
can be considered as reliable, it is important to make sure that data is
presented to the spreadsheet with the proper syntax. Also, when spreadsheets
are used for multiple numerical calculations in the form of in-house developed
templates, it is important to protect the spreadsheet from inadvertent changes,
to verify the reliability of the spreadsheet by comparison with known results
from known data, and to ensure that the spreadsheet can handle unforeseen
data input needs. Spreadsheets developed in the ORS laboratory should be
looked upon as in-house developed software that should be qualified before
use, just as instruments are qualified before use.
5.3. Validation of Spreadsheets
General guidance for design and validation of in-house spreadsheets and
other numerical calculation programs includes the following considerations:
A. Lock all cells of a spreadsheet, except those needed by the user to
input data.
B. Make spreadsheets read-only, with password protection, so that only
authorized users can alter the spreadsheet.
C. Design the spreadsheet so that data outside acceptable conditions is
rejected (for example, reject non-numerical inputs).
D. Manually verify spreadsheet calculations by entering data at extreme
values, as well as at expected values, to assess the ruggedness of the
spreadsheet.
E. Test the spreadsheet by entering nonsensical data (for example
alphabetical inputs, <CTRL> sequences, etc.).
Title:
Page 16 of 28
Basic Statistics and Data Presentation
F. Keep a permanent record of all cell formulas when the spreadsheet has
been developed. Document all changes made to the spreadsheet and
control using a system of version numbers with documentation.
G. Periodically re-validate spreadsheets. This should include verification of
cell formulas and a manual reverification of spreadsheet calculations.
6. Control Charts
A control chart is a graph of test results with limits established in which the test
results are expected to fall when the instrument or analytical procedure is in a
state of “statistical control.” A procedure is under statistical control when
results consistently fall within established control limits. There are a variety of
uses of control charts other than identifying results that are out of control. A
chart will disclose trends and cycles which will allow real time analysis of data
and information for deciding corrective action prior to say an entire analytical
system goes out of control. The use of control charts is strongly encouraged in
regulatory science.
6.1. Definitions
A. Central line: mean value of earlier determinations, usually a minimum
of twenty results
B. Inner control limit: the mean value ± 2 standard deviations
C. Outer control limit: the mean value ± 3 standard deviation
6.2. Discussion
A. Control charts are frequently used for quality control purposes in the
laboratory. Control charts serve as a tool that determines if results
performed on a routine basis (e.g. quality control samples) are
acceptable for the intended purposes of the data.
B. The mean control chart consists of a horizontal central line and two
pairs of horizontal control limits lines. The central line defines the mean
value, the inner control limit (mean ± 2 standard deviations), and outer
control limit (mean ± 3 standard deviations). Results are plotted on the
y-axis against the x-axis variable (e.g. date, batch number).
C. Results fall within the inner control limits 95% of the time. Results
falling outside the inner control limit serve as a warning that the results
may be biased. Results falling outside the outer control limit indicate
the results are biased and corrective action should be taken.
Title:
Page 17 of 28
Basic Statistics and Data Presentation
Title:
Page 18 of 28
Basic Statistics and Data Presentation
Title:
Page 19 of 28
Basic Statistics and Data Presentation
Title:
Page 20 of 28
Basic Statistics and Data Presentation
Title:
Page 21 of 28
Basic Statistics and Data Presentation
Title:
Page 22 of 28
Basic Statistics and Data Presentation
Where:
s R = standard deviation of the gross sample counting rate
g
, and
s R = standard deviation of the background counting rate
b .
C. The sample rate plus or minus one standard deviation is reported as
Rg Rb
R s ± s Rs = R s ± +
tg tb
Example. A sample counted for 100 seconds yields 2300 gross counts. The
background measured under identical conditions yields 100 counts in 10
seconds. Calculate the sample counting rate (counts per second) and the
standard deviation of the sample counting rate. Report the results at the 96%
confidence level.
2300 counts 100 counts
Rs = - = 23 cps - 10 cps = 13 cps
100 s 10 s
Title:
Page 23 of 28
Basic Statistics and Data Presentation
23 13
sR = + = 1.2 cps
s
100 10
Rs = 13 cps ± 2.4 cps
Title:
Page 24 of 28
Basic Statistics and Data Presentation
Title:
Page 25 of 28
Basic Statistics and Data Presentation
where xi are the individual counts, and ∏ indicates that the product of
Title:
Page 26 of 28
Basic Statistics and Data Presentation
n
∑ log xi
x g = anti log i =1
n
This formula is much easier for calculation purposes, particularly when
a large number of observations are involved.
10.3. Most Probable Number
Another statistical concept unique to microbiological observations is that of
Most Probable Number (MPN). The Most Probable Number is a statistically
derived estimate of the presence of microorganisms based on the presence or
absence of growth in serially diluted samples. After an initial dilution, serial
dilutions of the sample are made (for example, 1:10, 1:100, and 1:1000) with a
number of replicate tubes (for example, 3 or 5) at each dilution. After
incubation, the presence or absence of growth in each tube is tabulated. The
resulting code (number of positive tubes) is compared with published tables to
give the most probable number of microorganisms per unit of original,
undiluted sample. Most probable number tables are published for various
numbers of tubes at a number of dilutions. The statistical derivation is beyond
the scope of this discussion, but is based on Poisson counting statistics.
Tables are published in the Bacteriological Analytical Manual (BAM), the
AOAC Official Methods of Analysis, and General Chapter <61> of the USP.
10.4. References
A. (Current Ed.). “Microbiological Examination of Nonsterile Products
<61>,” U. S. Pharmacopeia and national formulary. Rockville, MD:
United States Pharmacopeial Convention, Inc.
B. Tomlinson, L. (Ed.). (1998). Bacteriological analytical manual (8th ed.,
Rev. A, in hardcopy) Washington DC: R. I. Merker, Ph.D., Office of
Special Research Skills, Center for Food Safety and Applied Nutrition,
U.S. Food & Drug Administration and The current version of the
Bacteriological Analytical Manual (BAM) found online.
11. Statistics Applied to pH in Canned Foods
A. pH is a logarithmic measure for the acidity of an aqueous solution.
Since pH represents the negative logarithm of a number, it is not
mathematically correct to calculate simple averages or other summary
statistics. Instead, the values should be converted to hydrogen ion
concentrations, averaged, and re-converted to pH values.
B. The following guidance is provided:
Title:
Page 27 of 28
Basic Statistics and Data Presentation
Title:
Page 28 of 28
Basic Statistics and Data Presentation
15. Attachments
None