DOC-20230410-WA0113.
DOC-20230410-WA0113.
SCIENCES
STATISTICS
Statistic or datum means measured or counted fact or piece of information stated as a figure such as height, age
weight, birth of a baby etc. Statistics or data would be same stated in more the one figures such as height of 10
persons, blood pressure of 15 patients, dental caries for 5 children etc. Statistics is a science of figure; Statistics
can be applied in various fields/ areas such as agriculture, economics pharmacy engineering etc.
1. Physiology: To find out co-relation between two variables such as height and weight – whether weight
increases / decreases, height increases / decreases.
2. Anatomy: To find out difference between means and proportions of normal at two places or in different
places.
3. Pharmacology: to find out the action of drug, to compare the action of two drugs, to find out relative potency
of a new drug with respect to a standard drug etc.
4. Medicine: To find out the efficacy of a particular drug, operation or line of treatment (i.e. comparison of case
and control), to find out an association between toe attributes such as cancer and smoking or malaria
and social class, to identify signs and symptoms of a disease or syndrome.
5. Dentistry: to find out the dental carries in a community or in a school going children.
6. Community Medicine: To test usefulness of sera and vaccines in the field / community i.e. percentage of
attacks or deaths among vaccinated subjects is compared with that among unvaccinated. In epidemiology
studies the role of causative factors is statistically tested. Deficiency of Iodine as an important cause of
Goitre in a community is confirmed after comparing the incidence of goitre cases before and after giving
iodized salt. To find out different rates / ratios, prevalence and incidence rates of a disease in a community.
1
ROLE OF BIOSTATISTICS IN MEDICINE
WHAT IS STATISTICS?
The term “statistics” is used in two ways. First, it refers to the everyday use of: data, - numerical observations,
- quantitative information
Examples
1. Number of trained medical personnel in Maharashtra (district-wise). 2. Birth weights of babies born in a
hospital/community. 3. Age of patients seen at Orthopedic Clinic in a Hospital. Prevalence of oral cancers, per
1000 of population, in Ahmednagar district. Prevalence of physical disability in children < 14 years in a
community. 5. Amount of creatinine in mg per liter in a 24-hour urine specimen. Statistics also refers to the
discipline, comprising: Statistical methods The study of scientific methods of collecting, processing, reducing,
presenting, analyzing and interpreting data, and of making inferences and drawing conclusions from numerical
data.
WHAT IS BIOSTATISTICS?
The term “Biostatistics” can be understood as (1) Statistics arising out of biological sciences, including from the
fields of medicine and public health. (2) The methods and principles used in dealing with statistics in the field of
biological sciences including medicine and public health and planning, conducting and analyzing data which
arise in investigations in these branches.
MAIN USES OF STATISTICAL METHODS
Three main uses of statistical methods are:
a) To collect data in the best possible and scientific way: This includes methods of: - designing forms for data
collection - organizing the collection procedure - designing and executing experiments/clinical trials -
conducting surveys in a population
2
Examples
1. Collection of data from mothers about their breast feeding practices. 2. Systematic collection of data on births
and deaths. 3. Collection of data to compare the relative effects of ergometrine+oxytocin and ergometrine alone
in the third-stage management of obstetric labour. 4. Collection of data on industrial workers of a given
geographical area.
b) To describe the characteristics of a group or a situation: This is accomplished mainly by: data reduction,
- data summary, - data presentation (Classification, Tabulation and Graphs/diagrams) C) To analyze data and to
draw conclusions from such analyses. This involves the use of various analytical techniques and the35 use of
probability concepts in drawing conclusions.
The application of statistics is also useful in developing a critical thinking faculty, in order to be able to:- think
scientifically, logically and critically about medical problems.- assess properly available evidence for decision-
making. - be aware of possible risks associated with medical decisions. Identify decisions and conclusions that
lack a scientific and logical basis.
Statistical principles and concepts are applied in various areas in medicine. Some examples are given below.
A) Handling of variation
Variation in a characteristic occurs when its value changes from subject to subject, or from time to time within
the same subject. Nearly all characteristics encountered in health care delivery, whether physiological,
biochemical or immunological, exhibit variations. The extent of this variability is biological or otherwise is
learnt by defining normal value and fixing normal limits.
Examples: Age, weight, height, blood pressure, cholesterol level, bilirubin, albumin, immunoglobulin levels,
platelet count, glucose level etc.
B) Diagnosis of patients’ ailments and community health
Diagnosis is the process whereby the health status of an individual, or group of individuals, and factors
producing it, is identified. The various disease categories, one distinct from the other, based on clustering of
signs, symptoms and magnitude of biochemical values, have often been established by procedures employing
implicit statistical methods.
In placing an individual or a community’s health status in one of these categories there is always some
uncertainty. It may happen that the stated signs and symptoms are not exactly the same as those listed for, and
defining, that category. Conversely, more than one category may have the same set of signs and symptoms
ascribed to it.
Statistical reasoning is often unconsciously employed when a doctor selects a disease category with the best
chance of being correct.
3
experience with similar patients or communities that had received the intervention. - Reports in the literature of
clinical trials or experiments to assess the relative efficacy of different drugs and other methods of treatment. -
Objective assessment of the health worker’s previous experiences.
The design, execution and analysis of medical experiments and intervention programs must employ sound
statistical principles and methods if the findings and conclusions are to be valid.
E) Public health, health administration, and planning
The major application here is in the use of data relating to illness in the population in order to make community
diagnosis. This requires knowledge of: characteristics such as size and age structure of the population; - the
health profile of the population, in terms of disease or risk factor distribution; - influence of environmental
factors; - use of vital statistics (data on births and deaths) In health administration and planning, use is also made
of data on the distribution of all levels of health care resources (need, availability, utilization etc.)
LIMITATIONS OF STATISTICS: Statistics, with its wide applications in almost every sphere of human
activity, is not without limitations. The following are some of its important limitations:
I) Statistics is not suited to the study of qualitative phenomenon: Statistics, being a science dealing with a set
of numerical data, is applicable to the study of only those subjects of inquiry, which are capable of quantitative
measurement. As such, qualitative phenomena like honesty, poverty, intelligence, culture, status of health, etc.,
which can not be expressed numerically, are not capable of direct statistical analysis. However, statistical
techniques may be applied indirectly by first reducing the qualitative expressions to precise quantitative terms.
For example, the intelligence of a group of candidates can be studied on the basis of their scores in a certain test.
ii ) Statistics laws are not exact : Unlike the law of physical and natural sciences statistical laws are only
approximations and not exact. On the basis of statistical analysis we can talk only in terms of probability and
chance and not in terms of certainty. Statistical conclusions are not universally true - they are true only on an
average. For example, let us consider the statement: “ It has been found that 20 % of certain surgical operations
by a particular doctor are successful.” The statement does not imply that if the doctor is to operate on 5 persons
on any day and four of the operations have proved fatal, the fifth must be a success. It may happen that fifth
man also dies of the operation or it may also happen that of the five operations on any day 2 or 3 or even more
may be successful. By the statement, we mean that as number of operations becomes larger and larger we
should expect, on the average, 20 % operations to be successful.
iii) Statistics does not study individuals: Statistics deals with an aggregate of objects and does not give any
specific recognition to the individual items of a series. Individual items, taken separately, do not constitute
statistical data and are meaningless for any statistical inquiry. For example, selected data of a patient have
limited value in statistics, unless they are compared with either previous data of the same patient or concurrent
data of other patents, to facilitate comparison. Thus, statistical analysis is more suited to problems where group
characteristics are to be studied.
iv) Statistics is liable to be misused: Perhaps the most important limitation of Statistics is that experts must use
it. As the saying goes, “Statistical methods are the most dangerous tools in the hands the inexpert.” The use of
statistical tools by inexperienced and untrained persons might lead to very fallacious conclusions. One of the
greatest shortcomings of statistics is that they do not bear on their face the label of their quality and as such can
be molded and manipulated in any manner to support one’s way of argument.
COLLECTION OF DATA
4
Introduction: Success of any statistical investigation depends upon the availability of accurate and reliable
data. Collection of data is a very basic activity in decision making. Following points should be considered before
starting data collection by an investigator:
(1). Purpose (2). Scope (3). Limitations and (4). Degree of accuracy
Primary and Secondary data : Data used in different studies is terms either ‘primary’ and ‘secondary’
depending upon whether it was collected specifically for the study in question or for some other purpose.
Primary data: data which is collected under the control and direct supervision of the investigator (investigator
collects the data himself) is called as primary data or direct data. (Direct or Primary method)
Secondary data: data, which is not collected by an investigator, but is derived from other sources, is called as
secondary or indirect data (Indirect or Secondary method)
Sources of Primary data: Survey
Sources of Secondary data:: Published and Unpublished
Published sources: National and International organizations which collects statistical data and publish their
findings in terms of statistical reports.
National Organizations: Census, Sample Registration System (SRS), National Sample Survey Organizations
(NSSO), National Family Planning Association (NFPA), Ministry of health, Magazines, Journals, Institutional
reports etc.
International Organizations: World Health Organization (WHO), United Nations Organizations (UNO),
UNICEF, UNFPA, World Bank etc. Unpublished sources: Records maintained by various Govt. and private
offices, studies made by research institutes, schools etc. This data based on internal records. Provides authentic
statistical data and is much cheaper than primary data.
6
A written list of questions, the answers to which are recorded by respondent. No one to explain the meaning of
questions to respondents. Questions are clear and easy to understand. It should be developed in an interactive
style. In this method the investigator draws up a questionnaire containing all the relevant questions which he
wants to ask from the respondents & accordingly records the reports.
Criteria's of selection for questionnaire method:
1.The nature of investigations 2. The geographical distribution of the study population
3. The type of study population
Questions may be formulated as: Open ended and Closed ended
In Open-ended questions the possible responses are not given. Respondents write the answers in his/her words.
In closed-ended questions the possible answers are set out in the questionnaire or schedule &the respondent or
investigator ticks the category that best describes the respondent's answer.
Examples:
Open - ended questions:
What is your current age? _____ Years
How would you describe your marital status? ________
What is your average annual income? _____
What is your opinion, are the qualities as a good administrator?
1. _____________
2. ______________
3. _______________
4 . ________________
Examples: Closed - ended questions:
a. Indicate your age by placing a tick mark
1. under 15 years
2. 15-29 years
3. 30-44 years
7
Examples: Closed - ended questions:
c. What is your average annual
1. under Rs. 10000/-
2. Rs. 10000-19999/-
3. Rs. 20000-39999/-
4. Rs. 40000/- and above
In closed-ended questions having developed categories, no change is allowed hence investigator should be very
certain. In open – ended questions there is a chance to develop any categories at the time of analysis.
Closed ended questions are extremely useful for eliciting factual information and open ended questions for
seeking opinions, attitudes and perceptions. The choice of open ended and closed ended questions should be
made according to the purpose for which a peace of information is to be used, the type of study population, the
methods of communicating the findings and the relationship.
Open ended questions provides in - depth information if used in an interview by an experienced interviewer. In a
questionnaire, open ended questions provide respondents with the opportunity to express themselves freely,
resulting in a greater variety of information. Open ended questions eliminate the possibility of investigator bias.
The possible responses are already categorized, they are easy to analyze. Greater possibility of investigator bias,
since the researcher may list only the response patterns the ease of answering a ready made list of responses may
create a tendency among some respondents and interviewers to tick a category without thinking through the
issues. Considerations in formulating questions always use simple and everyday language. Do not use
ambiguous questions. Do not ask double barreled questions do not ask leading questions do not ask questions
that are based on presumptions.
Mail Questionnaire:
A list of questions (questionnaire) is prepared & mailed to the respondents. Respondents are expected to fill in
the questionnaire & send it back to the investigator. This method can be easily adopted where the field of
investigation is very vast & respondents are spread over a wide geographical area. It can be adopted only where
the respondents are literate & can understand written questions & answer them.
8
CLASSIFICATION OF DATA
The process of arranging data in different groups according t o similarities. The process of classification can
be compared with the process of sorting out letters in post office.
SIGNIFICANCE
Classification is fundamental to the quantitative study of any phenomenon. It is recognized as the basis
of all scientific generalization and is therefore an essential element in statistical methodology. Uniform
definitions and uniform systems of classification are prerequisites in the advancement of the scientific
knowledge.
WHAT IS CLASSIFICATION?
Classification is a process of arranging a huge mass of heterogeneous data into homogeneous groups to
know the salient features of the data.
WHY CLASSIFICATION?
Facilitates comparison of data within and between the classes It renders the data more reliable because
homogeneous figures are separated from heterogeneous figures It helps in proper analysis and
interpretation of the data.
Objectives
1. To condense the mass of data in such a way that salient features can readily noticed. 2. To compare two
variables. 3. To prepare data this can be presented in tabular form. 4. To highlight the significance features of
data at a glance. 5. It reveals pattern 6. It gives prominence to important figures.7. It enables to analyze data.8. It
help in drafting a report
CLASSIFICATION OF DATA
Common types of classifications are: 1. Geographical i.e. according to area or region 2. Chronological i.e.
according to occurrence of an event in time 3. Quantitative i.e. according to magnitude 4. Qualitative i.e.
according to attributes
1. Geographical:
In this type of classification, data is classified according to area or region. For example: when we considered
state wise distribution of sex ratio in India. This listing of individual entries is generally done in alphabetical
order or according to size to emphasize the importance of a particular area or region.
2. Chronological: When the data is classified according to the time of its occurrence, it is known as chronological
classification.
For example: Distribution of the Deaths for last five years
-------------------------------------------------------
Year No. of deaths
-------------------------------------------------------
2001 241
2002 348
2003 412
2004 548
2005 698
3. Quantitative data:
When the data is classified according to some characteristics that can be measure, continuous data can take all
values of the variable.
* Definition: any statistical data, which are described both by measurement and counting is called as quantitative
data. For example: Height, Weight, Pulse Rate, BP, BSL, Age, RR, Age, Income, etc.
9
4. Qualitative data: When the data is classified according to some attributes (distinct categories) which are not
capable of measurement. An attribute is divided into two classes, one possessing the attribute and other not
possessing it. - Definition: Any statistical data, which are described only counting not by measurement, is
called as qualitative data. For example: Sex, Blood group, Births, Deaths, No. of patients suffering from a
diseases, SE classification such as Lower, middle and upper, No. of vaccinated, not vaccinated etc.
-Technical terms for quantitative classification:
a. Variable: a quantity which changes its values is called as variable. e.g. age, height, weight, etc Continuous variable: age,
height, weight etc. Discrete variable: Population of a city, production of a machine, spare parts etc.
b. Class Limits: the lowest and highest value of the class are called as class limits.
c. Open – ended and closed ended classes:
Open ended (Exclusive method) Closed ended (Inclusive method)
0 - 10 0-9
10 -20 10 -19
20 – 30 20 - 29
30 – 40 30 - 39
40 – 50 40 - 49
50 - 60 50 - 59
d. Class frequency: the number of items belonging to the same class
e. Class magnitude or class interval: the length of class i.e. the difference between the upper limit and lower
limit of the class.
Frequency distribution; The way in which the items are spread our or distributed into various classes is a
called the frequency distribution.
Two types of frequency distributions: Continuous and discrete
Formation of discrete frequency distribution table: By tally bars (tally marks) count how many time a
particular value is repeated and this number is the frequency of that value.
Example: given below the value of marks obtained by 19 students in term ending theory examination. Form the discrete
frequency distribution.
10,15,10,10,15,20,15,15,20
,10,20,15,20,25,15,15,15,,2
0,15,25
Marks Tally bars Frequency
10 4
15 9
20 5
25 2
Total 20
Formation of continuous frequency distribution table:By counting how many values fall into each class
Example: given below the value of weight in kgs of 25 students. Form the continuous frequency distribution.
a) To arrange the data in such a way that will arouse interest in a reader.
b) To make the data sufficiently concise without loosing important details.
c) To present data in simple form to enable the reader to form quick impressions and draw some conclusions,
directly or indirectly.
d) To facilitate further statistical analysis
Types of presentation:
i) Ordered array, ii) Tabulation, and iii) Drawings
When the data is simple and small it can be presented by arranging them in an orderly manner. The order may be
ascending or descending in magnitude, if the data is quantitative, it may be alphabetical or any other acceptable
norm if the data is qualitative.
Table method (Tabulation) A table is a systematic arrangement of statistical data in columns and rows. The
purpose of a table is to simplify the presentation and to facilitate comparison.
Role of tabulation: The significance of tabulation will be clear from the following points: It simplifies complex
data, It facilitates comparison, It gives identity to the data
PARTS OF A TABLE
(1) Table Number (2) Title of a table (3) Head Note (4) Caption (5) Stub (6) Body of a table (7) Foot note and
(8) Source note
11
Age in Year Males Females Total
50-60 27 13 40
60-70 23 7 30
70-80 18 2 20
>80 10 0 10
Total 90 30 120
Cross tabulation (Two-way table): A frequency table involving at least two variables that have been
cross classified. This table furnishes information about two interrelated characteristics for a particular
phenomenon.
1. The advantage as well as drawback of graphs is that they give a quick impression. Graphs facilitate
understanding of comparison of strengths, correlations and trends.
2. Interpretation is done by rough translation of the points into actual figures. A change in the scale will
give a different pattern.
3. Graphs should be adjuncts to respective tables and not their substitutes.
4. While comparing, note the difference in scales.
5. They must be drawn following certain basic rules which are dependent partly on convention, partly on .
mathematical considerations and partly on personal preferences.
Types of Drawings:
For Quantitative Data: The following graphs are normally drawn for quantitative data.
Histogram, b) Frequency polygon, c) Frequency Curve, d) Cumulative Frequency curve (Ogive), e)
Line chart, and f) scatter diagram
12
For Qualitative Data: The following diagrams are normally drawn for qualitative data.
(a)Bar diagram (Simple, Multiple & proportional), (b) Pie diagram (c) Pictogram (picture diagram),
d) Contour map
Other types: Age-sex pyramid, Epidemic curve etc.
* Bar diagram: It is the most commonly used devices of presenting categorical data. They consist of a group of
equidistant rectangles, one for each group of the data in which values are represented by the length or height of
the rectangles. The bars should have uniform width on the same base line. The bars may be drawn vertically or
horizontally.
Types of Bar diagrams:
I) Simple bar diagram: It is the simplest and more frequently used diagram for the comparison of two or more
items or values of a single variable or category of data. For example: The data relating to births of a region
during 91 - 95.
II) Multiple bar diagram: If two or more sets of inter-related variables are to be presented graphically, multiple
bar diagrams are used. For example: The data relating to births and deaths of a region during 91 - 95.
III) Proportionate (Component) bar diagram: The proportionate bar diagrams are used, if the total magnitude
of the given variable is to be divided into various parts or components. First of all bar representing the total is
drawn. Then it is divided into various segments. Each segm35ent should represent a given component of the
total. A key index is given along with the diagram to explain these differences. Thus it is useful not only for
presenting several terms of the variable but also enable us to make comparative study.
IV) Pie diagram: Circle may be divided into various sections of segments representing certain proportions or
percentage to the total is known as Pie diagram. For example, with the help of pie diagram we can exhibit the
information relating to the causes of death in children in community (Diarrhoea & Enteritis, Prematurity &
Atrophy, * Bronchitis & Pneumonia). While laying out the sectors, it is a common practice to begin with the
largest component sector of pie diagram 12 O’ clock position on the circle. The other component sectors are
placed in clock wise succession in descending order of magnitude.
13
V) Pictogram : Pictogram is technique of presenting statistical data through appropriate pictures. It is popularly
used when the facts are to be presented to layman and less educated masses.
Other charts:
Epidemic curve: This gives the chronological distribution of the number of cases of a disease i.e., the
distribution in time.
Population Pyramid: Two histograms showing age distribution of a population separately for the males and
females, are put base-to-base.
14
quantitative data. As social sciences advance from qualitative stages, statistical methods will gain wide currency.
The broad aspects of data analysis are as follows:
Data Editing
Data classification of establishment of categories
Data coding
Data tabulation
Statistical analysis of data
Drawing inferences.
The discussion on statistical analysis of data is purposefully kept simplistic and preliminary as it is considered
quite essential for any student of research methodology to be well versed with these fundamental concepts. For
an advance treatment of any topic, the students may refer to a specialized text on statistics.
DATA EDITING
This is the first step in processing data after the collection of data collection instrument – the questionnaire.
Editing refers to the process of examining the data for any obvious errors, omissions, inconsistencies and
illegible recording, with the aim or rectifying them at an early stage. It calls for a careful scrutiny or the
questionnaires for assessing completeness, accuracy and uniformity. The quality of editing influences the
convenience and speed at later stages of data analysis. Data can be edited at two stages – field editing and
centralized editing.
Very often the interviewer records the responses or his observations, during the course of his administering the
questionnaire, in abbreviated form or at times as an illegible scribble. Therefore, it is prudent that after each
interview is over, he should review the questionnaire to complete abbreviated shorthand responses, rewrite
illegible scribble and correct any omissions. This type of editing is essential as it is often difficult for the staff
undertaking central editing to exactly decipher every field investigator's notes. It is preferable that the
interviewer does this type of editing on the same day or the next day at the most so that the information is fresh
and he can recall it with ease. If at all required, the interviewer may again contact the respondent, and ask for
another possible appointment. The interviewer should be clearly briefed about not engaging in any guesswork.
At times, questionnaires from all places of fieldwork are mailed to the central office in batches or after the entire
fieldwork is completed, In case the number of questionnaires (the sample size) is small, a single qualified editor
can do the take of editing or else a team of persons may be employed. The inappropriate answers are striked off,
answers recorded in wrong units are rectified, no responses are segregated etc. Such editing is generally done in
red ink to help to distinguish between the recorded information and the editor's remarks. The persons doing the
editing work must be conversant with the research study, the questionnaire, the interviewer instructions, the
coding pattern to be followed, etc. The editors are usually instructed to put their signatures and dates at
appropriate places. In any case no original data must be erased. At the end of the editing process any
questionnaires that do not meet the criteria of completeness, accuracy and inconsistency are discarded.
DATA CODING
Coding refers to the process of assigning numerals or other symbols to the answers or responses. A coding
scheme or coding frame needs to be designed for every question, such that the responses fall in specific
categories. The categories, as said earlier, must be mutually exclusive and collectively exhaustive. Care is taken
to define every class along a single dimension or concept. Codification and classification are largely intertwined.
It helps the further process of data tabulation. Many times the questionnaire is pre-coded i.e. the responses are
already put into specific categories. The respondent himself may be asked to assign appropriate codes to his
responses. Coding can be done by the interviewer during the course of interview itself. This is usually the case
for dichotomous and multiple-choice questions. In case of open-ended questions, usually the coding is done by
an experienced coder, at the central office. An experienced researcher will give adequate thought to the various
aspects of coding, right at the time of designing the questionnaire. Proper coding helps tabulation and computer
entry of data.
Coding errors should be always kept at a minimum, if not completely eliminated. Training of the interviewers
and coders helps in reducing the inaccuracies. The coders must be explained the rules of coding with appropriate
examples. They may also be given dummy data for practice. Any revisions if required are carried out in the
codes. As always, responses to open-ended questions or qualitative data are hard to code. It is important that a
given type of answer is assigned to a given category, appropriately and consistently.
DATA TABULATION
Tabulation refers to summarizing the assembled heap of data in the form of a matrix in a concise and logical
fashion. It reduces data to a compact form. Tabulation builds on the earlier processes of data editing,
categorization and coding. If the researcher is familiar with his study and its various dimensions, it is possible
for the researcher to draw tentative tabulations at the point of designing the questionnaire. He may use dummy
data for this purpose or may use secondary data from earlier studies. Also, pre-testing the questionnaire by
means of a pilot study presents some hints regarding the kind of appropriate tabulation required. Tabulation may
be done manually or can be done by the use of mechanical methods. Theses days, computerized tabulations is
the norm. Manual or hand tabulation is practical when the length of questionnaire (i.e. the number of questions)
and the sample size are small. For large scale, commercial surveys computerized tabulation is preferred.
Tabulation can be one-dimensional or bi-dimensional, i.e. simple or complex. Simple tabulation is uni-
dimensional, i.e. it gives information about one or more groups of independent questions. Complex tabulation is
bi-dimensional and gives information about one or more inter-related question is bi-dimensional and gives
information about one or more inter-related questions.
It is a good practice to give a short, clear and relevant title to every table. The table is given a unique number and
the table number and its title being placed above the table. These two help in easy and specific identification of
the table. The row and column headings are bold type faced and short and concise, along with their appropriate
units of measure. The source of data is acknowledged on the same page by way of a footnote. Gridlines make the
table more presentable and legible. Columns are numbered to facilitate quick and easy reference. Comparative
data is placed in adjacent columns (or rows). One category of data may be separated from the other by means of
thick lines or by leaving a column/row blank. Negative figures may have a 'minus' sing prefixed to them or may
be mentioned in parenthesis. Presentation of data should be logical and orderly, flowing from the more
significant one to the lesser one. The prime objective of tabulation must be to facilitate the easy, fast and
16
accurate processing and analysis.
Descriptive analysis describes the variation dimensions of the object under study. It gives a vivid picture of the
subject matter as regards size, composition, preferences, attitudes, etc. It may use one or more dimensions of
analysis. If the analysis is based on one variable it is called univariate analysis. Bivariate analysis deals with two
variables and multivariate analysis with more than three variables. Multivariate analysis further consists of
multiple regression analysis, multiple discriminant analysis, and canonical analysis, Multivariate analysis of
variance and factor analysis. Inferential analysis concerns with drawing inferences and conclusions from the
gathered mass of data. Inferential analysis consists of two areas – statistical estimation and hypothesis testing.
Statistical estimation involves estimating the population parameters from the sample statistics and is an inherent
aspect of any sample survey. This forms the subject matter of an independent chapter. Hypothesis testing refers
to the application of statistical techniques to accept or reject the proposed hypothesis at specific levels of
significance under assumed population parameters. This is discussed in detail in a later chapter.
MEASURES OF CENTRAL TENDENCY (Centering Constants)
Objectives:
(1) To find out one representative value. (2) To locate and summarize the entire set of varying values.
(3) To make decision concerning the entire set. (4) To compare different distributions.
Significance: The mass of data in one single value enables us to get an idea of the entire data. It also enable us to
compare two or more sets of data to facilitate comparison. It represents whole data / distribution with unique
value.
Characteristics of good measure of central tendency:
It should be easy to understand. It should be simple to calculate. It should be based on all observations. It should
be uniquely defined. It should be capable of further algebraic treatment. It should not be unduly affected by
extreme values.
Important measures of central tendency or centering constants which are commonly used in medical science, are:
1. Mean (Average)
2. Median
3. Mode
4. Geometric mean
5. Harmonic mean
1. Mean:
The ratio of addition of the all values to the total number of observations in a series of data is called as Mean or
17
Average.
General Formula:
If X1,X2,X3,X4,………..XN be the ‘N’ number of observations in a data then,
Mean = X1+X2 +X3 + X4 +……….+XN / N i.e. Mean = X (X bar) = Σ X / N
Merits of mean: It is simplest to understand and easy to compute. It is affected by the value of every item in
the series. It is the center of gravity, balancing the values on either side of it. It is calculated value, and not based
on position in the series.
2. MEDIAN:
The centre most value in a series of data is called as Median. The median is the 50 th percentile value below
which 50% of the values in the simple fall. It divides the whole distribution in two equal parts.
General formula:
Median = Size of (N+1/2)th observation in a series of data when the data is arranged in ascending or
descending order is called as Median.
Merits of median: It is specially used in only the position and not the values. Extreme values do not affect the
median as strongly as the mean. It is the most appropriate average in dealing with qualitative data. The value of
median can be determined graphically but value of mean can not be determined graphically.
3. Mode The most commonly or frequently occurring observation in a series of data is called as Mode.
Relationship between Mean, Median & Mode: Mode = 3 Median – 2 Mean
Relation in their size: Mean < Median < Mode
MEASURES OF CENTRAL TENDENCY FOR CONTINIOUS FREQUENCY DISTRIBUTION
In case of continuous frequency distribution following formula can be used:
Mean = X = A + Σ {fd`/ N} x C
Where, A = assumed mean, d`= m-A / C, m = mid points of the classes,C= size of the equal class interval
For example: Following data shows the distribution of Pulse rate / min for 210 cases. Find out Mean pulse
rate /min.
Pulse rate (X) No. of cases (f)
-------------------------------------------------------------------------------
65-70 15
70-75 41
75-80 58
80-85 47
85-90 32
90-95 17
Solution:
Pulse Rate (X) No. of Cases (f) Mid Points (m). d= m-A d'= m-A/C fd'
65-70 15 67.5 -10 -2 -30
70-75 41 72.5 -5 -1 -41
18
Pulse Rate (X) No. of Cases (f) Mid Points (m). d= m-A d'= m-A/C fd'
75-80 58 77.5=A 0 0 0
80-85 47 82.5 +5 +1 47
85-90 32 87.5 +10 +2 64
90-95 17 92.5 +15 +3 51
Total N=∑f=210 ∑fd'=91
Mean = X = A + Σ {fd`/ N} x C = 77.5 + 91 / 210 x 5 = 77.5 + 0.4333 x 5 = 77.5 + 2.166
Thus, Mean Pulse rate = 79.66 / minute
MEDIAN
Me = L1 + {N /2 – c.f. /f} x C
Where, L1 = lower limit of the median class Median class = size of N/2 and the class having just greater than
c.f., the corresponding class is called median class c.f. = cumulative frequency of the previous class of median
class N = Σ f, f = frequency of the median class and C = Class interval
For example: Following data shows the distribution of Weight in Kgs of 200 TB patients in a hospital . Find
out Median weight.
Solution:
Weight (X) No. of Pts (f) Cumulative freq (c.f.)
60-70 15 15
70-75 41 56
75-80 58 114
80-85 47 161
85-90 32 193
90-95 17 210
Total N=∑f=210
Me = L1 + {N /2 – c.f. /f} x C
Now, Median class=Size of N / 2 = 210 /2 = 105 Then, the c. f. just greater than N/2=105 is 114 Thus, median
class = 75-80 Then, Me = L1 + {N /2 – c.f. /f} x C =75 + {105 – 56 / 58} x 5 = 75 + {49/58} x 5 = 75 + 4.22 =
79.22 Kg
MODE
Mode = L1 + { f1-f0 /2f1-f0-f2 } x C
19
Where, L1 = lower limit of the modal class, Modal class= Class having highest frequency f0= frequency of the
class preceding the modal class, and f1 = frequency of the modal class and f2 = frequency of the class succeeding
the modal class. And C = class interval
For example:Following data shows the distribution of Hb (gm%) of 500 anaemia cases in a village. Find
out Mode.
Hb 5.0-5.5 5.5-6.0 6.0-6.5 6.5-7.0 7.0-7.5 7.5-8.0
No. of cases 48 89 102 80 57 24
Mode = L1 + { f1-f0 /2f1-f0-f2 } x C Modal class = class having highest frequency = 102 = 6.5-7.0
Objective: The main objective of the measures of variation / variability / dispersion is “How the individual
observations are dispersed around the mean”.
Significance:
It determines the reliability of an average. To determine the nature and cause of variation in order to
20
control the variability. To compare two or more than two distributions with regards to their variability.
It is of great importance to advanced statistical analysis. To find out the variation in a distribution.
1. Range: This is a crude measure of variation since it uses only two extreme values.
Definition: It is defined as the difference between highest and lowest value in a set of data.
Symbolically, Range can be given as Range = X max. - X min Range is useful in quality control of drug,
maximum and minimum temperature in a case of enteric fever etc.
2. Interquartile Range: The difference between third and first quartile. Symbolically, Q = Q3 – Q1
Where, Q1 = First quartile Q3 = Third quartile The interquartile range is superior to the range as it is based
on two extreme values but rather on middle 50% observations.
3. Mean Deviation:
The ratio of sum of deviations from mean to individual observations to the number of observations after ignoring
the sign. i. e. Mean Deviation = |∑(X – X)| / N Although the mean deviation is good measure of variability, its
use is limited. It measure and compare variability among several sets of data.
Definition: SD is Root Mean Square Deviation i.e. it is the square root of the mean of squares of deviation from
individual observation to mean. Generally, it is denoted by δ (sigma) Greater/smaller the value of SD in data,
greater/smaller will be the variation among data.
21
6. Take the square root of step no. 5. Thus, SD = √Σ (X- X)² / N-1
If the data represents a small sample of size N from a population, then it can be proved that the sum of the
squared differences are divided by (N-1) instead by N. However, for large sample sizes, there is very little
difference in the use of N-1 or N in computing the SD. SD is directly proportional to the variation in a data. i.e.
if the value of SD is more / less, the variation will be more/ less. To minimize the value of SD increase the
number of observations in a series of data. Thus, it is better that investigator should take more number of
observations in any research study.
Example: Find out value of SD for the following data showing DBP (mm of Hg) for 10 NIDDM patients:
70, 80, 94, 70, 58, 66, 78, 67, 82, 60.
SN X (X-X) (X-X)²
-----------------------------------------------------------------------------------------------------------------------------------
1. 70 - 2.5 6.25
2. 80 + 7.5 56. 25
3. 94 + 21.5 462.25
4. 70 - 2.5 6.25
5. 58 - 14.5 210.25
6. 66 - 6.5 42.25
7. 78 + 5.5 30.25
8. 67 -5.5 30.25
9. 82 +9.5 690.25
10. 60 -12.5 156.25
----------------------------------------------------------------------------------
Σ X = 725 Σ(X-X)² = 1090.5
X = Σ X / N =725/10 = 72.5
SD = √Σ (X- X)² / N-1 = SD = √1090.5/9 = 11.07
Definition: It is the ratio of the standard deviation (SD) to the mean expressed as the percentage (%).
i.e. CV = SD / Mean x 100
Uses of CV: 1. It is applied to know the variation in the data. 2. To find out consistency and reliability of data. 3.
To compare two different variables.
Example: In a distribution mean weight is 76.4 kg with a SD of 7.7 and Mean DBP is 98.8 mm of Hg with SD as
10.5. Which variable is more consistent?
Solution: CV for weight = SD/Mean x 100 = 7.7/76.4x100 = 10.08% CV for DBP = SD/Mean x 100 =
10.5 /98.8x100 = 10.63% Thus, CV for DBP is more than CV for weight, (10.63% > 10.08%), then variable
22
weight shows less variation as compared to DBP. Thus, Weight is consistent variable than DBP.
Solution:
Pulse Rate (X) No. of Cases (f) Mid Point (m) d=m-A d'=m-A/C fd'
fd'2
65-70 15 67.5 -10 -1 -30 60
70-75 41 72.5 -5 -2 -41 41
75-80 58 77.5=A 0 0 0 0
80-85 47 82.5 +5 +1 47 47
85-90 32 87.5 +10 +2 64 128
90-95 17 92.5 +15 +3 51 153
Total N=∑f=210 ∑fd'=91
Σ fd`² = 429
Mean = X = A + Σ {fd`/ N} x C = 77.5 + 91 / 210 x 5 = 77.5 + 0.4333 x 5 =77.5 + 2.166=Mean Pulse rate =
79.66 / minute
S.D. = √ {Σ fd`²- {(Σ fd`)²/ N} / N-1} x C = √ {429 – {(91) ² / 210} / 209} x 5= √ {429-(8281/210)/ 209} x 5
= √ 429-39.44 / 209 x 5 = √ 389.56 / 209 x 5 = √1.8639 x 5 = 1.36 x 5 Thus, SD of pulse rate = 6.8/ minute
Coefficient of Variation =SD / Mean x 100 = 6.8 / 79.66 x 100= 0.08536 x 100 C.V. = 8.53 %
Normal Curve:
If an Area diagram of Histogram of such type of distribution is constructed then this diagram is called as Normal
curve.
Characteristics of Normal Curve:
(1) It is bell shaped. (2) It is symmetrical around the mean. (3). Mean, Median and Mode are co inside.
(4) It has two points of inflections. (5). Area under the curve is always equal to one. (6). It does not touch the
base line.
NORMAL CURVE
SAMPLING
Introduction: Population: Individuals under study or census or complete enumeration. - Sample: A subset or
part or group of a population is called as sample. The process of inferring something about a large group of
25
elements by studying only a part of it is called as Sampling. Sampling refers to the part of the population so that
some inference about the population can be made by studying a sample. Sampling is most frequently used in
surveys. The purpose of sample survey is to obtain information about population Sampling involves selecting a
number of units
Sampling and Representativeness
Developing a Sampling Plan, Define the Population of Interest, Identify a Sampling Frame (if possible)
Select a Sampling Method, Determine Sample Size, Execute the Sampling Plan
Population of interest is entirely dependent on Management Problem, Research Problems, and Research
Design.Some Bases for Defining Population: Geographical area, Demographic profile, Lifestyle, and Awareness
Population of interest is entirely dependent on Management Problem, Research Problems, and Research
Design.
Sampling unit : Subject under observation on which information is collected Example: Children <5 years,
hospital discharges, health events, etc.
Sampling fraction: Ratio between the sample size and the population size Example: 100 out of 2000 (5%)
Sampling frame: Any list of all the sampling units in the population e.g List of households, health care units…
Sampling scheme : Method of selecting sampling units from sampling frame Randomly, convenience sample…
OBJECTIVES
There are two important objectives of the sampling which are :
1. To estimate the population “Parameter” from sample “Statistic” and 2. Testing of hypothesis
Parameter: any characteristics or value obtained from population is refereed as Parameter. For example:
Population mean, median, mode, proportion etc.
Statistic: any characteristics or value obtained from sample is refereed as Statistic. For example: Sample mean,
median, mode, proportion etc.
Why Sampling? Following are various reasons which make sampling a desirable:
1. Time taken for the study: Results from a sample can be much faster than from a complete enumeration.
2. Coast involved in the study: sampling also helps in substantial cost reductions as compared to population.
3. Physically impossible of complete enumeration: In many situations the elements being studied gets destroyed
while being test.
4. Practical infeasibility of complete enumeration: Quite often, it is practically infeasible to do a complete
enumeration due to many practical difficulties.
5. Enough reliability of inference based on sampling: In many cases sampling provides adequate information so
that not much reliability can be gained with complete enumeration in spite of additional money and time.
6. Quality of data collected: For large population, complete enumeration also suffers form the possibility of
unreliable data collected by the investigator.
Types of Sampling
There are two basic types depending on who or what is allowed for selection of the sample which are:
(1) Probability sampling (2) Non-probability sampling
1. Probability Sampling: In this type the decision whether a particular element is included in the same or not, is
done by chance only. Each element in the population has some non-zero probability of getting included in the
sample. Time consuming sampling It is possible to quantify the magnitude of the likely error in inference made.
2. Non-Probability Sampling: Each element in the population has does not ensure some non-zero probability
26
of getting included in the sample .In this method samples may be picked up based on the judgment or
convenience of the investigator. Can also introduce biases in the study Such a sampling design would also
belong to the non probability category.
Probability Sampling Methods:
1. Simple random sampling (2). Systematic random sampling (3) Stratified random sampling (4) Multi-stage
random sampling (5) Cluster random sampling (6) Multiphase random sampling
1. Simple random sampling
Principle: This is a process which ensures that each of the sample size ‘n’ has an equal probability of being
selected as the chosen sample.
Procedure: Randomly draw units. This method requires a list of all members of the population Table nos. is
also used to select the sample.
Advantages: Simple and Sampling error easily measured
Disadvantages: Need complete list of units, Does not always achieve best representativeness, Units may be
scattered
Example: evaluate the prevalence of tooth decay among the 1200 children attending a school, List of children
attending the school Children numerated from 1 to 1200 Sample size = 100 children Random sampling of 100
numbers between 1 and 1200
2. Systematic random sampling: A list of population must be available. , A systematic way i.e. every 5 th, 7th,
10th or 15 th house will be the sample.
3. Stratified random sampling: Homogeneous population, Different starat’s (groups) can be performed with
the size of population. Then apply simple random sampling
Advantages: More precise if variable associated with strata, all subgroups represented, allowing separate
conclusions about each of them
Disadvantages: Sampling error difficult to measure, Loss of precision if very small numbers
Example: Determine vaccination coverage in a country: One sample drawn in each region Estimates calculated
for each stratum Each stratum weighted to obtain estimate for country (average)
4. Multiple stage random sampling: Sample can be selected by multiple stages, e.g. State – districts – blocks –
village etc.
st nd
Example : sampling unit = household, 1 stage : drawing areas or blocks , 2 stage : drawing buildings, ,
houses, 3rd stage : drawing households
5. Cluster random sampling
A list of population must be available, Cumulative population can be determined, Useful in evaluation of
Universal Immunization Programme (UIP)
Principle: Random sample of groups (“clusters”) of units, In selected clusters, all units or proportion (sample) of
units included
Advantages: Simple as complete list of sampling units within population not required, Less travel/resources
required
Disadvantages : Imprecise if clusters homogeneous and therefore sample variation greater than population
variation (large design effect), Sampling error difficult to measure
6. Multiphase random sampling: Can be used in hospital based set up studies to diagnose a disease.
Example: In case of TB: Sputum test +ve / -ve, Chest X-ray +ve / -ve etc.
27
Place of sampling in descriptive surveys
(1). Define objectives (2).Define resources available (3).Identify study population (4). Identify variables to
study (5). Define precision required (6).Establish plan of analysis (questionnaire) (7). Create sampling frame
(8).Select sample (9). Pilot data collection (10). Collect data (11). Analyse data (12). Communicate results
(13). Use results
Conclusion: Probability samples are the best
Hookworm prevalence rate was 30% Hookworm prevalence rate was 16%
then, sample size will be: then,
n = 4 x p x q /L² sample size will be:
n = 4 x 30 x 70 / 3 x 3 n = 4 x p x q /L²
n = 8400 / 9 n = 4x16 x 84 /1.6x1.6
n = 933 n = 5376 / 2.56
n = 2100
Thus, if prevalence rate is small the required sample size will be much larger. Therefore, deciding the sample
size it is necessary to know the prevalence rate of diseases in hospital or in community set up. Since no
statistical method can compensate for badly planned experiment on adequate sample size.
28
MEASURES OF CHANCE VARIATION
{Sampling Error or Standard Error (S.E.)}
The difference or deviation between the value of 'Statistic' of a particular sample and the corresponding
population 'Parameter' is known as measures of chance variation i. e. Standard Error or Sampling Error (S.E.) It
is used to find out variation in samples. The main objective of the measures of chance variation is to estimate the
population parameter from sample statistic.
Factors controlling S.E.:
It is called as Standard Error of Sample Mean (i.e. S. E.x ).SE of sample mean is defined as the ratio of SD to the
square root of sample size. i.e. S.E. x = SD /√n
Confidence Limits (Confidence Interval) : To estimate population mean, SE for sample mean can be used
The 95% and 99% confidence limits for population mean is as follows:
95%CI = Sample mean ± 2 SEx
99% CI = Sample mean ± 3 SEx
Measures of chance variation for qualitative data: It is called as Standard Error of Sample proportion i.e.
S.E.p S.E. of sample proportion is defined as ratio of the square root of multiplication of positive proportion (p)
and alternative proportion (q) to the sample size (n)
i. e. S E p = √ p x q/ n
Confidence Limits (Confidence Interval) :
To estimate population proportion, SE for proportion can be used The 95% and 99% confidence limits for
population proportion is as follows:
95%CI=Sample proportion±2 SEp
99%CI=Sample proportion±3 SEp
FOR QUANTITATIVE DATA (EXAMPLE):
1. In a random sample of 136 college students the mean Weight in Kg was observed to be 45.5 with SD 8.9. Find
out 95% and 99% confidence limits for population mean weight.
Solution:
Given that mean Wt.= 45.5kg and SD = 8.9kg , n=136. Now to estimate population mean weight first find out
SE of mean as follows: SEx = SD/√n= 8.9/√136 = 8.9/11.66 = 0.76
Then, 95% Confidence limits for population mean weight is as follows:
95%CI = Sample mean ± 2 SEx = 45.5± 2* 0.76 = 45.5± 1.52 = 43.98kg to 47.02 kg
Thus, population mean weight will lie in the range 43.98 to47.02 kg in 95% of all the cases.
Now, 99% Confidence limits for population mean weight is as follows:
99%CI = Sample mean ± 3 SEx = 45.5± 3* 0.76 = 45.5± 2.28 = 43.22kg to 47.78 kg
29
Thus, population mean weight will lie in the range 43.22 to 47.78 kg in 99% of all the cases.
Qualitative data (Example):
In a random sample of 150 school going children in an area, 20% were suffering from malaria. Find out 95% and
99% confidence limits for population proportion of malaria
Given that, n=150, p=proportion of malaria=20%,& q=100-20=80%.
30
TESTING OF HYPOTHESIS AND TESTS OF SIGNFICANCE
HYPOTHESIS TESTING
INTRODUCTION
Hypothesis testing also referred to as 'Statistical decision-making' is an important aspect of research. Quite often,
in real life situations, we need to take decision about the population based on the information about the sample.
Very simply, hypothesis testing enables us to make probability statements about population parameter(s). A
hypothesis may not be proved absolutely, but it is accepted if it stands the test of critical objective analysis.
WHAT IS A HYPOTHESIS?
A hypothesis, in plain terms, is a tentative solution or answer to the research problem, which the researcher has
to test based on the available body of knowledge, or on knowledge that can be known. It is merely an
assumption or some supposition to be proved or disproved.
A hypothesis may be defined as a proposition or a set of propositions set forth as an explanation for the
occurrence of some specified group of phenomena either asserted merely as a provisional conjecture to guide
some investigation or accepted as highly problem in the light of established facts. Webster's New International
Dictionary of English Language defines the terms as 'a proposition, condition or principle which is assumed,
perhaps without belief, in order to draw out its logical consequences and by this method to test its accord with
facts which are known or may be determined. Quite often a research hypothesis, is a predictive statement, that
can be tested scientifically and relates an independent variable to some dependent variable.
USES OF A HYPOTHESIS
Hypothesis is a principle instrument in research. Its primary function is to suggest new experiments and
observations. Many experiments have hypothesis testing as their objective. Quite clearly hypothesis is a useful
aid to every researcher. If hypothesis is not formulated, even implicitly, the researcher cannot effectively
proceed with problem investigation. In the absence of such hypothesis, the researcher has little clue about what
to look for and in what specific order, during the data collection phase. In the light of a well-defined hypothesis,
the researcher can assess the relevance and usability of any data that he comes across. Lundberg brings forth the
value of hypothesis in the following words – 'The only difference between gathering data without a hypothesis
and gathering the data with a hypothesis is that in the latter case, we deliberately recognize the limitations of our
senses and attempt to reduce their fallibility by limiting our field of investigation so as to prevent greater
concentration of attention of particular aspects which past experience leads us to believe are insignificant for our
purpose'.
Thus hypothesis enables collecting relevant data and organizing them effectively. It prevents a blind search and
indiscriminate gathering of data which may later prove irrelevant to the problem under study.
Research can being with a well formulated hypothesis or it may come out with hypothesis as its end product.
Hypothesis is not given readymade to the researcher, it has to be formulated. What then are the essential
characteristics of a good hypothesis?
A sound hypothesis is generally a simple one. Simple, however, dose not mean obvious. The more
insight the researcher has about the problem, simpler will be his hypothesis. This simplicity is termed as
'Occam's Razor', after the English phiolospher, William of Occam, who remarked, '...... neither more nor
more onerous causes are to be assumed than are necessary to account for the phenomena'.
A hypothesis must be clear and precise to allow reliable inferences.
31
A hypothesis must be capable of being tested. Science dose not admit anything as valid knowledge until
a satisfactory test of its validity has been completed. Very exacting proof and measurement are
demanded, often by two or more persons, or by retest, of a hypothesis. A hypothesis is testable if other
deductions can be made form it which, in turn, can be confirmed or disproved by observation.
Hypothesis should be focused in scope and be specific. Narrower hypothesis is more amenable to
testing.
Hypothesis must be in line with a substantial body of established facts.
Hypothesis must explain the facts that gave rise to the need for explanation. It must actually explain
what it claims to explain, it should have empirical reference.
TYPES OF HYPOTHESIS
Hypothesis can be classified in many ways, but classification based on the basis of their level of abstraction is
considered useful.
Good and Hatt classify hypothesis based on the levels of abstraction, into three categories:
At the lowest level of abstraction are hypothesis that state existence of certain empirical uniformities e.g.
experienced graduates are likely to be better managers than freshers, after completing their M.B.A. This
type of hypothesis seems of invite scientific variations of common sense propositions.
Hypothesis that deal with 'complex ideal types' are next in the hierarchy. They go beyond the level of
anticipating a simple empirical uniformity to purposeful distortions of empirical exactness. They aim at
testing whether logically derived relationships between empirical uniformities. Their function is to
create tools and termed as 'ideal types' because they are removed from empirical reality.
The hypothesis at the highest level of abstraction is concerned with the formulation of a relation
amongst analytic variables. They state possible variations or changes in a dependent variable when the
independent variable varies in a certain fashion. They explain how one variable affects the.
Students must not have the misconception that any type of hypothesis is superior to the as each hypothesis has
its own importance and utility. The higher level hypothesis is built on the lower level ones.
SOURCES OF HYPOTHESIS
Testing of Hypothesis:
Definition: - A statement or pre-assumption about population parameter or population distribution is called as
hypothesis. If the population is large; there is no way of analyzing the population or testing the population or
testing hypothesis directly. Instead, the hypothesis is tested on the basis of random sample.
Types of hypothesis: Null hypothesis (H0) and Alternative hypothesis (H1)
Null hypothesis (H0): A hypothesis of there is no significant difference between population values and sample
values or from sample to sample values are called as Null hypothesis (H0).
Alternative hypothesis (H1): A complimentary statement of null hypothesis or a hypothesis of there is a
significant difference between population values and sample values or from sample to sample values is called as
“Alternative hypothesis (H1)”.
32
Type I and type II error: Since the conclusions of any research study depend on the evidence provide by a
sample. Variations from one sample to another sample never be eliminated until the sample is large as the
population itself. It is possible that the conclusions drawn are incorrect which leads to error. There can be two
types of errors which are as follows:
If we wrongly reject H0, when in reality H0 is true - the error is called as Type I error. Similarly, when we
wrongly accept H0 when H0 is false- the error is called as Type II error. Both these errors are bad and should
be reduced to minimum. However, they can be completely eliminated only when the full population is
examined- in this case there would be no practical utility of the testing procedure. In all testing of hypothesis
procedure, it is assumed that type I error is more severe than type II error and so needs to controlled.
THE SIGNIFICANCE LEVEL: In all testing of hypothesis procedure, it is assumed that type I error is
more series, so the probability of type I error needs to be explicitly controlled. This is done through specifying a
significance level at which a test is conducted. Therefore, significance level sets a limit to the probability of type
I error and a test procedures are designed so as to get the lowest probability of type II error. There are two
significance levels which are 5% and 1%. 5% is called as Critical level of significance and 1% is called as higher
level of significance. A test of hypothesis is designed for a significance level and at the end of the test we can
reject the null hypothesis at 5% and 1% level of significance, this level is called as 'p value'. The “p” value of a
test expresses the probability of observing a sample statistic as extreme as the one observed if the null hypothesis
is true.
TESTS OF SIGNIFICANCE: The term statistically significance is often encountered in scientific literature,
and yet its meaning is still widely misunderstood. Determination of statistical significance is made by the
application of a procedure called as Tests of significance. Tests of significance are useful for interpreting
comparison results. For example: Suppose that a clinician finds that in small series of patients the mean response
to treatment is greater than for drug A than for drug B. Obviously the clinician would like to know if the
observed difference in this small series of patients will hold up for a population of such patients. In other words
he wants to know whether the observed difference is more than merely “sampling error”. This assessment can be
made with a statistical test of significance. To decide whether null hypothesis is to be accepted or rejected, a test
statistic is computed and compared with critical value obtained from a set of statistical tables. When the test
statistic exceeds the critical value, the null hypothesis is rejected and the difference is declared statistically
significant. Any decision to reject the null hypothesis carries with it a certain risk of being wrong.The risk is
called the significance level of the test.
34
1. Z-test for difference between two sample means:
Application: To find out Standard Error of difference between two sample means i.e. S. E. (X1-X2) e.g. To find
our significant difference between two different variables/groups i.e. Efficacy of two drugs, difference between
two groups etc.
Criteria' s:
Data must be quantitative. Data must be large (i.e. n>30) and Random samples selected from normal population
Steps involving in Test:
1. State the null hypothesis i.e. H0 and its alternative hypothesis i.e. H1
2. Find out the values of test statistic i.e. value of 'Z' as follows:
Z = X1 – X2 / SE (X1 – X2), where SE (X1 – X2)=√ (SD1)²/n1 + (SD2)² /n2
3.Determine probability i.e. 'p' value as follows:
If calculated value of Z <1.96, table value of Z at 5% level of significance, then accept null hypothesis i.e. H0.
(i.e. p>0.05) and If calculated value of Z >1.96,(2.58) table value of Z at 5% (1%) level of significance, then
Reject null hypothesis i.e. H0. (i.e. p<0.05 and p<0.01)
4. Thus, there is a not significant (p>0.05) and Significant difference (p<0.05) and Highly significant difference
(p>0.01) between two different groups under study.
The required data for Z test will be
n1 n2
X1 X2
SD1 SD2
Example: In an area a random sample of 500, the mean Hb (gm %) were found to b 9.8 with SD of 1.5. In
another area a random sample of 400 , the mean Hb was 8.6 with SD of 1.9. Test whether there is any significant
difference between mean Hb levels in both the groups.
Solution:
Given that, n1= 500 n2 = 400 X1 = 9.8 X2 = 8.6 SD1 = 1.5 SD2=1.9
Data is quantitative and large, Then, by applying Z test as follows: H0: There is no significant difference
between mean Hb level in both the groups
SE (X1 – X2)=√ (SD1)²/n1 + (SD2)² /n2) = √ (1.5)²/ 500 +(1.9)²/400 = √2.25/ 500 +3.61/400
= √0.0045+0.009025 = √0.013525 = 0.116297 Then, Z = 9.8-8.6 / 0.116 = 10.345
Thus value of Z >2.58 at 1% level of significance, hence reject H0.
Thus, There is a highly significant difference between mean Hb level in both the groups. i.e. p<0.01
35
2. Z test for difference between two sample proportions
Application: To fine out significant difference between two sample proportions i.e. SE (p1-p2)
Criteria: data must be qualitative ,data must be large (i.e. n>30), Random samples from normal population
Required data: p= positive proportion q= alternative proportion (100-p)
n1 n2
p1 p2
q1 q2
Steps involving in the test: 1.State the null hypothesis H0 and its alternative hypothesis H1
2. Applying Z test as follows: Z = p1-p2 / SE (p1-p2) where, SE (p1-p2)= √p1q1/n1+p2q2/n2
Steps involving in the test: Determine probability i.e. 'p' value as follows: If calculated value of Z <1.96, table
value of Z at 5% level of significance, then accept null hypothesis i.e. H0. (i.e. p>0.05) and If calculated value
of Z >1.96,(2.58) table value of Z at 5% (1%) level of significance, then Reject null hypothesis i.e. H0. (i.e.
p<0.05 and p<0.01)Thus, there is a not significant (p>0.05) and Significant difference (p<0.05) and Highly
significant difference (p>0.01) between two different groups under study.
Example: In School A, out of 900 students, 3% showed vitamin A deficiency. In school B, out of 700 students,
5% showed vitamin A deficiency. Test the significance by applying suitable test
Solution:
given that n1 = 900, n2=700
p1 = 3% p2= 5%
q1= 97% q2 = 95%
Null hypothesis (H0): There is no significant difference between proportions of vitamin 'A' deficiency in both the
school.
Z = p1-p2 / SE (p1-p2) where, SE (p1-p2)= √p1q1/n1+p2q2/n2 =√3x97/900 +5x95/700
= √0.3233+0.6785 = 1.00
Now Z = 5-3/1 = 2, Thus, Z >1.96, at 5% level of significance, than reject null hypothesis.
Thus there is a significant difference between proportions of vitamin A deficiency in both the schools.
Where, S.E.x = Standard Error of difference between before and after observations and can be determined as
follows: i.e. S.E. x = SD/√n
Where, x = difference between before and after observations, SD is the Standard Deviations of the difference
between before and after observations and n = no. of observations.
The required data for Paired t test is as follows: n X1 = before data X2 = after data
3. To determine the probability or 'p' values first find out degrees of freedom (d.f.) as follows: d. f. = n-1
Find out value of Table 't' at 5% and 1% level of significance at n-1 d.f. for acceptance or rejection of the Null
hypothesis as:
If calculated value of 't' (actual value) < (less than) Table value of 't' at 5 % and 1% level of significance at n-1
d.f. Then accept Null hypothesis (H0) {Reject H1 }, i.e. Not significant(p>0.05). And if calculated value of 't'
(actual value) > (greater than) Table value of 't' at 5 % and 1% level of significance at n-1 d.f. then reject Null
hypothesis (H0){Accept H1 }, i.e. Significant (p<0.05) &Highly Significant (p<0.01)
Example:
Following data shows the effect of a drug on Weight in Kgs of 7 TB patients:
37
Calculation table for Paired’ t’ test:
Where SD1 and SD2 are the Standard Deviations of the two different samples i.e. n 1 and n2 respectively.
Given that, data is quantitative and small. Two different groups are given. then apply Un-paired 't' test as
follows:
Here, n1 = 15 n2 = 17
X1 = 4.2 X2 = 2.3
SD1 = 0.8 SD2 = 0.5
Null Hypothesis (H0): There is no significant difference between mean Apgar score of new born in Normal
mothers and High risk mothers and By applying unpaired 't' test as:
S.E. S.E.( X1-X2) = √ SD1² /n1 + SD2²/n2 = √ (0.8)² /15 + (0.5)²/17 = √(0.64 /15 )+( 0.25/17)
= √ 0.0427 + 0.0147 = √ 0.0574 = 0.239
t = X1 – X2 / S.E.( X1 - X2) = 4.2 – 2.3 / 0.239 = 1.9 / 0.239 t = 7.499
39
Criteria’s: Data must be qualitative, Data must be large (n>.30), Expected frequency in any cell should not be
less than 5 and random samples selected from normal population
Where R1, R2, and C1 ,C2 are the rows and columns respectively. There corresponding totals are R 1T, R2T, C1T,
C2T. GT = GRAND TOTAL
O1,O2,O3 and O4 are the observed values (frequency) or actual values and E 1, E2, E3, and E4 are the expected
values (frequency)
Steps involved in test:
1.State the Null hypothesis (H0) & its alternative hypothesis (H1).
2. Find out the value of Test Statistic as i.e. value of χ² as follows:
χ² = Σ {(O-E)² / E} ={(O1-E1)²/E1+(O2-E2)²/E2+(O3-E3)²/E3+ (O4-E4)²/E4}
Where E1, E2, E3 and E4 are the expected values for observed values O1, O2, O3, and O4 in each cell each cell.
The values of E1, E2, E3 and E4 can be calculated as follows:
E1 = R1T X C1T / GT
E2 = R1T X C2T / GT
E3 = R2T X C1T / GT
E4 = R2T X C2T / GT
3. To determine the probability values of Chi-square at 5% and 1% level of significance, first find out degrees of
freedom (d.f.) as follows:
d. f. = (R-1 ) X (C-1) Where, R=No. of rows and C=No. of columns
4. Then, to accept or reject the null hypothesis find out Table values of Chi-square at 5% and 1% level of
significance and compare with actual values of Chi-square
40
Example: A study of vaccination against measles was conducted in a village. Out of 500 vaccinated 14 showed
attacks of measles, and out of 400 not vaccinated 27 showed attacks of measles. Apply a suitable test to find out
any significant association between vaccination and attack rate of measles.
41
NON-PARAMETRIC TESTS IN MEDICAL RESEARCH
Non-parametric tests or distribution free methods are applicable to all types of data-qualitative (nominal
scaling), data in rank form (ordinary scaling) as well as data that have been measured more precisely (internal or
ratio scaling). Many non-parametric tests make it possible to work with very small samples. This is particularly
helpful to the medical researcher for collecting pilot study data especially working with a rare disease. A large
number of non-parametric tests exist. But in this scientific paper following few of the better known and more
widely used ones tests with their application in medical research with examples are discussed.
Example: New drug ‘A’ was provided to 10 villages. It was observed that number of deaths due to Malaria in
these villages prior to provide drug ‘A’ and after that were as follows:
Village 1 2 3 4 5 6 7 8 9 10
-------------------------------------------------------------------------------------------------------------
Deaths prior to provide drug A 13 15 12 13 13 13 11 14 13 10
Deaths after drug A 11 12 10 12 10 9 8 12 15 14
Question: Can it be said that drug A has reduced significantly the deaths due to malaria in 10 villages?
Solution: In this non parametric test, we are interested in finding out the sign of the difference that occur in
number of deaths due to malaria before and after drug A.
By using data given, we can see that the difference can be assigned signs as:
13-11=2(+), 15-12=3(+), 12-10=2(+), 13-12=1(+), 13-10 =3(+), 13-9=4(+), 11-8=3(+) , 14-12=2(+), 13-
15=-2(-), 10-14 =-4(-)
We have seen that we got 8 plus signs and 2 minus signs,, if providing drug A has no effect then we should
obtained 5 plus and 5 minus signs i.e. p=0.5 and q =0.5 in Binomial distribution.
Now we must answer the question whether getting 8 plus signs out of 10 is by chance at 0.05 level of
significance or not. We find that for n=10 and p=0.5, the probability of getting s8 plus signs (reduction in
deaths) is
P = n! / r! (n-r)! x pr x qr (Binomial distribution)
Where, n = 10, r = 8, p = 0.5, and q = 0.5, Thus, P = 10! / 8! 2! x (0.5)8 x (0.5)2 = 45 x (0.5)10 = 45 x 0.0009765
= 0.044
This probability is less than 0.05 and hence we can say that providing drug ‘A’ there is a significant reduction of
the deaths due to malaria in 10 villages.
42
2. Wilcoxon Signed Rank Test:(Ordinal level of measurement for two related samples)
Example: A child psychologist wished to test whether nursery school, attendance has any effect on children’s
social perceptiveness; he obtained 8 pairs of identical twins. At random 1 twin from each pair is assigned to
attend nursery school and other twin in each pair is to remain out of school.
Question: Can we comment on the Difference in children’s social perceptiveness of home and nursery school
children.
Solution: First we calculate the difference in children’s social perceptiveness for each children and next step are
to rank these differences from 1 to 8. While giving the ranks we give highest rank to highest difference. Then we
put sign of each mark as follows:
Pair Twin in school Twin at home Difference Rank of diff. Rank with less frequent sign
---------------------------------------------------------------------------------------------------------------------
1. 82 63 10 7
2. 69 42 27 8
3. 73 74 -1 -1 1
4. 43 37 6 4
5. 58 51 7 5
6. 56 43 13 6
7. 76 80 -4 -3 3
8. 85 82 3 2
T=4
----------------------------------------------------------------------------------------------------------------------
3. Mann-Whitney U test:
(Ordinal level of measurement for independent samples)
Mann-Whitney U test may be used to test whether two independent groups have been drawn from the same
population. This is one of the most powerful alternative parametric test for the‘t’ test, when the investigator
wishes to avoid the t test’s assumptions, or when the measurement in the research is weaker than interval
scaling.
Example: David and Jackson studied whether rats would generalize learned limitation when placed under a new
drive and in a new situation. Five rats were trained to imitate leader rats in a T maze. They were trained to
follow the leaders when hungry in order to attain a food incentive. Then 5 rats were transferred to a shock
avoidance situation, where imitation of leader rats would have enabled them to avoid electric shock. The shock
avoidance situation was compared to that of 4 controls that have no previous training to follow leader. The
comparison is in terms of how many trials each rat took to reach a criterion of 10 correct responses in 10 trials.
The numbers of trials to criterion required by the Experimental (E) and Control(C) rats are as follows:
E rats: 78 64 75 45 82
C rats: 110 70 53 51
Solution: We arrange these scores in the order of their size, retaining the identity of each:
45 51 53 64 70 75 78 82 110
E C C E C E E E C
43
Then obtain U by the following formula
U = n1n2 + {n1 (n1+1) / 2} –R1
Where, n1 and n2 are sample sizes and R1 is the sum of the ranks assigned to the values of the first sample.
i.e. R1 = 26
Thus U = 9, Table value of U at n2 = 4 = 0.243, thus rats previously trained to follow a leader to a food
incentive will reach the criterion in the shock avoidance situation.
4. One –Sample Run Test: (Ordinal level of measurement for one sample case)
This test is based on the order or sequence in which the individual scores or observations originally were
obtained. The technique to be presented here is based on the number of runs which a sample exhibits. A run
is defined as a succession of identical symbols.
Example: In a study of the dynamics of aggression in young children, the investigator observed pairs of
children in a controlled play situation. 12 children were surveyed those played together daily. The median of
this set in order in which those scores occurred is 24.5. The following aggression scores in order to pluses
and minuses were observed:
Child: 1 2 3 4 5 6 7 8 9 10 11 12
Score: 31 23 21 43 51 22 12 26 43 75 2 3
Position: + - - + + - - + + + - -
(score w.r.t median)
All scores falling below that median are designated as minus, all above that median are designated as plus.
Then r = 6 runs occurred in this series. Reference to the table reveals that r = 6, n 1=6 and n2 = 6 of critical
values of the ‘r’ in the run test does not fall in the region of rejection, and thus decision is that the sample
scores occur in random order is acceptable.
This test is applied only for rank correlation. It is a measure of association which requires that both variables
be measured in at least an ordinal scale so that the objects or individuals under study may be ranked in two
ordered series.
Example:
Patient No. : 1 2 3 4 5 6 7
Weight (Kg) (X) : 54 68 57 49 52 65 74
Systolic Blood Pressure (mm of Hg) (Y): 120 124 128 122 130 134 140
Rank of X: 3 6 4 1 2 5 7
Rank of Y: 1 3 4 2 5 6 7
Difference of ranks (d) : +1 +3 0 -1 -3 -1 0
Squares of diff. (d2) : 1 9 0 1 9 1 0
Then the value of Spearmen’s Rank Correlation coefficient can be calculated as follows:
rs = 1-{6Σd2/ n3-n} = 1- {6x21/7x7x7-7}= 1- {126/336} = 0.625
44
d.f. = n-2, Table value of t = 2.015,
Therefore calculated value of t < table value of t at 5% level of significance then, there is no significant
correlation between Weight and Systolic blood pressure for 7 patients. (p>0.05)
Ordinal One sample run test Sign test Mann-Whitney Spearman rank
U test correlation
Wilcoxon matched
pairs signed–ranks Median test
test
Kolmogorov-
Smirnov test
Interval --------- Walsh test ------- -------
45
CORRELATION AND REHRESSION ANALYSIS
Introduction: Correlation measures the degree of relationship between the variables.The relationship of
quantitative nature, the appropriate statistical tool for discovering and measuring the relationship, and expressing
it in a brief formula is known as Correlation. Correlation is a statistical device which helps us in analyzing the
covariance of two or more than two variables.
TYPES OF CORRELATION Positive or negative Simple, partial and multiple Linear and Non-linear
Correlation is positive (direct) or negative (inverse) would depend upon the direction is a change of the
variables. If both variables are varying in the same direction, i.e. if one variable is increasing (decreasing) the
other is also increasing (decreasing) – Positive correlation If one variable is increasing (decreasing) the other is
decreasing (increasing) – negative correlation
METHODS OF CO-RELATION
1. SCATTER DIAGRAM (DOT DIAGRAM)
2. KARL PEARSON’S COEFFICIENT OF CORRELATION
3. SPAERMAN’S RANK CORRELATION COEFFICIENT
Solution: given that, n=7, X= weight and Y= Hb Formula for Karl Pearson's correlation coefficient is
r = Σ XY – N X Y / √ {ΣX-N(X) ²} √ {ΣY-N(Y) ²}
46
-----------------------------------------------------------------------------------------------------------
Weight (kg) (X) Hb (gm%) (Y) X² Y² XY
----------------------------------------------------------------------------------------------------------
53 12 2809 144 636
49 10 2401 100 490
54 11 2916 121 594
43 9 1849 81 387
45 12 2025 144 540
55 13 3025 169 715
44 10 1936 100 440
------------------------------------------------------------------------------------------------------------
ΣX= 343 ΣY=77 ΣX² =16961 ΣY² = 859 ΣXY=3802
X = ΣX/N =49 Y = ΣY/N= 11
Putting the values in the formula as follows:
r = Σ XY – N X Y / √ {ΣX²-N(X)²} √ {ΣY²-N(Y)²}
r = 3802 -7*49*11 / √ {16961-7*(49) ² √859 -7*(11)²} = 3802-3773 / √ 16961-16807√859 -847
r= 29 / √154 * 12 = 29 / √ 1848 = 29 / 42.99 = 0.67
Thus correlation between weight and Hb level is positive.
47
Solution: X-Marks in PSM Y-Marks in ENT N = 10
S.N. X Y Rx Ry D = Rx-Ry D²
---------------------------------------------------------------------------------
1. 33 28 4 5 -1 1
2. 22 17 7 9 -2 4
3. 20 29 8 4 4 16
4. 14 31 10 3 7 49
5. 29 38 5 1 4 16
6. 41 26 1 6 -5 25
7. 37 36 2 2 0 0
8. 25 21 6 8 -2 4
9. 18 14 9 10 -1 1
10. 34 24 3 7 -4 16
-------------------------------------------------------------------------------
∑D² = 132
Applying formula for pearson's Rank Correlation coefficient as follows:
R = 1-6 ∑D2 / N3 -N = 1- {6*132 / (103 – 10)} = 1-{792 / (1000 – 10)} = 1-{792 / 990} = 1-0.8 = 0.2
REGRESSION
Regression analysis reveals average relationship between two variables and this makes possible
estimation of prediction. The meaning of the term regression is the act of returning of going back.
Regression analysis is a statistical device with the help of which we are in a position to estimate (predict)
the unknown values of one variable from known values of another variable. The variables which are used to
predict the variable of interest is called the ‘independent variable’ and the variable we are trying to predict is
called the ‘depending variable’. The independent variable is denoted by ‘X’ and dependent variable is denoted
by ‘Y’. The analysis used is called the simple linear regression analysis. The liner means that an equation of a
straight line of the form Y =a + bX.
Line of regression
There are two lines of regression for analysis of two variables under study and to estimate the unknown value.
1. Line of regression of X on Y is given as:
X – X = bxy (Y-Y)
Where, X and Y are the means of X and Y respectively. bxy = Regression coefficient of X on Y and is
calculated as follows: bxy = rσx / σy, Where, r is correlation coefficient between x and y and σx and σy are the
SD's of x and y respectively.
2. Line of regression of Y on X is given as:
Y-Y = b yx (X-X)
Where, X and Y are the means of X and Y respectively. b yx = Regression coeff. of X on Y and is calculated as
follows: byx = r σy / σx, where, r is correlation coefficient between x and y and σx and σy are the SD's of x and
y respectively.
To estimate X when Y is known the line of regression of X on Y can be used.To estimate Y when X is known
the line of regression of Y on X can be used.
48
Example:
1. Find out correlation coefficient and Construct two line of regression and estimate X when Y=85 for the
following data:
1. 52 88
2. 57 92
3. 48 78
4. 45 72
5. 49 74
6. 51 98
7. 53 94
Solution:
Given that, n=7, X= weight and Y=DBP
---------------------------------------------------------------------------------------------
X Y X² Y² XY
---------------------------------------------------------------------------------------------
52 88 2704 7744 4576
57 92 3249 8464 5244
48 78 2304 6084 3744
45 72 2025 5184 3240
49 74 2401 5476 3626
51 98 2601 9604 4998
53 94 2809 8836 4982
------------------------------------------------------------------------------------------
ΣX=355 ΣY= 596 ΣX² = 18093 ΣY² = 51392 Σ XY=30410
(Y-85.71) = -0.092 X+4.66 =i, e. Y = -0.092 X+ 4.66 + 85.71 Thus, Y= -0.092 X+ 90.37
Now, to estimate X (weight) when Y (DBP) is given as 85 use Line of regression of X (Weight) on Y (DBP) as follows:
X = -0.46Y +90.14 X estimate = - 0.46 * 85 +90.14 = - 39.10 + 90.14 Thus, X estimate = 51.04
Thus, the estimated weight=51.04 when DBP =85
49
HEALTH STATISTICS
I. DEFINITION
“Health Statistics" is a specialized branch of Statistics that relates to the application of numerical methods to
all matters that have direct or indirect influence upon or relationship with health and are required for health
planning, services and reporting. In other words, it includes all statistical information required for the
administration of a health agency like health care providers, recipients of health care and health seeking
behaviors, other infrastructure facilities like hospitals, Clinics (MCH, STD etc), Blood banks, health expenditure
etc, and also the statistics required to assess the health status of people like vital statistics, demography,
morbidity statistics, hospital statistics, socio-economic, political spiritual and environmental factors which
influence health. Thus, “health statistics” is normally said as the “eyes and ears” of Public Health.
The statistics can be used to answer the following questions, which every Public Health Personnel would
encounter while delivering the services -
How many people suffer from particular diseases, how often and for how long.
What demands these diseases place on the medical and public health resources; and what financial loss they
cause;
How fatal the different diseases are;
To what extent these diseases prevent people from carrying out their normal activities.
To what extent diseases are concentrated in particular groups of the population e.g., according to age, sex,
ethnic group, occupation or place of residence;
How far the above factors vary from time to time.
What is the effect of medical care and health services on the control of disease incidence.
Health status of persons and population in a given area, providing us with indices of vitality and health;
The physical environmental & other conditions and factors having a more are less direct bearing on the
health status of the population - indices of social & environmental factors.
Health services and activities directed at the improvement of health conditions - indices of health activity
and facilities.
Systems organized on a national scale to obtain information continuously from each household or institution
that is census, registration of vital events, notification of diseases, disease surveillance registry (National
Cancer Registry, National Tuberculosis Registry), national population surveys (National Sample Surveys),
MIS of national health programmes.
Records of medical & health institutions providing service to the community.
Surveys or investigations conducted in response to the need for more detailed information. ex. Nutritional
surveys, epidemiological investigations, field trials of vaccines.
Miscellaneous eg. Physician case records, police record on accidents/ injuries/ suicides/ homicides etc,
meteorological data - temp, rainfall, humidity, air quality etc, morbidity records of industrial units, schools,
records of statutory bodies (DMER, MCI, DCI, FDI etc.) and information on social, economic or
occupational factors affecting health, health budget and expenditure etc.
50
V. DETAILS OF SOME IMPORTANT SOURCES:
Census: According to United Nations the Census is defined as “ a process of collecting, compiling and
publishing demographic, economic and social data pertaining to all persons in a country at a specified time."
The purpose of census is to provide required information for planning and administering developmental
activities, including health. The indices normally calculated for planning health services from census
include - birth rates, death rates, sickness rates, literacy rates, age at marriage, expectation of life, age and
sex composition, urban and rural distribution of population, language, place of birth and nationality, amount
of disability, fertility data (number of children born and remaining alive), distribution of population by
occupation, housing etc, and rate of increase of population.
Earlier the responsibility of registration of these vital events was shouldered on the village police Patil/revenue
official. After Panchayat Raj Act in 1961, the responsibility was shifted to Village secretary ( gramsevak).
However, the response of registration has not improved. Subsequently, at the peripheral level CHGs, Health
Assistants (Has), at intermediate level I/C M.O PHCs, at District level DHOs have been involved in collecting
this information, though the event is registered by the Gramsevak. In urban areas the responsibility of
registration lies on Municipal Councils or Corporations, as the case may be, while the CMO of the Municipal
Hospital also collects the information and forward it to the concerned state authorities.
Usually the sickness rates are measured in terms of “Persons” “Illness” or “Spells”
51
a) Incidence rate (spells): Total no. Of new spells of illness during a defined period
/population exposed to risk in the same period x 1000
b) Incidence rate (persons): Total no. of new persons who become ill at least once in a
defined period /population exposed to risk in the same period x 1000.
c) Period prevalence rate: Total no. Of new & old cases found during a sp. Period./
population exposed to risk in the same period. X 1000
d) Point prevalence rate: Total no. of new & old cases found at a particular point of time
/population exposed to risk in the same point of time X 1000
e) Fatality ratio: Total no. of deaths from a disease/ No. Of new cases of that disease.
Often, routinely collected data from health service records and from other sources do not provide a complete
description of the population suitable for use in health service planning. On such occasions carefully planned
health surveys may be used to collect additional information.
A Health Survey
A health survey is a planned study to investigate the health characteristics of a population. It is used to
measure the total amount of illness in the population,
measure the amount of illness caused by a specified disease,
study the nutritional status of the population,
examine the utilization of existing health care facilities and demand for new ones,
measure the distribution in the population of a particular characteristic e.g, Hb level, practices of
brushing the teeth, habits etc.
examine the role and relationship of one or more factors in the etiology of a disease.
“Vital Statistics” has been used to denote facts systematically collected and compiled in numerical form
relating to or derived from records of vital events, namely live birth, death, foetal death, marriage, divorce,
adoption, legitimating, recognition, annulment, or legal separation. In essence, vital statistics are derived from
legally registerable events without including population data or morbidity data.
1) To describe the level of community health. 2) To diagnose community ills and determine the met and unmet
health needs, 3) To disseminate reliable information on the health situation and health programmes. 4) To direct
or maintain control during execution of program. 5) To develop procedures, definitions and techniques such as
recording system, sampling schemes etc. 6)To undertake overall evaluation of health programmes and public
health work.
For instance, carefully compiled causes of death in a city can provide answers to
1) The leading cause of death in the city? (Malaria, TB etc.) 2) At what age is the mortality highest and from
what disease?
3) What sections of the city (women or children or individuals following certain occupation ) are the most
unhealthy and what is the outstanding cause of death there. 4) Comparison of cities in relation to their health
status, health facilities to cope with the problem
5) Total Fertility Rate (TFR): Average number of children that would be born alive to a woman during her
lifetime, who would be subjected to age-specific fertility rates of a given year.
6) Gross Reproduction Rate (GRR) : Average number of female live births that would be borne to a
woman during her lifetime who would be subjected to age- specific fertility rates of a given year.
7) Net Reproduction Rate ( NRR) : Average number of female live births that would be borne to a woman
during her lifetime who would be subjected to age specific fertility and mortality rates of a given year.
B) Mortality Rates :
a) Crude death rate (CDR) : To measure the decrease of population due to death the rate commonly used is
53
the CDR.
CDR : {No. of deaths in a given area and period / Mid-year population} x 1000
b) Specific Death rates: Specific death rates include age-specific (infants, neo-nats, geriatric), sex-specific,
vulnerable group-specific (maternal cases), disease- specific etc
i ) Infant Mortality Rate : It is one of the most sensitive indexes of health conditions of the general population.
It is sensitive measure because a baby in its extrauterine life is suddenly exposed to a multitude of new
environmental factors and their reactions are reflected in this rate. Under ideal conditions of social welfare no
normal baby should die.
IMR : ( No. of deaths occurred under 1 year of age / No. of live births ) X 1000
Perinatal Mortality Rate: {Late foetal deaths (20 weeks or more) + deaths under one week / live
births + late foetal deaths} X 1000
vi ) Maternal Mortality Rate : The risk of dying from causes associated with childbirth is measured by the
maternal mortality rate. MMR = { Number of deaths occurred due to delivery, child birth and puerperium / No
of live births } X 1000
vii ) Cause-of-death rate : This rate is calculated to understand, which cause/disease is commonly responsible
for mortality in the community/population.
i ) Life Table : William Farr called the life table the “Biometer” of the population. A life table is
composed of several sets of values showing how a group of infants, all supposed to be born at the same time and
experiencing specified mortality conditions, would gradually die out. These tables can be constructed
separately for male / female, occupational groups, population segments, geographical subdivision of a country.
It is constructed, showing survival and death occurring in a generation of 100,000 babies. On the basis
of mortality rate operating at the time under study, the number of babies would be alive at the first birth day is
estimated. By the application of the mortality rate of the second year on the number surviving at the end of
second year, we estimate the number surviving at the end of second year of life. Similarly for other ages. From
these we can also calculate the average lifetime a person can expect to live after any age.
ii ) Physical Quality of Life Index (PQLI ) : This is the average of Infant Mortality Rate, Literacy rate
and Life Expectancy at Birth.
iii) Human Development Index ( HDI ) : The HDI is based on three indicators :longevity, as measured
by life expectancy at birth; educational attainment, as measured by a combination of adult literacy (two-thirds
weight ) and combined primary, secondary and tertiary enrolment ratios (one-third weight);and standard of
living, as measured by real GDP per capita ( purchasing power parity - ppp).
54
COMPUTERS IN MEDICINE
INTRODUCTION:
The dictionary calls a computer an electronic device that stores, retrieves, and processes information. Thus, a
computer is a machine that can store large volumes of data and manipulate it by using arithmetic and logical
methods.
Salient features
HISTORY :
Earliest version was developed by Pascal baline in 1642 - The machine worked on wheels and gears, is only for
additions.
In 1694 G.W. Liebnitz devised a machine for other mathematical operations. Charles Babbage (1833), analytical
engine remained on paper, but the concept was accepted after his death.
In 1946, the first electronic Computer was developed by J.Presper and John Mauchly. Weight 30 tons, space
15000 Sq.ft 300 multiplications per second.
Computer Generations
First generation - 1946, bulky could out 5000 basic arithmetic operations per second. Used vacuum tubes as the
main logical units.
Second generation - 1959, Transistors replaced vacuum tubes and occupied smaller space, consumed less
power, faster and more accurate.
Advent of silicon chip which could accommodate 100s of transistors, made computer still smaller and faster.
Super computer carry out 500 lakh instructions per second, but they still work too slowly to approximate higher
forms of human though involving rapid association and analysis of ideas.
55
CPU
Hardware is the physical components of the computer, the things you can see and touch.
Computer Languages :
Application in Medicine
In health care, computers were tested in as early as the 60 s , primary in bio-medical research.
The last decade has seen the computer coming out from laboratory to routine clinical environment.
Medical informatics comprised the theoretical and practical aspects of information processing and
communication, based on knowledge and experience derived from processes in medicine and health care.
- development & use of diagnostic models using truth tables, decision trees, multivariate statistics ( Bayes
thermo) and expert system.
- recognition of objects and palters in images and signals as in X-rays, ECG interpretation and cell, chromosome
or cervical smear recognitio35n.
- Models for cardiovascular physiology in terms of mechanical ( flows, pressures, volume) and electrical
( depolarization and replarisation) parameters have been developed.
- epidemiology
- expert systems using AI
References :
57