Understanding Statistics: Scope & Types
Understanding Statistics: Scope & Types
These methods describe the various aspects of a data set. Descriptive statistical methods have
their beginning in the inventories kept by early civilizations, such as the Babylonians,
Egyptians, and the Chinese. For example, the Old Testament of the Bible refers to the
numbering or counting of the people of Israel and to the casting of lots for selection by chance,
and the Romans kept careful counts of people, possessions, and wealth in the territories they
conquered. Similarly, the Domesday Book of the late eleventh century enumerated the lands
and wealth of England. The Middle Ages also saw the growth of governments and religious
institutions and their recording of births, deaths, and marriages. These early methods were
primarily lists and counts kept for purposes of taxation and military conscription.
Inferential statistics consist of methods that permit one to reach conclusions and make
estimates about populations based upon information from a sample.
A parameter is any measurement that describes an entire population. Usually, the parameter
value is unknown since we rarely can observe the entire population. Parameters are often (but
not always) denoted by Greek letters, such as µ,θ and σ
Medicine
An experimental drug to treat asthma is given to 75 patients, of whom 24 get better. A placebo
is given to a control group of 75 volunteers, of whom 12 get better. Is the new drug better
than the placebo, or is the difference within the realm of chance?
Forecasting
Page 2 of 43
A large company carries 50 000 different products. To manage this vast inventory, it needs a
weekly order forecasting system that can respond to developing patterns in consumer demand.
Is there a way to predict weekly demand and place order from suppliers for every item,
without an unreasonable commitment of staff time?
Product warranty
A major automaker wants to know the average dollar cost of engine warranty claim on a new
hybrid engine. It has collected warranty cost data on 4 300 warranty claims during the first 6
months after the engines are introduced. Using these warranty claims as an estimate of future
costs, what is the margin of error associated with this estimate?
Page 3 of 43
variable. A continuous variable can assume any value within a specific relevant interval of
values assumed by the variable. Notice that age is continuous since an individual does not
age in discrete jumps.
Continuous variables are those whose numerical values can be broken down or sub-divided
into finer units almost indefinitely. Age qualifies as an example of a continuous variable in
that it could be broken down into years, months, days, and beyond. Other examples are
weight, height, time, income, educational attainment, and so on. Many other continuous
variables are formed by taking ratios or rates. The homicide rate (homicides per 100,000
residents) is a continuous variable because it divides homicides by population, and the
calculation could be carried out to a number of decimal points. One hallmark of continuous
variables is that in addition to researcher's ability to break them down into finer gradations,
they can assume decimals.
Categorical variables
A variable is called categorical when the measurement scale is a set of categories. For
example, marital status, with categories (single, married, widowed), is categorical. For
Ghanaians, the region of residence, is categorical, with categories Greater Accra, Eastern, and
so on. Other categorical variables are whether employed (yes, no), religious affiliation
(Protestant, Catholic, Jewish, Muslim, others, none), political party preference and favorite
type of music (classical, country, folk, jazz, rock), place of birth, nationality, colour, colour
of hair, gender, blood group, smoking habit, surname, rank in military. Categorical variables
are often called qualitative. It can be seen that categorical variables can neither be measured
nor counted.
Nominal scale
This scale of measure applies to qualitative variables only. On the nominal scale, no order is
required. For example, gender is nominal, blood group is nominal, and marital status is also
nominal. On the nominal scale, categories are mutually exclusive. Thus an item must belong
to exactly one category. Notice that, we cannot perform arithmetic operations on data
measured on the nominal scale.
Ordinal scale
This scale also applies to qualitative data. On the ordinal scale, order is necessary. This means
that one category is lower than the next one or vice versa. For example, in the Army, the rank
of private is lower than the rank of captain, which is lower than the rank of major, and so on.
Thus, the rank of an army officer is measured on the ordinal scale. In universities, the rank of
Page 4 of 43
an academic staff is measured on the ordinal scale. Grades are also ordinal, as excellent is
higher than very good, which in turn is higher than good, and so on.
It should be noted that, in the ordinal scale, differences between category values have no
meaning. For example, although Professor is higher than Lecturer, the difference between
these two ranks does not exist numerically. Similarly, if 4 denotes “excellent”, 3 denotes
“very good”, 2 denotes “good” and 1 denotes “fair”, it does not mean that a candidate who is
rated “excellent” is twice as competent as a candidate who is rated “good”, just because
“excellent” is denoted by 4 and “good” is denoted by 2.
Interval scale
This scale of measurement applies to quantitative data only. In this scale, the zero point does
not indicate a total absence of the quantity being measured. An example of such a scale is
temperature on the Celsius or Fahrenheit scale. Suppose the minimum temperatures of 3
cities, A, B and C, on a particular day were 00C, 200C and 100C, respectively. It is clear that
we can find the differences between these temperatures. For example, city B is 200C hotter
than city A. However, we cannot say that city A has no temperature. Note that city A has a
temperature equivalent to 320F. Moreover, we cannot say that city B is twice as hot as city C,
just because city B is 200C and city C is 100C. The reason is that, in the interval scale, the
ratio between two numbers is not meaningful.
Ratio scale
This scale of measurement also applies to quantitative data only and has all the properties of
the interval scale. In addition to these properties, the ratio scale has a meaningful zero starting
point and a meaningful ratio between 2 numbers.
An example of variables measured on the ratio scale, is weight. A weighing scale that reads
0 kg gives an indication that there is absolutely no weight on it. So, the zero starting point is
meaningful. If Yaw weighs 40 kg and Akosua weighs 20 kg, then Yaw weighs twice as
Akosua. Another example of a variable measured on the ratio scale is temperature measured
on the Kelvin scale. This has a true zero point.
Page 5 of 43
Methods of data collection
Introduction
Page 6 of 43
Most research techniques and many statistical process-control techniques involve the use of
sampling. A sample is selected, evaluated and studied in an effort to gain information about
the larger population from which the sample was drawn. Above, we learned that a sample is
defined as a subset or part of a population. Although, by definition, samples will be smaller
than the population from which they are drawn, samples can be very small or very large. A
single student can be considered a sample of students from a given university, a very large
sample consisting of millions of households can be selected to respond to a lengthy
questionnaire that is part of a census.
A sample represents a population, and information obtained from a sample is generalized to
be true for the entire population from which it was drawn. The validity or accuracy of
generalizations from samples to populations depends on how well a sample represents its
population. A well-selected sample can provide information comparable to that obtained by
a census.
Advantages of sampling
Studying a sample instead of a population, can have the following advantages.
1. Cost – Samples can be studied at much lower cost. The smaller number of units or
individuals involved in a sample requires less time and money to evaluate. Samples can
provide affordable, accurate, and useful information in cases where a census would cost more
than the value of the information obtained.
2. Time – Samples can be evaluated more quickly than a population. If a decision had to wait
for the results of a census, a critical advantage might be missed, or the information might be
made obsolete by events or changes that took place while the data were being collected and
analyzed.
3. Accuracy – Any time data are collected, there is a chance for errors to occur. Errors of
measurement, incorrect recording of data, transposition of digits, recording of information in
the wrong area of a form, and errors in entering data into a computer can all influence the
accuracy of results. In general, the larger the data set, the more opportunity there is for errors
to occur. A sample can provide a data set that is small enough to monitor carefully and can
permit careful training and supervision of data gatherer and handlers.
4. Feasibility – In some research situations, the population of interest is not available for
study. A substantial portion of the population might not yet exist or might no longer be
available for evaluation. In other cases, evaluation of an item requires its destruction. For
example, a manufacturer interested in how much pressure could be applied to a part before it
cracked, could not perform a census without destroying the entire production run.
5. Scope of information – In a sample survey, there are greater varieties of information that
can be considered which may be impracticable in a complete census due to constraints such
as limited number of trained personnel and equipment. When evaluating a smaller group, it
is sometimes possible to gather more extensive information on each unit evaluated.
Sample designs
There are two categories of sample designs, namely, probability (or random) sampling and
non-probability sampling.
1. Probability Sampling
In this sub-section, we introduce important sampling methods which incorporate
randomization, which means that the selection is not consciously influenced by human
choice. The major principle of these designs is to avoid bias in the selection procedure and to
achieve the maximum precision for a given outlay of resources. The main types of probability
Page 7 of 43
sampling designs are: simple random sampling, systematic sampling, stratified sampling,
cluster sampling and multi-stage sampling.
(i) Simple random sample
Subjects of a population to be sampled could be families, schools, cities, hospitals, records of
reported crimes, and so on. Simple random sampling is a method of sampling for which every
possible sample has equal chance of selection. Let denote the number of subjects in the
sample. This number is called the sample size. N
A simple random sample of subjects from a population is one in which each possible sample
of that size has the same probability (chance) of being selected.
A simple random sample is often just called a random sample. The simple objective is used
to distinguish this type of sampling from more complex sampling schemes.
Why is it a good idea to use random sampling? Because everyone has the same chance of
inclusion in the sample, so it provides fairness. This reduces the chance that the sample is
seriously biased in some way, leading to inaccurate inferences about the population. Most
inferential statistical methods assume randomization of the sort provided by random
sampling.
How to select a simple random sample
One way of obtaining a simple random sample is to use the ‘lottery system’.
The lottery system
The lottery system consists of writing the name of each item in the sample frame on a slip of
paper or a card and then drawing them from a container one after the other. To ensure a bias
free selection, shuffle the cards or the slips of paper before each draw.
Advantages of the lottery system
lection bias.
Disadvantages of the lottery system
-consuming and cumbersome when the population is large.
Example 1.3
Suppose you want to select a simple random sample of 10 students from a class of 20 students.
The sampling frame is a directory of these students. You can select the students by using two-
digit random numbers to identify them, as follows:
(1) Assign the numbers 01 to 20 to the students in the directory, using 01 for the first student
in the list, 02 for the second student, and so on.
(2) Starting at any point in Table 1.2, choose successive two-digit numbers until you obtain
10 distinct numbers between 01 and 20.
(3) Include in the sample the students with the assigned numbers equal to the random numbers
selected.
For example, using the first row of Table 1.2, the first 5 two-digit random numbers are 10,
15, 01, 02 and 14. Notice that we skipped the numbers which are greater than 20 since no
student in the directory has an assigned number greater than these numbers.
After using the first row of Table 1.2, move to the next row of numbers and continue. The
column (or row) from which you begin selecting the number does not matter, since the
numbers have no set pattern. Most statistical software can do this all for you.
(ii) Systematic random sample
Another method of random sampling is to choose every kth item from the list, starting from a
randomly chosen entry among the first k items on the list. This is called systematic sampling.
The number k is called the skip number. Fig. 1.2 shows how to sample every fourth item,
starting from item 2, resulting in a sample of size n =20 items from a list of N =78 items.
A systematic sample of n items from a population of N items requires that the skip number
be approximately 𝑁 ⁄𝑛 sampling from a sampling frame, it is simpler to select a systematic
random sample than a simple random sample because it uses only one random number.
Page 9 of 43
Fig. 1.2:Systematic sampling
An attraction of systematic sampling is that it can be used with unlistable or infinite
population, such as production processes (e.g. testing every 5000th light bulb) or political
polling (e.g., surveying every tenth voter who emerges from the polling place). Systematic
sampling is also well-suited to linearly organized physical population (e.g., pulling every
tenth patient folder from alphabetized filing drawers in a veterinary clinic).
Example 1.4
Suppose we want a systematic random sample of 100 students from a population of students
30000 students listed in a campus directory. Here, n=100 and N =30000, and so k=30000/100
= 300. The population size is 300 times the sample size. Therefore, we have to select one of
every 300th students. We select one student at random using every student after the one
selected randomly. This produces a sample of size 100. The first three digits in Table 1.2 are
104, which falls between 001 and 300, so we first select the student numbered 104. The
numbers of the other students selected are 104 + 300 = 404, 404 + 300 = 704, 704 + 300 =
1004, 1004 +300 = 1304, and so on. The 100th student selected is listed in the last 300 names
in the directory.
(iii) Stratified random sample
Another probability sampling method, useful in social science research for studies comparing
groups, is stratified random sampling.
A stratified random sample divides the population into subgroups called strata, and then
selects a simple random sample from each stratum.
Stratified random sampling is called proportional if the sampled strata proportions are the
same as those in the entire population. For example, if 90% of the population of interest are
men and 10% are women, then the sampling is proportional if the sample size for men is nine
times the sample size for women.
Stratified random sampling is called disproportional if the sampled strata proportion differs
from the population proportions. This is useful when the population size for a stratum is
relatively small. A group that comprises a small part of the population may not have enough
representation in a simple random sample to allow precise inferences.
Example 1.5
Suppose we want to estimate smallpox vaccination rate among employees in a university, and
we know that our target population (those individuals we are trying to study) is 55% male and
45% female. Suppose our budget only allows a sample of size 200. To ensure the correct
gender balance, we could sample 110 males and 90 females.
(iv) Cluster random sampling
Simple, systematic, and stratified random sampling are often difficult to implement, because
they require a complete sampling frame. Such lists are easy to obtain when sampling cities or
Page 10 of 43
hospitals for example, but more difficult to obtain when sampling individuals or families.
Cluster samples are essentially strata consisting of geographical regions. We divide a region
(say a city) into sub-regions (say, blocks, sub-divisions, or schools). In a one-stage cluster
sampling, our sample consists of all elements in each of k randomly chosen sub-regions (or
clusters). In a two-stage cluster sampling, we first randomly select k sub-regions (clusters)
and then choose a random sample of elements within each cluster. Fig. 1.3 illustrates how
four elements could be sampled from each of five randomly chosen clusters, using a two-
stage cluster sampling.
Cluster sampling is useful when:
Although cluster sampling is cheap and quick, it is often reasonably accurate because people
in the same neighbourhood tend to be similar in income, ethnicity, educational background,
and so on. Cluster sampling is useful in political polling, surveys of gasoline pump prices,
studies of crime victimization surveys, or lead contamination in soil. A hospital may contain
clusters (floors) of similar patients. A warehouse may have cluster (pallets) of inventory parts.
Forest sections may be viewed as clusters to be sampled for disease or timber growth rates.
Page 11 of 43
A study might plan to sample about 1% of the families in a city, using city block as clusters.
Using a map to identify city blocks, it could select a simple random sample of 1% of the
blocks and then sample every family on each block. A study of patient care in mental hospitals
in Ghana could first randomly sample mental hospitals (the clusters) and then collect data for
patients within these hospitals.
Example 1.7
What is the difference between a stratified sample and a cluster sample?
Solution
A stratified sample uses every stratum. The strata are usually groups we want to compare. By
contrast, a cluster sample uses a sample of the clusters, rather than all of them. In cluster
sampling, clusters are merely ways of easily identifying groups of subjects. The goal is not to
compare the clusters but to use them to obtain a sample. Most clusters are not represented in
the eventual sample.
(v) Multi-stage Sampling
A random sample of a population of interest often incurs considerable expense in collecting
the data from a wide area. A cheaper solution is to use multi-stage sampling which starts by
dividing the country into a number of regions. Some of these are selected at random and
subdivided further, e.g. into rural, suburban and inner city areas. Again, some of these are
selected at random and subdivided again, e.g. into parliamentary wards and a further random
selection made. The process can be repeated until individual households or companies or units
of interest are identified.
The Family Expenditure Survey makes use of multi-stage sampling. The Survey uses the
Small Users File of Postcode Address File and the primary sampling unit is postal sectors.
The benefit of this approach is that the resulting samples are concentrated in relatively few
geographical areas which reduces the cost of data collection.
2. Non-probability sampling
Non-probability sampling designs select samples with features not embodying randomness.
The selection of the elements in the sample lies solely on personal judgement. The chance of
selecting an element cannot be determined. For this reason, there is no means of measuring
the risk of making erroneous conclusion desired from non-probability samples. Thus the
reliability of results (i.e. sampling errors) cannot be assessed and also used to make valid
conclusions about the population. The main methods of non-probability sampling are
Convenience, Judgemental and Quota Sampling
(i) Convenience sample
The sole virtue of convenience sampling is that it is quick. The idea is to grab whatever
sample is handy. The convenience sample is simply one that happens to come your way. An
accounting professor who wants to know how many MBA students would take a summer
elective in international accounting can just survey the class she is currently teaching. The
students polled may not be representative of all MBA students, but an answer (although
imperfect) will be available immediately.
Page 12 of 43
A newspaper reporter doing a story on perceived airport security might interview co-workers
who travel frequently. An executive might ask department heads if they think non-business
Web surfing is widespread.
You might think that convenience sampling is rarely used or, when it is, that the results are
used with caution. However, this does not appear to be the case. Since convenience samples
often sound the first alarm on timely issue, their results have a way of attracting attention and
have probably influenced quite a few business decisions. The mathematical properties of
convenience samples are unknowable, but they do serve a purpose and their influence cannot
be ignored.
(ii) Judgment sample
Judgment sampling is a non-probability sampling method that relies on the expertise of the
sampler to choose items that are representative of the population. The sample obtained by this
method is based on personal judgment and some pre-knowledge of the population. For
example, to estimate the corporate spending on research and development (R&D) in the
medical equipment industry, we might ask an industry expert to select several “typical” firms.
Unfortunately, subconscious biases can affect expert, too. In this context, “bias” does not
mean prejudice, but rather non-randomness in the choice. Judgment samples may be the best
alternative in some cases, but we can’t be sure whether the sample was random.
(iii) Quota Sampling
Quota sampling is a special kind of judgment sampling, in which the interviewer chooses a
certain number of people in each category (e.g., men/women). Quota sampling involves first
classification of the population into non-overlapping sub populations, called strata. The
sample is then obtained by selecting the individual elements from each stratum based on a
specified quota. In quota sampling the selection of the sample is made by the interviewer,
who has been given quotas to fill from specified sub-groups of the population. For example,
an interviewer may be told to sample 50 females between the ages of 45 and 60.
Since the selection of the sample is non–random, the enumerator is allowed to use his/her
own judgement to meet the various quotas. This introduces a large degree of biasness. The
lack of randomness is, however, compensated for by less cost and administrative
convenience.
Sampling with or without replacement
Consider the lottery system on page 9. If an item selected is put in the box before taking
another item, we are sampling with replacement. Using the box analogy, if we throw each
item back in the bowl and stir the contents before the next draw, an item can be chosen again.
Duplicates are unlikely when the sample size n is much smaller than the population size N.
People instinctively prefer sampling without replacement because drawing the same item
more than once seems to add nothing to our knowledge. However, using the same sample
item more than once does not introduce any bias (i.e. no systematic tendency to over or
underestimate whatever parameter we are trying to measure).
Computers and statistical analysis
The recent widespread use of computers has had a tremendous impact on statistical analysis.
Computers can perform more calculations faster and far more accurately than can human
Page 13 of 43
technicians. The use of computers makes it possible for investigators to devote more time to
the improvement of the quality of raw data and the interpretation of the results.
The current prevalence of microcomputers and the abundance of statistical software packages
have further revolutionized statistical computing. The researcher in search of a statistical
software package will find the book by Woodward et al. (1987) extremely helpful. This book
describes approximately 140 packages. Among the most prominent ones are: Statistical
Package for the Social Sciences (SPSS), S-plus, MINITAB, SAS and GENSTAT. The
spreadsheet, Excel, also has facilities for statistical analysis.
Page 14 of 43
MODULE TWO
DESCRIPTIVE STATISTICS
We have seen that statistical methods are descriptive or inferential. The purpose of descriptive
statistics is to summarize data to make it easier to assimilate the information. In this module,
we present basic methods of descriptive statistics.
Frequency distribution
Table 2.1 gives the number of children per family for 54 families selected from Obo, a town
in Ghana. The data, presented in this form in which it was collected, is called raw data.
From Table 2.1, it can be seen that, the minimum and the maximum numbers of children per
family are 0 and 4, respectively. Apart from these numbers, it is impossible, without further
careful study, to extract any exact information from the data. By breaking down the data into
the form of Table 2.2, however, certain features of the data become apparent. For instance,
from Table 2.2, it can easily be seen that, most of the 54 families selected have two children.
This information cannot easily be obtained from the raw data in Table 2.1.
Page 15 of 43
Table 2.3 gives the body masses of 22 patients, measured to the nearest kilogram.
It can be seen that the minimum and the maximum body masses are 42 kg and 83 kg,
respectively. A frequency distribution giving every body mass between 42 kg and 83 kg
would be very long and would not be very informative. The problem is overcome by grouping
the data into classes. If we choose the classes 41 – 49, 50 – 58, 59 – 67, 68 – 76 and 77 – 85,
we obtain the frequency distribution given in Table 2.4.
These are, of course, not the only classes which could be chosen. Table 2.4 gives the
frequency of each group or class; it is therefore called a grouped frequency table or a grouped
frequency distribution. Using this grouped frequency distribution, it is easier to obtain
information about the data than using the raw data in Table 2.3. For instance, it can be seen
from Table 2.4, that 17 of the 22 patients have body masses between 50 kg and 76 kg (both
inclusive). This information cannot easily be obtained from the raw data in Table 2.3.
It should be noted that, even though Table 2.4 is concise, some information is lost. For
example, the grouped frequency distribution does not give us the exact body masses of the
patients. Thus, the individual body masses of the patients are lost in our effort to obtain an
overall picture. However, Table 2.4 is far more comprehensible, and its contents are easier to
grasp than Table 2.3.
We now define the terms that are used in grouped frequency tables.
(i) Class limits
The intervals into which the observations are put are called class intervals. The end points of
the class intervals are called class limits. For example, the class interval 41 – 49, has lower
class limit 41 and upper class limit 49.
(ii) Class boundaries
The raw data in Table 2.3 were recorded to the nearest kilogram. Thus, a body mass of 49.5
kg would have been recorded as 50 kg, a body mass of 58.4 kg would have been recorded as
58 kg, while a body mass of 58.5 kg would have been recorded as 59 kg. It can therefore be
seen that, the class interval 50 – 58, consists of measurements greater than or equal to 49.5
Page 16 of 43
kg and less than 58.5 kg. The numbers 49.5 and 58.5 are called the lower and upper
boundaries of the class interval 50 – 58. The class boundaries of the other class intervals are
given in Table 2.5.
Page 17 of 43
Relative frequency
It is sometimes useful to know the proportion, rather than the number, of values falling within
a particular class interval. We obtain this information by dividing the frequency of the
particular class interval by the total number of observations. We refer to the proportion of
values falling within a class interval as the relative frequency of the class interval. In Table
687
2.13, the relative frequency of the first class interval is 3088 = 0.2225
since the class frequency is 687 and the sum of the frequencies is 3088. Note that relative
frequencies must add up to 1, allowing for rounding errors.
Cumulative frequency
In many situations, we are not interested in the number of observations in a given class
interval, but in the number of observations which are less than (or greater than) a specified
value. For example, in Table 2.5, on page 26, it can be seen that 3 patients have body masses
less than 49.5 kg and 9 patients (i.e. 3 + 6) have body masses less than 58.5 kg. These
frequencies are called cumulative frequencies. A table of such cumulative frequencies is
called a cumulative frequency table or cumulative frequency distribution.
Table 2.14 shows the data in Table 2.5 along with the cumulative frequencies and the relative
frequencies. Notice that the last cumulative frequency is equal to the sum of all the
frequencies.
Example 2.2
Table 2.15 gives the ages of a sample of patients who attended Hope Medical Hospital.
(a) Find the sample size. (b) Complete the blank cells.
Page 18 of 43
Solution
(a) If the sample size is n, then the relative frequency of the second class interval is 8 ÷ n.
Hence, n is a root of the equation.
8 8
= 0.16 ≡ = n = 50
𝑛 0.16
The sample size is 50.
(b) Table 2.16 gives the completed blank cells.
Page 19 of 43
MODULE THREE
GRAPHICAL REPRESENTATION OF DATA
In the last module, we found that information given in a frequency distribution is easier to
interpret than raw data. Information given in a frequency distribution in a tabular form is
easier to grasp if presented graphically. Many types of diagrams are used in statistics,
depending on the nature of the data and the purpose for which the diagram is intended. In this
module, we discuss how statistical data can be presented by histograms and cumulative
frequency curves and et cetera.
Histogram
A histogram consists of rectangles with:
(i) bases on a horizontal axis, centres at the class marks, and lengths equal to the class widths,
(ii) areas proportional to class frequencies.
If the class intervals are of equal size, then the heights of the rectangles are proportional to
the class frequencies, and it is then customary to take the heights of the rectangles numerically
equal to the class frequencies.
If the class intervals are of different widths, then the heights of the rectangles are proportional
𝑐𝑙𝑎𝑠𝑠 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
to . This ratio is called frequency density.
𝑐𝑙𝑎𝑠𝑠 𝑤𝑖𝑑𝑡ℎ
Example 2.3
Table 2.17 shows the distribution of the heights of 40 students selected from St. Paul High
School. Draw a histogram to represent the data.
Table 2.17: Heights of students
Solution
Since the class intervals have different sizes, the heights of the rectangles of the histogram
are proportional to the frequency densities of the class intervals. The calculations of the
heights of the rectangles can be set up as shown in Table 2.18. If di is the frequency density
of class interval cdi, then the height of the rectangle representing this class interval is where
c is any positive number (see Table 2.18, column 5). Fig. 2.1 shows a histogram for the data.
It was drawn by taking c = 5. Notice that the centres of the bases of the rectangles of the
histogram are at the class marks. If preferred, a histogram may be drawn showing class
boundaries instead of class marks. i d, i cd
Table 2.18: Work table for computing the heights of rectangles of a histogram
Page 20 of 43
Fig. 2.1: Histogram of the data in Table 2.17
Example 2.4
Table 2.19 shows the distribution of ages of 168 diabetic patients selected from Progress
Hospital. A histogram is drawn to represent the data. If the height of the rectangle representing
the second class interval is 3 cm, find the height of the rectangle which represents the third
class interval.
Table 2.19: Ages of diabetic patients
Solution
The calculations of the heights of the rectangles of the histogram can be set up as shown in
Table 2.20. Notice that the heights of the rectangles are proportional to the frequency densities
of the class intervals (see Table 2.20, column 5).
Table 2.20: Work table for computing the heights of rectangles of a histogram
If the height of the rectangle representing the second class interval is 3 cm, then
3
2.4c = 3 ⇒ c = 2.4 = 1.25
The height of the rectangle which represents the third class interval is
4.8c cm = 4.8 × 1.25 cm = 6 cm.
Drawing a histogram
1. When drawing a histogram, suitable scales must be chosen for both the vertical and
horizontal axes. Scales like “2 cm to 5 units” or “2 cm to 10 units” are the best. Avoid using
scales like “2 cm to 3 units” or “2 cm to 7 units”.
Page 21 of 43
2. Label the axes.
3. Give your graph a title.
Solution
(a) Table 2.22 gives the cumulative frequency distribution of the data in Table 2.21.
Notice that a class with frequency zero is added before the first class. It can be seen that the
last cumulative frequency is equal to the total number of observations, a check on the accuracy
of our calculation. The corresponding cumulative frequency curve is shown in Fig. 2.2 on
page 40. The curve is obtained by marking the upper class boundary on the horizontal axis
and the cumulative frequencies on the vertical axis. All the points are joined by a smooth
curve.
Page 22 of 43
Fig. 2.2: Cumulative frequency curve of the data in Table 2.21
(b) (i) Since the body masses of the patients are recorded to the nearest integer, body masses
less than 65 kg consist of all body masses less than 64.5 kg. Therefore, to estimate the number
of patients whose body masses are less than 65 kg, we obtain the cumulative frequency which
corresponds to the point 64.5 kg on the horizontal axis. From Fig. 2.2, we find that 33 patients
have body masses less than 65 kg.
(ii) To estimate the number of patients whose body masses are at least 75 kg, we first estimate
the number of patients whose body masses are less than 75 kg. Now, the upper boundary of
the interval “less than 75” is 74.5. From Fig. 2.2, the cumulative frequency which corresponds
to the point 74.5 kg on the horizontal axis, is 44. It follows that 44 patients have body masses
less than 75 kg. Thus, the number of patients whose body masses are at least 75 kg is (50 –
44) = 6.
Frequency polygon
A grouped frequency table can also be represented by a frequency polygon, which is a special
kind of line graph. To construct a frequency polygon, we plot a graph of class frequencies
against the corresponding class mid-points and join successive points with straight lines. Fig.
2.3, on the next page, shows the frequency polygon for the data in Table 2.16, on page 33.
Page 23 of 43
Fig. 2.3: Frequency polygon of the data in Table 2.16
Notice that the polygon is brought down to the horizontal axis at the ends of points that would
be the mid-points if there were additional class intervals at each end of the corresponding
histogram. This makes the area under a frequency polygon equal to the area under the
corresponding histogram.
Fig. 2.4 shows the frequency polygon of Fig. 2.3 superimposed on the corresponding
histogram. This figure allows us to see, for the same set of data, the relationship between the
two graphic forms.
Page 24 of 43
Fig. 2.4: Histogram and frequency polygon of the data in Table 2.16
Stem-and-leaf plot
A stem-and-leaf plot is a graphical device that is useful for representing a relatively small set
of data which takes numerical values. To construct a stem-and-leaf plot, we partition each
measurement into two parts. The first part is called the stem, and the second part is called the
leaf. The stem of a measurement consists of one or more of the remaining digits. The stems
form an ordered column with the smallest stem at the top and the largest at the bottom. The
stems are separated from their leaves by a vertical line. We include in the stem column all
stems within the range of the data even when a measurement with that stem is not in the data
set. The rows of a stem-and-leaf plot contain the leaves, ordered and listed to the right of their
respective stems. When leaves consist of more than one digit, all digits after the first may be
omitted. Decimals, when present in the original data, are omitted in a stem-and-leaf plot.
A stem-and-leaf plot conveys similar information as a histogram. Turned on its side, it has
the same shape as the histogram. In fact, since the stem-and-leaf plot shows each observation,
it displays information that is lost in a histogram. A properly constructed stem-and-leaf plot,
like a histogram, provides information regarding the range of the data set, shows the location
of the highest concentration of measurements, and reveals the presence or absence of
symmetry. An advantage of a stem-and-leaf plot over a histogram, is the fact that it preserves
Page 25 of 43
the information contained in the individual measurements. Such information is lost when we
construct a grouped frequency table. Another advantage of a stem-and-leave plot is that it can
be constructed during the tallying process, so the intermediate step of preparing an ordered
array is eliminated.
Stem-and-leaf plots are useful for quick portrayal of a small data set. As the sample size
increases, you can accommodate the increase in leaves by splitting the stems. For instance,
you can list each stem twice, putting leaves of 0 to 4 on one line and leaves of 5 to 9 on
another. When a number has several digits, it is simplest for graphical portrayal to drop the
last digit or two. For instance, for a stem-and-leaf plot of annual income in thousands of
dollars, a value of GH¢27.1 thousand has a stem of 2 and a leave of 7 and a value of GH¢106.4
thousand has a stem of 10 and a leaf of 6.
Example 2.6
The following are the marks scored by 30 candidates in an English test. Construct a stem-
and-leaf plot for the data.
Solution
Since all the measurements are two-digit numbers, we will have one-digit stems and one-digit
leaves. For example, the mark 85 has a stem of 8 and a leaf of 5. Fig. 2.5 is the required stem-
and-leaf plot. The four numbers in the first row represent 52, 53, 56 and 56.
Bar chart
A bar chart is a diagram consisting of a series of horizontal or vertical bars of equal width.
The bars represent various categories of the data. There are three types of bar charts, and these
are simple bar charts, component bar charts and grouped bar charts.
(i) Simple bar chart
In a simple bar chart, the height (or length) of each bar is equal to the frequency it represents.
Example 2.7
Table 2.23 gives the production of timber in five districts of Ghana in a certain year. Draw a
bar chart to illustrate the data. The bars are separated to emphasize that the variable is
quantitative rather than quantitative.
Fig. 2.6:A simple bar chart for the data in Table 2.23
(ii) Component bar chart
In a component bar chart, the bar for each category is subdivided into component parts; hence
its name. Component bar charts are therefore used to show the division of items into
components. This is illustrated in the following example.
Example 2.8
Table 2.24 shows the distribution of sales of agricultural produce from Asiedu Farm in 1995,
1996 and 1997. Illustrate the information with a component bar chart.
Page 27 of 43
Solution
Fig. 2.7, on the next page, shows a component bar chart for the data. The sales of agricultural
produce consist of three components: the sales of coffee, cocoa, and palm oil. The component
bar chart shows the changes of each component over the years as well as the comparison of
the total sales between different years.
Page 28 of 43
Fig. 2.8:A grouped bar chart of the data in Table 2.24
Pie Charts
A pie chart is a circular graph divided into sectors, each sector representing a different value
or category. The angle of each sector of a pie chart is proportional to the value of the part of
the data it represents. The bar chart is more precise than the pie chart for visual comparison
of categories with similar relative frequencies.
The following are the steps for constructing a pie chart
(1) Find the sum of the category values.
(2) Calculate the angle of the sector for each category, using the following result:
𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝐴
angle of the sector for category A = 𝑠𝑢𝑚 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑣𝑎𝑙𝑢𝑒𝑠 x 3600
(3) Construct a circle and mark the centre.
(4) Use a protractor to divide the circle into sectors, using the angles obtained in step 2.
(5) Label each sector clearly.
Example 2.10
A housewife spent the following sums of money on buying ingredients for a family Christmas
cake in 2020.
Flour .................................... GH¢24
Margarine ............................ GH¢96
Sugar .................................... GH¢18
Eggs ..................................... GH¢60
Baking powder..................... GH¢12
Miscellaneous ...................... GH¢30
Represent the above information on a pie chart.
Solution
The angles of the sectors are calculated as shown in Table 2.25. Fig. 2.9, on the next page,
shows the required pie chart.
Table 2.25: Work table for computing the angles of the sectors of a pie chart
Page 29 of 43
Fig. 2.9: A pie chart of the data in Table 2.24
Page 30 of 43
Applications of Statistics
Statistics and Sociology
Sociology is one of the social sciences aiming to discover the basic structure of human
society, to identify the main forces that hold groups together or weaken them and to learn the
conditions that transform social life. It highlights and illuminates aspects of social life that
otherwise might be only obscurely recognized and understood. The sociologist may be called
upon for help with a special problem such as social conflict, urban plight or the war on poverty
or crimes. His practical contribution lies in the ability to clarify the underlaying nature of
social problems to estimate more exactly their dimensions and to identify aspects that seem
most amenable to remedy with the knowledge and skills at hand. He naturally lands in
sociological research which is the purposeful effort to learn more about society than one can
in the ordinary course of living. Keeping in view of the problem he sets forth his objectives
collects materials or data and uses statistical techniques and the knowledge and theory already
established on similar topics to achieve his objectives. So statistical data and statistical
methods are quite indispensable for sociological research studies. There is a growing
emphasis recently on social survey methods or research methodology in all faculties of arts.
Sociologists seek the help of statistical tools to study cultural change in the society, family
pattern, prostitution, crime, marriage system etc. They also study statistically the relation
between prostitution and poverty, crime and poverty, drunkenness and crime, illiteracy, and
crime etc. Thus, statistics is of immense use in various sociological studies.
Statistics and Government
The functions of a government are more varied and complex. Various depts in the state are
required to collect and record statistical data in a systematic manner for an effective
administration. Data pertaining to various fields namely population, natural resources,
production both agricultural and industrial, finance, trade, exports and imports, prices, labour,
transport and communication, health, education, defence ,crimes etc are the most fundamental
requirements of the state for its administration. It is only on this basis of such data; the
government decides on the priority areas, gives more attention to them through target-oriented
programmes and studies the impact of the programmes for its future guidelines.
Page 31 of 43
fluctuations and the underlaying causes. Thus, statistics is indispensable in economic
analysis.
DATA CLASSIFICATION
(a) Primary vs. Secondary Data
Data can be collected in two different ways. One way is to collect data directly from the
respondent. The person who answers the questions of the investigator is called respondent.
Statistical information thus collected is called primary data and the source of such information
is called primary source. This data is original because it is collected for the first time by the
investigator himself. For example, if the investigator collects the information about the
salaries of National Institute of Open Schooling employees by approaching them, then it is
primary data for him. Another way is to adopt the data already collected by someone else.
The investigator only adopts the data. Statistical information thus obtained is called secondary
data. The source of such information is called secondary source. For example, if the
investigator collects the information about the salaries of employees of National Institute of
Open Schooling from the salary register maintained by its accounts branch, then it is
secondary data for him.
(c) Sources of secondary data: As already discussed secondary data are not collected by the
investigator himself but they are obtained by him from other source. Broadly, there area two
sources:
(a) Published data and
(b) Unpublished data.
I. Published Sources: There are certain agencies which collect the data and publish them in
the form of either regular journals or reports. These agencies/sources are known as published
sources of data.
Sources of secondary macro-economic data in Nigeria include
(i) Governmental Agencies: National Bureau for Statistics (NBS), Central Bank of
Nigeria (CBN), National Agricultural Extension and Research Liaison Services
(NAERLS) etc.;
Page 32 of 43
(ii) (ii) International Agencies: The World Bank, IMF, FAO, IFDC etc.; and several
non-Governmental
II. Unpublished Sources: Secondary data are also available from unpublished sources,
because all statistical data is not always published. For example, information recorded in
various government and private offices, studies made by research scholars etc. can be
important sources of secondary data.
The above data are unorganized. To refine this data for comparison and analysis it should be
arranged in an orderly sequence or into groups on the basis of some similarity. This whole
process of arranging and grouping the data into some meaningful arrangement is a first step
towards analysis of data. Data can be arranged in two forms:
(a) Arrays and
(b) Frequency distributions.
(a) Arrays
A method of presenting an individual series is a simple array of data. An orderly arrangement
of raw data is called ‘Array’. Arrays are of two types:
(i) Simple array, and
(ii) Frequency array.
(i) Simple Array: A simple array is an arrangement of data in ascending or descending order.
Let us construct the simple arrays of the data about the marks of 40 students. The data in table
6.1 is arranged in ascending order and in table 6.2 in descending order.
The above arrays reveal information on two points clearly. One, the highest marks obtained
by any student are 58. Two, the lowest marks obtained by any student are 20.
Organising the data in the form of simple array is convenient if number of items is small. As
the number of items increase the series becomes too long and unmanageable. As such there
is need to condense data. Making a frequency array is one method of condensing data.
(ii) Frequency Array: Frequency array is a series formed on the basis of frequency with
which each item is repeated in series. The main steps in constructing frequency array are:
1. Prepare a table with three columns-first for values of items, second for tally sheet and third
for corresponding frequency. Frequency means the number of times a value appears in a
series. For example, in table 6.1 the marks 43 appears five times. So, frequency of 43 is 5.
2. Put the items in first column in a ascending order in such a way that one item is reordered
once only.
3. Prepare the tally sheet in second column marking one bar for one item. Make blocks of five
tally bars to avoid mistake in counting. Note that every fifth bar is shown by crossing the
previous four bars like e.g., ////.
4. Count the tally bars and record the total number in third column. This column will represent
the frequencies of corresponding items. Let us now explain construction of frequency array
of the marks obtained by 40 students. In table 6.3 data about the marks is arranged in an
ascending order in first column. It helps to find not only the maximum and minimum values
but also makes it easy to draw bars.
Now for each mark level make one bar (/) in second column and cross the item from the data.
Table 6.3 Frequency array of marks obtained by 40 students.
Page 34 of 43
35 2
36 1
37 1
38 2
39 1
40 3
41 2
42 2
43 5
45 1
46 2
47 2
48 1
49 1
50 1
51 1
53 1
54 1
56 1
58 1
Total Frequency =40
The main limitations of frequency array is that it does not give the idea of the characteristics
of a group. For example, it does not tell us that how many students have obtained marks
between 40 and 45. Therefore it is not possible to compare characteristics of different groups.
This limitation is removed by frequency distribution.
FREQUENCY DISTRIBUTION
Data in a frequency array is ungrouped data. To group the data, we need to make a ‘frequency
distribution’. A frequency distribution classifies the data into groups. For example, it tells us
how many students have secured marks between 40 and 45.
1. Class : Class is a group of magnitudes having two ends called class limits. For example,
20-25, 25-30 etc. or 20-24, 25-29 etc. as the case may be, each represents a class.
2. Class Limits : Every class has two boundaries or limits called lower limit (L1) and upper
limit (L2). For example in the class (20-30) L1 = 20 and L2 = 30.
3. Class Interval : The difference between two limits of a class is called class interval. It is
equal to upper limit minus lower limit. It is also called class width. Class interval = L2 – L1.
For 30 – 20 =10.
4. Class Frequency : Total number of items falling in a class that is having the value within
L1 and L2 is class frequency. For example, in table 6.4 class frequency in class (40-45) is 10.
Similarly in class (50-55) the frequency is 4.
5. Mid-Point/Mid-Value(M.V.) : The mid-value of the class interval of a class also called as
mid-point is obtained by dividing the sum of lower limit and upper limit of the class by 2. It
is the average value of two limits of a class. It falls just in the middle of a class is
Page 35 of 43
𝐿1 + 𝐿2 20+30
M.V. = For example, the mid-value of class (20-30) is = 25
2 2
Item having the value of 25 will be counted in next class of (25-30) as is clear from the
following example, Using the same data as given in making a frequency array and taking
class interval of 5, a frequency distribution of exclusive type will be as under:
(b) Inclusive Series : In this type the lower limit of next class is increased by one over the
upper limit of previous class. Both the items having value equal to lower and upper limit of a
class are counted or included in the same class. That is why such a frequency distribution is
called inclusive type. For example in the class (20-24) both 20 and 24 will be included in the
same class. Similarly in the class (40-44) both 40 and 44 will be included. The following table
has been formed on the basis of same data as taken in the exclusive type.
Page 36 of 43
(c) Open-end Classes : Open-end frequency distribution is one which has at least one of its
ends open. You will observe that either lower limit of first class or upper limit of last class or
both are not given in such series. In table 6.6 the first class and the last class i.e. below 25 and
55 and above are open-end classes.
(d) Unequal Classes : In case of unequal classes frequency distribution, the width of different
classes (i.e. L2-L1) need not be the same. In table 6.7, the class (30 – 40) has width 10 while
the class (40-55) has width 15.
Page 37 of 43
Table 6.7: Unequal Classes Frequency Distribution
(ii) From below, such as 2,6 (i.e. 2 + 4), 14 (i.e. 6+8), 24 (i.e. 14 + 10) and so on. Such a
distribution is called ‘More-than’ cumulative frequency distribution. It shows the total
number of observations (frequencies) having more than a particular value of the variable (here
Page 38 of 43
marks). For example, there are 6 (i.e. 2 + 4) students who got marks more than 50, 14 (i.e. 2
+ 4 + 8) students who got marks more than 45 etc. See table 6.9.
Page 39 of 43
SELF REVIEW QUESTIONS
1. For each of the following variables, state whether it is quantitative or qualitative and specify
the measurement scale that is employed when taking measurements on each.
(a) gender of babies born in a hospital, (b) marital status,
(c) temperature measured on the Kelvin scale, (d) nationality,
(e) masses of babies in kg, (f) temperature in 0C,
(g) prices of items in a shop, (h) position in an exam.
(i) the rank of an academic staff in a university.
2. For each of the following situations, answer questions (a) through (d):
(a) What is the variable in the study? (b) What is the population?
(c) What is the sample size? (d) What measurement scale was used?
A. A study of 150 students from St. Ann School, showed that 10% of the students had blood
group A.
B. A study of 100 patients admitted to St. Paul’s Hospital, showed that 25 patients lived 8 km
from the hospital.
C. A study of 50 teachers in Town A showed that 5% of the teachers earn N8000.00 per
month.
3. A team of ornithologist is doing field research by using a mist net to capture migrating
birds. They collect the following information:
(a) Species, (b) Weight (c) Wing span (d) Condition, either poor, fair, good, or excellent,
(e) Band ID number, (f) Approximate age.
Indicate whether each of these is an attribute measure or a variable measure.
4. Explain what is meant by inferential statistics.
5. Define the following terms:
(a) population, (b) qualitative variable,
(c) discrete variable, (d) sample,
(e) continuous variable, (f) quantitative variable.
6. For each of the following, indicate whether it is a discrete or a continuous variable.
(a) The number of minutes it takes to read a page in this text.
(b) The number of chapters in the text.
(c) The weight of the text.
(d) The number of problems in the text.
(e) The number of times the letter e appears on a page.
(f) The length of a page in inches.
7. Suppose that the following information is obtained from Ms Ofosu on her application for
a home mortgage following response, indicate whether it is a continuous variable and which
type of measurement scale it represents.
(a) Place of residence: in Accra.
(b) Type of residence: Single family home.
(c) Date of birth: August 13, 1966.
(d) Projected monthly payments: GH¢2 479.
(e) Occupation: Director of Food and Drug Board.
(f) Employer: Methodist University.
(g) Number of years at Job: 10.
(h) Annual income: GH¢140 000.
(i) Amount of mortgage requested: GH¢220 000.
8. Which scale of measurement (nominal, ordinal, or interval) is most appropriate for
(a) Attitude toward legalization of marijuana (favour, neutral, oppose).
Page 40 of 43
(b) Gender (male, female).
(c) Number of children in a family (0, 1, 2, …).
(d) Political party affiliation (APP, PDP, CPP).
(e) Religious affiliation (Catholic, Jewish, Protestant, Muslim, Others).
(f) Political philosophy (very liberal, somewhat liberal, moderate, somewhat conservative,
very conservative).
(g) Years of school completed (0, 1, 2, 3, …).
(h) Highest degree attained (none, high school, bachelor’s, master’s, doctorate).
(i) Employment status (employed, full time, employed part time, unemployed).
9. Give two reasons why it is sometimes necessary to take a sample from a population.
10. State two ways of obtaining primary data.
11. State two sources of secondary data.
12. State two advantages and two disadvantages of the lottery system for taking a simple
random sample from a population.
13. State two disadvantages and one advantage of telephone interview, as a means of
collecting data.
14. Briefly describe the difference between descriptive statistics and inferential statistics.
15. A doctor examined a patient to determine the cause of a disease. He took a drop of blood
and used it to determine the state of health of the patient. What aspect of statistics is the doctor
employing in order to form a judgement?
16. In your own words, explain and give an example of each of the following statistical terms:
(a) population, (b) sample.
17. Mrs. Akrong wants to check whether the pot of soup she is cooking has the right taste and
quantity of salt. She did this by tasting a small portion of the soup scooped in a ladle. What
aspect of statistics is she employing in order to form a judgement? Briefly explain why she
decided to use this particular method?
18. Explain the difference between qualitative and quantitative data. Give examples of
qualitative and quantitative data.
19. List the four levels of measurement and give examples.
20. Explain the difference between:
(a) nominal and ordinal data, (b) a census and a sample survey,
21. Clusters versus strata
(a) With a cluster random sample, do you take a sample of (i) the clusters? (ii) the subjects
within every cluster?
(b) With a stratified random sample, do you take a sample of (i) the strata? (ii) the subjects
within every stratum?
(c) Summarize the main differences between cluster sampling and stratified sampling in terms
of whether you sample the group or sample from within the group the form the clusters or
strata.
22. A class has 50 students. Use the column of the first two digits in the random number table
(Table 1.2) to select a simple random sample of three students. If the students are numbered
01 to 50, what are the numbers of the three students selected?
23. In cluster random sample with equal-sized clusters, every subject has the same chance of
selection. However, the sample is not a simple random sample. Explain why not.
[Link] following are the ages of 30 patients seen in the emergency room of a hospital on a
Monday night. Construct a stem-and-leaf plot for the data.
Page 41 of 43
25. The following table gives the ages (in years) of 60 cancer patients.
A histogram is drawn to represent this data. If the height of the rectangle representing the fifth
class interval is 2 cm, find the heights of the rectangles representing the first, second and the
third class intervals. Construct a histogram to represent the data.
26. The following table gives the distribution of the heights of 100 children, to the nearest
centimetre.
Draw a cumulative frequency curve for the data and use it to estimate:
(a) the number of children whose heights are between 142 cm and 152 cm (inclusive),
(b) the number of children whose heights are greater than 156 cm.
27. The following table gives the distribution of the marks scored by 40 students in an
examination.
Draw a cumulative frequency curve for the data and use it to estimate:
(a) the number of students who scored between 42% and 62% (inclusive),
(b) the least mark a student must score if he/she is to be placed in the top 25% of the class.
28. The following table gives the enrolments in primary and secondary schools in Kenya.
Construct a stem- and- leaf plot of the data (for a boy whose height is 162 cm, record this as
a stem of 16 and a leaf of 2.)
Page 42 of 43
30. The following table shows the amount of rainfall in Asarekrom during the first five
months of 2006. Construct a bar chart to illustrate the data.
31. The following table gives the frequency distribution of the results of an examination taken
by students from two schools M and N. Construct a grouped bar chart to represent this
information.
32. The following information gives the proportion in which Yaro spends his annual salary.
Page 43 of 43