Principles of Statistics for Business
Principles of Statistics for Business
Principles of Statistics
Faculty of Commerce
South Valley University
Department of Quantitative Methods
2022 / 2023
Preface
Chapter 2 10 - 36
Organizing and Graphing Data
Organizing and Graphing Qualitative Data; Frequency
Distribution of Qualitative Data; Bar Chart; Pie Chart;
Organizing and Graphing Quantitative Data; Cumulative
Distribution; Graphic Presentation of Quantitative Data;
Bar Chart; Histogram; Polygon (Line Chart); Exercises
Chapter 3 37 - 88
Measures of Central Tendency
Mean; Mean of Ungrouped Data; Mean of Grouped Data;
Median; Median of Ungrouped Data; Median of Grouped
data; Quartiles, Deciles and Percentiles; Mode; Mode of
Ungrouped Data; Mode of Grouped Data; Relationship
Between Mean, Median and Mode; Characteristics of the
Mean; Characteristics of the Median; Characteristics of
the Mode; Effect of Shifting and Scaling; Exercises
Chapter 4 89 – 129
Measures of Dispersion
Range; Range of Ungrouped Data; Range of Grouped
Data; Characteristics of the Range; Interquartile Range;
Quartile Deviation; Mean Deviation; Standard Deviation
and Variance; Standard Deviation and Variance of
Ungrouped Data; Standard Deviation and Variance of
Grouped Data; Measures of Relative Dispersion;
Coefficient of Variation; Quartile Coefficient of Variation;
Standardized Value; Coefficient of Skewness; Exercises
Chapter 5 130 - 159
Correlation Analysis
Scatterplots (Scatter Diagram); Pearson’s Correlation
Coefficient; Characteristics of Pearson’s Correlation
Coefficient; Coefficient of Determination; Spearman’s
Correlation Coefficient; Characteristics of Spearman’s
Correlation Coefficient; Yule’s Coefficient of Association;
Contingency Coefficient (Cramer’s Coefficient);
Exercises
References 266
Chapter (1)
Introduction
Population:
Consists of all elements, individuals, items or objects whose
characteristics are being studied.
Sample:
A portion of population or few elements selected from
a population.
Random Sample:
A sample is drawn in such a way that each element has some
chance of being selected.
3
1.3 Basic Terms:
It is very important to understand the meaning of some basic
terms that will be used frequently in this text. This section explains
the meaning of an element (or member), a variable, a constant,
an observation, and a data set.
Element:
Is a specific subject, or object about which the information is
collected.
Variable:
Is a characteristic under study that assumes different values for
different elements.
Constant:
The value of the constant is fixed.
Observation:
The value of a variable for an element is called an observation
or measurement.
Data Set:
Is a collection of observations on one or more variables.
A- Quantitative Variables:
Quantitative Variable:
A variable that can be measured numerically.
4
Income, height, gross sales, price of a home, number of cars
owned, and number of accidents are examples of quantitative
variables because each of them can be expressed numerically.
Such quantitative variables may be classified as either discrete
variables or continuous variables.
For example, the number of cars sold on any given day at a car
dealership is a discrete variable because the number of cars sold
must be 0, 1, 2, 3, . . . and we can count it. The number of cars
sold cannot be between 0 and 1, or between 1 and 2. Other
examples of discrete variables are the number of people visiting
a bank on any day, the number of cars in a parking lot, the number
of cattle owned by a farmer, and the number of students in a class.
5
(ii) Continuous Variable:
Continuous Variable:
A variable that can assume any numerical value over a certain
interval.
Qualitative data:
The data collected on a qualitative variable are called
qualitative data.
6
Variable
Qualitative or
Quantitative
Categorical
Discrete Continuous
7
Problems
1- Briefly describe the two meanings of the word statistics.
2- Briefly explain the types of statistics.
3- Briefly explain the terms: population, sample, representative
sample.
4- The following table lists the number of deaths by cause
Cause of Death Number of Deaths
Heart disease 611,105
Cancer 584,881
Accidents 130,557
Stroke 128,978
Alzheimer’s disease 84,767
Diabetes 75,578
Influenza and Pneumonia 56,979
Suicide 41,149
Briefly explain the meaning of an element, a variable,
an observation, and a data set with reference to the information
in this table.
5- Explain the meaning of the following terms.
a. Quantitative variable
b. Qualitative variable
c. Discrete variable
d. Continuous variable
6- Indicate which of the following variables are quantitative and
which are qualitative.
a. The amount of time a student spent studying for an exam.
b. The amount of rain last year in 30 cities.
8
c. The arrival status of an airline flight (early, on time, late,
canceled) at an airport.
d. A person’s blood type.
e. The amount of gasoline put into a car at a gas station.
f. The number of customers in the line waiting for service at
a bank at a given time
7- A survey of families living in a certain city was conducted to
collect information on the following variables: age of the oldest
person in the family, number of family members, number of
males in the family, number of females in the family, whether
or not they own a house, income of the family, whether or not
the family took vacations during the past one year, whether or
not they are happy with their financial situation, and the amount
of their monthly mortgage or rent.
a. Which of these variables are qualitative variables?
b. Which of these variables are quantitative variables?
c. Which of the quantitative variables of Part b are discrete
variables?
d. Which of the quantitative variables of Part b are continuous
variables?
9
Chapter (2)
Organizing and Graphing Data
In addition to hundreds of private organizations and individuals,
a large number of government agencies conduct hundreds of
surveys every year. The data collected from each of these surveys
fill hundreds of thousands of pages. In their original form, these
data sets may be so large that they do not make sense to most of
us. Descriptive statistics, however, supplies the techniques that
help to condense large data sets by using tables, graphs, and
summary measures. At a glance, these tabular and graphical
displays present information on every aspect of life.
Consequently, descriptive statistics is of immense importance
because it provides efficient and effective methods for
summarizing and analyzing information.
10
2.1.1 Raw Data:
Definition:
Raw Data: Data recorded in the sequence in which they are
collected and before they are processed or ranked are called
“raw data”.
11
Table 2.2
Status of 50 students
J F SO SE J J SE J J J
F F J F F F SE SO SE J
J F SE SO SO F J F SE SE
SO SE J SO SO J J SO F SO
SE SE F SE J SO F J SO SO
12
Table 2.3
Worries about not having
enough money to pay normal
monthly bills
Response Adults
Very worried 162
Moderately worried 203
Not too worried 305
Not worried at all 325
Others 20
Total 1015
13
The percentage for a category is obtained by multiplying the
relative frequency of that category by 100. A percentage
distribution lists the percentages for all categories.
Calculating Percentage:
Percentage = (Relative frequency) × 100%
Table 2.4
Relative frequency and percentage distributions of
worries about not having enough money to pay
normal monthly bills
Relative
Response Percentage
Frequency
Very worried 162/1015 = 0.16 0.16(100) = 16
Moderately worried 203/1015 = 0.2 0.2(100) = 20
Not too worried 305/1015 = 0.3 0.3(100) = 30
Not worried at all 325/1015 = 0.32 0.32(100) = 32
Others 20/1015 = 0.02 0.02(100) = 2
Total 1 100%
Bar Chart:
To construct a bar chart, we mark the various categories on the
horizontal axis as in Figure 2.1. Note that all categories are
14
represented by intervals of the same width. We mark the
frequencies on the vertical axis. Then we draw one bar for each
category such that the height of the bar represents the frequency
of the corresponding category. We leave a gap between adjacent
bars. Figure 2.1 gives the bar chart for the frequency distribution
of Table 2.3.
Figure 2.1
Bar Chart for the Frequency Distribution of Table 2.3.
Definition:
Bar Chart: A graph made of bars whose heights represent the
frequencies of respective categories is called a bar chart.
Definition:
Pie Chart is a circle divided into partitions that represent the
relative frequencies or percentages of a population or a sample
belonging to different categories is called a pie chart.
Figure 2.2 shows the pie chart for the percentage distribution of
Table 2.3.
WORRIES ABOUT NOT HAVING ENOUGH MONEY
TO PAY NORMAL MONTHLY BILLS
others
2% very worried
16%
not worried at
all
32%
moderately
worried
not too 20%
worried
30%
Figure 2.2
Pie Chart for the Percentage Distribution of Table 2.3
16
Table 2.5
Family size of 20 families
3 1 4 6 3 1 2 5 3 2
4 3 4 3 5 2 4 3 3 5
17
Table 2.7
Weekly Earnings of 100
Employees of a Company
Weekly Earnings Number of
(Classes) Employees (f)
800 – 4
1000 – 11
1200 – 39
1400 – 24
1600 – 16
1800 – 2000 6
Total 100
the classes always represent a variable. As we can observe, the
classes are non-overlapping; that is, each value for earnings
belongs to one and only one class and there are no gaps between
any two successive intervals. The second column in the table lists
the number of employees who have earnings within each class.
For example, 4 employees of this company earn 800 to less than
1000 per week. The numbers listed in the second column are
called the frequencies, which give the number of data values that
belong to different classes, where the frequencies are denoted by
fi (the frequency of ith class) and c = 1, 2, 3, …….., c, where c is
the number of classes.
For quantitative data, the frequency of a class represents the
number of values in the data set that fall in that class. Table 2.7
contains six classes. Each class has a lower limit and an upper
limit. The values 800, 1000, 1200, 1400, 1600, and 1800 give the
lower limits, and implicitly the upper limits are the lower limits of
the next classes.
18
Frequency Distribution for Quantitative Data:
Definition:
Frequency Distribution for Quantitative Data:
A frequency distribution for quantitative data lists all the classes
and the number of values that belong to each class. Data
presented in the form of a frequency distribution are called
grouped data.
Definition:
Class Midpoint:
𝐋𝐨𝐰𝐞𝐫 𝐥𝐢𝐦𝐢𝐭 + 𝐔𝐩𝐩𝐞𝐫 𝐥𝐢𝐦𝐢𝐭
Class Midpoint (or Mark) =
𝟐
Thus, the midpoint of the first class in Table 2.5 or Table 2.6 is
calculated as follows:
𝟖𝟎𝟎 + 𝟏𝟎𝟎𝟎
Midpoint of the first class =
𝟐
= 900
19
The class midpoints for the frequency distribution of Table 2.7 are
listed in the third column of Table 2.8.
Table 2.8
Class Widths and Class Midpoints for Table 2.5
Class Limits Class Width Class Midpoint
800 – 200 900
1000 – 200 1100
1200 – 200 1300
1400 – 200 1500
1600 – 200 1700
1800 – 2000 200 1900
21
Table 2.9
Weights of Students of a Class
Class Tally
Frequency
Interval Mark
38 – ||| 3
50 – |||| || 7
62 – |||| | 6
74 – |||| |||| | 11
86 – | 1
98 – 110 |||| 4
Total - 32
Example (2):
The data given below related to the number of years that 50
workers of a small factory has worked for.
Table 2.11
Years of experience of 50 workers
1.4 2.4 0.6 5.1 4.1 4.8 10.9 3.9 11.6 0.9
11.0 8.6 4.4 0.8 5.7 2.3 1.3 7.6 9.3 14.4
5.4 6.9 8.6 3.2 10.6 6.8 7.1 8.4 2.1 11.3
0.4 4.9 8.2 10.8 15.0 9.3 2.3 0.7 3.9 6.2
2.2 5.7 13.8 10.1 0.7 3.2 4.6 9.8 3.9 2.7
23
Table 2.12
Years of Experience of 50 Workers
Class Tally Mark Frequency
0- |||| ||| 8
2- |||| |||| | 11
4- |||| |||| 9
6- |||| 5
8- |||| || 7
10 - |||| || 7
12 - | 1
14-16 || 2
Total - 50
Table 2.13
Frequency Distribution
Of Years of Experience of 50 Workers
Class Frequency
0- 8
2- 11
4- 9
6- 5
8- 7
10- 7
12- 1
14-16 2
Total 50
24
Example (3):
Calculate the relative frequencies and percentages for Table 2.10.
Solution:
The relative frequencies and percentages for the data in Table 2.9
are calculated and listed in the second and third columns,
respectively, of Table 2.14.
Table 2.14
Relative Frequency and Percentage
Distributions of the weights
of students of a class
Relative Percentage
Class
Frequency (%)
38 – 3/32 = 0.09375 9.375
50 – 7/32 = 0.21875 21.875
62 – 6/32 = 0.1875 18.75
74 – 11/32 = 0.34375 34.375
86 – 1/32 = 0.03125 3.125
98 – 110 4/32 = 0.125 12.5
Total 1 100 %
25
1- Less-Than (Ascending) Cumulative Frequency Table:
It gives the total number of values that fall below the upper limit of
each class. Cumulative frequency (CF) for a given class is
obtained by adding the frequency of this class to the frequencies
of all classes that come before it. Table 2.15 represents a less-
than cumulative frequency distribution of table 2.13.
Table 2.15
Less-Than Frequency Distribution
for Years of Experience of 50 Workers
More
Less-Than
Class Effective
Cumulative Frequency
Way
Less than 0 0 0
Less than 2 0+8=8 0+8=8
Less than 4 0 + 8 + 11 = 19 8 + 11 = 19
Less than 6 0 + 8 + 11 + 9 = 28 19 + 9 = 28
Less than 8 0 + 8 + 11 + 9 + 5 = 33 28 + 5 = 33
Less than 10 0 + 8 + 11 + 9 + 5 + 7 = 40 33 + 7 = 40
Less than 12 0 + 8 + 11 + 9 + 5 + 7 + 7 = 47 40 + 7 = 47
Less than 14 0 + 8 + 11 + 9 + 5 + 7 + 7 + 1 = 48 47 + 1 = 48
Less than 16 0 + 8 + 11 + 9 + 5 + 7 + 7 + 1 + 2 = 50 48 + 2 = 50
The following table summarizes the “less than” and “more than”
tables for data given in Table (13):
27
2.2.5 Graphical Presentation of Quantitative Data:
Quantitative data can be displayed in a histogram or a polygon or
frequency curve. This section describes how to construct such
graphs.
Bar Chart:
A Bar Chart was explained in graphical presentation of qualitative
variable, but it also can be used to draw a frequency distribution
for quantitative categorical variable. Figure 2.3 represents a Bar
Chart for table 2.6
6
5
4
3
2
1
0
1 2 Figure
3 2.3 4 5 6
Family Size
Figure 2.3
Histogram for the Frequency
Distribution of Table 2.9.
Histogram:
A histogram can be drawn for a frequency distribution, a relative
frequency distribution, or a percentage distribution. To draw a
histogram, we first mark classes on the horizontal axis and
frequencies (or relative frequencies or percentages) on the
vertical axis. Next, we draw a bar for each class so that its height
represents the frequency of that class. The bars in a histogram
are drawn adjacent to each other with no gap between them.
28
A histogram is called a frequency histogram, a relative frequency
histogram, or a percentage histogram depending on whether
frequencies, relative frequencies, or percentages are marked on
the vertical axis. Figures 2.4 and 2.5 show the frequency and the
percentage histograms, respectively, for the data of Table 2.9.
8
6
4
2
0
44 56 68 80 92 104
Figure 2.4(kg)
Weight
Figure 2.4
Histogram for the frequency distribution
of Table 2.9
25%
20%
15%
10%
5%
0% Figure 2.5
44 56 68 80 92 104
Weight (kg)
Figure 2.5
Histogram for the percentage distribution of Table 2.14
29
Frequency Polygon (Line graph):
A frequency polygon (Line Graph) is another device that can be
used to represent quantitative data in graphic form. To draw
a frequency polygon, we first mark a dot above the midpoint of
each class at a height equal to the frequency of that class. This is
the same as marking the midpoint at the top of each bar in
a histogram. The resulting line graph is called a frequency polygon
or simply a polygon.
10
8
Frequency
0
1 2 3 4 5 6 7 8
Years of experience
Figure 2.6
A Frequency Polygon (Line Graph)
for the Frequency Distribution of Table 2.13
30
Less-than Ogive
60
Cumulative Frequency
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16 18
Years of Experience
More-than Ogive
35
30
Cumulative Frequency
25
20
15
10
0
38 48 58 68 78 88 98 108 118
Years of Experience
31
Exercises
1- The following data give the results of a sample survey. The
letters Y, N, and D represent the three categories.
D N N Y Y Y N Y D Y
Y Y Y Y N Y Y N N Y
N Y Y N D N Y Y Y Y
Y Y N N Y Y N N D Y
a. Prepare a frequency distribution table.
b. Calculate the relative frequencies and percentages for all
categories.
c. What percentage of the elements in this sample belongs to
category Y?
d. What percentage of the elements in this sample belong to
category N or D?
e. Draw a pie chart for the percentage distribution.
2- A market research company asked residents of Qena to name
their favorite pizza topping. The possible responses included
the following choices: meats (M); seafood, for example, tuna,
or crab (S); vegetables and fruits (V); poultry (PO); and cheese
(C). The following data represent the responses of a random
sample of 36 people.
V M M M V PO S M V S V S
M S V V V M S S V M C V
V V C V S PO V M S M PO M
a. Prepare a frequency distribution table.
b. Calculate the relative frequencies and percentages for all
categories.
c. What percentage of the respondents mentioned vegetables
and fruits, poultry, or cheese?
32
3- The following data show the method of payment by 16
customers in a supermarket checkout line. Here, C refers to
cash, CK to check, CC to credit card, D to debit card, and O
stands for other.
C CK CK C CC D O C
CK CC D CC C CK CK CC
33
Gallons Number
of Gas of Customers
0- 31
4- 78
8- 49
12 - 81
16 - 117
20 - 24 13
34
a. Construct a frequency distribution table using the class
width = 10.
b. Calculate the relative frequency and percentage for
each class.
c. Construct a histogram for the percentage distribution made
in Part b.
d. What percentage of the workers in this sample commutes
for 30 minutes or more?
e. Prepare the less-than and more-than cumulative frequency,
cumulative relative frequency, and cumulative percentage
distributions using the table of Part a.
36
Chapter (3)
Measures of Central Tendency
The measure of central tendency is a single value that attempts
to describe a set of data by identifying the central position within
that set of data. As such, measures of central tendency are
sometimes called measures of central location. They are also
classified as summary statistics. The mean (often called the
average) is the most likely of the measures of central tendency
that you are most familiar with, but there are others, such as the
median and the mode.
The mean, median and mode are all valid measures of central
tendency, but under different conditions, some measures of
central tendency become more appropriate to use than others. In
the following sections, we will look at the mean, mode and
median, and learn how to calculate them and under what
conditions they are most appropriate to be used.
3.1 Mean:
This section discusses how to find the arithmetic mean (Mean or
Average) of ungrouped and grouped data using 3 different
methods which are:
(1) Direct method.
(2) Assumed mean method.
(3) Step-deviation method.
The arithmetic mean is normally abbreviated to just the ‘mean’
The main advantage of the mean is that it uses all the values in
the data set, while the main disadvantage that it is severely
affected by outliers.
37
3.1.1 The Mean of Ungrouped Data:
The arithmetic mean of a set of values is defined as ‘the
sum of all the values’ divided by ‘the number of values’,
that is
Sum of All Values
Mean =
Number of Values
Example 1:
(a) If a firm received orders worth (by 1000 L.E)
30, 10, and 56
for three consecutive months, their mean value of orders per
month would be calculated as:
30 + 10 + 56 96
= = 32
3 3
x1 + x2 + x3 + …+ xn is written as ∑ x
∑ is called “summation notation”. It is a Greek symbol for
capital ‘S’ (for Sum) and ∑ x can be simply translated as ‘add up
all the x-values under consideration’.
For the sales example, we have:
∑ x = 4 + 5 + 12 + 8 + 2 = 31
Example 2:
To calculate the mean for the set:
43 , 75 , 50 , 51 , 51 , 47 , 50 , 47 , 40 , 48
̅ = 𝐚 + 𝐝̅
Dividing both sides by n, we get 𝐱
Example 3:
If the hourly wage of 12 workers was as follows:
32 , 26 , 27 , 41 , 36 , 29 , 45 , 40 , 32 , 28 , 33 , 38
Find the mean hourly wage using the assumed mean method.
Solution:
Let, a = the center of the data set = (min + max)/2
= (45 + 26) / 2 = 35.5 ≅ 35
Then, subtracting a from each value in the data set, we get the
following deviations (di):
Hence: di’s = -3 , -9 , -8 , 6 , 1 , -6 , 10 , 5 , -3 , -7 , -2 , 3
̅ = ∑ di = −13 =−1.08
Then, d
n 12
Therefore,
x̅ = a + d̅ = 35 + (−1.08) = 33.92
40
Note: It is not necessarily for the value of the constant a to be
the average of the minimum and maximum values. You
can choose any convenient constant.
̅ and 𝐱
Now, let us find the relation between 𝐃 ̅:
𝐱 −𝐚
Since 𝐃𝐢 = 𝐢 , then
𝐛
∑ (𝐱𝐢 − 𝐚)
∑ 𝐃𝐢 =
𝐛
Therefore, ∑(𝐱 𝐢 − 𝐚) = 𝐛∑𝐃𝐢
and ∑𝐱 𝐢 = 𝐧𝐚 + 𝐛∑𝐃𝐢
Dividing both sides by n, then we get 𝐱 ̅
̅ = 𝐚 + 𝐛𝐃
41
Then, subtracting a from each value in the data set, then we get
the following deviations (di):
5 , -10 , -25 , 0 , 25 , 10 , -15
It is obvious that the common factor of (di) is (b = 5)
and dividing (di) by (5), then we get (Di) values
1 , -2 , -5 , 0 , 5 , 2 , -3
∑ 𝐃𝐢 −𝟐
̅ =
𝐃 =
𝐧 𝟕
Therefore,
̅ = 𝟏𝟕𝟎 + 𝟓. (−𝟐)
𝐱̅ = 𝐚 + 𝐛𝐃
𝟕
= 𝟏𝟕𝟎 + (−𝟏. 𝟒𝟑) = 𝟏𝟔𝟖. 𝟓𝟕
The total of all the values = the sub total of the 10’s (= 10 × 2= 20)
+ the sub total of the 12’s (= 12 × 8 = 96)
+ the sub total of the 13’s (= 13 × 17 =221)
+ the sub total of the 14’s (= 14 × 5 = 70)
+ the sub total of the 16’s (= 16 × 1 = 16)
+ the sub total of the 19’s (= 19 × 1 = 19) = 442
42
Notice that in order to get the sub-totals 20, 96, 221, …, etc., xi is
being multiplied by fi each time. In other words, the total is just
∑𝐱 𝐢 𝐟𝐢 . Also, since there are 2 “10’s”, 8 “12’s”, 17 “13’s”, …, etc.,
the number of values included in the distribution is 2 + 8 + 17 + 5
+ 1 + 1 = 34, therefore, n = ∑𝐟𝐢 = 31.
Example 5:
For the following discrete frequency distribution
Number of Vehicles 0 1 2 3 4 5
Serviceable (xi)
Number of Days (fi) 2 5 11 4 4 1
Solution:
The normal layout for calculations is:
43
x f xf
0 2 0
1 5 5
2 11 22
3 4 12
4 4 16
5 1 5
Total 27 60
∑𝐱𝐢 𝐟𝐢
̅
Thus: 𝐱 = = 𝟔𝟎
𝟐𝟕
= 𝟐.𝟐𝟐
∑𝐟𝐢
∑𝐝𝐢 𝐟𝐢
𝐱̅ = 𝐚 + 𝐝̅ = 𝐚 +
∑𝐟𝐢
44
Mean of Grouped Quantitative Discrete Data Using the
Assumed Mean Method:
∑di fi
x̅ = a +
∑fi
Example 6:
Consider the following frequency distribution of family size:
Family size (xi) 1 2 3 4 5 6 Total
Number of families (fi) 2 10 12 13 9 4 50
Find the mean family size using the assumed mean method.
Solution:
• a = Center of xi’s = (min + max)/2 = (1 + 6) / 2 = 3.5
For purposes of simplifying calculations, let a = 3
• Subtracting (a = 3) from each (xi) we get the column:
(di = xi – 3)
• Multiplying each (di) by its corresponding (fi) to get the
column (di fi).
x f d=x-3 df
1 2 -2 -4
2 10 -1 -10
3 12 0 0
4 13 1 13
5 9 2 18
6 4 3 12
Total 27 60 29
= 𝟑 + 𝟐𝟗
∑𝐝𝐢 𝐟𝐢
Therefore, 𝐱̅ = 𝐚 + 𝟓𝟎
= 𝟑.𝟓𝟖 pearson.
∑𝐟𝐢
45
(c) Method of Step – Deviation:
If all the values of di are divisible by a common factor b (i.e.,
without remainder), calculations can be more simplified as follows:
𝐝 𝐱 −𝐚
Let 𝐃𝐢 = 𝐢 = 𝐢 , where a is the assumed mean and b is the
𝐛 𝐛
common factor of di.
Multiplying (Di) by (fi), the mean can be written as
∑𝐃𝐢 𝐟𝐢
̅ = 𝐚 + 𝐛(
𝐱̅ = 𝐚 + 𝐛𝐃 )
∑𝐟𝐢
Mean of Grouped Quantitative Discrete Data Using the
Step-Deviation Method:
∑Di fi
x̅ = a + b ( )
∑fi
Example 7:
Consider the following apartment rent frequency distribution
xi 1100 1150 1200 1250 1300 1350 1400 1450 1500 Total
fi 6 10 5 12 22 16 13 11 5 100
Solution:
• a = Center of xi’s = (min + max)/2 = (1100 + 1500) / 2 = 1300
• Subtracting (a = 1300) from each (xi), then we get the column
(di = xi – 1300).
• Since all di’s are divisible by (b = 50), then we can define the
column (Di = di /50).
• Multiplying each (Di) by its corresponding (fi) to get the
column (Difi).
46
x f d = x - 1300 D = d / 50 Df
1100 6 -200 -4 -24
1150 10 -150 -3 -30
1200 5 -100 -2 -10
1250 12 -50 -1 -12
1300 22 0 0 0
1350 16 50 1 16
1400 13 100 2 26
1450 11 150 3 33
1500 5 200 4 20
Total 27 60 29 19
∑𝐃𝐢 𝐟𝐢 𝟏𝟗
Then, 𝐱
̅ =𝐚+𝐛 ( ) = 𝟏𝟑𝟎𝟎 + 𝟓𝟎 (𝟏𝟎𝟎) = 𝟏𝟑𝟎𝟗. 𝟓
∑𝐟𝐢
47
(a) Direct Method:
Mean of Grouped Quantitative Continuous Data Using the
Direct Method:
∑xi fi
x̅ =
∑fi
where 𝐱 𝐢 are class midpoint and,
𝐟𝐢 are the corresponding frequencies.
Example 8:
For the following hourly wage frequency distribution:
Wage 30- 40- 50- 60- 70- 80- 90-100 Total
Number of 1 4 7 14 11 8 5 50
Employees
Find the mean hourly wage using the direct method.
Solution:
• Finding class midpoints xi = (lower limit + upper limit)/2
• Multiplying each (xi) by its corresponding (fi) to get the
column (xifi).
We can summarize these calculations as shown in the
following table:
Class
Wage f xf
Midpoint (x)
30- 1 35 35
40- 4 45 180
50- 7 55 385
60- 14 65 910
70- 11 75 825
80- 8 85 680
90-100 5 95 475
Total 50 - 3490
48
∑𝐱𝐢 𝐟𝐢 3490
𝐱̅ = = = 69.8
∑𝐟𝐢 50
(c) Method of Assumed Mean:
Choosing the value (a) as an assumed mean and subtracting this
value from each class midpoint (i.e. di = xi – a), then:
Example 9:
For the hourly wage frequency distribution in Example (8), find the
mean using the assumed mean method
Solution:
• Let assumed mean = a = midpoints of the class (60 – 70), that is
a = (60 + 70)/2 = 65
• Subtracting (a = 65) from each (xi), then we get the column
(di = xi – 65)
• Multiplying each (di) by its corresponding (fi) to get the
column (difi)
We can summarize these calculations as shown in the
following table:
Class
wage f D = x - 65 df
Midpoint (x)
30- 1 35 -30 -30
40- 4 45 -20 -80
50- 7 55 -10 -70
60- 14 65 0 0
70- 11 75 10 110
80- 8 85 20 160
90-100 5 95 30 150
Total 50 - - 240
49
= 65 + 240
∑di fi
̅=
Then: x a+ 50
= 69.8
∑fi
Note: You can consider any other value for the assumed mean.
However, choosing one of the midpoints may simplify
computations.
(c) Method of Step -Deviation:
If all the values of (di) can be divided by a common factor (b), then,
by defining (Di = di / b), Then:
Solution:
• Let a = 65 , and b = 10
• Then, we find the columns of (Di = di /10) and Difi.
The following table presents calculations required:
Wage class
f d = x - 65 D = d /10 Df
Class midpoint (x)
30- 1 35 -30 -3 -3
40- 4 45 -20 -2 -8
50- 7 55 -10 -1 -7
60- 14 65 0 0 0
70- 11 75 10 1 11
80- 8 85 20 2 16
90-100 5 95 30 3 15
Total 50 - - 24
50
∑Di fi 24
Then: x
̅ = a + b( ) = 65 + 10 (50) = 69.8
∑fi
The same result obtained for the other two methods (Examples: 8
and 9).
Note: The three methods used for obtaining the value of the
mean must lead to the same results.
3.2 Median:
The median is generally considered as an alternative central
tendency measure to the mean. This section defines the median
and shows how to find its value for ungrouped and grouped data.
Definition:
Median:
The median is the middle value of a set of data after arranging
data in an ascending order (or descending order).
Example 11:
Given the hourly wage of 7 workers as follows
75 , 47 , 48 , 50 , 51 , 51 , 43
Find the median hourly wage.
51
Solution:
• Arranging data in an ascending order, we get:
43 , 47 , 48 , 50 , 51 , 51 , 75
• Since n = 7 is an odd number of items, then the median order
𝟕+𝟏 th
is ( ) = the 4th item.
𝟐
• Therefore, the median = 50
(b) Even Number of Values:
When an ordered set of data contains an even number of values,
𝐧
there are two middle values. The order of the first value is ( ) and
𝟐
𝐧
the order of the second value is ( + 𝟏). The convenient in this
𝟐
situation is to use the mean of these middle two values to give
the median.
Example 12:
Given the weight of 10 students as follows
88 , 82 , 91 , 72 , 65 , 85 , 80 , 84 , 73 , 90
Find the median weight
Solution:
• By arranging values in a descending order, we get:
91 , 90 , 88 , 85 , 84 , 82 , 80 , 73 , 72 , 65
• Since n = 10 is an even number of items, order of the first middle
𝐧 𝟏𝟎
item = ( ) = ( ) = 𝟓 and the order of the second middle
𝟐 𝟐
n 10
item = ( + 1) = ( + 1) = 6.
2 2
84 + 82
• So, the median = = 83
2
3.2.2 The Median of Grouped Data:
As we mentioned in the previous section, the penalty paid by
grouping values is the loss of their individual identities and thus
52
there is no way that a median can be calculated exactly in this
situation. However, we can use interpolation for estimating the
median. Interpolation in this context is a simple mathematical
technique which estimates an unknown value by utilizing
immediately surrounding known values.
To be able to find the median, we need to find its order
(median point):
∑𝐟𝐢
The order of the median (Median Point) = ( )
𝟐
Then, we can use either the less-than cumulative frequency
distribution or the more-than cumulative frequency distribution to
find the value of the median as follows:
Example 13:
For the hourly wage frequency distribution in Example (8), find the
median for the hourly wages.
Solution:
∑𝐟𝐢 𝟓𝟎
• The order of the median (median point) = ( )= = 25
𝟐 𝟐
• Prepare a less-than cumulative frequency distribution.
53
Less than CF
wage classes
Less than 30 0
Less than 40 1
Less than 50 5
Less than 60 12 Order of
Less than 70 26 The Median
m: value of Less than 80 37 = 25
the median
Less than 90 45
Less than 100 50
54
The main idea of interpolation is that the ratio (25 – 12) to
(26 – 12) is proportional to the ratio (m - 60) to (70 – 60), that is
𝐦 − 𝟔𝟎 𝟐𝟓 − 𝟏𝟐 𝐦 − 𝟔𝟎 𝟏𝟑
= 𝟐𝟔 − 𝟏𝟐 gives = 𝟏𝟒
𝟕𝟎 − 𝟔𝟎 𝟏𝟎
𝟏𝟑
𝐦 = 𝟔𝟎 + 𝟏𝟎 ( ) = 𝐋. 𝐄. 𝟔𝟗. 𝟐𝟗
𝟏𝟒
Note that: the value of the median (m = L.E. 69.29) must lie
between the lower and upper limit for the median class), i.e.,
between 60 and 70.
Wage Fi
60 38
m 25
70 24
m − 60 25 − 38 m − 60 −13
= gives =
70 − 60 24 − 38 10 −14
55
𝟏𝟑
𝐦 = 𝟔𝟎 + 𝟏𝟎 ( ) = 𝐋. 𝐄. 𝟔𝟗. 𝟐𝟗
𝟏𝟒
3.3 Quartiles, Deciles and Percentiles:
Quartiles are the summary measures that divide a ranked data set
into four equal parts. Three measures will divide any data set into
four equal parts. These three measures are the first quartile
(denoted by Q1), the second quartile (denoted by Q2), the second
quartile is the same as the median of a data set and the third
quartile (denoted by Q3).
25 % 25 % 25 % 25 %
Q2 ≡
Q1 Q3
Median
Each of these partitions contains 25% of the observations of
a data set arranged in increasing order. Approximately 25% of the
values in a ranked data set are less than Q1 and about 75% are
greater than Q1. The second quartile, Q2, divides a ranked data
set into two equal parts; hence, the second quartile and the
median are the same. Approximately 75% of the data values are
less than Q3 and about 25% are greater than Q3.
In the same way, deciles are the summary measures that divide
a ranked data set into ten equal parts. Nine measures will divide
any data set into ten equal parts. These nine measures are the
first decile (denoted by d1), the second decile (denoted by d2), the
56
third decile (denoted by d3), and so on. Notice that the fifth decile
(denoted by d5) is the same as the median of a data set.
10 %
10 %
10 %
10 %
10 %
10 %
10 %
10 %
10 %
10 %
d1 d2 d3 d4 d5 d6 d7 d8 d9
≡
Median
57
So, approximately 1% of the values in a ranked data set are less
than (P1) and about 99% are greater than (P1), …, approximately
19% of the data values are less than (P19) and about 81% are
greater than (P19), and so on.
Example 14:
The following frequency distribution illustrates the weekly
income (L.E) for 300 families:
Income 2250 -
750- 1000- 1250- 1500- 1750- 2000- Total
Classe 2500
Number of
14 30 45 68 62 49 32 300
Families
58
Using a less -than cumulative frequency distribution, find:
(a) The median. (b) The 1st and the 3rd quartiles.
(c) If 60% of families get less than L.E. x, determine the
value of x.
{d) The percentage of families whose income less than L.E.1650
(e) The number of families whose incomes are between 1350
and L.E. 1800.
(f) The number of families whose incomes are at least L.E. 1850.
Solution:
Prepare the less-than cumulative frequency distribution.
Class CF
less than 750 0
less than 1000 14
less than 1250 44
less than 1500 89
less than 1750 157
less than 2000 219
less than 2250 268
less than 2500 300
59
m − 1500 150 − 89 m − 1500 61
= gives =
1750 − 1500 157 − 89 250 68
61
m = 1500 + 250 ( ) = L. E. 1724.26
68
So, the first quartile for the weekly income is L.E. 1422.22
𝟑
• Similarly, since of the data is less than the 3rd quartile, then the
𝟒
3 3
order of the 3rd quartile =
4
∑fi = 4 (300) = 225
Income FA
L.T. 2000 219
L.T. Q3 225
L.T. 2250 268
60
Q3 − 2000 225 − 219
=
2250 − 2000 268 − 219
Q3 − 2000 6 6
= gives Q 3 = 2000 + 250 ( )
250 49 49
Q 3 = L. E. 2030.61
Therefore, the third quartile for the weekly income is L.E. 2030.61
(c) Since 60% of families get less than x, then find the value of x.
Note that x has the same definition of the 6th decile or the 60th
percentile,
𝟔 𝟔𝟎
Since (or ) of the data is less than the 6th decile (or the
𝟏𝟎 𝟏𝟎𝟎
6
60th percentile), then the order of the 6th decile =
10
∑fi
𝟔
=
𝟏𝟎
(𝟑𝟎𝟎) = 𝟏𝟖𝟎
Income FA
L.T. 1750 157
L.T. d6 180
L.T. 2000 219
61
Income FA
L.T. 1500 89
L.T. 1650 k
L.T. 1750 157
1650 − 1500 k − 89 150 k − 89
= gives =
1750 − 1500 157 − 89 250 68
150
k = 89 + 68 ( ) = 129.8 ≅ 130
250
Income FA
L.T. 1250 44
L.T. 1350 a
L.T. 1500 89
1350 − 1250 a − 44 100 a − 44
= gives =
1500 − 1250 89 − 44 250 45
100
a = 44 + 45 ( ) = 62 , and
250
62
Income FA
L.T. 1750 157
L.T. 1800 b
L.T. 2000 219
Income FA
L.T. 1750 157
L.T. 1850 k
L.T. 2000 219
63
So the number of families whose weekly incomes are more than
L.E 1850 is:
300 – k = 300 – 182 = 118 families
(ii) Using More-Than Cumulative Frequency Distribution
(Descending Cumulative Frequency Distribution):
• Form a more - than cumulative frequency distribution.
• Find the order of the measure (𝐑∑𝐟𝐢 )
where R is the proportion of data which are more than the value
of the measure.
• Identify the location of the order of the measure in the
column of cumulative frequencies (Fi), and the location
of the unknown value of the measure (say, m) in the column
[
of classes.
• Determine the value of the measure using interpolation.
Example 15:
Solve the previous example using a more-than cumulative
frequency distribution
Solution:
Prepare the more-than cumulative frequency distribution
More Than FD
Income Classes
more than 750 300
more than 1000 286
more than 1250 256
more than 1500 211
more than 1750 143
more than 2000 81
more than 2250 32
more than 2500 0
64
(a) In order to find the median, we have the following:
𝟏
Since of the data is more than the median, then the order of
𝟐
𝟏 𝟏
the median =
𝟐
∑𝐟𝐢 = 𝟐 (𝟑𝟎𝟎) = 𝟏𝟓𝟎, i.e, Median Point = 150
Income Fi
M.T. 1500 211
M.T. m 150
M.T. 1750 143
65
Therefore, the 1st quartile for the weekly income Q1 = L.E. 1422.22
(the same result as before).
𝟏
similarly, since of the data is more than the 3rd quartile, then the
𝟒
𝟏 𝟏
order of the 3rd quartile =
𝟒
∑𝐟𝐢 = 𝟒 (𝟑𝟎𝟎) = 𝟕𝟓
Income FD
M.T. 2000 81
M.T. Q3 75
M.T. 2250 32
Q3 − 2000 75 − 81 Q3 − 2000 −6
= gives =
2250 − 2000 32 − 81 250 −49
6
Q 3 = 2000 + 250 ( ) = L. E. 2030.61
49
Therefore, the 3rd quartile for the weekly income Q3 = L.E. 2030.61
(the same result obtained before).
(c) If 60% of families get less than x, then to find the value of x,
note that x has the same definition of the 6th decile or the 60th
percentile, so that, we have the following steps:
𝟒 𝟒𝟎
Since (or ) of the data is more than the 6th decile (or
𝟏𝟎 𝟏𝟎𝟎
4
the 60th percentile), then the order of the 6th decile =
10
∑fi
4
= (300) = 120
10
Income FD
M.T. 1750 143
M.T. d6 120
M.T. 2000 81
66
d6 − 1750 120 − 143 d6 − 1750 −23
= gives =
2000 − 1750 81 − 143 250 −62
23
d6 = 1750 + 250 ( ) = L. E. 1842.74
62
So, 60% of families get less than x = L.E. 1842.74 (the same
preceding result).
67
(e) To find the number of families with incomes between 1350
and L.E. 1800, we have the following steps:
• Let the number of families whose incomes are more than
L.E. 1350 = a
• Let the number of families with incomes more than
L.E. 1800 = b
The number of families whose incomes are between 1350
and L.E. 1800 = (a – b)
Therefore, we have to find each of ‘a’ and ‘b’ as follows:
Income FD
M.T. 1250 256
M.T. 1350 a
M.T. 1500 211
1350 − 1250 a − 256 100 a − 256
= leads to =
1500 − 1250 211 − 256 250 −45
100
a = 256 − 45 ( ) = 238 families
250
Income FD
M.T. 1750 143
M.T. 1800 b
M.T. 2000 81
Income Fi
1750 143
1850 k
2000 81
3.4 Mode:
Sometimes a set of data is obtained where it is appropriate to
measure a representative (central tendency) value in terms of
‘popularity’. For example, if a shop sold television sets, the answer
to the question ‘what price does the average television set sell at?
it is probably best given as the price of the best-selling television.
This value is the mode. In this type of questions, the mode would
be more representative of the data than, for instance, the mean
or median.
69
3.4.1 Mode of Ungrouped Data:
Definition:
Mode of Ungrouped Data:
The mode of a data set is the value which occurs most often
(frequently) or is the most typical value.
Example 16:
If the scores of 10 students were 76, 82, 73, 95, 82, 91, 96, 65,
73, 82. Then the mode would be 82 since this value occurred
most often.
In case of ungrouped data, the mode may not exist and
sometimes there are more than one mode for the data set. The
following examples explain these cases.
Example 17:
Given the temperature of 7 successive days as follows:
35 , 34 , 36 , 33 , 38 , 39 , 37
In such case the mode doesn’t exist since each value happened
only once.
Example 18:
Given the working experience of 9 employees:
7 , 12 , 10 , 7 , 16 , 21 , 15, 27, 16
The data set has 2 modes which are 7 and 16, since they have
equal frequency and repeated more than other values.
70
estimated. However, we can use the modal class to find out the
value of the mode.
Definition:
Modal Class:
The modal class is that class which has the largest frequency.
Number of 1 4 7 14 11 8 5 50
Employees
Find the mode using the method of the midpoint of the
modal class.
71
Solution:
• The modal class is (60 – 70), therefore 𝐋𝐦 = 𝟔𝟎 and 𝐔𝐦 = 𝟕𝟎
𝐋𝐦 + 𝐔𝐦 𝟔𝟎 + 𝟕𝟎
• The mode = = = 65 L.E
𝟐 𝟐
(2) Method of The Moments:
𝐟𝐬
Mode = 𝐋𝐦 + × 𝐖𝐦
𝐟𝐩 + 𝐟𝐬
Where
Lm: Lower limit for the modal class
fp: Frequency of the class preceding the modal class
fs: Frequency for the class succeeding the modal class
Wm: Width of the modal class
Example 20:
Solve the previous example using moments method.
Solution:
• The modal class is (60 – 70), Therefore:
50- 60- 70-
7 14 11
𝐋𝐦 = 𝟔𝟎 , 𝐟𝐬 = 𝟏𝟏 , 𝐟𝐩 = 𝟕 , 𝐰𝐦 = 𝟏𝟎
fs 11
• Mode = Lm + ( ) × Wm = 60 + ( ) ×10
fp + fs 7 + 11
= L.E. 66.61
72
(3) Method of Differences:
𝐟𝐦 − 𝐟𝐩
Mode = 𝐋𝐦 + × 𝐖𝐦
( 𝐟𝐦 − 𝐟𝐩 ) + ( 𝐟𝐦 − 𝐟𝐬 )
Where
Lm: lower limit of the modal class.
fm: frequency of the modal class.
fp: frequency of the class preceding the modal class.
fs: frequency of the class succeeding the modal class.
Wm: width of the modal class.
Note that this method considers the frequency of the modal class
in addition to the preceding and succeeding frequencies.
Therefore, it is considered the most accurate method.
Example 21:
Solve Example 19 using the method of principle of moments.
Solution:
• The modal class is (60 – 70), therefore
50- 60- 70-
7 14 11
𝐋𝐦 = 𝟔𝟎 , 𝐟𝐦 = 𝟏𝟒 , 𝐟𝐬 = 𝟏𝟏 , 𝐟𝐩 = 𝟕 , 𝐖𝐦 = 𝟏𝟎
fm − fp
• Mode = Lm + ( fm − fp ) + ( fm − fs )
× wm
14−7
= 60 + (14 − 7) + (14 − 11)
× 10
7
= 60 + 10(
7+3
) = 67
73
3.5 Relationship Between Mean, Median and Mode:
Frequency curves of distributions may be relatively symmetric, but
more often are skewed to some extent. Typical examples of this
are distributions of wages, company turnover or times to
component failure and it is of some interest to know the
approximate relative positions of the three main central tendency
measures, the mean, median and mode. Figure 3.1 shows these
positions for moderately left skewed, symmetric and moderately
right skewed distributions.
(a) Symmetric
Figure (3-1)
Graphical Position of Mean, Median and Mode
74
A useful aid in remembering the positions of the three central
tendency measures in Figure 3.1 is to use the following
characteristics of the three measures.
▪ The mode is the item that occurs most frequently and so it must
lie under the main ‘hump’.
▪ The mean is the measure that is most affected by extremes
and so it must lie towards the ‘tail’ of the distribution (except, of
course, for a symmetric distribution).
▪ The median is the middle item and it also lies in the middle of
the other two central tendency measures (but slightly closer to
the mean by a factor of 2 to 1 approximately).
Example 22:
Given the frequency distribution of statistics grades for 65
students
Solution:
• Mean:
= 𝟑𝟑𝟖𝟎
∑𝐱𝐢 𝐟𝐢
𝐱̅ = 𝟔𝟓
= 𝟓𝟐 See the table given below.
∑𝐟𝐢
75
Class
Grade f Midpoint (x)
xf
10- 4 16 64
22- 8 28 224
34- 13 40 520
46- 15 52 780
58- 13 64 832
70- 8 76 608
82-94 4 88 352
Total 65 - 3380
• Median:
Income CF
less than 10 0
less than 22 4
less than 34 12
less than 46 25
less than 58 40
less than 70 53
less than 82 61
less than 94 65
𝟏 𝟏
Median Point =
𝟐
∑𝒇𝒊 = 𝟐 (𝟔𝟓) = 32.5
Grade Fi
L.T. 46 25
L.T . m 32.5
L.T. 58 40
𝑚 − 46 32.5 − 25 𝑚 − 46 7.5
= gives =
58 − 46 40 − 25 12 15
76
1
𝑚 = 46 + 12 (2) = 118.2
• Mode:
The modal class is (46 – 58)
𝐋𝐦 = 𝟒𝟔 , 𝐟𝐬 = 𝟏𝟑 , 𝐟𝐩 = 𝟏𝟑 , 𝐖𝐦 = 𝟏𝟐
fm − fp
Mode = Lm + × Wm
(fm − fp ) + (fm − fs )
15−13
= 46 + × 12
(15 − 13) + (15 − 13)
2
= 46 + × 12 = 52
2+2
We can conclude that:
Mean = Median = Mode = 52
This is because the frequency distribution is symmetric.
Example 23:
For the frequency distribution given in Example 8, Example 13
and Example 21 where median = 69.29 and mode = 67.
• Find the mean as a formula of median and mode.
• Is the frequency distribution negatively skewed? Why?
Solution:
𝟑(𝐌𝐞𝐝𝐢𝐚𝐧) − 𝐌𝐨𝐝𝐞 𝟑(𝟔𝟗.𝟐𝟗) − 𝟔𝟕
• Mean = = = 70.435
𝟐 𝟐
• Since (mean > median > mode), then the distribution is
positively skewed.
3.6 Characteristics of the Measures of Central
Tendency:
This section illustrates the advantages and disadvantages for
central tendency measures.
3.6.1 Characteristics of The Mean:
Advantages:
1- It is considered as the most important and the most applicable
measure of central tendency. It is widely used in applications.
78
2- In contrary to the median and mode, all the values are
considered in computing the mean.
Disadvantages:
1- The mean is sensitive to outliers (extreme values). For
example, consider the salaries of 5 employees as 3500, 5000,
4200, 4300 and L.E. 60000. The mean is 15400 which gives
unrealistic picture about the level of salaries. Note that the
salary 60000 is an outlier.
2- The mean can’t be calculated for frequency distributions with
open-ended classes. We can handle this problem by:
• Getting rid of open classes and calculating the mean for the
rest of the data, but we will lose information especially when
these classes have large frequencies.
• Closing the open classes by supposing a value for the lower
limit of the lowest class or a value for the upper limit of the
highest class which is subjective and requires experience to
deal with.
• In symmetrical or nearly symmetrical frequency
distributions, we can use the relationship between mean,
median and mode to estimate the mean by using Pearson’s
relationship which is as follows:
𝟑×𝐌𝐞𝐝𝐢𝐚𝐧 − 𝐌𝐨𝐝𝐞
𝐌𝐞𝐚𝐧 =
𝟐
3- It can’t be used for qualitative data.
4-
79
3.6.2 Characteristics of The Median:
Advantages:
1- It is the middle value of the data after arranging in ascending
(or descending) order, so it is not affected by extreme values
like the mean.
2- It can be calculated for frequency distributions with open-ended
classes.
Disadvantages:
1- Only a part of the data set is used for computing the value of
the median. Therefore, it is less accurate than the mean.
2- It isn’t amenable for algebraic calculations.
80
3.7 Effect of Shifting and Scaling:
Sometimes the entire data set can be changed either by shifting
(adding or subtracting the same value) or scaling (multiplying or
dividing by the same value). In such cases, it will be useful to know
how the measures of central tendency are affected by these
changes without repeating any calculations.
The measures of central tendency for the new data set change
by the same value we add to the entire data set or we subtract
from it.
Let the value we add or subtract ‘a’, then the measures of
central tendency for the new data set can be written as:
• The mean of the new data set
= 𝐱̅ + 𝐚 (In case of addition)
= 𝐱̅ − 𝐚 (In case of subtraction)
• The median of the new data set
= 𝐦 + 𝐚 (In case of addition)
= 𝐦 − 𝐚 (In case of subtraction)
• The mode of the new data set
= 𝐦𝐨𝐝𝐞 + 𝐚 (In case of addition)
= 𝐦𝐨𝐝𝐞 − 𝐚 (In case of subtraction)
Example 24:
For the frequency distribution given in Example 8, Example 13
and Example 21 where mean = 69.8, median = 69.29 and
mode = 67. Find the measures of central tendency for the new
data set after:
(a) Adding L.E. 10 to the wages of all employees.
(b) Subtracting L.E. 5 from the wages of all employees.
81
Solution:
(a) In case of adding 10 L.E to the wage for all employees, then:
• The mean of the new data set = 𝐱̅ + 𝐚 = 𝟔𝟗. 𝟖 + 𝟏𝟎 = 𝟕𝟗. 𝟖
• The median of the new data set = 𝐦 + 𝐚 = 𝟔𝟗. 𝟐𝟗 + 𝟏𝟎 =
𝟕𝟗. 𝟐𝟗
• The mode of the new data set = 𝐦𝐨𝐝𝐞 + 𝐚 = 𝟔𝟕 + 𝟏𝟎 = 𝟕𝟕
82
Example 25:
For the frequency distribution given in Example 8, Example 13
and Example 21, where the mean of wages = L.E. 69.8, the
median of wages = L.E. 69.29 and the mode of wages = L.E. 67,
find the measures of central tendency for the new data set after:
(a) Increasing the wage by 10% for all employees.
(b) Decreasing the wage by 5% for all employees.
Solution:
(a) In case of increasing the wage by 10% for all employees, then:
new wage = old wage + 10% old wage
= old wage + 0.1(old wage) = (1 + 0.1) old wage
= 1.1 × old wage
Therefore, for a = 1.1, then:
83
• The median for the new data set = m × a = 69.29 × 0.95
= L. E. 65.83
• The mode for the new data set = mode × a = 67 × 0.95
= L. E. 63.65
84
Exercises
1- The following data give the ages for 6 persons:
65 82 92 86 5 90
a. Find the mean and the median for these data.
b. Using the results of part (a), how can you investigate that this
data set contains outliers? Explain?
c. If you dropped the outlier and the values of the mean and
median were recalculated, which of the two measures is
expected to change by a larger amount? No calculations
required.
d. Which of the mean or median is considered a better measure
for these data? Explain?
2- The following data give the daily wages (L.E) earned by
a sample of 30 workers as shown below.
36 28 22 44 30 26 49 24 33 34
25 31 39 33 28 37 42 27 23 27
32 25 34 29 43 32 26 20 28 35
85
a. On the basis of inspection only (without performing any
calculations), what might you expect for the value of the
mean compared with that of the median? Justify your
answer.
b. Find the price value so that 50% of houses have prices
less than this value.
c. Find the mean of prices.
d. What connection do you see between your answers in
Parts (a), (b), and (c)?
4- The following frequency distribution reports the electricity cost
(in dollars) for a sample of 25 two-bedroom apartments in a city.
86
a. On the basis of inspection only (without performing any
calculations), do you agree with the claim that the distribution
of the monthly bonus is positively skewed? Explain?
b. Find the mean, median and mode of the monthly bonus.
c. Find the bonus value so that 75% of employees have
bonuses less than this value.
7- The following data represents the ages for a sample of internet
users as follows:
39 15 31 25 24 23 21 22 22 18
19 16 23 27 34 24 19 20 29 17
87
9- The following table represents the distribution of the hourly
wages for 20 workers in a factory.
Hourly Less than Less than Less than Less than
20-25
Wage 30 35 40 45
No. of 2 A 14 18 20
Workers
a. If the value of the median is 32.5, find the value of:
(1) A (2) The mean (3) The mode
b. How would the measures of central tendency be affected if
the hourly wages?
(1) Increased by 20%. (2) Decreased by 10%.
(3) Increased by 10 L.E. (4) Decreased by 5 L.E.
88
Chapter (4)
Measures of Dispersion and Skewness
In the previous chapter, we have studied measures of central
tendency such as mean, mode, median of ungrouped and
grouped data. The averages are representatives of a frequency
distribution. But they fail to give a complete picture of the
distribution because they do not tell anything about how the
observations are scattered within the distribution. For example,
consider the following data about the wages of two groups of
employees as follows:
Group1
145 , 160 , 191 , 172 , 184 , 195 , 179 , 176 , 155 , 163
Group2
155 , 184 , 132 , 176 , 162 , 148 , 115 , 170 , 232 , 146
If we want to compare between both groups based on average,
we may say that level of wages in both groups are equal because
they have the same mean value = L.E 172. But comparing both
groups based on average only is not accurate, because in group1
wages ranged between L.E. 145 and L.E. 195 (i.e., Range = 50),
whereas in group2 wages ranged between L.E. 132 and L.E. 232
(i.e., Range = 100), which indicates that values in group1 are
closer to the mean and therefore are more homogenous, whereas
values in group 2 are less concentrated around the mean.
Measures of dispersion describe how spread out or scattered
a set or distribution of numeric data is?
89
4.1.1 The Range:
The range is defined as the numerical difference between the
lowest and largest values of the items in a data set or
distribution.
Example 1:
The following data represents ages of 10 persons (years):
31 , 18 , 27 , 41 , 53 , 32 , 56 , 43 , 17 , 22
Solution:
• Arrange the data in an ascending order
17 , 18 , 22 , 27 , 31 , 32 , 41 , 43 , 53 , 56
Example 2:
The following data give the hours worked last week by 5
employees of a company:
42 34 40 85 36
Find the range.
Solution:
• Arrange the data in an ascending order
34 36 40 42 85
Example 3:
Given the following frequency distribution table:
Electricity 30- 40- 50- 60- 70- 80-90 Total
Cost
No. of 3 8 12 16 7 4 50
Apartments
91
Example 4:
The following is a frequency distribution for the ages of a sample
of 20 employees at a company.
Age 20- 30- 40- 50- 60-70 Total
No. of employees 4 5 6 3 2 20
Example 5:
For the frequency distribution given in Example 4, find the inter-
quartile range.
Solution:
Class CF
Less than 20 0
Less than 30 4
Less than 40 9
Less than 50 15
Less than 60 18
Less than 70 20
93
1
Q1 Point = 20( ) = 5
4
Age Fi
L.T. 30 4
L.T. Q1 5
L.T. 40 9
Q1 − 30 5−4 Q1 − 30 1
= gives =
40 − 30 9−4 10 5
1
Q1 = 30 + 10 ( ) = 32 years
5
3
Q3 Point = 20( ) = 15
4
Age Fi
L.T. 50 15
𝐐𝟑 = 𝟓𝟎 𝐲𝐞𝐚𝐫𝐬
IQR = Q3 – Q1 = 50 – 32 = 18
Example 6:
For the frequency distribution given in Example 3, find the
quartile deviation.
94
Solution:
Class CF
Less than 30 0
Less than 40 3
Less than 50 11
Less than 60 23
Less than 70 39
Less than 80 45
Less than 90 50
1
Q1 Point = 50( ) = 12.5
4
Cost Fi
L.T. 50 11
L.T. Q1 12.5
L.T. 60 23
Q1 − 50 12.5 − 11 Q1 − 50 1.5
= = =
60 − 50 23 − 11 10 12
1.5
Q1 = 50 + 10 ( ) = 51.25
12
3
Q3 Point = 20( ) = 15
4
Cost Fi
L.T. 60 23
L.T. Q1 37.5
L.T. 70 39
Q3 − 60 37.5 − 23 Q3 − 60 14.5
= gives =
70 − 60 39 − 23 10 16
14.5
Q 3 = 60 + 10 ( 16 ) = 69.0625
95
Q3 − Q1 69.0625 − 51.25
QD = = = 8.906
2 2
Example 7:
The time taken for the weekly maintenance of a group of
machines in a workshop over the past 25 weeks is shown in the
following table.
Maintenance Time 0- 2- 4- 6- 8-10 Total
No. of Weeks 2 6 10 5 2 25
1
Q1 Point = 25( ) = 6.25
4
Time Fi
L.T. 2 2
L.T. Q1 6.25
L.T. 4 8
Q1 − 2 6.25 − 2 Q1 − 2 4.25
= gives =
4−2 8−2 2 6
4.25
Q1 = 2 + 2 ( 6
) = 3.417 hours
96
3
Q3 Point = 25( ) = 18.75
4
Time Fi
L.T. 6 18
L.T. Q3 18.75
L.T. 8 23
Q3 − 6 18.75 − 18 Q3 − 6 0.75
= gives =
8−6 23 − 18 2 5
0.75
Q3 = 6 + 2 ( 5
) = 6.3 hours
𝐐𝟑 − 𝐐𝟏 𝟔.𝟑 − 𝟑.𝟒𝟏𝟕
𝐐𝐃 = = = 𝟏. 𝟒𝟒𝟏𝟓 hours
𝟐 𝟐
4.1.4 Mean Deviation (MD):
The mean deviation is a measure of dispersion that gives the
average absolute difference (i.e. ignoring ‘minus’ signs) between
each item and the mean.
1- Mean Deviation for Ungrouped Data:
Mean Deviation (MD) of a Set of Values:
∑𝐧𝐢=𝟏|𝐱 𝐢 − 𝐱̅|
𝐌𝐃 =
𝐧
Example 8:
Calculate the mean deviation of 43 , 75 , 48 , 39 , 51 , 47 , 50 , 47.
Solution:
∑n
i=1 xi 400
First determine the mean as: = = 50 and then we
n 8
find absolute deviations as follows:
97
𝐱𝐢 𝐱 𝐢 − 𝐱̅ |𝐱 𝐢 − 𝐱̅|
43 -7 7
75 25 25
48 -2 2
39 -11 11
51 1 1
47 -3 3
50 0 0
47 -3 3
400 0 52
∑ni=1|xi − x̅| 52
MD = = = 6.5
n 8
In other words, each value in the set is, on average, 6.5 units away
from the common mean.
2- Mean Deviation for Grouped Data:
Mean Deviation of Grouped data:
∑ni=1 fi |xi − x̅|
MD =
∑ni=1 fi
Example 9:
For the frequency distribution given in Example 7, find the mean
deviation.
Maintenance No. of
Midpoint |xi − x̅|, |x
Weeks x f
i i i−x̅ | fi
Time (xi ) x̅ = 4.92
(fi )
0- 2 1 2 3.92 7.84
2- 6 3 18 1.92 11.52
4- 10 5 50 0.08 0.8
6- 5 7 35 2.08 10.4
8-10 2 9 18 4.08 8.16
Total 25 - 123 - 38.72
98
Note: For a grouped frequency distribution, xi represents the
class midpoint for the ith class.
̅ = i i = 123
∑x f
Mean: x = 4.92
∑fi 25
∑ni=1 fi |xi − x̅| 38.72
𝐌𝐃 = = = 1.5488
∑ni=1 fi 25
∑ni=1(xi − x̅)2
s=√
n
99
Solution:
𝐱𝐢 𝐱 𝐢𝟐 𝐱𝐢 𝐱 𝐢𝟐
43 1849 47 2209
75 5625 50 2500
48 2304 47 2209
51 2601 40 1600
51 2601 48 2304
2
∑ni=1 xi2 ∑ni=1 xi
s=√ −( )
n n
25802 500 2
= √ 10 − ( 10 ) = 8.96
Variance = s2 = (8.96)2 = 80.2816
Example 11:
For data in Example 2, find the standard deviation and the
variance.
Solution:
𝐱𝐢 𝐱 𝐢𝟐
42 1764
34 1156
40 1600
85 7225
36 1296
237 13041
100
2
∑ni=1 xi2 ∑ni=1 xi
s=√ −( )
n n
13041 237 2
=√ 5
− ( 5
) = 19.01
Example 12:
The following table presents the distribution of the monthly income
(in thousands of Egyptian pounds). Calculate the mean, standard
deviation and variance of monthly income using the
direct method.
Monthly Income 25- 30- 35- 40- 45-50 Total
Number of Households 2 3 5 22 18 50
Solution:
The standard layout and calculations are shown in the
following table:
101
Number of
Monthly Households Midpoint xf x2 f
Income (x)
(f)
25- 2 27.5 55 1512.5
30- 3 32.5 97.5 3168.75
35- 5 37.5 187.5 7031.25
40- 22 42.5 935 39737.5
45-50 18 47.5 855 40612.5
Total 50 - 2130 92062.5
= 𝟐𝟏𝟑𝟎
∑𝐱𝐢 𝐟𝐢
̅=
Mean: 𝐱
𝟓𝟎
= 𝟒𝟐.𝟔
∑𝐟𝐢
2
∑n 2
i=1 xi fi ∑n
i=1 xi fi
Standard Deviation: 𝐬 =√ ∑n
−( ∑n
)
i=1 fi i=1 fi
92062.5 2130 2
=√ −( )
50 50
= 5.147
Variance = s2 = (5.147)2 = 26.49
Example 13:
For the frequency distribution table given in Example 7, find the
mean, standard deviation and variance.
Solution:
The standard layout and calculations are shown in the
following table:
102
No. of Midpoint
Maintenance
Workers xf X2f
Time (x)
(f)
0- 2 1 2 2
2- 6 3 18 54
4- 10 5 50 250
6- 5 7 35 245
8-10 2 9 18 162
Total 25 - 123 713
= 𝟏𝟐𝟑
∑𝐱𝐢 𝐟𝐢
̅=
Mean: 𝐱
𝟐𝟓
= 𝟒.𝟗𝟐
∑𝐟𝐢
∑𝐧 𝟐 ∑𝐧 𝟐
𝐢=𝟏 𝐱 𝐢 𝐟𝐢 𝐢=𝟏 𝐱 𝐢 𝐟𝐢
Standard deviation: 𝐬 = √ −( )
∑𝐧
𝐢=𝟏 𝐟𝐢 ∑𝐧
𝐢=𝟏 𝐟𝐢
𝟕𝟏𝟑 𝟏𝟐𝟑 𝟐
= √ 𝟐𝟓 − ( 𝟐𝟓 ) = 𝟐. 𝟎𝟕𝟕
Variance = s2 = (2.077)2 = 4.314
Example 14:
For the monthly income frequency distribution in Example 12,
find the mean, standard deviation and variance using the method
of step-deviations.
103
Solution:
Number of
Midpoint
Income Households d = x – 37.5 df d 2f
(f) (x)
Standard Deviation:
2
∑ni=1 d2i fi ∑ni=1 di fi
S=√ n −( n )
∑i=1 fi ∑i=1 fi
2625 255 2
=√ −( ) = 5.147
50 50
∑Di fi 51
Mean: x
̅ = a + b. ( ) = 37.5 + 5 (50) = 42.6
∑fi
2
∑n 2
i=1 Di fi ∑n
i=1 Di fi
Since SD =√ ∑n
−( ∑n
)
i=1 fi i=1 fi
105 51 2
=√ −( ) = 1.0293
50 50
Example 16:
In a medium sized city there are 50 houses for sale of similar size.
The frequency distribution of prices is as follows.
105
Price ($10,000) 10- 20- 30- 40- 50-60 Total
No. of houses 14 18 12 4 2 50
Find the mean, standard deviation and variance for the prices
using the method of step-deviations.
Solution:
No. of
Price Midpoint
Houses d = x – 35 D = d/10 Df D2 f
($10,000) (x)
(f)
10- 14 15 -20 -2 -28 56
20- 18 25 -10 -1 -18 18
30- 12 35 0 0 0 0
40- 4 45 10 1 4 4
50-60 2 55 20 2 4 8
Total 50 - - - -38 86
∑𝐃𝐢 𝐟𝐢 −𝟑𝟖
̅
Mean: 𝐱 =𝐚+𝐛 ( ) = 𝟑𝟓 + 𝟏𝟎 ( 𝟓𝟎 ) = 𝟐𝟕. 𝟒
∑𝐟𝐢
∑𝐧 𝟐 ∑𝐧 𝟐
𝐢=𝟏 𝐃𝐢 𝐟𝐢 𝐢=𝟏 𝐃𝐢 𝐟𝐢
Since 𝐒𝐃 =√ ∑𝐧
−( ∑𝐧
)
𝐢=𝟏 𝐟𝐢 𝐢=𝟏 𝐟𝐢
𝟖𝟔 −𝟑𝟖 𝟐
𝐒𝐃 = √𝟓𝟎 − ( 𝟓𝟎 ) = 𝟏. 𝟎𝟔𝟗
106
(or from) each value, but is changed by multiplying (or
dividing) each value by a constant, say b. That is,
Sx = S(X ± a) and S(bx) = b(SX)
Characteristics of the Standard Deviation:
1- The standard deviation is the natural partner of the arithmetic
mean in the following respects:
• ‘By definition’. The standard deviation is defined in terms of
the mean.
• In further statistical analysis, there is a need to deal with one
of the most commonly occurring natural distributions, called
the Normal distribution, which can only be specified in terms
of both the mean and standard deviation.
2- It can be regarded as truly representative of the data, since all
data values are taken into account in its calculations.
3- For distributions that are not too skewed:
• Virtually, all of the items should lie within three standard
deviations of the mean. i.e., range = 6 × standard deviation
(approximately).
• 95% of the items should lie within two standard items
deviations of the mean.
4- The standard deviation is affected by extreme values (outliers).
107
comparable (not widely different). Except in these cases, the
measures of relative dispersion are more appropriate.
Now, some measures of relative dispersion are to be considered.
Example 17:
Over a period of three months the daily number of components
produced by two comparable machines was measured, giving the
following statistics.
Machine A: Mean = 242.8 , sd = 20.5
sd stands for “standard deviation”
Example 18:
The following table represents the rates of return over the past 6
years for two mutual funds
108
Fund A (x) 8.3 -6 18.9 - 5.7 23.6 20
Fund B (y) 12 - 4.8 6.4 10.2 25.3 1.4
Which fund has higher risk. (High variability implies high risk)
Solution:
𝒙𝒊 𝒙𝟐𝒊 𝒚𝒊 𝒚𝟐𝒊
8.3 68.89 12 144
-6 36 -4.8 23.04
18.9 357.21 6.4 40.96
- 5.7 32.49 10.2 104.04
23.6 556.96 25.3 640.09
20 400 1.4 1.96
59.1 1451.55 50.5 954.09
Fund A:
∑xi 59.1
x̅ = = = 9.85
n 6
∑n 2 ∑n 2 2
i=1 xi i=1 xi 1451.55 59.1
sx = √ −( ) =√ −( )
n n 6 6
= 12.04
Fund B:
∑yi 50.5
y̅ = = = 8.42
n 6
2
∑n 2
i=1 yi ∑n
i=1 yi 954.09 50.5 2
sy = √ −( ) =√ −( )
n n 6 6
= 9.39
109
Coefficient of variation:
sx 12.04
For fund A = × 100% = × 100% = 122.23%
x̅ 9.85
sy 9.39
For fund B = × 100% = × 100% = 111.52%
y̅ 8.42
Example 19:
For the frequency distribution given in Example 3, find the quartile
coefficient of variation.
Solution:
The values of 𝐐𝟏 and 𝐐𝟑 have already obtained in Example 6.
Q1 = 51.25 and Q 3 = 69.0625
110
The value of the median can be found as follows:
Median Point = 50/2 = 25
A "less than" cumulative frequency table for this distribution is
as follows:
Cumulative Distribution
Class CF
Less than 30 0
Less than 40 3
Less than 50 11
Less than 60 23
Less than 70 39
Less than 80 45
Less than 90 50
1
Median Point = 50( ) = 25
2
Cost Fi
L.T. 60 23
L.T. m 25
L.T. 70 39
m − 60 25 − 23 m − 60 2
= gives =
70 − 60 39 − 23 10 16
2
Median = 60 + 10 ( ) = 61.25
16
3 Q −Q
1 69.0625−51.25
QCV = 2×Median × 100% = × 100%
2×61.25
= 14.54 %
111
Example 20:
For the houses’ prices distribution in Example 16, find the quartile
coefficient of variation.
Solution:
Class CF
Less than 10 0
Less than 20 14
Less than 30 32
Less than 40 44
Less than 50 48
Less than 60 50
1
Q1 Point = 50( ) = 12.5
4
Time Fi
L.T. 10 0
L.T. Q1 12.5
L.T. 20 14
Q1 − 10 12.5 − 0 Q1 − 10 12.5
= gives =
20 − 10 14 − 0 10 14
12.5
Q1 = 10 + 10 ( 14 ) = 18.93
1
Median Point = 50( ) = 25
2
Time Fi
L.T. 20 14
L.T. m 25
L.T. 30 32
112
m − 20 25 − 14 m − 20 11
= gives =
30 − 20 32 − 14 10 18
11
Median = 20 + 10 ( ) = 26.11
18
3
Q3 Point = 50( ) = 37.5
4
Time Fi
L.T. 30 32
L.T. Q3 37.5
L.T. 40 44
Q3 − 30 37.5 − 32 Q3 − 30 5.5
= gives =
40 − 30 44 − 32 10 12
5.5
Q 3 = 30 + 10 ( 12 ) = 34.58
𝐐 𝟑 − 𝐐𝟏 𝟑𝟒.𝟓𝟖 −𝟏𝟖.𝟗𝟑
𝐐𝐂𝐕 = × 𝟏𝟎𝟎% = × 𝟏𝟎𝟎%
𝟐×𝐌𝐞𝐝𝐢𝐚𝐧 𝟐×𝟐𝟔.𝟏𝟏
= 𝟐𝟗. 𝟗𝟕 %
113
group who took the same test. We can’t compare their scores to
people who were not in the group that we tested. Fortunately, we
can transform raw scores to standard scores. When we
standardize scores, we can compare scores for different groups
of people, and we can compare scores on different tests.
The foundational standard score in measurement is the Z-score.
A Z-score is based on the normal, bell-shaped curve and is
formed from deviation scores. A deviation score is the difference
between any one score and the mean (𝒙𝒊 − 𝒙 ̅). If this deviation
score is divided by the standard deviation (SD) for that group of
scores, we have transformed the raw score into a Z-score. The
formula of the Z-Score is given as follows.
Definition:
Standardized Value (Z-Score):
A Z-score is the deviation of a score from the mean
expressed in standard deviation units:
xi − x̅
Zi =
s
Solution:
xStat − x̅1 75 − 67
ZStat = = =2
s1 4
𝐙𝐒𝐭𝐚𝐭 indicates that the student’s score lies 2 standard deviations
above the mean.
xMath − x̅2
𝐙𝐌𝐚𝐭𝐡 =
s2
80 − 71
= = 1.5
6
𝐙𝐌𝐚𝐭𝐡 indicates that the student’s score lies 1.5 standard
deviations above the mean.
Although the student’s Math score is higher than his Statistics
score, the Z-scores indicates that he has better relative
performance in Statistics than in Mathematics.
Example 22:
You take an Economics test that has a mean of 80 with a standard
deviation of 6. What grade did you earn if your Z-score was 1.5?
Solution:
xEco. − x̅
ZEco. =
s
xEco – 80
1.5 = gives xEco. − 80 = 1.5 × 6
6
xEco. = 80 + (1.5 × 6)
xEco. = 89
115
4.4 Coefficient of Skewness:
Skewness was described in the previous chapter and it is shown
that the degree of skewness could be measured by the difference
between the mean and median or the difference between mean
and mode. However, for most practical purposes, it is usual to
require a measure of skewness to be unit-free (i.e., a coefficient)
and the following expression, known as Pearson’s measure of
skewness (P) is of this type.
Pearson’s Measures of Skewness
3(Mean − Median)
P1 =
Standard Deviation
Mean − Mode
P2 =
Standard Deviation
Note that:
P < 0 shows there is left or negative skewness.
P = 0 signifies no skewness (symmetric distribution).
P > 0 means there is right or positive skewness.
The higher the absolute value of the coefficient of skewness, the
more asymmetric the distribution is. Pearson’s coefficients
of skewness lie between -3 and +3 (−𝟑 ≤ 𝑷𝟏 ≤ +𝟑)
and (𝟑 ≤ 𝑷𝟐 ≤ +𝟑). The distribution is almost asymmetric
if |𝑷𝟏 | and |𝑷𝟐 | are less than or equal 0.5.
Example 23:
For the frequency distribution given in Example 12, Find
Pearson’s coefficients of skewness.
Solution:
The following results were obtained:
Mean = 42.6, and standard deviation = 5.147
Mode: The modal class is (40 – 55), therefore
116
35- 40- 45-50
5 22 18
Lm = 40 , fm = 22 , fp = 5 , fs = 18 , wm = 5
fm −fp
Mode = lm + × wm
( fm − fp )+( fm − fs )
22−5
= 40 + × 5
(22−5)+(22−18)
17
= 40 + ( ) × 5 = 44.05
17+4
1
Median Point = 50( ) = 25
2
Class CF
Less than 25 0
Less than 30 2
Less than 35 5
Less than 40 10
Less than 45 32
Less than 50 50
Time Fi
L.T. 40 10
L.T. m 25
L.T. 45 32
m − 40 25 − 10 m − 40 15
= gives =
45 − 40 32 − 10 5 22
15
Median = 40 + 5 ( ) = 43.41
22
117
Pearson’ s coefficients skewness:
3(mean − median) 3(42.6 − 43.41)
P1 = = = −0.472
standard deviation 5.147
mean − mode 42.6 − 44.05
P2 = = = −0.282
standard deviation 5.147
So, the frequency distribution is left (negatively) skewed.
The distribution can be considered as symmetric distribution
because the absolute value for each is less than 0.5.
Example 24:
Calculate Pearson’ coefficients of skewness from the following
frequency distribution
Payment of
10- 12- 14- 16- 18- 20- 22-24 Total
Commission*
No. of
7 15 18 20 25 10 5 100
salesmen
* Payments of commission are given in hundreds of dollars.
Solution:
Commission f x d = x – 17 D = d /2 Df D 2f
10- 7 11 -6 -3 -21 63
12- 15 13 -4 -2 -30 60
14- 18 15 -2 -1 -18 18
16- 20 17 0 0 0 0
18- 25 19 2 1 25 25
20- 10 21 4 2 20 40
22-24 5 23 6 3 15 45
Total 100 - - - -9 251
∑𝐃𝐢 𝐟𝐢 −𝟗
̅
Mean: 𝐱 = 𝐚 + 𝐛( ) = 𝟏𝟕 + 𝟐 (𝟏𝟎𝟎) = 𝟏𝟔. 𝟖𝟐
∑𝐟𝐢
118
2
∑n 2
i=1 Di fi ∑n
i=1 Di fi
Since SD = √ −( )
∑n
i=1 fi ∑n
i=1 fi
𝟐𝟓𝟏 −𝟗 𝟐
= √𝟏𝟎𝟎 − (𝟏𝟎𝟎) = 1.582
Lm = 1800 , fm = 25 , fs = 10 , fp = 20 , wm = 200
fm −fp
Mode = lm + × wm
( fm − fp)+( fm − fs )
25−20
= 18 + × 2
(25−20)+(25−10)
5
= 18 + ( ) × 2 = 18.5
5+15
Median:
Class CF
Less than 10 0
Less than 12 7
Less than 14 22
Less than 16 40
Less than 18 60
Less than 20 85
Less than 22 95
Less than 24 100
119
1
Median Point = 100( ) = 50
2
Time Fi
L.T. 16 40
L.T. m 50
L.T. 18 60
m − 16 50 − 40 m − 16 10
= gives =
18 − 16 60 − 40 2 20
10
Median = 16 + 2 ( ) = 17
20
Pearson’ skewness coefficients:
3(Mean − Median) 3(16.82 − 17)
P1 = = = −0.17
Standard Deviation 3.164
The distribution is almost symmetric.
Mean − Mode 16.82 − 18.50
P2 = = = -0.53
Standard Deviation 3.164
Weak negative skewness.
For a symmetric distribution, the median lies exactly halfway
between the other two quartiles. If a distribution is skewed to the
right (positively skewed), the median is pulled closer to 𝐐𝟏 (or
pulled closer to 𝐐𝟑 for negative skewness) and this relationship
enables the following coefficient to be derived for measuring
skewness.
120
Quartile Coefficient of Skewness (Bowley’s Coefficient:
(Q 3 − median) − (median − Q1 )
QCS =
Q 3 − Q1
Note that:
• QCS < 0 shows there is left or negative skewness.
• QCS = 0 signifies no skewness (symmetric distribution).
• QCS > 0 means there is right or positive skewness.
Example 25:
For the frequency distribution given in Example 3, find the quartile
coefficient of skewness.
Solution:
The values of Q1, median and Q3 were found to be:
Q1 = 51.25 (Example 6) , Median = 61.25 (Example 19)
Q 3 = 69.0625 (Example 6)
(Q 3 − median) − (median − Q1 )
QCS =
Q 3 − Q1
(69.0625 − 61.25) − (61.25 − 51.25)
= = −0.12
69.0625 − 51.25
So, the frequency distribution is left (negatively) skewed. It can be
considered as an almost symmetric distribution.
Example 26:
Calculate the quartile coefficient of skewness from the following
frequency distribution:
Monthly
10- 12- 14- 16- 18- 20- 22- 24- 26-28 Total
Salary+
No. of
5 14 23 50 52 25 22 7 2 200
Employees
+ Monthly salary in hundreds of L.E.
121
Solution:
To find the values of Q1, Q3, and median, a "less-than"
cumulative frequency distribution was found to be as follows.
Class CF
Less than 10 0
Less than 12 5
Less than 14 19
Less than 16 42
Less than 18 92
Less than 20 144
Less than 22 169
Less than 24 191
Less than 26 198
Less than 28 200
1
Q1 Point = 200( ) = 100
2
Salary Fi
L.T. 16 42
L.T. Q1 50
L.T. 18 92
Q1 − 16 50 − 42 Q1 − 16 8
= gives =
18 − 16 92 − 42 2 50
8
Q1 = 16 + 2 ( ) = 16.32
50
1
Median Point = 200( ) = 100
2
Time Fi
L.T. 18 92
L.T. m 100
L.T. 20 144
122
m − 18 100 − 92 m −18 8
= gives =
2000 − 18 144 − 92 2 52
8
Median = 18 + 2 ( ) = 18.3077
52
3
Q3 Point = 200( ) = 150
4
Time Fi
L.T. 2000 144
L.T. Q3 150
L.T. 2200 169
Q3 − 20 150 − 144 Q3 − 20 6
= gives =
22 − 20 169 − 144 2 25
6
Q 3 = 20 + 2 ( ) = 20.48
25
123
Exercises
1- Given below are 14 statements. Indicate in each case whether
the statement is True or False:
(a) The difference between the largest and the smallest
observations is called the quartile range.
(b) The dispersion in a series indicates the reliability of the
measure of central tendency.
(c) The interquartile range is based on only two values
contained in a series.
(d) The square root of the variance gives the standard
deviation.
(e) The coefficient of variation is not a relative measure of
dispersion.
(f) In measuring dispersion, the standard deviation is more
frequently used than the mean deviation.
(g) A major limitation of the range is that it ignores the large
number of observations in a series.
(h) Even for an open-ended distribution, it is possible to
measure the range.
(i) While calculating variance, every observation in a series is
considered.
(j) The coefficient of variation is measured in the same units
as the observations in a series.
(k) In a skewed distribution, the mean, median and mode do
not have the same value.
(l) In a frequency distribution, if a curve has a longer tail to the
right, then it is negatively skewed.
(m) In a positively skewed curve, mean < median < mode.
(n) The median does not always lie between the mean and
the mode in a skewed distribution.
124
2- Multiple Choice Questions:
2.1 Which of the following statements is not correct in respect
of the range as a measure of dispersion?
(a) It is difficult to calculate.
(b) Only two points in the data set determine it.
(c) It is affected by extreme values.
(d) There may be considerable change in it from one
sample to another.
2.2 If the first and third quartiles are 20.58 and 60.38,
respectively, then the quartile deviation is
(a) 39.8 (b) 30.3 (c) 19.9 (d) None of the above
2.3 A series has its mean as 15 and its coefficient of variation
as 20, its standard deviation is
(a) 5 (b) 10 (c) 3 (d) 7
2.4 If the first and third quartiles in a series are 15 and 35, then
the semi-inter-quartile range is
(a) 30 (b) 20 (c) 10 (d) None of the above
2.5 Which of the following is a measure of relative dispersion?
(a) Variance (b) Coefficient of variation
(c) Standard deviation (d) All of these
2.6 The square of the variance of a distribution is the
(a) Absolute deviation (b) Mean
(c) Standard deviation (d) None of these
2.7 Which of the following is not true in respect of mean
deviation?
(a) It is simple to understand
(b) It considers each and every item in a series.
(c) It is capable of further algebraic treatment.
(d) The extreme items have less effect on its magnitude.
125
2.8 Which one is the formula for relative skewness?
(a) Mean = Mode
(b) (Q 3 − Q 2 ) − (Q 2 − Q1 )
(Q3 −median)−(median−Q1 )
(c)
Q3 −Q1
(d) None of the above
2.9 Which of the following relationship is valid in a symmetrical
distribution?
(a) (Median − Q1 ) < (Q 3 − Median)
(b) (Median − Q1 ) > (Q 3 − Median)
(c) (Median − Q1 ) = (Q 3 − Median)
(d) None of these
127
(b) Which factory shows greater variability in the distribution
of weekly wages?
11- Given below are the daily wages paid to workers in two
factories X and Y.
Daily Number of Workers
Wages Factory X Factory Y
20- 15 25
30- 30 40
40- 44 60
50- 60 35
60- 60 20
70- 14 15
80-90 7 5
Using arithmetic mean and standard deviation, answer the
following questions:
(a) Which factory pays higher average wages? By how
much?
(b) In which factory are wages more variable?
129
Chapter (5)
Correlation Analysis
Correlation is a technique used to measure the strength of the
relationship between two variables. This chapter describes the
general nature and purpose of correlation and gives techniques
for measuring correlation. Correlation is a statistical measure that
indicates the extent to which two or more variables fluctuate
together. A positive correlation indicates the extent to which those
variables increase or decrease in parallel; a negative correlation
indicates the extent to which one variable increases as the other
decreases. When the fluctuation of one variable reliably predicts
a similar fluctuation in another variable, there’s often a tendency
to think that means that the change in one causes the change in
the other. However, correlation does not imply causation. There
may be an unknown factor that influences both variables similarly.
130
Table (5.1)
GPA and Achievement Motivation
For A Group of 14 Students
ID GPA Motivation
1 2.0 50
2 2.0 48
3 2.0 100
4 2.0 12
5 2.3 34
6 2.6 30
7 2.6 78
8 3.0 87
9 3.1 84
10 3.2 75
11 3.6 83
12 3.8 90
13 3.8 90
14 4.0 98
Figure 5.1
Scatterplot For Table 5.1
131
In this example, the relationship between students’ achievement
motivation and their GPA is being investigated (GPA stands for
“Grade Point Average”. It is a number that indicates how well or
how high you scored). Table 5.1 includes a small group of
individuals for whom GPA and scores on a motivation scale have
been recorded. GPAs can range from 0 to 4 and motivation scores
in this example range from 0 to 100. Individuals in this table were
ordered based on their GPA. Simply looking at the table shows
that, in general, as GPA increases, motivation scores also
increase. However, with a real set of data, which may have
hundreds or even thousands of individuals, a pattern cannot be
detected by simply looking at the numbers. Therefore, a very
useful strategy is to represent the two variables graphically to
illustrate the relationship between them. The image on the right
is an example of a scatterplot and displays the data from the table
on the left. GPA scores are displayed on the horizontal axis and
motivational scores are displayed on the vertical axis. Each dot
on the scatterplot represents one individual from the data set. The
location of each point on the graph depends on both the GPA and
motivation scores. Individuals with higher GPAs are located
further to the right and individuals with higher motivation scores
are located higher up on the graph. The purpose of a scatterplot
is to provide a general illustration of the relationship between the
two variables. In this example, in general, as GPA increases so
does an individual’s motivation score.
Interpreting Scatterplots:
As in any graph of data, look for the overall pattern and for striking
departures from that pattern. The overall pattern of a scatterplot
can be described by direction, form, and strength of the
relationship. An important kind of departure is an outlier, an
individual value that falls outside the overall pattern of the
relationship.
One important component to a scatterplot is the direction of the
relationship between two variables.
132
• Two variables have a positive association when above-average
values of one tend to accompany above-average values of the
other, and when below-average values also tend to occur
together.
• Two variables have a negative association when above-average
values of one tend to accompany below-average values of
the other.
Figure 5.2 compares students’ achievement motivation and their
GPA. These two variables have a positive association because
as GPA increases, so does motivation.
Figure 5.2
Achievement Motivation and GPA for Students
Figure 5.3
GPA and Number of Absences for Students
133
Figure 5.3 compares students’ GPA and their number of
absences. These two variables have a negative association
because, in general, as a student’s number of absences
decreases, their GPA increases
Another important component to a scatterplot is the form (type) of
the relationship between the two variables.
Linear Relationship:
Figure 5.2 illustrates a linear relationship. This means that the
points on the scatterplot closely resemble a straight line.
A relationship is linear if one variable increases by approximately
the same rate as the other variables change by one unit.
Curvilinear Relationship:
Figure 5.4 illustrates a relationship that has the form of a curve,
rather than a straight line. This is due to the fact that one variable
does not increase at a constant rate and may even start
decreasing after a certain point.
Figure 5.4
Figure 5.4
The Relationship Between
Age and Working Memory
This example describes a curvilinear relationship between the
variable “age” and the variable “working memory”. In this example,
134
working memory increases through childhood, remains steady in
adulthood, and begins decreasing around age 50.
135
Pearson’s Correlation Coefficient Formula:
∑n x y ∑n xi ∑ni=1 yi
( i=1n i i ) − ( i=1n )( n )
r=
∑ni=1 xi2 ∑ni=1 xi 2 ∑ni=1 yi2 ∑ni=1 yi 2
√[ −( × −(
n n ) ] [ n n ) ]
Note that:
▪ r is always a number between -1 and 1.
▪ r > 0 indicates a positive (direct) relationship.
▪ r < 0 indicates a negative (inverse) relationship.
▪ The strength of the linear relationship increases as r moves
away from 0 toward -1 or 1.
▪ The extreme values r = -1 and r = 1 occur only in the case of
perfect linear relationship.
136
• It provides evidence of association, not causation.
• It is affected by outliers.
• It doesn’t change when the units of the measure of x, y, or both
are changed (i.e. not affected by mathematical operations).
Coefficient of Determination (𝐫 𝟐 ):
Definition:
The coefficient of determination gives the proportion of the
variation in the y-values that is explained by the variation
in the x-values.
Example 1:
The following table relates the weekly maintenance cost ($) to the
age (in months) of ten machines of similar type in a manufacturing
company.
Machine 1 2 3 4 5 6 7 8 9 10
Age (x) 5 10 15 20 30 30 30 50 50 60
Cost (y) 190 240 250 300 310 335 300 300 350 395
Calculate the Pearson’s correlation coefficient and the coefficient
of determination.
Solution:
The table in the next page present the calculations
required for determining the values of the correlation
coefficient and then the coefficient of determination.
From this table, the following results were obtained:
n = 10 , ∑𝐧𝐢=𝟏 𝐱 𝐢 = 𝟑𝟎𝟎 , ∑𝐧𝐢=𝟏 𝐲𝐢 = 𝟐𝟗𝟕𝟎 ,
∑𝐧𝐢=𝟏 𝐱 𝐢 𝐲𝐢 = 𝟗𝟕𝟔𝟓𝟎 , ∑𝐧𝐢=𝟏 𝐱 𝐢𝟐 = 𝟏𝟐𝟎𝟓𝟎 , ∑𝐧𝐢=𝟏 𝐲𝐢𝟐 = 𝟗𝟏𝟑𝟎𝟓𝟎
137
Machine 𝐱 y xy x2 y2
1 5 190 950 25 36100
2 10 240 2400 100 57600
3 15 250 3750 225 62500
4 20 300 6000 400 90000
5 30 310 9300 900 96100
6 30 335 10050 900 112225
7 30 300 9000 900 90000
8 50 300 15000 2500 90000
9 50 350 17500 2500 122500
10 60 395 23700 3600 156025
Total 300 2970 97650 12050 913050
855
= = 0.88
√305 × 3096
There is a very strong positive relationship. This means as the age
of the machine increases the cost of its maintenance increases,
and vice versa.
138
Coefficient of determination: r 2 = (0.88)2 = 0.7744
This means that 77.44% of the variation in maintenance cost is
explained by the age of machines, while 22.56% is due to other
factors. This is according to the linear relationship between the
cost of maintenance and the age of the machine.
Example 2:
The following data, obtained from claims drawn on life assurance
policies, relates age at official retirement to age at death.
ID 1 2 3 4 5 6 7 8 9
Age of 57 62 60 57 65 60 58 62 56
Retirement
Age At Death 71 70 66 70 69 67 69 63 70
Solution:
ID 𝐱 y xy x2 y2
1 57 71 4047 3249 5041
2 62 70 4340 3844 4900
3 60 66 3960 3600 4356
4 57 70 3990 3249 4900
5 65 69 4485 4225 4761
6 60 67 4020 3600 4489
7 58 69 4002 3364 4761
8 62 63 3906 3844 3969
9 56 70 3920 3136 4900
Total 537 615 36670 32111 42077
139
n = 9 , ∑n n n
i=1 xi = 537 , ∑i=1 yi = 615 , ∑i=1 xi yi = 36670 ,
−2.78
= = −0.415
√7.78 × 5.78
Example 3:
A sample of eight employees is taken from the production
department of a light engineering factory. The data which follow
relate to the number of weeks experience in the wiring of
components, and the number of components which were rejected
as unsatisfactory last week.
Employee A B C D E F G H
Weeks of Experience 4 5 7 9 10 11 12 14
Number of Rejects 21 22 15 18 14 14 11 13
Calculate the Pearson’s correlation coefficient and coefficient of
determination for these data and interpret their values.
140
Solution:
Employee 𝐱 y xy x2 y2
A 4 21 84 16 441
B 5 22 110 25 484
C 7 15 105 49 225
D 9 18 162 81 324
E 10 14 140 100 196
F 11 14 154 121 196
G 12 11 132 144 121
H 14 13 182 196 169
Total 72 128 1069 732 2156
1069 72 128
( 8 ) − ( 8 )( 8 )
=
2 2
√[732 − (72) ] × [2156 − (128) ]
8 8 8 8
−10.375
= = −0.87
√10.5 × 13.5
141
There is a very strong negative relationship. That is, as the
duration of experience increases the number of rejects decreases
and vice versa.
Coefficient of determination: 𝑟 2 = (−0.87)2 = 0.7569
This means that 75.69% of the variation in number of rejects is
explained by variation in duration of experience, and 24.31% is
due to factors other than duration of experience. This comes as
a result for the linear relationship between the two variables.
Total 127
6 ∑n 2
i=1 di 6×127
rs = 1 - =1 − = 0.56
n(n2 −1) 12(122 −1)
There is a moderate positive relationship. That is, as the number
of vehicles increases the road deaths increases.
Example 5:
The following data give information of the ages (in years) and the
number of breakdowns during the past month for 5 machines in a
small company.
Machine No. 1 2 3 4 5
Age (year) 7 2 4 8 9
No. of Breakdowns 5 1 2 5 7
Calculate Spearman’ correlation Coefficient.
144
Solution:
𝐱𝐢 𝐲𝐢 𝐫𝐱 𝐫𝐲 𝐝𝐢 𝐝𝟐𝐢
7 5 3 3.5 -0.5 0.25
2 1 1 1 0 0
4 2 2 2 0 0
8 5 4 3.5 0.5 0.25
9 7 5 5 0 0
Total 0.5
145
Day 𝐱 𝐲 𝐱𝐲 x2 y2 𝐫𝐱 𝐫𝐲 𝐝𝐢 𝐝𝟐𝐢
1 5 12 60 25 144 1 1 0 0
2 7 14 98 49 196 4.5 4 0.5 0.25
3 6 13 78 36 169 2.5 2.5 0 0
4 8 17 136 64 289 6.5 7 -0.5 0.25
5 9 19 171 81 361 8 8 0 0
6 7 16 112 49 256 4.5 5.5 -1 1
7 8 16 128 64 256 6.5 5.5 1 1
8 6 13 78 36 169 2.5 2.5 0 0
Total 56 120 861 404 1840 - - 0 2.5
861 56 120
( 8 ) − ( 8 )( 8 )
=
404 56 2 1840 120 2
√[ −( ) ]×[ −(
8 8 8 8 ) ]
2.625
= = 0.96
√1.5×5
146
The value of rs is very close to the value of r which indicates that
data of x and y do not contain outliers. There is a very strong
positive relationship. That is, as the number of workers increases,
the daily production increases.
a b
c d
Then:
Yule’s Coefficient of Association:
ad − bc
Q=
ad + bc
Example 7:
An insurance company is interested in determining whether there
is a relationship between automobile accident frequency and
cigarette smoking. It randomly sampled 36 policyholders and
came up with the following data:
147
Number of
Cigarette Accidents Total
Smoking
0 1
Smokers 8 6 14
Nonsmokers 12 10 22
Total 20 16 36
Solution:
ad − bc (10 × 25) − (5 × 10)
Q= = = 0.67
ad + bc (10 × 25) + (5 × 10)
There a strong relationship between the time of purchase and
purchase size. From the table, 50% of those who buy in the
evening spent more than $20 in the purchase, while the
percentage reaches 83.33% at the time of purchase at night. This
means that buyers spend more on their purchases during the
night than they spend in the evening.
Example 9:
Two samples, one of 200 students from urban secondary schools
and another of 250 students from rural secondary schools, were
taken. These students were asked if they have ever smoked. The
following table lists the summary of the results.
Type Car Repair
Requiring Not Requiring Total
of Car
Repair Repair
A 11 89 100
B 13 57 70
Total 24 146 170
Find the value of Yule’s coefficient of association and interpret
your result.
Solution:
ad − bc (11 × 57) − (89 × 13)
Q= = = −0.297
ad + bc (11 × 57) + (89 × 13)
There is a weak relationship between the type of car and need for
maintenance. The car's need for maintenance depends poorly on
whether the car is of type A or B. It is clear from the table that type
149
B is in need of maintenance at a rate greater that between type A
(11% and 18.6%, respectively).
𝛘𝟐 Formula:
2
(O − E)2
χ =∑
E
Computational Formula for 𝛘𝟐 :
2 O2
χ =∑ −N
E
Where
E: Expected value
Row Marginal × Column Marginal
E=
N
O: Observed value
N: Table total
Example 10:
A sample of 200 units produced by a machine were classified as
good or defective and by the shift on which they were produced.
The results are reported in the following table.
150
Shift
Quality Total
First Second Third
Good 76 64 40 180
Defective 4 6 10 20
Total 80 70 50 200
Is there evidence of a relationship between the quality of the units
and the shift in which they were produced. If yes, to what extent?
Solution:
Row Marginal × Column Marginal
Expected values: E =
N
180 × 80 180 × 70
E11 = = 72 , E12 = = 63
200 200
180 × 50 20 × 80
E13 = = 45 , E21 = =8
200 200
20 × 70 20 × 50
E22 = =7 , E23 = =5
200 200
𝟐 𝐎𝟐
Chi-Square: 𝛘 = ∑ −𝐍
𝐄
= (76)2 / 72 + (64)2 / 63 + (40)2 / 45 + (4)2 / 8
+ (6)2 / 7 + (10)2 / 5 – 200 = 7.937
Cramer’s Coefficient:
χ2 7.937
V = √N(s−1) = √200(2−1) = 0.199
There is a poor relationship between the quality of the product and
the shift of production. This means that the quality of the product
does not depend, to a large extent, on the timing of the shift in
which it was produced.
Note: Sum of Expected Frequencies and Sum of Observed
Frequencies are equal for any row or column.
151
Example 11:
A study regarding the relationship between age and the amount
of pressure sales personnel feel in relation to their jobs revealed
the following sample information.
Degree of Job Pressure
Age (years) Total
Low Medium High
Less than 25 70 20 10 100
25 - 40 70 50 80 200
40 - 60 10 30 160 200
Total 150 100 250 500
Solution:
Row Marginal × Column Marginal
Expected values: E =
N
100 × 150 100 ×100
E11 = = 30 , E12 = = 20 ,
500 500
100 × 250 200 × 150
E13 = = 50 , E21 = = 60
500 500
200 × 100 200 × 250
E22 = = 40 . E23 = = 100
500 500
200 × 150 200 × 100
E31 = = 60 , E32 = = 40
500 500
200 × 250
E33 = = 100
500
O2
Chi-square: χ2 = ∑ −N
E
152
2
702 202 102 702 502 802 102 302 1602
χ =( + + + + + + + + )
30 20 50 60 40 100 60 40 100
−200 = 173.67
Cramer’s Coefficient:
χ2 173.67
V=√ =√ = 0.417
N(s − 1) 500(3 − 1)
Solution:
Row Marginal × Column Marginal
Expected values: E =
N
153
120 × 76 120 × 36
E11 = = 45.6 , E12 = = 21.6
200 200
120 × 88 80 × 76
E13 = = 52.8 , E21 = = 30.4
200 200
80 × 36 80 × 88
E22 = = 14.4 , E23 = = 35.2
200 200
O2
Chi-square: χ2 = ∑ −N
E
2
122 242 842 642 122 42
χ =( + + + + + ) − 200 = 108.65
45.6 21.6 52.8 30.4 14.4 35.2
Cramer’s Coefficient:
χ2 108.65
V=√ =√ = 0.737
N(s − 1) 200(2 − 1)
154
Exercises
Multiple Choice Questions (1- 6)
1- The unit of correlation coefficient between height in feet and
weight in kgs is
(i) kg/feet
(ii) percentage
(iii) non-existent
2- The range of simple correlation coefficient is
(i) 0 to infinity
(ii) minus one to plus one
(iii) minus infinity to infinity
3- If r is positive the relation between X and Y is of the type
(i) When Y increases X increases
(ii) When Y decreases X increases
(iii) When Y increases X does not change
4- If r = 0 the variable X and Y are
(i) linearly related
(ii) not linearly related
(iii) independent
5- Of the following three measures which one can measure any
type of relationship
(i) Karl Pearson’s coefficient of correlation
(ii) Spearman’s rank correlation
(iii) Scatter diagram
(iv) None of these
6- If precisely measured data are available, the simple correlation
coefficient is
(i) more accurate than rank correlation coefficient.
(ii) less accurate than rank correlation coefficient.
(iii) as accurate as the rank correlation coefficient.
155
7- Can r lie outside the –1 and 1 range depending on the type
of data?
8- Does correlation imply causation?
9- When is rank correlation more precise than simple correlation
coefficient?
10- Does zero correlation mean independence?
11- Can simple correlation coefficient measure any type of
relationship?
12- Interpret the values of r as 1, –1 and 0.
13- Why does rank correlation coefficient differ from Pearson’s
correlation coefficient?
14- Calculate Pearson’s correlation coefficient between the
heights of fathers in inches (X) and the heights of their
sons (Y).
x 65 66 57 67 68 69 70 72
y 67 56 65 68 72 72 69 71
156
17- The table below shows the number of absences, x, in
a Statistics course and the final exam grade, y, for 7 students.
Find the correlation coefficient and interpret your result.
x 1 0 2 6 4 3 3
y 95 90 90 55 70 80 85
157
23- Of a group of patients who complained that they did not sleep
well, some were given sleeping pills while others were given
sugar pills (although they all thought they were getting
sleeping pills). They were later asked whether the pills help
them or not.
Mode of Sleep
Pills Slept Did Not Sleep Total
Well Well
Sleeping Pills 44 10 54
Sugar Pills 81 35 116
Total 125 45 170
Does the sample provide sufficient information to conclude
that there is a relationship between pills type frequency and
mode of sleep?
158
three categories are in favor of policy (F), against the
policy (A), and indifferent toward the policy (I).
Reaction
Occupation Total
F A I
Doctors 80 30 10 120
Advocates 70 40 40 150
Univ. Teachers 50 50 30 130
Total 200 120 80 400
Can you conclude that there is a relationship between
occupation and respondents’ reaction toward the policy?
Explain
159
Chapter (6)
Regression Analysis
Regression analysis concerns the study of relationships
between variables with the object of identifying, estimating and
validating the relationship. The estimated relationship can then be
used to predict one variable from the value of the other variable(s).
A regression problem involving a single predictor (also called
simple regression) arises when we wish to study the relation
between two variables x and y and use it to predict y from x. The
variable x acts as an independent variable. The variable y
depends on x and is also subjected to unaccountable variations
or errors.
Example 1:
Airline Cost
Can the cost of flying commercial airliner be predicted using
regression analysis? If so, what variables are related to
such cost?
A few of the many variables can potentially contribute are type of
plane, distance, number of passengers, amount of luggage,
weather conditions, direction of destination, and perhaps even
pilot skill. Suppose a study is conducted using only Boeing 737s
traveling 500 miles on comparable routes during the same season
of the year. Can the number of passengers predict the cost of
flying routes? It seems logical that more passengers result in more
weight and more baggage, which could, in turn, result in increased
fuel consumption and other costs. Suppose the data displayed in
160
Table 1 are the costs and associated number of passengers for
twelve 500-mile commercial airline flights using Boeing 737s
during the same season of the year. We will use these data to
develop a regression model to predict cost by number
of passengers.
Table (6.1)
Airline Cost Data
Number of Cost
Passengers (x) (y)
61 4280
63 4080
67 4420
69 4170
70 4480
74 4300
76 4820
81 4700
86 5110
91 5130
95 5640
97 5560
Figure (6.2)
Graph of Straight line 𝐲 = 𝐚 + 𝐛𝐱
162
General Form of Linear Regression Equation:
ŷi = a + bxi
Where:
• ŷi read y hat, is the predicted value of the y variable for
a selected x value.
• a is the y - intercept. It is the value of y where the regression
line crosses x-axis when x = 0.
• b is the slope of the line, or the average change in ŷ for each
change of one unit (either increase or decrease) in the
independent variable x. It is also called "regression
coefficient".
• xi is any value of the independent variable that is selected.
163
Note that:
• The regression coefficient (b) and correlation coefficient are of
the same sign.
• If b > 0, there is a positive relationship between y and x.
• If b < 0, there is a negative relationship between y and x.
The Principle of Least Squares:
The difference between each value 𝑦𝑖 of the dependent variable
and its predicted value 𝑦̂𝑖 is called an error. we can choose the
estimates a and b to be the values that minimize the distances of
the data points to the fitted line. So, we would like to minimize the
sum of the squared distances of each observed response to its
fitted value. That is, we want to minimize the error sum
of squares:
∑𝐧𝐢=𝟏 𝐞𝟐𝐢 = ∑𝐧𝐢=𝟏(𝐲𝐢 − 𝐲̂𝐢 )𝟐.
Example 2:
A researcher is interested in the relationship between the total
cost ($1000) against the output (1000 units) of a certain product
over a period of 5 weeks, yielding the following data.
Output (x) 4 5 3 6 8
Total cost (y) 40 55 35 70 75
a- Determine the regression equation.
b- Find the predicted cost for producing 7 units.
Solution:
Week x y xy x2 y2
1 4 40 160 16 1600
2 5 55 275 25 3025
3 3 35 105 9 1225
4 6 70 420 36 4900
5 8 75 600 64 5625
Total 26 275 1560 150 16375
164
Slope of the regression line (Regression Coefficient):
𝟏𝟓𝟔𝟎 𝟐𝟔 𝟐𝟕𝟓
( )− ( )( ) 𝟐𝟔
𝐛 = 𝟓 𝟓 𝟓 = = 𝟖. 𝟕𝟖𝟒
𝟏𝟓𝟎 𝟐𝟔 𝟐 𝟐. 𝟗𝟔
−( )
𝟓 𝟓
So, average change in cost per 1 unit change in output is 8.784.
(b) y – intercept:
∑𝐧
𝐢=𝟏 𝐱 𝐢 𝟐𝟔 ∑𝐧
𝐢=𝟏 𝐲𝐢 𝟐𝟕𝟓
𝐱̅ = = = 𝟓. 𝟐 , 𝐲̅ = = = 𝟓𝟓
𝐧 𝟓 𝐧 𝟓
𝐚 = 𝐲̅ − 𝐛 𝐱̅ = 𝟓𝟓 − 𝟖. 𝟕𝟖𝟒(𝟓. 𝟐) = 𝟗. 𝟑𝟐𝟑
So, the cost when there is no output (i.e. fixed costs) is 9.323
($1000).
Regression equation:
𝐲̂𝐢 = 𝟗. 𝟑𝟐𝟑 + 𝟖. 𝟕𝟖𝟒𝐱 𝐢
Predicted cost for producing 7 units
ŷi = 9.323 + 8.784 (69) = 70.811 ($1000)
Figure 6.3
Scatterplot of Relationship Between Output and Cost
165
Example 3:
Using data of Example 2 in the previous chapter:
(a) Determine the regression of age at death on age of retirement.
(b) Predict the age of death for an age of retirement of 65 years.
Solution:
Calculations required were already obtained in
Example 2 in the previous chapter as follows:
n = 9 , ∑𝐧𝐢=𝟏 𝐱 𝐢 = 𝟓𝟑𝟕 , ∑𝐧𝐢=𝟏 𝐲𝐢 = 𝟔𝟏𝟓 , ∑𝐧𝐢=𝟏 𝐱 𝐢 𝐲𝐢 = 𝟑𝟔𝟔𝟕𝟎
∑𝐧𝐢=𝟏 𝐱 𝐢𝟐 = 𝟑𝟐𝟏𝟏𝟏
∑𝐧 𝐱 𝐢 𝐲𝐢 ∑𝐧 𝐱 𝐢 ∑𝐧 𝐲
( 𝐢=𝟏 )−( 𝐢=𝟏 )( 𝐢=𝟏 𝐢 )
𝐧 𝐧 𝐧
(a) b= 𝟐
∑𝐧 𝟐 𝐧
𝐢=𝟏 𝐱 𝐢 − (∑𝐢=𝟏 𝐱 𝐢 )
𝐧 𝐧
∑n
i=1 xi 537 ∑n
i=1 yi 615
x̅ = = = 59.67 , y̅ = = = 68.33
n 9 n 9
a = y̅ − b x̅ = 68.33 − (−0.36)(59.67) = 89.81
Regression equation:
ŷi = 89.81 − 0.36xi
(b) Predicted the age of death for 56 years age at retirement:
ŷi = 89.81 − 0.36(69) = 64.97 ($1000).
Example 4:
For data given in Example 3 in the previous chapter:
(a) Determine the regression of age at death on age of retirement.
(b) Predict the number of rejects for the Employee E.
166
Solution:
Calculations required were already obtained in Example 3 in the
previous chapter as follows:
n = 8 , ∑𝐧𝐢=𝟏 𝐱 𝐢 = 𝟕𝟐 , ∑𝐧𝐢=𝟏 𝐲𝐢 = 𝟏𝟐𝟖 ,
∑𝐧𝐢=𝟏 𝐱 𝐢 𝐲𝐢 = 𝟏𝟎𝟔𝟗 , ∑𝐧𝐢=𝟏 𝐱 𝐢𝟐 = 𝟕𝟑𝟐
∑𝐧
𝐢=𝟏 𝐱 𝐢 𝐲𝐢 ∑𝐧
𝐢=𝟏 𝐱 𝐢 ∑𝐧
𝐢=𝟏 𝐲𝐢
( )−( )( )
𝐧 𝐧 𝐧
b= 𝟐
∑𝐧 𝟐 𝐧
𝐢=𝟏 𝐱 𝐢 − (∑𝐢=𝟏 𝐱 𝐢 )
𝐧 𝐧
1069 72 128
( ) − ( )( ) −10.375
8 8 8
= 2 = = - 0.99
732 72 10.5
−( )
8 8
∑n
i=1 xi 72 ∑n
i=1 yi 128
x̅ = = =9 , y̅ = = = 16
n 8 n 8
a = y̅ − b x̅ = 16 − (−0.99)(9) = 24.91
Regression equation:
𝐲̂𝐢 = 𝟏𝟒. 𝟗𝟏 − 𝟎. 𝟗𝟗𝐱 𝐢
(b) Predicted the number of rejects for Employee E that is,
for x = 10:
ŷi = 24.91 − 0.99(10) = 15.01 ($1000).
The Standard Error of Estimate:
Note in the preceding scatterplot (Figure 6.3) that all of the points
do not lie exactly on the regression line. If they all were on the
line, there would be no error in estimating the total cost. To put it
another way, if all points were on the regression line, total cost
167
could be predicted with 100% accuracy. Thus, there would be no
error in predicting the y variable based on x variable.
Perfect prediction in economics and business is practically
impossible. What is needed, then, is a measure that describes
how precise the prediction of y is based on x or, conversely, how
inaccurate the estimate might be. This measure is called the
standard error of estimate. The standard error of estimate
measures the dispersion about the regression line.
Definition:
Standard Error of Estimate:
A measure of the dispersion, or scatter, of the observed values
around the regression line.
∑ni=1(yi − ŷi )2
Se = √
n−2
Computational Formula:
∑ni=1 yi2 − a(∑ni=1 yi ) − b(∑ni=1 xi yi )
Se = √
n−2
Coefficient of Determination:
We indicated to the coefficient of determination and what it means
in the previous chapter. It is widely used measure of fit for
regression model that has a more easily interpreted meaning is
the coefficient of determination (𝑟 2 ). It is computed by squaring
Pearson’s correlation coefficient.
168
Definition:
Coefficient of Determination:
The proportion of the total variation in the dependent variable y
that is explained by the variation in the independent variable x.
Solution:
Standard error of estimate
169
1560 26 275
( ) − ( )( )
= 5 5 5
150 26 2 16375 275 2
√[ −( ) ]×[ −( ) ]
5 5 5 5
26
= = 0.956
√2.96 × 250
r 2 = (0.956)2 = 0.914
So, 91.4% of variation in cost is explained by the variation in the
output according to the linear relationship between the two
variables, while 8.6% of variation in cost can be explained by
some factors other than output.
Example 6:
The following data represents horsepower of motors and monthly
cost of electricity.
Horsepower of Motor (x) 4 6 8 10 12
Monthly Cost of Electricity (y) 100 150 200 250 300
a- Determine the regression equation.
b- Find the coefficient of determination.
c- Calculate the standard error of estimate.
170
Solution:
x y xy x2 y2
4 100 400 16 10000
6 150 900 36 22500
8 200 1600 64 40000
10 250 2500 100 62500
12 300 3600 144 90000
40 1000 9000 360 225000
Regression Equation:
𝑦̂𝑖 = 25𝑥𝑖
171
(b) Coefficient of Determination (r2):
∑ni=1 xi yi ∑ni=1 xi ∑ni=1 yi
( n ) − ( n )( n )
r=
∑ni=1 xi2 ∑ni=1 xi 2 ∑ni=1 yi2 ∑ni=1 yi 2
√[ − ( × − (
n n ) ] [ n n ) ]
9000 40 1000
( ) − ( )( )
= 5 5 5
2 2
√[360 − (40) ] × [225000 − (1000) ]
5 5 5 5
200
= =1
√8 × 5000
r 2 = (1)2 = 1
So, 100% of variation in monthly cost of electricity is explained by
the variation in the horsepower of motor.
172
Exercises
1- Sketch scatterplots (scatter diagram) from the following data
and determine the equation of the regression line.
x 12 21 28 8 20
y 17 15 22 19 24
173
5- Is it possible to predict the annual number of business
bankruptcies by the number of business starts? The following
data are pairs of the number of business bankruptcies (1000s)
and the number of business starts (10,000s) for a six-year
period. Use the data to develop the equation of the regression
model to predict the number of business bankruptcies by the
number of business starts. Discuss the meaning of the slope.
Business
Bankruptcies 34.3 35.0 38.5 40.1 35.5 37.9
(1000)
Business
58.1 55.4 57.0 58.5 57.4 58.0
Starts (10,000)
174
7- Determine the equation of the regression line for the following
data and compute the coefficient of determination.
x 15 8 19 12 5
y 47 36 56 44 21
Question (4):
The following data give the experience (in years) and daily
salaries (in L.E.) of ten randomly selected salaries:
Experience (x) 4 7 10 8 6 5 9 12 3 6
Daily Salary (y) 6 9 12 8 7 5 10 11 5 7
(1) What would be the advantage of plotting these data using the
scatter diagram? (Do not plot the scatter diagram).
(2) Calculate the coefficient of determination. Explain what it
indicates in the context of this problem.
(3) Find the regression line with experience as an independent
variable and daily salary as a dependent variable.
(4) Should we use this regression equation for predictive
purposes? Why?
(5) (a) Predict the daily salary for a secretary with an experience
of 4 years.
(b) How can you interpret the difference between the actual
and predicted daily salary for this secretary?
(6) Find the standard score for x = 7 and explain what it means.
Question (5):
Patients in hospital were asked questions on smoking habits with
the following results:
179
Cigarettes Smoked
What would you say about the association between smoking and
lung cancer?
180
South Valley University Date: January 1998
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 5 Questions, 4 Pages
Question (1):
(1) The following data give the temperature (in Fahrenheit)
observed during eight wintry days in a city (hypothetical data):
20 24 19 26 17 28 19 23
Describe the meaning of variable, element, observation, and
data set with reference to these data. Is the variable discrete
or continuous?
(2) Briefly explain the following terms:
A. Random sample B. Population
(3) Write short notes on:
A. Types of statistics B. Types of variables
Question (2):
The following data set gives the number of years for which 24
workers have been with their current employers:
15 12 9 10 5 12 3 7 16 13 11 14
11 8 7 14 11 8 4 13 2 18 6 19
(1) Construct a frequency distribution table for these data, using 1
as the lower limit of the first class and 4 as the width of
each class.
(2) Draw a histogram corresponding to the frequency distribution
prepared in Part (1).
(3) Calculate the relative frequency and percentage for
all classes.
(4) Using the appropriate cumulative distribution, what
percentage of the employees have been with their current
employers for 9 years or more?
181
Question (3):
(1) Briefly explain the meaning of an outlier. Is the mean or the
median a better measure of central tendency for a data set
that contains an outlier? Illustrate with the help of an example.
(2) Explain the relationship between the mean, median and mode
for symmetric and skewed distributions. Illustrate these
relationships with graphs.
(3) The following data give the ages of 10 persons:
22 19 25 30 29 27 32 98 18 26
A. Find the mean and median for these data.
B. Using the results obtained in Part (1), how can you know
that these data contain an outlier?
C. Drop the outlier and recalculate the mean and median.
Which of the two measures changed by a larger amount
when you drop the outlier?
D. Is the mean or the median a better measure for these data?
Why?
(4) The following table gives information on the amount (in L.E.)
of electric bills for November 1997 for a sample of 50 families:
Amount of Bill 0- 5- 10- 15- 20- 25-30 Total
No. of Families 7 8 10 12 8 5 50
A. Calculate the mean and variance.
B. What proportion of families whose bills are between 12
and 25?
C. Calculate the coefficient of skewness. Comment on the
skewness of these data.
Question (4):
(1) Consider the following two data sets:
Data Set (X): 5 9 4 11 6
Data Set (Y): 3 7 2 9 4
Note that each value of the second data set (Y) is obtained by
subtracting 2 from the corresponding value of the first data
set (X).
182
Without calculations, comment on the relationship between the
two data sets in regard to:
- The mean - The standard deviation
- The correlation coefficient - The regression line of Y on X
(2) A statistics teacher believes that a knowledge of mathematics
is essential for a student to do well in a statistics course. At
the start of the semester, the teacher administers
a standardized test of general mathematics. Later, he
compares these scores to the scores of a statistical test. The
data are shown in the accompanying table:
Student A B C D E F G H I J
Math Score 90 85 80 75 70 69 72 60 56 53
Stat Score 94 92 81 78 74 73 75 67 54 52
A. Draw a scatter diagram for these data and describe the
relationship between the two tests.
B. Calculate the correlation coefficient between the two sets of
scores. Does the correlation coefficient support your
description of the scatter diagram?
C. Using the coefficient of determination, describe the
relationship between the two tests.
D. Would it be fair to state that knowledge of a student’s
performance on the mathematics test allows you to predict
the student’s performance in the statistics course?
E. A student obtained a score of 87 on the mathematics test.
What is his/her predicted score in the statistics test?
F. Predict the statistics score for the student B. How can you
interpret the difference between the actual and predicted
statistics score for this student?
G. Find the standardized score for the mathematics score of
the student E, and explain what it means.
183
Question (5):
The personal department of a large corporation recorded the
week days during which individuals in a sample of 360 absentees
were away over the past several months. The personnel
department categorized absentees according to the shift on which
they worked, as shown in the accompanying table.
Shift Sat. Sun. Mon. Tue. Wed. Thurs.
Day 65 40 15 19 16 30
Evening 25 30 20 16 14 70
(1) Is there evidence of a relationship between the days on which
employees were absent and the shift on which they worked?
(2) Determine the nature and strength of this relationship (if any).
184
South Valley University Date: January 1999
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 4 Questions, 3 Pages
Question (1): 14 Points
(1) Briefly explain the meaning of each of the following:
A. Inferential statistics. B. Simple random sample
C. Element D. Discrete Variable
Question (2):
(1) The following are the scores of 25 students on a statistics
examination.
38 10 23 33 46 59 49 35 28 12 30 43 55
19 36 50 26 44 27 52 29 32 48 32 21
A. Calculate the mean score.
B. Display these data in a frequency distribution table using 10
as a width of each class, then find the mean.
C. Explain why the mean of the groped data in Part (2) would
be different from the mean obtained for the ungrouped data
in Part (1).
D. Prepare a "less than" cumulative frequency distribution and
then find
i. The median
ii. What proportion of students would fail if the pass mark
is 30?
E. Under what conditions is the median a better measure of
central tendency than the mean?
F. Based on your results of Parts (2) and (4 - a), do you think
that the distribution of scores is symmetric or skewed?
Explain.
(2) The following table gives the weekly wages of 20 workers in
a certain factory.
Weekly Wage 35- 45- 55- 65 and more Total
No. of workers 2 6 8 4 20
185
Given that the mean wage is 58, find the highest possible wage.
(3) For a symmetric distribution, if:
Mean = 50 and First quartile (Q1) = 30
Find the semi - interquartile range for this distribution.
Question (3):
(1) For the two variables X and Y, Given that:
• σx = 0.5σy
• yi = ŷi for all values of i.
• X and y are positively correlated
• For x = 1, ŷ = 2.
A. Show that the two variables X and Y have the same
coefficient of variation.
B. What is the regression coefficient of Y on X?
186
I. With the advertising expenditure as an independent variable
and the product sales as a dependent variable, what do you
expect for the sign of the regression coefficient (b) to be?
Explain.
J. Find the predicted sales for month 4. How can you
interpret the difference between the actual and predicted
sales for this month?
K. Find the predicted sales for a month with advertising
expenditure of L.E.10,000. Comment on the accuracy of
this prediction.
L. What would be the effect on sales if advertising expenditure
reduced by L.E.1000 per month?
Question (4):
Suppose that a random sample of men and women indicated their
view on a proposal of public importance as follows:
View
Gender Total
Opposed In Favor Undecided
Men 60 30 10 100
Women 90 40 20 150
Total 150 70 30 250
Can we say that the opinion on this issue is the same for men and
women? Explain your conclusion.
187
South Valley University Date: January 2000
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 5 Questions, 3 Pages
Question (1):
Briefly explain the following:
(1) Population (2) Random sample
(3) Types of variables (give examples)
Question (2):
The following data represent 25 ages for a group of persons:
21 8 17 22 19 18 19 14 17 11 6 21 25
19 9 12 16 16 10 29 24 5 21 20 25
(1) Construct a frequency distribution table. Take 5 as the width
of each class.
(2) Calculate the relative frequencies for all classes.
(3) For what percentage of persons in this group was the age less
than 15 years?
Question (3):
The following table gives the frequency distribution of heights (in
inches) for 100 persons:
Height 50- 55- 60- 65- 70- 75 Total
No. of Persons 20 25 30 15 10 100
(1) Calculate the mean and standard deviation of the heights.
(2) Find the median of the heights.
(3) From your answers to Parts (1) and (2), what might you
conclude about the type and strength of the skewness of the
heights?
(4) What proportion of persons whose heights are less than 65
inches?
(5) What height of a person with a standard score (z) of 1.5?
Explain your results.
188
Question (4):
The following table gives the experience (in years) and the
number of items which were rejected as unsatisfactory last week
for the employees at a small factory.
Employee A B C D E F G H
Years of experience
4 5 7 9 10 11 12 14
(X)
No. of rejects (Y) 21 22 15 18 14 14 11 13
Compare between X and Y with regard to variation.
(1) Calculate the standard score of X and Y for the employee G.
Comment on your results.
(2) Find the correlation coefficient between X and Y.
(3) What proportion of the variability in the number of rejects is
explained by the variability in the years of experience?
(4) Determine the regression line of Y on X. Comment on the
value you have found for the regression coefficient.
(5) Find the predicted number of rejects for the employee E. How
can you interpret the difference between the actual and
predicted number of rejects for this employee?
(6) Compute the standard deviation of errors. Explain what it
means.
Question (5):
The following tables shows the number of good and defective
parts on each of the three work shifts at a manufacturing plant
during randomly sampled periods.
Shift Day Evening Night Total
Defectives 10 20 20 50
Non-defectives 50 70 80 200
Total 60 90 100 250
Can we say that the shift is independent of whether or not the part
produced is defective or non-defective? Explain.
189
South Valley University Date: 10/1/2001
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 5 Questions, 3 Pages
Question (1):
Briefly explain the following:
(a) Simple random sample (b) Types of statistics
(c) Continuous variable
Question (2):
The following are the scores made on an intelligence test by
a group of children who participated in an experiment:
94 129 100 136 80 98 114 113 124 96
154 109 120 122 139 112 108 127 132 110
(1) Display these data in a frequency distribution with 5 classes of
equal width.
(2) Based on your results obtained in Part (a), Construct the
cumulative frequency distribution.
(3) Using:
(a) Raw data (Ungrouped data).
(b) Cumulative frequency distribution obtained in Part (2) What
proportion of scores are
(i) less than 110? (ii) greater than 125?
Comment on your results.
Question (3):
(1) The following table gives the distribution of bonus payments
(L.E.) paid to 50 employees in a company.
Monthly Bonus 0- 10- 20- 30- 40-50 Total
No. of Employees 8 12 15 9 6 50
A. Find the mean, median, and standard deviation.
B. Based on your results obtained in Part (1), determine the
type and degree of skewness.
190
C. Find the semi- interquartile range (quartile deviation).
(2) For the two variables X and Y, given that: Y = 2X
A. What is the rank correlation coefficient between X
and Y?
B. Compare between X and Y in regard to:
(i) Mean (ii) Variance (iii) Coefficient of variation
Question (4):
Ten randomly selected life-insurance salesmen were surveyed in
a company to determine the number of weekly sales calls they
made and the number of policy sales they concluded. The data
shown in the accompanied table were collected:
Salesman No. 1 2 3 4 5 6 7 8 9 10
Weekly Calls (x) 7 4 3 5 6 8 4 2 6 5
Weekly Sales (y) 12 9 7 11 14 13 6 5 13 10
(1) Calculate the correlation coefficient and coefficient of
determination. What do these values tell you about the
relationship between the two variables?
(2) Find the regression line of y on x; 𝐲̂ = 𝐚 + 𝐛𝐱.
(3) Give a brief interpretation of the values of a and b found in
Part (2).
(4) Predict the number of sales concluded by a salesman who
makes 9 calls.
(5) What is the predicted value of weekly sales for the salesman
number 6? Find the error of estimation and comment on
your results.
Question (5):
Indiscipline and violence have become major problem in schools
in Egypt. A random sample of 100 adults were selected and they
were asked if they favor giving more freedom to school teachers
to punish students for indiscipline and violence. The two-way
classification of responses of these adults is presented in the
following table.
191
Opinion No
In favor Against Total
Gender Opinion
Men 16 11 3 30
Women 12 6 2 20
Total 28 17 5 50
To what extent the opinion on this issue depends upon gender?
Explain your conclusion.
192
South Valley University Date: 23/1/2002
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 4 Questions, 3 Pages
Question (1): Briefly explain each of the following:
(1) Population and sample.
(2) Inferential statistics.
(3) Types of quantitative variables, give an example for each type.
Question (2):
(1) The following figures represent the time (in minutes) lost per
day through mechanical failure of machinery collected over
a period of 40 consecutive working days.
37 14 9 27 54 30 15 23 42 29
24 32 26 18 11 33 25 10 34 7
50 5 26 17 7 32 28 12 38 29
33 18 22 16 31 28 24 19 23 17
A. Prepare a frequency distribution table for these data using
five classes of equal widths.
B. Construct a cumulative frequency distribution for the time
lost.
C. What proportion of the days had a lost time of:
(i) 35 minutes at least (ii) less than 20 minutes
D. Using the frequency distribution of Part (1), find the mean,
median, standard deviation, and coefficient of skewness for
these data.
(2) The following data give the hours worked last week by
5 employees of a company:
42 34 40 85 36
A. Find the mean, median, and mode for these data.
B. Does this data set contain any outlier? If yes, drop this value
and recalculate the mean and median. Which of the two
193
measures changes by a larger amount when you drop
this value?
C. Is the mean or the median a better measure for these data?
Explain.
Question (3):
(1) For the two variables x and y, Given:
• The two variables have the same coefficient of
variation.
• 𝐲̂𝒊 = yi for all values of i.
• x and y are positively correlated.
• x = 0.2y.
Find the predicted value of y for x = 4. Comment on this
prediction.
(2) The following data give information on the ages (in years)
and the number of breakdowns during the past month for
a sample of 8 machines in a large company.
Machine No. 1 2 3 4 5 6 7 8
Age 12 7 2 8 13 9 4 9
Number of
9 6 1 5 11 7 2 7
Breakdowns
A. Find the correlation coefficient between age and number of
breakdowns. Comment on your results.
B. Without doing any further calculations show the effect, if
any, on the correlation coefficient if each age is increased
by 10%.
C. What proportion of the variability in the number of
breakdowns is not explained by the variability in age?
D. Determine the regression line with number of breakdowns
as a dependent variable and age of machine as an
independent variable.
E. (5) Why is the sign of the regression coefficient always the
same as that of the correlation coefficient? Explain the
meaning of the regression coefficient.
194
F. Should we use the regression line obtained in Pare (4) for
purposes of prediction? Explain.
G. What will be the predicted number of breakdowns for
a machine of:
(i) 8 years age (ii) 10 years age
H. Comment on your predictions in Parts (a) and (b).
I. Find the standardized value for the number of breakdowns
of the 5th machine. What does this value mean?
J. Compare between age and number of breakdowns with
regard to relative variation.
Question (4):
(1) Consider the following two variables:
x: 2 3 1 6 5
y: 7 9 5 15 13
Note: Each value of the variable y is obtained by multiplying 2
by the corresponding value of the first variable x and then
adding 3.
That is, y = 2 x + 3
Without calculations, comment on the relationship between
the two variables in regard to the
(a) mean (b) variance
(c) rank correlation coefficient (d) regression line of y on x
(2) Two samples, one of 200 students from urban secondary
schools and another of 250 students from rural secondary
schools, were taken. These students were asked if they have
ever smoked. The following table lists the summary of the
results.
Smoking Urban Rural Total
Have Never Smoked 120 160 280
Have Smoked 80 90 170
Total 200 250 450
What would you say about the association between the two
attributes?
Discuss your results in the context of these data.
195
South Valley University Date: 16/1/2003
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 4 Questions, 3 Pages
Question (1):
(1) Briefly define each of the following:
A. Inferential statistics B. Primary data
(2) Distinguish between a discrete variable and a continuous
variable, and give examples.
(3) Describe the application of statistics in the following fields:
A. Production B. Marketing
Question (2):
(1) Before admission into a college, the students have to take
Basic Skills Test in fundamentals of mathematics. In one such
exam, 40 students appeared in the test. Their scores are
recorded below out of a total maximum of 30 points.
15 12 15 22 28 30 19 25 24 28
10 15 16 20 26 22 18 20 27 14
12 19 21 18 19 30 13 10 21 24
15 20 22 18 20 12 23 29 22 24
A. Construct a frequency distribution for the above data with 5
classes and a suitable width of each class.
B. Based on your frequency distribution constructed in Part (1):
i. What proportion of students would fail if the pass mark
is 15 ?
ii. Find the percentage of students whose scores are at
least 22.
(2) The following data represent the distribution of the annual
incomes (in thousands of dollars) of 50 households.
196
Annual Income Number of Households
5 and under 15 15
15 ,, ,, 25 20
25 ,, ,, 35 8
35 ,, ,, 45 5
45 ,, ,, 55 2
197
Question (4):
Four machines, A , B , C , and D are used to manufacture certain
machine parts which are classified as first grade and second
grade. The quality control engineer wants to test whether the
quality of the product from the four machines is consistent. The
data collected from 100 parts taken from the four machines are
classified and tabulated as follows:
Machines
Grades A B C D Total
First 20 15 14 15 64
Second 10 5 11 10 36
________________________________________________________________
Total 30 20 25 25 100
Can we conclude That the quality of the two grades produced by
all machines is the same?
198
South Valley University Date: 22/12/2002
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 4 Questions, 3 Pages
Question (1):
(1) Briefly define each of the following:
A. Descriptive statistics B. Population and sample
(2) A Dealer of Toyota cars sold 20,000 Toyota cars last year.
He is interested to know if his customers are satistisfied with
their purchases. 3000 questionnaires were mailed at random
to the purchasers. 1600 responses were received. 1440 of
these responses indicated satisfaction.
A. What is the population of interest?
B. What is the sample?
C. Is the percentage of satisfied customers a parameter or
a statistic?
(3) Describe the application of statistics in the following fields:
A. Purchasing B. Production
Question (2):
(1) The following set of data represents the marks obtained by 30
students in Economics in the final exam.
66 61 65 59 87 61 58 70 77 94
45 80 58 78 49 72 75 92 84 82
79 75 68 88 75 90 78 85 69 57
A. (1) Construct a frequency distribution for these data
with 5 classes starting at 45 as the lower limit of the
first class.
B. Using the frequency distribution of Part (1) , what
percentage of marks are:
- less than 70? - between 65 and 85?
(2) The department of Labour in a company wants to determine
the pattern of daily wages paid to 100 workers. The data were
199
individually collected and combined into a frequency
distribution as follows:
Weekly Wages (L.E.) Number of Workers
10 and less than 15 1
15 ,, ,, ,, 20 4
20 ,, ,, ,, 25 10
25 ,, ,, ,, 30 18
30 ,, ,, ,, 35 28
35 ,, ,, ,, 40 26
40 ,, ,, ,, 45 13
__________________________________________________________
Total 100
A. Without calculations , is this distribution skewed ? If so, then
in which direction? Explain.
B. Compute for these data:
(i) Mean (ii) Median (iii) Coefficient of variation
C. Find the coefficient of skewness for this distribution. Is the
value of this coefficient consistent with your results in Parts
(1) and (2)?
Question (3):
In economics, the demand for an item is often related to the
price of the item. An Electronic company has come up with a new
electronic toy for children and is trying to estimate the demand
function for this new toy at various prices. In a pilot study, eight
retail stores in different cities were selected and the following data
were collected for a 30-day period.
Price Per Unit (in dollars) : 12 18 13 16 20 14 17 10
Demand (100s of units): 9 2 6 5 1 5 3 9
(1) Find the least square regression line of demand on price.
(2) Interpret the meaning of the regression coefficient obtained
in Part (1).
(3) How does the quantity demanded change with lowering of
price by one dollar each?
(4) Calculate the coefficient of correlation and intrepret the nature
of the value as calculated.
200
(5) What proportion of the variation in demand can be explained
by its relationship to price?
Question (4):
A behavioural scientist is conducting a survey to determine if the
financial benefits in terms of the total salary influences the level of
satisfaction of the employees or whether there are other factors
such as the work environment, which are more important than
money. A random sample of 100 employees of an organization
are given a test to determine their level of satisfaction. Each
employee’s total salary is also recorded. The information is
tabulated as follows:
Level of Satisfaction Annual Salary
Under L.E. 3000 L.E. 3000-6000 OverL.E.6000
High 10 7 3
Medium 26 16 8
Low 14 10 6
Can you conclude that there is a relationship between salary and
job satisfaction? Explain your results in the context of this issue.
201
South Valley University Date: 19/1/2004
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 5 Questions, 4 Pages
Question (1):
(1) Briefly define each of the following:
(a) Simple random sample. (b) Inferential statistics.
(c) Continuous variable, give examples.
(2) The following data give the number of new cars sold at
a dealership during a 5-day period.
8 5 10 6 9
Describe the meaning of:
(a) Variable (b) Element (c) Observation (d) Data set
with reference to these data.
Is the variable, in this case, discrete or continuous?
Question (2):
(1) The manager of famous restaurant has received complaints
that the customers have to wait too long time in the lounge
after they arrive at the restaurant and before they are actually
served dinner. He selected a random sample of 30 customers
and kept track of their waiting time. The following data
recorded, with waiting time measured to the nearest minute.
29 40 25 31 60 68 39 42 60 43
28 52 30 32 48 15 40 21 31 30
51 72 22 29 19 43 43 36 50 32
(a) Develop a frequency distribution using 6 classes.
(b) Compute the average waiting time for the grouped data
and compare it with that computed for the raw data.
Comment.
(2) A group of 25 students were surveyed as to how much cash
did they carry with them on a given day. The data collected is
presented as follows:
202
Amount (L.E.) 10- 12- 14- 16- 18-20 Total
Number of Students 2 6 9 6 2 25
A. For the frequency distribution, find:
(i) Mean (ii) Median (iii) Standard deviation
(iv) Coefficient of skewness (v) First and third quartiles
B. What connection do you see between your answers to A.(ii)
and A.(v) and the answer obtained in A.(iv). Explain.
Question (3):
(1) Given that the mean and standard deviation of a set of figures
are and respectively, write down the new values of the
mean and standard deviation when:
(a) each figure is increased by a constant c.
(b) each figure is multiplied by a constant k.
(2) A group of students sat two examinations, one in statistics (x)
and in mathematics (y). In order to compare the results, the
statistics marks were scaled linearly (that is, a mark of x
became a mark of ax + b where a and b are constants) so
that the means and standard deviations of the marks in both
examinations became the same. The original means and
standard deviations are shown as follows:
Statistics (x) Mathematics (y)
Mean 48 62
Standard deviation 12 10
(a) Find the values of a and b.
(b) The original marks of a particular student are 36 in statistics,
48 in mathematics. In what sense, has he done better in
statistics than in mathematics?
(3) Answer the following:
A. Which measure of central tendency is affected by some
extremely large values or extremely small values so that it
ceases to become the representative measure?
B. Which measure of central tendency is defined as the value
which appears most often in the data.
203
A. Under what circumstances would median be the most
appropriate measure of central tendency?
B. Most women use a shoe size of 39. Which measure of
central tendency would most appropriately represent the
average shoe size of women?
C. In a symmetric distribution, which measure of central
tendency has the largest value, if any?
D. In a negatively skewed frequency distribution, which
measure of central tendency is the largest?
E. In comparing the dispersion of two or more distributions,
what is the most appropriate measure if they are in different
units?
Question (4):
The following table reports heights (in inches) for 10
married couples.
Couple 1 2 3 4 5 6 7 8 9 10
Husband 76 75 72 68 67 62 70 78 77 65
Wife 71 70 67 64 63 61 66 74 72 62
(1) Find the coefficient of correlation between husbands’ and
wives’ heights. What would you say about the strength of the
linear relationship between husbands’ and wives’ heights?
(2) What might you expect for the value of the rank correlation
coefficient when compared with that obtained in Part (1)?
Note: No calculations required.
(3) Find the regression line of wife’s height on husband’s height.
(4) What does the regression line tell you about the relationship
between husbands’ and wives’ heights?
(5) What percentage of the variation in wives’ heights is
unexplained by the variation in husbands' heights?
(6) Predict the height of the wife of a man who is:
(a) 74 in tall (b) 78 in tall
(7) Comment on these predictions.
204
Question (5):
Four brands of light bulbs are being considered for use in a large
manufacturing plant. The director of purchasing asked for
samples of 100 from each manufacturer. The numbers of
acceptable and unacceptable bulbs from each manufacturer are
shown below:
Manufacturer
__________________________________
A B C D
______________________________________________________
Acceptable 12 8 5 11
Unacceptable 88 92 95 89
_________________________________________ ______________
205
South Valley University Date: 10/1/2005
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 4 Questions, 3 Pages
Question (1):
(1) Briefly explain the meaning of each of the following terms:
A. Random sample.
B. Quantitative variables, give examples.
C. Descriptive statistics.
(2) A sample of midterm grades for 5 statistics students showed
the following results:
Student No. : 1 2 3 4 5
Grade : 72 65 82 90 76
A. What variable has been measured in this case? Is the
variable quantitative or qualitative?
B. How many elements does this data set contain? What
are they?
C. How many observations in this data set?
Question (2):
(1) A psychologist developed a new test of adult intelligence. The
test was administrated to 20 individuals, and the following data
were obtained.
112 123 116 128 105 129 118 134 107 126
138 118 100 121 110 117 136 148 109 119
A. Organize the data into a frequency distribution using 5 classes
of equal width.
B. Prepare a "Less than" cumulative frequency distribution for
these data.
C. Use the cumulative frequency distribution constructed in
Part (A) to determine the proportion of individuals who had
a grade of:
(i) at least 112. (ii) between 130 and 145.
206
(2) The gross hourly earnings (in dollars) of 50 workers in a large
industrial concern were organized into the following frequency
distribution.
Hourly Earnings 10- 15- 20- 25- 30-35 Total
Number of Workers 18 12 8 7 5 50
A. Find the mean and variance of hourly earnings.
B. (2) On the basis of inspection only (Without performing
any calculations), Do you agree with the claim that the
distribution of hourly earnings has a positive skewness?
Explain.
C. On the basis of your answer to Part (2), would you expect
the median to be greater than the mean? Justify
your answer.
Note: No calculations required.
D. Check your answer to Part (3) by computing the median.
Comment on the degree to which the distribution of hourly
earnings is skewed.
Question (3):
(1) In order to determine a realistic price for a new product that
a company wishes to market, the company's research
department selected 10 sites thought to have essentially
identical sales potential and offered the product in each at
a different price. The resulting sales are recorded in the
accompanying table.
Location 1 2 3 4 5 6 7 8 9 10
Price 14 15 16 17 18 20 21 22 23 24
Sales ($1,000s) 14 13 15 9 11 10 8 9 6 5
A. Find he correlation coefficient. Does it reflect a strong or
weak relationship between prices and sales? Is it a direct or
inverse relationship?
B. What percentage of variation in sales can be explained
by prices?
C. Develop the least squares regression line of sales
on prices.
207
D. What is the change in sales for each dollar increase
in price?
E. Find the predicted sales for the 5th location. How can you
interpret the difference between the actual and
predicted sales?
F. Predict the company sales if the price is 19 in one location.
Would you feel comfortable about this prediction? Why or
why not?
G. Determine the standard error of estimate.
(2) For the two variables x and y, given that:
A. x and y have the same variance.
B. The correlation coefficient between x and y equals 0.8.
C. The mean of y is 10.
D. The predicted value for y is 6 if x equals 2.
Compare between the two variables x and y in regard to the
coefficient of variation.
Question (4):
Data on the marital status of men and women aged 25 to 35
were obtained as a part of a national survey. The results from
a sample of 300 men and 200 women follow.
Marital Status
Gender
Never Married Married Divorced
Men 240 50 10
Women 60 120 20
Does there appear to be a relationship between marital status and
gender? If yes, to which extent?
208
South Valley University Date: 18/1/2006
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 3 Questions, 3 Pages
Question (1):
(1) Briefly explain the meaning of each of the following terms:
A. Census and sample survey B. Simple random sample
C. Inferential statistics D. Continuous variable
(2) The data below give the ages for 5 persons:
18 22 19 24 17
Describe the meanings of a variable, an element, an
observation, and a data set with reference to these data.
Question (2):
(1) The following data show the sales (in thousands of pounds) in
a big firm during the last 20 months:
61 76 95 50 80 75 84 72 99 71
56 74 88 65 63 79 82 75 78 86
A. Group the above data into 5 classes of equal width.
B. Prepare a “Less than” cumulative frequency distribution for
these data.
C. Based on your cumulative frequency distribution
constructed in Part (B), find the number of months with
sales of at least L.E.72500.
(2) The following is a frequency distribution for the ages of
a sample of 20 employees at a company.
Age 20 - 30 - 40 - 50 - 60 - 70
Number of Employees 4 5 6 3 2
A. Find the mean, median, and standard deviation of ages.
B. Based on your results of Part (A), what would you expect
for the type and degree of skewness for this distribution?
209
C. Calculate the coefficient of skewness, Is the value of the
coefficient of skewness consistent with what you expected
in Part (B)?
Question (3):
The following table gives the monthly output (in thousands
of tons) and labor cost (in thousands of dollars) of a factory for
a given 8-month period.
Month 1 2 3 4 5 6 7 8
Monthly Output
66 74 78 70 82 90 87 85
(1,000 tons)
Labor Cost ($1,000s) 50 57 59 52 64 85 77 68
(1) Calculate the correlation coefficient (r) for the two variables
shown. Interpret the nature of the value as calculated.
(2) How would your answer of Part (1) be affected if output were
measured in tons, instead of tons, where 1 ton = 1.016 ton?
Justify your answer.
(3) Compute the rank correlation coefficient (rs) between
monthly output and labor cost.
(4) Compare between the values of r and rs as obtained in Parts
(1) and (3), respectively. Comment on your results.
(5) Would it be fair to state that knowledge of output for a given
month allows you to predict the labour cost for this month?
Explain.
(6) Predict the labor cost you would expect for the following
months:
A. The 4th month B. A month in which output is a $ 80,000
Comment on your results of Parts (a) and (b).
(7) Find the standard error of estimate.
(8) Find the standard score for the output of the 5th month, and
explain what it means.
Question (4):
(1) Explain the relationship between the mean, median, and
mode for symmetric and skewed distributions. Illustrate
with graphs.
210
(2) The following data give the weights (in kilograms) for 10
persons:
8 7 2 5 3 84 7 8 2 4
A. Find the mean and median for these data.
B. Using the results of Part (A), how can you investigate that
these data contain outliers?
C. (3) If you dropped the outlier and the mean and median
were recalculated. Which of the two measures is expected
to be changed by a larger amount? No calculations
required.
D. Is the mean or the median a better measure for these
data? Why?
(3) A garage sells three types of new cars (A, B, and C). The
following data show, for each type of car sold, the number
requiring repair during the first 12 months.
211
South Valley University Date: 21/1/2007
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 4 Questions, 4 Pages
Question (1):
(1) Briefly explain the meaning of each of the following terms:
A. Descriptive statistics B. Simple random sample
C. Quantitative variable, give examples
(2) A bank is concerned about its level of customer service and
conducts a survey into the time which elapses from the
moment a customer enters the bank to the moment they finish
their transaction. The survey results (to the nearest minute) are
as follows.
0 4 1 7 6 0 5 10 8 2
7 5 9 12 4 13 3 2 9 6
6 10 5 2 8 7 9 11 13 11
A. Develop a frequency distribution using 7 classes.
B. Comment on the shape of the distribution.
C. Based on your results in Part (A), what percentage of
customers have waiting times of less than 7 minutes?
Question (2):
(1) In a medium sized city there are 100 houses for sale of
a similar size. The frequency distribution of the asking prices is
Price
40- 50- 60- 70- 80- 90-100 Total
($1000)
No. of
6 8 10 20 30 26 100
Houses
A. On the basis of inspection only (without performing any
calculations), what might you expect for the value of the
mean compared with that of the median? Justify
your answer.
212
B. Find the price value so that 50% of houses have prices less
than this value.
C. Find the mean and coefficient of variation.
D. Check your answers to Parts (1), (2), and (3) by calculating
the coefficient of skewness for prices.
(2) Given that the mean and variance of a set of figures are µ and
2 respectively, write down the new values of the mean and
variance, justifying your answer, when:
A. each figure is decreased by a constant A.
B. each figure is divided by a constant K and then added to
a constant m.
Question (3):
The following data shows the age in years and the second-hand
price of a sample of 10 cars advertised in a local paper.
Car A B C D E F G H I J
Age (Years) 5 7 6 6 5 4 7 6 5 5
Price ($100) 76 45 58 55 70 88 43 56 69 70
(1) Determine which one is the X variable and which one is the
Y variable. Explain.
(2) Calculate the correlation coefficient. Explain what it
indicates in the context of this problem.
(3) Develop a regression equation that could be used to
predict the price of a car given its age.
(4) Do you believe the regression equation developed in Part
(3) would provide a good prediction of the price of car?
Explain.
(5) Interpret the meaning of the slope of the equation obtained
in Part (3).
(6) Based on the regression equation in Part (3), What would
be the effect on prices if ages were reduced by two years
per car?
(7) predict the price for the following cars:
(a) A car that is 8 years old (b) Car F
What would you say about your results in Parts (a) and (b)
as far as the reliability of predictions is concerned?
213
(8) What percentage of variation in prices can be explained
by ages?
(9) Find the standard error of estimate. Explain.
Question (4):
(1) The scatter diagrams of two data sets are shown in Figures
(1) and (2). What conclusions can you draw from these scatter
diagrams as far as the simple linear correlation and simple
linear regression are concerned?
Figure (1)
Figure (2)
215
South Valley University Date: 15/1/2008
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 4 Questions, 3 Pages
Question (1):
(1) Briefly explain the meaning of each of the following terms:
A. Population and sample. B. Random sample.
C. Discrete variable, give examples.
D. Inferential statistics.
(2) The following data show the time (rounded to the nearest day)
required to complete year-end audits for a sample of 20 clients
of a small public accounting firm.
10 14 19 18 15 15 18 17 20 27
22 23 13 21 34 28 14 18 16 13
A. Display the data into a frequency distribution, using five
classes and 10 as the lower limit of the first class.
B. On the basis of inspection only (without doing
calculations), is the distribution obtained in Part (1)
symmetric or skewed? Explain.
C. Prepare a "less than" cumulative frequency distribution.
Then, find the proportion of clients with a time of at most
20 days?
Question (2):
(1) An auto insurance company reported the following information
regarding the age of a driver and the number of accidents
reported last year.
Age of Driver 20- 30- 40- 50- 60-70 Total
No. of Accidents 8 10 16 9 7 50
A. Find the mean, median, and standard deviation for ages.
B. Give the reasons why the values of the mean and the
median should be approximately equal for this distribution.
216
C. Find the coefficient of skewness for the above distribution.
Is the value obtained consistent with your answer in
Part (B)?
D. Determine the standard score for the age 60. Explain what
it means.
(2) The accompanying table is the frequency distribution of the
statistics scores of a group of students.
Score 0- 2- 4- 6- 8-10
No. of Students 1 2 k 2 1
A. For k 3, explain how does the value of k affect the
values of
(i) the mean? (ii) the median? (iii) the mode?
B. Is there a value for k such that the value of the variance of
scores is zero? If yes, determine this value.
C. Given the variance of scores is 4.8, find the value of k.
Question (3):
A company wishes to investigate whether the amount it spends
on advertising prior to the launch of a new product is related to the
sales volume of the product in the first month. Data from the last
8 product launches is shown below.
Product Number 1 2 3 4 5 6 7 8
Advertising ($10,000) 50 25 20 65 30 17 25 40
Sales (1000 units) 18 14 13 20 14 12 13 16
(1) Find the correlation coefficient between advertising
and sales. What does the value obtained suggest about the
Direction and strength of the relationship between the
two variables?
(2) What does the value of the correlation coefficient obtained in
Part (1) tell you about how useful the advertising is for
predicting sales. Explain.
(3) What proportion of the variability in sales is not explained by
the variability in advertising expenditure?
217
(4) Using your results in Part (1), find the regression equation for
sales based on advertising.
(5) Based on the regression equation obtained in Part (4), what is
the marginal effect on volume of sales if the amount spent on
sales is decreased by $10,000.
(6) Predict the level of sales when $400,000 is spent on
advertising. What proportion of the error in this case? How
can you interpret the occurring of this error?
(7) Find the standard error of estimate. Explain.
(8) For what reasons may the standard deviation be inappropriate
for comparing the dispersion of advertising and sales. Suggest
better alternative measure, and find it for each variable.
Question (4):
(1) Given are data for two variables, x and y.
X: 4 2 8 c 10
Y: 6 1 d 26 21
If you know that the two variables x and y are perfectly related.
A. Determine the values of c and d.
B. (2) Without performing any calculations, what is the rank
correlation coefficient between x and y?
(2) A manufacturer of preassembled windows produced 50
windows yesterday. This morning the quality assurance
inspector reviewed each window for all quality aspects. Each
was classified as acceptable or unacceptable and by the shift
on which it was produced. Thus, he reported two variables on
a single item. The two variables are shift and quality. The
results are reported in the following table.
Shift
Day Afternoon Night Total
Defective 10 6 24 40
Acceptable 80 54 26 160
————————————————————————————
Total 90 60 50 200
218
A. Is there evidence of a relationship between the quality of
windows and the shift on which they were produced?
B. Discuss your results in the context of this issue.
C. Excluding the night shift, Repeat Parts (A) and (B).
219
South Valley University Date: 20/1/2009
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 5 Questions , 4 Pages
Question (1): 16 Points
(1) Briefly explain the meaning of each of the following terms:
A. Applied statistics B. Continuous variable.
C. Simple random sample D. Representative sample
(2) A machine produces the following number of rejects in each
successive period of five minutes.
16 21 26 24 11 24 17 25 26 13
27 24 26 05 27 23 24 15 22 22
12 22 29 21 18 22 28 25 07 17
22 28 19 23 23 22 06 19 13 31
23 28 24 09 20 34 30 23 20 12
A. Prepare a frequency distribution for these data using six
classes of equal width. Hence,
B. Construct a cumulative frequency distribution.
C. Determine the proportion of periods that had 20 rejects
at least.
D. Find the number of rejects so that 30% of periods have
less than this number.
Question (2): 32 Points
(1) The values of 100 properties handled by a property dealer over
a six-month period are shown as follows:
Value of Property ($1000s) 15- 20- 25- 30- 35-40
No. of Properties 17 18 30 20 15
A. Find the mean and median for values of properties.
B. What value of property occurred most frequently?
C. Determine the variance for values of properties.
220
D. Compute the semi-interquartile range (quartile
deviation) for this distribution and discuss why it is often
preferred as a measure of variation over the range.
E. Based on your results in Parts (1) and (3), what would you
say about the symmetry of this distribution?
F. Confirm or contradict your answer in Parts (5) and (6) by
finding the value of the coefficient of skewness for the
above distribution.
(2) The following table represents the distribution of the annual
incomes (in thousands of dollars) of 50 households.
Annual Income Less Than 15 15- 25- 35- 45-55
No. of Housholds 15 20 8 5 2
221
From the scatter diagram shown above
A. What does the scatter diagram indicate about the
relationship between age and cost?
B. Do the data contain outliers? If so, identify.
(3) Find the correlation coefficient (r) between age and cost,
and describe what it tells you.
(4) How would your answer of Part (3) be affected if the age of
each machine was increased by only one month. Explain.
(5) Find the rank correlation coefficient (rs) between age of
machine and maintenance cost.
(6) Compare between the values of r and rs obtained in (3) and
(5), respectively. Comment on your result.
(7) Determine the proportion of variation in cost that would be
explained by its relationship to age.
(8) Based on your results in Part (7), would you feel comfortable
using the age of a machine for predicting its maintenance
cost? Explain.
(9) Develop a linear regression equation in which
maintenance cost is to be predicted by age of machine.
Then,
(10) What are the fixed costs of the company?
(11) Provide an interpretation for the slope of the estimated
equation obtained in (9).
(12) Predict the maintenance cost for the 4th machine. Discuss
the difference between the actual and predicted cost.
(13) Find the standard error of estimate. What does it tell you
about the regression equation?
Question (4): 16 Points
(1) Given two variables, x and Y where
𝐱−𝟒
𝐘=
𝟐
Comment on the relationship between x and Y in regard to the:
A. Mean B. Median C. Variance
D. Rank correlation coefficient
E. Regression line of Y on X.
222
(2) A management behavior analyst has been studying the
relationship between male/female supervisory structures in
the workplace and the level of employees' job satisfaction.
The results of a recent survey are shown here in the
following table.
Level of Boss/Employee
Satisfaction F/M F/F M/M M/F
Satisfied 20 10 50 120
Dissatisfied 70 100 10 20
F = Female , M = Male
Can you conclude that the level of job satisfaction depends on the
boss/employee gender relationship? Discuss your results in the
context of this issue.
223
South Valley University Date: 6/1/2013
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the following questions: 4 Page, 3 Questions
Question (1): 25 Points
(1) Briefly explain the meaning of each of the following:
A. Simple random sample. B. Descriptive statistics.
C. Types of quantitative variables, give examples.
(2) The following data give the statistics score for 6 students.
76 82 95 68 43 57
Describe the meaning of each of the following, with reference to
these data.
(1) Element (2) Observation
(3) Variable (4) Data set
(3) The following data give the daily wages (in Egyptian pounds)
earned by a sample of 30 workers as shown below.
36 28 22 44 30 26 49 24 33 34 25 31 39 33 28
37 42 27 23 27 32 25 34 29 43 32 26 20 28 35
A. Prepare a frequency distribution for these data.
Hint: Use Sturges rule formula to determine the number
of classes.
B. Construct a cumulative frequency distribution.
C. On the basis of the cumulative distribution obtained in
Part (B), find the percentage of workers with a daily wage
of at least L.E. 27.
Question (2): 35 Points
(1) The following data give the ages for 6 persons.
65 82 92 86 5 90
A. Find the mean and the median for these data.
B. Using the results of Part (A), how can you investigate that
these data contain outliers? Explain.
224
C. If you dropped the outlier and the values of the mean and
median were recalculated, which of the two measures is
expected to be changed by a larger amount? No
calculations required.
D. Is the mean or the median a better measure for these data?
Explain.
(2) Given that the mean and variance of a set of figures are µ and
σ2, respectively.
What are the new values of the mean and variance when:
A. Each figure is decreased by a constant A.
B. Each figure is multiplied by a constant B.
Justify your answers.
(3) The following table presents the distribution of the monthly
incomes (in thousands of Egyptian pounds) of 50 households.
Monthly Income
25- 30- 35- 40- 45-50 Total
(L.E. 1000)
Number of
2 3 5 22 18 50
households
A. On the bases of inspection only (without performing any
calculations), do you agree with the claim that the
distribution of monthly incomes is positively skewed?
Justify your answer.
B. Calculate the values of the mean and standard deviation
for the monthly incomes.
C. Find the value of the monthly income such that 50% of the
households have incomes less than this value.
D. Find the coefficient of skewness for this distribution.
Comment.
E. Is the value of the coefficient of skewness obtained in Part
(D) consistent with your answer in Part (A)?
Question (3): 40 Points
(1) The table given below shows the age (in years) and the
second-hand price (in thousands of Egyptian pounds) of 8
cars advertised in a local paper.
225
Car No. 1 2 3 4 5 6 7 8 Total
Age 6 5 4 7 5 8 9 4 48
Price 58 75 85 55 66 50 44 95 528
A. For what reasons may the standard deviation be
inappropriate for comparing the dispersion of the two
variables. Suggest better alternative measure.
Note: No calculations required.
B. Calculate the value of the correlation coefficient for the
two variables. Explain your result in the context of
this question.
C. Using the least squares method, find the equation of the
regression line of price on age. Would it be fair to state
that the knowledge of the age of a given car allows you to
predict the price for this car? Explain.
D. Predict the price you would expect for the 6th car.
Comment.
E. Find the standardized value for the price of the 4th car.
Explain what it means.
(2) The scatter diagram of the two variables X and Y is shown
as follows.
226
regression, and existing of outliers are concerned? If
outliers exist, identify.
(3) Indiscipline and violence have become a major problem in
schools in Egypt. A random sample of 200 adults were selected
and they were asked if they favor giving more freedom to
school teachers to punish students for indiscipline and
violence. The three-way classification of responses of these
adults is presented in the following table.
Opinion In No
Against Total
Gender Favor Opinion
Men 96 18 6 120
Women 32 44 4 80
Total 128 62 10 200
Can you conclude that there is a relationship between gender and
opinion on this issue? If the answer is 'yes', to what extent the
opinion on this issue depends upon whether the adult is a man or
a woman? Explain in the context of this issue.
227
South Valley University Date: 5/1/2014
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 3 Questions , 4 Pages
Question (1): 15 Points
(1) Briefly explain the meaning of each of the following terms:
A. Inferential statistics. B. Random sample.
C. Qualitative variable, give examples.
(2) A bank is concerned about its level of customer service and
conducts a survey into the time which elapses from the
moment a customer enters the bank to the moment they finish
their transaction. The survey results (to the nearest minute) for
20 clients are as follows.
5 12 11 9 10 15 6 14 10 12
17 7 11 9 13 5 12 19 10 8
A. Develop a frequency distribution using 5 equal classes.
B. Based on your results in Part (A), what proportion of
customers have waiting times of less than 10 minutes?
Question (2): 40 Points
(1) In a medium sized city there are 50 houses for sale of a similar
size. The frequency distribution of prices is as follows.
Price
10-20 20-30 30-40 40-50 50-60 Total
($10,000)
No. of
14 18 12 4 2 50
Houses
On the basis of inspection only (without performing any
calculations),
A. what might you expect for the value of the mean compared
with that of the median? Justify your answer.
B. Find the price value so that 50% of houses have prices less
than this value.
C. Find the mean and coefficient of skewness of prices.
228
D. What connection do you see between your answers to
Parts (A), (B), and (C)?
(2) Given that the mean and variance of a set of figures are µ
and 2 respectively, write down the new values of the mean
and variance, justifying your answer, when:
A. Each figure is increased by a constant B.
B. Each figure is multiplied by a constant K and then
subtracted from a constant m.
(3) Given the following frequency distribution.
229
E. How would your answers of Parts (B) and (C) be affected if
costs were measured in Egyptian pound (L.E.) instead of
tens of Egyptian pounds (L.E. 10s)?
F. Based on the regression equation in Part (C), What would
be the effect on costs if ages were reduced by one year
per car?
G. Predict the cost you would expect for the following
machines:
(i)A machine that is 4 years old.
(ii)Machine number 2.
What would you say about your results in Parts (i) and (ii)
as far as the reliability of predictions is concerned?
H. What percentage of variation in costs can be explained
by ages?
(2) Given the two variables x and Y. If:
• 64% of the variation in X is explained by the variation
in Y.
• For each one increase in X, Y is increased by 4.
• The mean of X is equal to 10.
• The two variables have the same coefficient of
variation.
Find the regression equation of Y on X.
(3) A convenience store is open 24 hours a day. The owner of the
store is interested to find out if there is any relationship
between the time a purchase is made and the amount of
money spent on that purchase. 100 purchases are selected
randomly from the store records. The following summarized
data represent the number of purchases made as well as their
time of purchase and amount of money spent on
each purchase:
230
Time of Size of Purchase (in dollars)
Purchase Less than 10 From 10 to 20 Over 20 Total
Morning 35 10 5 50
Evening 28 8 4 40
Night 7 2 1 10
Total 70 20 10 100
A. Can you conclude that the amount of money spent on
purchase depends upon the time of purchase? If yes, to
what extent?
B. Repeat Part (A) after excluding data about the two
categories "Over $20" and "Night".
C. Compare between your results in Parts (A) and (B),
justifying your answer.
231
South Valley University Date: 4/1/2015
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 3 Questions , 3 Pages
Question (1): 40 Points
(1) Management of a restaurant is concerned with the time
a patron must wait before being seated for dinner. Listed below
is the wait time in minutes, for the 25 tables seated last
Friday night.
28 39 21 21 37 32 56 40 66 50 51 45 44
68 43 55 46 41 34 44 48 65 44 35 53
A. Construct a frequency distribution for these data using 5
equal classes
B. Prepare a "Less than" cumulative frequency distribution.
Then, find the number of patrons with wait time of at least
48 minutes.
(2) The following frequency distribution reports the electricity cost
(in dollars) of a sample of 50 two-bedroom apartments in a city.
Electricity Cost 30- 40- 50- 60- 70- 80- Total
Number
3 8 12 16 7 4 50
of Apartments
A. Find the mean, median, and standard deviation of cost.
B. Find the coefficient of skewness for this distribution.
Comment.
(3) Consider the following frequency distribution:
Class 5-15 15-25 25-35 35-45 45-55
Frequency 3 A 6 B 2
Given that: Mean = 29.5 , Median = 30 , Median Class is 25-35
Determine the values of A and B.
232
Question (2): 45 Points
(1) A random sample of 10 homes currently listed for sale
provided the following information on size (hundreds of square
feet) and price (thousands of dollars).
Home No. 1 2 3 4 5 6 7 8 9 10
Size 27 28 34 30 29 35 32 40 23 22
Price ($000) 29 31 32 33 36 41 49 55 25 19
A. Find the correlation coefficient between size and price.
What would you say about the relationship between the
two variables?
B. How would your answer of Part (A) be affected if price is
measured in hundreds of dollars instead of thousands of
dollars?
C. What proportion of variability in price that is not
explained by size?
D. What do you conclude from the result in Part (A) as far as
the reliability of predictions is concerned?
E. Find the predicted price for each of the following:
(i) Home number 4. (ii) A home of size 2500 square feet.
F. Comment on these predictions.
G. Determine the standard error of estimate.
H. Find the standard score for the price of the 8th home.
Explain what it means.
(2) For the two variables x and y, given that:
• yi = 𝐲̂𝒊 for all values of i.
• For each unit increase in x, y is decreased by 2.
• The mean and variance of x are 8 and 4 respectively.
• For x = 5, y = 10.
A. Compare between x and y in regard to:
(i) Mean (ii) Variance (iii) Coefficient of variation
(iv) Rank correlation Coefficient
B. Predict the value of y for x = 2.
233
Question (3): 15 Points
A study regarding the relationship between age and the amount
of pressure sales personnel feel in relation to their jobs revealed
the following sample information.
Degree of Job Pressure
Age (years) Total
Low Medium High
Less than 25 70 20 10 100
25 - 40 70 50 80 200
40 - 60 10 30 160 200
Total 150 100 250 500
(1) Does the sample information provide evidence to conclude
that the degree of job pressure depends upon age? Explain
your results in the context of this issue.
(2) Repeat Part (1) excluding the category of "medium" for the
degree of job pressure and the category "25 up to 40" for age.
Comment on your answer.
234
South Valley University Date: 17/1/2016
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 3 Question , 3 Pages
Question (1): 15 Points
(1) Briefly explain the meaning of each of the following:
A. Qualitative variable, give examples. B. Descriptive statistics.
(2) Suppose the following data are the ages of Internet users for
a sample.
39 15 31 25 24 23 21 22 22 18
19 16 23 27 34 24 19 20 29 17
A. (1) Organize these data into a frequency distribution
using 5 as a width for each class.
B. Based upon the frequency distribution obtained in Part (1):
(i) Prepare a "Less than" cumulative frequency
distribution.
(ii) Find the percentage of Internet users with age of at
least 32 years.
Question (2): 35 Points
(1) The following frequency table gives the distribution of bonus
payments (in hundreds of L.E.) made to 100 employees in
a company.
Monthly Bonus
30- 40- 50- 60- 70- 80 Total
(L.E. 00)
Number
12 20 36 20 12 100
of Employees
A. On the basis of inspection only (no calculations
required), do you agree with the claim that the distribution
of monthly bonus is positively skewed? Explain.
B. Find the mean and standard deviation of monthly bonus.
C. Find the bonus value so that 75% of employees have
bonuses less than this value.
235
D. Based on your results of Parts (A), (B), and (C), determine
the bonus value so that 75% of employees have bonuses
greater than this value.
(2) Consider the following frequency distribution:
Less
Score 10 - 20 20 - 30 30 - 40
than 10
Number of
4 B 8 2
students
If: Mean = 19.6 and Variance = 68.64,
determine the following:
a. The value of B. b. The lowest score.
Question (3): 50 Points
(1) The following table presents the salaries (in hundreds of L.E.)
and years of experience for a sample of 8 workers.
236
G. Find the rank correlation coefficient (rs) between years of
experience and salary.
H. Compare between the values of r and rs. Comment.
I. Find the predicted salary for each of the following:
(i) The worker number 6. Comment.
(i) A worker with experience of 3 years. Would you feel
comfortable about this prediction? Justify your answer.
J. Interpret the value of the regression coefficient obtained
in Part (I).
(2) For the two variables x and y, given:
− The two variables have the same coefficient of variation.
− yi = 𝐲̂i for all values of i.
− The two variables are positively related.
− σy = 2σx .
A. Compare between x and y in regard to:
(i) Mean (ii) Variance
(iii) Rank correlation coefficient
B. Predict the value of y for x = 4.
(3) Data on the social class and the number of children in
a family were obtained as a part of national survey. The results
from a sample of 200 families follow.
Number of Social Class
Total
Children Lower Middle Upper
1 or 2 12 24 84 120
More than 2 64 12 4 80
Total 76 36 88 200
A. (1) To what extent you can say that the number of children
in a family depends upon the social class of this family?
Explain your result.
B. Repeat Part (A) excluding the category of "middle" for
the social class.
C. Compare between your results in Part (A) and Part (B).
Comment.
237
South Valley University Date: 15/1/2017
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 3 Questions , 4 Pages
Question (1): 20 Points
(1) Briefly explain the meaning of each of the following:
A. Simple random sample. B. Inferential statistics.
(2) The data below give the ages, to the nearest year, for 5
children:
8 2 5 4 6
Describe the meaning of:
(a) Variable (b) Element (c) Observation
With reference to these data. Is the variable, in this case,
discrete or continuous?
(3) The following data (in thousands of dollars) represent the net
annual income for a sample of 20 taxpayers:
41 34 20 39 29 38 42 30 37 44
26 32 30 43 33 39 32 35 33 31
A. Construct a frequency distribution for these data having 5
classes.
B. Based upon the frequency distribution obtained in
Part (A):
(i) Prepare a 'Less than' cumulative frequency distribution.
(ii) Determine the value of the annual income so that
40% of taxpayers will earn at least this value.
Question (2): 40 Points
(1) The income distribution of the middle management in a large
organization is tabulated below:
Income
20- 25- 30- 35- 40- 45 - 50 Total
($ hundreds)
Number
4 5 6 15 12 8 50
of Managers
238
A. On the basis of inspection only (without performing any
calculations), compare and contrast the values of the
mean and median you would expect for the income
distribution.
Justify your answer.
B. Find the mean and standard deviation of income.
C. Find the income value so that 50% of managers have
incomes less than this value.
D. Determine the type and degree of skewness of the income
distribution.
E. Is your answer of Part (4) consistent with what you
expected in Part (1)? Explain.
(2) The accompanying table is the frequency distribution of the
statistics scores of a group of students.
Score 4-6 6-8 8-10 10-12 12-14
No. of Students 2 5 k 5 2
A. For k > 5, explain how does the value of k affect the
values of
(i) Mean? (ii) Median? (iii) Standard deviation?
B. Is there a value for k such that the value of the variance of
scores is zero? If yes, determine this value.
C. For k > 5, if the value of the third quartile (Q3) is equal to
10.8, use this value to determine the value of the first
quartile (Q1). Explain.
D. Given the variance of scores is 5.2, find the value of k.
Question (3): 40 Points
(1) A new-car dealer is interested in the relationship between the
number of salespeople working in a month and the number
of cars sold. Data were gathered in 5 consecutive months:
Month 1 2 3 4 5 Total
Number of Salespeople 5 6 4 2 8 25
Number of Cars sold 22 20 14 9 25 90
239
A. What would be the advantage of plotting these data using
the scatter diagram? Hint: Do not draw the scatter
diagram.
B. Calculate the values of the correlation coefficient (r) and
the coefficient of determination. What do these values tell
you about the relationship between the two variables?
C. How much of the variation in the number of cars sold is
not explained by the number of salespeople?
D. Compute the rank correlation coefficient (rs) between the
number of salespeople and the number of cars sold.
E. Compare between the values of r and rs as obtained in
Parts (B) and (D), respectively. Comment on your results.
F. Would it be fair to state that knowledge of the number of
salespeople for a given month allows you to predict the
number of cars sold for this month? Justify your answer.
G. Predict the number of cars sold you would expect for the
following months:
(i) The 2nd month. (ii) A month with 7 salespeople.
Comment on your predictions of Parts (i) and (ii).
H. If the number of salespeople was reduced by 2 per month,
what would be the effect on:
(i) Value of the correlation coefficient?
(ii) Number of cars sold?
I. Find the standard score of the number of cars sold for
the 5th month, and explain what it means.
(2) For the two variables x and y, given:
• The correlation coefficient between x and y equals 0.6.
• For x = 4, the predicted value of y is 5 .
• The mean of x is 8. - Sx = 0.8Sy .
A. Determine the regression line of y on x.
B. Find the mean for the variable y.
(3) An insurance company is interested in determining whether
there is a relationship between automobile accident
240
frequency and cigarette smoking. It randomly sampled 50
policyholders and came up with the following data:
Cigarette Number of Accidents
Total
Smoking 0 1 2 or more
Smokers 8 6 4 18
Nonsmokers 12 10 10 32
Total 20 16 14 50
A. Does the sample provide sufficient information to conclude
that there is a relationship between automobile accident
frequency and cigarette smoking? If yes, to what extent?
Discuss your results in the context of this issue.
B. Does it make difference to your results of Part (A) if the
category '2 or more' is excluded? Explain?
241
South Valley University Date: 26/12/2017
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
Answer the Following Questions: 3 Questions , 3 Pages
Question (1): 14 Points
(1) Briefly explain the meaning of each of the following:
A. Descriptive statistics. B. Discrete variable, give examples.
(2) Before admission into a college, the students have to take
Basic Skills Test in fundamentals of mathematics. The scores
of 25 students are recorded below out of a total maximum of
40 points.
Hint: Pass Mark = 60% of Full Mark.
15 12 15 22 28 30 19 25 24 28 10 23 16
20 26 22 18 20 27 14 12 19 21 24 32
A. Prepare a frequency distribution for these data using 5 as
width for each class.
B. Using your result in Part (A), prepare a 'Less than'
cumulative frequency distribution. Then, find the
proportion of students who passed the exam.
Question (2): 32 Points
(1) The following table presents the distribution of the hourly
wages of 20 workers in a certain factory.
Hourly Less Less Less Less
20 - 25
Wage Than 30 Than 35 Than 40 Than 45
No. of
2 A 14 18 20
workers
If the value of the median is 32.5, find the value of :
(i) A (ii) The Mode
(2) The time taken for the weekly maintenance of a group of
machines in a workshop over the past 25 weeks is shown in
the following table.
242
Maintenance
0-2 2-4 4-6 6-8 8 - 10 Total
Time (hours)
Number of 25
2 6 10 5 2
Weeks
A. Find the mean, variance and coefficient of variation for
maintenance time.
B. Determine the maintenance time below which 10 weeks
will lie.
C. On the basis of inspection only (without any
computations), can you say that this distribution is
extremely skewed to the right? Explain.
Question (3): 54 Points
(1) The following table represents the daily production (units) and
the number of workers assigned for each of the 8 days.
Day 1 2 3 4 5 6 7 8 Total
Number of
5 7 6 8 9 7 8 6 56
Workers
Production 12 14 13 17 19 16 16 13 120
A. For what reasons may standard deviation be
inappropriate for comparing the dispersion of the two
variables? Suggest better alternative measure and use it for
such a comparison.
B. Find the correlation coefficient (r) between the two
variables. Explain what it indicates in the context of
this problem.
C. How would your answer of Part (B) be affected if the
number of workers for each day is:
(i) decreased by 2. (ii) multiplied by 2.
D. Find the coefficient of determination. Explain its
meaning.
E. The value of the rank correlation coefficient (rs) was
found to be 0.97. What would you say about this value when
compared to that of the correlation coefficient (r)
obtained in Part (B)?
243
F. Predict the number of units produced you would
expect for:
i. Day no. 4. How can you interpret the difference between
the actual and predicted number of units produced?
ii. A day in which 10 workers were assigned. Comment.
G. Interpret the value of the regression coefficient obtained
in Part (6).
H. Find the standard error of the estimate. Comment.
(2) Given are data for two variables x and y.
X 4 A 2 7
y 4 2 0 B
If you know that the two variables are perfectly related.
A. Determine the values of A and B.
B. Compare between the two variables in regard to the
coefficient of variation.
C. Without any computations, what is the value of the rank
correlation coefficient (rs)?
(3) A sample of 200 units produced by a machine were classified
as good or defective and by the shift on which they were
produced. The results are reported in the following table.
Shift
Quality Total
First Second Third
Good 76 64 40 180
Defective 4 6 10 20
Total 80 70 50 200
A. Is there evidence of a relationship between the quality of
the units produced and the shift on which they were
produced? If yes, to what extent?
Explain your conclusion in the context of this issue.
B. Excluding the second shift, repeat Part (A).
C. Compare between your results of Parts (A) and (B)
244
South Valley University Date: 9/1/2019
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
You have 25 questions. Please mark all your answers on the
answer sheet provided to you. You have to submit both
questions' papers and answer's sheet.
• Make sure that the answer sheet form matches the
questions form.
• Choose the best answer for each of the following
questions.
Hint: Five points are assigned for each of the following six
questions (Q1-Q6).
Q1. The following data show the time (rounded to the nearest day)
required to complete year-end audits for a sample of 20 clients
of a small accounting firm.
10 24 19 18 25 15 18 17 20 27
22 32 13 21 34 24 14 26 16 13
The frequency distribution for these data using 5 equal
classes is:
(A)
Time (day) 10-15 15-20 20-25 25-30 30-35
Number of Clients 3 8 5 3 1
(B)
Time (day) 10-15 15-20 20-25 25-30 30-35
Number of Clients 4 7 4 3 2
(C)
Time (day) 10-15 15-20 20-25 25-30 30-35
Number of Clients 4 6 5 3 2
249
South Valley University Date: 9/1/2020
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
You have 30 questions. Please mark all your answers on the
answer sheet provided to you. You have to submit both
questions papers and answer sheet.
Please:
• Make sure that the answer sheet form matches the
questions form.
• Choose the best answer for each of the following
questions.
Hint: Two points are assigned for each of the following
10 questions (Q1- Q10).
Q1. In a symmetric distribution, if Q3 - Q1 = 20 and median = 15,
then Q3 is equal to
(A) 20 (B) 25 (C) 10 (D) 15
Q2. The variance is zero if observations are the
(A) Different (B) Squares (C) Square roots (D) Same
The mean and variance of a set of numbers is 20 and 4
respectively. If each number is multiplied by 2 and then
increased by 5.
Answer the following two questions (Q3 - Q4):
Q3. The mean of new numbers is
(A) 30 (B) 36 (C) 45 (D) 40
Q4. The standard deviation of new numbers is
(A) 10 (B) 4 (C) 6 (D) 5
Q5. In a set of observation, the variance is 50. All the
observations are increased by 100%. The variance of the
increased observations will become
(A) 200 (B) 240 (C) 180 (D) No change
Q6. If the correlation coefficient between x and y equals 0.75,
then the correlation coefficient between u = 2x and v = 2y is
(A) 0 (B) - 0.75 (C) 1.5 (D) 0.75
250
Q7. The signs of the regression coefficient and correlation
coefficient are always
(A) Different (B) Positive (C) Same (D) Negative
Q8. If the percent of total variation of the dependent variable y
unexplained by the independent variable x is 24%. The
correlation coefficient (r) between x and y is
(A) 0.49 (B) 0.85 (C) 0.76 (D) 0.87
Q9. If the value of any regression coefficient is zero, then the two
variables are
(A) Correlation (B) Independent
(C) Dependent (D) Quantitative
Q10. If the coefficient of determination is a positive value, then
a regression equation
(A) Must have a positive slope
(B) Must have a negative slope
(C) Could have either a positive or a negative slope
(D) Must have a positive y- intercept
Hint: Three points are assigned for each of the
Q11. The following data show sales (in thousands of dollars) in
a firm during the last 20 months.
61 76 95 50 80 75 84 72 99 71
56 74 88 65 63 79 82 75 78 86
The frequency distribution for these data using 5 equal
classes is
Sales 50- 60 60 -70 70- 80 80-90 90-100 Total
A 2 3 8 5 2 20
Number B 2 4 7 4 3 20
of C 2 3 9 4 2 20
Months D None of the above is correct
Q12. The mean of wages is
(A) 18.24 (B) 17.64 (C) 16.85 (D) 17.75
251
Q13. What is the variance of wages?
(A) 35.3725 (B) 36.1875 (C) 38.1284 (D) 37.6825
Q14. Determine the wage below which 50% of workers will be.
(A) 17.5 (B) 16.8 (C) 17.2 (D) 16.4
Q15. What is the standard value for the wage 18?
(A) 0.12 (B) 0.05 (C) 0.04 (D) 0.08
Q16. Find the number of workers who have daily wages of at
least $12.
(A) 12 (B) 14 (C) 18 (D) 16
[Link] is the coefficient of skewness for a frequency
distribution that has: Mean = 10 , Median = 8 , Coefficient of
variation = 25%?
(A) 2.4 (B) 2.6 (C) 3.2 (D) 2.8
For the two variables x and Y, Given that
- The two variables have the same variance and the same
coefficient of variation.
- The correlation coefficient between x and Y is equal to 0.5.
- The mean of Y is 10.
- The predicted value of Y when x = 6 equals 10.
Answer the following two questions (Q18-Q19):
[Link] regression coefficient of the regression line of Y
on X is
(A) 0.4 (B) 0.5 (C) 0.3 (D) 0.6
[Link] mean of X is
(A) 6 (B) 8 (C) 12 (D) 10
Q20. If yi = 𝐲̂i for all values of i, then the two variables have
(A) Perfect positive relationship
(B) Perfect negative relationship
(C) Could have either a perfect positive or a perfect
negative relationship.
(D) Weak relationship
252
Hint: Five points are assigned for each of the following 10
questions (Q21-Q30).
A cost accountant has derived the total cost (in thousands of
dollars) against the output (in thousands of units) of a certain
product over a period of 5 weeks, yielding the following data:
Output 4 2 5 8 6
Total Cost 25 20 30 40 35
Answer the following six questions (Q21- Q26):
Q21. The value of the correlation coefficient (r) between the two
variables is
(A) 0.86 (B) 0.94 (C) 0.99 (D) 0.88
Q22. The value of the rank correlation coefficient (rs) is
(A) 1 (B) 0.95 (C) 0.97 (D) 0.98
Q23. The value of the coefficient of determination is
(A) 0.74 (B) 0.88 (C) 0.98 (D) 0.77
Q24. The regression coefficient of the regression line of cost
on output is
(A) 2.65 (B) 3.28 (C) 2.80 (D) 3.50
Q25. What is the predicted cost for producing 8 units?
(A) 39.8 (B) 40.5 (C) 42.4 (D) 38.6
Q26. Find the standard error of estimate.
(A) 1.29 (B) 1.32 (C) 2.16 (D) 2.24
254
South Valley University Date: 9/3/2021
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
You have 25 questions. Please mark all your answers on the
answer sheet provided to you. You have to submit both questions
papers and answer sheet. Please:
- Make sure that the answer sheet form matches the questions form.
- Choose the best answer for each of the following questions
Hint:Two points are assigned for each of the following 10 questions
(Q1- Q10).
Q1. Which of the following measures of central tendency can have
more than one value in a single data set?
(A) Mean (B) Median (C) Mode (D) First quartile
256
Mark 40- 50 50 -60 60- 70 70-80 80-90 Total
A 1 3 4 4 3 15
Number of B 1 2 6 3 3 15
Students C 1 4 5 3 2 15
D 1 3 5 4 2 15
257
Q19. A student obtained the mean and standard deviation of 100
observations as 40 and 5.1 respectively. It was later discovered that
he had wrongly copied down an observation as 50 instead of 40. The
correct standard deviation is
(A) 5 (B) 6 (C) 3 (D) 7
Q20. Following is the cumulative frequency distribution of the
electricity cost (in dollars) of 20 two-bedroom apartments.
Electricity Less Less Less Less Less
Cost Then 40 than 50 than 60 than 70 then 80
Number of 3 7 13 17 20
Apartments
If the lowest electricity cost is $30, what is the value of the mean?
(A) 62 (B) 65 (C) 52 (D) 55
Hint: Six points are assigned for each of the following 5 questions
(Q21-Q25).
The following table presents advertising expenditures ($1000s) and
revenue ($1000s) for 5 companies.
Company No. 1 2 3 4 5 Total
Advertising (x) 8 6 5 9 2 30
Revenue (y) 45 40 35 50 10 180
Computations provided the following:
∑x y = 1245 , ∑x2 = 210 , ∑y2 = 7450
Answer the following four questions (Q21- Q24):
Q21. Find the value of the correlation coefficient (r) between the
two variables.
(A) 0.94 (B) 0.97 (C) 0.96 (D) 0.95
[Link] of the following is the most likely value for the rank
correlation coefficient (rs)? Hint: You are not in need to make any
calculations.
(A) 0.75 (B) 0.65 (C) 1 (D) 0.82
258
Q23. What is the error of predicting the revenue of the second
company?
(A) 6 (B) 5 (C) 3 (D) 4
Q24. Find the standard error of the estimate.
(A) 4.6 (B) 3.8 (C) 3.6 (D) 4.8
Q25. Data on the social class and the number of children in a family
were obtained as a part of national survey. The results from a sample
of 200 families follow.
Number Social Class
of Children Lower Middle Upper Total
1 or 2 12 24 84 120
More than 2 8 16 56 80
Total 20 40 140 200
Based on the value of the Cramer's Coefficient (V), which one of the
following statements is correct?
(A) The number of children in a family depends to a large extent on
its social class.
(B) The relationship between the number of children in a family
and its social class is neither strong nor weak.
(C) The number of children in a family depends to some extent on
its social class.
(D) The number of children in a family does not depend at all on its
social class.
259
South Valley University Date: 2/2/2022
Faculty of Commerce Time Allowed: 3 hours
Principles of Statistics
English Teaching Section
You have 30 questions. Please mark all your answers on the answer sheet
provided to you. You have to submit both questions papers and
answer sheet.
Please:
- Make sure that the answer sheet form matches the questions form.
- Choose the best answer for each of the following questions
Hint: Two points are assigned for each of the following 10 questions
(Q1- Q10): (Time: 20 Minutes)
[Link] of the following measures can have more than one value for
a set of data?
(A) Mode (B) Median (C) Mean (D) None of these
Q2. Ifthe coefficient of determination is a positive value, then the
regression equation ……
(A) Must have a positive slope (B) Must have a negative slope
(C) Could have either a positive or a negative slope
(D) Must have a positive y - intercept
The mean and standard deviation of a set of numbers are 4 and 2
respectively. If each number is multiplied by 2 and then subtracted
from 10.
Answer the following two questions (Q3 - Q4):
Q3. The mean of new numbers is ……
(A) 5 (B) 2 (C) 4 (D) 6
Q4. The standard deviation of new numbers is ……
(A) 6 (B) 2 (C) 3 (D) 4
Q5. Let the correlation coefficient between X and Y be 0.6. Random
variables W and Z are defined as Z = X + 4 and W = Y/2. What is the
correlation coefficient between W and Z?
(A) 0.4 (B) 0.6 (C) 0.8 (D) None of these
260
Q6. If a test was generally very easy, except for a few students who had
very low scores, then the distribution of scores would be ……
(A) Positively skewed (B) Negatively skewed
(C) Not skewed at all (D) Normal
Q7. The goal of ……. is to focus on using sample results to make decisions
about the entire population from which the sample was drawn.
(A) Inferential statistics (B) Descriptive statistics
(C) None of the above (D) All of the above
Q8. If the percent of total variation of the dependent variable Y
unexplained by the independent variable X is 36%. The correlation
coefficient (r) between X and Y is …...
(A) 0.49 (B) 0.36 (C) 0.80 (D) 0.90
Q9. If the value of any regression coefficient is one, then the two
variables are ……
(A) Independent (B) Moderately related
(C) Perfectly related (D) It is impossible to tell
Q10. The statistics score of a student is 2 standard deviations above the
mean. If the mean and standard deviation of scores of all students are 70
and 5 respectively, what is the statistics score of this student?
(A) 80 (B) 82 (C) 60 (D) 86
Hint: Four points are assigned for each of the following 10 questions
(Q11- Q20): (Time: 85 Minutes)
Q11. The following data show the time (rounded to the nearest day)
required to complete year-end audits for a sample of 20 clients of a small
accounting firm.
20 34 29 28 35 25 28 27 30 37
32 42 23 31 44 34 24 36 26 23
The frequency distribution for these data using 5 equal classes is ……
Time (day) 20-25 25-30 30-35 35-40 40-45 Total
A 3 8 5 3 1 20
Number of B 4 7 4 3 2 20
Clients C 4 5 6 3 2 20
D None of the above is correct
261
In a city, there are 25 houses for sale of similar size. The frequency of
prices (in $1,000) is as follows:
Price ($1,000) 10 - 20 - 30 - 40 - 50 - 60 Total
Number of Houses 1 3 8 7 6 25
262
The following table gives the experience (in years) and the number of
items which were rejected as unsatisfactory last week for 8 workers
at a small factory.
Worker A B C D E F G H
Years of Experience (x) 8 5 10 3 7 2 9 12
Number of Rejects (y) 9 15 5 19 11 21 7 1
Computations provided the following:
∑x = 56 , ∑ y = 88 , ∑xy = 448 , ∑x2 = 476 , ∑y2 = 1304
Answer the following four questions (Q17- Q20):
[Link] is the value of the correlation coefficient (r) between the
two variables?
(A) – 0.985 (B) – 912 (C) – 1 (D) 1
Q18. The value of the rank correlation coefficient (rs) is ……
(A) – 1 (B) – 0.95 (C) – 0.97 (D) 1
Q19. The regression coefficient of the regression line of number of
rejects on years of experience is ……
(A) – 2.4 (B) – 3.0 (C) 2.4 (D) – 2.0
Q20. Predict the number of rejects for a worker of 4 years of
experience?
(A) 15 (B) 16 (C) 17 (D) 18
263
Q23. For computing the value of Cramer’s coefficient, if Oij = Eij for
all values of i andj (The observed and expected frequencies are equal
for each cell of the contingency table). The relationship between the
two attributes is ……
(A) Fairly strong (B) Moderate (C) Weak (D) No relationship
Q24. If y^i = yi for all values of i, then the two variables have ……
(A) Perfect positive relationship
(B) Perfect negative relationship
(C) Could have either a perfect positive or a perfect negative
relationship.
(D) Weak relationship
Given are data for two variables X and Y:
X: 4 2 k 3
Y: 2 6 10 4
If you know that the two variables are perfectly related.
Answer the following two Questions (Q25 - Q26):
Q25. Determine the value of k.
(A) 1 (B) 6 (C) 5 (D) 0
Q26. The value of the rank correlation coefficient (rs) is ……
(A) 0.96 (B) -1 (C) 1 (D) Difficult to know
Data on the marital status of men and women aged 25 to 35 were
obtained as a part of a national survey. The results from a sample
of 120 men and 80 women follow.
Marital Status
Gender Total
Never Married Married Divorced
Men 80 30 10 120
Women 10 50 20 80
Total 90 80 30 200
Answer the following two questions (Q27 - Q28):
264
Q27. Based on the value of the Cramer’s coefficient (V), which of
the following statements is correct?
(A) The marital status depends to a large extent on gender.
(B) The relationship between the marital status and gender is
weak.
(C) The marital status moderately depends on gender.
(D) The marital status does not depend at all on gender.
Q28. Repeating Q27 with excluding the category ‘Divorced’, the
degree of association between the two attributes changes to be ……
(A) Higher (B) Lower (C) Same (D) Difficult to tell
Given the following frequency distribution:
Age 2 - 4 4 - 6 6 - 8 8 – 10 10 and over
Number of Persons 1 2 5 K 1
The values of the mean and standard deviation of this distribution
are 6.9 and 2.3, respectively.
Answer the following two questions (Q29 - Q30):
Q29. The value of K is ……
(A) 2 (B) 3 (C) 1 (D) 4
Q30. The highest age is ……
(A) 12 (B) 14 (C) 16 (D) 15
265
References
266