0% found this document useful (0 votes)
89 views23 pages

Statistics Fundamentals for Students

This document outlines a course on data management, specifically focusing on statistics as a science of data. It covers fundamental concepts such as descriptive statistics, measures of central tendency (mean, median, mode), and measures of dispersion, along with practical applications and exercises. The learning outcomes emphasize the ability to accurately process numerical data and solve application problems using appropriate statistical tools.

Uploaded by

Isachar Job
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views23 pages

Statistics Fundamentals for Students

This document outlines a course on data management, specifically focusing on statistics as a science of data. It covers fundamental concepts such as descriptive statistics, measures of central tendency (mean, median, mode), and measures of dispersion, along with practical applications and exercises. The learning outcomes emphasize the ability to accurately process numerical data and solve application problems using appropriate statistical tools.

Uploaded by

Isachar Job
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Republic of the Philippines

Mountain Province State Polytechnic College


Bontoc, Mountain Province

DATA MANAGEMENT

Mathematics in the Modern World

Teacher Education Department


1st Semester, School year 2020-2021
INTRODUCTION

The purpose of this course is to introduce you to the subject of statistics as a


science of data. There is data abound in this information age; how to extract useful
knowledge and gain a sound understanding in complex data sets has been more of a
challenge. In this course, we will focus on the fundamentals of statistics, which may be
broadly described as the techniques to collect, clarify, summarize, organize, analyze,
and interpret numerical information.

In today's technologically advanced world, we have access to large volumes of


data. The first step of data analysis is to accurately summarize all of this data, both
graphically and numerically, so that we can understand what the data reveals. To be
able to use and interpret the data correctly is essential to making informed decisions.
For instance, when you see a survey of opinion about a certain TV program, you may
be interested in the proportion of those people who indeed like the program.

In this module, you will learn about descriptive statistics, which are used to
summarize and display data. After completing this module, you will know how to present
your findings once you have collected data.

This module will begin with a brief overview of the discipline of statistics and will
then quickly focus on descriptive statistics as such the summary measures as central
tendency, position, variations of data. The latter of which serves as the foundation for
statistical inference. On the side of inference, we will focus on both estimation and
hypothesis testing issues. We will also examine the techniques to study the relationship
between two or more variables; this is known as regression and correlation.

In all it’s your turn activity serves as your class standing and post assessment as
your unit test which are all required to be submitted. Use long bond papers in all your
outputs or answers.

LEARNING OUTCOMES

The learning outcomes list the module’s overall learning outcomes. The objectives
will be written under each lesson.
At the end of the module, you should be able to:
1. use appropriate statistical tools to process and manage numerical data
accurately; and
2. solve application problems correctly

PRE-TEST
Let us see how much you already know about statistics. Answer each question
below. Take note of the items that you do not yet know. Write the letter of the
correct answer on the blank provided before the number.
___1. Which of the following is not a measure of central location?
a. Mean b. median
c. variance d. mode
___2. In quartiles, central tendency median to be measured must lie in
a. first quartile b. second quartile
c. third quartile d. four quartile
___3. Arithmetic mean is 12 and number of observations are 20 then sum of all
values is
a. 8 b. 32 c. 240 d. 1.667
___4. Method used to compute average or central value of collected data is
considered as
a. measures of positive variation

1
b. measures of central tendency
c. measures of negative skewness
d. measures of negative variation
___5. Mean or average used to measure central tendency is called
a. sample mean
b. arithmetic mean
c. negative mean
d. population mean
___6. For values lie close to the mean, the standard deviations are.
a. Big b. Small c. Moderate d. None
___7. Which of the following is not a measure of dispersion?
a. Mean b. Standard deviation c. Variance d. range
___8. Which of the following is not a measure of position?
a. Decile b. quartile c. percentile d. range
___9. If most repeated observations recorded are outliers of data then mode is
considered as
a. intended measure
b. percentage measure
c. best measure
d. poor measure
___10. The mean of a sample is
a. always equal to the mean of the population
b. always smaller than the mean of the population
c. computed by summing the data values and dividing the sum by (n - 1)
d. computed by summing all the data values and dividing the sum by the
number of items
___11. In a five number summary, which of the following is not used for data
summarization?
a. the smallest value
b. the median
c. the 25th percentile
d. the mean
___12. Since mode is the most frequently occurring data value, it
a. can never be larger than the mean
b. is always larger than the median
c. must have a value of at least two
d. None of the above answers is correct.
A researcher has collected the following sample data.
5 12 6 8 5
6 7 5 12 4
___13. The median is
a. 5 b. 6 c. 7 d. 8
___14. The mode is
a. 5 b. 6 c. 7 d. 8
___15. The mean is
a. 5 b. 6 c. 7 d. 8

2
LESSON 1: SUMMARY MEASURES

Objectives:

At the end of the lesson, you should be able to:

1. calculate and interpret correctly the mean, median, mode;


2. generate and interpret correctly the range, variance, standard
deviation;
3. calculate the decile, quartile and percentile of a set of data accurately
and;
4. solve problems applying summary measure concepts correctly.

Let’s Engage!

In descriptive statistics, summary statistics are used to summarize a set


of observations, in order to communicate the largest amount of information as simply
as possible. Summary statistics summarize and provide information about
your sample data. It tells you something about the values in your data set. This
includes where the average lies and whether your data is skewed. Summary statistics
fall into three main categories: measures of location (also called central tendency),
measures of spread and graphs/charts.

LET’S TALK ABOUT IT!


A. Measures of Central Tendency

A measure of central tendency is a single figure that is a representative of the


general level of magnitudes or values of the items in the data set. It is called measure
of central tendency because when the data points are arranged according to
magnitude, it tends to lie centrally within the set.
Ant measure indicating the center of a set of data arranged in an increasing or
decreasing order of magnitude is called a measure of central location or a measure of
central tendency.

A. Mean
1. Arithmetic mean (or average). This is the most widely used measure of
location. It is calculated by adding the values of the observations and dividing
by the total number of observations.

Population Mean: If a set of data X1, X2…..Xn, represents a finite population of


size N, then the population mean is

𝛴𝑥ᵢ
𝜇=
𝑁
Sample mean: If a set of data X1, X2…..Xn, represents a finite sample of size N,
then the sample mean is
𝛴𝑥ᵢ
𝑥̅ =
𝑁
2. Weight mean
𝛴𝑤ᵢ𝑥ᵢ
𝑥̅𝑤 = 𝛴𝑤ᵢ

Where:

Xᵢ=ᵢth data point


Wᵢ= weight of the ᵢth data point

3
𝛴𝑤ᵢ𝑥ᵢ= sum of the products of the data points and their corresponding weights.

Example: Renan has the following grade. Determine his GPA(grade point average).
Subject Unit(wᵢ) Grade (Xᵢ) 𝑤ᵢ𝑥ᵢ
Filipino 2 3 87 261
English 3 84 252
Math 7 3 85 255
P.E 1 95 95
Chem 1 (lec) 3 82 246
Chem 2 (lab) 1 82 82
Philo 1 3 85 255
𝛴 17 1,446
1,446
Thus, the weight mean is: Xw= 17
=85.06

B. Median. The median, X, of a set of observations arranged in an increasing or


decreasing order of magnitude is the middle value when the number of
observations is odd or the arithmetic mean of the two values when the number of
observation is even. The formula is given by
𝑥𝑛 +1
𝑥̅ = 2 if n is odd and if n is even
𝑥𝑛
+𝑥𝑛
2 +1
2
2
a) Find the median of the scores 7,2,3,7,6,9,10,8,9,9,10.
Solution: arrange the scores in increasing magnitude or ascending order 2, 3, 6, 7, 7,
8, 9, 9, 10, 10.
With these eleven scores, the number 8 is located in the middle, so 8 is the median.
b) Find the median of the scores 7, 2, 3, 7, 6, 9, 10, 9, 9.
Solution: again arrange the scores
2, 3, 6, 7, 7, 8, 9, 9, 9, 10.
The two center scores are 7 and 8. So we find the mean of these two scores.
7+8
=7.5 thus, 7.5 is the median of the given scores.
2

C. Mode. The mode of a set of observations is the set of observations is the value
which occurs most often or with the highest frequency. It is the least used method.

Examples:
a. The scores 1, 2, 3, 2, 4, 7, 9, 2 have a mode of 2.
b. The scores 2, 3, 6, 7, 8, 9 have no mode since no score is repeated.
c. The scores 1, 2, 2, 3, 4, 5, 2, 5, 6, 6, 7, 9, 6 have the modes 2 and 6 since they both
occur with the same highest frequency (we refer to such data as bimodal).
d. The scores 3, 4, 5, 1, 3, 2, 4, 5, 7, 10 have the modes 3, 4, and 5.

Advantages of using the mode as the measure of central tendency:

1. It requires no calculation.
2. It can be used for quantitative as well as qualitative data.

Note: the mode does not always exist. For some sets of data, there may be several values
occurring with the greatest frequency, in which case, there are more than one mode. If
there are two modes in the distribution is said to be bimodal, if more than 2 mode then
it is multi modal.

****Midrange

4
It is defined as the mean of the largest and the smallest values in a set data.

IT’S YOUR TURN!

Activity 1

Answer the following and use a separate paper (long bond paper) for your answers.

1. The following data are ages of infants (in months) at which they walked alone. These
sample data were obtained from two populations, A and B.

Sample A 9.8 11 9.5 9.8 10 13 10 14 10 9 10 9.5


Sample B 12 9.5 9.5 14 12 14 13 12 12 14 14 13
Compare the two groups as their mean, median, mode, and midrange.

Central tendency A B _
Mean ___________ ___________
median ___________ ___________
mode(s) ___________ ___________
midrange ___________ ___________

2. A study was conducted to determine the level of awareness of the residents of a


certain municipality on the causes of hypertension. The accompanying table shows
the results with respect to the gender of the respondents.

Male Female Weight


Indicator (n=32) (n=48) Mean
𝑥̅ 1 𝑥̅ 2 𝑥̅ w

1. Intake of food high in salt 2.8 2.8


2. Intake of food high in saturated fats. 2.4 3.0
3. Caffeine consumed in more than 4 cups 2.5 2.4
of coffee a day
4. Drinking too much alcohol 2.9 3.0
5. Smoking 3.2 3.2
6. Intake of drugs or medications that may 2.8 2.8
increase blood pressure.
7. Lack of physical exercise 3.0 3.1
8. Stress 2.8 3.0
9. Family history of hypertension 2.6 2.7
10 Present medical conditions like kidney 2.5 2.5
disorder, diabetes mellitus and others
Average
a) Find the average for the male and for the female.
b) Find the weighted mean, 𝑥̅ w, for each indicator letting the n as the weight.
3. The numbers of in correct answers on a true or false competency test for a random
sample of 15 students were recorded as follows: 2, 1, 3, 0, 1, 3, 6, 0, 3, 3, 5, 2, 1, 4,
and 2. Find the following: (a) Mean, (b) Median, (c) Mode, and (d) Midrange.
4. An employee had accumulated 20 competency credits with the grade of A, 25 credits
with B’s, 10 credits with C’s, and 2 credits with D’s. the company uses the competency
evaluation scale in which A=4 grade points, B=3, C=2 and D=1. Determine the grade
point competency average.
5. A student was taking 6 subjects in college during the first semester. Find his
average grade if his final grade were as follows:

5
Subject Grade Units
Math 1.75 3
Physics 2.50 5
English 2.25 3
Speech 1.50 2
Statistics 3.00 4
6. An economist studying trends in gasoline prices with in a city takes sample of 30 of
the city’s gas stations, determining for each station the price per liter (in pesos) of
unleaded regular gasoline. The results are given below. Find the mean and the median
prices.
Price (in peso) frequency
54.60 1
55.64 3
56.16 1
57.20 12
57.72 8
58.24 5

B. Measures of Dispersion
Measures of Variability
A measure of variability of a set of data is a number that conveys the idea of
spread for the data set. The measures of dispersion report on how far the
values of the distribution are from the center.

A. Range
The range is the simplest measures of variability which measures distance between
the largest and the smallest values and, as such, gives an idea of the spread data set.
However, the range does not concept of deviation. It is affected by outliers but does not
consider all values in the data set. Thus it is not a very useful measure of variability.
Range (R) = highest value-lowest value

Example:
Consider these three sets of quiz scores: Find the range of each quiz scores
(i) Section A: 5 5 5 5 5 5 5 5 5 5
(ii) Section B: 0 0 0 0 0 10 10 10 10 10
(iii) Section C: 4 4 4 5 5 5 5 6 6 6

Solution:
(i) section A, the range is 0 since both maximum and minimum are 5 and 5 –
5=0
(ii) For section B, the range is 10 since 10 – 0 = 10
(iii) For section C, the range is 2 since 6 – 4 = 2

B. Mean Absolute Deviation


The mean absolute deviation (MAD) utilizes of the data values from the mean
in its computation or it is the a rithmetic mean of the absolute values of
the deviations from the mean.
The MAD is the average of the absolute deviation values from the mean, computed
using the formula
𝛴∣𝑥ᵢ−𝜇∣
Parameter: MAD= 𝑁
𝛴∣𝑥ᵢ−𝑥̅ ∣
Statistic: MAD= 𝑛−1

6
If the data set A has a greater MAD than the data set B, then it is reasonable to believe
that the values in data set A are more spread out (variable) than the values in set B.

Example:
Section D, with scores 0 5 5 5 5 5 5 5 5 10.

We could compute for each data value the difference between the data value and the
mean:
deviation: data value –
data value deviation squared
mean
0 0-5 = -5 (-5)2 = 25
5 5-5 = 0 02 = 0
5 5-5 = 0 02 = 0
5 5-5 = 0 02 = 0
5 5-5 = 0 02 = 0
5 5-5 = 0 02 = 0
5 5-5 = 0 02 = 0
5 5-5 = 0 02 = 0
5 5-5 = 0 02 = 0
10 10-5 = 5 (5)2 = 25

We would like to get an idea of the “average” deviation from the mean, but if we
find the average of the values in the second column the negative and positive values
cancel each other out (this will always happen), so to prevent this we square every
value in the second column.

We then add the squared deviations up to get 25 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 +


25 = 50. Ordinarily we would then divide by the number of scores, n, (in this case,
10) to find the mean of the deviations. But we only do this if the data set represents
a population;
50
Parameter: MAD= 10 = 5
If the data set represents a sample (as it almost always does), we instead divide
by n – 1 (in this case, 10 – 1 = 9).
50
Statistic: MAD=10−1 = 5.56

These values (5 and 5.56) are called, respectively, the population mean of
deviations and the sample mean of deviations for section D.

***If we are unsure whether the data set is a sample or a population, we will usually
assume it is a sample, and we will round answers to one more decimal place than the
original data.

C. Variance and Standard Deviation


The variance and the standard deviation are the most common and useful
measures of variability. These two measures provide information about how the data
vary about the mean. The variance is the measure of variation which considers the
position of each observation relative to the mean of the set. It is an approximate
average of the squared deviations from the sample mean. The standard deviation is
the square root of the variance.
A few important characteristics:
• Standard deviation is always positive. Standard deviation will be zero if all the
data values are equal, and will get larger as the data spreads out.

7
• Standard deviation has the same units as the original data.
• Standard deviation, like the mean, can be highly influenced by outliers.

Population variance.

2=
𝛴(𝑥ᵢ−𝜇)2
𝑁

Sample variance.
𝛴(𝑥ᵢ−𝑥̅ )2
s2= 𝑁−1

Population Standard Deviation.

𝛴(𝑥ᵢ−𝜇)2
=√ 𝑁

Sample Standard Deviation.

𝛴(𝑥ᵢ−𝑥̅ )2
s=√ 𝑁−1

Where:
2= population variance
S2= sample variance
= population standard deviation
s= sample standard deviation
𝑥̅ = sample mean
n= sample size
𝑥ᵢ = ith observation
𝜇 = population mean
N= population size

If the data are clustered around the mean, then the variance and the standard
deviation will be somewhat small, if, however, the data are widely scattered about the
mean, the variance and the standard deviation will be somewhat large.

Notes:
1. We divide by the quantity n-1 in order to make the sample variance an
unbiased estimator of the population variance. (an estimator is unbiased if its
average value is equal to the parameter it is estimating.)
2. The sample variance uses the squares of the deviations from the mean, as this
will eliminate the effects of the signs (as was also the case when we used the
absolute value of the deviation in computing the MAD.)
3. The unit of the standard deviation is the same as that of the raw data, so it is
preferable to use the standard deviation as a measure of variability instead of
the variance.

To Compute Variance and Standard Deviation


1. Find the deviation of each data from the mean. In other words, subtract the
mean from the data value.
2. Square each deviation.
3. Add the squared deviations.
4. Divide by n, the number of data values, if the data represents a whole
population; divide by n – 1 if the data is from a sample.
5. And, to compute the standard deviation take square root of the result.

8
Example
Computing the standard deviation for Section B =0 0 0 0 0 10 10 10 10 10, we
first calculate that the mean is 5. Using a table can help keep track of your
computations for the variance and standard deviation:
deviation: data value –
data value deviation squared
mean
0 0-5 = -5 (-5)2 = 25
0 0-5 = -5 (-5)2 = 25
0 0-5 = -5 (-5)2 = 25
0 0-5 = -5 (-5)2 = 25
0 0-5 = -5 (-5)2 = 25
10 10-5 = 5 (5)2 = 25
10 10-5 = 5 (5)2 = 25
10 10-5 = 5 (5)2 = 25
10 10-5 = 5 (5)2 = 25
10 10-5 = 5 (5)2 = 25

▪ Assuming this data represents a population, we will add the squared


deviations, divide by 10, the number of data values, and compute:
Population Variance
𝛴(𝑥ᵢ−𝜇)2
2 = 𝑁
250
= 10
=25
Population Standard Deviation
𝛴(𝑥ᵢ−𝜇)2
=√ 𝑁

= √25
=5

▪ These values (25 and 5) are called, respectively, the population variance and
the population standard deviation for section B.

Assuming this data represents a sample, we will add the squared deviations, divide
by 10, the number of data values, and compute:
Sample Variance
𝛴(𝑥ᵢ−𝜇)2
s2= 𝑛−1
250
= 9
s2 =27.78
Sample Standard Deviation
𝛴(𝑥ᵢ−𝜇)2
s=√
𝑛−1

= √27.78
= 5.27

These values (27.78 and 5.27) are called, respectively, the sample variance and
the sample standard deviation for section B.

9
***If we are unsure whether the data set is a sample or a population, we will usually
assume it is a sample, and we will round answers to one more decimal place than the
original data.

IT’S YOUR TURN!

Activity 2

Answer the following and use a separate paper (long bond paper) for your answers.

1. Study of the effects of smoking on sleep patterns is conducted. The measure observed
is the time, in minutes, that it takes to fall asleep, these data are obtained:
Smokers: 69.3, 56.0, 22.1, 47.6, 53.2, 48.1, 52.7, 34.4, 60.2, 43.8, 23.2,
13.8
Nonsmokers: 28.6, 25.1, 26.4, 34.9, 29.8, 28.4, 38.5, 30.2, 30.6, 31.8, 41.6, 21.1,
36.0, 37.9, 13.9
a. Find the sample mean, mode and median for each group.
b. Find the range, mean, absolute deviation, variance, standard deviation for each
group.
c. Comment on what kind of impact smoking appears to have on the time required
to fall asleep.
2. Faculty salaries for a random sample of teachers in the public school system of a
certain town were coded by dividing each salary by 1000. Find the MAD and the
standard deviation of these salaries if the coded observations are 18, 15, 21, 19, 13,
15, 14, 23, 18 and 16 pesos.

C. Measures of Position (Quantiles)


Measures of position or quantiles are different techniques that divide a set of
data into equal groups. To determine the measurement of position, the data must
be sorted from lowest to highest.

1. Percentiles
Percentiles divide the data set into one hundred equal parts. Each set of the
observation has 99 percentile and are denoted by P1, P2, …, P99.
Note:
A Percentile is a value in the data set
A percentile rank of the given value is a percent that indicates the percentage of
data is smaller than the value.

2. Deciles
The deciles divide the data set into ten equal parts. Each set of the observation
has 9 percentile and are denoted by D1, D2, …, D9.
Note: The first decile and tenth percentile are the same D1=P10,
Similarly D2=P20, D3=P30, …, D9=P90

3. Quartiles
The quartiles divide the data set into four equal parts. Each observations has
3 quartiles and they are denoted by Q1, Q2, and Q3.
The first quartile Q1 is a value in the data set that 25% of the values fall below
Q1 and 75% of all the values fall above Q1.
The second quartile Q2 is a value in the data set that 50% of the values fall
below Q2 and 50% of all the values fall above Q2.
The third quartile Q3 is a value in the data set that 75% of the values fall below
Q3 and 25% of all the values fall above Q3.

10
Note:
Q1=P25, Q2=P50, …, D3=P75
Median: the 50th percentile, 5th decile and the second quartile of the distribution
are equal to the same value and are referred to as the median. That is

Median=Q2=D5 =P50

Computation of Measures of Position (Quantile)

The starting point for finding the quantile value for ungrouped data is to
arrange the data set then locate the position of the quantile and then just pick
the value from the arranged data set that corresponds to the quantile. The
position is computed as follows:

For the quantiles Qk, we pick the k= p(N+1)th/4 item

For the deciles Dk, we pick the k= p(N+1)th/10 item

For the percentile Pk, we pick the k= p(N+1)th/100 item

Where :
P= 1, 2 or 4 for quartiles, 1 to 10 for deciles and 1 to 100 for percentiles.
N=number of items or values

Example:

Given the ungrouped data Find the 90th percentile, 5th decile and 3rd quartile

{95, 83, 67, 86, 93, 82, 71, 86, 97, 55, 75, 88, 70, 40, 89, 79, 90, 46, 75, 66, 81, 75,
49, 50, 55, 68, 55, 70, 75, 92}

Then we have to rank or arrange them in strict numerical sequence

{40, 46, 49, 50, 55, 55, 55, 66, 67, 68, 70, 70, 71, 75, 75, 75, 75, 79, 81, 82, 83, 86,
86, 88, 89, 90, 92, 93, 95, 97}

90th percentile k= p(N+1)th/100


k= 90(30+1)/100
k= 27.9th
obviously there is no 27.9 th number in the sequence then round it to the nearest
whole number to have the 28th term.
K=28th =93

5th decile k= p(N+1)th/10


k= 5(30+1)/10
k=15.5th
k= 15th or 16th =75
3rd quartile k= p(N+1)th/4
k= 3(30+1)/4
k= 23.25th
k=23th =86

11
IT’S YOUR TURN!
Activity 3

Answer the following and use a separate paper (long bond paper) for your answers.

1. For the data set below, which value is in the 75th percentile, 4th
decile and 1st quartile? {1, 3, 3, 4, 6, 7, 7, 7, 8, 9, 9, 10, 12, 15, 16,
17}
2. Which of the following data values is the 50th percentile, 8th decile
and 3rd quartile?{1.52, 5.36, 6.79, 5.21, 0.28, 6.36, 8.47, 5.52, 6.26,
5.97}

LESSON 2. NORMAL DISTRIBUTIONS

Objectives:

At the end of the lesson, you should be able to:


1. find areas under the standard normal curve accurately

LET’S ENGAGE!

The normal distribution is the most important probability distribution


in statistics because it fits many natural phenomena. For example, heights, blood
pressure, measurement error, and IQ scores follow the normal distribution. It is also
known as the Gaussian distribution and the bell curve.
The normal distribution is a probability function that describes how the values
of a variable are distributed. It is a symmetric distribution where most of the
observations cluster around the central peak and the probabilities for values further
away from the mean taper off equally in both directions. Extreme values in both tails of
the distribution are similarly unlikely.
In this lesson, you’ll learn how to use the normal distribution, its parameters,
and how to calculate Z-scores to standardize your data and find probabilities.

LET’S TALK ABOUT IT!

A. The Normal Distribution


One of the most important statistical of a data is known a normal distribution.
This distribution occurs in a variety of application. Types of data that may demonstrate
a normal distribution include the lengths of leaves on a tree, the weight of new-borns in
a hospital, the length of time of student’s trip from school to home over a period of
months, the SAT scores of a large group of a students, and the life span of a light bulbs.

A normal distribution forms a bell-shaped curve that is symmetric about a


vertical line through the mean of the data. A graph of a normal distribution with a mean
of 5 is shown at the left.

Properties of a Normal Distribution

Every normal distribution has the following properties.

▪ The graph is symmetric about a vertical line through the mean of a normal
distribution.
▪ The mean, median, and are equal.
▪ The y-value of each point on the curve is the percent (expressed as a decimal) of
the data at the corresponding x-value.
▪ Areas under the curve that are symmetric about the mean are equal.
▪ The total area under the curve is equal to one.

12
Total area = 1

μ x
• A normal distribution can have any mean and any positive standard deviation.
• The mean gives the location of the line of symmetry.
• The standard deviation describes the spread of the data.

μ = 3.5 μ = 3.5 μ = 1.5


σ = 1.5 σ = 0.7 σ = 0.7

Empirical Rule for a Normal Distribution

In a normal distribution, approximately

▪ 68% of the data lie within 1 standard deviation of the mean.


▪ 95% of the data lie within 2 standard deviation of the mean.
▪ 99.7% of the data lie within 3 standard deviation of the mean.

The standard Normal Distribution


A process of transformation which alters a given raw score into a standard
normal score is called the standardization procedure. The resulting normal random
variable has mean 0 and standard deviation 1 and is denoted by Z called the standard
normal variable.
The transformation process: if X is a normal random variable with mean, µ and
standard deviation, ℴ , then

𝑥−𝑥̅ 𝑥− µ
Z= for a sample, or Z= for a population
𝑠 ℴ

This standardized random variable Z has a normal distribution which is known


as the Standard normal distribution. The standard score, or z score, is the number of
standard deviations that a given value x is above or below the mean.

13
Table of Areas Under the Normal Curve
(Adapted from [Link]
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359

0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753

0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141

0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517

0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879

0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224

0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549

0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852

0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133

0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389

1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621

1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830

1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015

1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177

1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319

1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441

1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545

1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633

1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706

1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767

2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817

2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857

2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890

2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916

2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936

2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952

2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964

2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974

2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981

2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986

3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990

Standardizing also allows you to compare numbers from different distributions.


For example, suppose Bob scores 80 on both his math exam (which has a mean of 70
and standard deviation of 10) and his English exam (which has a mean of 85 and

14
standard deviation of 5). On which exam did Bob do better, in terms of his relative
standing in the class?
Bob’s math exam score of 80 standardizes to a z-value of
80−70
Z= =1
10

That tells us his math score is one standard deviation above the class average. His
English exam score of 80 standardizes to a z-value of
80−85
Z= =-1
5

putting him one standard deviation below the class average. Even though Bob scored
80 on both exams, he actually did better on the math exam than the English exam,
relatively speaking.

To interpret a standard score, you don’t need to know the original score, the
mean, or the standard deviation. The standard score gives you the relative standing of
a value, which in most cases is what matters most. In fact, on most national
achievement tests, they won’t even tell you what the mean and standard deviation were
when they report your results; they just tell you where you stand on the distribution by
giving you your z-score.

Finding Areas under the Curve of a Normal Distribution

Steps and example

1. Sketch the standard normal curve and shade the appropriate area
under the curve.
2. Find the area by following the directions for each case shown.
a. To find the area to the left of z, find the area that
corresponds to z in the Standard Normal Table.

• Use the table to fin the area for the Z score of 1.23
• The area to the left of Z = 1.23 or 0.8907 (shaded)
b. To find the area to the right of z, use the Standard Normal
Table to find the area that corresponds to z. Then subtract
the area from 1.

• Use the table to find the area of the z score.


• The area to the left of Z = 1.23 or 0.8907 (shaded)

15
• Subtract to find the area to the right of Z = 1.23: 1- 0.8907 = 0.1093
(unshaded)
c. To find the area between two z-scores, find the area
corresponding to each z-score in the Standard Normal Table.
Then subtract the smaller area from the larger area.

• Use the table to find the area of the z score.


• The area to the left of Z = 1.23 or 0.8907 (shaded)
• The area to the left of Z = -0.75 is 0.2266
• Subtract to find the area between the two z scores: 0.8907 – 0.2266 =
0.6641

IT’S YOUR TURN!


Activity 4
Answer the following and use a separate paper (long bond paper) for your answers.

1. A normal distribution of scores has a standard deviation of 10. Find the z-


scores corresponding to each of the following values:
a. A score of 60, where the mean score of the sample data values is 40.
b. A score that is 30 points below the mean.
c. A score of 80, where the mean score of the sample data values is 30.
d. A score of 20, where the mean score of the sample data values is 50.
2. A normal distribution of scores has a standard deviation of 10. Find the z-
scores corresponding to each of the following values:
a. A score that is 20 points above the mean.
b. A score that is 10 points below the mean.
c. A score that is 15 points above the mean
d. A score that is 30 points below the mean.
3. Given a standard normal distribution, find the area under the curve and
sketch the the standard normal curve that lies
a. Between z= -0.46 to z= 2.21
b. To the left of z= -0.6
c. To the right of z=1.96
4. What z-scores correspond to the following areas under the normal curve?
a. Area of .01 to the right of +z
b. Area of .01 to the left of -z
c. Area of .05 to the right of +z
d. Area of .05 to the left of -z
e. Area of .90 between ±z
f. Area of .99 between ±z

16
LESSON 3. LINEAR REGRESSION AND CORRELATION

OBJECTIVES:

At the end of this lesson you will be able to:


1. solve and interpret correctly the linear regression model
2. solve and interpret correctly the linear correlation using Pearson correlation
coefficient.

LET’S ENGAGE!

The most commonly used techniques for investigating the relationship between
two quantitative variables are correlation and linear regression. Correlation quantifies
the strength of the linear relationship between a pair of variables, whereas regression
expresses the relationship in the form of an equation.

This introduces methods of analyzing the relationship between two quantitative


variables. The calculation and interpretation of the sample using product moment
correlation coefficient and the linear regression equation are discussed and illustrated.

LET’S TALK ABOUT IT!

Linear Regression

Linear regression attempts to model the relationship between two variables by


fitting a linear equation to observed data. One variable is considered to be an
explanatory variable, and the other is considered to be a dependent variable.

A linear regression line has an equation of the form Y = a + bX, where X is the
explanatory variable or independent variable and Y is the dependent variable. The slope
of the line is b, and a is the intercept (the value of y when x = 0).

You might also recognize the equation as the slope formula. The equation has
the form Y=a+bX, where Y is the dependent variable (that’s the variable that goes on
the Y axis), X is the independent variable (i.e. it is plotted on the X axis), b is the slope
of the line and a is the y-intercept.
(Σ𝑦)(Σ𝑥 2 ) − (Σ𝑥)(Σ𝑥𝑦)
𝑎=
𝑛(Σ𝑥 2 ) − (Σ𝑥)2

𝑛Σ𝑥𝑦 − (Σ𝑥)(Σ𝑦)
𝑏=
𝑛(Σ𝑥 2 ) − (Σ𝑥)2

The first step in finding a linear regression equation is to determine if there is a


relationship between the two variables. This is often a judgment call for the
researcher. You’ll also need a list of your data in x-y format (i.e. two columns of data
— independent and dependent variables).
Warnings:
▪ Just because two variables are related, it does not mean that one causes the
other. For example, although there is a relationship between

17
high GWA scores and better performance in grad school, it doesn’t mean that
high GWA scores cause good grad school performance.
▪ If you attempt to try and find a linear regression equation for a set of data
(especially through an automated program like Excel), you will find one, but it
does not necessarily mean the equation is a good fit for your data. One
technique is to make a scatter plot first, to see if the data roughly fits a
line before you try to find a linear regression equation.

How to Find a Linear Regression Equation:

Find the linear correlation coefficient and its significant interpretation


Step 1: Make a chart of your data, filling in the columns in the same way as you would
fill in the chart if you were finding the Pearson’s Correlation Coefficient.
glucose
subject age (x) level (y) xy x2 y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022

From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n
is the sample size (6, in our case).

Step 2: Use the following equations to find a and b.


(Σ𝑦)(Σ𝑥 2 )−(Σ𝑥)(Σ𝑥𝑦)
𝑎= a = 65.1416
𝑛(Σ𝑥 2 )−(Σ𝑥)2
𝑛Σ𝑥𝑦−(Σ𝑥)(Σ𝑦)
𝑏= 𝑛(Σ𝑥 2 )−(Σ𝑥)2
b = .385225

Find a:
((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 2472)
484979 / 7445
=65.14

18
Find b:
(6(20,485) – (247 × 486)) / (6 (11409) – 2472)
(122,910 – 120,042) / 68,454 – 2472
2,868 / 7,445
= .385225

Step 3: Insert the values into the equation.


y’ = a + bx
y’ = 65.14 + .385225x

Linear Correlation
Correlation: the degree of relationship between the variables under consideration
is measure through the correlation analysis.

To determine the strength of a linear relationship between two variables,


statisticians use a statistic called the linear correlation coefficient, which denoted by the
variable r and is defined as follows.

Linear Correlation Coefficient

For the n ordered pairs( 𝑥1 𝑦1, ), ( 𝑥2 𝑦2, ), ( 𝑥3 𝑦3, ), … , ( 𝑥𝑛 𝑦𝑛, ), the linear correlation
coefficient r is given by

𝑛Σ𝑥𝑦 − (Σ𝑥)(Σ𝑦)
𝑟=
√𝑛Σ𝑥 2 − (Σ𝑥)2 ∙ √𝑛Σ𝑦 2 − (Σy)2

If the linear correlation coefficient r is positive, the relationship between the


variables has a positive correlation. In this case, if one variable increases, the other
variable also tends to increase. If r is negative, the linear relationship between the
variables has a negative correlation. In this case, if one variable increases, the other
variable tends to decrease.

Properties of the Linear Correlation Coefficient

1. The linear correlation coefficient r is always a real number between -1 and 1,


inclusive. In the case in which
▪ all of the ordered pairs lie on a line with positive slope, r is 1.
▪ All of the ordered pairs lie on a line with negative slope, r is -1.
2. For any set of ordered pairs, the linear correlation coefficient r and the slope of
the least-squares line both have the same sign.
3. Interchanging the variables in the ordered pairs does not change the value of r.
Thus the value of r for the ordered pairs( 𝑥1 , 𝑦1, ), ( 𝑥2 , 𝑦2, ), … , ( 𝑥𝑛 , 𝑦𝑛, ) is the same as
the value of r for the ordered pairs ( 𝑦1 , 𝑥1, ), ( 𝑦2 , 𝑥2, ), … , ( 𝑦𝑛 , 𝑥𝑛, ).
4. The value of r does not depend on the units used. You can change the units of a
variable from, for example, feet to inches, and the value of r will remain the same.

Example:

Find the linear correlation coefficient for stride length versus speed of an adult man.
Round your results to the nearest hundredth.

Stride length 2.5 3.0 3.3 3.5 3.8 4.0 4.2 4.5
(m)
Speed (m/s) 3.4 4.9 5.5 6.6 7.0 7.7 8.3 8.7

19
Solution

Complete the table to be used in the formula

Stride
Speed
Subject length x2 y2 xy
(m/s)y
(m)x
1 2.5 3.4 6.25 11.56 8.5
2 3 4.9 9 24.01 14.7
3 3.3 5.5 10.89 30.25 18.15
4 3.5 6.6 12.25 43.56 23.1
5 3.8 7 14.44 49 26.6
6 4 7.7 16 59.29 30.8
7 4.2 8.3 17.64 68.89 34.86
8 4.5 8.7 20.25 75.69 39.15
Σ 28.8 52.1 106.72 362.25 195.86

𝑛Σ𝑥𝑦 − (Σ𝑥)(Σ𝑦)
𝑟=
√𝑛Σ𝑥 2 − (Σ𝑥)2 ∙ √𝑛Σ𝑦 2 − (Σy)2

8(195.86) − (28.8)(52.1)
=
√8(106.72) − (28.8)2 ∙ √8(362.25) − (52.1)2

=0.993715

It indicates a positive correlation between a man’s stride length and his


speed. That is, as a man’s stride increases, his speed also increases.

IT’S YOUR TURN!

Activity 5

Answer the following and use a separate paper (long bond paper) for your answers.

Consider the IQ scores and the mathematics grade of freshmen in a certain university.
The Data is Given Below.

Student IQ(x) Math Grade(y) x2 y2 xy


1 65 85
2 50 74
3 55 76
4 65 90
5 55 85
6 70 87
7 65 94
8 70 98
9 55 81
10 70 91
11 50 76
12 55 74
Compute the slope and the intercept of the regression line and the linear correlation
coefficient and give the fact of significance.

20
POST ASSESSMENT

In the following multiple choice questions, choose the correct answer. Answers on a
separate sheet of paper to be submitted.

___1. If a data set has an even number of observations, the median


a. can not be determined
b. is the average value of the two middle items
c. must be equal to the mean
d. is the average value of the two middle items when all items are arranged
in ascending order
___2. The sum of deviations of the individual data elements from their mean is
a. always greater than zero
b. always less than zero
c. sometimes greater than and sometimes less than zero, depending on the
data elements
d. always equal to zero
___3. In a sample of 800 students in a university, 160, or 20%, are Business majors.
Based on the above information, the school's paper reported that "20% of all the
students at the university are Business majors." This report is an example of
a. a sample
b. a population
c. statistical inference
d. descriptive statistics
___4. A statistics professor asked students in a class their ages. On the basis of this
information, the professor states that the average age of all the students in the
university is 21 years. This is an example of
a. a census
b. descriptive statistics
c. an experiment
d. statistical inference
A researcher has collected the following sample data.
15 12 6 8 15
6 7 15 12 4
___5. The median is
a. 5 b. 6 c. 7 d. 8
A researcher has collected the following sample data. The mean of the sample is 5.

i. 3 5 12 3 2
___6. The variance is
a. 80 b. 4.062 c. 13.2 d. 16.5
___7. If the variance of a data set is correctly computed with the formula using n - 1
in the denominator, which of the following is true?
a. the data set is a sample
b. the data set is a population
c. the data set could be either a sample or a population
d. the data set is from a census
___8. The measure of dispersion that is influenced most by extreme values is
a. the variance
b. the standard deviation
c. the range
d. the interquartile range
___9. When should measures of location and dispersion be computed from grouped
data rather than from individual data values?

21
a. as much as possible since computations are easier
b. only when individual data values are unavailable
c. whenever computer packages for descriptive statistics are unavailable
d. only when the data are from a population
___10. The descriptive measure of dispersion that is based on the concept of a
deviation about the mean is
a. the range
b. the interquartile range
c. both a and b
d. the standard deviation
___11. The standard normal distribution has a mean
a. Less than zero
b. Equal to zero
c. Greater than zero
d. Equal to 1
___12. The total area of the distribution is
a. exactly equal to one
b. Less than one
c. Equal to zero
d. Greater than zero
___13. Which of the following is not a property of a normal distribution
a. The graph is symmetric about a vertical line through the mean of a
normal distribution.
b. The mean, median, and are not equal.
c. The y-value of each point on the curve is the percent (expressed as a
decimal) of the data at the corresponding x-value.
d. Areas under the curve that are symmetric about the mean are equal.
e. The total area under the curve is equal to one.
___14. A survey conducted shows the LEC performance of the students and their GWA
during their undergraduate studies. Find if there is a significant relationship
between the two variables. (Correlation Analysis)
GWA 80 85 89 76 83 89 84 86 95 81 76 83 89 84 86
Lec
75 76 87 74 78 89 80 87 96 79 74 78 89 80 87
Grd

___15. Given the data on the problem solving performance of a first year college
student and their mathematics grade in their senior high. Find if there is a
significant relationship between the two variables using Regression analysis and
Find the model.
SHMG 80 85 89 76 83 89 84 86 95 81 80 85
PSP 75 76 87 74 78 89 80 87 96 79 75 76

REFERENCES

1. Boston University School of Public Health, (2013), Multivariable Methods.


[Link]
704_Multivariable5.html
2. Cengage Learning Asia (2018). Mathematics in the Modern World. Rex Bookstore Inc.
3. Kiernan, D., (2014) Natural Resources Biometrics. Correlation and Simple Linear
Regression, MLE Library, [Link]
resources-biometrics/chapter/chapter-7-correlation-and-simple-linear-
regression/
_____________________________________________________________________________________

22

You might also like