Test Validity and Reliability Methods
Test Validity and Reliability Methods
UNDERSTAND
Desired Significant Learning Outcomes
In this lesson, you are expected to:
use procedures and statistical analysis to establish test validity and
reliability;
decide whether a test is valid or reliable; and
decide which test items are easy and difficult.
PREPARE
In order to establish the validity and reliability of an assessment too, you
need to know the different ways of establishing test validity and reliability. You
are expected to read this before you can analyze your items.
98
3. Split-Half Administer a test to a group of Correlate the two sets
examinees. The items need to
be split into halves, usually using
of scores using Pearson
[Link] the correlation,
the odd-even technique. In this use another formula
technique, get the sum of the called Spearman-
points in the odd-numbered items Brown Coefficient. The
and correlate it with the sum of correlation coefficient
points ofthe even-numbered Thecorrelation coefficient
items. Each examinee will have two obtained using Pearson
Scores coming from the same test. rand Spearman Brown
The scores on each set should be should be significant
close or consistent. and positive to mean
that the test has internal
Split-half applicable when
is the
test has alarge number of items. consistency reliability.
4. Test of This procedure involves determining |A statistical analysis
Internal if the scores for each item are called Cronbach's alpha
Consistency consistently answered by the or the Kuder Richardson
Using examinees. After administering the is used to determine the
Kuder test to a group of examinees, itis internal consistency of
Richardson necessaryto determine and record the items. A Cronbach's
and the scores for each item. The idea alpha value of 0.60 and
Cronbach's here is to see if the responses per above indicates that the
Alpha item are consistent with each other. | test items have internal
Method This technique will work well when consistency.
the assessment tool has a large
number ofitems. It is also applicable
forscales and inventories (e.g.,
Likert scale from "strongly agree" to
"stronglydisagree").
5. Inter-rater This procedure is used to determine A statistical analysis called
Reliability the consistency of multiple raters Kendall's tau coefficient
when using rating scales and of concordance is used to
rubrics to judge performance. The determine if the ratings
reliability here refers to the similar provided by multiple
or consistent ratings provided by raters agree with each
more than one rater or judge when other. Significant Kendall's
they use an assessment tool. tau value indicates that
Inter-rater is applicable when the the raters concur or agree
assessment requires the use of with each other in their
multiple raters. ratings.
You will notice in the table that statistical analysis is required to determine
the reliability of a measure. The very basis of statistical analysis to determine
reliability is the use of linear regression.
99
1. Linear regression
Linear regression is demonstrated when you have two variables that are
measured, such as two set of scores in a test taken at two different times by
the same participants. When the two scores are plotted in a graph (with X- and
Y-axis), they tend to form a straight line. The straight line formed for the two
sets of scores can produce a linear regression. When a straight line is formed,
we can say that there is a correlation between the two sets of scores. This can
be seen in the graph shown. This correlation is shown in the graph given. The
graph is called a scatterplot. Each point in the scatterplot is a respondent with
two scores (one for each test).
20
0 144
Scoe 1
100
Monday Test Tuesday Test
X Y y2 XY
6 12 36 144 72
7 10 49 100 70
8 13 64 169 104
X2- Square the value of the X scores Y2- Add all the squared values of Y
(Monday scores) ZXY Add all the product of X and Y
2- Square the value ofthe Yscores
(Tuesday scores)
Substitute the values in the formula:
10(1328)-(87)(139)_
10(871)-(87)10(2125)- (139)1
r 0.80
The value of a correlation coefficient does not exceed 1.00 or -1.00. A value
of 1.00 and 1.00 indicates perfect correlation. In test of reliability though, we
aim for high positive correlation to mean that there is consistency in the way the
student answered the tests taken.
3. Difference between a positive and a negative correlation
When the value of the correlation coefficient is positive, it means that the
higherthe scores inX, the higher the scores in Y. This is called a positive correlation.
In the case of the two spelling Scores, a positive correlation is obtained. When the
value of the correlation coefficient is negative, it means that the higher the scores
in X, the lower the scores in Y, and vice versa. This is called a negative correlation.
When the same test is administered to the same group of participants, usually a
positive correlation indicates reliability or consistency of the scores.
101
ofa correlation
4. Determining the strength
the strength of the reliability of
of the correlation also indicates
The strength coefficient. The closer the
This is indicated by the value of the correlation
the test. the guide:
or -1.00, the stronger
is the correlation. Below is
value to 1.00
B 3 4 15 1.2 1.44
3 16 -0.2 0.04
C 2 5
3 13 -3.2 10.24
D 1 4
3
E 3 3 4 4 4 18 1.8 3.24
102
Total 14 .21 16 17 13 16.2
item o= core M e a r
foreach
item (2X) n-1
X2
4891 54 59 39 ot 228
oSD 2.2 0.7 0.7 0.3 1.3 SD= 5.2 o 5.7
Cronbach's a= -1 -Z0
Cronbach's a= 5-1
Cronbach's a = 0.10
The internal consistency of theresponses in the attitude toward teaching
is 0.10, indicating low internal consistency.
The consistency of ratings can also be obtained using a coefficient of
concordance. The Kendall's w coefficient of concordance is used to test
agreement among raters.
Below is a performance task demonstrated by five students rated by three
raters. The rubric used a scale of 1 to 4, where in 4 is the highest and 1 is the lowest.
B 3 2 3 3 -0.4 0.16
C 3 4 11 2.6 6.76
4
D 3 3 2 8 -0.4 0.16
E 11 2 4 -4.4 19.36
W 12 2D
mNNP-1)
12 (33.2)
W 3S)X5-1)
W 0.37
103
A Kendall's w coefficient value of 0.38 indicates the agreement of the three
raters in the five demonstrations. There is moderate concordance among the
three raters because the value is far from 1.00.
104
Convergent | When the components Correlation is done for the factors of
Validity or factors of a test are the test.
hypothesized to have a
positive correlation
Divergent When the components Correlation is done for the factors of
Validity or factors of a test are the test.
hypothesized to have a
negative correlation. An
example to correlate are
the scores in a test on
intrinsic and extrinsic
motivation.
There are cases for each type of validity provided that illustrate how it is
conducted. After reading the cases and references about the different kinds of
validity, partner with a seatmate and answer the following questions. Discuss
your answers. You may use other references and browse the internet.
1. Content Validity
A coordinator in science is checking the science test paper for grade 4. She
asked the grade 4 science teacher to submit the table of specifications containing
the objectives of the lesson and the corresponding items. The coordinator checked
whether each item is aligned with the objectives.
How are the objectives used when creating test items?
How is content validity determined when given the objectives and the items
in a test?
What should be present in a test table of specifications when determining
content validity?
Who checks the content validity of items?
2. Face Validity
The assistant principal browsed the test paper made by the math teacher.
She checked if the contents of the items are about mathematics. She examined
if instructions are clear. She browsed through the items if the
grammar is correct
and if the vocabulary is within the students' level of
understanding.
What can be done in order to ensure that the assessment to be appears
effective?
What practices are done in conducting face validity?
Why is face validity the weakest form of validity?
3. Predictive Validity
The school admission's office
developed an entrance [Link] officials
wanted to determine if the results of the
entrance examination are accurate in
105
identifying good students. They took the grades of the students accepted for
the first quarter. They correlated the entrance exam results and the first quarter
grades. They found significant and positive correlations between the entrance
examination scores and grades. The entrance examination results predicted the
grades of students after the first quarter. Thus, there was predictive-prediction
validity.
Why are two measures needed in predictive validity?
What is the assumed connection between these two measures?
How can we determine if a measure has
predictive validity?
What statistical analysis is done to determine
predictive validity?
How are the test results of predictive validity
interpreted?
4. Concurrent Validity
A school guidance counselor administered a math achievement test to grade
6 students. She also has a copy of the students'
grades in math. She wanted to
verify the math grades of the students are measuring the same competencies
if
as the math achievement test. The school
counselor correlated the math
achievement scores and math grades to determine if they are measuring the
same competencies.
What needs to be available when
conducting concurrent validity?
At least how many tests are needed for
conducting concurrent validity?
What statistical analysis can be used to establish concurrent
validity?
How are the results of a correlation coefficient interpreted for concurrent
validity?
5. Construct Validity
A Science test was made by a grade 10 teacher
composed of four domains:
matter, living things, force and motion, and earth and
space. There are 10 items
under each domain. The teacher wanted to determine if the 10
items made under
each domain really belonged to that domain. The teacher
consulted an expert
in test measurement. They conducted a
procedure called factor analysis. Factor
analysis is a statistical procedure done to determine if the items written will
under the domain they belong. load
What type of test requires construct validity?
What should the test have in order to verify its constructs?
What are constructs and factors in a test?
How are these factors verified if they are appropriate for the test?
What results come out in construct validity?
106
How are the results in construct validity interpreted?
The construct validity of a measure is reported in journal articles. The
following are guide questions used when searching for the construct validity of a
measure from reports
What was the purpose of construct validity?
What type of test was used?
What are the dimensions or factors that were studied using construct validity?
What procedure was used to establish the construct validity?
What statistics was used for the construct validity?
What were the results of the test's construct validity?
. Convergent Validity
Amath teacher developed a test to be administered at the end ofthe school
year, which measures number sense, patterns and algebra, measurement,
geometry, and statistics. It is assumed by the math teacher that students
competencies in number sense improves their capacity to learn patterns and
agebra and other concepts. After administering the test, the scores were
separated for each area, and these five domains were intercorrelated using
Pearson r. The positive correlation between number sense and patterns and
algebra indicates that, when number sense scores increase, the patterns and
algebra scores also increase. This shows student learning of number
scaffold patterns and algebra competencies.
What should a test have in order to conduct convergent validity?
What are done with the domains in a test on convergent validity?
What analysis is used to determine convergent validity?
How are the results in convergent validity interpreted?
7. Divergent Validity
An English teacher taught metacognitive awareness strategy to comprehend
a paragraph for grade 11 students. She wanted to determine if the performance
of her students in reading comprehension would reflect well in the reading
comprehension test. She administered the same reading comprehension test to
another class which was not taught the metacognitive awareness strategy. She
compared the results using a t-test for independent samples and found that the
class that was taught metacognitive awareness strategy performed significantly
better than the other group. The test has divergent validity.
What conditions are needed to conduct divergent validity?
What assumption is being proved in divergent validity?
What statistical analysis can be used to establish divergent validity?
How are the results of divergent validity interpreted?
107
How to determine if an item is easy or difficult?
An item is difficult if majority of students are unable to provide the correct
answer. The item is easy if majority of the students are able to answer correctlý.
An item can discriminate if the examinees who score high in the test can answer
more the items correctly than examinees who got low scores.
Below is a dataset of five items on the addition and subtraction of integers.
Follow the procedure to determine the difficulty and discrimination of each item.
1. Get the total score of each student and arrange scores from highest to lowest.
Student 5 1 1 1
Student 66 1 0 0
Student 7 0 0 1
Student 8 1 1 0
Student 9 1
Student 10 1 1 1
Student9 1 4
Student 1 0 0 1 1 1 3
108
Student 6 1 1 0 3
Student 10 1 0 1 3
Student 3 0 0 1 1 2
Student 7 0 1 1 2
Student 8 0 1 0 0 2
Student 4 0 0 0 1
3 Obtain the proportion correct for each item. This is computed for the upper
27% group and the lower 27% group. This is done by summating the corr
answer per item and dividing it by the total number of students.
Item 1 Item 2 Item 3 Item 4 Item 5 Total score
Student2 4
1 1 1
1
Student 5 1 1 1 4
Student 9 1 0 1 4
Total 2 2 3 2 3
Proportion of the 0.67 0.67 1.00 0.67 1.00
high group (PH)
Student 7 0 0 1 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 1 1
Total 0 1 2. 1 1
Proportion of the
0.00 0.33 0.67 0.33 0.33
low group (PL)
4.
4. The item difficulty is obtained using the following formula:
tem difficulty-PHtpl
2
The difficulty is interpreted using the table
109
Computation
DEVELOP
A. Indicate the type of reliability applicable for each case. Write the type of
reliability on the space before the number.
111