0% found this document useful (0 votes)
55 views16 pages

Test Validity and Reliability Methods

Lesson 6 focuses on establishing the validity and reliability of tests through statistical analysis and procedures. Participants will learn to determine if tests and their items are valid and reliable, and will engage in specific performance tasks to demonstrate their understanding. Key concepts include methods of reliability testing, factors affecting reliability, and the use of statistical measures such as Pearson's r and Cronbach's alpha.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views16 pages

Test Validity and Reliability Methods

Lesson 6 focuses on establishing the validity and reliability of tests through statistical analysis and procedures. Participants will learn to determine if tests and their items are valid and reliable, and will engage in specific performance tasks to demonstrate their understanding. Key concepts include methods of reliability testing, factors affecting reliability, and the use of statistical measures such as Pearson's r and Cronbach's alpha.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lesson 6

Establishing Test Validity and Reliability

Suggested Timeframe: 6 hours


How do we establish the validity and reliability of tests?

UNDERSTAND
Desired Significant Learning Outcomes
In this lesson, you are expected to:
use procedures and statistical analysis to establish test validity and
reliability;
decide whether a test is valid or reliable; and
decide which test items are easy and difficult.

Significant Culminating Performance Task and Success Indicators


At the end of the lesson,
you should be able to demonstrate your knowledge
and skills in determining whether the test and its
items are valid and reliable.
You are considered successful in this
culminating
satisfied at least the following indicators of success:performance task you have
if

Specific Performance Tasks Success Indicators


Use appropriate procedure in Provided the detailed
determining test validity and steps, decision,
and rationale in the use of
reliability appropriate
validity and reliability measures
Show the procedure on how Provided the detailed procedure from
to establish test validity and the preparation of the instrument,
reliability procedure in pretesting, and analysis
in determining the test's
validity and
reliability
Provide accurate. results in the Made the appropriate
analysis of item difficulty and computation, use
of software, reporting of
results, and
reliability interpretation of the results for the tests
of validity and
reliability
Prerequisite of This Lesson
To be able to successfully perform this culminating performance task, you
should have prepared a test following the proper procedure with clear learning
targets (objectives), table of specifications, and pretest data per item. In the
96
previous lesson, you were provided with guidelines in constructing tests following
different formats. You have also learned that assessment becomes valid when the
test items represent a good set of objectives, and this should be found in table of
specifications. The learning targets will help you construct appropriate test items.

PREPARE
In order to establish the validity and reliability of an assessment too, you
need to know the different ways of establishing test validity and reliability. You
are expected to read this before you can analyze your items.

What is test reliability?


Reliability is the consistency of the responses to measure under three
conditions: (1) when retested on the same person; (2) when retested on the
same measure; and (3) similarity of responses across items that measure the
same characteristic. In the first condition, consistent response is expected when
the test is given to the same participants. In the second condition, reliability is
attained if the responses to the same test is consistent with the same test or its
equivalent or another test that measures but measures the same characteristic
when administered at a different time. In the third condition, there is reliability
when the person responded in the same way or consistently across items that
measure the same characteristic.
There are different factors that affect the reliability of a measure. The
reliability of a measure can be high or low, depending on the following factors:
1. Ihenumber of items in a test - The more items a test has, the likelihood
of reliability is high. The probability of obtaining consistent scores is high
because of the large pool of items.
2. Individual differences o f participants - Every participant possesses
2.
characteristics that affect their performance in a test, such as fatigue,
concentration, innate ability, perseverance, and motivation. These individual
factors change over time and affect the consistency of the answers in a test.

3. External_environment The external envirónment may include room


temperature, noise level, depth of instruction, exposure to materials,
and quality of instruction, which could affect changes in the responses of
examinees in a test.

What are the different ways to establish test reliability?


There are different ways in determining the reliability of a test. The specific
kind of reliability will depend on the (1) variable you are measuring, (2) type of
test, and (3) number of versions of the test.
The different types of reliability are indicated and how they are done.
Notice in the third column that statistical analysis is needed to determine the
test reliability.
97
Methodd
in Testing How is this reliability done? What statistics is used?
Reliability
1. est-retest You have a test, and you need Correlate the test scores
to administer it at one time to a from the first and the
group of examinees. Administer it next administration.
again at another time to the "same Significant and positive
group" of examinees. There is a Correlation indicates that
time interval of not more than the test has temporal
6 months between the first and stability overtime.
second administration of tests that
measure stable characteristics,
Correlation refers to a
statistical procedure
Such as standardized aptitude where linear relationship
tests. The post-test can be given
is expected for two
with a minimum time interval of 30
minutes. The responses in the test
variables. You may use
Pearson Product Moment
should more or less be the same
Correlation or Pearson
across the two points in time.
r because test data are
Test-retest is applicable for tests usually in an interval
that measure stable variables, scale (refer to a statistics
Such as aptitude and psychomotor book for Pearson).
measures (e.g. typing test, tasks in
physical education).
2. Parallel There are two versions of a test. Correlate the test
Forms The items need to exactly measure results for the first form
the same skill. Each test version and the second form.
is called a "form." Administer Significantand positive
one form at one time and the Correlation coefficient
other form to another time to are expected. The
the "same" group of participants. significant and positive
The responses on the two forms correlation indicates that
should be more or less the same. the responses in the two
Parallel forms are applicable if forms are the same or
there are two versions of the consistent. Pearson r
test. This is usually done when isusually used for this
the test is repeatedly used for analysis.
different groups, such as entrance
examinations and licensure
examinations. Different versions
ofthe test are given to a different
group of examinees.

98
3. Split-Half Administer a test to a group of Correlate the two sets
examinees. The items need to
be split into halves, usually using
of scores using Pearson
[Link] the correlation,
the odd-even technique. In this use another formula
technique, get the sum of the called Spearman-
points in the odd-numbered items Brown Coefficient. The
and correlate it with the sum of correlation coefficient
points ofthe even-numbered Thecorrelation coefficient
items. Each examinee will have two obtained using Pearson
Scores coming from the same test. rand Spearman Brown
The scores on each set should be should be significant
close or consistent. and positive to mean
that the test has internal
Split-half applicable when
is the
test has alarge number of items. consistency reliability.
4. Test of This procedure involves determining |A statistical analysis
Internal if the scores for each item are called Cronbach's alpha
Consistency consistently answered by the or the Kuder Richardson
Using examinees. After administering the is used to determine the
Kuder test to a group of examinees, itis internal consistency of
Richardson necessaryto determine and record the items. A Cronbach's
and the scores for each item. The idea alpha value of 0.60 and
Cronbach's here is to see if the responses per above indicates that the
Alpha item are consistent with each other. | test items have internal
Method This technique will work well when consistency.
the assessment tool has a large
number ofitems. It is also applicable
forscales and inventories (e.g.,
Likert scale from "strongly agree" to
"stronglydisagree").
5. Inter-rater This procedure is used to determine A statistical analysis called
Reliability the consistency of multiple raters Kendall's tau coefficient
when using rating scales and of concordance is used to
rubrics to judge performance. The determine if the ratings
reliability here refers to the similar provided by multiple
or consistent ratings provided by raters agree with each
more than one rater or judge when other. Significant Kendall's
they use an assessment tool. tau value indicates that
Inter-rater is applicable when the the raters concur or agree
assessment requires the use of with each other in their
multiple raters. ratings.
You will notice in the table that statistical analysis is required to determine
the reliability of a measure. The very basis of statistical analysis to determine
reliability is the use of linear regression.

99
1. Linear regression
Linear regression is demonstrated when you have two variables that are
measured, such as two set of scores in a test taken at two different times by
the same participants. When the two scores are plotted in a graph (with X- and
Y-axis), they tend to form a straight line. The straight line formed for the two
sets of scores can produce a linear regression. When a straight line is formed,
we can say that there is a correlation between the two sets of scores. This can
be seen in the graph shown. This correlation is shown in the graph given. The
graph is called a scatterplot. Each point in the scatterplot is a respondent with
two scores (one for each test).

Sçatterplot (Spreadsheet1 10/10c)


Score 2 = 48403+1.0403%

20

0 144

Scoe 1

2. Computation of Pearsonr correlation


The index of the linear regression is called a correlation coefficient. When
the points in a scatterplot tend to fall within the linear line, the correlation is said
to be strong. When the direction of the scatterplot is directly proportional, the
correlation coefficient will have a positive value. If the line is inverse, the correlation
coefficient wll have a negative value. The statistical analysis used to determine the
correlation coefficient is called the Pearson r. How the Pearson r is obtained is
illustrated below.
Suppose that a teacher gave the spelling of two-syllable words with 20 items
for Monday and Tuesday. The teacher wanted to determine the reliability of two
set of scores by computing for the Pearson r.
Formula:
N(ZXY)-(ZX(ZY)
IN(ZX)-(EXHIN(ZY)- (ZY)1

100
Monday Test Tuesday Test
X Y y2 XY

10 100 400 200


20
9 15 81 225 135

6 12 36 144 72

10 18 100 324 180

12 144 361 228


19
4 8 16 64 32
5 25 49 35

7 10 49 100 70

16 17 256 289 272

8 13 64 169 104

X 87 Y= 139 ZX= 871 Z=2125 XY =1328


EX-Add all the X scores (Monday scores) XY- Multiply the X and Yscores
ZY-Add all the Yscores (Tuesday scores) X - Add all the squared values ofX

X2- Square the value of the X scores Y2- Add all the squared values of Y
(Monday scores) ZXY Add all the product of X and Y
2- Square the value ofthe Yscores
(Tuesday scores)
Substitute the values in the formula:

10(1328)-(87)(139)_
10(871)-(87)10(2125)- (139)1
r 0.80

The value of a correlation coefficient does not exceed 1.00 or -1.00. A value
of 1.00 and 1.00 indicates perfect correlation. In test of reliability though, we
aim for high positive correlation to mean that there is consistency in the way the
student answered the tests taken.
3. Difference between a positive and a negative correlation
When the value of the correlation coefficient is positive, it means that the
higherthe scores inX, the higher the scores in Y. This is called a positive correlation.
In the case of the two spelling Scores, a positive correlation is obtained. When the
value of the correlation coefficient is negative, it means that the higher the scores
in X, the lower the scores in Y, and vice versa. This is called a negative correlation.
When the same test is administered to the same group of participants, usually a
positive correlation indicates reliability or consistency of the scores.
101
ofa correlation
4. Determining the strength
the strength of the reliability of
of the correlation also indicates
The strength coefficient. The closer the
This is indicated by the value of the correlation
the test. the guide:
or -1.00, the stronger
is the correlation. Below is
value to 1.00

0.80-1.00 Very strong relationship


0.6-0.79 Strong relationship
0.40-0.59 Substantial/marked relationship
0.2-0.39 Weak relationship
0.00-0.19 Negligible relationship
correlation
5.
5. Determining the significance of the
could be due chance.
to
The correlation obtained between two variables
In order to determine if the correlation is
free of certain errors, it is tested for
means that the probability of the
significance. When a correlation is significant, it
two variables being related is free of certain
errors.

In order to determine ifa correlation coefficient value is significant, it is compared


critical value.
with an expected probability of correlation coefficient values called a
When the value is greater than the critical value, it means that the
computed
information obtained has more than 95% chance of being correlated and is significant.
Another statistical analysis mentioned to determine the internal consistency
of test is the Cronbach's alpha. Follow the procedure to determine the internal
consistency.
Suppose that five students answered a checklist about their hygiene with a

scale of 1 to 5, where in the following are the corresponding scores:


5-always, 4 - often, 3 - sometimes, 2- rarely, 1 never
The checklist has five items. The teacher wanted to determine if the items
have internal consistency.

Item Item Item Item Total for Score


Studentltem 3 4 5 each case (X)| Mean
(Score Mean)?
1 2
A 5 5 4 19 2.8 7.84

B 3 4 15 1.2 1.44

3 16 -0.2 0.04
C 2 5
3 13 -3.2 10.24
D 1 4
3
E 3 3 4 4 4 18 1.8 3.24

case16.2 (Score Mean)? 22.8|

102
Total 14 .21 16 17 13 16.2
item o= core M e a r
foreach
item (2X) n-1

X2
4891 54 59 39 ot 228
oSD 2.2 0.7 0.7 0.3 1.3 SD= 5.2 o 5.7

Cronbach's a= -1 -Z0
Cronbach's a= 5-1
Cronbach's a = 0.10
The internal consistency of theresponses in the attitude toward teaching
is 0.10, indicating low internal consistency.
The consistency of ratings can also be obtained using a coefficient of
concordance. The Kendall's w coefficient of concordance is used to test
agreement among raters.
Below is a performance task demonstrated by five students rated by three
raters. The rubric used a scale of 1 to 4, where in 4 is the highest and 1 is the lowest.

Five Rater Rater Rater Sum of


D D2
demonstrations 1 2 3 Ratings
A 4 3 11 2.6 6.76

B 3 2 3 3 -0.4 0.16

C 3 4 11 2.6 6.76
4
D 3 3 2 8 -0.4 0.16
E 11 2 4 -4.4 19.36

ARatingS8.4 ZD2 33.2


The scores given by the three raters are first computed by summing up the
total ratings for each demonstration. Ihe mean is obtained for the sum of ratings
Foains8.4). The mean is subtracted from each of the Sum of Ratings (D). Each
difference is squared (D), then the sum of squares is computed (ZD2= 33.2). The
mean and summation of squared difference is substituted in the Kendall's ww
formula. In the formula, m is the numbers of raters.

W 12 2D
mNNP-1)
12 (33.2)
W 3S)X5-1)
W 0.37

103
A Kendall's w coefficient value of 0.38 indicates the agreement of the three
raters in the five demonstrations. There is moderate concordance among the
three raters because the value is far from 1.00.

What is test validity?


A measure is valid when it measures what it is supposed to measure. If a
quarterly exam is valid, then the contents should directly measure the objectives
of the curriculum. If a scale that measures personality is composed of five factors,
then the scores on the five factors should have items that are highly correlated.
If an entrance exam is valid, it should predict students' grades after the first
semester.

What are the different ways to establish test validity?


There are different ways to establish test validity.
Type of
Definition Procedure
validity
Content When the items The items are compared with the
Validity represent the domain objectives ofthe program. The
being measured items need to measure directly
the objectives (for achievement) or
definition (for scales). A reviewer
conducts the checking.
Face When the test is The test items and layout are
Validity presented well, free of reviewed and tried out on a small
errors, and administered group of respondents. A manual for
well administration can be made as a
guide for the test administrator.
Predictive A measure should A correlation coefficient is obtained
Validity predict a future criterion. where the X-variable is used as the
Example is an entrance predictor and the Y-variable as the
exam predicting the criterion.
grades of the students
after the first semester.
Construct The components or The Pearson r can be used to correlate
Validity |factors of the test should the items for each factor. However,
Contain items that are there is a technique called factor
strongly correlated. analysis to determine which items are
highly correlated to form a factor.
|Concurrent When two ormore The scores on the measures should be
Validity measures are present correlated.
for each examiñee that
measure the same
characteristic

104
Convergent | When the components Correlation is done for the factors of
Validity or factors of a test are the test.
hypothesized to have a
positive correlation
Divergent When the components Correlation is done for the factors of
Validity or factors of a test are the test.
hypothesized to have a
negative correlation. An
example to correlate are
the scores in a test on
intrinsic and extrinsic
motivation.
There are cases for each type of validity provided that illustrate how it is
conducted. After reading the cases and references about the different kinds of
validity, partner with a seatmate and answer the following questions. Discuss
your answers. You may use other references and browse the internet.

1. Content Validity
A coordinator in science is checking the science test paper for grade 4. She
asked the grade 4 science teacher to submit the table of specifications containing
the objectives of the lesson and the corresponding items. The coordinator checked
whether each item is aligned with the objectives.
How are the objectives used when creating test items?
How is content validity determined when given the objectives and the items
in a test?
What should be present in a test table of specifications when determining
content validity?
Who checks the content validity of items?

2. Face Validity
The assistant principal browsed the test paper made by the math teacher.
She checked if the contents of the items are about mathematics. She examined
if instructions are clear. She browsed through the items if the
grammar is correct
and if the vocabulary is within the students' level of
understanding.
What can be done in order to ensure that the assessment to be appears
effective?
What practices are done in conducting face validity?
Why is face validity the weakest form of validity?
3. Predictive Validity
The school admission's office
developed an entrance [Link] officials
wanted to determine if the results of the
entrance examination are accurate in

105
identifying good students. They took the grades of the students accepted for
the first quarter. They correlated the entrance exam results and the first quarter
grades. They found significant and positive correlations between the entrance
examination scores and grades. The entrance examination results predicted the
grades of students after the first quarter. Thus, there was predictive-prediction
validity.
Why are two measures needed in predictive validity?
What is the assumed connection between these two measures?
How can we determine if a measure has
predictive validity?
What statistical analysis is done to determine
predictive validity?
How are the test results of predictive validity
interpreted?
4. Concurrent Validity
A school guidance counselor administered a math achievement test to grade
6 students. She also has a copy of the students'
grades in math. She wanted to
verify the math grades of the students are measuring the same competencies
if
as the math achievement test. The school
counselor correlated the math
achievement scores and math grades to determine if they are measuring the
same competencies.
What needs to be available when
conducting concurrent validity?
At least how many tests are needed for
conducting concurrent validity?
What statistical analysis can be used to establish concurrent
validity?
How are the results of a correlation coefficient interpreted for concurrent
validity?
5. Construct Validity
A Science test was made by a grade 10 teacher
composed of four domains:
matter, living things, force and motion, and earth and
space. There are 10 items
under each domain. The teacher wanted to determine if the 10
items made under
each domain really belonged to that domain. The teacher
consulted an expert
in test measurement. They conducted a
procedure called factor analysis. Factor
analysis is a statistical procedure done to determine if the items written will
under the domain they belong. load
What type of test requires construct validity?
What should the test have in order to verify its constructs?
What are constructs and factors in a test?
How are these factors verified if they are appropriate for the test?
What results come out in construct validity?

106
How are the results in construct validity interpreted?
The construct validity of a measure is reported in journal articles. The
following are guide questions used when searching for the construct validity of a
measure from reports
What was the purpose of construct validity?
What type of test was used?
What are the dimensions or factors that were studied using construct validity?
What procedure was used to establish the construct validity?
What statistics was used for the construct validity?
What were the results of the test's construct validity?

. Convergent Validity
Amath teacher developed a test to be administered at the end ofthe school
year, which measures number sense, patterns and algebra, measurement,
geometry, and statistics. It is assumed by the math teacher that students
competencies in number sense improves their capacity to learn patterns and

agebra and other concepts. After administering the test, the scores were
separated for each area, and these five domains were intercorrelated using
Pearson r. The positive correlation between number sense and patterns and
algebra indicates that, when number sense scores increase, the patterns and
algebra scores also increase. This shows student learning of number
scaffold patterns and algebra competencies.
What should a test have in order to conduct convergent validity?
What are done with the domains in a test on convergent validity?
What analysis is used to determine convergent validity?
How are the results in convergent validity interpreted?

7. Divergent Validity
An English teacher taught metacognitive awareness strategy to comprehend
a paragraph for grade 11 students. She wanted to determine if the performance
of her students in reading comprehension would reflect well in the reading
comprehension test. She administered the same reading comprehension test to
another class which was not taught the metacognitive awareness strategy. She
compared the results using a t-test for independent samples and found that the
class that was taught metacognitive awareness strategy performed significantly
better than the other group. The test has divergent validity.
What conditions are needed to conduct divergent validity?
What assumption is being proved in divergent validity?
What statistical analysis can be used to establish divergent validity?
How are the results of divergent validity interpreted?

107
How to determine if an item is easy or difficult?
An item is difficult if majority of students are unable to provide the correct
answer. The item is easy if majority of the students are able to answer correctlý.
An item can discriminate if the examinees who score high in the test can answer
more the items correctly than examinees who got low scores.
Below is a dataset of five items on the addition and subtraction of integers.
Follow the procedure to determine the difficulty and discrimination of each item.

1. Get the total score of each student and arrange scores from highest to lowest.

Item 1 Item 2 Item 3 Item 4 Item 5


Student1 1 1
Student 2 1 1 0
Student 3 0 0 0 1
Student 4 0 0 0 0 1

Student 5 1 1 1
Student 66 1 0 0
Student 7 0 0 1

Student 8 1 1 0
Student 9 1

Student 10 1 1 1

2. Obtain the upper and lower 27% of the group.


Multiply 0.27 by the total
number of students, and you will get a value of 2.7.
The rounded whole
number value is 3.0. Get the top three students and the bottom 3
students
based on their total scores. The top three students are students
2, 5, and 9
The bottom three students are students 7, 8, and 4. The rest of the
students
are not included in the item
analysis.
Item 1 Item 2 Item 3 Item 4 Item 5
total
Score
Student2 1 4
Student 5 0 1 1 4

Student9 1 4
Student 1 0 0 1 1 1 3

108
Student 6 1 1 0 3
Student 10 1 0 1 3
Student 3 0 0 1 1 2

Student 7 0 1 1 2
Student 8 0 1 0 0 2

Student 4 0 0 0 1

3 Obtain the proportion correct for each item. This is computed for the upper
27% group and the lower 27% group. This is done by summating the corr
answer per item and dividing it by the total number of students.
Item 1 Item 2 Item 3 Item 4 Item 5 Total score

Student2 4
1 1 1
1
Student 5 1 1 1 4

Student 9 1 0 1 4

Total 2 2 3 2 3
Proportion of the 0.67 0.67 1.00 0.67 1.00
high group (PH)

Student 7 0 0 1 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 1 1
Total 0 1 2. 1 1

Proportion of the
0.00 0.33 0.67 0.33 0.33
low group (PL)

4.
4. The item difficulty is obtained using the following formula:

tem difficulty-PHtpl

2
The difficulty is interpreted using the table

Difficulty Index Remark


0.76 or higher Easy Itemn
0.25 to 0.75 Average Item
0.24 or lower Difficult Item

109
Computation

Item 1 Item 2 Item 3 Item 4 Item 5


0.67+0 0.67+0.33 2.00+0.67 1.00+0.33| 1.00+0.33
2 2 2 2 2
Index of 0.33 0.50 0.83 0.50 0.67
difficulty
Item Difficult Averagee Easy Average Average
difficulty
5. The index of discrimination is obtained using the formula:
Item discrimination =pH pl -

The value is interpreted using the table:


Index discrimination Remark
|0.40 and above Very good item
0.30-0.39 Good item
0.20 0.29 Reasonably Good item
0.10-0.19 Marginal item
Below 0.10 Poor item

Item 1 Item 2 Item 3 Item 4 Item5


F0.67-0=0.67 0.33-2.00 0.67=1.00-0.33-1.00 -0.33
Discrimination| .67 0.33 0.33 0.33 0.67
Index
Discrimination Very good Good item Good item Good item Very good
item item
When developinga tèacher-maae test, it is good to have items that are easy,
average, and difficult with positive discrimination indices. If you are developing a
standardized test, the rule is more stringent as it aims for average items or not so
easy nor difficult items and whose discrimination index is at least 0.3.

DEVELOP
A. Indicate the type of reliability applicable for each case. Write the type of
reliability on the space before the number.

1. Mr. Perez conducted a survey of his students to determine their


study habits. Each item is answered using a five-point scale (always,
often, sometimes, rarely, never). He wanted to determine if the
responses for each item are consistent. What reliability technique is
recommended?
2. A teacher administered a spelling test to her students. After a day,
another spelling test was given with the same length and stress of
words. What reliability can be used for the two spelling tests?
3. APE teacher requested two judges to rate the dance performance
of her students in physical education. What reliability can be used to
determine the reliability of the judgments?
4. An English teacher administered a test to determine students' use
of verb given a subject with 20 items. The scores were divided into
items 1 to 10, and another for items 11 to 20. The teacher correlated
the two set of scores that form the same test. What reliability is
done here?
5. A computer teacher gave a set of typing tests on Wednesday and
gave the same set the following week. The teacher wanted to know
if the students' typing skills are consistent. What reliability can be
used?

111

You might also like