Testing & Evaluation
What to Discuss
• Definition of a test
• Types of tests
• Characteristics of a good test (Validity,
Reliability & Practicality)
Test, Assessment & Evaluation
Evaluation
ASSESSMENT
TEST
Definition of a test
A test is a systematic procedure for measuring an
individual’s behavior (Brown, 1991). This
definition implies that it has to be developed
following specific guidelines. It is a formal and
systematic way of gathering information about
learners’ behavior, usually through paper-and-
pencil procedure (Airisan, 1989).
Definition of a test
“A method of measuring a person’s ability,
knowledge or performance in a given domain.”
(Brown, 2004)
Language tests are simply instruments or
procedures for gathering particular kinds of
information, typically information having to do
with students’ language abilities.”
(Norris, 2003)
Assessment
Assessment is the process of identifying, gathering and
interpreting information about students' learning. The
central purpose of assessment is to provide information on
student achievement and progress and set the direction for
ongoing teaching and learning .
It seeks to improve the quality of student learning, not to
provide evidence for evaluating or grading students. It
provides faculty with feedback about their effectiveness as
teachers, and it gives students a measure of their progress
as learners. It is created, administered, and analysed by
teachers in the classrooms.
Evaluation
Decision making about student performance and
about appropriate teaching strategies (Woolfolk,
2005, p. 504).
Differences between testing and assessment
Testing Assessment
Tests are prepared administrative Assessment is an ongoing
procedures that occur at process that encompasses a
identifiable times in a
curriculum. much wider domain.
When tested, learners know that A good teacher never ceases to
their performance is being assess students, whether
measured and evaluated. those assessments are
When tested, learners muster all incidental or intended.
their faculties to offer peak
performance. Whenever a student responds to
Tests are a subset of a question, offers a comment,
assessment. They are only one or tries out a new word or
among many procedures and structure, the teacher
tasks that teachers can
ultimately use to assess subconsciously makes an
students. assessment of the student’s
Tests are usually time- performance.
constrained (usually spanning Assessment includes testing.
a class period or at most
several hours) and draw on a Assessment is more extended
limited sample of behaviour. and it includes a lot more
components.
Types of tests
Numerous types of tests are used in school. There
are different ways of categorizing tests, namely:
ease of quantification of response, mode of
preparation, mode of administration, test
constructor, mode of interpreting results, and
nature of response ( Manarang & Manarang, 1993;
Louisell & Descamps, (1992).
Mode of response-based tests
Oral test- it is a test wherein the test taker gives
his answer orally.
Written test- it is a test where answers to
questions are written by test taker.
Performance test- it is one in which the test
taker creates an answer or a product that
demonstrates his knowledge or skill, as in cooking
and baking.
Ease of quantification of response-
based test
Objective test- it is a paper and pencil test
wherein students’ answers can be compared and
quantified to yield a numerical score.
Subjective test- it is a paper-and-pencil test which
is not easily quantified as students are given the
freedom to write their answer to a question, such
as an essay test.
Mode of administration-based tests
Individual test- it is a test administered to one
student at a time.
Group test- it is one administered to a group of
students simultaneously.
Constructor-based tests
Standardized test- it is a test prepared by an expert
or specialist and administered to representative
populations of similar individuals to obtain
normative data.
Ex: TOEFL is an example of standardized tests.
Un-standardized test- it is one prepared by
teachers for use in the classroom, with no
established norms for scoring and interpretation of
results. It is not accompanied by normative data
Mode of interpreting results-based tests
Norm-referenced test- it is a test that evaluates a
student’s performance by comparing it to the performance
of a group students on the same test.
Criterion-referenced test compares all the testees to
a predetermined criterion. In such a test everybody whose
achievement comes up to the pre-set criterion will receive
a pass mark, while those under it will fail. The criteria are
often set in terms of tasks that students have to be able to
perform (e.g. to interact with an interlocutor with ease; to
ask for information and understand instructions).
Nature of answer-based test
Personality test- it is a test designed for
assessing some aspects of an individual’s
personality.
Intelligence test- it is a test that measures the
mental ability of an individual.
Aptitude test- it is test designed for the purpose
of predicting the likelihood of an individual’s
success in a learning area or field of endeavor .
Achievement test- it is a test given to students to
determine what a student has learned from formal
instruction in school.
Summative test- it is a test given at the end of
instruction to determine students’ learning and
assign grades.
Diagnostic test- it is a test administered to
students to identify their specific strengths and
weaknesses in past and present learning.
Formative test- it is a test given to improve
teaching and learning while it is going on.
Nature of the question-based tests
Direct test: candidates are required to perform the skill
the test intends to measure.
Indirect test wants to measure skills that underlie
performance in a particular task.
In a discrete point test every item focuses on one
clear-cut segment of the target language without involving
the others Typical test format: written multiple-choice test.
In an integrative test candidates need to use a number
of language elements at the same time in completing the
test tasks. For example: essay writing, dictation, cloze test.
Factors to consider in testing
Validity
A test is said to be valid if it measures
accurately what it is intended to measure .
Content validity
A test is said to have content validity if its content
constitutes a representative sample of the language
skills, structures, etc., with which it is meant to be
concerned.
In order to judge whether or not a test has content
validity, we need a specification of the skills or
structures etc., that is meant to cover. Such a
specification should be made at a very early stage
in test construction.
It isn’t to be expected that everything in the
specification will always appear in a single test.
But it will provide the test constructor with the
basis for making a principled selection of elements
for inclusion in the test. A comparison between
test specification and test content is the basis for
judgments as to content validity.
Criterion-related validity
Also referred to as instrumental validity, it states
that the criteria should be clearly defined by the
teacher in advance. It has to take into account
other teachers´ criteria to be standardized and it
also needs to demonstrate the accuracy of a
measure or procedure compared to another
measure or procedure which has already been
demonstrated to be valid.
There are essentially two kinds of criterion-related
validity: Concurrent validity and Predictive
validity
Concurrent validity
Concurrent validity is a statistical method using correlation.
Examinees who are known to be either masters or non-
masters on the content measured by the test are identified
before the test is administered. Once the tests have been
scored, the relationship between the examinees’ status as
either masters or non-masters and their performance (i.e.,
pass or fail) is estimated based on the test. This type of
validity provides evidence that the test is classifying
examinees correctly. The stronger the correlation is, the
greater the concurrent validity of the test is.
Predictive validity
Predictive validity concerns the degree to which a
test can predict candidates’ future performances
as masters or non-masters. An example would be
how a proficiency test could predict a student’s
ability to cope with a graduate course at an
American university.
Construct validity
A test, part of test, or a testing technique is said
to have construct validity if it can be demonstrated
that it measures just the ability which is supposed
to measure. The word ‘ construct’ refers to
underlying ability (or trait) which is hypothesised
in a theory of language ability.
Face validity
A test is said to have face validity if it looks as if it measures
what it is supposed to measure. For example, a test which
pretended to measure writing ability but which didn’t require
the examinees to write might be thought to lack face validity.
Face validity is hardly scientific, yet it is important. It’s not
investigated through formal procedures and is not
determined by subject experts. Instead, anyone who looks
over the test including testees and other stakeholders, may
develop an informal opinion as to whether or not the test is
measuring what it calims to measure.
Reliability
Reliability
Reliability is the extent to which an experiment, test, or
any measuring procedure shows the same result on
repeated trials. It implies the extent to which a test is
repeatable and yields consistent scores.
A test is said to be reliable if we get almost the same
results repeatedly.
A test is said to be unreliable if one’s scores might
fluctuate from one administration to the other. That is,
one’s score on various administrations will be
inconsistence.
Student Obtained score Score which would
have obtained on
the following day
Samir 15 15,5
Imane 12 11
Rania 14 14
Adil 9 10
Hicham 15 14
Marwa 10 10
Sami 8 9
Rajaa 13 13,5
Ayman 16 16
Hajar 14,5 15,5
Manar 8 8
Ahmed 10 11
Reda 11 10,5
Salah 12 12
Student Obtained score Score which would
have obtained on
the following day
Samir 15 11
Imane 12 7
Rania 14 17
Adil 9 15
Hicham 15 6,5
Marwa 10 17
Sami 8 14
Rajaa 13 9
Ayman 16 16
Hajar 14,5 10
Manar 8 13
Ahmed 10 18
Reda 11 6
Salah 12 16
Student Obtained score Score which would
have obtained on
the following day
Samir 15 16,5
Imane 12 12,5
Rania 14 16
Adil 9 12
Hicham 15 16
Marwa 10 13
Sami 8 10
Rajaa 13 14
Ayman 16 16,5
Hajar 14,5 15
Manar 8 10
Ahmed 10 12
Reda 11 11
Salah 12 13
The notion of consistency of one’s score with respect to
one's average score over repeated administration is the
central concern on the concept of reliability.
The change in one’s score is inevitable. Some of the
changes might represent a steady increase in one’s score.
The increase would most likely be due to some sort of
learning. This kind of change, which would be predictable,
is called systematic variation.
The systematic variation contributes to the reliability and
the unsystematic variation, which is called error variation ,
contributes to the unreliability of a test.
True Score
Let’s assume that someone takes a test. Since all
measurement devices are subject to error, the score one
gets on a test cannot be true manifestation of one’s ability
in that particular trait. In other words, the score contains
one’s true ability along with some error. If this error part
could be eliminated, the resulting score would represent an
errorless measure of that ability. By definition, this
errorless score is called a “true score”.
Observed score
The true score is almost always different from the score
one gets, which is called the “observed score”. Since the
observed score includes the measurement error, i.e., the
error score, it can be greater than, equal to, or smaller than
the true score. If there is absolutely no error of
measurement, the observed score will equal the true score.
However, when there is a measurement error, which is
often the case, it can lead to an overestimation or an
underestimation of true score.
Therefore, if the observed score is represented by X, the
true score by T and the error score by E, the relationship
between the observed and true score can be illustrated as
follows:
X=T or
X>T or
X<T
Methods of Estimating
Reliability
Test-Retest Method
In this method reliability is obtained through
administering a given test to a particular group twice and
calculating the correlation between the two sets of score
obtained from the two administrations.
Since there has to be a reasonable amount of time
between the two administrations, this kind of reliability is
referred to as the reliability or consistency over time.
Test-Retest
Parallel-form Method
In the parallel-form method, two similar, or parallel forms of the
same test are administrated to a group of examinees just once.
The two form of the test should be the same. It means all the
elements upon which test items are constructed should be the
same in both forms. For example if we are measuring a particular
element of grammar, the other form should also contain the same
number of items on the same elements of grammar.
Subtests should also be the same, i.e., if one form of the test has
tree subsection of grammar, vocabulary, and reading
comprehension, the other form should also have the same
subsections with the same proportions.
Split-Half Method
In split-half method the items comprising a test are
homogeneous. That is, all the items in a test attempt to
measure elements of a particular trait, E.g., tenses,
propositions, other grammatical points, vocabulary,
reading and listening comprehension, which are all
subparts of the trait called language ability.
In this method, when a single test with homogeneous items
is administrated to a group of examinees, the test is split,
or divided, into two equal halves. The correlation between
the two halves is an estimate of the test score reliability
Which method should we use?
It depends on the function of the test.
Test-retest method is appropriate when the consistency of
scores a particular time interval (stability of test scores over
time) is important
The Parallel-forms method is desirable when the
consistency of scores over different forms is of importance.
When the go-togetherness of the items of a test is of
significance (all test items measure the same construct) (the
internal consistency), Split-Half will be the most appropriate
methods
Factors Influencing Reliability
The Effect of Testees: Since human beings are dynamic creatures, the
attributes related to human beings are also dynamic. The implication is that
the performance of human beings will, by their very nature, fluctuate from
time to time, or from place to place. (e.g., students misunderstanding or
misreading test directions, noise level, distractions, and sickness) can cause
test scores to vary.
The Effect of Test Factors:
1)Test length. Generally, the longer a test is, the more reliable it is, however
the length is up to a point.
2) Item difficulty. When there is little variability among test scores, the
reliability will be low. Thus, reliability will be low if a test is so easy that
every student gets most or all of the items correct or so difficult that every
student gets most or all of the items wrong.
The effect of Administration Factors: Poor or unclear
directions given during administration or inaccurate scoring can affect
reliability.
Practicality
It refers to the economy of time, effort and money in testing.
In other words, a test should be…
Easy to design
Easy to administer
Easy to invigilate
Easy to score
Easy to interpret (the results)
THANK YOU!