Key Principles of Language Testing
Key Principles of Language Testing
3.1 Reliability
Reliability in language testing refers to the consistency and dependability of test scores
(Bachman & Palmer, 1996; Liu, et al., 2020). A reliable test produces consistent results
across different administrations, raters, or test items. In other words, a reliable test
minimizes random error or fluctuations in scores that are not due to true differences in
language proficiency. Ensuring reliability is crucial for making valid inferences about
test-takers' language abilities based on their test scores.
Types of reliability
Test-retest reliability: This method involves administering the same test to the
same group of test-takers on two separate occasions and then correlating the two
sets of scores. A high correlation coefficient indicates that the test produces
consistent results over time. However, this method can be affected by factors such
as practice effects and memory.
Procedure: A group of applicants takes the test. After a suitable interval (e.g., two
weeks), the same group of applicants takes the exact same test again under similar
conditions. The scores from both administrations are compared for each
individual.
High reliability: If the scores from the first and second administrations are highly
correlated (e.g., a correlation coefficient close to 1.0), it suggests that the test
produces consistent results over time. This indicates that the test is reliable
because it is measuring the same underlying language proficiency consistently,
regardless of when it is administered.
Low reliability: If the scores show little or no correlation between the two
administrations, it suggests that the test is unreliable. This could be due to factors
like:
Overall, test-retest reliability assesses the consistency of test scores over time. A reliable
test should produce similar results when administered to the same group of test-takers
under comparable conditions.
Parallel forms reliability: This method requires the creation of two equivalent
versions of the same test, which are then administered to the same group of test-
takers. The scores on the two versions are then correlated. A high correlation
coefficient suggests that the two versions of the test are equivalent and measure
the same construct consistently.
Scenario:
Parallel Forms: Two equivalent versions of the test (Form A and Form B) are
created.
o Equivalence: Both forms cover the same range of English language skills
(reading, grammar, vocabulary) with a similar level of difficulty.
Example:
Reading Comprehension: Form A might include a passage
about historical figures, while Form B includes a passage
about scientific discoveries, but both passages assess similar
reading comprehension skills (main idea, supporting details,
inference).
Grammar: Both forms might include questions on verb
tenses and subject-verb agreement, but with different
sentence structures and contexts.
Vocabulary: Both forms might include vocabulary items
related to common academic topics, but with different word
choices and question formats (e.g., multiple choice, fill-in-
the-blank).
Procedure:
High reliability: If the scores from Form A and Form B are highly correlated
(e.g., a correlation coefficient close to 1.0), it suggests that the two forms are
equivalent and measure the same underlying English language proficiency
consistently. This indicates that the test is reliable because it produces similar
results regardless of which version is administered.
Low reliability: If the scores show little or no correlation between the two forms,
it suggests that the two forms are not equivalent and may not be measuring the
same construct. This could be due to differences in difficulty, content coverage, or
question format between the two forms.
In short, parallel forms reliability in this context ensures that the English achievement test
consistently measures students' English language proficiency, regardless of which version
they take. This is crucial for making fair and accurate assessments of student learning and
for identifying areas where students may need additional support.
Scenario:
Test: A standardized English proficiency test for university applicants, designed to
assess reading comprehension.
Example:
Overall, internal consistency reliability ensures that the items within a test section are
measuring the same underlying construct and are not measuring unrelated or inconsistent
skills. In this example, a high Cronbach's alpha for the reading comprehension section
would suggest that the test is reliably measuring general reading comprehension ability.
Scenario:
Test: An English speaking exam for university students
Inter-rater reliability consideration: Since the assessment involves subjective
judgment by human raters (interviewers), it is crucial to ensure consistent scoring
across different raters.
Example:
Inter-rater reliability analysis: The scores given by the two raters for each
applicant are compared. Statistical methods like Cohen's kappa or intraclass
correlation coefficients are used to calculate the level of agreement between the
raters.
Overall, inter-rater reliability ensures that the scores assigned to test-takers are consistent
across different raters. In the case of oral proficiency interviews, high inter-rater
reliability is essential to ensure that the assessment is fair and objective.
Improving reliability
3.2 Validity
In the realm of language assessment, the concept of validity holds paramount importance.
Validity, in its essence, refers to the extent to which a language test accurately measures
what it purports to measure (Chapelle & Lee, 2021). It is a multifaceted concept that
encompasses various aspects, including content, construct, criterion, and consequential
validity. This section delves into the different types of language test validity, exploring
their significance in ensuring the accuracy, fairness, and meaningfulness of language
assessments.
Content validity
Content validity focuses on the degree to which a test comprehensively covers the
relevant content domain (Brown & Abeywickrama, 2019; Hughes, 2020). In language
testing, this entails ensuring that the test tasks and items adequately represent the
knowledge, skills, and abilities that are considered essential for the target language use
domain (Dinh, 2019; Siddiek, 2010). For instance, a test designed to assess academic
writing proficiency should include tasks that reflect the types of writing required in
academic settings, such as essays, research papers, and literature reviews. The following
scenario illustrates content validity:
Scenario:
Test purpose: An English test aims to assess the reading comprehension skills of
university students in an academic setting.
Content Validity Consideration: To ensure content validity, the test developers
would need to carefully select reading materials that are representative of the types
of texts students will encounter in their university studies.
Example:
Instead of using simplified or general interest texts, the test should include:
By including these types of materials, the test directly measures the skills students need
to succeed in their academic reading. It avoids irrelevant content (like casual
conversations or simple narratives) that wouldn't accurately reflect their ability to handle
university-level texts.
Overall, content validity is about ensuring the test's content aligns with the specific skills
or knowledge it intends to measure. In this case, the test content is valid because it
reflects the reading demands of a university environment.
Construct Validity
Construct validity examines whether a test effectively measures the underlying construct
or attribute it is intended to assess (Hill & McNamara, 2015). Constructs in language
testing can include language proficiency, communicative competence, or specific
language skills like reading comprehension or listening ability. To establish construct
validity, researchers often employ various methods, such as correlating test scores with
other established measures of the same construct or examining the internal structure of
the test through factor analysis. The followoing scenario illustrates construct validity:
Scenario:
Test purpose: An English test aims to assess the writing ability of students,
specifically their ability to construct well-formed sentences.
Construct validity consideration: To ensure construct validity, the test needs to
accurately measure the underlying construct of "sentence construction ability."
This means it should effectively differentiate between students who have a good
grasp of sentence structure and those who struggle with it.
Example:
The test includes various tasks designed to elicit different aspects of sentence
construction:
By using a variety of tasks that target different facets of sentence construction, the test
provides a more comprehensive assessment of this underlying construct. It avoids relying
on just one type of task, which might only measure a narrow aspect of sentence
construction ability.
In general, construct validity is about ensuring the test accurately measures the theoretical
construct it intends to measure. In this case, the test is designed to measure the construct
of "sentence construction ability" by using tasks that effectively tap into different aspects
of this skill.
Scenario:
Example:
Test: The university administers its own English language proficiency test to
applicants.
Criterion: The university tracks the academic performance (e.g., GPA) of the
admitted students in their first year of study.
Analysis: The university then correlates the test scores with the students' academic
performance.
If the test scores show a strong positive correlation with the students' academic
performance (i.e., higher test scores predict better academic outcomes), it demonstrates
that the test has good criterion-related validity. It means the test is effective at predicting
how well students will perform in their academic studies, which is the intended purpose
of the test.
In short, criterion-related validity focuses on how well a test's scores correlate with an
external criterion (in this case, academic performance). A strong correlation indicates that
the test is useful for predicting future performance on that criterion.
Consequential validity
Consequential validity considers the social and personal consequences of test use. It
examines the impact of a test on individuals, groups, and society as a whole. This
includes evaluating the potential benefits and drawbacks of test use, such as its effect on
learning, teaching, and educational policy. For instance, a test that is used for high-stakes
decisions, such as university admission or job placement, should be carefully scrutinized
for its potential impact on test takers' lives and opportunities. The following scenario
illustrates consequential validity”
Scenario:
Example:
Positive consequence: The test accurately identifies students with strong English
skills, allowing them to be placed in challenging classes where they can thrive and
learn at a faster pace. This can lead to increased motivation, higher academic
achievement, and greater opportunities for these students.
Negative consequence: The test may unfairly disadvantage certain groups of
students, such as those from low-income backgrounds or those who are English
language learners. If the test is culturally biased or does not adequately account for
their unique learning experiences, it could lead to misplacement, frustration, and
lower self-esteem. This could also perpetuate existing educational inequalities.
Conduct a thorough needs assessment: Identify the specific needs and learning
goals of the target population.
Minimize potential bias: Ensure the test is culturally fair and does not
disadvantage any particular group of students.
Monitor the impact of test use: Track the long-term consequences of test-based
placement decisions on student learning and well-being.
Make adjustments as needed: Based on the monitoring data, make adjustments
to the test or the placement process to mitigate any negative consequences.
The significance of language test validity lies in its ability to ensure that language
assessments are accurate, fair, and meaningful. A valid language test provides reliable
information about test takers' language abilities, enabling stakeholders to make informed
decisions based on the test results. For test takers, a valid test ensures that their language
skills are assessed fairly and accurately, providing them with a true reflection of their
abilities. For educators, a valid test helps to identify students' strengths and weaknesses,
guiding instructional practices and curriculum development. For policymakers, a valid
test can inform decisions about language education programs and policies, ensuring that
they are effective and equitable.
Overall, language test validity is a complex and multifaceted concept that plays a crucial
role in ensuring the quality and meaningfulness of language assessments. By
understanding the different types of validity and their significance, language testers and
educators can strive to develop and utilize assessments that are accurate, fair, and
beneficial for all stakeholders.
3.3 Practicality
Time constraints: Time is a valuable resource, both for test-takers and test
administrators. A practical test should be administered within a reasonable
timeframe. Test-takers should not feel rushed or pressured, but the test should also
not be excessively long, as this can lead to fatigue and reduced performance. Test
administrators should also consider the time required for test preparation,
administration, scoring, and reporting.
Cost-effectiveness: The cost of developing, administering, and scoring a language
test can vary significantly. Test development requires time, expertise, and
resources, such as item writing, pilot testing, and statistical analysis.
Administration costs include venue rental, equipment, and personnel. Scoring
costs can be substantial, especially for subjective assessments that require human
raters. A practical test should be cost-effective, considering the available budget
and the value derived from the test results.
Logistical feasibility: Logistical considerations include the availability of
appropriate testing venues, equipment, and personnel. Test administrators need to
ensure that the testing environment is conducive to optimal test performance, with
adequate space, lighting, and ventilation. They also need to ensure that sufficient
personnel are available to administer the test, monitor test-takers, and provide any
necessary assistance.
Ease of administration and scoring: A practical test should be easy to administer
and score. Clear and concise instructions should be provided to test-takers, and the
testing procedures should be straightforward and easy to follow. Scoring
procedures should be well-defined and easy to apply, reducing the potential for
subjectivity and inconsistency. The use of automated scoring systems can
significantly improve the efficiency and objectivity of the scoring process. Below
is an example:
Scenario:
Example:
Overall, the use of a multiple-choice format enhances the practicality of the test by
making it easy to administer and score. This is especially important for high-stakes tests
that need to be administered to a large number of students efficiently. A practical test
should be accessible to all test-takers, regardless of their physical or cognitive abilities.
Reasonable accommodations should be made for test-takers with disabilities, such as
providing extra time, using assistive technology, or modifying test formats. The test
should also be culturally sensitive and appropriate for test-takers from diverse
backgrounds.
Test results should be easily interpretable and reported in a clear and concise manner.
Test reports should provide meaningful information about test-takers' language
proficiency, such as their strengths and weaknesses. They should also be easy to
understand and use by test-takers, educators, and other stakeholders.
Scenario:
Example:
Standardized Scoring Scale: The test uses a standardized scoring scale, such as a
score range from 1 to 100 or a band score system (e.g., 1-9).
Benefits:
o Easy interpretation: Standardized scores are easy to understand and
compare. Applicants and universities can easily see a student's overall
English proficiency level.
o Clear reporting: Test reports can be generated quickly and easily,
providing clear and concise information about the applicant's performance
in different areas of English language proficiency (reading, writing,
listening, speaking).
o Efficient decision-making: Universities can easily use the standardized
scores to make admissions decisions, compare applicants, and identify
students who may need additional language support.
Using a standardized scoring scale enhances the practicality of the test by making it easy
to interpret and report results. This facilitates efficient decision-making for both
universities and applicants.
Ethical considerations are paramount in language testing. Fairness, equity, and respect for
test-takers are fundamental principles that must guide all aspects of test development and
administration. This chapter will explore key ethical considerations in English language
testing, including test bias, cultural sensitivity, test security, and the responsible use of
test results.
Test bias
Test bias occurs when a test systematically favors or disadvantages certain groups of test-
takers based on factors such as gender, ethnicity, socioeconomic status, or cultural
background. One common type of bias is cultural bias, where test items or tasks are
culturally inappropriate or unfamiliar to test-takers from certain cultural backgrounds.
For example, a reading comprehension passage that references cultural concepts or events
that are unfamiliar to test-takers from a different cultural background may disadvantage
them.
Use culturally diverse materials: Include reading passages, listening materials, and
test items that are relevant and accessible to test-takers from diverse cultural
backgrounds.
Avoid culturally loaded language: Use language that is clear, concise, and free
from cultural idioms or slang that may be unfamiliar to certain groups.
Conduct pilot testing: Pilot test the test with diverse groups of test-takers to
identify and address any potential cultural biases.
Test security
Maintaining test security is crucial to ensure the integrity of the testing process. Test
materials should be kept confidential and should not be accessible to unauthorized
individuals. Measures should be taken to prevent cheating, such as proctoring exams
carefully, using secure testing environments, and implementing measures to detect and
prevent cheating.
Test results should be used responsibly and ethically. Test scores should not be used to
make high-stakes decisions without careful consideration of their limitations. It is
important to remember that test scores are just one piece of information about a test-
taker's language ability, and they should not be the sole determinant of important
decisions such as university admissions or job placement. Test results should be
interpreted in conjunction with other relevant information, such as academic records,
recommendations, and interviews.
Test-taker rights
Test-takers have certain rights that must be respected. These rights include:
Right to clear and concise instructions: Test-takers should be provided with clear
and unambiguous instructions for all test tasks.
Right to a fair and equitable testing environment: Test-takers should be provided
with a comfortable and supportive testing environment that is free from
distractions and disruptions.
Right to privacy and confidentiality: Test-takers have the right to expect that their
test scores and personal information will be kept confidential and used only for the
intended purposes.
Right to appeal test scores: Test-takers should have the right to appeal their test
scores if they believe there has been an error in scoring or administration.
In general, ethical considerations are paramount in all aspects of language testing. Test
developers and administrators have a responsibility to ensure that tests are fair, equitable,
and unbiased. By adhering to ethical principles and best practices, we can ensure that
language tests are used responsibly and that test-takers are treated with respect and
dignity.
Instruction: Carefully read each statement in the table below. For each statement, rate
your understanding using the following scale:
Objective: To have students analyze scenarios to identify possible causes for unreliability
of an English test.
Procedure: Divide students into small groups. Provide each group with a handout
containing 3-4 scenarios related to different types of reliability in language testing.
o Example Scenarios:
Scenario 1 (Test-retest): "A student takes an online English
proficiency test. A week later, they retake the same test. Their scores
differ significantly. What factors might have contributed to this
difference?"
Scenario 2 (Parallel Forms): "A school uses two different versions
of a reading comprehension test. Students who perform well on one
version consistently score poorly on the other. What could be the
potential issues with the test?"
Scenario 3 (Internal Consistency): "A vocabulary test includes
items that seem unrelated to each other. How might this affect the
internal consistency of the test?"
Scenario 4 (Inter-rater): "Two teachers score student essays. Their
scores for the same essays differ significantly. What steps can be
taken to improve inter-rater reliability?"
Each group analyzes the scenarios and discusses possible causes for unreliability.
Objective: To have students analyze a sample English test for high school students in
Vietnam and identify potential threats to its reliability.
Material: A sample English test (e.g., a reading comprehension test with multiple-choice
questions)
Procedure:
Test presentation: Present students with a sample English test (or students search
from the Internet). Briefly discuss the purpose of the test and the target language
skills it aims to assess.
Group analysis : Divide students into small groups. Provide each group with a
handout containing the following questions:
Objective: To have students examine an English test given by the teacher or one on
the Internet to analyze the validity of an English test for high school students in
Vietnam.
o Content: Does the test adequately cover the range of language skills taught
in the English curriculum?
o Construct: Does the test accurately measure the intended language skills
(e.g., reading comprehension, writing, speaking)?
Objective: To have students apply the principles of practicality (time, cost, logistics, ease
of administration/scoring, accessibility, interpretation/reporting) in the design of a short
language test.
Procedure: Present students with a scenario requiring the creation of a language test. For
example, "Design a short English test for incoming exchange students to assess their
basic conversational skills." Each group designs a short language test (e.g., 2-3 tasks)
based on the given scenario. Students discuss in groups to consider the following:
Objective: To have students analyze ethical dilemmas in language testing and develop
strategies for ensuring fair and equitable assessment practices.
Procedure: Divide students into small groups. Each group analyzes the following
questions and discusses potential threats to its reliability. Below is a list containing 3-4
ethical dilemmas related to language testing.
Scenario 1 (Test bias): "A reading comprehension test
includes a passage about a specific cultural event that would
be unfamiliar to students from certain cultural backgrounds.
How does this create bias, and how can it be addressed?"
Scenario 2 (Test security): "A student discovers a copy of
the upcoming English proficiency test online. What are the
ethical implications of this situation, and what steps should be
taken to address it?"
Scenario 3 (Responsible use of test results): "A university
uses an English proficiency test as the sole criterion for
admission to its English program. What are the potential
ethical concerns with this approach?"
Scenario 4 (Test-taker rights): "A student with a learning
disability requests extra time for an English test. How should
the school handle this request to ensure fairness and equity
for all students?"
Each group analyzes the scenarios, discussing the ethical implications and potential
solutions.
References
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford University
Press.
Hill, K., & McNamara, T. (2015). Validity inferences under high-stakes conditions: A
response from language testing. Measurement: Interdisciplinary Research &
Perspectives, 13(1), 39-43.
Liu, Z., Li, T., & Diao, H. (2020). Analysis on the Reliability and Validity of Teachers'
Self-designed English Listening Test. Journal of Language Teaching and
Research, 11(5), 801-808.
Siddiek, A. G. (2010). The Impact of Test Content Validity on Language Teaching and
Learning. Online Submission, 6(12), 133-143.
Tang, W., Cui, Y., & Babenko, O. (2014). Internal consistency: Do we really know what
it is and how to assess it. Journal of Psychology and Behavioral Science, 2(2), 205-
220.