0% found this document useful (0 votes)
45 views20 pages

Key Principles of Language Testing

Chapter 3 discusses the essential qualities of a good language test, focusing on reliability, validity, practicality, and ethical considerations. It elaborates on various types of reliability, including test-retest, parallel forms, internal consistency, and inter-rater reliability, emphasizing their importance in ensuring fair and accurate assessments. Additionally, the chapter covers validity types such as content, construct, and criterion-related validity, highlighting their roles in measuring the intended language skills effectively.

Uploaded by

Thư Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views20 pages

Key Principles of Language Testing

Chapter 3 discusses the essential qualities of a good language test, focusing on reliability, validity, practicality, and ethical considerations. It elaborates on various types of reliability, including test-retest, parallel forms, internal consistency, and inter-rater reliability, emphasizing their importance in ensuring fair and accurate assessments. Additionally, the chapter covers validity types such as content, construct, and criterion-related validity, highlighting their roles in measuring the intended language skills effectively.

Uploaded by

Thư Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Chapter 3: Qualities of a Good Language Test

Language assessment plays a pivotal role in various educational and professional


contexts, from classroom settings to high-stakes examinations like university admissions
or professional certifications. To ensure that language tests are fair, accurate, and
meaningful, it is crucial to adhere to a set of fundamental principles. This chapter will
explore four key principles of language assessment: reliability, validity, practicality, and
ethical considerations. By understanding and applying these principles, language testers
can develop and implement assessments that are fair, accurate, meaningful, and ethically
sound, ultimately leading to more effective and equitable language learning and teaching
practices.

3.1 Reliability

Reliability in language testing refers to the consistency and dependability of test scores
(Bachman & Palmer, 1996; Liu, et al., 2020). A reliable test produces consistent results
across different administrations, raters, or test items. In other words, a reliable test
minimizes random error or fluctuations in scores that are not due to true differences in
language proficiency. Ensuring reliability is crucial for making valid inferences about
test-takers' language abilities based on their test scores.

Types of reliability

 Test-retest reliability: This method involves administering the same test to the
same group of test-takers on two separate occasions and then correlating the two
sets of scores. A high correlation coefficient indicates that the test produces
consistent results over time. However, this method can be affected by factors such
as practice effects and memory.

For example: A standardized English proficiency test for university admissions.

Procedure: A group of applicants takes the test. After a suitable interval (e.g., two
weeks), the same group of applicants takes the exact same test again under similar
conditions. The scores from both administrations are compared for each
individual.

Demonstrating test-retest reliability:

High reliability: If the scores from the first and second administrations are highly
correlated (e.g., a correlation coefficient close to 1.0), it suggests that the test
produces consistent results over time. This indicates that the test is reliable
because it is measuring the same underlying language proficiency consistently,
regardless of when it is administered.

Low reliability: If the scores show little or no correlation between the two
administrations, it suggests that the test is unreliable. This could be due to factors
like:

o Test-takers' memory: They might remember answers from the first


administration, artificially inflating their scores on the second attempt.
o Practice effects: They might have improved their language skills between
the two administrations, leading to higher scores on the second attempt.
o Uncontrolled factors: Differences in testing conditions, time of day, or the
emotional state of the test-takers could also influence scores.

Overall, test-retest reliability assesses the consistency of test scores over time. A reliable
test should produce similar results when administered to the same group of test-takers
under comparable conditions.

 Parallel forms reliability: This method requires the creation of two equivalent
versions of the same test, which are then administered to the same group of test-
takers. The scores on the two versions are then correlated. A high correlation
coefficient suggests that the two versions of the test are equivalent and measure
the same construct consistently.

Scenario:

Test: A standardized English achievement test for 8th-grade students, designed to


assess reading comprehension, grammar, and vocabulary.

Purpose: To measure students' overall English language proficiency at the end of


the school year.

Parallel Forms: Two equivalent versions of the test (Form A and Form B) are
created.

o Equivalence: Both forms cover the same range of English language skills
(reading, grammar, vocabulary) with a similar level of difficulty.
 Example:
 Reading Comprehension: Form A might include a passage
about historical figures, while Form B includes a passage
about scientific discoveries, but both passages assess similar
reading comprehension skills (main idea, supporting details,
inference).
 Grammar: Both forms might include questions on verb
tenses and subject-verb agreement, but with different
sentence structures and contexts.
 Vocabulary: Both forms might include vocabulary items
related to common academic topics, but with different word
choices and question formats (e.g., multiple choice, fill-in-
the-blank).

Procedure:

o A group of 8th-grade students takes Form A of the English achievement


test.
o After a suitable interval (e.g., a few weeks), the same group of students
takes Form B of the test.
o The scores from both forms are compared for each student.

Demonstrating parallel forms reliability:

High reliability: If the scores from Form A and Form B are highly correlated
(e.g., a correlation coefficient close to 1.0), it suggests that the two forms are
equivalent and measure the same underlying English language proficiency
consistently. This indicates that the test is reliable because it produces similar
results regardless of which version is administered.

Low reliability: If the scores show little or no correlation between the two forms,
it suggests that the two forms are not equivalent and may not be measuring the
same construct. This could be due to differences in difficulty, content coverage, or
question format between the two forms.

In short, parallel forms reliability in this context ensures that the English achievement test
consistently measures students' English language proficiency, regardless of which version
they take. This is crucial for making fair and accurate assessments of student learning and
for identifying areas where students may need additional support.

 Internal consistency reliability: This method assesses the consistency of items


within a single test. One common measure of internal consistency is Cronbach's
alpha, which estimates the average correlation between all possible pairs of items
on the test. A high alpha coefficient indicates that the items on the test are
measuring the same underlying construct.

Scenario:
Test: A standardized English proficiency test for university applicants, designed to
assess reading comprehension.

Internal consistency consideration: The test aims to measure a single underlying


construct, such as general reading comprehension ability. Therefore, the items
within the reading comprehension section should be measuring the same thing.

Example:

Test content: The reading comprehension section consists of several passages


with multiple-choice questions.

Internal consistency check: To assess internal consistency, we can calculate


Cronbach's alpha, a common statistical measure. Cronbach's alpha estimates the
average correlation between all possible pairs of items on the test.

o High internal consistency: If Cronbach's alpha is high (typically above


0.70), it suggests that the items within the reading comprehension section
are measuring the same construct (i.e., general reading comprehension
ability) and are internally consistent (Tang, et al., 2014). This means that
students who perform well on one item are likely to perform well on other
items as well.
o Low internal consistency: If Cronbach's alpha is low, it suggests that the
items are not measuring the same construct. This could be due to factors
such as:
 Item ambiguity or difficulty: Some items may be poorly written or
too difficult, leading to inconsistent responses.
 Heterogeneity of content: The items may be measuring different
aspects of reading comprehension (e.g., vocabulary, main idea,
inference) rather than a single, unified construct.

Overall, internal consistency reliability ensures that the items within a test section are
measuring the same underlying construct and are not measuring unrelated or inconsistent
skills. In this example, a high Cronbach's alpha for the reading comprehension section
would suggest that the test is reliably measuring general reading comprehension ability.

 Inter-rater reliability: This method is particularly important for subjective


assessments, such as writing or speaking tests. It measures the degree of
agreement between two or more raters on the same set of responses. Inter-rater
reliability can be estimated using statistical methods such as Cohen's kappa or
intraclass correlation coefficients.

Scenario:
 Test: An English speaking exam for university students
 Inter-rater reliability consideration: Since the assessment involves subjective
judgment by human raters (interviewers), it is crucial to ensure consistent scoring
across different raters.

Example:

Interview procedure: A group of applicants is interviewed by two different,


trained raters.

Scoring: Each rater independently scores the applicants' performance on criteria


such as fluency, grammar, vocabulary, pronunciation, and overall communication
effectiveness.

Inter-rater reliability analysis: The scores given by the two raters for each
applicant are compared. Statistical methods like Cohen's kappa or intraclass
correlation coefficients are used to calculate the level of agreement between the
raters.

o High inter-rater reliability: If the correlation between the scores of the


two raters is high, it indicates that they are consistently evaluating the
applicants' performance in a similar manner. This suggests that the scoring
process is reliable and not significantly influenced by the subjective
judgment of individual raters.
o Low inter-rater reliability: If the correlation between the scores is low, it
suggests that the raters are not consistently evaluating the applicants. This
could be due to factors such as:
 Vague scoring rubrics: The criteria for scoring are not clearly
defined or consistently applied.
 Lack of rater training: Raters may not have received adequate
training on how to apply the scoring rubrics consistently.
 Rater bias: Personal biases or preferences of individual raters may
be influencing their scores.

Overall, inter-rater reliability ensures that the scores assigned to test-takers are consistent
across different raters. In the case of oral proficiency interviews, high inter-rater
reliability is essential to ensure that the assessment is fair and objective.

Factors affecting reliability

Several factors can affect the reliability of language tests:


 Test length: Longer tests tend to be more reliable than shorter tests, as they
provide more opportunities to sample the test-taker's language ability.
 Test difficulty: Tests that are too easy or too difficult may not discriminate
effectively between test-takers, leading to lower reliability.
 Test administration: Factors such as testing conditions, time constraints, and
instructions can affect test-taker performance and, consequently, test reliability.
 Rater factors: Rater subjectivity, fatigue, and inconsistency can all impact the
reliability of subjective assessments.

Improving reliability

Several strategies can be used to improve the reliability of language tests:

 Clear instructions: Providing clear and unambiguous instructions to test-takers can


help to minimize confusion and ensure that everyone understands the task
requirements.
 Standardized procedures: Establishing standardized testing procedures can help to
control for extraneous variables and ensure that all test-takers are assessed under
similar conditions.
 Rater training: Providing raters with clear scoring rubrics and training on how to
apply them consistently can improve inter-rater reliability.
 Item analysis: Conducting item analysis can help to identify and eliminate
problematic items that do not contribute to the overall reliability of the test.

In general, reliability is a fundamental aspect of language test quality. By ensuring that


tests are reliable, we can be more confident that the scores they produce accurately reflect
test-takers' language abilities. This is essential for making valid inferences about
language proficiency and for using test scores to make important decisions, such as
admissions to educational programs or job placement.

3.2 Validity

In the realm of language assessment, the concept of validity holds paramount importance.
Validity, in its essence, refers to the extent to which a language test accurately measures
what it purports to measure (Chapelle & Lee, 2021). It is a multifaceted concept that
encompasses various aspects, including content, construct, criterion, and consequential
validity. This section delves into the different types of language test validity, exploring
their significance in ensuring the accuracy, fairness, and meaningfulness of language
assessments.
Content validity

Content validity focuses on the degree to which a test comprehensively covers the
relevant content domain (Brown & Abeywickrama, 2019; Hughes, 2020). In language
testing, this entails ensuring that the test tasks and items adequately represent the
knowledge, skills, and abilities that are considered essential for the target language use
domain (Dinh, 2019; Siddiek, 2010). For instance, a test designed to assess academic
writing proficiency should include tasks that reflect the types of writing required in
academic settings, such as essays, research papers, and literature reviews. The following
scenario illustrates content validity:

Scenario:

 Test purpose: An English test aims to assess the reading comprehension skills of
university students in an academic setting.
 Content Validity Consideration: To ensure content validity, the test developers
would need to carefully select reading materials that are representative of the types
of texts students will encounter in their university studies.

Example:

Instead of using simplified or general interest texts, the test should include:

 Excerpts from academic textbooks: Covering various disciplines like science,


history, or literature.
 Scholarly articles: With complex sentence structures and discipline-specific
vocabulary.
 Graphs and charts: Requiring interpretation of visual data commonly found in
academic materials.

By including these types of materials, the test directly measures the skills students need
to succeed in their academic reading. It avoids irrelevant content (like casual
conversations or simple narratives) that wouldn't accurately reflect their ability to handle
university-level texts.

Overall, content validity is about ensuring the test's content aligns with the specific skills
or knowledge it intends to measure. In this case, the test content is valid because it
reflects the reading demands of a university environment.

Construct Validity

Construct validity examines whether a test effectively measures the underlying construct
or attribute it is intended to assess (Hill & McNamara, 2015). Constructs in language
testing can include language proficiency, communicative competence, or specific
language skills like reading comprehension or listening ability. To establish construct
validity, researchers often employ various methods, such as correlating test scores with
other established measures of the same construct or examining the internal structure of
the test through factor analysis. The followoing scenario illustrates construct validity:

Scenario:

 Test purpose: An English test aims to assess the writing ability of students,
specifically their ability to construct well-formed sentences.
 Construct validity consideration: To ensure construct validity, the test needs to
accurately measure the underlying construct of "sentence construction ability."
This means it should effectively differentiate between students who have a good
grasp of sentence structure and those who struggle with it.

Example:

The test includes various tasks designed to elicit different aspects of sentence
construction:

 Sentence completion: Students fill in missing words or phrases in sentences,


requiring them to understand grammatical relationships and sentence structure.
 Sentence combining: Students combine two or more simple sentences into a more
complex one, testing their ability to use conjunctions, relative clauses, etc.
 Error identification: Students identify grammatical errors in sentences, assessing
their knowledge of correct sentence structure.

Why this demonstrates construct validity:

By using a variety of tasks that target different facets of sentence construction, the test
provides a more comprehensive assessment of this underlying construct. It avoids relying
on just one type of task, which might only measure a narrow aspect of sentence
construction ability.

In general, construct validity is about ensuring the test accurately measures the theoretical
construct it intends to measure. In this case, the test is designed to measure the construct
of "sentence construction ability" by using tasks that effectively tap into different aspects
of this skill.

Criterion-related validity: Criterion-related validity explores the relationship between


test scores and an external criterion or measure of the same construct. This type of
validity is typically divided into two categories: concurrent validity and predictive
validity. Concurrent validity assesses the relationship between test scores and a criterion
measure collected at the same time, while predictive validity examines the relationship
between test scores and a criterion measure collected at a later point in time. For example,
a test designed to predict academic success in a language learning program would exhibit
high predictive validity if its scores strongly correlate with students' actual academic
performance in the program. Another aspect of criterion-related validity is to Examining
how well the test scores correlate with external criteria, such as grades in English classes
or performance on other standardized English assessments. This helps validate the test's
effectiveness in predicting students' actual English language abilities. Below is an
example of criterion-related validity:

Scenario:

 Test purpose: A university wants to assess the English language proficiency of


international students applying for admission.
 Criterion-related validity consideration: The university wants to ensure that the
English test scores accurately predict how well the students will perform in their
academic coursework.

Example:

 Test: The university administers its own English language proficiency test to
applicants.
 Criterion: The university tracks the academic performance (e.g., GPA) of the
admitted students in their first year of study.
 Analysis: The university then correlates the test scores with the students' academic
performance.

Demonstrating criterion-related validity:

If the test scores show a strong positive correlation with the students' academic
performance (i.e., higher test scores predict better academic outcomes), it demonstrates
that the test has good criterion-related validity. It means the test is effective at predicting
how well students will perform in their academic studies, which is the intended purpose
of the test.

In short, criterion-related validity focuses on how well a test's scores correlate with an
external criterion (in this case, academic performance). A strong correlation indicates that
the test is useful for predicting future performance on that criterion.

Consequential validity

Consequential validity considers the social and personal consequences of test use. It
examines the impact of a test on individuals, groups, and society as a whole. This
includes evaluating the potential benefits and drawbacks of test use, such as its effect on
learning, teaching, and educational policy. For instance, a test that is used for high-stakes
decisions, such as university admission or job placement, should be carefully scrutinized
for its potential impact on test takers' lives and opportunities. The following scenario
illustrates consequential validity”

Scenario:

 Test purpose: A standardized English proficiency test is used to determine which


students are eligible for placement in advanced English classes.
 Consequential validity consideration: The test developers and administrators
need to consider the potential consequences of using this test for placement
decisions.

Example:

 Positive consequence: The test accurately identifies students with strong English
skills, allowing them to be placed in challenging classes where they can thrive and
learn at a faster pace. This can lead to increased motivation, higher academic
achievement, and greater opportunities for these students.
 Negative consequence: The test may unfairly disadvantage certain groups of
students, such as those from low-income backgrounds or those who are English
language learners. If the test is culturally biased or does not adequately account for
their unique learning experiences, it could lead to misplacement, frustration, and
lower self-esteem. This could also perpetuate existing educational inequalities.

Demonstrating consequential validity:

To ensure consequential validity, the test developers and administrators should:

 Conduct a thorough needs assessment: Identify the specific needs and learning
goals of the target population.
 Minimize potential bias: Ensure the test is culturally fair and does not
disadvantage any particular group of students.
 Monitor the impact of test use: Track the long-term consequences of test-based
placement decisions on student learning and well-being.
 Make adjustments as needed: Based on the monitoring data, make adjustments
to the test or the placement process to mitigate any negative consequences.

In summary, consequential validity considers the broader social and personal


implications of test use. It's about ensuring that the test has a positive impact on students
and does not create unintended negative consequences.
Significance of language test validity

The significance of language test validity lies in its ability to ensure that language
assessments are accurate, fair, and meaningful. A valid language test provides reliable
information about test takers' language abilities, enabling stakeholders to make informed
decisions based on the test results. For test takers, a valid test ensures that their language
skills are assessed fairly and accurately, providing them with a true reflection of their
abilities. For educators, a valid test helps to identify students' strengths and weaknesses,
guiding instructional practices and curriculum development. For policymakers, a valid
test can inform decisions about language education programs and policies, ensuring that
they are effective and equitable.

Overall, language test validity is a complex and multifaceted concept that plays a crucial
role in ensuring the quality and meaningfulness of language assessments. By
understanding the different types of validity and their significance, language testers and
educators can strive to develop and utilize assessments that are accurate, fair, and
beneficial for all stakeholders.

3.3 Practicality

In addition to validity and reliability, practicality is a crucial consideration in language


test development and administration. A practical test is one that is feasible and efficient
in terms of time, cost, and resources. It should be easy to administer, score, and interpret,
and it should be logistically manageable (Bachman & Palmer, 1996; Brown &
Abeywickrama, 2019).

Factors related practicality

 Time constraints: Time is a valuable resource, both for test-takers and test
administrators. A practical test should be administered within a reasonable
timeframe. Test-takers should not feel rushed or pressured, but the test should also
not be excessively long, as this can lead to fatigue and reduced performance. Test
administrators should also consider the time required for test preparation,
administration, scoring, and reporting.
 Cost-effectiveness: The cost of developing, administering, and scoring a language
test can vary significantly. Test development requires time, expertise, and
resources, such as item writing, pilot testing, and statistical analysis.
Administration costs include venue rental, equipment, and personnel. Scoring
costs can be substantial, especially for subjective assessments that require human
raters. A practical test should be cost-effective, considering the available budget
and the value derived from the test results.
 Logistical feasibility: Logistical considerations include the availability of
appropriate testing venues, equipment, and personnel. Test administrators need to
ensure that the testing environment is conducive to optimal test performance, with
adequate space, lighting, and ventilation. They also need to ensure that sufficient
personnel are available to administer the test, monitor test-takers, and provide any
necessary assistance.
 Ease of administration and scoring: A practical test should be easy to administer
and score. Clear and concise instructions should be provided to test-takers, and the
testing procedures should be straightforward and easy to follow. Scoring
procedures should be well-defined and easy to apply, reducing the potential for
subjectivity and inconsistency. The use of automated scoring systems can
significantly improve the efficiency and objectivity of the scoring process. Below
is an example:

Scenario:

 Test: A standardized English proficiency test for high school students.


 Practicality Consideration: The test needs to be easy to administer and score,
especially for teachers with limited time and resources.

Example:

 Multiple-choice format: The test primarily uses a multiple-choice format for


questions on grammar, vocabulary, and reading comprehension.
 Benefits:
o Easy administration: Multiple-choice questions are straightforward for
teachers to administer. They can be easily printed or displayed on a screen,
and students can mark their answers on a separate answer sheet.
o Objective scoring: Scoring is objective and efficient. Answer keys can be
used to quickly and accurately score the tests, minimizing the potential for
human error.
o Automated scoring: In many cases, answer sheets can be scanned and
scored electronically, further increasing efficiency and reducing the time
and effort required for scoring.

Overall, the use of a multiple-choice format enhances the practicality of the test by
making it easy to administer and score. This is especially important for high-stakes tests
that need to be administered to a large number of students efficiently. A practical test
should be accessible to all test-takers, regardless of their physical or cognitive abilities.
Reasonable accommodations should be made for test-takers with disabilities, such as
providing extra time, using assistive technology, or modifying test formats. The test
should also be culturally sensitive and appropriate for test-takers from diverse
backgrounds.

 Interpretation and reporting

Test results should be easily interpretable and reported in a clear and concise manner.
Test reports should provide meaningful information about test-takers' language
proficiency, such as their strengths and weaknesses. They should also be easy to
understand and use by test-takers, educators, and other stakeholders.

Scenario:

 Test: A standardized English proficiency test for university admissions.


 Practicality consideration: The test results need to be easily interpreted and
reported in a clear and concise manner for both the university and the applicants.

Example:

 Standardized Scoring Scale: The test uses a standardized scoring scale, such as a
score range from 1 to 100 or a band score system (e.g., 1-9).
 Benefits:
o Easy interpretation: Standardized scores are easy to understand and
compare. Applicants and universities can easily see a student's overall
English proficiency level.
o Clear reporting: Test reports can be generated quickly and easily,
providing clear and concise information about the applicant's performance
in different areas of English language proficiency (reading, writing,
listening, speaking).
o Efficient decision-making: Universities can easily use the standardized
scores to make admissions decisions, compare applicants, and identify
students who may need additional language support.

Using a standardized scoring scale enhances the practicality of the test by making it easy
to interpret and report results. This facilitates efficient decision-making for both
universities and applicants.

In summary, practicality is a crucial consideration in language test development and


administration. A practical test is one that is feasible, efficient, and cost-effective, and
that can be administered and scored efficiently. By carefully considering the practical
aspects of test development and administration, we can ensure that language tests are fair,
reliable, and meaningful for all stakeholders.
3.4 Ethical considerations in English language testing

Ethical considerations are paramount in language testing. Fairness, equity, and respect for
test-takers are fundamental principles that must guide all aspects of test development and
administration. This chapter will explore key ethical considerations in English language
testing, including test bias, cultural sensitivity, test security, and the responsible use of
test results.

Test bias

Test bias occurs when a test systematically favors or disadvantages certain groups of test-
takers based on factors such as gender, ethnicity, socioeconomic status, or cultural
background. One common type of bias is cultural bias, where test items or tasks are
culturally inappropriate or unfamiliar to test-takers from certain cultural backgrounds.
For example, a reading comprehension passage that references cultural concepts or events
that are unfamiliar to test-takers from a different cultural background may disadvantage
them.

To minimize cultural bias, test developers should:

 Use culturally diverse materials: Include reading passages, listening materials, and
test items that are relevant and accessible to test-takers from diverse cultural
backgrounds.
 Avoid culturally loaded language: Use language that is clear, concise, and free
from cultural idioms or slang that may be unfamiliar to certain groups.
 Conduct pilot testing: Pilot test the test with diverse groups of test-takers to
identify and address any potential cultural biases.

Test security

Maintaining test security is crucial to ensure the integrity of the testing process. Test
materials should be kept confidential and should not be accessible to unauthorized
individuals. Measures should be taken to prevent cheating, such as proctoring exams
carefully, using secure testing environments, and implementing measures to detect and
prevent cheating.

Responsible use of test results

Test results should be used responsibly and ethically. Test scores should not be used to
make high-stakes decisions without careful consideration of their limitations. It is
important to remember that test scores are just one piece of information about a test-
taker's language ability, and they should not be the sole determinant of important
decisions such as university admissions or job placement. Test results should be
interpreted in conjunction with other relevant information, such as academic records,
recommendations, and interviews.

Test-taker rights

Test-takers have certain rights that must be respected. These rights include:

 Right to clear and concise instructions: Test-takers should be provided with clear
and unambiguous instructions for all test tasks.
 Right to a fair and equitable testing environment: Test-takers should be provided
with a comfortable and supportive testing environment that is free from
distractions and disruptions.
 Right to privacy and confidentiality: Test-takers have the right to expect that their
test scores and personal information will be kept confidential and used only for the
intended purposes.
 Right to appeal test scores: Test-takers should have the right to appeal their test
scores if they believe there has been an error in scoring or administration.

In general, ethical considerations are paramount in all aspects of language testing. Test
developers and administrators have a responsibility to ensure that tests are fair, equitable,
and unbiased. By adhering to ethical principles and best practices, we can ensure that
language tests are used responsibly and that test-takers are treated with respect and
dignity.

3.5 Student self-assessment

Instruction: Carefully read each statement in the table below. For each statement, rate
your understanding using the following scale:

1: I do not understand this at all.


2: I understand this a little, but I need more help.
3: I understand this fairly well, but I have some questions.
4: I understand this very well and can explain it to others.

Learning Objective/Concept Self- Evidence/Notes Action Plan


Assessment (Explain your (What will
Rating (1-4) rating) you do to
improve?)
Key Qualities: I can identify and
explain the four key qualities of a
good language test (reliability,
validity, practicality, ethical
considerations).
Factors Affecting Reliability: I can
identify factors that can affect the
reliability of a language test (e.g.,
test conditions, test-taker
characteristics, rater bias).
Validity: I can define validity and
explain its different types (e.g.,
content validity, construct validity,
criterion-related validity).
Practicality: I can discuss the
concept of practicality in language
testing and explain its components
(e.g., cost, time constraints, ease of
administration).
Ethical Considerations: I can
identify and discuss ethical
considerations related to language
testing (e.g., fairness, bias,
transparency, confidentiality).
Overall Understanding: I feel
confident in my understanding of the
qualities of a good language test and
their importance in language
assessment.

3.6 Consolidation activities

Activity 1: Scenario analysis to identify causes for unreliability

Objective: To have students analyze scenarios to identify possible causes for unreliability
of an English test.

Procedure: Divide students into small groups. Provide each group with a handout
containing 3-4 scenarios related to different types of reliability in language testing.

o Example Scenarios:
 Scenario 1 (Test-retest): "A student takes an online English
proficiency test. A week later, they retake the same test. Their scores
differ significantly. What factors might have contributed to this
difference?"
 Scenario 2 (Parallel Forms): "A school uses two different versions
of a reading comprehension test. Students who perform well on one
version consistently score poorly on the other. What could be the
potential issues with the test?"
 Scenario 3 (Internal Consistency): "A vocabulary test includes
items that seem unrelated to each other. How might this affect the
internal consistency of the test?"
 Scenario 4 (Inter-rater): "Two teachers score student essays. Their
scores for the same essays differ significantly. What steps can be
taken to improve inter-rater reliability?"

Each group analyzes the scenarios and discusses possible causes for unreliability.

Activity 2: Unpacking the reliability of a sample test

Objective: To have students analyze a sample English test for high school students in
Vietnam and identify potential threats to its reliability.

Material: A sample English test (e.g., a reading comprehension test with multiple-choice
questions)

Procedure:

Test presentation: Present students with a sample English test (or students search
from the Internet). Briefly discuss the purpose of the test and the target language
skills it aims to assess.

Group analysis : Divide students into small groups. Provide each group with a
handout containing the following questions:

 Test-retest: "What factors might affect the consistency of scores if


the same students took this test again in a few weeks?"
 Parallel forms: "If a parallel version of this test were created, what
would be important considerations to ensure the two versions are
truly equivalent?"
 Internal consistency: "How could you assess the internal
consistency of this test? What specific measures or analyses would
you use?"
 Inter-rater reliability: "If this test included a writing section, how
could you ensure consistent scoring across different raters?"
 Overall reliability: "Based on your analysis, what are the potential
threats to the reliability of this test?"
Activity 3: Analysis of language test validity

Objective: To have students examine an English test given by the teacher or one on
the Internet to analyze the validity of an English test for high school students in
Vietnam.

Procedure: Students work in pairs to discuss the following questions:

o Content: Does the test adequately cover the range of language skills taught
in the English curriculum?
o Construct: Does the test accurately measure the intended language skills
(e.g., reading comprehension, writing, speaking)?

Activity 4: The practicality puzzle: Designing a feasible test

Objective: To have students apply the principles of practicality (time, cost, logistics, ease
of administration/scoring, accessibility, interpretation/reporting) in the design of a short
language test.

Procedure: Present students with a scenario requiring the creation of a language test. For
example, "Design a short English test for incoming exchange students to assess their
basic conversational skills." Each group designs a short language test (e.g., 2-3 tasks)
based on the given scenario. Students discuss in groups to consider the following:

 Test format: Multiple-choice, short answer, role-play, etc.


 Test length: How long will the test take to administer?
 Scoring procedures: How will the test be scored? Are there clear
scoring rubrics?
 Logistics: Where will the test be administered? What resources are
needed?
 Accessibility: How will the test be made accessible to students with
disabilities?
 Reporting: How will test results be communicated to students and
other stakeholders?

Activity 5: The ethics of assessment: Navigating dilemmas in language testing

Objective: To have students analyze ethical dilemmas in language testing and develop
strategies for ensuring fair and equitable assessment practices.

Procedure: Divide students into small groups. Each group analyzes the following
questions and discusses potential threats to its reliability. Below is a list containing 3-4
ethical dilemmas related to language testing.
 Scenario 1 (Test bias): "A reading comprehension test
includes a passage about a specific cultural event that would
be unfamiliar to students from certain cultural backgrounds.
How does this create bias, and how can it be addressed?"
 Scenario 2 (Test security): "A student discovers a copy of
the upcoming English proficiency test online. What are the
ethical implications of this situation, and what steps should be
taken to address it?"
 Scenario 3 (Responsible use of test results): "A university
uses an English proficiency test as the sole criterion for
admission to its English program. What are the potential
ethical concerns with this approach?"
 Scenario 4 (Test-taker rights): "A student with a learning
disability requests extra time for an English test. How should
the school handle this request to ensure fairness and equity
for all students?"

Each group analyzes the scenarios, discussing the ethical implications and potential
solutions.

References

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford University
Press.

Brown, H. D., & Abeywickrama, P. (2019). Language assessment: Principles and


classroom practices. Pearson.

Chapelle, C. A., & Lee, H. W. (2021). Conceptions of validity. In The Routledge


handbook of language testing (pp. 17-31). Routledge.

Dinh, M. T. (2019). A review on validating language tests. VNU Journal of Foreign


Studies, 35(1).

Hill, K., & McNamara, T. (2015). Validity inferences under high-stakes conditions: A
response from language testing. Measurement: Interdisciplinary Research &
Perspectives, 13(1), 39-43.

Hughes, A. (2020). Testing for language teachers. Cambridge university press.

Liu, Z., Li, T., & Diao, H. (2020). Analysis on the Reliability and Validity of Teachers'
Self-designed English Listening Test. Journal of Language Teaching and
Research, 11(5), 801-808.
Siddiek, A. G. (2010). The Impact of Test Content Validity on Language Teaching and
Learning. Online Submission, 6(12), 133-143.

Tang, W., Cui, Y., & Babenko, O. (2014). Internal consistency: Do we really know what
it is and how to assess it. Journal of Psychology and Behavioral Science, 2(2), 205-
220.

You might also like