What Is Test Validity?
- What is test validity and test validation?
- What are the different types of validity evidence?
- Building a case for validity
- What statistical concepts are used in validity studies?
Tests themselves are not valid or invalid. Instead, we validate the use of a test score.
Tests are pervasive in our world. Tests can take the form of written responses to a series of questions, such as the paper-and-pencil SAT™, or of judgments by experts about behavior, such as those for gymnastic trials or for a work performance appraisal. The form of test results also vary from pass/fail, to holistic judgments, to a complex series of numbers meant to convey minute differences in behavior.
Regardless of the form a test takes, its most important aspect is how the results are used and the way those results impact individual persons and society as a whole. Tests used for admission to schools or programs or for educational diagnosis not only affect individuals, but also assign value to the content being tested. A test that is perfectly appropriate and useful in one situation may be inappropriate or insufficient in another. For example, a test that may be sufficient for use in educational diagnosis may be completely insufficient for use in determining graduation from high school.
Test validity, or the validation of a test, explicitly means validating the use of a test in a specific context, such as college admission or placement into a course. Therefore, when determining the validity of a test, it is important to study the test results in the setting in which they are used. In the previous example, in order to use the same test for educational diagnosis as for high school graduation, each use would need to be validated separately, even though the same test is used for both purposes.
Validity is a matter of degree, not all or none.
Samuel Messick, a renowned psychometrician, defines validity as "...an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationale support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment." Messick points out that validity is a matter of degree, not absolutely valid or absolutely invalid. He advocates that, over time, validity evidence will continue to gather, either enhancing or contradicting previous findings.
Tests sample behavior; they don't measure it directly.
Most, but not all, tests are designed to measure skills, abilities, or traits that are and are not directly observable. For example, scores on the SAT measure developed critical reading, writing and mathematical ability. The score on the SAT that an examinee obtains when he or she takes the test is not a direct measure of critical reading ability, such as degrees centigrade are a direct measure of the heat of an object. The amount of an examinee's developed critical reading ability must be inferred from the examinee's SAT critical reading score.
The process of using a test score as a sample of behavior in order to draw conclusions about a larger domain of behaviors is characteristic of most educational and psychological tests. Responsible test developers and publishers must be able to demonstrate that it is possible to use the sample of behaviors measured by a test to make valid inferences about an examinee's ability to perform tasks that represent the larger domain of interest.
Reliability is not enough; a test must also be valid for its use.
If test scores are to be used to make accurate inferences about an examinee's ability, they must be both reliable and valid. Reliability is a prerequisite for validity and refers to the ability of a test to measure a particular trait or skill consistently. However, tests can be highly reliable and still not be valid for a particular purpose. Crocker and Algina (1986, page 217) demonstrate the difference between reliability and validity with the following analogy.
Consider the analogy of a car's fuel gauge which systematically registers one-quarter higher than the actual level of fuel in the gas tank. If repeated readings are taken under the same conditions, the gauge will yield consistent (reliable) measurements, but the inference about the amount of fuel in the tank is faulty.
This analogy makes it clear that determining the reliability of a test is an important first step, but not the defining step, in determining the validity of a test.
There are many different methods that can be used to establish the validity of a test's use.
Crocker and Algina (1986) point to three major types of validity studies: content validity, criterion-related validity, and construct validity. Recently, consequential validity is increasingly discussed as a fourth major type of validity.
These four types of validity studies include, and sometimes employ, additional concepts of validity. For the content validity of a test, both a face validity and curricular validity study should be completed. To establish criterion-related validity, either a predictive validity or a concurrent validity study can be used. To establish construct validity, convergent validity and/or discriminant validity studies are used. Evidence from content and criterion-related validity studies can also be used to establish construct validity. Consequential validity requires an inquiry into the social consequences of the test use which are unrelated to the construct being tested, but which impact one or more groups.
Several types of evidence should be used to build a case for valid test use. For example, in building a case for the use of the CLEP® exams for college course placement, the college may want to:
- Compare the test specifications with course requirements to see if there is sufficient overlap to be comfortable using evidence from the test in place of completion of a course.
- Complete a concurrent criterion-related validity study to determine the relationship between course grades and test scores.
- Compare results from the CLEP exams with results from classroom tests of the same topics to establish convergent validity evidence.
- Follow up with surveys of the students enrolled in subsequent classes, who tested out of prerequisite classes using CLEP, to determine whether they felt their preparation to be adequate.
Test scores provide specific information about test-takers that can be used to make decisions about college admission, course placement, promotion, services, etc., but the use of tests for these specific purposes first requires validation of the test.
Just as an attorney builds a legal case with different types of evidence, the degree of validity for the use of a test is established through various types of evidence including logical, empirical, judgmental, and procedural evidence. A validation study seeks to establish:
Why are test scores used to make decisions about test-takers?
The responsible use of test scores in any decision-making process requires test users to justify why a given test is preferred over another measure. Consider the following nonacademic example:
Why should we supplement driving exams with a written component when we can so readily observe a person driving? Wouldn't observation be a more informative method of assessment, and subsequently, a more solid platform for basing a decision about someone's ability to drive? What is the purpose of giving a written test to assess driving skills?
These are not innocuous questions. They raise issues of adequacy, practicality, and fairness.
It would be impractical to simulate every possible scenario of driving conditions (uphill parking, the four-way stop dilemma, illegal or dangerous maneuverings, etc.). It would also be difficult to standardize conditions or to simulate certain conditions. For example, geographic differences within a given state will vary from one test site to another. One part of the state might be mountainous (and possibly snowy), while another part might be quite flat (maybe prone to flooding). Weather conditions will change with the seasons or even from one testing session to another. All these factors make it more difficult to generalize from a limited sample of behavior to a more general set of skills.
In this example, the potential challenges involved in maintaining consistent testing conditions are obvious, but how does this relate to academic assessment? There should be a solid rationale for the inclusion of test scores in any decision-making process. On tests that are standardized, every examinee receives an equivalent test under similar conditions. Great care is taken to ensure that one candidate does not have an unfair advantage over another because of the test date, form, or location of the testing site.
One reason to consider test scores when making academic decisions is that large-scale tests are administered under standardized conditions. Well-developed, well-constructed, relevant tests are reliable measures of targeted constructs.
Though no test can perfectly measure what a person knows or fails to know, good tests give snapshots of the ability level (or state of mastery) at a given point in time. When used with other relevant pieces of information, test scores provide a strong foundation for well-informed decisions.
What are the test scores or other measures that will produce the best results?
In the course of establishing the validity of a test in a particular context, several key questions arise that have no hard and fast answers. A careful use of judgment and attention to the consequential validity of the test use are your best guides in answering these questions. The first key decision to be made is which test (and test scores) will you attempt to validate? It is usually best to include several choices. Remember that validity is a matter of degree, rather than all or none.
For example, a college that is validating the use of a test for admission will want to simultaneously consider the results if no test is used, if a combination of measures is used, and if a test is used by itself. Comparing the results will give the best indication of whether the test is valid and useful.
Several important questions must be answered when designing the validation studies, each of which has several potential answers depending on factors such as politics, expertise, availability, and budget:
- Who will determine whether a test has sufficient content validity? An independent outside panel? The department chair? A panel of teachers?
- What are the measures that will be used to determine convergent or discriminant validity? Which measures should produce similar results? Which should produce different results?
- What will be the basis for a curricular study? Will course outlines be used? Department regulations? Classroom tests? State standards?
- What will be the criterion in a criterion-related validity study? How will success be determined in a course or program? (Tell me more)
- What are the social consequences that will be explored to determine consequential validity? How will unintended effects be monitored?
If inferences or actions have been based on the use of test scores, when were the data collected?
Current data are usually the best data. If any major change occurred in institutional standards, policies concerning admission or course placement, institutional marketing strategies, or student body composition between the time the data were gathered and when the data were submitted, then extreme caution should be exercised when interpreting outcomes. More current data may produce different results in cases where decisions are data-dependent. When conditions change, a new validity study is the best option.
An important question to ask before designing a validity study is how will the results of the validity study be used?
You may be interested in conducting a study that will help you select an entering class of students for your institution. You may wish to gather evidence that the admission decisions you are making at your institution are fair or that the variables, such as test scores, are valid and legally defensible. You may be interested in making decisions about placing students into remedial or advanced classes at your institution. How you will ultimately use the results determines how you should design the study and its components.
Before conducting a study, it is important to be clear about what your inquiry will actually be measuring; will the responses indeed answer your specific questions? To assist you in this process, the Admitted Class Evaluation Service™ (ACES™) system allows you to download sample reports.
Require Adobe Reader (latest version recommended).