Validating a test
Of primary interest are estimates of internal consistency which account for error due to content sampling, usually the largest single component of measurement error. In terms of an achievement test, criterion validity refers to the extent to which a test can be used to draw inferences regarding achievement.Empirical evidence in support of criterion validity must include a comparison of performance on the validated test against performance on outside criteria.
Rudner, ERIC Clearinghouse on Assessment and Evaluation The "Standards for Educational and Psychological Testing" established by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, are intended to provide a comprehensive basis for evaluating tests.This article identifies the key standards applicable to most test evaluation situations.Sample questions are presented to help in your evaluations.TEST COVERAGE AND USE There must be a clear statement of recommended uses and a description of the population for which the test is intended. What is the basis for considering whether the test applies to your students?Sources of measurement error, which include fatigue, nervousness, content sampling, answering mistakes, misinterpreting instructions and guessing, contribute to an individual's score and lower a test's reliability.Different types of reliability estimates should be used to estimate the contributions of different sources of measurement error.
Inter-rater reliability coefficients provide estimates of errors due to inconsistencies in judgment between raters. (e.g., Split half-reliability coefficients should not be used with speeded tests as they will produce artificially high estimates.) 2. Is the reliability sufficiently high to warrant using the test as a basis for decisions concerning individual students? To what extent are the groups used to provide reliability estimates similar to the groups the test will be used with?
Alternate-form reliability coefficients provide estimates of the extent to which individuals can be expected to rank the same on alternate forms of a test. What are the reliabilities of the test for different groups of test-takers? CRITERION VALIDITY The test adequately predicts academic performance.
The principal question to ask when evaluating a test is whether it is appropriate for your intended purposes as well as your students. APPROPRIATE SAMPLES FOR TEST VALIDATION AND NORMING The samples used for test validation and norming must be of adequate size and must be sufficiently representative to substantiate validity statements, to establish appropriate norms, and to support conclusions regarding the use of the instrument for the intended purpose.
The use intended by the test developer must be justified by the publisher on technical grounds. What interpretations does the publisher feel are appropriate? The individuals in the norming and validation samples should represent the group for which the test is intended in terms of age, experience and background. How were the samples used in pilot testing, validation and norming chosen? Was the sample size large enough to develop stable estimates with minimal fluctuation due to sampling errors?
You then need to evaluate your intended use against the publisher's intended use. How is this sample related to your student population? Where statements are made concerning subgroups, are there enough test-takers in each subgroup? Do the difficulty levels of the test and criterion measures (if any) provide an adequate basis for validating and norming the instrument? RELIABILITY The test is sufficiently reliable to permit stable estimates of the ability levels of individuals in the target group.
Fundamental to the evaluation of any instrument is the degree to which test scores are free from measurement error and are consistent from one occasion to another when the test is used with the target group.