Reliability, in the context of a single local assessment, is a measure of internal consistency. “How reliable is this test?” is another way of saying “How well do each of the items on this test measure a single thing?”.
It is important to note “reliability” often gets interchanged and confused with “validity.” In the context of local assessment, validity answers the question, “Is this the correct assessment for our purposes?” Validity is about drawing conclusions and assessing with a purpose.
Lean and Shape (Skewness and Kurtosis)
These two metrics are more closely related to score distribution than to reliability. If all of the results from a test are graphed together, the shape of the graph will resemble a kind of bell shape. In a normal distribution of data, the bell is equal on the left and right sides. For most (likely all) local assessments in Aware, the bell will actually lean more to the right or left and will be misshapen to a more flattened or more peaked shape.
Most Aware users will have no need for these numbers, and they can be ignored. Look at the Raw Score Distribution graph for a more accessible use of this information in graphical form.
We use Cronbach’s Alpha (α) as a measure of internal consistency. This number represents how each test item’s performance relates to each other, the total number of test items, and the total score. Basically, does every item do its part to measure the same “thing?”
A “thing” could be “Fractions, Decimals, and Percents,” “Significant Events of the Civil War,” “Moon Phases,” or any other topic that the test author wanted to draw conclusions about.
The closer α is to the value “1” (it will not be “1”), the more reliable the test. Fewer test items can make determining acceptable ranges difficult, but >0.8 is generally considered good. For local assessments, especially when test authors deliberately include scaffolded items that assess outside the primary scope of the test, ranges as low as 0.7 or 0.6 are fine.
Standard Error of Measurement (SEM)
Since every assessment is flawed, there is no way to know a “true score" for student learning. SEM is a way to estimate how a student if they were to take the same test multiple times might vary in their results.
SEM is a function of reliability, so more-reliable assessments will have smaller SEMs.