Technical Manual Chapter Two
Ch 2: Test Development
In this chapter, we describe the process used to develop the items included in VALLSS. In doing so, we provide evidence in support of the generalization inferences outlined Table 3. Although processing, code- and language-based constructs differ substantially, the item development process was nearly identical regardless of subtest. We therefore describe development for all items within this chapter. All score reporting will be covered in a forthcoming version of this report.
Table 3.
Overview of Inferences, Assumptions, and Supporting Analyses
Inference |
Assumption |
Analysis |
1) VALLSS adequately measures students’ language and code-based literacy |
1.a) Theory supports the items included in VALLSS
|
Literature review and expert feedback |
1.b) Items align with the established elements of early literacy and language
|
Expert feedback |
|
2) The response processes are appropriate for each item |
2.a) Response processes are developmentally appropriate for students at each grade level, allowing students to demonstrate the skills and understandings being measured |
Expert feedback |
|
|
Expert feedback |
|
2.b) Students respond to the items as intended |
Reports on implementation from early piloting
|
Inference 1: VALLSS adequately measures students’ language and code-based literacy.
Assumption 1.a: Theory supports the items included in VALLSS.
The evidence base on reading in combination with the SVR (Gough & Tunmer, 1986) theoretical framework contributed to early decisions about what VALLSS would measure. Reports from the NRP (2000), the National Literacy Panel on Language Minority Children and Youth (August & Shanahan, 2017), and the National Early Literacy Panel (2008) identified a number of foundational skills that are associated with literacy development (i.e., phonological awareness, the alphabetic principle, vocabulary, and fluency). The skills represented in these reports have remained relevant indicators in research on the science of reading, and summaries for each subtest are available via Front Matter files on the VLP website (click the Screener+ option from the menu). VALLSS’s measurement of language comprehension, code-based literacy, and processing is therefore theoretically and empirically supported.
Assumption 1.b: Items align with the established elements of early literacy and language.
VLP gathered information from subject matter experts (SMEs), reviewed evidence-based information on reading instruction, consulted scopes and sequences of foundational reading skills listed in reading instructional program manuals, and perused Virginia-specific “Standards of Learning” (SOLs) describing what is to be taught in each grade. Assessments and subtests that are widely used for tapping each of the target constructs were identified. The VLP team gathered information on how to accurately assess language comprehension and foundational literacy skills from journals, reports, and existing measures of language acquisition and reading development. There were more readily available standardized, published screeners, assessments, and scopes and sequences for code-based literacy than for language comprehension, so the team also read journal articles about language development. Faculty/researchers and educators were involved in review of all of these resources to ensure the content for each of the VALLSS subtests would align with established elements of early literacy and language. The following describes the scope of content for each construct in more detail.
Code-Based Literacy
Unlike the rest of the subtests, the scope of content for alphabet and letter sound knowledge was comprehensive. All letter sounds are included items because comprehensive knowledge of letter sounds is necessary for long-term literacy development. All letter names are also included items. Educators are accustomed to teaching the names of letters and the team determined keeping all letters might facilitate uptake. For the rest of the subtests, researchers, educators, and SMEs were asked to create items through awareness of the standards and sequence. Items were developed in consideration of the skill and subtest, manipulating item characteristics to target assessment term within grades. The item characteristics of relevance therefore varied.
In the absence of a range of content that could be manipulated for alphabet and letter sound knowledge, the team addressed the presentation of each item. Letter Sound items’ reliance on paired lowercase and uppercase ensures each item represents whatever letter type a student might be familiar with. Letter Naming items’ reliance on separate and comprehensive presentations of lowercase and uppercase types ensures students have an opportunity to see a single grapheme at a time while allowing for points regardless of the type a student knows. Appendices provide outlines with more detail about the development process for phonological, decoding, and oral reading fluency items/passages, respectively.
When creating the decoding and encoding items, writers referenced existing inventories to create databases of items. Inventories provided examples of words and features, from which VLP documented target word patterns. Item writers avoided semantic priming, homophones, and other words that may have different meanings or be similar to words with different meanings in other languages. Real Word Reading and Pseudoword Reading items measure word-reading ability of real and nonsense words. Pseudoword items were written to follow real word spelling patterns. Word-level reading items’ scoring relies on the accuracy and smoothness within which a student responds to a single word; we posit that this combination of content and scoring requirements ensures VALLSS assesses whether a student has mastered specific phonics skills and can decode automatically. The Encoding items also assess specific phonics skills and are relatively more demanding because they require that the student produce written words and spell them in an orthographically accurate way rather than seeing them as a stimulus. Scoring relies on accuracy in producing the grapheme (letter) that is matched to a phoneme (sound). Encoding items were developed because they are expressive demonstration of code-based literacy.
Passages for ORF were created using several criteria. There were five key elements included in the development of the ORF passages; word count, decodability, and lexile level all of which varied by grade, were considered alongside topics that cover a wide variety of concepts, activities, and events with which students are familiar. ORF passages were also selected in consideration of fiction and nonfiction writing. A balance of fiction and nonfiction passages are included in every grade. For more specific information on word counts, lexile ranges, and decodability percentages for each grade, see Appendix 2.
Language Comprehension
See Appendix 2 for the entire process describing the development of the narrative passages and items included on the language comprehension subtests Passage Retell and Passage Comprehension Questions.
VLP followed several criteria when creating language comprehension passages for the retell and expository comprehension questions. VLP relied on the previous grade’s SOLs (ex. Grade 3 passages were created using topics from the grade 2 SOLs). When choosing topics for the passages, both expository and narrative passages were written for each grade level. All passages were drafted considering age- and grade-appropriate content, vocabulary, and grammatical structures. Lexile levels were also calculated for each passage as an indicator of readability. As Lexile levels can help inform the selection of texts at an appropriate level for students to read independently, and these passages were developed to read to students aloud, our target Lexile levels were about two grade levels above the assessment grade (4th grade level for 2nd grade students). Questions about passages were developed to rely on evidence provided by the passage itself, and to avoid relying on students’ background knowledge. Again, for more details on development, see Appendix 3.
Vocabulary lists were developed for items that measure word-level expressive vocabulary naming accuracy and understanding of relational vocabulary.
In the creation of the vocabulary lists for Expressive Vocabulary, all words were ensured to be picturable; no more than 20% of words could be Spanish-English cognates; no words could appear on any image of a VALLSS subtest within the same grade in English; all words with additional referents were avoided; and words were selected from different semantic categories from which no more than 3-4 words were taken. Additionally, SOLs were consulted to understand what English vocabulary appears in each grade level. Once lists were created that followed the above standards, each list was internally reviewed using the Child Language Data Exchange System (CHILDES; MacWhinney, 2000) database (database of child language data sets) which gave data on word frequencies. The words in the lists were arranged from “highest” to “lowest” frequency within grade, as well as across grades to facilitate measurement of students’ progress.
To write Nonsense Sentences, a word and phrase bank of grade-level appropriate articles, subjects, adjectives, verbs, prepositions, nouns, and predicates were created first. Sentences with nonsensical meaning but grammatically- and syntactically-correct structure, were then created using various combinations from the word bank. A variety of combinations of sentence structures and vocabulary was used.
To write the sentences for Relational Vocabulary, a list of common vocabulary words used to illustrate the relationship between two or more things (e.g., above/below, more/fewer) was created. Relational terms were selected that could be easily depicted using only the two objects (box, ball) introduced at the beginning of the subtest (and relational terms that were difficult to illustrate were excluded from the subtest).
Processing
Fluent reading requires students to quickly recognize letters, their associated sounds, and then blend them together. VLP referenced item types from published instruments and ultimately developed RAN using the Rapid Automatized Naming and Rapid Alternating Stimulus Tests (Wolf & Denckla, 2003) assessment as a guide. Existing PALS data were also referenced to identify the most easily recognized letters among students in Virginia: o, c, z, a, m. These letters were the only ones used and repeated for RAN in order to reduce the letter name knowledge burden. See Appendix 4 for more details on the development of the RAN subtest.
Inference 2: The response processes are appropriate for each item.
Assumption 2.a: Response processes are developmentally appropriate for students at each grade level, allowing students to demonstrate the skills and understandings being measured.
VLP selected response processes for measuring language comprehension, code-based literacy, and processing that educators can administer. Abilities would therefore be measured through formats common in schools and via items that are transparent in what VALLSS is measuring. For example, cues and responses to phonological awareness items are interpretable as information about literacy ability and are more practically measured within schools than neurological imaging of brain activity in response to certain stimuli.
Items were also written so that the information from a response could be attributed to a unit of language or literacy that is developmentally informative. Phonological items include a range of relatively easier (i.e., repeating the initial sound of a word) and difficult (i.e., segmenting the phonemes in a word) cues and expected responses. Common characteristics for measuring phonological abilities that could have been manipulated differently include the length of the word/number of phonemes, position of the sounds, and whether a student must manipulate any sounds (e.g., deletion). Nonetheless, the types of cues and responses elicited remain similar for all items within a subtest regardless of the construct an item measures. For example, items that require students to demonstrate understanding of graphemic-phonemic correspondence were developed for either Letter Sounds or for word-level decoding. Had items been written to probe or be scored in such a way that combined word reading with letter-sound abilities, VALLSS subtests would provide less information to educators about whether phonics instruction should target more sub-word or word-level decoding. These relatively more fine-grained units of information are easier to use in creating a lesson plan or selecting curricular materials. Conversely, VALLSS items from across subtests, in combination, require responses that ensure enough of the established elements of language and literacy are measured.
After development, analyses were conducted to further ensure that response processes are developmentally appropriate for students at each grade level. Our database of items was refined multiple times, including during the Soft Launch on which we are currently reporting. Data collected during the Soft Launch of VALLSS was used to determine which items should be retained for measuring code-based literacy and language comprehension, for each term and within each grade. For example, analyses supported the use of RAN items in the fall for Grades 1-3, but at mid-year instead of fall for kindergarten. Other analyses, reported in detail in Chapter 3, were used to identify items that serve as relatively weak indicators of the intended construct. Analyses were also used to identify items that were technically redundant for the purpose of pruning, thus reducing the demands of each form. Prompted by model fit when Phoneme Segmenting items were included with the rest of the code-based items, the team discussed whether to exclude these items from the vertical scale. It was appropriate from a content standpoint to consider excluding Phoneme Segmenting items because, unlike the rest of the code-based items, they do not rely on the written word. It was determined that a literacy assessment without phonological items would be inappropriate given the intended purpose of supporting instructional decisions. Keeping the Phoneme Segmenting items may also increase the face validity of VALLSS. Therefore, Phoneme Segmenting items were retained for VALLSS: K-3, but were excluded from BOR due to patterns of relatively poorer statistical cohesion with the rest of the items that contribute to BOR. The result is that Instructional Indicators are available to educators for Phoneme Segmenting and the subtest is available as an optional tool when not required.
Assumption 2.b: Students respond to items as intended.
Support for divisions participating in Soft Launch included access to a VALLSS Hotline and recurring VALLSS Office Hours. Both of these resources provided avenues for VLP to determine systematic challenges in test administration. VLP data collectors also observed student responses during Pilot first-hand. Students responded to the vast majority of items as intended. That is, given subtest instructions and item-specific stimuli, students generally responded with the type of response that could be scored and generally demonstrated understanding of the task at hand. In rare instances in which there was statistical evidence for doing so, or when assessors reported student responses suggesting unexpected and possibly systematic error, the suspect item was further scrutinized. The VLP team also documented student responses and used these to examine types of errors. Suspect patterns that emerged from this documentation of errors also led to closer scrutiny of other suspect items. This involved a combination of revisiting the contents of the item, examination of descriptive statistics for the item, probing data collectors, and making a determination whether the item needed to be excluded from VALLSS.
DRAFT V1 (February, 2025)