As the artificial intelligence revolution unfolds around us, many education researchers and practitioners believe that artificial intelligence will soon lead to highly personalized interventions such as intelligent teachers. In theory, these tools should improve educational progress by responding more accurately to students' needs and engaging them with more relevant learning materials. However, AI application development relies on large, high-quality data sets. This is a standard that is too often not met. That’s because generative AI models are mostly trained on publicly available data, which is opaque, lacks documentation, and is likely to be biased.
The Institute of Education Sciences (IES), an independent scientific agency within the U.S. Department of Education that I led until March of this year, has a wealth of data that can and should be used to improve our understanding of student learning. This is especially the case for the National Center for Education Statistics (NCES), the statistical unit of the IES that administers the National Assessment of Educational Progress (NAEP).
Through assessments, NAEP programs have amassed a vast amount of high-quality data about what students know and can do. (Approximately 500,000 students take the 4th and 8th grade reading and math assessments every other year; tests in other grades and subjects occur less frequently.) Because NAEP assessments are all nationally representative, NAEP data is useful for AI education purposes. This is especially important. The data does not reflect only a limited portion of the population. Additionally, the data is “labeled.” This means that the assessment has already been scored by an experienced rater and often includes detailed information about the concepts being tested. Over the past five years, well over $700 million in federal revenue from U.S. taxpayers (more than $100 million just for questionnaire development) has been used to generate this treasure trove of data. It contains hundreds of thousands of student essays, math practice problems, and civics test answers. These large datasets can help researchers, policymakers, parents, and teachers use the power of AI to improve student learning and performance.
However, this is not progressing at the desired speed. Accessing data for research purposes through NCES is currently too difficult. Cumbersome application procedures, bureaucratic hurdles, and slow processes plague researchers and organizations alike. For example, a team of accomplished researchers at Vanderbilt University sought access to three NAEP mathematics datasets for nearly a year, but were met with problems, including losing documents that had previously required them to mail them to multiple people, refusing to accept electronic signatures, and more. faced with disappointing management inefficiencies. You can also submit an application for your data.
These problems are caused by paper records stored on CDs and legacy security policies to protect data (remember those?). This is not the world we live in today.
Many government agencies, including IES, now provide secure remote access to confidential data sets. The Administrative Data Research Facility (ADRF), created by the Coleridge Initiative, is a secure research platform that provides easy access to sensitive and confidential microdata. It provides a model for how data can be protected while facilitating access to modern cloud infrastructure for improved collaboration, access to shared computing resources, and other benefits. You can now securely access NAEP and other student data from IES through this virtual enclave. State education agencies, workforce agencies, higher education institutions, and non-profit organizations also utilize the facility.
Despite these innovations, remote access to IES data remains a bottleneck. Applicants must fill out an outdated form that references “anti-virus software,” locked file cabinets, computers disconnected from the Internet, and other items that originated in a very different era of data storage and research. Now is the time to remove these long-standing barriers and bring NCES and NAEP data faster to accelerate the development of AI for educational purposes.
The first task is to make a concerted effort to modernize current secure data application processes to make it easier for researchers and developers to get the data they need for their projects. A new request for proposal system is needed to process online applications rather than relying on submitting documents by mail. Digital submissions enable more automated reviews to find and correct low-level errors, such as missing signatures. This frees up trained, well-paid staff reviewing applications to focus on more practical issues. It can also accommodate the proliferation of remote work and multi-agency collaboration by enabling electronic signatures and collaboration online rather than in a physical space.
In the long term, increased access means a wider range of acceptable uses of NCES data. NCES appropriately focuses on high-quality data products that support statistical uses of student data while avoiding enforcement, surveillance, or marketing uses. More broadly, IES and its centers primarily collaborate with universities and non-profit organizations. However, many organizations in the private sector, especially technology companies, are interested in using data for AI-related purposes. Current systems rarely allow this, in part due to privacy concerns about student data, but it is equally prohibitive due to a bureaucratic culture that is suspicious of commercial enterprises. However, statistical uses can be aligned with educational and analytical uses, and there are many privacy-enhancing technologies being developed and deployed by IES and other institutions that NCES could learn from.
Obviously, these changes must be consistent with the Family Educational Rights and Privacy Act (FERPA), which regulates access to education data. And similarly, IES and NCES must protect the privacy of student data within applicable laws. However, none of the proposed updates to data access processes will affect the protection of existing (outdated) systems. More broadly, FERPA has too often served as a brake on needed changes to how valuable data can be used. IES should lead efforts to better balance concerns about student privacy with the reality that the nation needs a breakthrough that provides access only to unbiased and representative data of the kind generated by NAEP.