Use Case Evaluations for AI Adoption in the Department of War (DOW): Humans vs. Large Language Models (LLMs) : MITRE , April 7 , 2026
From the report: “As Artificial Intelligence Enabled Systems (AIES) become more integrated into the modern workforce, government organizations are also integrating AI capabilities into their workflows. The Department of War (DOW) will require a well-structured and efficient evaluation process to identify and prioritize their use of AI models. For the most efficient application of AI to DOW systems, the evaluation process should start with the examination of candidate AI use cases articulating mission impact and technical risk to inform down selection and investment decisions. Traditionally, use case evaluations are conducted by human subject-matter experts (SMEs), but this evaluation process can be labor and time intensive and variation in the SME’s evaluation can potentially lead to unintended bias. Recent advances and availability of large language models (LLMs) present an opportunity to augment the expert evaluation process to improve speed and repeatability, but LLMs pose their own challenges and their alignment with human assessments in the DOW context has not been examined. This paper conducts an empirical study to compare human experts and LLM-augmented evaluations using a curated set of use cases for AI adoption in DOW. The team provided human SMEs and prompted LLMs with specific guidance on rating each use case on their implementation feasibility and mission impact criteria. The results of the human and LLM-assisted applications were analyzed to study evaluation patterns between the two groups and outliers. Results show that although LLM based evaluation scores differ from human expert evaluators, the LLM produces the same overall prioritization order of the AI use cases as human evaluators. A weak but statistically significant positive association is also observed between human and LLM based evaluations. Findings of this study offer insights into the potential uses and limitations of LLMs for structured evaluation tasks and help with defining the best practices for integrating AI-assisted evaluations into DOW pipelines seeking to frame AI use cases in effective and repeatable manner.”
Authors - Islam, Muhammad F., Yetto, Matt, Pomales, Carol, Schwartz, PeterSubjects
Authors
Publishers
Format
Related Resources