Initial Measurement of Data Quality : MITRE , April 1 , 2025
From the abstract: “The quality of a dataset is extremely hard to gauge in the age of big data because of the overwhelming amount of data needed to train deep learning models. To estimate the applicability of our data to a deep learning solution, we often must fit our model. This is a time and resource intensive process. The method we present here is a quick triage if the data may not be worth the time or if it may deserve morethorough vetting before fitting a solution. We propose a data quality score that is closely associated with the amount of separability within the data. Our target application is for a large amount of unstructured data such as images and text. We use pre-trained models to do feature generation and use an approximate nearest neighbor solution for speed in understanding the local neighborhoods of data points. In simulation and through examples on well-known toy datasets our method performs as expected and is able to identify when there may be problems when training a classifier.”
Authors - Kinney, Mitchell J.Related Resources