Documenting Machine Learning Datasets to Increase Accountability and Inclusivity

13.08.23 11:21 AM Comment(s) By Ines Almeida

Machine learning models rely heavily on their training datasets, inheriting inherent biases and limitations. Yet unlike other fields like electronics where datasheets meticulously detail components' operating parameters, no standards exist for documenting machine learning datasets. This research proposes "datasheets for datasets" to fill that gap, increasing transparency and mitigating risks.


Datasets impart significant yet often opaque influences on models. Biased data produces biased AI, like résumé screening tools that disadvantaged women. Undocumented datasets also limit reproducibility. The World Economic Forum thus recommends documenting datasets to avoid discrimination.

The authors devised an initial set of questions for dataset creators to reflect on a dataset's motivation, composition, collection, processing, uses, distribution and maintenance. For example, what subpopulations does the data represent? Could it directly or indirectly identify individuals? Might it result in unfair treatment for certain groups?


They refined the questions based on creating sample datasheets and feedback from companies pilots. A legal review led to removing explicit regulatory compliance questions. The goal is to encourage thoughtful dataset creation versus check-box documentation.


Published datasheets have already enabled better model development. At Microsoft, reviewing datasheets surfaced imbalanced gender data. But adding synthetic balanced data produced more equitable models. Datasheets also helped researchers select appropriate datasets.


Challenges remain in adapting datasheets across domains like dynamic datasets. Implementation imposes overhead on creators. And datasheets alone cannot identify every dataset bias or misuse. Combining with expertise like social scientists is recommended when data involves people.


Still, datasheets facilitate communication between creators and users, increasing intentionality in dataset development. Distinguishing datasets via transparency aids accountability and quality. For businesses, documenting training data provides reputational benefits and ensures models reflect corporate values. Proactively auditing algorithms and data is crucial as AI proliferates in customer-facing systems.


More broadly, the paper argues datasheets exemplify responsible data science in service of justice. Much like nutrition labels detail food ingredients for healthier eating, datasheets elucidate datasets' impacts. The authors urge centering inclusion and examining unintended consequences in design.

Responsible innovation requires transcending technical fixes to ask philosophical questions. What societal outcomes do we want these technologies to help achieve? How can AI empower communities historically excluded by “neutral” systems? Keeping diverse perspectives at the heart of development is key.


While datasheets provide a practical tool for dataset creators and consumers, their significance is as much symbolic as practical. They represent taking an intentional stance on transparency and equity in an opaque, high-stakes field. For businesses navigating AI's rise, this study offers inspiration to wield these technologies thoughtfully for human progress.


Sources:

Datasheets for datasets, Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, Kate Crawford

Share -