Now Next Later AI - Blog #Dataset development

The Emerging Task of Measuring AI Training Data

Sun, 13 Aug 2023 20:14:16 +1000

A new perspective paper argues for "measuring data" as a critical task to advance responsible AI development. Just as physical objects can be measured, data used to train AI systems should also be quantitatively analyzed to understand its composition.

The authors propose formalizing data measurement as a core research area. They categorize different measurement types like distance, density, and diversity. Distance metrics like Euclidean distance capture similarities in the data. Density reflects how densely certain concepts are represented. Diversity measures heterogeneity.

The paper summarizes measurements proposed across computer vision, natural language processing, and general data science. For instance, perplexity scores indicate how predictable text data is. Image density models visual patterns. Word mover's distance compares document similarity based on word embeddings.

The authors argue precise data measurements will let practitioners better curate datasets and understand what models learn. Characterizing training data is crucial for responsible AI. Metrics can reveal underrepresentation of groups or embed problematic biases.

For business leaders deploying AI, this paper provides a framework to evaluate data rigorously before training models. Combining multiple measurements will enable deeply understanding datasets. Just as standardized physical measurements enabled advances in science, quantifying data properties will drive progress in AI. Insisting models be trained on measured data is vital for ethical AI.

Sources:

Measuring Data

Making Data Work More Visible Through Documentation

Sun, 13 Aug 2023 12:31:31 +1000

Machine learning (ML) systems are becoming increasingly prevalent, but few business leaders understand how the data used to train these systems is produced. A new study provides insights into the complex processes and people behind ML data work.

The study was conducted by researchers who partnered with two business process outsourcing (BPO) companies that employ human workers to collect and label data used for training machine learning systems. Through interviews and design workshops with data workers, managers, and clients, the researchers aimed to understand how documentation practices could be improved to capture the context and processes involved in producing ML training data.

Key findings:

Data workers have limited visibility into the overall goals and uses of the projects they work on. Better documentation that explains the client, project timeline, and intended uses of the final dataset could improve coordination and workers’ sense of meaning.
Data work involves collaboration between many geographically dispersed actors, but communication tends to be top-down. Enabling feedback loops in documentation could foster more participatory, iterative processes between clients, managers, and data workers.
Data work is dynamic, but documentation practices are often static. Tools like revision histories and timeline tracking could better reflect the evolving nature of projects and datasets.
Worker needs are not fully represented in current documentation. Including details like task instructions, payment terms, and ethical considerations could make documentation more worker-centric.

The researchers conclude that "data production documentation should travel across multiple stakeholders and facilitate their communication." Viewing documentation as a "boundary object" that adapts to diverse needs while maintaining integrity could enable more participatory and equitable data production processes.

Key implications for business leaders:

Seek more visibility into how your ML training data is produced and provide input to vendors on documentation needs.
Recognize and engage data workers as key collaborators, not just low-level laborers. Their feedback can improve dataset quality.
Set documentation policies that make values like transparency, participation, and fairness explicit. This also mitigates business risks.
Documentation supports coordination and accountability. Invest in tools and practices that reflect the dynamic reality of data work.

Machine Learning's pervasive growth underscores the importance of understanding its foundational data. This study underscores that the intricate human processes behind data collection and labeling are often underappreciated and misunderstood. As businesses increasingly rely on ML systems, there's a compelling need for enriched documentation that facilitates comprehensive communication across stakeholders, recognizes the invaluable contribution of data workers, and remains dynamic to the evolving nature of data projects. By viewing documentation as an adaptable "boundary object," businesses can champion a more inclusive, transparent, and effective ML data production process, ensuring both operational excellence and ethical integrity.

Sources:

Documenting Data Production Processes: A Participatory Approach for Data Work
Milagros Miceli, Tianling Yang, Adriana Alvarado Garcia, Julian Posada, Sonja Mei Wang, Marc Pohl, Alex Hanna

Examining How AI Training Datasets Are Built: A Framework for More Responsible Practices

Sun, 13 Aug 2023 11:51:01 +1000

As artificial intelligence (AI) and machine learning technologies become more widely adopted across sectors, businesses must prioritize responsible and ethical implementation. A critical yet often overlooked area is how AI training datasets are constructed behind the scenes.

In a recent paper, researchers Mehtab Khan and Alex Hanna highlight the need for greater scrutiny, transparency, and accountability in how massive datasets for machine learning models are created. Their analysis proposes breaking the process into distinct stages and identifying affected individuals.

Legally Gray Areas Around Data Collection

The paper notes that current copyright law is ambiguous regarding reproducing large volumes of text, images, and other data for assembling AI training datasets through scraping websites and other sources. While fair use exemptions may technically apply in some cases, the legal boundaries are highly complex. This gray area allows widespread copying to build datasets, but concerning practices could still raise issues like copyright infringement.

Emerging Individual and Collective Privacy Concerns

Collecting data en masse, even from publicly available websites, risks violating the privacy of individuals included without their consent. But the paper emphasizes privacy issues extend beyond individual data collection. Biases and unfair representations affecting certain populations also clearly emerge when examining dataset design choices.

A Framework for the Entire Development Pipeline

To illuminate interconnected problems throughout the process, the researchers propose systematically analyzing each stage of assembling training datasets. This includes initial problem definition, data collection and cleaning, annotating examples, model building and evaluation, and distribution. Examining how different stakeholders are impacted at each phase can uncover ethical issues.

The paper argues this framework highlights where greater transparency, oversight, and accountability are most needed. It can also inform policies and self-regulation. Documenting datasets is one step, but comprehensive governance of the full development lifecycle is required for responsible AI.

Proactive Policies Needed

Understanding training data origins is no simple task but crucial for mitigating risks like biases, discrimination, and improper data usage. While technical teams focus on accuracy, leaders must invest in responsible data sourcing and stewardship. Establishing proactive policies and drawing on frameworks like the one outlined in this paper can lead to more ethical AI systems.

Sources:

The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability

Mehtab Khan, Alex Hanna

Documenting Machine Learning Datasets to Increase Accountability and Inclusivity

Sun, 13 Aug 2023 11:21:07 +1000

Machine learning models rely heavily on their training datasets, inheriting inherent biases and limitations. Yet unlike other fields like electronics where datasheets meticulously detail components' operating parameters, no standards exist for documenting machine learning datasets. This research proposes "datasheets for datasets" to fill that gap, increasing transparency and mitigating risks.

Datasets impart significant yet often opaque influences on models. Biased data produces biased AI, like résumé screening tools that disadvantaged women. Undocumented datasets also limit reproducibility. The World Economic Forum thus recommends documenting datasets to avoid discrimination.

The authors devised an initial set of questions for dataset creators to reflect on a dataset's motivation, composition, collection, processing, uses, distribution and maintenance. For example, what subpopulations does the data represent? Could it directly or indirectly identify individuals? Might it result in unfair treatment for certain groups?

They refined the questions based on creating sample datasheets and feedback from companies pilots. A legal review led to removing explicit regulatory compliance questions. The goal is to encourage thoughtful dataset creation versus check-box documentation.

Published datasheets have already enabled better model development. At Microsoft, reviewing datasheets surfaced imbalanced gender data. But adding synthetic balanced data produced more equitable models. Datasheets also helped researchers select appropriate datasets.

Challenges remain in adapting datasheets across domains like dynamic datasets. Implementation imposes overhead on creators. And datasheets alone cannot identify every dataset bias or misuse. Combining with expertise like social scientists is recommended when data involves people.

Still, datasheets facilitate communication between creators and users, increasing intentionality in dataset development. Distinguishing datasets via transparency aids accountability and quality. For businesses, documenting training data provides reputational benefits and ensures models reflect corporate values. Proactively auditing algorithms and data is crucial as AI proliferates in customer-facing systems.

More broadly, the paper argues datasheets exemplify responsible data science in service of justice. Much like nutrition labels detail food ingredients for healthier eating, datasheets elucidate datasets' impacts. The authors urge centering inclusion and examining unintended consequences in design.

Responsible innovation requires transcending technical fixes to ask philosophical questions. What societal outcomes do we want these technologies to help achieve? How can AI empower communities historically excluded by “neutral” systems? Keeping diverse perspectives at the heart of development is key.

While datasheets provide a practical tool for dataset creators and consumers, their significance is as much symbolic as practical. They represent taking an intentional stance on transparency and equity in an opaque, high-stakes field. For businesses navigating AI's rise, this study offers inspiration to wield these technologies thoughtfully for human progress.

Sources:

Datasheets for datasets, Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, Kate Crawford