“Everyone wants to do the model work, not the data work”: 
Data Cascades in High-Stakes AI
Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, Lora Aroyo
[nithyasamba,kapania,hhighfill,dakrong,pkp,loraa]@google.com
Google Research
Mountain View, CA
ABSTRACT

AI models are increasingly applied in high-stakes domains like health and conservation. Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations. Paradoxically, data is the most under-valued and de-glamorized aspect of AI. In this paper, we report on data practices in high-stakes AI, from interviews with 53 AI practitioners in India, East and West African countries, and USA. We define, identify, and present empirical evidence on Data Cascades—compounding events causing negative, downstream effects from data issues—triggered by conventional AI/ML practices that undervalue data quality. Data cascades are pervasive (92% prevalence), invisible, delayed, but often avoidable. We discuss HCI opportunities in designing and incentivizing data excellence as a first-class citizen of AI, resulting in safer and more robust systems for all.
 
1 INTRODUCTION

Data is the critical infrastructure necessary to build Artificial Intelligence (AI) systems. Data largely determines performance, fairness, robustness, safety, and scalability of AI systems. Paradoxically, for AI researchers and developers, data is often the least incentivized aspect, viewed as ‘operational’ relative to the lionized work of building novel models and algorithms. Intuitively, AI developers understand that data quality matters, often spending inordinate amounts of time on data tasks. In practice, most organizations fail to create or meet any data quality standards, from under-valuing data work vis-a-vis model development. Under-valuing of data work is common to all of AI development. We pay particular attention to undervaluing of data in high-stakes domains that have safety impacts on living beings, due to a few reasons. One, developers are increasingly deploying AI models in complex, humanitarian domains, e.g., in maternal health, road safety, and climate change. Two, poor data quality in high-stakes domains can have outsized effects on vulnerable communities and contexts. As Hiatt et al. argue, high-stakes efforts are distinct from serving customers; these projects work with and for populations at risk of a litany of horrors. As an example, poor data practices reduced accuracy in IBM’s cancer treatment AI and led to Google Flu Trends missing the flu peak by 140%. Three, high-stakes AI systems are typically deployed in low-resource contexts with a pronounced lack of readily available, high-quality datasets. Applications span into communities that live outside of a modern data infrastructure, or where everyday functions are not yet consistently tracked, e.g., walking distances to gather water in rural areas—in contrast to, say, click data. 
 
Finally, high-stakes AI is more often created at the combination of two or more disciplines; for example, AI and diabetic retinopathy, leading to greater collaboration challenges among stakeholders across organizations and domains. Considering the above factors, currently data quality issues in AI are addressed with the wrong tools created for, and fitted to other technology problems—they are approached as a database problem, legal compliance issue, or licensing deal. HCI and CSCW scholarship have long examined the practices of collaboration, problem formulation, and sensemaking, by humans behind the datasets, including data collectors and scientists, and are designing computational artefacts for dataset development. Our research extends this scholarship by empirically examining data practices and challenges of high-stakes AI practitioners impacting vulnerable groups. We report our results from a qualitative study on practices and structural factors among 53 AI practitioners in India, the US, and East and West African countries, applying AI to high-stakes domains including landslide detection, suicide prevention, and cancer detection. Our research aimed to understand how practitioners conceptualized and navigated the end-to-end AI data life cycles.

In this paper, we define and identify Data Cascades: compounding events causing negative, downstream effects from data issues, resulting in technical debt over time. In our study, data cascades were widely prevalent: 92% of AI practitioners reported experiencing one or more, and 45.3% reported two or more cascades in a given project. Data cascades often resulted from applying conventional AI practices that undervalued data quality. For example, eye disease detection models, trained on noise-free training data for high model performance, failed to predict the disease in production upon small specks of dust on images. Data cascades were opaque and delayed, with poor indicators and metrics. Cascades compounded into major negative impacts in the downstream of models like costly iterations, discarding projects, and harm to communities. Cascades were largely avoidable through intentional practices. The high prevalence of fairly severe data cascades point to a larger problem of broken data practices, methodologies, and incentives in the field of AI. Although the AI/ML practitioners in our study were attuned to the importance of data quality and displayed deep moral commitment to vulnerable groups, data cascades were disturbingly prevalent even in the high stakes domains we studied.
 
Additionally, our results point to serious gaps in what AI practitioners were trained and equipped to handle, in the form of tensions in working with field partners and application-domain experts, and in understanding human impacts of models—a serious problem as AI developers seek to deploy in domains where governments, civil society, and policy makers have historically struggled to respond. The prevalence of data cascades point to the contours of a larger problem: residual conventions and perceptions in AI/ML drawn from worlds of ‘big data’—of abundant, expendable digital resources and worlds in which one user has one account; of model valourisation; of moving fast to proof-of-concept; and of viewing data as grunt work in ML workflows. Taken together, our research underscores the need for data excellence in building AI systems, a shift to proactively considering care, sanctity, and diligence in data as valuable contributions in the AI ecosystem. Any solution needs to take into account social, technical, and structural aspects of the AI ecosystem, which we discuss in our paper.
 
Our paper makes three main contributions:
(1) Conceptualizing and documenting data cascades, their characteristics, and impact on the end-to-end AI lifecycle, drawn from an empirical study of data practices of international AI practitioners in high-stakes domains.
(2) Empirically derived awareness for the need of urgent structural change in AI research and development to incentivize care in data excellence, through our case study of high-stakes AI.
(3) Implications for HCI: we highlight an under-explored but significant new research path for the field in creating interfaces, processes, and policy for data excellence in AI.