Dublin, July 13, 2020: A study by a researcher from Lero, the SFI Research Centre for Software, and UCD’s Complex Software Lab, has resulted in the withdrawal of an 80-million image library, which has up-to-now been used to train Artificial Intelligence (AI) and Machine Learning (ML) systems. The research by PhD candidate Abeba Birhane found images in academic datasets, used to develop AI systems and applications, were contaminated with racist, misogynistic and other unacceptable offensive labels and slurs.

Already, MIT has deleted its much-cited '80 Million Tiny Images' dataset, asking researchers and developers to cease using the library to train AI and ML systems. MIT’s decision came as a direct result of the research carried out by researcher Abeba Birhane, and Vinay Prabhu, Chief Scientist at UnifyID (a privacy start-up in Silicon Valley).

In the course of the work, Ms Birhane found the MIT database contained thousands of images labelled with racist and misogynistic insults and derogatory terms. Ms Birhane said linking images to slurs and offensive language infuses prejudice and bias into AI and ML models, perpetuating these stereotypes and prejudices, inflicting unprecedented and incalculable harm on those already on the margins of society.

“Not only is it unacceptable to label people’s images with offensive terms without their awareness and consent, training and validating AI systems with such dataset raises grave problems in the age of ubiquitous AI. Face recognition systems built on such dataset embed harmful stereotypes and prejudices. When such systems are deployed into the real-world – in security, hiring, or policing systems – the consequences are dire, resulting in individuals being denied opportunities or labelled as a criminal. More fundamentally, the practice of labelling a person based on their appearance risks reviving the long discredited pseudoscientific practice of physiognomy,” Ms Birhane said.

While the 80 Million Tiny Images dataset is one of the Large Scale Vision Datasets (LSVD), there are many others in use around the world. “Lack of scrutiny has played a role in the creation of monstrous and secretive datasets without much resistance, prompting further questions such as: what other secretive datasets currently exist hidden and guarded under the guise of proprietary assets,” she said.

The researchers also found that all of the images used to populate the datasets examined were 'non-consensual' images, included those of children, scraped from seven image search engines, including Google. “In the age of Big Data, the fundamentals of informed consent, privacy, or agency of the individual have gradually been eroded. Institutions, academia, and industry alike, amass millions of images of people without consent and often for unstated purposes under the guise of anonymisation, a claim that is both ephemeral and vacuous,” the research team argues.

The researchers said their goal is to bring awareness to the AI and ML community regarding the severity of the threats from ill-considered datasets and their direct and indirect impact of their work on society, especially on vulnerable groups. “From the questionable ways images were sourced, to troublesome labelling of people in images, to the downstream effects of training AI models using such images, large scale vision datasets may do more harm than good. We believe radical ethics that challenge deeply ingrained traditions need to be incentivised and rewarded in order to bring about a shift in culture that centres justice and the welfare of disproportionately impacted communities. I would urge the machine learning community to pay close attention to the direct and indirect impact of our work on society, especially on vulnerable groups,” Ms Birhane concluded.

The full paper (under review) 'Large image datasets: A pyrrhic win for computer vision?' is available to view here.