Motherboard has an article up called "AI Is Probably Using Your Images and It's Not Easy to Opt Out" It does a good job of laying out what images are referred to in the datasets used to train algorithms, especially for text-to-image generators like Stable Diffusion. I think for a lot of folks who read this newsletter, it won't be a surprise that if an image is available publicly on the internet, it may be scraped and used in a dataset.
And we should be clear about what that means. Here's how it works for LAION, the Large-scale Artificial Intelligence Open Network, a non-profit organization making "large-scale machine learning models, datasets and related code available to the general public."
First of all, the image files themselves are not stored in the dataset. LAION accesses the images in order to compute a similarity score between pictures and alt text, but it doesn't store the images. So the dataset includes links, descriptions, and scores, but no image files.
When an organization uses the LAION dataset it must re-download images if it needs more than the links, descriptions and scores. If the images are no longer at the link, they won't get the image.
But there are things on the internet that you may not realize are available. Motherboard notes a couple of examples of medical images being referenced in the dataset, and the patients, and in some cases the doctors, were unaware they had been posted publicly.
One possible avenue for such publication could be in journals, where the doctors didn't think about the fact that the images would be posted publicly online, as well as on paper. Other possibilities could include improperly protected medical websites.
This all points to a growing controversy over the ethics and legality of using publicly available data for AI. As we've discussed before, it's arguably fair use, since the photos are not stored and the use made by the algorithm is clearly transformative. But the public perception may be that it's unfair, since when they posted a picture on their blog, they didn't intend for it to be used to train an algorithm. And even further, when a doctor took a picture of someone's rash, neither party may have expected it to be publicly available. But is there harm in these uses? If so, what is it? And if there is harm, who should be on the hook for it? The websites, the government, the algorithm makers, the algorithm users?
Tom’s Thoughts
I wouldn’t say there isn’t harm but I’d like to know more about what harm can result that wouldn’t result anyway. The Motherboard article cites biased algorithms, but arguably bias might be made worse by restricting datasets. It also references disturbing images as a problem. While I don’t want to see images of beheadings, the algorithm training on disturbing images is not the same problem as exposing them to people. The concern could be that the algorithm outputs disturbing images if it is trained on them. But if properly developed, those disturbing images could be used to train the algorithm not to output such images. So it may be necessary to include them in datasets.
The objection I feel is valid is that I didn’t give my permission for the image to be included. The counter to this argument, is that if you made it public you can’t be mad when someone publicly accesses it. And let’s leave aside the examples of images you didn’t know would be made public. Even if you knew you were making an image public, were you aware of the possibility of it being used to train algorithms? Would you make a differnt decision if you knew that might happen? I don’t think we need to approve every possible use of somethign we make public, but I do think this is a big enough concern for people, that maybe we need to treat it differntly than other mundane uses.
I am a believer in people being in control of their data. And I’m a believer in systems that put the control and the responsibility on the provider of data. I’ve argued, for example, that a robots.txt file on a website that can tell a search engine whether it is allowed to index the site or not, is enough. So I am not very sympathetic to arguments that search engines should not index websites. The website owner can control that if they want. Perhaps then, what we need is a robots.txt for personal images. A standard that lets users choose to make their images available for training datasets and defaults to not allowing them. Creative Commons offers a path. In fact, LAION states that it chooses Creative Commons licensed works to scrape. But even then, the license and the person using it, may not have contemplated this use. So to be safe an extension to CC, or some complementary system, could be used to provide this permission.
That way the responsibility can be placed where I think it is best placed, on the owner of an image. And if the owner of the image does nothing, then their image won’t be used. So it’s not placing a burden on image owners. This would reduce available data, but it would put control in the realm of users not organizations and encourage an informed decision before an image is contributed to datasets.
Then the problem of images making there way into the public when they should not could also be addressed. I think it may be a more rare occurrence than the Motherboard article might imply, but it’s worth cracking down on nonetheless.
There are certainly holes in my idea. But I’d ask that you not *just* poke holes in it, but provide patches as well if you do. That’s how we eventually build solutions, not just complaints. If you’re only intention is to prove me wrong, then let me preemeptively say “You’re right. I’m wrong” and we can both move along satisfied, without wasting time. If you’re intention is to improve on the idea and help discover a working solution, I look forward to it.