← Back to Blog

How AI Companies Are Scraping Your Art Right Now

Somewhere on a server farm in the American Midwest, a machine is looking at your painting. It is not looking the way a person looks at a painting, with curiosity or admiration or even indifference. It is looking at your painting the way a photocopier looks at a document. It is extracting information.

The machine does not know your name. It does not know that the colour palette took you three years to develop, or that the compositional instinct behind it was shaped by a decade of failed canvases. It does not know that the painting represents something to you. It knows only that the painting exists, that it has pixels, and that those pixels contain patterns worth learning.

This is not speculative. This is the current state of the internet for visual artists. And the pipeline that makes it possible is more systematic, more thorough, and more invisible than most artists realise.

Five point eight billion

That is the number of images in LAION-5B, the largest openly documented image-text dataset in the world. More than one image for every person alive. Released in 2022 by the Large-Scale Artificial Intelligence Open Network, it became the foundation upon which the first generation of commercially available diffusion models were built.

LAION-5B is not, technically, a collection of images. It is a collection of URLs. Billions of links point to images hosted on other people's servers. Your ArtStation portfolio. Your Instagram. Your personal website. That blog post from 2019 with your concept art. All indexed. All catalogued. All linked to text descriptions that tell the model what it is looking at.

If you are a visual artist who has posted work online in the last ten years, the odds are good that you are in this dataset. Not by choice. Not by agreement. By default.

The crawler

The pipeline begins with Common Crawl, a non-profit web crawler that archives vast portions of the public internet. Think of it as an automated process that visits websites, downloads everything it can access, and stores it.

Common Crawl itself is legitimate infrastructure. Researchers use it. Language models need training data. The internet benefits from public archives. The problem is not that Common Crawl exists. The problem is what happens to the data once it has been collected.

The crawler is not selective. It does not distinguish between a small photography business and a stock photo library. It does not check whether the person who uploaded an image intended it for public consumption or for a specific audience. It simply visits URLs, downloads content, and moves on. Every image, every page, every embedded resource is captured.

This process runs continuously. If you posted a new piece this morning, by this evening a crawler may have found it. You receive no notification. There is no inbox alert that says "your image has been indexed for potential use in AI training." The extraction is silent by design.

Technically, a robots.txt file on your server can request that crawlers not index your content. In practice, almost no artists know this exists. And even those who do know cannot guarantee compliance. The convention is a request, not a lock.

The dataset

LAION takes the raw Common Crawl data and organises it. For each image, it preserves the URL and associated metadata such as the alt text, surrounding HTML, or automated captions generated by CLIP, a model that pairs images with text descriptions. The result is a text-image pair: a description linked to a visual.

This pairing is what makes the dataset valuable to AI companies. Neural networks learn by association. Show a network a million images labelled "sunset" and it learns which visual patterns correlate with that word. With billions of examples, it learns with extraordinary nuance. It does not just learn "sunset." It learns the difference between a watercolour sunset and a photographic one, between a Turner and a mobile phone snap.

The dataset preserves artists' names when they appear in metadata. This is the detail that should concern you most. If your ArtStation portfolio titles an image "Sci-Fi Character by [Your Name]," that pairing survives into the dataset. It means that when someone later types "character in the style of [Your Name]," the model knows precisely which visual patterns to activate.

Your name becomes a prompt. Your life's work becomes a style filter.

The training

By 2022, LAION-5B was publicly available on AWS. Any organisation with sufficient computing infrastructure could download it. Stability AI did. OpenAI built on similar but proprietary datasets. Midjourney, Adobe, Microsoft all trained models on datasets derived from or structured identically to LAION.

A diffusion model like Stable Diffusion does not store your images. It does something more subtle and, in many ways, more invasive. It learns the statistical relationships between text descriptions and visual patterns across billions of examples. During training, it sees each image and decomposes it into a high-dimensional mathematical representation. It identifies the features that matter. These include edges, textures, colour relationships, and compositional structures. It then builds a probabilistic map of how these features relate to language.

Your painting of 2 million pixels is compressed into floating-point weights shared across the entire model. The model does not remember your painting. It remembers the patterns your painting contained, and those patterns are now permanently encoded in its parameters.

This compression is not lossless in the way you might hope. The model loses the painting. But it keeps the style. It keeps the distinctive features of your brushwork, your colour choices, your compositional instincts. These become part of the model's internal representation, available for activation whenever someone requests them.

The replication

This is where the pipeline reaches its conclusion, and where the abstraction becomes personal.

A trained diffusion model can decompose artistic style into mathematical weights. When a user types "painting in the style of [artist name]," the model activates the specific weights most strongly associated with that name during training. If your dataset included a thousand images by one artist, all sharing similar visual patterns, the model learns to map the name to those patterns. It can then generate new images matching those patterns without ever accessing the original images again.

This is not inspiration. It is not influence. It is extraction.

When an artist says "I was influenced by Caravaggio," they mean they spent years studying chiaroscuro, made conscious decisions about what to absorb and what to reject, and eventually developed their own relationship with light and shadow. The process takes a lifetime. It involves failure, rejection, and the slow accumulation of aesthetic judgement.

When a diffusion model learns from your work, it performs no such act. It extracts everything indiscriminately. Your colour relationships, your compositional tendencies, your texture handling, your line weight preferences, your signature techniques. It compresses all of it into weights. And it makes those weights available to anyone with a subscription or an open-source download.

A person in any city on earth can now type your name and generate work that captures your aesthetic. In four seconds. For pennies. Without knowing who you are, without crediting you, without compensating you. The model has memorised your creative identity and commodified it.

The legal vacuum

The standard response from the technology industry is that publicly available work is fair game. "It's on the internet. Anyone can see it. How is AI training different from a person looking at it?"

It is different in scale, in purpose, and in consequence. A person looking at your painting learns from it over months and years, filtered through their own experience and aesthetic judgement. An AI model processes your painting in milliseconds, alongside billions of others, and extracts the mathematical essence of your style for commercial redistribution.

The lawsuits have begun. Andersen v. Stability AI. Getty Images v. Stability AI. Class actions from illustrators, photographers, and concept artists. These cases are establishing that "publicly available" does not mean "available for unrestricted commercial extraction." But litigation is slow. The scraping is fast.

By the time the courts rule definitively, several more years of training data will have been collected. The legal system operates on a timeline measured in years. The extraction pipeline operates on a timeline measured in hours.

The shutdown that changes nothing

In 2024, facing mounting legal and ethical pressure, LAION announced it would limit distribution of its datasets. This was presented as a significant step. In practice, it changes very little.

Every model trained on LAION-5B before the shutdown still exists. Stability AI's models, Midjourney's models, every derivative and fine-tuned model built on that foundation still contain the statistical patterns extracted from your work. Weights do not disappear because the source dataset becomes unavailable. The shutdown closes a door after everything of value has already been carried through it.

What you can do about it

If you cannot prevent extraction, you can make it useless.

Adversarial perturbation applies carefully calculated pixel modifications to your images. These changes are too small for the human eye to detect but devastating to the mathematical processes neural networks use to learn. When a diffusion model trains on a perturbed image, it learns incorrect associations. The corrupted patterns degrade the model's ability to replicate your style. The image looks exactly the same to you, to your clients, to anyone viewing it online. But to the machine, it is poison.

Art Vault applies three layers of this perturbation simultaneously. Edge perturbation attacks the model's structural understanding. Texture-band perturbation targets mid-frequency patterns, the frequencies where style and brushwork live. Spectral camouflage operates in the frequency domain itself. Together, they ensure that training on a protected image is worse than training on nothing at all.

This is not a permanent solution. Adversarial techniques evolve. AI companies develop filtering. The arms race continues. Which is why protection alone is not enough.

The receipt

Art Vault also embeds C2PA provenance into every protected image. This is a cryptographically signed certificate that records who created the work, when it was created, and that it was protected from that moment forward. The certificate uses ES256 encryption, the same standard that secures financial transactions. It cannot be faked. If anyone modifies the certificate, the cryptographic signature breaks and verification fails.

Provenance does not prevent extraction. It proves extraction happened. An artist with C2PA provenance can say: "I created this work on this date. I protected it immediately. Any AI-generated work replicating my style after this date was produced by training on my protected images." That is a statement with cryptographic proof behind it.

In a future where galleries, marketplaces, and legal proceedings require verifiable authorship, the artists who established provenance early will have evidence. That future is arriving faster than most people expect. The ones who did not establish provenance will have nothing but their word.

The window

Protection works today. Adversarial perturbation is genuinely effective at degrading AI models' ability to learn from your images. But the window is not infinite. Models are improving. Filtering techniques are advancing. The cost of protection will not always be this low relative to the sophistication of the threat.

Provenance established today, March 2026, carries more weight than provenance established in 2028 because it demonstrates that you acted when the tools first became available. Not after the legal precedents were set. Not after the industry shifted. Now.

The pipeline that scrapes your art is mechanical, comprehensive, and indifferent. It does not care about your intentions when you posted. It does not respect the years of craft encoded in your colour choices. It does not distinguish between work you shared for connection and work you shared for commerce.

But you are not obligated to make its job easy. Poison the data. Sign your work. Establish the record.

The machine is looking at your painting. Make sure it learns nothing useful.

Protect Your First Image Free

Adversarial perturbation and C2PA provenance. Your art looks identical. AI models learn nothing.

Get Protected