Last Friday, legal eagles from the Joseph Saveri Law Firm sprang into action, slapping Meta and OpenAI with US federal class-action lawsuits. The plaintiffs? None other than laugh-riot Sarah Silverman and fellow scribes, Christopher Golden and Richard Kadrey.
They say imitation is the sincerest form of flattery, but these authors would beg to differ. According to the lawsuit, OpenAI and Meta’s AI language models, notably ChatGPT and LLaMA, have been playing a bit too fast and loose with copyrighted content. And by ‘too fast and loose’, we mean allegedly guzzling copyrighted works like a college student on an energy drink binge during finals.
The legal drama has been brewing for a while now, with authors Paul Tremblay and Mona Awad having already fired their salvos in a lawsuit on June 28. Charges thrown into the mix include violations of the Digital Millennium Copyright Act, unfair competition laws, and the lawyers’ personal favorite – negligence.
When it comes to filing headline-grabbing lawsuits against AI, the Joseph Saveri Law Firm is no rookie. In November 2022, they had a beef with GitHub Copilot for purported copyright violations. By January 2023, they took a similar stand against Stability AI, Midjourney, and DeviantArt over AI image generators. The GitHub lawsuit is steaming ahead towards trial, while the rest are still simmering on the back burner.
In a press release last month, the law firm described ChatGPT and LLaMA as the literary equivalent of kleptomaniacs, accusing them of violating the rights of book authors. Since March 2023, authors have been ringing up the law firm, alarmed by these AI tools’ eerie talent for spinning text that sounds suspiciously familiar.
The latest batch of lawsuits were filed in a US district court in San Francisco. The authors want a jury trial and are gunning for an injunction that could force Meta and OpenAI to tinker with their AI tools. When asked to comment, Meta played coy and OpenAI played hard to get.
In a statement to Ars, a spokesperson for the Saveri Law Firm warned of a dystopian future where AI models, powered by ill-gotten works, could elbow out the authors they’re competing against.
One of the biggest bones of contention is the shadowy nature of the data sets used to train LLaMA and ChatGPT. OpenAI and Meta have been playing their cards close to the chest, but the plaintiffs believe they’ve sussed out the likely sources of the data sets, accusing both companies of feasting on copyrighted content without consent.
They assert that ChatGPT was likely trained on a whopping 294,000 books possibly nabbed from ‘shadow library’ sites. Meta, on the other hand, admitted to training LLaMA on part of a data set named ThePile, which, according to the lawsuit, is virtually synonymous with “every book on Bibliotik,” totaling an impressive 196,640 titles.
In a twist worthy of a thriller, OpenAI is also accused of utilizing a “controversial data set” known as BookCorpus. Comprising self-published novels from Smashwords, the set was allegedly compiled without the knowledge or consent of the authors.
The plaintiffs insist that the use of these “blatantly illegal” data sets has led to copyright infringement of specific titles. Their contention hinges on the AI tools’ spooky accuracy in summarizing copyrighted books. They allege that this can only mean that the AI models retain specific information from these works.
To add insult to injury, the authors claim that the copyright-management information (CMI) was “deliberately omitted,” allowing OpenAI to profit handsomely from a product founded on unacknowledged reproductions of purloined writings.
One prickly question stands out from the legal quagmire: are ChatGPT and LLaMA themselves infringing derivative works based on thousands of authors’ creations? Authors are peeved that companies seem to be making a mint from their copyrighted works. They demand restitution of lost profits, adding that the gravy train is only going to get more lucrative, especially with Meta’s plans to commercialize the next version of LLaMA.
In their press release, Saveri and Butterick stated, “Much of the material in the training datasets used by OpenAI and Meta comes from copyrighted works—including books written by plaintiffs—that were copied by OpenAI and Meta without consent, without credit, and without compensation.”
So, it seems like the case of AI vs authors is just warming up, promising to deliver some blockbuster courtroom drama. Now, who’s got the popcorn?