Some of the most notorious so-called shadow libraries are increasingly facing legal pressure to stop book piracy or risk being shut down or moved to the dark web. Among the biggest targets are Z-Library, which the U.S. Department of Justice charged with criminal copyright infringement, and Library Genesis (Libgen), which was sued by a textbook publisher last fall for allegedly distributing digital copies of copyrighted works on a large scale. “In case of intentional violation of copyright law.”
But now these shadow libraries and others accused of flouting copyright appear to have found an unexpected defender in AI chipmaker Nvidia, which is among the companies benefiting most from the recent AI boom.
Nvidia appeared to defend its shadow library as a valid source of online information when responding to a lawsuit from a book author over a list of data repositories scraped to create the Books3 dataset used to train Nvidia's AI platform NeMo.
The list includes some of the most “notorious” shadow libraries, such as Bibliotik, Z-Library (Z-Lib), Libgen, Sci-Hub, and Anna's Archive, the authors claimed. But Nvidia hopes to partially invalidate the authors' copyright claims by denying that these controversial websites should be considered shadow libraries.
“NVIDIA denies classifying any of the listed data repositories as ‘shadow libraries’ and denies that hosting data on or distributing data from the data repositories necessarily violates U.S. copyright law,” Nvidia said in a court filing. “I do,” he said.
The chipmaker did not elaborate further to define what it considers a shadow library or what it would potentially free the controversial site from major copyright issues raised by various ongoing lawsuits. Instead, Nvidia kept its response brief, bluntly contesting the authors' petition for class action status and defending its AI training method as fair use.
“Nvidia denies that it improperly used or copied the claimed work,” the court papers state. “The training is a highly transformative process that may include adjustments to numerical parameters, including ‘weights,’ and the output of the LLM is: “It can be based on this,” he argued. “It’s at least partly about that ‘weight.’”
Nvidia's argument will likely hinge on the court's agreement that it is fair use for an AI model to collect published works and translate those works into weights that govern the AI output. However, the authors claimed that “these weights were derived entirely and uniquely from a protected representation of the training dataset, copied without obtaining the authors' consent or providing compensation to them.”
Some companies, such as OpenAI, have already begun licensing content from publishers and are likely to avoid these copyright issues entirely. Lawyers for The New York Times, one of the publishers suing OpenAI, have already proposed that OpenAI enter into a content licensing agreement with News Corp. ” Media Post reported.
Until this issue is resolved by the courts or lawmakers, companies that train AI on the Books3 dataset will likely continue to face lawsuits from rights holders, especially those who see AI models as an extension of the harm caused by these illegal shadow libraries. It's high. Matthew Oppenheim, a lawyer for the textbook publisher suing Libgen, previously told Ars that Libgen is a “den of thieves” of illegal books and that there is “no question” that Libgen's actions were “illegal on a large scale.”
“These shadow libraries have long been of interest to the AI education community because they host and distribute vast amounts of unlicensed, copyrighted material,” the authors of the Nvidia lawsuit claim, taking the next step in linking chipmakers to shadow libraries. I stepped on it. Shadow libraries also violate U.S. copyright law.”
Nvidia appears to be preparing to defend against a copyright lawsuit by debating what a shadow library is, but the website at the heart of Nvidia's lawsuit may pose less of an issue with the label. Anna, the pseudonymous author of Anna's Archive, uses the term liberally, describing the site as “the world's largest shadow library” and offering to educate other so-called pirate archivists.
But in some ways, it's not surprising that Nvidia appears to be siding with shadow libraries when it comes to fighting off copyright claims.
In 2022, when the federal government began cracking down on pirated e-book sites, Anna told Vice that shadow libraries like hers operate with the mentality that “we want information to be free.” AI companies are highly motivated to want the same.
Nvidia recently announced that it earned a record $26 billion in the first quarter of 2024 alone. For Nvidia and other AI companies hoping to maximize profits and dominate the AI market early on, there is likely still no better price for AI training data than free, and thus sites that offer vast amounts of information for free. There are few better sources of training data.