from new york times After suing OpenAI for copyright infringement by using Times content for educational purposes, everyone involved in AI is wondering what the consequences will be. How will this lawsuit proceed? And more importantly, how will the results affect the way we train and use large-scale language models?
This suit has two components: First, I was able to reproduce some of it with ChatGPT. times The article is literally very close. This is a fairly clear copyright infringement, although important questions still remain that could affect the outcome of the case. reappearance new york times This is clearly not ChatGPT's intention, and OpenAI appears to have modified ChatGPT's guardrails to make creating infringing content more difficult, but probably not impossible. Is this enough to limit the damage? It's unclear whether anyone has used ChatGPT to avoid paying NYT subscription fees. Second, examples of cases like this are always cherry-picked. during times We can clearly show that OpenAI can reproduce some articles. times' Storage? Can you get ChatGPT to write an article on page 37 of the September 18, 1947 issue? Or, for that matter, see this article: chicago tribune or boston globe? Can I use the entire material (I doubt it) or just certain random articles? I do not know. Considering that OpenAI has modified GPT to reduce the likelihood of breaches, it is almost certainly too late to conduct that experiment. Courts must decide whether negligent, insignificant, or unforeseen copying meets the legal definition of copyright infringement.
Learn faster. Take a deeper dive. Look further.
The more important argument is that training a model on copyrighted content is an infringement, regardless of whether the model can reproduce that training data in its output. An incompetent and clumsy version of this argument was filed in Sarah Silverman and other lawsuits, but was dismissed. The Authors' Guild has its own version of this lawsuit and is working on a licensing model that would allow its members to choose a single license agreement. The outcome of this case could have a number of side effects, as it would essentially allow publishers to charge not only for the texts they produce, but also for the ways in which those texts are used.
It's easy to guess, but difficult to predict what the outcome will be. Here is mine. OpenAI will agree as follows: new york times If we go out of court, we can't get a verdict. This agreement will have important consequences. A de facto price for training data will be set. And the price will undoubtedly be high. Maybe it's not as high as our country. times (There are rumors that OpenAI offered between $1 million and $5 million.) But it's high enough to deter OpenAI's competitors.
$1 million isn't a very high price in and of itself, and the Times says it believes it's too low. However, be aware that OpenAI will have to pay a similar amount to almost every major newspaper publisher in the world, in addition to organizations such as the Authors' Guild, technical journal publishers, magazine publishers, and many other content owners. The total bill is likely to be closer to $1 billion, at least some of which will be recurring costs as models need to be updated. I think OpenAI will have a hard time rising to higher levels, no matter what you think about Microsoft's investment and this strategy. OpenAI needs to consider the total cost. I suspect they are hardly profitable. They appear to be executing an Uber-like business plan, spending huge amounts of money to purchase the market without regard for sustainable business operations. But even with that business model, billions of dollars in costs would have to please partners like Microsoft.
that much timesOn the other hand, it seems like he's making a common mistake. In other words, you are overestimating the data. Yes, we have a large archive. But what is the value of old news? Moreover, in almost all applications, especially AI, the value of data is not in the data itself. It is a correlation between different data sets. that much times I don't own that correlation any more than I own the correlation between my search data and Tim O'Reilly's data. But these correlations are critical to building OpenAI and other data-driven products.
Setting the price of copyrighted training data at around $1 billion would force other model developers to pay a similar amount to license their training data. Google, Microsoft (regardless of the model they develop independently), Facebook, Amazon and Apple. Those companies can afford it. Smaller startups (including companies like Anthropic and Cohere) are priced out along with all the open source efforts. By settling in, OpenAI will eliminate much of the competition. And the good news for OpenAI is that even if an agreement isn't reached, it could still lose. They will probably pay more, but the impact on competition will be the same. not only, times And other publishers are responsible for enforcing this “Agreement.” They are responsible for negotiating with other groups that want to use their content and suing groups that cannot agree. OpenAI keeps its hands clean and doesn't use your legal budget. They can win by losing. So is there any real incentive to win?
Unfortunately, OpenAI is right that you can't train good models without copyrighted data (even though OpenAI CEO Sam Altman has said the opposite). Yes, we have a significant library of public domain literature and papers from Wikipedia and ArXiv. But if a language model trained on that data produces text that sounds like a cross between a 19th-century novel and a scientific paper, that's not a pleasant idea. The problem isn't just text generation. In language models where training data is limited to non-copyright sources, should prompts be written in early 20th century or 19th century style? Newspapers and other copyrighted materials are excellent sources of well-edited, grammatically correct modern language. It is unreasonable to believe that a good model for a modern language can be built from out-of-copyright sources.
Requiring model-building organizations to purchase the rights to their training data will inevitably leave generative AI in the hands of a few unassailable monopolies. (I won't go into what you can and cannot do with copyrighted material, but I will say that copyright law says nothing about where the material comes from. You can legally buy it, borrow it from a friend, or steal it. , find it in the trash. None of this has anything to do with copyright infringement.) One of the WEF roundtable participants, The Expanding Universe of Generative Models, reported that Altman said it was no longer needed. This is not unexpected, considering that his strategy was built around minimizing competition, rather than a single foundational model. But this is creepy. If every AI application passes through one of a few monopolies, can we trust those monopolies to deal honestly with bias issues? AI developers have talked a lot about “alignment,” but discussions about alignment always seem to sidestep more immediate issues like race- and gender-based bias. Can I develop specialized applications (e.g. O'Reilly Answers) that require training on specific data sets? I'm sure the monopolists will say, “Of course it can be built by fine-tuning our basic model.” But do you know if this is the best way to build that application? Or, if a monopoly succeeds in purchasing the market, will smaller companies be able to afford to build the application? Remember: Uber used to be cheap.
If model development is limited to a few wealthy companies, the future will be bleak. The consequences of the copyright lawsuit don't just apply to current-generation Transformer-based models. Applies to any model that requires training data. Limiting model building to a small number of companies eliminates most academic research. It would certainly be possible for most research universities to build their educational materials on legally acquired content. Any good library will have it. times Other newspapers on microfilm that can be converted to text using OCR. However, if the law specifies how copyrighted materials can be used, research applications based on materials legally purchased by the university may not be possible. It is not possible to develop open source models like Mistral and Mixtral. There will be no funding to acquire training data. That said, there is no such thing as a compact model that doesn't require large server farms with power-hungry GPUs. Many of these compact models can run on modern laptops, making them an ideal platform for developing AI-based applications. Will that be possible in the future? Or will innovation be possible only through entrenched monopolies?
Open source AI has recently fallen victim to a lot of fear-mongering. But the idea that open source AI will be used irresponsibly to develop adversarial applications that are detrimental to human well-being is precisely what makes the problem wrong. Yes, open source will be used irresponsibly, just like every tool ever invented. However, we know that adversarial applications will be developed and are already being developed in military labs, government labs, and multiple companies. Open source gives us the opportunity to see what's happening behind closed doors. This means understanding AI’s capabilities, predicting AI abuse, and even preparing defenses. Disabling open source AI doesn’t “protect” us from anything. This prevents us from recognizing threats and developing countermeasures.
Transparency is important, and proprietary models will always lag open source models in terms of transparency. Open source has always been about source code rather than data. But that is changing. OpenAI's GPT-4 scores surprisingly high on Stanford's Foundation Model Transparency Index, but still lags behind leading open source models (Meta's LLaMA and BigScience's BLOOM). But it's not the total score that matters. It's the “upstream” score that contains the source of the training data, and in this respect proprietary models aren't even close. How can we understand the biases inherent in our models without data transparency? Understanding these biases is critical to addressing the harm that our current models are suffering, rather than the hypothetical harm that science-fiction superintelligence could cause. Restricting AI development to a small number of wealthy players who sign non-disclosure agreements with publishers ensures that training data is never made public.
What will AI look like in the future? The number of models will increase, right? Can AI users, both corporate and individual, build tools that benefit them? Or will we be stuck with a small number of AI models running in the cloud and billed transactionally, with no real understanding of what tasks or functions the models are performing? This is the final stage of the legal battle between OpenAI and OpenAI. times It's everything.