OpenAI say it would be 'impossible' to train AI without pinching copyrighted works

By Liam Dawe - 9 January 2024 at 12:47 pm UTC | Views: 41,254

I really fear for the internet and what it will become in even just another year, with the rise of AI writing and AI art being used in place of real people. And now OpenAI openly state they need to use copyrighted works for training material.

As reported by The Guardian, the New York Times sued OpenAI and Microsoft over copyright infringement and just recently OpenAI sent a submission to the UK House of Lords Communications and Digital Select Committee where OpenAI said pretty clearly:

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Worth noting OpenAI put up their own news post "OpenAI and journalism" on January 8th.

Why am I writing about this here? Well, the reasoning is pretty simple. AI writing is (on top of other things) increasing the race to the bottom of content for clicks. Search engines have quickly become a mess to find what you actually want, and it's only going to continue getting far worse thanks to all these SEO (Search Engine Optimisation) bait content farms, with more popping up all the time, and we've already seen some bigger websites trial AI writing. The internet is a mess.

As time goes on, and as more people use AI to pinch content and write entire articles, we're going to hand off profitable writing to a select few big names only who can weather the storm and handle it. A lot of smaller scale websites are just going to die off. Any time you search for something, it will be those big names sprinkled in between the vast AI website farms all with very similar robotic plain writing styles.

Many (most?) websites make content for search engines, not for people. The Verge recently did a rather fascinating piece on this showing how websites are designed around Google, and it really is something worth scrolling through and reading.

One thing you can count on: my perfectly imperfect writing full of terrible grammar continuing without the use of AI. At least it's natural right? I write as I speak, for better or worse. By humans, for humans — a tagline I plan to stick with until AI truly takes over and I have to go find a job flipping burgers or something. But then again, there will be robots for that too. I think I need to learn how to fish…

Article taken from GamingOnLinux.com.

Tags: Editorial, Misc

26 Likes

About the author - Liam Dawe

I am the owner of GamingOnLinux. After discovering Linux back in the days of Mandrake in 2003, I constantly came back to check on the progress of Linux until Ubuntu appeared on the scene and it helped me to really love it. You can reach me easily by emailing GamingOnLinux directly. Find me on Mastodon.
See more from me

Some you may have missed, popular articles from the last month:

68 comments

Page: «3/7 »

NathanaelKStottlemyer Jan 9

Link

View PC info

Quoting: Purple Library Guy
Quoting: NathanaelKStottlemyerP.S. According to LanguageTool, three commas were needed in the article.
Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.

All the places where LanguageTool said a comma was needed, I wouldn't care either way. However, I personally err on the side of using the commas, because they save lives after all.

4 Likes, Who?

scaine Jan 9

Link

View PC info

Contributing Editor
Mega Supporter

Quoting: NathanaelKStottlemyer
Quoting: Purple Library Guy
Quoting: NathanaelKStottlemyerP.S. According to LanguageTool, three commas were needed in the article.
Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.

All the places where LanguageTool said a comma was needed, I wouldn't care either way. However, I personally err on the side of using the commas, because they save lives after all.

This joke?

A comma is the difference between:
- Let's eat, Grandma!
and
- Let's eat Grandma!

7 Likes, Who?

EagleDelta Jan 9

Link

View PC info

I get really annoyed by this idea that LLMs are "stealing" data. It's literally the automation of what people manually do..... what we have always done in tech. My job is built around automating monotonous tasks to improve stability and reliability.

LLMs aren't going around storing articles, code, pictures, art, etc in its model. It is simply learning from those.... and all the benefits AND drawbacks that come with that. That means bad data is also getting into many of the LLMs too. I've been using CoPilot to write code for a while now. It is absolutely useful, but it also gets things wrong on a regular basis too.

1 Likes, Who?

pleasereadthemanual Jan 9

Link

View PC info

I agree with the sentiment that our public domain is not as valuable as it should be. As ever, OpenAI representatives write with the assumption that they are entitled to do whatever they want, regardless of the laws. Why do they feel the need to phrase it like that?

Quoting: EagleDeltaLLMs aren't going around storing articles, code, pictures, art, etc in its model. It is simply learning from those.... and all the benefits AND drawbacks that come with that.

Sure, but that doesn't mean OpenAI employees are now allowed to download millions of copyrighted works that have been distributed on trackers/DDL sites without permission from the copyright holder. If ChatGPT were only using Common Crawl, that's one thing, but we know they're not.

Supposedly ChatGPT's training content is carefully curated, FWIW.

Last edited by pleasereadthemanual on 9 January 2024 at 11:03 pm UTC

2 Likes, Who?

Kithop Jan 9

Link

View PC info

Supporter Plus

Quoting: scaineI'd love to know where the money is being made with this shit.

Venture capitalists pouring money into it in the hopes that there'll be a bump in all this 'interest' in LLMs, so they can dump it when it peaks. It's just another dot-com bubble, or 2008 bubble all over again.

I wouldn't be surprised to learn all these AI companies are just burning cash in the hopes that they're the ones 'on top' when it all comes crashing down. As to how to 'monetize' it and actually make money off of all that wasted electricity and questionable results? That's the next guy's problem.

3 Likes, Who?

Purple Library Guy Jan 10

Link

Quoting: NathanaelKStottlemyer
Quoting: Purple Library Guy
Quoting: NathanaelKStottlemyerP.S. According to LanguageTool, three commas were needed in the article.
Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.

All the places where LanguageTool said a comma was needed, I wouldn't care either way. However, I personally err on the side of using the commas, because they save lives after all.

But they turn pandas homicidal!
("Eats shoots and leaves" --> "Eats, shoots and leaves")

2 Likes, Who?

Purple Library Guy Jan 10

Link

Quoting: EagleDeltaI get really annoyed by this idea that LLMs are "stealing" data. It's literally the automation of what people manually do.....

That is not as persuasive a statement as you think it is. I want to keep on doing some things manually, thanks very much. And I very much hope my wife agrees.

But in any case, that's not all it is. The fact is that these AIs essentially end up restating things that actual people said . . . which is fine in and of itself. But they are being used to redistribute revenue from the people who initially said the things, to the people who made the AI programs, by using the things the people said as input. That is not benign--and when the people saying the things go out of business and the AIs are reduced to restating each other's statements plus the one major source of statements on the internet that needs no revenue--propaganda--the results ain't gonna be pretty.

Last edited by Purple Library Guy on 10 January 2024 at 1:10 am UTC

9 Likes, Who?

Nod Jan 10

Link

View PC info

Here is a counter opinion that might give you a bit of hope.

QuoteBut there will be silver linings to The Great Robot Spam Flood of 2024. It will drive us into healthier online communities. It will spotlight and boost the value of authored creativity. And it may help give birth to a new generation of independent media.

Robots will make the internet more human.

Essentially he argues that AI content will turbo charge the already dire enshittification of content on the internet such that the experience is so bad that it drives people towards sites just like this one. Ones that prioritize content "by humans, for humans".

2 Likes, Who?

junibegood Jan 10

Link

View PC info

Quoting: NodHere is a counter opinion that might give you a bit of hope.

QuoteBut there will be silver linings to The Great Robot Spam Flood of 2024. It will drive us into healthier online communities. It will spotlight and boost the value of authored creativity. And it may help give birth to a new generation of independent media.

Robots will make the internet more human.

Essentially he argues that AI content will turbo charge the already dire enshittification of content on the internet such that the experience is so bad that it drives people towards sites just like this one. Ones that prioritize content "by humans, for humans".

I wish it were true but I don't believe it.

When both smartphones and social networks appeared, internet was suddenly flooded with pictures (and later videos) shot with cheap cameras by people who had almost never taken a picture before. That was human work, sure, but I think it's comparable to the rising of AI because we saw a brutal increase in quantity and decrease in quality. Did that raise the interest for quality pictures by photographs ? Maybe for a very small fraction of humanity, yes, but the rest of us takes, posts and watches even more crap pictures and videos than we did 15 years ago...

Last edited by junibegood on 10 January 2024 at 9:15 am UTC

1 Likes, Who?

LoudTechie Jan 10

Link

Quoting: pleasereadthemanualI agree with the sentiment that our public domain is not as valuable as it should be. As ever, OpenAI representatives write with the assumption that they are entitled to do whatever they want, regardless of the laws. Why do they feel the need to phrase it like that?

Quoting: EagleDeltaLLMs aren't going around storing articles, code, pictures, art, etc in its model. It is simply learning from those.... and all the benefits AND drawbacks that come with that.
Sure, but that doesn't mean OpenAI employees are now allowed to download millions of copyrighted works that have been distributed on trackers/DDL sites without permission from the copyright holder. If ChatGPT were only using Common Crawl, that's one thing, but we know they're not.

Supposedly ChatGPT's training content is carefully curated, FWIW.

A. Totally agree. We've to play by the rules or run. They've to play by the rules or run.
B. I would actually go further than that and call them unauthorized hosters of to the copyright holders choice copyrighted content or even of a derivative work of all copyrighted content in their training set.
i. Storing unauthorized copyrighted is illegal independent of whether or not you distribute it. This is how pirates were hunted at first until it proved too inefficient.
Alos copyright is format independent and it has been proven multiple times that training data can partly to fully be recovered from llms. The same has been successfully said of jpg, png and other data formats. Its just easier.
ii. It's a derivative work, because it has been made with the copyrighted data. Would've been different without it. Has been made to mimic properties of the copyrighted data.(The drawback of this argument is that it uses the same argument as the arguments against fan fiction, but they've held up in court and most fan fiction organizations tend to accept that they exist by the grace of their often pretty graceful authors.)
C. This's actually the main difference between the development method of of actual data available AI(often FOSS, not always) and proprietary AI like Bard and OpenAI. Source available AI aggressively curates their data, because it gives a great training speed advantage and requires less data. Proprietary AI tends to use lots of training layers with lots of parameters, due to the low development cost.

1 Likes, Who?

« 1 «3 /7 »

While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations:

PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!