OpenAI say it would be 'impossible' to train AI without pinching copyrighted works

By Liam Dawe - 9 January 2024 at 12:47 pm UTC | Views: 41,560

I really fear for the internet and what it will become in even just another year, with the rise of AI writing and AI art being used in place of real people. And now OpenAI openly state they need to use copyrighted works for training material.

As reported by The Guardian, the New York Times sued OpenAI and Microsoft over copyright infringement and just recently OpenAI sent a submission to the UK House of Lords Communications and Digital Select Committee where OpenAI said pretty clearly:

Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.

Worth noting OpenAI put up their own news post "OpenAI and journalism" on January 8th.

Why am I writing about this here? Well, the reasoning is pretty simple. AI writing is (on top of other things) increasing the race to the bottom of content for clicks. Search engines have quickly become a mess to find what you actually want, and it's only going to continue getting far worse thanks to all these SEO (Search Engine Optimisation) bait content farms, with more popping up all the time, and we've already seen some bigger websites trial AI writing. The internet is a mess.

As time goes on, and as more people use AI to pinch content and write entire articles, we're going to hand off profitable writing to a select few big names only who can weather the storm and handle it. A lot of smaller scale websites are just going to die off. Any time you search for something, it will be those big names sprinkled in between the vast AI website farms all with very similar robotic plain writing styles.

Many (most?) websites make content for search engines, not for people. The Verge recently did a rather fascinating piece on this showing how websites are designed around Google, and it really is something worth scrolling through and reading.

One thing you can count on: my perfectly imperfect writing full of terrible grammar continuing without the use of AI. At least it's natural right? I write as I speak, for better or worse. By humans, for humans — a tagline I plan to stick with until AI truly takes over and I have to go find a job flipping burgers or something. But then again, there will be robots for that too. I think I need to learn how to fish…

Article taken from GamingOnLinux.com.

Tags: Editorial, Misc

26 Likes

About the author - Liam Dawe

I am the owner of GamingOnLinux. After discovering Linux back in the days of Mandrake in 2003, I constantly came back to check on the progress of Linux until Ubuntu appeared on the scene and it helped me to really love it. You can reach me easily by emailing GamingOnLinux directly. Find me on Mastodon.
See more from me

Some you may have missed, popular articles from the last month:

68 comments

Page: «2/7 »

damarrin Jan 9

Link

View PC info

Supporter Plus

Well, AI may be bad right now, but I'll eat my hat if it doesn't become much less bad very quickly.

1 Likes, Who?

Purple Library Guy Jan 9

Link

Personally, I don't care about copyright per se. Current copyright laws suck. But what's at issue here really is that the current ChatGPT-type "AI" thingies (which are not AI) are being used mostly by outfits like Google, who are creating this gateway thing where their "AI" restates the internet to you so you don't have to go to any actual websites, and all the ad revenue stays with Google. Left unchecked, this will strangle the internet and all the creators on it. Copyright is maybe the only plausible weapon right now to block this, so fine I'll back copyright this one time.

7 Likes, Who?

Salvatos Jan 9

Link

View PC info

Working in translation, another worrisome trend I’ve noticed is content farms not just using AI to write articles, but to translate them, so we’re seeing tons of poorly translated content filling up search engine results in content farms with country-specific domains (e.g. "english-sounding-name.fr"). Logically, the AI is going to continue to get trained on junk content that it (or another AI) wrote or translated itself and perpetuate, if not amplify, the errors in it by seeing them as commonly used. A race to the bottom indeed.

It would be bad enough if it only meant more trash on the Internet, but you just know this widespread corruption of language is also going to influence people, especially language learners, who will be much more frequently exposed to mistranslations and unnatural expressions and adopt them naively. The effect on English speakers will likely be lesser or slower to manifest (English being the "native" tongue of most AI and being a simple language to begin with), but I weep for just about every other language.

Quoting: NathanaelKStottlemyer...it's not hard to write anybody can do it, and it's the pennical of laziness to...

*Pinnacle ;)

5 Likes, Who?

Liam Dawe Jan 9

Link

View PC info

Admin

Quoting: SalvatosWorking in translation, another worrisome trend I’ve noticed is content farms not just using AI to write articles, but to translate them, so we’re seeing tons of poorly translated content filling up search engine results in content farms with country-specific domains (e.g. "english-sounding-name.fr"). Logically, the AI is going to continue to get trained on junk content that it (or another AI) wrote or translated itself and perpetuate, if not amplify, the errors in it by seeing them as commonly used. A race to the bottom indeed.

Duolingo recently did this. Translations now by AI, with only a few checking them over. Expect more of this sort of thing over time.

3 Likes, Who?

scaine Jan 9

Link

View PC info

Contributing Editor
Mega Supporter

I'd love to know where the money is being made with this shit. AI is not cheap to run, and while it's not proof-of-work coin-mining bad, it's still pretty bad for the environment overall, given that all the compute is running on hot-ass tensor cores guzzling electricity and cooling before it melts. Microsoft stuffed over $10B (yep, billion) into OpenAI, with another billion coming from multiple rounds of fund-raising, with OpenAI also apparently wooing the middle-east for another $8B-$12B. Meanwhile, Meta is pushing Llama2, Google is pushing Bard and Gemini, while Amazon, Google and others are all out (also to the tune of around $6B) on Anthropic.

And for what? LLMs are just complex guessers. Sure, they guess with context, but they're still just guessing based on all the billions of documents they consumed during their (extremely intensive) training. You can't use them for research because they make shit up... because they're just guessing. It's a mess.

I'm hoping that 2024 might see some of this novelty wear off as consumers realise how bland and uninspiring AI generated content generally is, but I suspect that real, lasting damage will have been done by then.

It has a use in enterprise settings, properly controlled, with targeted outcomes. As it stands? Total shit show.

7 Likes, Who?

Purple Library Guy Jan 9

Link

Quoting: damarrinWell, AI may be bad right now, but I'll eat my hat if it doesn't become much less bad very quickly.

I've noticed that in this current degenerate age, most people don't even have hats. How do I know you will really eat it? You have no credibility, sir!

More seriously, I'm not sure it will improve that much that fast. This seems like a new technology because of the way it burst on the scene, but the research into this basic schtick has been going on for decades, staying quiet until they got the whole thing looking promising enough that someone was willing to sink in the cash to scale it up to really big data sets. And with these things, the size of the data set is key. So while it looks new, it may actually already be a fairly mature technology, not subject to the kind of rapid improvement you might expect from something genuinely new.

Last edited by Purple Library Guy on 9 January 2024 at 5:38 pm UTC

5 Likes, Who?

Purple Library Guy Jan 9

Link

Quoting: NathanaelKStottlemyerP.S. According to LanguageTool, three commas were needed in the article.

Ehhh, IMO commas are kind of a "soft" punctuation mark--there are stylistic differences in how people use them. There are many situations where it's not really technically "wrong" either to use one or not to use one, and others where it is wrong by some technical standards to do it a particular way, but doing it that "wrong" way still works given the flow of the sentence and the way people talk. Periods, for instance, are a lot clearer--if you're at the end of a sentence you should be using one, period. Well, unless you have a reason to use a question mark or exclamation point instead. But commas are comparatively mushy, and I don't trust computerized guidance about how to use them.

4 Likes, Who?

Talon1024 Jan 9

Link

View PC info

QuoteWorking in translation, another worrisome trend I’ve noticed is content farms not just using AI to write articles, but to translate them

Given the choice between a corrupt human translation and an AI translation, which one will you choose?

Canonical recently had to take down the Ubuntu 23.10 release because a corrupt translator vandalized the Ukrainian translation. Although it's perfectly understandable why they would do so, it is no less inappropriate and disrespectful to the authors of the original text.

The Anime industry has recently come under fire for that sort of localization vandalism, too. Apparently it's gotten so bad, that people will celebrate when a human translator is fired from the Anime industry, and replaced with an AI.

0 Likes

MadWolf Jan 9

Link

hi
if you are going to let AI systems steal copyrighted content then it is also OK for the ReactOS and wine teams to use leaked windows source code to build ReactOS and wine if they did that Microsoft will DMCA strike the projects faster that you can say Microsoft

the problems with GitHub copilot is 1. AI models getting trained on source code that is source available but not open source for example the windows research kernel

2. having a project on GitHub and not having the option not to let the AI models train on there project but who gets the final decision the project lead or is it like trying to change the license of a project where you need most of the contributors to agree to the license change

0 Likes

NathanaelKStottlemyer Jan 9

Link

View PC info

Quoting: SalvatosQuoting: NathanaelKStottlemyer
...it's not hard to write anybody can do it, and it's the pennical of laziness to...

*Pinnacle ;)

Clearly I'm not a robot.

2 Likes, Who?

«2 /7 »

While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations:

PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!