Connect with us

AI

Web Archives Under Siege: The Battle Over AI’s Use of Copyrighted Content

Published

on

To go back to this article, go to My Profile and then check out your saved stories.

Kate Knibbs

Danish News Organizations Challenge Common Crawl Over Use of AI Training Material

Media groups in Denmark have called on the nonprofit internet archive, Common Crawl, to delete their articles from previous collections and to cease gathering content from their sites at once. This action comes in response to increasing frustration with how entities such as OpenAI utilize copyrighted content.

Common Crawl intends to adhere to the demand that was initially made on Monday. According to Executive Director Rich Skrenta, the organization lacks the resources to engage in legal battles with media firms and publishers.

The Danish Rights Alliance (DRA), a group advocating for copyright owners in Denmark, led the initiative. They submitted the plea for four media companies, among them Berlingske Media and the daily Jyllands-Posten. Previously, The New York Times had made a comparable approach to Common Crawl, before initiating legal action against OpenAI for unauthorized use of its content. In its legal filing, The New York Times emphasized that Common Crawl's information was the most significantly prioritized data set in GPT-3.

Thomas Heldrup, who leads the DRA's content protection and enforcement division, mentions that the initiative was sparked by the Times. "What sets Common Crawl apart is its popularity among major AI firms for their data needs," Heldrup states. He views its vast database as a challenge for media organizations trying to engage in discussions with AI giants.

Despite playing a crucial role in the evolution of numerous AI tools that work with text, Common Crawl wasn't originally created with artificial intelligence applications in perspective. Established in 2007 and located in San Francisco, the organization initially gained recognition as a valuable asset for research before the surge in AI technology. "Common Crawl finds itself at the center of the ongoing debate around copyright issues and the use of generative AI," notes Stefan Baack, a data analyst at the Mozilla Foundation who has recently produced an analysis on how Common Crawl contributes to the training of AI systems. "For a long time, it remained a relatively obscure project that few people were aware of."

Before the year 2023, there had been no instances where Common Crawl was asked to remove data. However, recently, alongside appeals from the New York Times and a consortium of publishers from Denmark, it has begun to handle a growing number of requests for data redaction that remain confidential.

The demand to remove data has significantly surged, and at the same time, Common Crawl's web scraping tool, CCBot, is facing increased obstacles in gathering new information from content publishers. Originality AI, a startup specializing in detecting AI usage, reports that over 44 percent of leading global news and media websites are preventing access to CCBot. Although BuzzFeed started to deny access in 2018, many other notable publishers such as Reuters, the Washington Post, and the CBC have only started to do so over the past year. According to Baack, the frequency of these blockages is on the rise.

Common Crawl's swift response to such demands is influenced by the challenges of maintaining a modest-sized nonprofit organization. However, adherence to these requests does not imply concurrence with the underlying principles. Skrenta views the pressure to delete historical content from databases like Common Crawl as a direct attack on the current state of the internet. "This represents a fundamental danger," he asserts. "It's a threat that could destroy the concept of an open web."

Authored by Mark

Authored by David

Authored by Christopher

By the Author

He's not the only one worried. "The attempts to wipe out internet history, particularly news stories, deeply disturb me," states Jeff Jarvis, a journalism professor and fervent advocate for Common Crawl. "It's referenced in over 10,000 scholarly articles. It's an immensely useful tool." Common Crawl gathers fresh instances of studies that have utilized its data; among the latest are an analysis of web censorship in Turkmenistan and a study aimed at enhancing the detection of online fraud.

The transformation of Common Crawl from a niche resource cherished by tech enthusiasts yet overlooked by the masses to a contentious assistant for AI projects reflects a broader dispute surrounding copyright issues and the concept of an unrestricted internet. An increasing number of publishers, alongside various artists, authors, and creatives, are opposing the practice of web crawling and scraping—this opposition persists even in cases where the endeavors are not for profit, such as the continuous project by Common Crawl. Any initiative that might serve as a source of data for artificial intelligence is being closely examined.

Amid numerous legal actions accusing leading figures in the generative AI industry of copyright violations, advocates for copyright protection are also advocating for the introduction of laws to implement restrictions on the use of training data, thereby obliging AI firms to compensate for their data usage. Increased attention on Common Crawl and other widely used data collections such as LAION-5B has uncovered that in their extensive online data collection efforts, these databases have unintentionally gathered content from some of the most objectionable online areas. (In December 2023, LAION-5B was temporarily disabled following a Stanford University investigation, which discovered the inclusion of child sexual abuse materials within the dataset.)

The Danish Rights Alliance takes an assertive stance on issues related to AI and copyright infringement. At the beginning of the year, they initiated a movement to send out DMCA takedown requests. These requests serve as a warning to companies about the presence of potentially illegal content on their platforms, specifically targeting instances where book publishers' materials were shared on OpenAI's GPT Store without authorization. In the previous year, they were at the forefront of efforts to eliminate a widely used generative AI dataset named Books3 from the web. The collective action of Danish media against the unauthorized use of media content by AI firms for training purposes is notably coordinated. A group comprising leading newspapers and television networks has recently made a move to initiate legal action against OpenAI, demanding financial compensation for the utilization of their content in AI training datasets.

Should a substantial number of publishers and media organizations decide against participating in Common Crawl, the repercussions could extend broadly, affecting scholarly work across various fields. Additionally, Baack suggests, such a move may lead to unintended outcomes. He believes that discontinuing Common Crawl could disproportionately affect newcomers and smaller initiatives, as well as the academic community, solidifying the positions of current major players and making the landscape more rigid. "In the event that Common Crawl becomes too compromised to serve as an effective source of training data, it would likely result in a scenario where OpenAI and other top AI firms are further empowered," he asserts. "After all, these entities possess the means to conduct their own web crawls."

You May Also Enjoy …

Skeptical that breakdancing qualifies as an Olympic sport? The global champion shares your sentiment (to some extent)

Investigators unlocked a decade-old passphrase for a cryptocurrency wallet valued at $3 million.

The remarkable emergence of the globe's inaugural AI-driven beauty contest

Ease the strain on your spine: Discover the top office chairs we've evaluated.

Name: Joel Khalili

Journalist Profile:

Knight Will

Knight Will

Steven Levy

Morgan Meaker

N/A

Greenberg,

Further Insights from WIRED

Evaluations and Manuals

© 2024 Condé Nast. All rights reserved. Purchases made through our site may generate a commission for WIRED as part of our Affiliate Partnerships with retail companies. Any content from this site cannot be copied, shared, transmitted, or used in any form without explicit written consent from Condé Nast. Ad Choices

Choose a global website


Discover more from Automobilnews News - The first AI News Portal world wide

Subscribe to get the latest posts sent to your email.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE FOR FREE

Advertisement
F19 mins ago

Red Bull F1 Overhaul: Lambiase Promoted Amid Major Team Restructuring

Moto GP23 mins ago

Jack Miller Returns to Pramac Yamaha for 2025 MotoGP Season, Completing the Grid Line-Up

F140 mins ago

McLaren’s ‘Mini DRS’ Under FIA Scrutiny: Flexi-Wing Debate Reignited After Piastri’s Baku Triumph

Moto GP53 mins ago

**Title:** “2025 MotoGP Rider Market Shake-Up: The Biggest Losers and Missed Opportunities

F11 hour ago

Max Verstappen Criticizes FIA’s Radio Swear Ban: ‘Are We Five-Year-Olds?

Moto GP1 hour ago

Jack Miller Reflects on ‘Bleak’ Summer and Revels in Pramac Yamaha Deal for 2025 MotoGP Season

F12 hours ago

Mercedes Unveil Strategic Pit Lane Start for Hamilton in Baku Amid Anticipation of Major F1 Upgrades

Moto GP2 hours ago

Francesco Bagnaia Chooses Neutral Ground Amid Valentino Rossi and Marc Marquez Controversy

F12 hours ago

**Lewis Hamilton Condemns FIA President’s Swearing Clampdown Comments as Racially Insensitive**

Moto GP2 hours ago

Yamaha Confirms V4 Engine Development for MotoGP with Potential 2025 Debut

F13 hours ago

Resilient Hamilton Vows to ‘Give It Absolutely Everything’ After Azerbaijan Setback Ahead of Singapore GP

Moto GP3 hours ago

Fabio Quartararo Criticizes Yamaha’s Disorganized Test Team Amid Strategic Shifts and New Partnerships

F13 hours ago

New Audi F1 Contender Sparks Speculation as Bottas Stays Tight-Lipped on Future

Moto GP3 hours ago

Brad Binder Praises ‘Radical’ 2025 KTM MotoGP Prototype: ‘Quite Different’ to Current Model

F14 hours ago

Charles Leclerc Unveils Ferrari’s Internal Debate Over McLaren’s Controversial Rear Wing

Moto GP4 hours ago

Marc Marquez Praises Pecco Bagnaia for Defusing Misano Crowd Boos: A Call for Respect in MotoGP

Automakers & Suppliers4 hours ago

Exploring the Apex of Innovation: Lamborghini’s Latest Supercar Technologies and Luxury Advancements

Automakers & Suppliers6 hours ago

Unveiling Ferrari’s Latest Supercar Innovations: A Deep Dive into Maranello’s Masterpieces and Cutting-Edge Technologies

Politics2 months ago

News Outlet Clears Sacked Welsh Minister in Leak Scandal Amidst Ongoing Political Turmoil

Moto GP4 months ago

Enea Bastianini’s Bold Stand Against MotoGP Penalties Sparks Debate: A Dive into the Controversial Catalan GP Decision

Sports4 months ago

Leclerc Conquers Monaco: Home Victory Breaks Personal Curse and Delivers Emotional Triumph

Moto GP4 months ago

Aleix Espargaro’s Valiant Battle in Catalunya: A Lion’s Heart Against Marc Marquez’s Precision

Moto GP4 months ago

Raul Fernandez Grapples with Rear Tyre Woes Despite Strong Performance at Catalunya MotoGP

Sports4 months ago

Verstappen Identifies Sole Positive Amidst Red Bull’s Monaco Struggles: A Weekend to Reflect and Improve

Moto GP4 months ago

Joan Mir’s Tough Ride in Catalunya: Honda’s New Engine Configuration Fails to Impress

Sports4 months ago

Leclerc Triumphs at Home: 2024 Monaco Grand Prix Round 8 Victory and Highlights

Sports4 months ago

Leclerc’s Monaco Triumph Cuts Verstappen’s Lead: F1 Championship Standings Shakeup After 2024 Monaco GP

Sports4 months ago

Perez Shaken and Surprised: Calls for Penalty After Dramatic Monaco Crash with Magnussen

Sports4 months ago

Gasly Condemns Ocon’s Aggressive Move in Monaco Clash: Team Harmony and Future Strategies at Stake

Business4 months ago

Driving Success: Mastering the Fast Lane of Vehicle Manufacturing, Automotive Sales, and Aftermarket Services

Cars & Concepts2 months ago

Chevrolet Unleashes American Powerhouse: The 2025 Corvette ZR1 with Over 1,000 HP

Business4 months ago

Shifting Gears for Success: Exploring the Future of the Automobile Industry through Vehicle Manufacturing, Sales, and Advanced Technologies

AI4 months ago

Revolutionizing the Future: How Leading AI Innovations Like DaVinci-AI.de and AI-AllCreator.com Are Redefining Industries

Business4 months ago

Driving Success in the Fast Lane: Mastering Market Trends, Technological Innovations, and Strategic Excellence in the Automobile Industry

Mobility Report4 months ago

**”SkyDrive’s Ascent: Suzuki Propels Japan’s Leading eVTOL Hope into the Global Air Mobility Arena”**

Tech4 months ago

Driving the Future: Exploring Top Innovations in Automotive Technology for Enhanced Safety, Efficiency, and Connectivity

V12 AI REVOLUTION COMMING SOON !

Get ready for a groundbreaking shift in the world of artificial intelligence as the V12 AI Revolution is on the horizon

SPORT NEWS

Business NEWS

Advertisement

POLITCS NEWS

Chatten Sie mit uns

Hallo! Wie kann ich Ihnen helfen?

Discover more from Automobilnews News - The first AI News Portal world wide

Subscribe now to keep reading and get access to the full archive.

Continue reading

×