AI

Web Archives Under Siege: The Battle Over AI’s Use of Copyrighted Content

Published

3 months ago

June 14, 2024

To go back to this article, go to My Profile and then check out your saved stories.

Kate Knibbs

Danish News Organizations Challenge Common Crawl Over Use of AI Training Material

Media groups in Denmark have called on the nonprofit internet archive, Common Crawl, to delete their articles from previous collections and to cease gathering content from their sites at once. This action comes in response to increasing frustration with how entities such as OpenAI utilize copyrighted content.

Common Crawl intends to adhere to the demand that was initially made on Monday. According to Executive Director Rich Skrenta, the organization lacks the resources to engage in legal battles with media firms and publishers.

The Danish Rights Alliance (DRA), a group advocating for copyright owners in Denmark, led the initiative. They submitted the plea for four media companies, among them Berlingske Media and the daily Jyllands-Posten. Previously, The New York Times had made a comparable approach to Common Crawl, before initiating legal action against OpenAI for unauthorized use of its content. In its legal filing, The New York Times emphasized that Common Crawl's information was the most significantly prioritized data set in GPT-3.

Thomas Heldrup, who leads the DRA's content protection and enforcement division, mentions that the initiative was sparked by the Times. "What sets Common Crawl apart is its popularity among major AI firms for their data needs," Heldrup states. He views its vast database as a challenge for media organizations trying to engage in discussions with AI giants.

Despite playing a crucial role in the evolution of numerous AI tools that work with text, Common Crawl wasn't originally created with artificial intelligence applications in perspective. Established in 2007 and located in San Francisco, the organization initially gained recognition as a valuable asset for research before the surge in AI technology. "Common Crawl finds itself at the center of the ongoing debate around copyright issues and the use of generative AI," notes Stefan Baack, a data analyst at the Mozilla Foundation who has recently produced an analysis on how Common Crawl contributes to the training of AI systems. "For a long time, it remained a relatively obscure project that few people were aware of."

Before the year 2023, there had been no instances where Common Crawl was asked to remove data. However, recently, alongside appeals from the New York Times and a consortium of publishers from Denmark, it has begun to handle a growing number of requests for data redaction that remain confidential.

The demand to remove data has significantly surged, and at the same time, Common Crawl's web scraping tool, CCBot, is facing increased obstacles in gathering new information from content publishers. Originality AI, a startup specializing in detecting AI usage, reports that over 44 percent of leading global news and media websites are preventing access to CCBot. Although BuzzFeed started to deny access in 2018, many other notable publishers such as Reuters, the Washington Post, and the CBC have only started to do so over the past year. According to Baack, the frequency of these blockages is on the rise.

Common Crawl's swift response to such demands is influenced by the challenges of maintaining a modest-sized nonprofit organization. However, adherence to these requests does not imply concurrence with the underlying principles. Skrenta views the pressure to delete historical content from databases like Common Crawl as a direct attack on the current state of the internet. "This represents a fundamental danger," he asserts. "It's a threat that could destroy the concept of an open web."

Authored by Mark

Authored by David

Authored by Christopher

By the Author

He's not the only one worried. "The attempts to wipe out internet history, particularly news stories, deeply disturb me," states Jeff Jarvis, a journalism professor and fervent advocate for Common Crawl. "It's referenced in over 10,000 scholarly articles. It's an immensely useful tool." Common Crawl gathers fresh instances of studies that have utilized its data; among the latest are an analysis of web censorship in Turkmenistan and a study aimed at enhancing the detection of online fraud.

The transformation of Common Crawl from a niche resource cherished by tech enthusiasts yet overlooked by the masses to a contentious assistant for AI projects reflects a broader dispute surrounding copyright issues and the concept of an unrestricted internet. An increasing number of publishers, alongside various artists, authors, and creatives, are opposing the practice of web crawling and scraping—this opposition persists even in cases where the endeavors are not for profit, such as the continuous project by Common Crawl. Any initiative that might serve as a source of data for artificial intelligence is being closely examined.

Amid numerous legal actions accusing leading figures in the generative AI industry of copyright violations, advocates for copyright protection are also advocating for the introduction of laws to implement restrictions on the use of training data, thereby obliging AI firms to compensate for their data usage. Increased attention on Common Crawl and other widely used data collections such as LAION-5B has uncovered that in their extensive online data collection efforts, these databases have unintentionally gathered content from some of the most objectionable online areas. (In December 2023, LAION-5B was temporarily disabled following a Stanford University investigation, which discovered the inclusion of child sexual abuse materials within the dataset.)

The Danish Rights Alliance takes an assertive stance on issues related to AI and copyright infringement. At the beginning of the year, they initiated a movement to send out DMCA takedown requests. These requests serve as a warning to companies about the presence of potentially illegal content on their platforms, specifically targeting instances where book publishers' materials were shared on OpenAI's GPT Store without authorization. In the previous year, they were at the forefront of efforts to eliminate a widely used generative AI dataset named Books3 from the web. The collective action of Danish media against the unauthorized use of media content by AI firms for training purposes is notably coordinated. A group comprising leading newspapers and television networks has recently made a move to initiate legal action against OpenAI, demanding financial compensation for the utilization of their content in AI training datasets.

Should a substantial number of publishers and media organizations decide against participating in Common Crawl, the repercussions could extend broadly, affecting scholarly work across various fields. Additionally, Baack suggests, such a move may lead to unintended outcomes. He believes that discontinuing Common Crawl could disproportionately affect newcomers and smaller initiatives, as well as the academic community, solidifying the positions of current major players and making the landscape more rigid. "In the event that Common Crawl becomes too compromised to serve as an effective source of training data, it would likely result in a scenario where OpenAI and other top AI firms are further empowered," he asserts. "After all, these entities possess the means to conduct their own web crawls."

Discover more from Automobilnews News - The first AI News Portal world wide

Subscribe to get the latest posts sent to your email.

Automobilnews News – The first AI News Portal world wide

Web Archives Under Siege: The Battle Over AI’s Use of Copyrighted Content

Related

Discover more from Automobilnews News - The first AI News Portal world wide

You may like

Leave a Reply Cancel reply

Leave a Reply

SUBSCRIBE FOR FREE

Red Bull F1 Overhaul: Lambiase Promoted Amid Major Team Restructuring

Jack Miller Returns to Pramac Yamaha for 2025 MotoGP Season, Completing the Grid Line-Up

McLaren’s ‘Mini DRS’ Under FIA Scrutiny: Flexi-Wing Debate Reignited After Piastri’s Baku Triumph

**Title:** “2025 MotoGP Rider Market Shake-Up: The Biggest Losers and Missed Opportunities

Max Verstappen Criticizes FIA’s Radio Swear Ban: ‘Are We Five-Year-Olds?

Jack Miller Reflects on ‘Bleak’ Summer and Revels in Pramac Yamaha Deal for 2025 MotoGP Season

Mercedes Unveil Strategic Pit Lane Start for Hamilton in Baku Amid Anticipation of Major F1 Upgrades

Francesco Bagnaia Chooses Neutral Ground Amid Valentino Rossi and Marc Marquez Controversy

**Lewis Hamilton Condemns FIA President’s Swearing Clampdown Comments as Racially Insensitive**

Yamaha Confirms V4 Engine Development for MotoGP with Potential 2025 Debut

Resilient Hamilton Vows to ‘Give It Absolutely Everything’ After Azerbaijan Setback Ahead of Singapore GP

Fabio Quartararo Criticizes Yamaha’s Disorganized Test Team Amid Strategic Shifts and New Partnerships

New Audi F1 Contender Sparks Speculation as Bottas Stays Tight-Lipped on Future

Brad Binder Praises ‘Radical’ 2025 KTM MotoGP Prototype: ‘Quite Different’ to Current Model

Charles Leclerc Unveils Ferrari’s Internal Debate Over McLaren’s Controversial Rear Wing

Marc Marquez Praises Pecco Bagnaia for Defusing Misano Crowd Boos: A Call for Respect in MotoGP

Exploring the Apex of Innovation: Lamborghini’s Latest Supercar Technologies and Luxury Advancements

Unveiling Ferrari’s Latest Supercar Innovations: A Deep Dive into Maranello’s Masterpieces and Cutting-Edge Technologies

News Outlet Clears Sacked Welsh Minister in Leak Scandal Amidst Ongoing Political Turmoil

Enea Bastianini’s Bold Stand Against MotoGP Penalties Sparks Debate: A Dive into the Controversial Catalan GP Decision

Leclerc Conquers Monaco: Home Victory Breaks Personal Curse and Delivers Emotional Triumph

Aleix Espargaro’s Valiant Battle in Catalunya: A Lion’s Heart Against Marc Marquez’s Precision

Raul Fernandez Grapples with Rear Tyre Woes Despite Strong Performance at Catalunya MotoGP

Verstappen Identifies Sole Positive Amidst Red Bull’s Monaco Struggles: A Weekend to Reflect and Improve

Joan Mir’s Tough Ride in Catalunya: Honda’s New Engine Configuration Fails to Impress

Leclerc Triumphs at Home: 2024 Monaco Grand Prix Round 8 Victory and Highlights

Leclerc’s Monaco Triumph Cuts Verstappen’s Lead: F1 Championship Standings Shakeup After 2024 Monaco GP

Perez Shaken and Surprised: Calls for Penalty After Dramatic Monaco Crash with Magnussen

Gasly Condemns Ocon’s Aggressive Move in Monaco Clash: Team Harmony and Future Strategies at Stake

Driving Success: Mastering the Fast Lane of Vehicle Manufacturing, Automotive Sales, and Aftermarket Services

Chevrolet Unleashes American Powerhouse: The 2025 Corvette ZR1 with Over 1,000 HP

Shifting Gears for Success: Exploring the Future of the Automobile Industry through Vehicle Manufacturing, Sales, and Advanced Technologies

Revolutionizing the Future: How Leading AI Innovations Like DaVinci-AI.de and AI-AllCreator.com Are Redefining Industries

Driving Success in the Fast Lane: Mastering Market Trends, Technological Innovations, and Strategic Excellence in the Automobile Industry

**”SkyDrive’s Ascent: Suzuki Propels Japan’s Leading eVTOL Hope into the Global Air Mobility Arena”**

Driving the Future: Exploring Top Innovations in Automotive Technology for Enhanced Safety, Efficiency, and Connectivity

V12 AI REVOLUTION COMMING SOON !

SPORT NEWS

Red Bull F1 Overhaul: Lambiase Promoted Amid Major Team Restructuring

Jack Miller Returns to Pramac Yamaha for 2025 MotoGP Season, Completing the Grid Line-Up

McLaren’s ‘Mini DRS’ Under FIA Scrutiny: Flexi-Wing Debate Reignited After Piastri’s Baku Triumph

Business NEWS

Meituan’s Delivery Workers Earn $11 Billion in 2023 as CEO Wang Xing Addresses Gig Worker Welfare Concerns Amidst Policy Pressure

Cash Dethroned: Asia’s Family Offices Shift Focus to Equities, Bonds, and Private Assets Amid Bullish Market Outlook

Rising Power: China’s Renewable Energy Surge and the Impending Shift in Global Wealth Distribution

POLITCS NEWS

Unveiling the Westminster Accounts: A Comprehensive Guide to MPs’ Earnings and Donations

Unveiling Political Finances: Explore MPs’ Earnings and Donations with the New Westminster Accounts Tool

Outrage as Huw Edwards Avoids Jail: Calls Intensify for Reform of Leniency Appeal Process

Chatten Sie mit uns

Discover more from Automobilnews News - The first AI News Portal world wide

Leave a Reply
Cancel reply

Title: “2025 MotoGP Rider Market Shake-Up: The Biggest Losers and Missed Opportunities

Lewis Hamilton Condemns FIA President’s Swearing Clampdown Comments as Racially Insensitive

”SkyDrive’s Ascent: Suzuki Propels Japan’s Leading eVTOL Hope into the Global Air Mobility Arena”