AI
Web Archives Under Siege: The Battle Over AI’s Use of Copyrighted Content
To go back to this article, go to My Profile and then check out your saved stories.
Kate Knibbs
Danish News Organizations Challenge Common Crawl Over Use of AI Training Material
Media groups in Denmark have called on the nonprofit internet archive, Common Crawl, to delete their articles from previous collections and to cease gathering content from their sites at once. This action comes in response to increasing frustration with how entities such as OpenAI utilize copyrighted content.
Common Crawl intends to adhere to the demand that was initially made on Monday. According to Executive Director Rich Skrenta, the organization lacks the resources to engage in legal battles with media firms and publishers.
The Danish Rights Alliance (DRA), a group advocating for copyright owners in Denmark, led the initiative. They submitted the plea for four media companies, among them Berlingske Media and the daily Jyllands-Posten. Previously, The New York Times had made a comparable approach to Common Crawl, before initiating legal action against OpenAI for unauthorized use of its content. In its legal filing, The New York Times emphasized that Common Crawl's information was the most significantly prioritized data set in GPT-3.
Thomas Heldrup, who leads the DRA's content protection and enforcement division, mentions that the initiative was sparked by the Times. "What sets Common Crawl apart is its popularity among major AI firms for their data needs," Heldrup states. He views its vast database as a challenge for media organizations trying to engage in discussions with AI giants.
Despite playing a crucial role in the evolution of numerous AI tools that work with text, Common Crawl wasn't originally created with artificial intelligence applications in perspective. Established in 2007 and located in San Francisco, the organization initially gained recognition as a valuable asset for research before the surge in AI technology. "Common Crawl finds itself at the center of the ongoing debate around copyright issues and the use of generative AI," notes Stefan Baack, a data analyst at the Mozilla Foundation who has recently produced an analysis on how Common Crawl contributes to the training of AI systems. "For a long time, it remained a relatively obscure project that few people were aware of."
Before the year 2023, there had been no instances where Common Crawl was asked to remove data. However, recently, alongside appeals from the New York Times and a consortium of publishers from Denmark, it has begun to handle a growing number of requests for data redaction that remain confidential.
The demand to remove data has significantly surged, and at the same time, Common Crawl's web scraping tool, CCBot, is facing increased obstacles in gathering new information from content publishers. Originality AI, a startup specializing in detecting AI usage, reports that over 44 percent of leading global news and media websites are preventing access to CCBot. Although BuzzFeed started to deny access in 2018, many other notable publishers such as Reuters, the Washington Post, and the CBC have only started to do so over the past year. According to Baack, the frequency of these blockages is on the rise.
Common Crawl's swift response to such demands is influenced by the challenges of maintaining a modest-sized nonprofit organization. However, adherence to these requests does not imply concurrence with the underlying principles. Skrenta views the pressure to delete historical content from databases like Common Crawl as a direct attack on the current state of the internet. "This represents a fundamental danger," he asserts. "It's a threat that could destroy the concept of an open web."
Authored by Mark
Authored by David
Authored by Christopher
By the Author
He's not the only one worried. "The attempts to wipe out internet history, particularly news stories, deeply disturb me," states Jeff Jarvis, a journalism professor and fervent advocate for Common Crawl. "It's referenced in over 10,000 scholarly articles. It's an immensely useful tool." Common Crawl gathers fresh instances of studies that have utilized its data; among the latest are an analysis of web censorship in Turkmenistan and a study aimed at enhancing the detection of online fraud.
The transformation of Common Crawl from a niche resource cherished by tech enthusiasts yet overlooked by the masses to a contentious assistant for AI projects reflects a broader dispute surrounding copyright issues and the concept of an unrestricted internet. An increasing number of publishers, alongside various artists, authors, and creatives, are opposing the practice of web crawling and scraping—this opposition persists even in cases where the endeavors are not for profit, such as the continuous project by Common Crawl. Any initiative that might serve as a source of data for artificial intelligence is being closely examined.
Amid numerous legal actions accusing leading figures in the generative AI industry of copyright violations, advocates for copyright protection are also advocating for the introduction of laws to implement restrictions on the use of training data, thereby obliging AI firms to compensate for their data usage. Increased attention on Common Crawl and other widely used data collections such as LAION-5B has uncovered that in their extensive online data collection efforts, these databases have unintentionally gathered content from some of the most objectionable online areas. (In December 2023, LAION-5B was temporarily disabled following a Stanford University investigation, which discovered the inclusion of child sexual abuse materials within the dataset.)
The Danish Rights Alliance takes an assertive stance on issues related to AI and copyright infringement. At the beginning of the year, they initiated a movement to send out DMCA takedown requests. These requests serve as a warning to companies about the presence of potentially illegal content on their platforms, specifically targeting instances where book publishers' materials were shared on OpenAI's GPT Store without authorization. In the previous year, they were at the forefront of efforts to eliminate a widely used generative AI dataset named Books3 from the web. The collective action of Danish media against the unauthorized use of media content by AI firms for training purposes is notably coordinated. A group comprising leading newspapers and television networks has recently made a move to initiate legal action against OpenAI, demanding financial compensation for the utilization of their content in AI training datasets.
Should a substantial number of publishers and media organizations decide against participating in Common Crawl, the repercussions could extend broadly, affecting scholarly work across various fields. Additionally, Baack suggests, such a move may lead to unintended outcomes. He believes that discontinuing Common Crawl could disproportionately affect newcomers and smaller initiatives, as well as the academic community, solidifying the positions of current major players and making the landscape more rigid. "In the event that Common Crawl becomes too compromised to serve as an effective source of training data, it would likely result in a scenario where OpenAI and other top AI firms are further empowered," he asserts. "After all, these entities possess the means to conduct their own web crawls."
You May Also Enjoy …
Skeptical that breakdancing qualifies as an Olympic sport? The global champion shares your sentiment (to some extent)
Investigators unlocked a decade-old passphrase for a cryptocurrency wallet valued at $3 million.
The remarkable emergence of the globe's inaugural AI-driven beauty contest
Ease the strain on your spine: Discover the top office chairs we've evaluated.
Name: Joel Khalili
Journalist Profile:
Knight Will
Knight Will
Steven Levy
Morgan Meaker
N/A
Greenberg,
Further Insights from WIRED
Evaluations and Manuals
© 2024 Condé Nast. All rights reserved. Purchases made through our site may generate a commission for WIRED as part of our Affiliate Partnerships with retail companies. Any content from this site cannot be copied, shared, transmitted, or used in any form without explicit written consent from Condé Nast. Ad Choices
Choose a global website
Discover more from Automobilnews News - The first AI News Portal world wide
Subscribe to get the latest posts sent to your email.