Connect with us

AI

Silicon Valley Giants Accused of Using Swiped YouTube Content to Train AI: An In-Depth Investigation Reveals Thousands of Videos Harvested Without Permission

Published

on

To go back to this article, go to My Profile and then look at the stories you've saved.

Apple, Nvidia, and Anthropic Utilized Thousands of Illegally Obtained YouTube Clips for AI Training

This article is a joint publication with Proof News.

Technology corporations are adopting contentious strategies to satisfy their artificial intelligence systems' voracious appetite for data, indiscriminately collecting information from books, websites, images, and social media content, frequently without the knowledge or consent of the original creators.

Investigations conducted by Proof News have revealed that many of the world’s richest AI firms have been discreetly utilizing content from thousands of YouTube videos to develop their artificial intelligence technologies. This practice persists even though it directly contravenes YouTube's policies regarding the unauthorized extraction of content from its site.

Our inquiry revealed that transcriptions from 173,536 YouTube clips, gathered across over 48,000 channels, were utilized by major Silicon Valley firms such as Anthropic, Nvidia, Apple, and Salesforce.

The collection, named YouTube Subtitles, comprises transcripts from videos belonging to educational and e-learning platforms such as Khan Academy, MIT, and Harvard. Additionally, content from The Wall Street Journal, NPR, and the BBC was utilized in AI training, along with material from The Late Show With Stephen Colbert, Last Week Tonight With John Oliver, and Jimmy Kimmel Live.

Investigations by Proof News uncovered content from top YouTube influencers, like MrBeast (boasting 289 million followers, with two videos used in training), Marques Brownlee (with a follower count of 19 million, contributing seven videos), Jacksepticeye (approaching 31 million subscribers, with 377 videos used), and PewDiePie (who has 111 million subscribers, with 337 of his videos utilized). Additionally, some of the content leveraged for AI training endorsed conspiracy theories, including the notion that the earth is flat.

Proof News developed a utility that allows for the discovery of creators within the YouTube AI training dataset.

David Pakman, the presenter of The David Pakman Show, a progressive political outlet boasting over 2 million followers and surpassing 2 billion views, noted, "Nobody approached me to ask for my permission to use this." He revealed that about 160 of his program's episodes were incorporated into the YouTube Subtitles training collection.

Pakman's business operates with a full-time staff of four, generating several videos daily, along with a podcast, TikTok content, and materials for various other channels. Pakman argues that if AI firms receive payment, he too deserves remuneration for the utilization of his data. He highlighted that certain media organizations have successfully negotiated deals to receive compensation for their content being used in AI training.

"Pakman emphasized that producing this content is his source of income, involving investment in terms of time, finances, resources, and his team's effort. He pointed out that there is always plenty of work to be done."

"Dave Wiskus, the Chief Executive Officer of Nebula—a streaming platform co-owned by its contributors—described the situation as 'theft', referencing incidents where content from YouTube was used without permission to train artificial intelligence, affecting some of the platform's creators."

Wiskus mentioned that it's "disrespectful" for their creations to be used without their permission, particularly because studios might employ "generative AI to cut down on the number of artists involved."

"Wiskus firmly believes this will indeed be utilized to take advantage of and cause damage to artists."

Officials from EleutherAI, who developed the dataset, remained silent when asked to comment on Proof's discoveries, which included claims of unauthorized video usage. The organization's online platform articulates its primary aim as making AI development more accessible to individuals beyond the elite circles of major technology corporations, highlighting its track record of democratizing access to advanced AI technologies through the training and distribution of models.

YouTube Subtitles feature only the written text from videos' dialogues or commentary, frequently accompanied by translations into various languages such as Japanese, German, and Arabic, without incorporating any visual elements from the videos.

Based on a study released by EleutherAI, the dataset in question forms a segment of a larger collection termed the Pile, which the nonprofit organization has disclosed. The construction of the Pile encompasses contributions not only from YouTube but also integrates content from the European Parliament, the English version of Wikipedia, and a vast collection of emails from Enron Corporation workers, which became public during a governmental probe into the company.

The majority of the datasets from the Pile are available and can be accessed by anyone online who has sufficient storage and processing capabilities. Researchers and developers not affiliated with major tech companies also utilized the dataset, but they were not the sole users.

Tech giants such as Apple, Nvidia, and Salesforce, which boast market valuations reaching into the trillions, have detailed in various studies and articles their utilization of the Pile for AI training purposes. Additionally, evidence suggests Apple employed the Pile in developing OpenELM, a significant model they introduced in April, just prior to announcing enhanced AI features for its iPhone and MacBook lines. Moreover, both Bloomberg and Databricks have indicated in their own reports that they have used the Pile for training their models.

Similarly, Anthropic, a prominent AI developer, attracted a $4 billion investment from Amazon, emphasizing its commitment to "AI safety."

"Jennifer Martinez, representing Anthropic, acknowledged in a formal statement that their AI assistant named Claude utilizes the Pile, which contains a minor selection of YouTube subtitle content. Martinez emphasized that employing the Pile dataset is a separate matter from directly engaging with YouTube's platform, which is governed by its own set of terms. Regarding any concerns about breaching YouTube's service terms, Martinez suggested reaching out to the creators of the Pile for further clarification."

Salesforce acknowledged employing the Pile for the development of an AI model aimed at "academic and research purposes." Caiming Xiong, who serves as the vice president of AI research at Salesforce, highlighted in a remark that the dataset was accessible to the public.

In 2022, Salesforce made the AI model publicly available, and it has been downloaded over 86,000 times, as per the information on its Hugging Face page. The Salesforce team, in their study, indicated that the Pile included not only profanity but also biases related to gender and particular religious communities, highlighting potential risks and safety issues. Proof News identified numerous instances of profanity in YouTube subtitles, along with examples of racial and gender derogatory remarks. When approached about these safety issues, a spokesperson from Salesforce did not offer any comments.

A spokesperson for Nvidia chose not to provide any comments. Meanwhile, representatives from Apple, Databricks, and Bloomberg remained silent, not replying to requests for a statement.

The Treasure Trove of YouTube Data

The battle among AI firms partly hinges on obtaining superior data, noted Jai Vipra, a specialist in AI policy and a CyberBRICS scholar at the Fundação Getulio Vargas Law School in Rio de Janeiro, Brazil. This is a key factor behind companies' tendency to guard their data sources jealously.

In the early months of this year, The New York Times revealed that Google, the parent company of YouTube, utilized text from videos on the platform to improve its algorithms. A representative responded to the publication by stating that this practice was allowed under the contracts with content creators on YouTube.

An investigation by The Times uncovered that OpenAI utilized YouTube videos without permission. When approached, representatives from the company did not verify or refute the newspaper's discoveries.

Top officials at OpenAI have consistently refrained from addressing inquiries in public forums regarding their use of YouTube content for the training of their AI tool, Sora, capable of generating videos based on textual instructions. A journalist from The Wall Street Journal earlier approached Mira Murati, the Chief Technology Officer of OpenAI, with this query.

"Actually, I'm uncertain," responded Murati.

Vipra emphasized that YouTube subtitles and various forms of speech-to-text information could serve as a valuable resource, aiding in the training of models to mimic human conversation and speech patterns.

"The core issue remains," stated Dave Farina, creator of Professor Dave Explains, a YouTube channel with 3 million followers that offers chemistry and various science lessons, which had 140 of its videos reuploaded for YouTube Subtitles.

He stated, "If you're making money from efforts I've contributed to creating a product that could lead to unemployment for myself or others in my position, then we need to discuss potential compensation or regulatory measures."

Launched in 2020, YouTube Subtitles encompasses captions from over 12,000 videos that have been subsequently removed from YouTube. In a particular instance, a creator eliminated their entire digital footprint, however, their contributions have been assimilated into an unspecified quantity of AI models.

Proof News made efforts to contact the proprietors of the channels mentioned in this report. Several did not reply to inquiries for their input. Among the creators who did engage in conversation, none had knowledge that their data had been accessed or the purposes for which it was utilized.

Among those caught off guard were the creators behind Crash Course (boasting close to 16 million followers and 871 videos created) and SciShow (with 8 million subscribers and 228 videos produced), key components of the educational video network established by siblings Hank and John Green.

"It's disheartening to discover that our carefully crafted educational materials have been utilized in such a manner without our permission," stated Julie Walsh Smith, CEO of the production firm Complexly, in a formal announcement.

YouTube's captioning system is not the initial instance of AI-generated training data causing concern within the arts sector.

Journalist Alex Reisner, writing for Proof News, accessed a version of the Pile dataset known as Books3 and revealed in an article for The Atlantic last year that over 180,000 texts, featuring works by authors such as Margaret Atwood, Michael Pollan, and Zadie Smith, were plagiarized. Following this revelation, numerous writers have initiated lawsuits against AI firms for exploiting their creations without permission, citing breaches of copyright law. The issue has escalated, leading to a surge in similar legal actions, and resulted in the removal of Books3 from its hosting site.

In reaction to the legal challenges, companies like Meta, OpenAI, and Bloomberg have defended their behavior as falling under the umbrella of fair use. The lawsuit targeting EleutherAI, known for initially extracting content from books and sharing it publicly, was willingly dropped by those who brought the case forward.

Legal proceedings in the outstanding cases are still in the preliminary phases, leaving the issues of authorization and compensation unanswered. The Pile has been taken down from its original download platform, yet it continues to be accessible on file-sharing networks.

"Technology firms have bulldozed their way through," stated Amy Keller, a consumer rights lawyer and partner at DiCello Levitt, who has filed lawsuits for artists claiming their work was taken by AI companies without permission.

"Keller expressed that the main issue is the lack of options available to individuals. 'The heart of the problem lies in the absence of choice,' he remarked."

Echoing a Parrot

Numerous creators experience doubt regarding their future direction.

Dedicated YouTubers are constantly on the lookout for illicit usage of their content, frequently submitting requests to remove such materials. There's a growing concern among them that it might not be long before artificial intelligence is capable of creating content that resembles theirs, or even directly replicates it.

While browsing TikTok, David Pakman, the host of The David Pakman Show, experienced the surprising capabilities of artificial intelligence firsthand. He stumbled upon what was presented as a Tucker Carlson video. However, upon viewing it, Pakman was shocked to find that although it sounded like Carlson, the content was an exact replication of his own words from his YouTube show, including the delivery style. What disturbed him further was the realization that only a single commenter out of many seemed to notice that the clip was fabricated, with Carlson's voice being used to mimic Pakman's original dialogue.

"Pakman expressed concerns in a YouTube video about the counterfeit, stating, 'This will pose a challenge,' and emphasized that this could be done to anyone."

Sid Black, a cofounder of EleutherAI, shared on GitHub that he developed YouTube Subtitles with a script. This script pulls subtitles directly from YouTube's API, mimicking the process a viewer's browser undergoes to display subtitles during video playback. GitHub's documentation reveals that Black employed 495 keywords to filter videos for this project. These keywords span a diverse range of topics, including "funny vloggers," "Einstein," and "black protestant," alongside others like "Protective Social Services," "infowars," "quantum chromodynamics," "Ben Shapiro," "Uighurs," "fruitarian," "cake recipe," "Nazca lines," and "flat earth."

Despite YouTube's policy banning the use of automated methods to watch its content, over 2,000 GitHub members have shown support for or saved the code.

"Machine learning engineer Jonas Depoix mentioned in a GitHub discussion that YouTube has numerous methods at its disposal to stop this module from functioning if they desire to do so," he stated, referring to the code Black utilized to retrieve YouTube captions. "Up to this point, such action has not been taken."

Depoix communicated to Proof News via email that the code he developed during his university years for a project has not been in use by him since then. He expressed his astonishment that others found value in it. Furthermore, he opted not to respond to inquiries regarding YouTube's regulations.

In response to an inquiry for feedback, Jack Malon, a representative for Google, stated via email that the tech giant has implemented measures over time to combat the unauthorized and harmful extraction of data. He did not address inquiries regarding how other firms might be utilizing the content for training purposes.

In the collection of videos leveraged by AI firms, there are 146 clips from Einstein Parrot, a popular channel that boasts close to 150,000 followers. Marcia, the person who looks after the African grey parrot and prefers not to disclose her surname to protect the well-known bird, initially found it amusing that AI algorithms had absorbed the speech patterns of a parrot known for its mimicry.

"Why would anyone choose to mimic a parrot's voice?" Marcia pondered. "However, I'm aware that his articulation is excellent. He imitates my voice precisely. So, he's echoing me, and subsequently, the AI is imitating the echo of the parrot."

After AI has absorbed information, it's impossible to erase that knowledge. Marcia found herself concerned about the myriad potential uses of her parrot's data, among them the creation of a virtual clone of her bird and the possibility of it being programmed to use profanity.

"Marcia noted, "We're venturing into unknown areas."

Explore More…

Dive into election season by subscribing to our WIRED Politics Lab newsletter and tuning into our podcast.

Unconvinced that breakdancing qualifies as an Olympic event? The global champion somewhat concurs.

Researchers unlocked a decade-old encryption for a cryptocurrency wallet valued at $3 million

The remarkable emergence of the globe's inaugural AI-driven beauty contest

Ease the strain on your spine: Discover the top desk chairs we've evaluated.

Additional Content from WIRED

Assessments and Manuals

© 2024 Condé Nast. Rights protected. WIRED might receive a share of revenue from items bought via our website, thanks to our collaboration with retail affiliates. Replicating, sharing, broadcasting, storing, or utilizing the content on this website in any form is strictly forbidden without explicit consent from Condé Nast. Advertising Options

Choose a global site


Discover more from Automobilnews News - The first AI News Portal world wide

Subscribe to get the latest posts sent to your email.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE FOR FREE

Advertisement
F111 mins ago

**Lewis Hamilton Condemns FIA President’s Swearing Clampdown Comments as Racially Insensitive**

Moto GP24 mins ago

Yamaha Confirms V4 Engine Development for MotoGP with Potential 2025 Debut

F140 mins ago

Resilient Hamilton Vows to ‘Give It Absolutely Everything’ After Azerbaijan Setback Ahead of Singapore GP

Moto GP55 mins ago

Fabio Quartararo Criticizes Yamaha’s Disorganized Test Team Amid Strategic Shifts and New Partnerships

F11 hour ago

New Audi F1 Contender Sparks Speculation as Bottas Stays Tight-Lipped on Future

Moto GP1 hour ago

Brad Binder Praises ‘Radical’ 2025 KTM MotoGP Prototype: ‘Quite Different’ to Current Model

F12 hours ago

Charles Leclerc Unveils Ferrari’s Internal Debate Over McLaren’s Controversial Rear Wing

Moto GP2 hours ago

Marc Marquez Praises Pecco Bagnaia for Defusing Misano Crowd Boos: A Call for Respect in MotoGP

Automakers & Suppliers2 hours ago

Exploring the Apex of Innovation: Lamborghini’s Latest Supercar Technologies and Luxury Advancements

Automakers & Suppliers4 hours ago

Unveiling Ferrari’s Latest Supercar Innovations: A Deep Dive into Maranello’s Masterpieces and Cutting-Edge Technologies

Sports5 hours ago

Nigel Mansell Criticizes Ferrari’s “Short-Sighted” Decision on Adrian Newey, Predicts Bright Future for Aston Martin

AI5 hours ago

Revealing the AI Gap: How U.S. Teens Outpace Their Parents in Generative AI Use and Understanding

Sports5 hours ago

Peter Windsor Dismisses Russell’s Pirelli Complaints as “Nonsense,” Questions Mercedes Driver’s Approach Post-Azerbaijan GP

AI5 hours ago

Revolutionizing Creativity: YouTube to Unleash Generative AI Video Creation with Veo Model Integration

Sports6 hours ago

Wolff Identifies Tyre Temperature Control as Mercedes’ Key Challenge at Singapore Grand Prix

AI6 hours ago

SocialAI: Navigating the Echo Chamber of AI-Generated Companions

AI6 hours ago

Into the AI Abyss: Navigating the Uncanny World of SocialAI

Sports6 hours ago

Nigel Mansell Weighs in on McLaren’s Team Strategy: Urges Lando Norris to “Step Up” Amid Title Race

Politics2 months ago

News Outlet Clears Sacked Welsh Minister in Leak Scandal Amidst Ongoing Political Turmoil

Moto GP4 months ago

Enea Bastianini’s Bold Stand Against MotoGP Penalties Sparks Debate: A Dive into the Controversial Catalan GP Decision

Sports4 months ago

Leclerc Conquers Monaco: Home Victory Breaks Personal Curse and Delivers Emotional Triumph

Moto GP4 months ago

Aleix Espargaro’s Valiant Battle in Catalunya: A Lion’s Heart Against Marc Marquez’s Precision

Moto GP4 months ago

Raul Fernandez Grapples with Rear Tyre Woes Despite Strong Performance at Catalunya MotoGP

Sports4 months ago

Verstappen Identifies Sole Positive Amidst Red Bull’s Monaco Struggles: A Weekend to Reflect and Improve

Moto GP4 months ago

Joan Mir’s Tough Ride in Catalunya: Honda’s New Engine Configuration Fails to Impress

Sports4 months ago

Leclerc Triumphs at Home: 2024 Monaco Grand Prix Round 8 Victory and Highlights

Sports4 months ago

Leclerc’s Monaco Triumph Cuts Verstappen’s Lead: F1 Championship Standings Shakeup After 2024 Monaco GP

Sports4 months ago

Perez Shaken and Surprised: Calls for Penalty After Dramatic Monaco Crash with Magnussen

Sports4 months ago

Gasly Condemns Ocon’s Aggressive Move in Monaco Clash: Team Harmony and Future Strategies at Stake

Business4 months ago

Driving Success: Mastering the Fast Lane of Vehicle Manufacturing, Automotive Sales, and Aftermarket Services

Cars & Concepts2 months ago

Chevrolet Unleashes American Powerhouse: The 2025 Corvette ZR1 with Over 1,000 HP

Business4 months ago

Shifting Gears for Success: Exploring the Future of the Automobile Industry through Vehicle Manufacturing, Sales, and Advanced Technologies

AI4 months ago

Revolutionizing the Future: How Leading AI Innovations Like DaVinci-AI.de and AI-AllCreator.com Are Redefining Industries

Business4 months ago

Driving Success in the Fast Lane: Mastering Market Trends, Technological Innovations, and Strategic Excellence in the Automobile Industry

Mobility Report4 months ago

**”SkyDrive’s Ascent: Suzuki Propels Japan’s Leading eVTOL Hope into the Global Air Mobility Arena”**

Tech4 months ago

Driving the Future: Exploring Top Innovations in Automotive Technology for Enhanced Safety, Efficiency, and Connectivity

V12 AI REVOLUTION COMMING SOON !

Get ready for a groundbreaking shift in the world of artificial intelligence as the V12 AI Revolution is on the horizon

SPORT NEWS

Business NEWS

Advertisement

POLITCS NEWS

Chatten Sie mit uns

Hallo! Wie kann ich Ihnen helfen?

Discover more from Automobilnews News - The first AI News Portal world wide

Subscribe now to keep reading and get access to the full archive.

Continue reading

×