AI
Amazon Probes Perplexity AI Amid Allegations of Web Scraping Abuse and Ignoring Robots Exclusion Protocol
To go back to this article, navigate to My Profile and then look for the saved stories section.
Andrew Couts and Dhruv Mehrotra
Amazon Probes Claims Against Perplexity for Alleged Scraping Misconduct
The cloud computing arm of Amazon is conducting a probe into Perplexity AI, following allegations that the AI-focused startup may have breached Amazon Web Services' policies by harvesting data from websites that have sought to block such actions, according to a report by WIRED.
A spokesperson from AWS, speaking to WIRED anonymously, verified the company's probe into Perplexity. Earlier, WIRED uncovered that the startup, supported financially by Jeff Bezos' family fund and Nvidia and currently valued at $3 billion, seems to use material from websites it wasn't supposed to access, as they were protected by the Robots Exclusion Protocol, a widely recognized standard on the web. Although the protocol itself doesn't carry legal weight, the terms of service typically do.
The Robots Exclusion Protocol, a longstanding internet norm, requires the addition of a simple text file (such as wired.com/robots.txt) to a website to specify which pages are off-limits to automated software and search engine spiders. Although entities employing scraping tools can opt to disregard this standard, it has generally been honored by most. An Amazon representative informed WIRED that those using AWS for web crawling are obligated to comply with the robots.txt guidelines.
"A spokesperson stated that AWS's policies strictly forbid clients from engaging in unlawful activities using their services. Additionally, it is the responsibility of the clients to adhere to these policies as well as all relevant legislation."
Investigations into Perplexity's operations were sparked by a report from Forbes on June 11, which alleged that the startup had appropriated at least one article from them. Further inquiries by WIRED corroborated these allegations, uncovering more instances of content scraping and plagiarism linked to Perplexity’s AI-driven search chatbot. To prevent Perplexity from accessing its content, Condé Nast, the parent company of WIRED, implemented a block on Perplexity’s web crawler on all its sites through a robots.txt file. Nonetheless, WIRED discovered that the startup was still able to access its sites through an unpublished IP address—44.221.181.252—having visited Condé Nast digital properties potentially hundreds of times over the last three months, presumably to continue scraping content from these sites.
The device linked to Perplexity seems to be extensively scanning news platforms that prohibit robots from retrieving their information. Representatives from The Guardian, Forbes, and The New York Times have also noticed the IP address frequently accessing their systems.
WIRED tracked down the IP address to an Elastic Compute Cloud (EC2) instance running on AWS. This investigation was initiated following our inquiries about whether the use of AWS's infrastructure to harvest data from websites that prohibit such actions breached the company's service agreement.
In a recent interaction, Perplexity's Chief Executive Officer, Aravind Srinivas, addressed the inquiries from WIRED, initially criticizing the questions for showcasing a profound misunderstanding of both Perplexity's operations and the fundamental workings of the Internet. Subsequently, in a conversation with Fast Company, Srinivas clarified that the hidden IP address seen scraping content from Condé Nast's websites and a dummy site set up by WIRED was actually being used by an external service provider specializing in web crawling and indexing. He declined to reveal the identity of this service provider, attributing his refusal to a confidentiality agreement. When questioned about the possibility of instructing the third-party service to cease scraping WIRED's content, Srinivas's response was, “It’s complicated.”
By Kelly Clancy
Authored by Jaina Grey
Authored by David
Authored by Kate Knibbs
Sara Platnick, representing Perplexity, informed WIRED that the company addressed Amazon's questions by Wednesday, describing the investigation as a normal process. Platnick mentions that Perplexity did not alter its operations following Amazon's queries.
Platnick explains that PerplexityBot, which operates using AWS, adheres to the guidelines set by robots.txt. She assures that the operations controlled by Perplexity do not breach any AWS service agreements. Nonetheless, Platnick notes that there are rare instances where PerplexityBot will not follow robots.txt, specifically when a user directly inputs a URL into the system.
"Platnick explains that entering a particular URL doesn't initiate the process of web crawling. Instead, the agent operates as if it's representing the user, fetching the URL directly. This mechanism is equivalent to the user manually visiting a website, copying the article's content, and pasting it into the platform themselves."
This explanation of how Perplexity operates supports WIRED's discovery that its chatbot occasionally disregards robots.txt rules.
Digital Content Next represents the digital content sector as a trade organization, with membership from prominent companies such as The New York Times, The Washington Post, and Condé Nast. In the previous year, this group proposed preliminary guidelines for managing generative AI technologies to safeguard against possible infringements of copyright. Jason Kint, the CEO, mentioned to WIRED that should the accusations facing Perplexity hold any truth, the firm would be breaching several of these established guidelines.
Kint believes that AI firms ought to operate under the principle that they are not entitled to repurpose publishers' material without consent. He further notes that if Perplexity is bypassing the terms of service or robots.txt directives, this should serve as a major warning sign indicating potentially unauthorized activities.
Recommended for You…
Direct to your email: Dive into Will Knight's Fast Forward for the latest progress in artificial intelligence.
Delving into the largest undercover operation ever conducted by the FBI
The WIRED AI Elections Initiative: Monitoring over 60 worldwide electoral events
Ecuador finds itself completely at the mercy of dry conditions
Be confident: These are the top mattresses available for online purchase
The text provided does
Vittoria Elliott
Dmitri Alperovitch
Cameron Dell
Matthew Burgess
Joseph Cox
Additional Content from WIRED
Evaluations and Tutorials
© 2024 Condé Nast. All rights reserved. Purchases made through our website may result in WIRED receiving a share of the sale, as part of our Affiliate Agreements with retail partners. Reproduction, distribution, transmission, caching, or any other form of utilization of the material found on this website is strictly prohibited without explicit prior written consent from Condé Nast. Advertising Choices.
Choose a global website
Discover more from Automobilnews News - The first AI News Portal world wide
Subscribe to get the latest posts sent to your email.