Perplexity AI Is Lying about Their User Agent rknight.me/blog/perplexity-ai-

Shocking absolutely no one an AI company, Perplexity AI, isn't sending the correct user agent string they say they will and they completely ignore robots.txt

@robb I’d assume the perplexity bot agent is what they use to do mass crawling for training.

And then user actions are sent as coming from the user directly (which given ‘user agent’ literally refers to the software acting on the user’s behalf, seems right to me).

@iKyle But the "software" is Perplexity, so it should have the user agent they say it should have. And that same software should be respecting robots.txt.

OpenAI does this correctly.

@robb I agree with the user agent string but not about the robots.txt.

robots.txt is for automated crawling like a search engine.

Your web browser ignores that file because it’s not a robot. It’s acting for you the user at your direct action. So loads those pages fine.

So the automated crawling should be blocked. But if I tell the software, look at this URL and tell me what it says, it should ignite the robots.txt and do what I say, like my web browser does.

@iKyle I think we disagree fundamentally on what Perplexity is. It’s a search engine first and foremost and it's ignoring robots.txt. I wouldn't expect my site to show up on Google if I'd blocked it just because someone asked about it.

@robb @iKyle I came here to make the same point Kyle did. I have no idea if these folks are being genuine or not, but if they are then you could still observe this same behavior. An explicit request to summarize a site is different than indexing it. And the part about asking the AI how this supposed violation was possible is silly because we know how LLMs work and it wasn’t explaining anything about how it works but just stringing words together that often show up in the same place.

@DavidAnson Fair enough about the asking it about it.

The software isn’t identifying itself as it should. It doesn’t matter if it’s for indexing, AI scraping, or a user request. They say they do it but they don’t. It’s as simple as that.

@DavidAnson And they still indexed my site when robots.txt said they couldn’t.

@robb How do you know they indexed that page? All I saw was a request for information about it which would not require it to have been previously indexed. And answering the question posed would require bypassing robots.txt, which is exactly what a person trying to answer that question would have done.

@DavidAnson I didn’t say they indexed that page but when I searched my name they had loads of my other pages.

It doesn’t “require bypassing robots.txt” they should be sending the correct user agent, get denied access, and tell the user that.

A person viewing my site as intended it completely different to an AI scraping it and summarising it for their own gain.

Follow

@robb @DavidAnson

An interesting similar (but different in likely significant ways!) situation is a browser with an adblocker that omits any reference to the adblocker in its identification.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.