SIDs in URLs?

Found an issue with the phpBB system here at NESdev? Use this forum to report problems.

Moderator: Moderators

Post Reply
User avatar
Dwedit
Posts: 4236
Joined: Fri Nov 19, 2004 7:35 pm
Contact:

SIDs in URLs?

Post by Dwedit » Sat May 19, 2018 1:49 pm

A reality of the internet is that bots crawl message boards, but having Phpbb SIDs in URLs messes with their ability to crawl the board. Is there any way to get rid of those from phpbb?
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!

User avatar
koitsu
Posts: 4216
Joined: Sun Sep 19, 2004 9:28 pm
Location: A world gone mad

Re: SIDs in URLs?

Post by koitsu » Sat May 19, 2018 4:47 pm

Not sure if this is still the case (probably is), but phpBB used to only used sid=XXX as an HTTP parameter if using cookies failed. In other words: it prefers cookies, but falls back to using a session ID in the URL if it can't. There are several posts on the phpBB support forum describing this mechanism.

If some random spider/bot is picking up sid=XXX in URLs, then it's because it's not allowing or using cookies.

And yes, this is one of many problems when it comes to bots crawling phpBB forums. The other is that they often get stuck in an infinite loop downloading everything. Generally speaking rejecting bots from hitting phpBB through robots.txt is more commonplace.

tepples
Posts: 21752
Joined: Sun Sep 19, 2004 11:12 pm
Location: NE Indiana, USA (NTSC)
Contact:

Re: SIDs in URLs?

Post by tepples » Sat May 19, 2018 6:10 pm

And for those bots deemed too useful to exclude, such as Google, Bing, Internet Archive, and whatever feeds into DuckDuckGo, try these in no particular order:

1. Make sure the board software is issuing a proper absolute URL in <link rel="canonical">.
2. Hardcode their user agents into the board software as not eligible to begin a session.
3. Try turning off session.use_trans_sid.

Post Reply