Page 1 of 1

SIDs in URLs?

Posted: Sat May 19, 2018 1:49 pm
by Dwedit
A reality of the internet is that bots crawl message boards, but having Phpbb SIDs in URLs messes with their ability to crawl the board. Is there any way to get rid of those from phpbb?

Re: SIDs in URLs?

Posted: Sat May 19, 2018 4:47 pm
by koitsu
Not sure if this is still the case (probably is), but phpBB used to only used sid=XXX as an HTTP parameter if using cookies failed. In other words: it prefers cookies, but falls back to using a session ID in the URL if it can't. There are several posts on the phpBB support forum describing this mechanism.

If some random spider/bot is picking up sid=XXX in URLs, then it's because it's not allowing or using cookies.

And yes, this is one of many problems when it comes to bots crawling phpBB forums. The other is that they often get stuck in an infinite loop downloading everything. Generally speaking rejecting bots from hitting phpBB through robots.txt is more commonplace.

Re: SIDs in URLs?

Posted: Sat May 19, 2018 6:10 pm
by tepples
And for those bots deemed too useful to exclude, such as Google, Bing, Internet Archive, and whatever feeds into DuckDuckGo, try these in no particular order:

1. Make sure the board software is issuing a proper absolute URL in <link rel="canonical">.
2. Hardcode their user agents into the board software as not eligible to begin a session.
3. Try turning off session.use_trans_sid.