It is currently Thu Oct 19, 2017 7:20 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 20 posts ]  Go to page Previous  1, 2
Author Message
 Post subject:
PostPosted: Sat Mar 29, 2008 4:37 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
tepples wrote:
koitsu wrote:
There's a **lot** of people who have done this over the years; it's why Memblers put the "Do not download full copies of the site through the webserver. Use the FTP mirror" note on the main page. That ZIP file is updated weekly, and automatically.

As far as I can tell, the forum is the most interesting part of the site, especially because the front page is years out of date. But I just downloaded all 70 MB of the site's archive over FTP five minutes ago, and a static copy of /bbs/ isn't in there.


Correct, it's not there, because there's no easy way possible to archive that in a way that will be user-friendly, nor a way to easily automate the process. The web board isn't stored in flat files, it's all SQL-based. What exactly do you want me to do about it?

If I remove robots.txt, are you willing to pay the 95th-percentile overusage fees from my co-location provider? Because honestly that's the only way this is going to work in a way that makes you happy.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Mar 30, 2008 5:08 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19100
Location: NE Indiana, USA (NTSC)
There are ways to automate turning the SQL into static data, and I know some boards do this for some sort of "lo-fi version" of each thread that gets sent to the search engines.

But at this point, never mind.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Mar 30, 2008 3:35 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
tepples wrote:
There are ways to automate turning the SQL into static data, and I know some boards do this for some sort of "lo-fi version" of each thread that gets sent to the search engines.

But at this point, never mind.


Yep, I'm well aware that there's ways to do what you've described, but 1) I doubt the existing board software here has native capability for such (or if it does, in a way that can be done automatically on a nightly basis), 2) I'm unaware of any software that can scrape the SQL tables for an old phpBB forum and create static files for it all. If you know of such software, awesome -- or if you feel like writing it, equally as awesome -- sounds like a good project for you to take on. Pass the idea by Memblers. He runs the site, I just run the server / network. :)

I would be more than happy to incorporate the data into the cronjob we have that makes the weekly ZIP (or even offer a separate ZIP file for just the board).

The only reason I'm offering the weekly ZIP is because Memblers and I both felt it might keep people from downloading the entire sites' content on a regular basis. It has helped -- there's even someone that mirrors that ZIP file on a weekly basis (they fetch it once a week a few hours after we automatically update it). But sadly it doesn't stop people who use leech software (not the mention ones which forge their User-Agent!) against http://nesdev.com/ and then don't bother to look at it what the client is doing for 24+ hours (since they all get stuck in infinite loops once they hit the web boards).

EDIT: I thought about this a little bit more. I'm going to try an experiment for you, tepples. I'm going to remove the robots.txt for the next 3-4 weeks and watch our bandwidth usage closely. I put the robots.txt in place back in February of 2007, so it's possible Google and others have fixed their crawling software within the past year. If things get bad, obviously I'll put it back, but I'm willing to try an experiment for now.

Let me know how things look in a few days (I forget how often Google runs their scrapes).

P.S. -- Looks like they finally stopped: http://jdc.parodius.com/lj/china_incident/dropped_packets_0330_1523.png


Top
 Profile  
 
 Post subject:
PostPosted: Tue May 06, 2008 7:40 pm 
Offline

Joined: Fri Feb 29, 2008 10:35 am
Posts: 85
fwiw, you could block certain pages that are known to eat bandwidth, like viewtopic.php?p=* (and other things, and only let it archive forums and threads.


Top
 Profile  
 
 Post subject:
PostPosted: Mon May 12, 2008 1:55 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
Xkeeper wrote:
fwiw, you could block certain pages that are known to eat bandwidth, like viewtopic.php?p=* (and other things, and only let it archive forums and threads.


That doesn't work in the long-term. The way things are currently work fine, and Google should be caching the forums (tepples can verify the robots.txt is gone). The amount of bandwidth this site gets is pretty astounding considering how "niche" it is. People don't understand how expensive bandwidth is.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page Previous  1, 2

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group