It is currently Mon Dec 11, 2017 8:28 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: Why do we block Google?
PostPosted: Sun Mar 23, 2008 1:25 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19326
Location: NE Indiana, USA (NTSC)
The robots.txt file of this domain is currently configured to exclude almost all pages of this domain, including the entire forum. Why is this? The policy as of now appears to hurt 1. the visibility of nesdev and 2. the ability of members to search the board more efficiently.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Mar 23, 2008 1:41 pm 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
Maybe bandwidth? I notice a lot of boards have a "low-bandwidth" version that search engines index. Aside from that, every page here has about 7K of inline style-sheet, which is surely a waste of bandwidth. The style sheet even has a comment about this:
Quote:
NOTE: These CSS definitions are stored within the main page body so that you can use the phpBB2 theme administration centre. When you have finalised your style you could cut the final CSS code and place it in an external file, deleting this section to save bandwidth.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Mar 23, 2008 7:43 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19326
Location: NE Indiana, USA (NTSC)
This board is less active than Pocket Heaven or even tetrisconcept.com, yet those boards get indexed. If you're worried about server load, use Crawl-delay: .


Top
 Profile  
 
 Post subject:
PostPosted: Mon Mar 24, 2008 12:18 am 
Offline
User avatar

Joined: Fri Nov 19, 2004 7:35 pm
Posts: 3967
If you care about server load, use a better caching system and/or accelarated php.

Or maybe use Punbb.

_________________
Here come the fortune cookies! Here come the fortune cookies! They're wearing paper hats!


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 25, 2008 10:03 am 
Offline
User avatar

Joined: Fri Nov 12, 2004 2:49 pm
Posts: 7312
Location: Chexbres, VD, Switzerland
Well, the main Nesdev page, who links to this forum, can be found in Google, but unfotunately it hasn't been updated for almost 3 years or so, and 75% of links in the page are broken by now.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Mar 25, 2008 12:03 pm 
Offline

Joined: Thu Jun 29, 2006 7:44 pm
Posts: 524
Location: lolz!
It would be nice if Google was able to find this place.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 4:30 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
It isn't specific to this site, it's specific to all the sites on the Parodius Network. It's a global robots.txt.

I did this because archival bots -- including Googles -- gets stuck in infinite loops when fetching content from message boards, and have awful problems dealing with sites that have multiple Host entries which point to the same content (nesdev.com vs. nesdev.com, etc.). The bandwidth usage is the main one.

We just had this happen tonight with a bunch of Chinese IPs. For the past 10 hours they've been pounding the nesdev site, and we've been spitting out up to 2mbit/sec worth of traffic for about 10 hours. I would have noticed sooner except I was sleeping. Did they honour robots.txt? Nope. It was a distributed leech session across multiple IPs within China, which means it was probably compromised machines being used to download content.

This is very likely going to cost me hundreds of dollars in 95th percentile overusage payments with the co-location provider.

EDIT: I've uploaded pictures of the incident tonight, to give you some idea what happens to the server, to the network, and to the firewall state tables when leeching or webcrawler bot software encounters message boards. As you can see, I had to firewall off portions of China to alleviate the problem (using deny statements in Apache doesn't help -- they don't honour anything other than HTTP 404, so the only way to stop them is to block their packets).

http://jdc.parodius.com/lj/china_incident/


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 7:11 am 
Offline
User avatar

Joined: Mon Sep 27, 2004 8:33 am
Posts: 3715
Location: Central Texas, USA
Too bad there's no way to have something watch for excessive usage like this and simply shut down the entire site until a human can figure out what to block. Makes me angry hearing about it.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 8:29 am 
Offline

Joined: Thu Jun 29, 2006 7:44 pm
Posts: 524
Location: lolz!
I fell every site on earth should be on Google always because everyone has the right to knowledge and information.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 8:31 am 
Offline
Formerly Fx3
User avatar

Joined: Fri Nov 12, 2004 4:59 pm
Posts: 3076
Location: Brazil
NotTheCommonDose wrote:
I fell every site on earth should be on Google always because everyone has the right to knowledge and information.


...and downloads? And leeching?

_________________
Zepper
RockNES developer


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 9:27 am 
Offline

Joined: Thu Jun 29, 2006 7:44 pm
Posts: 524
Location: lolz!
Yes.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 9:47 am 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
blargg wrote:
Too bad there's no way to have something watch for excessive usage like this and simply shut down the entire site until a human can figure out what to block. Makes me angry hearing about it.


The part I'm still trying to figure out is how they managed to get that amount of network I/O out of us.

The nesdev site is rate-limited to ~50KBytes/sec (shared across all visitors -- yes, that's why the site seems slow sometimes), which means technically it shouldn't have exceeded 384kbit/sec. I'm thinking there's a bug in the bandwidth limiting module we use, and if that's the case, I have another I can try -- or I'll just end up sticking the site on it's own IP and use ALTQ in the network stack to do the rate-limiting.

Alternatives I've come up with, none of which are very user-friendly:

1) Use a module which limits the number of requests-per-second submit per IP address; if they exceed the limit, they're blocked for something like 5-10 minutes. The problem with that method is that it can sometimes go awry (and I've seen it happen on sites I've visited), especially if someone loads different pages of the site in multiple tabs or windows.

It also doesn't solve issues like what happened this morning, because the requests being made by the leechers still come in and hit the webserver, and it still has to spit back some brief HTML saying they've been blocked temporarily. This doesn't stop the requests.

2) Use a module which limits the total site bandwidth to X number of kilobytes per minute/hour/day/week/month. If this number is exceeded, the site essentially shuts down hard until the limit is reset (by me). You might've seen this on some web pages out there, where you get a brief HTML message saying "Bandwidth Exceeded".

The problem with this is that all it takes is some prick downloading the entire site (which happens regularly) with wget or some *zilla downloader, and then the site goes offline for everyone until I get around to noticing or someone contacts me to reset the limit.

There's really no decent solution to this problem, folks, at least not one that's ultimately user-friendly, while still being resource-friendly and won't financially screw me into oblivion.

P.S. -- http://jdc.parodius.com/lj/china_incident/dropped_packets_0329_0950.png shows that the leechers *still* have not shut off their leeching programs.

EDIT: I figured out how the leechers managed to get past the bandwidth limit. The bandwidth limiting module we were using was setting the total amount of bandwidth per user to 384kbit, not for the entire site. Thus, multiple simultaneously connections could indeed reach 2mbit. For those who are technical, the module I was using was mod_bw. The documentation for this module is badly written; once I went back and re-read the docs for the directive, I realised "Oh, so THAT'S what they mean... ugh."

I've addressed this by switching to mod_cband, which lets you set a maximum bandwidth limit for a site as a total, not per-client.

Also, looks like some of the leechers have finally noticed and stopped fetching data: http://jdc.parodius.com/lj/china_incident/dropped_packets_0329_1054.png. Now all I'm left wondering is if they're just going to find other machines to do this from...


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 1:08 pm 
Offline
User avatar

Joined: Wed Dec 06, 2006 8:18 pm
Posts: 2806
What's the point of these people leeching? They just want to download the entire site for some reason? Just greedy/selfish people?


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 1:28 pm 
Offline
User avatar

Joined: Sun Sep 19, 2004 9:28 pm
Posts: 3192
Location: Mountain View, CA, USA
MottZilla wrote:
What's the point of these people leeching? They just want to download the entire site for some reason? Just greedy/selfish people?


I don't know, you'd have to ask them.

There's a **lot** of people who have done this over the years; it's why Memblers put the "Do not download full copies of the site through the webserver. Use the FTP mirror" note on the main page. That ZIP file is updated weekly, and automatically. I did it solely so people would stop leeching the site, but I guess that's wishful thinking on my part.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 1:36 pm 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 19326
Location: NE Indiana, USA (NTSC)
koitsu wrote:
There's a **lot** of people who have done this over the years; it's why Memblers put the "Do not download full copies of the site through the webserver. Use the FTP mirror" note on the main page. That ZIP file is updated weekly, and automatically.

As far as I can tell, the forum is the most interesting part of the site, especially because the front page is years out of date. But I just downloaded all 70 MB of the site's archive over FTP five minutes ago, and a static copy of /bbs/ isn't in there.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ]  Go to page 1, 2  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group