9
More Googlebot Flailing
Filed under: Uncategorized | Tags: Web | July 9th, 2004
Now I’m seeing the Googlebot request /about/ pages relative to known blogs that don’t have any links to any /about/ URI. The last time the Googlebot flailed around like this it was fun to watch for a little bit and wonder what they had cooking in the labs, but then it got annoying. I don’t know if there are rules of bot etiquette, but requesting imagined unlinked resources while spidering can’t be a best practice.
There is of course one blog vendor who consistently has about pages at Now my question Google is: what should the rest of us do if we want our about pages indexed by this new system? Mine happens to be in the /about/ URIs, and that’s Typepad.about subdirectory of my blog, but what about people who have about.html or about-me.php? Should I set up a permanent redirect for every blog I have redirecting to the real about page?
(Note: That’s faux indignation. I don’t have any juicy conspiracy theories, and I’m not really that peeved, mostly I’m just curious what they’re up to. However juicy conspiracy theories are welcome in the comments. [As long as they don't make fun of me for noticing these things.] )
UPDATE: It just requested a non-existent non-linked /contact/ URI.
UPDATE: It just requested a non-existent non-linked /stats/ URI.
DEVELOPING . . .






Rob Mientjes | July 9th, 2004 @ 1:49 am
Well, to kick in the obvious conspiracy theory: Maybe Google likes TypePad more than normal…?
GoogleBot can be quite irritating. Having it around on your website 24/7 with about 10 instances of it is quite odd, but at least you can find my website by the oddest queries now
Sushubh | July 9th, 2004 @ 2:41 am
my vote is that they are planning something biggie related to blogs. they search atom.xml and other similar stuff that is quite popular on other Blog tools and now typepad specific folders… gotta be something funky. Google lacks tools for blogs and rss feeds. something is definitely cooking
Mark J | July 9th, 2004 @ 3:52 am
Google is like the boyscouts. Its motto is: be prepared. They’re information whores. They might not yet even know what they plan to do with all this blog-related data that is being indexed.
Randy Peterman | July 9th, 2004 @ 8:24 am
GoogleBot has been requesting non-existent RSS files for some time now. However, I changed my apache redirect to send them to my WordPress .php files to handle all of the 404 errors being reported in my stats app.
Ryan Mack | July 9th, 2004 @ 8:48 am
“typepad specific folders”? I think that’s a bit of a stretch. /about/ and /contact/ are hardly specific to Typepad. They’re everywhere. And not much of an indication that Google’s up to something. Now, if over the next few days you start to see requests for other non-existent “about and “contact” URI’s like about.html, about.php, contact-us.html, etc. it’d be a bit safer to say that something’s going on.
waylman | July 9th, 2004 @ 9:54 am
While they may be annoying, maybe we should be looking at them as SEO clues. If that is the filename/dir where Google expects such info, is it possable that they’ll like it when we put it there and, who knows, maybe even giving us a higher page rank? Not that I obsese about such things, but it could potentialy make us easier to find.
It’s just a thought. But, then again, maybe I’m way off base?!
andrew | July 9th, 2004 @ 11:58 am
Ok. ‘About’, maybe. ‘Contact’, maybe. But ‘Stats’? That to me is completely unnecessary and invasive practice. There is no need whatsoever for Google to crawl people’s unprotected stats packages.
random | July 9th, 2004 @ 1:13 pm
andrew:
Stats pages normally contain lists of referrers and popular pages, and more, so it’s not like Google don’t get anything out of it. I think it’s quite a good idea.
If people don’t want something crawled, they should disallow access in robots.txt and/or password protect it. It’s not as though this is hard.
Ryan Mack | July 9th, 2004 @ 1:56 pm
I have to agree with Andrew on the /stats/ URI. /about/ and /contact/ are common URI’s for publicly available information, but /stats/ is not. That’s a bit invasive. Google expects webmasters to be “polite” by not serving up special content to googlebot. They too should be “polite” and not go poking where they haven’t been invited.
Matt | July 10th, 2004 @ 6:38 pm
There is of course one blog vendor who consistently has about pages at /about/ URIs, and that’s Typepad.
False. My Typepad sites all have the about section at ~/about.html, not ~/about/.
Matt | July 10th, 2004 @ 6:40 pm
Isn’t
/about/the default? If not, I stand corrected and I’ll happily remove that from the post. It’s not terribly relevant anyway.Matt | July 10th, 2004 @ 6:56 pm
The other Matt is correct and I’ve updated the post (with correct markup and
datetimeattribute) to reflect this.random | July 10th, 2004 @ 11:56 pm
It’s invasive to go to any unlinked page or directory, but visiting /stats/ isn’t more invasive than /about/, assuming neither are publicly available. It’s not hard to imagine a private (security through obscurity) directory with contact information. If I had one, I’d much rather it stay private than my essentially anonymous statistics.
Matt | July 11th, 2004 @ 9:17 am
Also remember that putting something as forbidden in your
robots.txtfile is the best way to get bad robots to visit it. Many robot honeypots work this way.Pingback: Petroglyphs
Pingback: geek ramblings » Semantic Web
Jonathan Dingman | July 31st, 2008 @ 5:28 am
Yeah, I know this is an uber-old post, but whatever happened as the end result?
did you ever figure out what was causing googlebot to randomly crawl non-existent pages?
Matt | July 31st, 2008 @ 7:54 am
Nope.