Limiting robots to top-level page only (via robots.txt)?

Chuck Doucette (doucette@tinkerbell.macsyma.com)
Tue, 26 Mar 1996 00:49:04 -0500

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Detlev Kalb: "Re: Info on authoring a Web Robot"
Previous message: Martijn Koster: "Re: Links"
Next in thread: Jaakko Hyvatti: "Re: Limiting robots to top-level page only (via robots.txt)?"
Reply: Jaakko Hyvatti: "Re: Limiting robots to top-level page only (via robots.txt)?"

Since I found out that Alta Vista (probably among others) indexes each page
on our site (and not just the top-level page), I've been trying to find out
how to prevent that. The sub-pages should certainly be accessible to anyone
who can read the top-level page; however, someone may not have the context
to go directly to a sub-page without going through our top-level page first.

So, if indeed I wanted to prevent a robot from indexing any page other
than the default one for the top-level (http://www.macsyma.com/), how
could I do that?

It's my understanding that the syntax for disallow assumes the top-level
URL (http://www.macsyma.com) and matches on any trailing characters
(such as /). This isn't stated clearly in the Robot exclusion documents
I've read. With this syntax, I see no way of allowing "http://www.macsyma.com/"
but preventing "http://www.macsyma.com/*.html" since regular expressions
aren't allowed (nor multiple disallow fields?).

Chuck

-- 
Chuck Doucette				e-mail:	doucette@macsyma.com
Macsyma, Inc.				phone:	(617) 646-4550
20 Academy St., Suite 201		fax:	(617) 646-3161
Arlington MA 02174-6436 / U.S.A.	URL:	http://www.macsyma.com

Next message: Detlev Kalb: "Re: Info on authoring a Web Robot"
Previous message: Martijn Koster: "Re: Links"
Next in thread: Jaakko Hyvatti: "Re: Limiting robots to top-level page only (via robots.txt)?"
Reply: Jaakko Hyvatti: "Re: Limiting robots to top-level page only (via robots.txt)?"