>
> I've been thinking about the proposed extentions to robots.txt. It seems
> that, given the fact that there's a hard enough time in getting web sites to
> use the existing proposed standard, there may be no hope of a second file
> with a (slightly) different format to be used.
>
> I'd there like to propose a slight modification to what I proposed
> earlier.
>
> The current format of robots.txt is:
>
> # comment - can appear almost anywhere
> User-agent: agentname # comment
> Disallow: simplepattern
>
> User-agent: * # match all agents not explicitly ment.
> Disallow: /
>
> The simple pattern is basically equivilent to the following regular
> expression:
>
> simplepattern*
>
> ie. Literal match followed by anything.
>
[stuff deteleted]
>
> Now, for some headers that might be useful, but can wait:
>
> Visit-time:
> Format:
> Visit-time: <time> <time> # 2.0.0
>
> Comment:
> The robot is requested to only visit the site between the
> given times. If the robot visits outside of this time,
> it should notify its author/user that the site only
> wants it between the times specified.
>
> This can only appear once per rule set.
>
> The format for <time> has to be worked out. At the
> least, use GMT (or whatever it is called now).
>
> Request-rate:
> Format:
> Request-rate: <rate> # 2.0.0
> Request-rate: <rate> <time> <time> # 2.0.0
>
> Comment:
> <rate> is defined as <numdocuments> '/' <timeunit>. The
> default time unit is the second.
> <time> format is to be worked out.
>
> An example of some rates:
>
> 10/60 - no more than 10 documents per 60 secs
> 10/10m - no more than 10 documents per 10 mins
> 20/1h - no more than 20 documents per hour
>
> If time is given, then the robot is to use the given
> rate (and no faster) if the time is between the times given.
>
> If more than one 'Request-rate:' header is given and does
> NOT include the time, use the one that requests the
> fewest documents per unit of time.
>
> If no 'Request-rate:' is given, then the robot is encouraged
> to use the following rule of thumb for time between
> requests (which seems reasonable to me):
>
> twice the amount of time it took to retrieve
> the document
>
> 10 seconds
>
> whichever is slower.
>
These shouldn't wait ... IMO, important to develop a
standard for these parameters in the next robots.txt
rev.
Cheers
******************************************************
Brent Boghosian | Email: BrentB@OpenText.com
|
Open Text Corp. | Phone: (519)888-7111 Ext.279
180 Columbia St. West | FAX: (519)888-0677
Waterloo, ON N2L 3L3 | http://www.opentext.com
******************************************************