Re: An extended verion of the robot exclusion standard

Hrvoje Niksic (hniksic@srce.hr)
09 Nov 1996 17:01:15 +0100


Captain Napalm (spc@armigeron.com) wrote:
> Anyway, take a look at http://www.armigeron.com/people/spc/robots2.html
> and give me feedback. Also, if anyone has implemented the extentions

<comment> - general comment line format

Any text following a "#" up to the end of a line is to be
ignored. The "#" character can appear at any portion of a
line. Some examples:

I don't like this -- I prefer the old definition that a blank must
precede the comment. It makes sense in sh, and it makes sense in
/robots.txt.

<general> - Version 1.0.0 general string match format.

The general match is included for compatibility with Version 1.0.0 of
the robots exclusion standard. General matches do not contain regular
expression characters, but are treated as if they contain the
character "*", which is used to match zero or more characters, at the
end of the string. An example would be: /helpme, which is to be
treated as: /helpme*.

No, this should be `^/helpme.*$'. Anyway, you don't need .* in Perl
unless you put the `$' explicitly. Please, do not confuse the
shell-style wildcards and grep/perl-style regular expressions.

<regex> - Version 2.0.0 regular expression string match format.

This is a regular expression that is compatible with that used in
Perl, a popular language used on the web that contains support for
regular expression matching.

Now, seeing the examples below, this is plain *wrong*. Perl regexps
are *not* the same as shell wildcards matching. What you use in the
specification is shell-style globbing.

-- 
Hrvoje Niksic <hniksic@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
Contrary to popular belief, Unix is user friendly.  
It just happens to be selective about who it makes friends with.