Re: An extended version of the Robots...

Hrvoje Niksic (hniksic@srce.hr)
09 Nov 1996 17:14:13 +0100

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Otis Gospodnetic: "robot algorithm ?"
Previous message: Hrvoje Niksic: "Re: An extended verion of the robot exclusion standard"
In reply to: Captain Napalm: "Re: An extended version of the Robots..."

Captain Napalm (spc@armigeron.com) wrote:
> > This makes implementation outside Perl a near impossibility.
> > Normal v7 regexps are far easier to deal with.
> What would that be? What /bin/sh uses? Tcsh? grep? Perl? Do you use
> '?' or '.' to match a single character? What about one or more characters?
> This may be impossible to do correctly, given the different number of
> regexprs and meta-characters that are possible (I like the set used in

I suppose v7 are those used by grep (there is far less of the stuff
than in Perl). What sh uses is a simple globbing syntax that is easy
to cover, e.g.:
`*' - matches 0 or more characters
`?' - matches exactly 1 character
`[...]' - introduce character ranges, regexp-style
`\' - escapes the next character

So, the string "*" would match any string (even empty ones), "?*"
would match strings with 1 or more characters, "*abc" would match
strings ending with "abc", whereas "abc*" would match strings
beginning with "abc". "[a-z]*" matches a string beginning with a
lower-case letter, and "\**" matches the string beginning with an
asterisk. This is quite logical, and not too hard to implement. You
use this style in your exampes

The grep-style regexps are more powerful, but more complex, and take
more fuss to implement and (for the uninitiated) to use. This is a
partial specification:
`*' - matches 0 or more occurences of the preceding character
`+' - matches 1 or more occurences of the preceding character
`?' - matches 0 or 1 occurences of the preceding character
`^' - matches the beginning of line
`$' - matches the end of line
`(' and `)' - introduce the registers
...etc.

So "abc" would match any string containing "abc" anywhere, just like
"^.*abc.*$" (the first form is much faster too). "^abc" matches a
string beginning with "abc", whereas "abc$" matches a string ending
with "abc". "^(abc)+" matches a string beginning with 1 or more
occurences of "abc". Etc.

Perl has an even more powerful regexp syntax than this.

I would like robots.txt to use the normal shell-style globbing syntax,
since it is much simpler and faster to use.

-- 
Hrvoje Niksic <hniksic@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
Contrary to popular belief, Unix is user friendly.  
It just happens to be selective about who it makes friends with.

Next message: Otis Gospodnetic: "robot algorithm ?"
Previous message: Hrvoje Niksic: "Re: An extended verion of the robot exclusion standard"
In reply to: Captain Napalm: "Re: An extended version of the Robots..."