Matching the user-agent in /robots.txt

Hrvoje Niksic (hniksic@srce.hr)
07 Nov 1996 20:58:38 +0100


It seems to me that a lot was left unsaid about matching the
user-agent in /robots.txt. The file specifies that the robot should
be liberal in matching it, and that a case-insensitive substring
match without version info is recommended.

With this definition,
user-agent: Wget/*
would not match Wget/1.4.0, which is not nice. The same goes for
wget* or similar. In the same sense,
user-agent: Wget/1.4.0
would not match Wget/1.4.0, which is ridiculous (the match would fail
because of substring search *without* version info; Wget/1.4.0 is not
a substring of Wget).

How about defining a simple and clear heuristics for matching the
user-agent, something like this:
If string = "*"
match always
Else
If string contains wildcards
If string contains '/'
match with fnmatch(string, full_version)
Else
match with fnmatch(string, base_version)
Endif
Else
match with strstr(full_version, string)
Endif
Endif

fnmatch() is a function that matches on the pattern; the name is from
the function in GNU bash which I have taken. I think this would be a
useful addition to the standard, since (even if in an appendix), since
it is fairly simple to implement and allows closer matches on robot
versions (like wget/2.*).

A sample C implementation follows:

(!strcasecmp(cmd, "User-agent"))

int match = 0;
/* Lowercase the agent string. */
for (i = 0; str[i]; i++)
str[i] = tolower(str[i]);
/* If the string is '*', it matches. */
if (*str == '*' && !*(str + 1))
match = 1;
else
{
/* If the string contains wildcards, we'll run it through
fnmatch(). */
if (has_wildcards(str))
{
/* If the string contains '/', compare with the full
version. Else, compare it to base_version. */
if (strchr(str, '/'))
match = !fnmatch(str, version, 0);
else
match = !fnmatch(str, base_version, 0);
}
else /* Substring search */
{
if (strstr(version, str))
match = 1;
else
match = 0;
}
}

-- 
Hrvoje Niksic <hniksic@srce.hr> | Student at FER Zagreb, Croatia
--------------------------------+--------------------------------
Contrary to popular belief, Unix is user friendly.  
It just happens to be selective about who it makes friends with.