> I have a web site (www.livepage.com) which is serving up pages from
> an SQL database rather than directly from the file system. All of
> the URLs in the site have the syntax of a CGI program (although I am
> really using NSAPI so this syntax was just for convenience). In all
> other ways it is a well-behaved web site.
We who run do not care about the details in how pages are generated. We
don't want to be trapped in infinate loops, and we want to be good
friends with webmasters all over the place.
>
> My problem is that none of the major web search robots will seem to
> index the site except for the home page. I assume that this is
> because the URLs look like CGI programs. Is this the case? I haven't
The way web robotics people reason as regards to cgi-bin URLs differs
widely. But we are all careful not to be trapped in "virtually
infinate URL spaces". Some people are more careful than other, and I
have myself been overly anxious.
> been able to find any information on what URLs are typically ignored
> other than that Lycos mentions that it doesn't look at any URLs that
> contain "?". Is there any way to convince the robots that it is OK
> to go ahead and traverse these URLs? Does anyone have any
That is one method to exclude searches and pages that are
different from day to day. To find a way around this problem I would do the
following:
I wouldn't pass parameters to you API through the QUERY_STRING, but
rather using the path. Using my YE OLDE test-cgi shell script I get
the by default output (with no arguments):
http://localhost/cgi-bin/test-cgi
CGI/1.0 test script report:
...
SERVER_PORT = 80
REQUEST_METHOD = GET
PATH_INFO =
PATH_TRANSLATED =
SCRIPT_NAME = /cgi-bin/test-cgi
QUERY_STRING =
Here is what most sensible robots would recognize as a search (or
whatever) and exclude from indexing
http://localhost/cgi-bin/test-cgi?a=fdsafdsa&b=fdsafdsa
REQUEST_METHOD = GET
PATH_INFO =
PATH_TRANSLATED =
SCRIPT_NAME = /cgi-bin/test-cgi
QUERY_STRING = a=fdsafdsa&b=fdsafdsa
REMOTE_HOST = localhost
REMOTE_ADDR = 127.0.0.1
A first step to hide the arguments would be to avoid using the QUERY_STRING
(or what the equivalent thing is in your API):
http://localhost/cgi-bin/test-cgi/a_fdsafdsa/b_fdsafdsa
The environmental variables passed using ordinary CGI would be
SERVER_PORT = 80
REQUEST_METHOD = GET
PATH_INFO = /a_fdsafdsa/b_fdsafdsa
PATH_TRANSLATED = /local_home/WWW/a_fdsafdsa/b_fdsafdsa
SCRIPT_NAME = /cgi-bin/test-cgi
QUERY_STRING =
REMOTE_HOST = localhost
REMOTE_ADDR = 127.0.0.1
The variables would be found in PATH_INFO rather in QUERY_STRING. If
I had used index.cgi as the index file in my htdoc the URL above would
have been
http://localhost/a_fdsafdsa/b_fdsafdsa
root no one (and certainly not a robot) could tell that everything was
delivered by a RDBM!
Your API must give you the option to parse something equivalent with the
PATH_INFO environmental variable.
Cheers,
Sigfrid
_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html