Re: Web pages being served from an SQL database

Sigfrid Lundberg (siglun@gungner.ub2.lu.se)
Wed, 18 Dec 1996 10:03:54 +0100 (MET)


On Tue, 17 Dec 1996, Eric Mackie wrote:

> I have a web site (www.livepage.com) which is serving up pages from
> an SQL database rather than directly from the file system. All of
> the URLs in the site have the syntax of a CGI program (although I am
> really using NSAPI so this syntax was just for convenience). In all
> other ways it is a well-behaved web site.

We who run do not care about the details in how pages are generated. We
don't want to be trapped in infinate loops, and we want to be good
friends with webmasters all over the place.

>
> My problem is that none of the major web search robots will seem to
> index the site except for the home page. I assume that this is
> because the URLs look like CGI programs. Is this the case? I haven't

The way web robotics people reason as regards to cgi-bin URLs differs
widely. But we are all careful not to be trapped in "virtually
infinate URL spaces". Some people are more careful than other, and I
have myself been overly anxious.

> been able to find any information on what URLs are typically ignored
> other than that Lycos mentions that it doesn't look at any URLs that
> contain "?". Is there any way to convince the robots that it is OK
> to go ahead and traverse these URLs? Does anyone have any

That is one method to exclude searches and pages that are
different from day to day. To find a way around this problem I would do the
following:

I wouldn't pass parameters to you API through the QUERY_STRING, but
rather using the path. Using my YE OLDE test-cgi shell script I get
the by default output (with no arguments):

http://localhost/cgi-bin/test-cgi

CGI/1.0 test script report:

...

SERVER_PORT = 80
REQUEST_METHOD = GET
PATH_INFO =
PATH_TRANSLATED =
SCRIPT_NAME = /cgi-bin/test-cgi
QUERY_STRING =

Here is what most sensible robots would recognize as a search (or
whatever) and exclude from indexing

http://localhost/cgi-bin/test-cgi?a=fdsafdsa&b=fdsafdsa

REQUEST_METHOD = GET
PATH_INFO =
PATH_TRANSLATED =
SCRIPT_NAME = /cgi-bin/test-cgi
QUERY_STRING = a=fdsafdsa&b=fdsafdsa
REMOTE_HOST = localhost
REMOTE_ADDR = 127.0.0.1

A first step to hide the arguments would be to avoid using the QUERY_STRING
(or what the equivalent thing is in your API):

http://localhost/cgi-bin/test-cgi/a_fdsafdsa/b_fdsafdsa

The environmental variables passed using ordinary CGI would be

SERVER_PORT = 80
REQUEST_METHOD = GET
PATH_INFO = /a_fdsafdsa/b_fdsafdsa
PATH_TRANSLATED = /local_home/WWW/a_fdsafdsa/b_fdsafdsa
SCRIPT_NAME = /cgi-bin/test-cgi
QUERY_STRING =
REMOTE_HOST = localhost
REMOTE_ADDR = 127.0.0.1

The variables would be found in PATH_INFO rather in QUERY_STRING. If
I had used index.cgi as the index file in my htdoc the URL above would
have been

http://localhost/a_fdsafdsa/b_fdsafdsa

root no one (and certainly not a robot) could tell that everything was
delivered by a RDBM!

Your API must give you the option to parse something equivalent with the
PATH_INFO environmental variable.

Cheers,

Sigfrid

_________________________________________________
This messages was sent by the robots mailing list. To unsubscribe, send mail
to robots-request@webcrawler.com with the word "unsubscribe" in the body.
For more info see http://info.webcrawler.com/mak/projects/robots/robots.html