Re: nastygram from xxx.lanl.gov

Michael Schlindwein (m_schlin@informatik.uni-kl.de)
Wed, 10 Jul 1996 22:51:11 +0200 (MET DST)


Hi

>
>
> Yes, but the specific point about a browser following a link is that
> it ALWAYS has user interaction and monitoring. If it turns out that
> the browser is downloading 5000MB of garbage (this could be a server
> error in the HEAD request, so don't claim it's not applicable in this
> case) then the user will stop it somewhere between 50k and 1MB. The

How realisic is such a case? How often will it happen?

If a user checked out a document personally (=> download no problem, fine page)
and then lets his agent monitor the site... tell me how probable it is, that
such desaster will happen.

> robot won't. Just as long as there is an agent with net access and no
> human there is potential for disaster.

Sure you absolutely right!
You could also say: just as long as there are complex software systems running
there is a potential for disaster...
And although people are mostly not able to manage big software projects
people now aim at unleashing autonomous agents.
I think this is a piece of human nature... ;-)

>
> The test should be
> do you select each link to be traversed before it is

What exactly do you mean here?
Select by the user directly before retrieving a document?

Or would you agree with me, that if the user HAS personally checked a document
this would count as "select each link" an his agent can download the page
automatically from this moment on?

> do you examine the output of that link as it downloads or at
> least monitor the size / time it's taking

We surely will do this

> If not, it's a robot.

>
...
>
> But, there are many things which could do robot style damage
> (automatic uncontrolled downloads) to my server without me being able
> to tell. For example things which check for updates on pages can do
> this just need enough of them. If I say no robots that's what I mean.

This is right if many people use such a functionality and especially for
popular sites! (In my last mail (just on the way at the moment) I refered to
this problem.)

>
> If I need an exception to the robots exclusion protocol, what I'll do
> is go and ask the site admin. Is this so difficult? Generally it's
> the first email address on the top page of the whole site.

I don't think thats the point here.
Are you admin?
Let's assume tools to monitor pages are available to every user...

(if this is not reality yet, I strongly believe, that this will come; the
benefit for the user is evident and the more people go online, the more
there are commercial interests (also in selling funny-funky-user-tools)...
So why should this not happen? )

... I think you can imagine the result... ;-)

>
...
>
> I do think we should get a better robot exculsion protocol which
> distinguishes
>
> head requests / body requests
>
> time of day (for their free time)
>
> maximum rate of requests (after this many k, wait this long?)
>
> how often we update our robots.txt
>
> reason for robot (link checking / indexing / finding junk
> email lists )

checking changes ;-)
I read from a method supported by HTTP that one can download a document only
when it was "LAST-MODIFIED AFTER xxx".
This would THE method to avoid downloading the page to check changes.
I also read, that this often is not implemented...(in the servers I think)
Can anyone tell more about this?

Mike