Single tar (Re: Inktomi & large scale spidering)

=?ISO-8859-1?Q?Jaakko_Hyv=E4tti?= (Jaakko.Hyvatti@iki.fi)
Mon, 27 Jan 1997 10:30:49 +0200 (EET)


On 26 Jan 1997, Erik Selberg wrote:
> because you're already stuck with this format. Another option is that
> webmasters just ship a big tarball of raw text; it would seem to me
> that would be the best option for all parties. But who knows if anyone
> will implement it!

I think this would be the only really universal interface. Also on
simple servers it would be just a simple tar. If I get to do some robot
work later this year I'll consider implementing this, in some scale here
Finland. If robot admins see archives on servers, maybe they'll implement
them too.. if some robot implements it, maybe servers do it too.. If it
survives, it was good..

Pros:

- No crawling, just grabbing a single file, with a conditional get. It's
like only one hit versus thousands.
- Compressed format saves bandwidth.
- Someone using session tracking cookies embedded in an URL would
be able to present a corrected URL space for robots.

Cons:

- Cheating with different pages for robots with different keywords,
but this problem is present also with crawling.
- Fighting incompetency: should one verify all pages with HEAD requests?
- Single massive transfer might use all server bandwidth for a long time.
Though a robot should maybe limit it with read()ing slowly or limiting
window. Maybe something could be done with TOS also.
- It makes it too easy to copy information, steal copyrighted work in
a neat packet. Now is it so difficult with a robot anyway?
- A site with fancy dynamic content would have to crawl itself to
create an index. Uh well, do not do it then, let the robot do it.
- If one single resource is updated, the archive file has to be
transmitted as a whole.

In <URL:http://info.webcrawler.com/mailing-lists/robots/0252.html>
I wrote, about /robots.txt:

> I actually have a new idea too:
>
> Textarchive: /allpages.zip
>
> or
>
> Textarchive: /publicdocs.tar.gz
>
> (or with any other compressed archive format) ..instructs robots to
> fetch all there is in a compressed format. Is this a simple enough
> interface for everyone to accept? Too simple?

A robot should accept at least .tar.gz and .zip. This archive file
would, in the simple case, with no security taken into consideration, be
updated like:

cd /../htdocs
find . -type f -name \*.htm\* -print > /tmp/files.lis.$$
for i in /home/users/*/public_html
do
find $i -type f -name \*.htm\* -print
done | sed 's/^\/home\/users\/.*\/public_html\//\/~/' >> /tmp/files.lis.$$
tar -c -f - -T /tmp/files.lis.$$ | gzip -9 > publicdocs.tar.gz.tmp.$$
mv publicdocs.tar.gz.tmp.$$ publicdocs.tar.gz
rm -f /tmp/files.lis.$$

Or with zip, also in DOS/NT/whatever someone would do, give or take
a few flags:

cd \htdocs
zip -ruo9 allpages.zip *.htm

-- 
# Jaakko.Hyvatti@iki.fi       http://www.iki.fi/~hyvatti/       +358 40 5011222
echo 'movl $36,%eax;int $128;movl $0,%ebx;movl $1,%eax;int $128'|as -o/bin/sync

_________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html