RE: avoiding infinite regress for robots

David Eagles (eaglesd@planets.com.au)
Tue, 9 Jan 1996 08:03:08 +-1100


------ =_NextPart_000_01BADE68.EAFE2840
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Benjamin Franz wrote:

>Many sites use=20
>symbolic links from lower to upper levels. If you try to suck=20
>'everything', you will end up in an infinite recursion. You need a =
depth=20
>limit (no more than X '/' elements in the URL), and probably a total=20
>pages limit (no more than Y pages total) to prevent any obscure cases=20
>from sucking it down an unexpected rat hole.

I'm surprised that no spider seems to use the page content to guess =
whether or
not two document trees are equal. For example, one heuristic would be =
to keep
a checksum for every visited page, and to decide that two subtrees are =
probably
equal if its root nodes and their children have iddentical checksums.

Do spiders use the content to cut off walks, and if not, is it because
alternative techniques are sufficient? Since my own spiders are rather
simple-minded (and not widely used), I'd be interested in seeing a more
informed opinion on the usefulness of comparing content.

Yep. This is one of the ways FunnelWeb (the latest version I haven't =
quite released yet) checks for looping.

What would be REALLY nice, however, would be if the HTML spec was =
extended to include a Filename: field sent by the server for every =
request. The field would specify the exact filename after all links =
were resolved and would therefor eliminate a lot of the guess work, =
parsing, etc required by clients, spiders, etc.

Hope everyone had a great New Year.

Regards,
David

------ =_NextPart_000_01BADE68.EAFE2840
Content-Type: application/ms-tnef
Content-Transfer-Encoding: base64

eJ8+IgoVAQaQCAAEAAAAAAABAAEAAQeQBgAIAAAA5AQAAAAAAADoAAENgAQAAgAAAAIAAgABBJAG
ACQBAAABAAAADAAAAAMAADADAAAACwAPDgAAAAACAf8PAQAAAEkAAAAAAAAAgSsfpL6jEBmdbgDd
AQ9UAgAAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20AU01UUAByb2JvdHNAd2ViY3Jhd2xlci5jb20A
AAAAHgACMAEAAAAFAAAAU01UUAAAAAAeAAMwAQAAABYAAAByb2JvdHNAd2ViY3Jhd2xlci5jb20A
AAADABUMAQAAAAMA/g8GAAAAHgABMAEAAAAYAAAAJ3JvYm90c0B3ZWJjcmF3bGVyLmNvbScAAgEL
MAEAAAAbAAAAU01UUDpST0JPVFNAV0VCQ1JBV0xFUi5DT00AAAMAADkAAAAACwBAOgEAAAACAfYP
AQAAAAQAAAAAAAAD0jcBCIAHABgAAABJUE0uTWljcm9zb2Z0IE1haWwuTm90ZQAxCAEEgAEAKQAA
AFJFOiBhdm9pZGluZyBpbmZpbml0ZSByZWdyZXNzIGZvciByb2JvdHMA8w4BBYADAA4AAADMBwEA
CQAIAAMACAACAPIAASCAAwAOAAAAzAcBAAkABwA6ABYAAgA2AQEJgAEAIQAAAEU3QTlDRURCNUE0
QUNGMTE5ODZBMDAwMEMwOEMwMzRFAEwHAQOQBgCYBgAAEgAAAAsAIwAAAAAAAwAmAAAAAAALACkA
AAAAAAMANgAAAAAAQAA5AICQX7YM3roBHgBwAAEAAAApAAAAUkU6IGF2b2lkaW5nIGluZmluaXRl
IHJlZ3Jlc3MgZm9yIHJvYm90cwAAAAACAXEAAQAAABYAAAABut4MtlfbzqnoSloRz5hqAADAjANO
AAAeAB4MAQAAAAUAAABTTVRQAAAAAB4AHwwBAAAAEgAAAGVhZ2xlc2RAcGMuY29tLmF1AAAAAwAG
EL+tcaYDAAcQeQQAAB4ACBABAAAAZQAAAEJFTkpBTUlORlJBTlpXUk9URTpNQU5ZU0lURVNVU0VT
WU1CT0xJQ0xJTktTRlJPTUxPV0VSVE9VUFBFUkxFVkVMU0lGWU9VVFJZVE9TVUNLRVZFUllUSElO
RyxZT1VXSUxMRU4AAAAAAgEJEAEAAAAJBQAABQUAAJsHAABMWkZ1GnmE6f8ACgEPAhUCqAXrAoMA
UALyCQIAY2gKwHNldDI3BgAGwwKDMgPFAgBwckJxEeJzdGVtAoMzdwLkBxMCgH0KgAjPCdk78RYP
MjU1AoAKgQ2xC2BgbmcxMDMUUAsOMW42CqADYBPQYwVACots+GkzNg3wC1UUUQvyGrYiQgnwamFt
C4AgRqJyAHB6IHcawjoKhckKhT5NAHB5IACQE9AtBCB1EbAfd3MGw2ljciAcAG5rBCADUiHwbyJ3
BJAgdG8goHBwASLhbGV2ZWxzLmAgSWYgeQhgIwByQyAwIxFzdWNrH3cnEyOxJKB0aAuAZycsWyRD
A/BsAyAJ8GQjMSC3HiEDkQuAZguAIGEgFhBeYwhwAJACICQAWSRhbrMJ4CdgYSANsAUwaB935xwA
HhAFQChuIyAEYBYQByMAEYADoFggJy8n/ycwI6AHgAIwBCAeISYwINDwVVJMKSaQAHAnYBqx+mIB
oGwgMCngIxABkAMgux+GCrBnB5Eq7wORWS4Q+y/DLtMpIwITUCOxAjAn0Z8gMC5ABPAIcCDQY2ER
sH8EIB+GImMlAiZRJ6AFQGTrIsAnw3UpkHgjYBsAKbFLHmAFQGgG8GUuHxxJ+ic0wnITUAQAKbEr
0QVAcStRc3BpBIEgQAngbf8xsiCjLUIvsjOgAiEysiMR9mcKUAQRdy1QLUEFwAWw9wqFK1A7wXcj
IDWQKMAssh8kgQngBCAKwCDQZXF1fwdAJAAeQAWxNjAeAAtQZf8mkAIgINAtUAhxE8Ah0T3QtHVs
J2BiK7EjIGsJ4P5wCoUp4BFwBZAiMD4wIlDzP8Il8iB2BAAgYS4BL8F/LcQjEQWBObE5BD3CJQBi
/z6YLiYKhT8zJ6AkMCBgBCB/A2A9kStQDbA+0UUSLVBp9wXAEXADEGQWEAOgEYAjwP0noGQNsAIw
IdAvAUL2I/D9HxxEOXYglC1CO2kowAVA/m8N0B6gB0AiMC3ESJE9gX8mkAQANVJBwDOwILFCZmy3
E9AEoDbQaUsBGuFoAwB/P0A+xSUADdAh0AiQAjA/1T+QUwuAYyDQbTMRNbF/TZY+8jbBPLEKhQCQ
QCIt9x4RDbAnYCgt4j2CA/ANsPsugSCxZC2xODBBowuAUlH/B5A2gh4hOgE1IingK4IKhe8oEQWw
B4AnYG85oAMAAiD3QHEtMyCxZkGAKZAEEU+A/ztRQCAKwDUiO2U3RhxcGyy9HGxjAEApQCoQP4FU
JkD/LPFeAUCRXiEtQk/AE7AeQMM2ECmQbFdlYiswLUJ/C2AgcQVAJfEo8iQQStNuficFQD9AKGQj
oDPBJ2B5vxHAMiBC9ENzFaBcgmc3TQZXOSJBZ1JFQUxM/zFQAwBU4CaQNwAi0CXxJpCHQWdIkS1C
SFRNTDmB/wWQT7EEIDYwO5FXsiMRVMH3CkBFsSngRgMQCfAeAB7w3yJQCJBBkRGwMsFiJLEtUf0R
sHIl8UN5FhBTUl9QYwL/INBwlEFkbgIGkHFUP/EbAf8oMHAUJ9ABgCLhB0ADICIE+yLRKIJzBvAj
wCnBJ1FBZP88og3AP8IwElKBb7IVoE9i9y0zPBUFsGsmkGEBAJAZEP8mkBHAIeBywkogKbFxQW9w
31RCUAFNlXvDN01IXIA/Efsl8kCDYSnCCcE20QfCYsDjCsA3TVJlZwsRUABM9r9K8DmwHxwbr2BO
FTEAhqAAAAADABAQAAAAAAMAERAAAAAAQAAHMICZtgsM3roBQAAIMICZtgsM3roBHgA9AAEAAAAF
AAAAUkU6IAAAAABZ/g==

------ =_NextPart_000_01BADE68.EAFE2840--