Re: escaped vs unescaped urls

=?ISO-8859-1?Q?Jaakko_Hyv=E4tti?= (Jaakko.Hyvatti@iki.fi)
Sat, 18 Jan 1997 14:11:22 +0200 (EET)


Dan Gildor wrote:
> what about urls with %xx characters escaped? That is, would robots and
> search engines index the following two urls as different urls or the same?
>
> http://whatever.com/~someone/index.html
> http://whatever.com/%7esomeone/index.html

These are equivalent. Because ~ is not a reserved character,
encoding it does not change it's semantics. ";", "/", "?", ":", "@",
"=" and "&" are the characters which may be reserved for special
meaning within a scheme.

Characters "<", ">", """, "#" and "%" are defined unsafe, because
they usually have special meanings, and "{", "}", "|", "\", "^", "~",
"[", "]", and "`" are defined unsafe, because they sometimes get
corrupted in mail etc. All unsafe characters must always be encoded
within an URL, says rfc1738. I think this is a bit too strong wording.
For readability, I would keep the latter unencoded if I wasn't going
to mail them to someone using EBCDIC or something :-). In fact, for
readability, I would even display characters 0xa1-0xff in ISO-8859-1.

You should read rfc1738 section "2.2. URL Character Encoding Issues"
very carefully. See <URL:http://www.funet.fi/pub/doc/rfc/rfc1738.txt>
or your nearest rfc archive.

To normalize an url, in my opinion, one should do the following:

First, strip off any anchor, that is remove the first character "#"
and anything that follows it. It is not a part of URL. Also, if the
URL is picked up from some free text, strip heading and trailing "<",
">" and """ characters or text "<URL:".

FOR every character in an url:
IF it is encoded (%xx)
THEN
IF it is in the range of 0x00-0x20 or 0x7f-0xff
or it is one of "<", ">", """, "#", "%",
";", "/", "?", ":", "@", "=" or "&"
THEN preserve its encoding
ELSE decode it
ELSE (not encoded)
IF it is in the range of 0x00-0x20 or 0x7f-0xff
or it is one of "<", ">" or """
THEN encode it
ELSE preserve it unencoded

Also, lowercase the scheme name. After that, do scheme-specific
things. With http, lowercasing and uniqueing the host and stripping
explicit ":80".

I would also display characters in 0xa1-0xff range in ISO-8859-1,
but in a request send them to a server in encoded form.

-- 
# Jaakko.Hyvatti@iki.fi       http://www.iki.fi/~hyvatti/       +358 40 5011222
echo 'movl $36,%eax;int $128;movl $0,%ebx;movl $1,%eax;int $128'|as -o/bin/sync

_________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html