These are equivalent. Because ~ is not a reserved character,
encoding it does not change it's semantics. ";", "/", "?", ":", "@",
"=" and "&" are the characters which may be reserved for special
meaning within a scheme.
Characters "<", ">", """, "#" and "%" are defined unsafe, because
they usually have special meanings, and "{", "}", "|", "\", "^", "~",
"[", "]", and "`" are defined unsafe, because they sometimes get
corrupted in mail etc. All unsafe characters must always be encoded
within an URL, says rfc1738. I think this is a bit too strong wording.
For readability, I would keep the latter unencoded if I wasn't going
to mail them to someone using EBCDIC or something :-). In fact, for
readability, I would even display characters 0xa1-0xff in ISO-8859-1.
You should read rfc1738 section "2.2. URL Character Encoding Issues"
very carefully. See <URL:http://www.funet.fi/pub/doc/rfc/rfc1738.txt>
or your nearest rfc archive.
To normalize an url, in my opinion, one should do the following:
First, strip off any anchor, that is remove the first character "#"
and anything that follows it. It is not a part of URL. Also, if the
URL is picked up from some free text, strip heading and trailing "<",
">" and """ characters or text "<URL:".
FOR every character in an url:
IF it is encoded (%xx)
THEN
IF it is in the range of 0x00-0x20 or 0x7f-0xff
or it is one of "<", ">", """, "#", "%",
";", "/", "?", ":", "@", "=" or "&"
THEN preserve its encoding
ELSE decode it
ELSE (not encoded)
IF it is in the range of 0x00-0x20 or 0x7f-0xff
or it is one of "<", ">" or """
THEN encode it
ELSE preserve it unencoded
Also, lowercase the scheme name. After that, do scheme-specific
things. With http, lowercasing and uniqueing the host and stripping
explicit ":80".
I would also display characters in 0xa1-0xff range in ISO-8859-1,
but in a request send them to a server in encoded form.
-- # Jaakko.Hyvatti@iki.fi http://www.iki.fi/~hyvatti/ +358 40 5011222 echo 'movl $36,%eax;int $128;movl $0,%ebx;movl $1,%eax;int $128'|as -o/bin/sync_________________________________________________ This messages was sent by the robots mailing list. To unsubscribe, send mail to robots-request@webcrawler.com with the word "unsubscribe" in the body. For more info see http://info.webcrawler.com/mak/projects/robots/robots.html