rfc2396
rfc2396
Berners-Lee
Request for Comments: 2396 MIT/LC
Updates: 1808, 1738 R. Fielding
Category: Standards Track U.C. Irvine
L. Masinter
Xerox Corporation
August 1998
Copyright Notice
IESG Note
Abstract
1. Introduction
All significant changes from the prior RFCs are noted in Appendix G.
Uniform
Uniformity provides several benefits: it allows different types
of resource identifiers to be used in the same context, even
when the mechanisms used to access those resources may differ;
it allows uniform semantic interpretation of common syntactic
conventions across different types of resource identifiers; it
allows introduction of new types of resource identifiers
without interfering with the way that existing identifiers are
used; and, it allows the identifiers to be reused in many
different contexts, thus permitting new applications or
protocols to leverage a pre-existing, large, and widely-used
set of resource identifiers.
Resource
A resource can be anything that has identity. Familiar
examples include an electronic document, an image, a service
(e.g., "today's weather report for Los Angeles"), and a
collection of other resources. Not all resources are network
"retrievable"; e.g., human beings, corporations, and bound
books in a library can also be considered resources.
Identifier
An identifier is an object that can act as a reference to
something that has identity. In the case of URI, the object is
a sequence of characters with a restricted syntax.
The URI scheme (Section 3.1) defines the namespace of the URI, and
thus may further restrict the syntax and semantics of identifiers
using that scheme. This specification defines those elements of the
URI syntax that are either required of all URI schemes or are common
to many URI schemes. It thus defines the syntax and semantics that
are needed to implement a scheme-independent parsing mechanism for
URI references, such that the scheme-dependent handling of a URI can
be postponed until the scheme-dependent semantics are needed. We use
the term URL below when describing syntax or semantics that only
apply to locators.
Although many URL schemes are named after protocols, this does not
imply that the only way to access the URL's resource is via the named
protocol. Gateways, proxies, caches, and name resolution services
might be used to access some resources, independent of the protocol
of their origin, and the resolution of some URL may require the use
of more than one protocol (e.g., both DNS and HTTP are typically used
to access an "http" URL's resource when it can't be found in a local
cache).
ftp://ftp.is.co.za/rfc/rfc1808.txt
-- ftp scheme for File Transfer Protocol services
gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
-- gopher scheme for Gopher and Gopher+ Protocol services
http://www.math.uio.no/faq/compression-faq/part1.html
-- http scheme for Hypertext Transfer Protocol services
mailto:[email protected]
-- mailto scheme for electronic mail addresses
news:comp.infosystems.www.servers.unix
-- news scheme for USENET news groups and articles
telnet://melvyl.ucop.edu/
-- telnet scheme for interactive services via the TELNET Protocol
and with improving technology, users might benefit from being able to
use a wider range of characters; such use is not defined in this
document.
This document uses two conventions to describe and define the syntax
for URI. The first, called the layout form, is a general description
of the order of components and component separators, as in
<first>/<second>;<third>?<fourth>
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
"j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
"s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
The "reserved" syntax class above refers to those characters that are
allowed within a URI, but which may not be allowed within a
particular component of the generic URI syntax; they are used as
delimiters of the components described in Section 3.
Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and
symbols.
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Because the percent "%" character always has the reserved purpose of
being the escape indicator, it must be escaped as "%25" in order to
be used as data within a URI. Implementers should be careful not to
escape or unescape the same string more than once, since unescaping
an already unescaped string might lead to misinterpreting a percent
data character as another escaped character, or vice versa in the
case of escaping an already escaped string.
Although they are disallowed within the URI syntax, we include here a
description of those US-ASCII characters that have been excluded and
the reasons for their exclusion.
The control characters in the US-ASCII coded character set are not
used within a URI, both because they are non-printable and because
they are likely to be misinterpreted by some control mechanisms.
The angle-bracket "<" and ">" and double-quote (") characters are
excluded because they are often used as the delimiters around URI in
text documents and protocol fields. The character "#" is excluded
because it is used to delimit a URI from a fragment identifier in URI
references (Section 4). The percent character "%" is excluded because
it is used for the encoding of escaped characters.
<scheme>:<scheme-specific-part>
An absolute URI contains the name of the scheme being used (<scheme>)
followed by a colon (":") and then a string (the <scheme-specific-
part>) whose interpretation depends on the scheme.
The URI syntax does not require that the scheme-specific-part have
any general structure or set of semantics which is common among all
URI. However, a subset of URI do share a common syntax for
representing hierarchical relationships within the namespace. This
"generic URI" syntax consists of a sequence of four main components:
<scheme>://<authority><path>?<query>
URI that are hierarchical in nature use the slash "/" character for
separating hierarchical components. For some file systems, a "/"
character (used to denote the hierarchical structure of a URI) is the
delimiter used to construct a file name hierarchy, and thus the URI
path will look similar to a file pathname. This does NOT imply that
the resource is a file or that the URI maps to an actual filesystem
pathname.
URI that do not make use of the slash "/" character for separating
hierarchical components are considered opaque by the generic URI
parser.
<userinfo>@<host>:<port>
The port is the network port number for the server. Most schemes
designate protocols that have a default port number. Another port
number may optionally be supplied, in decimal, separated from the
host by a colon. If the port is omitted, the default port number is
assumed.
The path component contains data, specific to the authority (or the
scheme if there is no authority component), identifying the resource
within the scope of that scheme and authority.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "@",
"&", "=", "+", ",", and "$" are reserved.
4. URI References
The syntax for relative URI is a shortened form of that for absolute
URI, where some prefix of the URI is missing and certain path
components ("." and "..") have a special meaning when, and only when,
interpreting a relative path. The relative URI syntax is defined in
Section 5.
fragment = *uric
The syntax for relative URI takes advantage of the <hier_part> syntax
of <absoluteURI> (Section 3) in order to express a reference that is
relative to the namespace of another hierarchical URI.
The term "relative URI" implies that there exists some absolute "base
URI" against which the relative reference is applied. Indeed, the
base URI is necessary to define the semantics of any relative URI
reference; without it, a relative reference is meaningless. In order
for relative URI to be usable within a document, the base URI of that
document must be known to the parser.
.----------------------------------------------------------.
| .----------------------------------------------------. |
| | .----------------------------------------------. | |
| | | .----------------------------------------. | | |
| | | | .----------------------------------. | | | |
| | | | | <relative_reference> | | | | |
| | | | `----------------------------------' | | | |
| | | | (5.1.1) Base URI embedded in the | | | |
| | | | document's content | | | |
| | | `----------------------------------------' | | |
| | | (5.1.2) Base URI of the encapsulating entity | | |
| | | (message, document, or none). | | |
| | `----------------------------------------------' | |
| | (5.1.3) URI used to retrieve the entity | |
| `----------------------------------------------------' |
| (5.1.4) Default Base URI is application-dependent |
`----------------------------------------------------------'
Within certain document media types, the base URI of the document can
be embedded within the content itself such that it can be readily
obtained by a parser. This can be useful for descriptive documents,
such as tables of content, which may be transmitted to others through
protocols other than their usual retrieval context (e.g., E-Mail or
USENET news).
A mechanism for embedding the base URI within MIME container types
(e.g., the message and multipart types) is defined by MHTML
[RFC2110]. Protocols that do not use the MIME message header syntax,
but which do allow some form of tagged metainformation to be included
within messages, may define their own syntax for defining the base
URI as part of a message.
to define the base URI using one of the other methods may result in
the same content being interpreted differently by different types of
application.
The base URI is established according to the rules of Section 5.1 and
parsed into the four main components as described in Section 3. Note
that only the scheme component is required to be present in the base
URI; the other components may be empty or undefined. A component is
undefined if its preceding separator does not appear in the URI
reference; the path component is never undefined, though it may be
empty. The base URI's query component is not used by the resolution
algorithm and may be discarded.
For each URI reference, the following steps are performed in order:
1) The URI reference is parsed into the potential four components and
fragment identifier, as described in Section 4.3.
can then continue with the steps below for the remainder of the
reference components. Validating parsers should mark such a
misformed relative reference as an error.
a) All but the last segment of the base URI's path component is
copied to the buffer. In other words, any characters after the
last (right-most) slash character, if any, are excluded.
result = ""
return result
7. Security Considerations
A URI does not in itself pose a security threat. Users should beware
that there is no general guarantee that a URL, which at one time
located a given resource, will continue to do so. Nor is there any
guarantee that a URL will not locate a different resource at some
later point in time, due to the lack of any constraint on how a given
authority apportions its namespace. Such a guarantee can only be
obtained from the person(s) controlling that namespace and the
resource in question. A specific URI scheme may include additional
semantics, such as name persistence, if those semantics are required
of all naming authorities for that scheme.
Caution should be used when using any URL that specifies a port
number other than the default for the protocol, especially when it is
a number within the reserved space.
8. Acknowledgements
This document was derived from RFC 1738 [RFC1738] and RFC 1808
[RFC1808]; the acknowledgements in those specifications still apply.
In addition, contributions by Gisle Aas, Martin Beet, Martin Duerst,
Jim Gettys, Martijn Koster, Dave Kristol, Daniel LaLiberte, Foteos
Macrides, James Marshall, Ryan Moats, Keith Moore, and Lauren Wood
are gratefully acknowledged.
9. References
[RFC822] Crocker, D., "Standard for the Format of ARPA Internet Text
Messages", STD 11, RFC 822, August 1982.
Tim Berners-Lee
World Wide Web Consortium
MIT Laboratory for Computer Science, NE43-356
545 Technology Square
Cambridge, MA 02139
Fax: +1(617)258-8682
EMail: [email protected]
Roy T. Fielding
Department of Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425
Fax: +1(949)824-1715
EMail: [email protected]
Larry Masinter
Xerox PARC
3333 Coyote Hill Road
Palo Alto, CA 94034
Fax: +1(415)812-4333
EMail: [email protected]
query = *uric
fragment = *uric
lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
"j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
"s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
<n> as $<n>. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
http://a/b/c/d;p?q
g:h = g:h
g = http://a/b/c/g
g/ = http://a/b/c/g/
/g = http://a/g
//g = http://g
?y = http://a/b/c/?y
g?y = http://a/b/c/g?y
#s = (current document)#s
g#s = http://a/b/c/g#s
g?y#s = http://a/b/c/g?y#s
;x = http://a/b/c/;x
g;x = http://a/b/c/g;x
g;x?y#s = http://a/b/c/g;x?y#s
Parsers must be careful in handling the case where there are more
relative path ".." segments than there are hierarchical levels in the
base URI's path. Note that the ".." syntax cannot be used to change
the authority component of a URI.
Similarly, parsers must avoid treating "." and ".." as special when
they are not complete components of a relative path.
/./g = http://a/./g
/../g = http://a/../g
g. = http://a/b/c/g.
g.. = http://a/b/c/g..
Less likely are cases where the relative URI uses unnecessary or
nonsensical forms of the "." and ".." complete path segments.
g/./h = http://a/b/c/g/h
g/../h = http://a/b/c/h
g;x=1/./y = http://a/b/c/g;x=1/y
g;x=1/../y = http://a/b/c/y
All client applications remove the query component from the base URI
before resolving relative URI. However, some applications fail to
separate the reference's query and/or fragment components from a
relative path before merging it with the base path. This error is
rarely noticed, since typical usage of a fragment never includes the
hierarchy ("/") character, and the query component is not normally
used within relative references.
g?y/./x = http://a/b/c/g?y/./x
g?y/../x = http://a/b/c/g?y/../x
g#s/./x = http://a/b/c/g#s/./x
g#s/../x = http://a/b/c/g#s/../x
<http://www.ics.uci.edu/Test/a/x>
URI are often transmitted through formats that do not provide a clear
context for their interpretation. For example, there are many
occasions when URI are included in plain text; examples include text
sent in electronic mail, USENET news messages, and, most importantly,
printed on paper. In such cases, it is important to be able to
delimit the URI from the rest of the text, and in particular from
punctuation marks that might be mistaken for part of the URI.
http://test.com/
ietf/uri/historical.html#WARNING>.
http://www.w3.org/Addressing/
ftp://ds.internic.net/rfc/
http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING
F. Abbreviated URLs
www.w3.org/Addressing/
or simply the DNS hostname on its own. Such references are primarily
intended for human interpretation rather than machine, with the
assumption that context-based heuristics are sufficient to complete
the URL (e.g., most hostnames beginning with "www" are likely to have
a URL prefix of "http://"). Although there is no standard set of
heuristics for disambiguating abbreviated URL references, many client
implementations allow them to be entered by the user and
heuristically resolved. It should be noted that such heuristics may
change over time, particularly when new URL schemes are introduced.
Since an abbreviated URL has the same syntax as a relative URL path,
abbreviated URL references cannot be used in contexts where relative
URLs are expected. This limits the use of abbreviated URLs to places
where there is no defined base URL, such as dialog boxes and off-line
advertisements.
G.1. Additions
Both RFC 1738 and RFC 1808 refer to the "reserved" set of characters
as if URI-interpreting software were limited to a single set of
characters with a reserved purpose (i.e., as meaning something other
than the data to which the characters correspond), and that this set
was fixed by the URI scheme. However, this has not been true in
practice; any character that is interpreted differently when it is
escaped is, in effect, reserved. Furthermore, the interpreting
engine on a HTTP server is often dependent on the resource, not just
the URI scheme. The description of reserved characters has been
changed accordingly.
The plus "+", dollar "$", and comma "," characters have been added to
those in the "reserved" set, since they are treated as reserved
within the query component.
The tilde "~" character was added to those in the "unreserved" set,
since it is extensively used on the Internet in spite of the
difficulty to transcribe it with some keyboards.
The syntax for URI scheme has been changed to require that all
schemes begin with an alpha character.
The question-mark "?" character was removed from the set of allowed
characters for the userinfo in the authority component, since testing
showed that many applications treat it as reserved for separating the
query component from the rest of the URI.
RFC 1738 specified that the path was separated from the authority
portion of a URI by a slash. RFC 1808 followed suit, but with a
fudge of carrying around the separator as a "prefix" in order to
describe the parsing algorithm. RFC 1630 never had this problem,
since it considered the slash to be part of the path. In writing
this specification, it was found to be impossible to accurately
describe and retain the difference between the two URI
<foo:/bar> and <foo:bar>
without either considering the slash to be part of the path (as
corresponds to actual practice) or creating a separate component just
to hold that slash. We chose the former.
The URL port is now *digit instead of 1*digit, since systems are
expected to handle the case where the ":" separator between host and
port is supplied without a port.
The description of the mythical Base header field has been replaced
with a reference to the Content-Location header field defined by
MHTML [RFC2110].
RFC 1808 described various schemes as either having or not having the
properties of the generic URI syntax. However, the only requirement
is that the particular document containing the relative references
have a base URI that abides by the generic URI syntax, regardless of
the URI scheme, so the associated description has been updated to
reflect that.
The BNF term <net_loc> has been replaced with <authority>, since the
latter more accurately describes its use and purpose. Likewise, the
authority is no longer restricted to the IP server syntax.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.