Is array syntax using square brackets in URL query strings valid?

Timo Huovinen picture Timo Huovinen · Jul 15, 2012 · Viewed 16.8k times · Source

Is it actually safe/valid to use multidimensional array synthax in the URL query string?

http://example.com?abc[]=123&abc[]=456

It seems to work in every browser and I always thought it was OK to use, but accodring to a comment in this article it is not: http://www.456bereastreet.com/archive/201008/what_characters_are_allowed_unencoded_in_query_strings/#comment4

I would like to hear a second opinion.

Answer

David N. Jafferian picture David N. Jafferian · Sep 2, 2016

The answer is not simple.

The following is extracted from section 3.2.2 of RFC 3986 :

A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.

This seems to answer the question by flatly stating that square brackets are not allowed anywhere else in the URI. But there is a difference between a square bracket character and a percent encoded square bracket character.

The following is extracted from the beginning of section 3 of RFC 3986 :

  1. Syntax Components

    The generic URI syntax consists of a hierarchical sequence of
    components referred to as the scheme, authority, path, query, and
    fragment.

    URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

So the "query" is a component of the "URI".

The following is extracted from section 2.2 of RFC 3986 :

2.2. Reserved Characters

URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must
be percent-encoded before the URI is formed.

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

So square brackets may appear in a query string, but only if they are percent encoded. Unless they aren't, to be explained further down in section 2.2 :

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.

So because square brackets are only allowed in the "host" subcomponent, they "should" be percent encoded in other components and subcomponents, and in this case in the "query" component, unless RFC 3986 explicitly allows unencoded square brackets to represent data in the query component, which is does not.

However, if a "URI producing application" fails to do what it "should" do, by leaving square brackets unencoded in the query, then readers of the URI are not to reject the URI outright. Instead, the square brackets are to be considered as belonging to the data of the query component, since they are not used as delimiters in that component.

This is why, for example, it is not a violation of RFC 3986 when PHP accepts both unencoded and percent encoded square brackets as valid characters in a query string, and even assigns to them a special purpose. However, it would appear that authors who try to take advantage of this loophole by not percent encoding square brackets are in violation of RFC 3986.