I've been reading up on the RFC-4627 specification, and I've come to interpretation:
When advertising a payload as application/json
mime-type,
BOM
s at the beginning of properly encoded JSON streams (based on section "3. Encoding"), andapplication/json; charset=utf-8
does not conform to RFC-4627 (based on section "6. IANA Considerations").Are these correct deductions? Will I run into problem when implementing web-services or web-clients which adhere to this interpretations? Should I file bugs against web browsers which violate the the two properties above?
Implementations MUST NOT add a byte order mark to the beginning of a JSON text.
This is put as clearly as it can be. This is the only "MUST NOT" in the entire RFC.
The MIME media type for JSON text is application/json.
Type name: application
Subtype name: json
Required parameters: n/a
Optional parameters: n/a
[...]
Note: No "charset" parameter is defined for this registration.
The only valid encodings of JSON are UTF-8, UTF-16 or UTF-32 and since the first character (or first two if there is more than one character) will always have a Unicode value lower than 128 (there is no valid JSON text that can include higher values of the first two characters) it is always possible to know which of the valid encodings and which endianness was used just by looking at the byte stream.
The JSON RFC says that the first two characters will always be below 128 and you should check the first 4 bytes.
I would put it differently: since a string "1" is also valid JSON there is no guarantee that you have two characters at all - let alone 4 bytes.
My recommendation of determining the JSON encoding would be slightly different:
Fast method:
{}
, []
or ""
)"x"
, [1]
etc.)00 00 00 xx
- it's UTF-32BE00 xx 00 xx
- it's UTF-16BExx 00 00 00
- it's UTF-32LExx 00 xx 00
- it's UTF-16LExx xx xx xx
- it's UTF-8but it only works if it is indeed a valid string in any of those encodings, which it may not be. Moreover, even if you have a valid string in one of the 5 valid encodings, it may still not be a valid JSON.
My recommendation would be to have a slightly more rigid verification than the one included in the RFC to verify that you have:
Looking only for NUL bytes is not enough.
That having been said, at no point you need to have any BOM characters to determine the encoding, neither you need MIME charset - both of which are not needed and not valid in JSON.
You only have to use the binary content-transfer-encoding when using UTF-16 and UTF-32 because those may contain NUL bytes. UTF-8 doesn't have that problem and 8bit content-transfer-encoding is fine as it doesn't contain NUL in the string (though it still contains bytes >= 128 so 7-bit transfer will not work - there is UTF-7 that would work for such a transfer but it wouldn't be valid JSON, as it is not one of the only valid JSON encodings).
See also this answer for more details.
Are these correct deductions?
Yes.
Will I run into problem when implementing web-services or web-clients which adhere to this interpretations?
Possibly, if you interact with incorrect implementations. Your implementation MAY ignore the BOM for the sake of interoperability with incorrect implementations - see RFC 7159, Section 1.8:
In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
Also, ignoring the MIME charset is the expected behavior of compliant JSON implementations - see RFC 7159, Section 11:
Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.
I am not personally convinced that silently accepting incorrect JSON streams is always desired. If you decide to accept input with BOM and/or MIME charset then you will have to answer those questions:
Having the encoding defined in three independent places - in a JSON string itself, in the BOM and in the MIME charset makes the question inevitable: what to do if they disagree. And unless you reject such an input then there is no one obvious answer.
For example, if you have a code that verifies the JSON string to see if it's safe to eval it in JavaScript - it might be misled by the MIME charset or the BOM and treat is as a different encoding than it actually is and not detect strings that it would detect if it used the correct encoding. (A similar problem with HTML has led to XSS attacks in the past.)
You have to be prepared for all of those possibilities whenever you decide to accept incorrect JSON strings with multiple and possibly conflicting encoding indicators. It's not to say that you should never do that because you may need to consume input generated by incorrect implementations. I'm just saying that you need to thoroughly consider the implications.
Should I file bugs against web browsers which violate the the two properties above?
Certainly - if they call it JSON and the implementation doesn't conform to the JSON RFC then it is a bug and should be reported as such.
Have you found any specific implementations that doesn't conform to the JSON specification and yet they advertise to do so?