How to replace/escape U+2028 or U+2029 characters in PHP to stop my JSONP API breaking

zuallauz picture zuallauz · Jan 6, 2013 · Viewed 16k times · Source

Ok I am running a public JSONP API which the data is served from my PHP server. I just read this article:

Basically if my JSON strings contains a U+2028 character (Unicode line separator) or U+2029 character (Unicode paragraph separator) then this is perfectly valid JSON. However when using JSONP the JSON gets executed as JavaScript and no string in JavaScript can contain a literal U+2028 or a U+2029 as it will break the JavaScript. Apparently this is usually not a problem as long as you use a proper JSON parser, but in the case of JSONP the browser is the JSON parser.

Essentially if these characters were inside strings in my JSONP data being sent to the client this would throw a line or paragraph break into the string which would break the JavaScript and stop it executing. This is a possibility as the API is sending back some client entered data. Someone could potentially enter a U+2028 or a U+2029 into the database, so when I send that back as JSONP it will break any implementation using my API.

So my question is, in PHP how can I sanitise/output escape the JSON data to remove or escape the U+2028 and U+2029 characters before sending it to the client?

Currently my process is doing a json_encode on an array of data and sending that data down to the client. Should I escape the data by looping through the array and filtering it, or escape all the JSON encoded string all at once?

The other thing is I'm not sure how to escape the U+2028 and U+2029 characters in PHP anyway. Can I just do a str_replace? I'm not sure if str_replace is multibyte safe and there's no mb_str_replace function unless I use some custom made one. So how do you remove/escape those unicode characters?

Thanks very much.

Answer

Dietrich Epp picture Dietrich Epp · Jan 6, 2013

You can replace U+2028, U+2029 with "\u2028", "\u2029" either on the PHP side or the JavaScript side, or both, it doesn't matter as long as it happens at least once (it's idempotent).

You can just use ordinary string replacement functions. They don't need to be "multibyte safe", and you can do it just as easily in any Unicode encoding (UTF-8, UTF-16, UTF-32 are all equally fine). PHP didn't have Unicode escape sequences last time I checked which is just one more reason why PHP is a joke but you can use the \x escape with UTF-8...

(In short, the reason there's no multibyte string replace function is because it would be redundant -- it would be exactly the same as a non-multibyte string replace function.)

// Javascript
data = data.replace("\u2028", "\\u2028").replace("\u2029", "\\u2029");

// PHP
$data = str_replace("\xe2\x80\xa8", '\\u2028', $data);
$data = str_replace("\xe2\x80\xa9", '\\u2029', $data);

Or you could just do nothing at all, since PHP escapes non-Unicode characters by default in json_encode():

// Safe
echo json_encode("\xe2\x80\xa9");
--> "\u2029"

// Correct JSON, but invalid Javascript...
// (Well, technically, JSON root must be array or object)
echo json_encode("\xe2\x80\xa9", JSON_UNESCAPED_UNICODE);
--> "
"