Modify HTML Response (Not Headers)

Charlie picture Charlie · Sep 13, 2014 · Viewed 38.7k times · Source

Hoping someone can help me out or point me in the right direction.

I've been asked to find out how to make Akamai (or any other CDN, or NGINX) modify the actual response body.

Why?

I'm to make the CDN change all "http://" requests to "https://" instead of modifying the App code to use "//" for external resource requests.

Is this possible?

Anyone know?

Answer

Michael - sqlbot picture Michael - sqlbot · Sep 14, 2014

This appears to be possible via a number of different approaches, but that's not to say how advisable it might actually be.

It seems potentially problematic (example: what if you rewrite something that shouldn't have been rewritten?) and machine-resource-intensive (a lot of CPU cycles to parse and munge response bodies, repeatedly).

Here's what I found:

Nginx has the http_sub_module that appears to accomplish this in a fairly straightforward way, assuming what you want to replace is simple and you only need to match one pattern per page, like replacing <a href="http://example.com/... with <a href="https://example.com/..., one or more times. This kind of content-mungery seems sketchy but depending on the situation you're in (which may be one of limited control of the application) it might get you there.

It looks like there's something called http_substitutions_filter, possibly unofficial or at least not part of the core Nginx distribution that can do more powerful filter-based rewriting of response bodies.

Varnish seems to have a similar capability (possibly a plugin) but HAProxy doesn't, since it only deals in headers and leaves bodies alone except when doing gzip offloading. Other reverse-proxy-capable software like Apache or Squid might also offer something useful, that you'd place in front of your application server.

My initial impression, in any event, is that simple string replacing may not quite get you there, and even regex-based replacing isn't really sufficient, without significant sophistication in the regexes, because you always run the risk of rewriting something that you shouldn't.

What I would suggest "really needs to happen" in order to accomplish this purpose in the most correct way, would be to actually interpret the generated HTML with a DOM parsing library, traverse the tree, and modify the relevant elements in-place, before handing the revised document to the requester. This way, the document gets modified based on a contextual understanding of its contents.

It sounds complicated, in my opinion, because it is -- so I would again suggest you reconsider your planned approach unless this is outside your control.

Final thought: Curiosity got the best of me, so I took this question and retrofitted the http reverse proxy I wrote (for a different purpose) so that, based on the content-type, it could actually parse and walk the HTML structure as a proper entity, modifying it in place (as described above), before returning the response body to the requester.

This turns out, as I expected, to be fairly processor-intensive. My test content was 29K of real-world HTML from a live site, with containing 56 <a href ...> and 6 <link rel ...> elements, and the rewrite operation required 128 ms on a 1 GHz Opteron 1218, and 43 ms 2.4GHz Xeon E5620. These benchmarks are strictly for the additional operations -- excluding the (smaller amount of) time required for the actual "proxy" functionality itself. This time cost is not insurmountable, but could add up to a lot of CPU time. This is far longer than a regular expression-based content rewrite would take, but it's far more precise and unlikely to break the pages it touches.