Nginx location match regex for special characters and encoded url characters

MitchellK picture MitchellK · Aug 8, 2018 · Viewed 9.7k times · Source

I've been trying so many things today and I am just not winning. I have one file in my site which got created by accident with a special character in it. As a result Googlebot has stopped crawling for 3 weeks now and Webmaster tools / Search console keeps notifying me and wanting to retest the url.

All I want to achieve is to configure Nginx to match the following requests and redirect them to the correct location but regex has me stumped on this one.

The unencoded URL string is:

/historical-rainfall-trends-south-africa-1921–2015.pdf

The encoded URL string is:

/historical-rainfall-trends-south-africa-1921%C3%A2%E2%82%AC%E2%80%9C2015.pdf

How can I get a location match for these?

UPDATE:

Still losing my mind, nothing I have tried is working. I get a match with this regex here - https://regex101.com/r/3Lk2zr/3

but then using this

location ~ /.*[^\x00-\x7F]+.* { return 444; }

still gives me a 404 and not a 444

Likewise I get a match with this - https://regex101.com/r/80KWJ8/1 But then

location ~ /.*([^?]*)\%(.*)$ { return 444; }

Gives 404 and not 444 😭

Also tried this but still no work. Sourced from: https://serverfault.com/questions/656096/rewriting-ascii-percent-encoded-locations-to-their-utf-8-encoded-equivalent

location ~* (*UTF8).*([^?]*)\%(.*)$ { return 444; }

location ~* (*UTF8).*[^\x00-\x7F]+.* { return 444; }

Temporary Solution

Thanks to @funilrys and also this How do I redirect all requests that contains a certain string to 404 in nginx?

This works now 100%

location /resources { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ { expires 3h; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800'; } location ~* \.(pdf)$ { expires 30d; add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=2592000'; if ($request_uri ~ .*%.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } if ($request_uri ~ .*[^\x00-\x7F]+.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; } }

Answer

miknik picture miknik · Aug 9, 2018

Your solution is terrible, let me tell you why.

Every single request which matches this location block now has to be evaluated against two if conditions before being served.

Any request which matches then gets redirected to the correct url, which also matches this location block so now your server is doing another two evaluations of those if conditions.

Just for fun you are also making Nginx evaluate requests for image, css and js files against your if conditions too. None of them will match as you are worried about a pdf, but you are still adding an extra 200% overhead to the request processing.

A much more Nginx friendly solution is actually very simple.

Nginx does regex matching in the order the location directives are listed in your config and chooses the first matching block, so if this file url will match any of your other regex directives then you need to place this block above those locations:

location ~* /historical-rainfall-trends-south-africa-1921([^_])*?2015\.pdf$ {
    return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf;
}

Just tested it on one of my servers running Nginx 1.15.1, works a charm.