I've been trying so many things today and I am just not winning. I have one file in my site which got created by accident with a special character in it. As a result Googlebot has stopped crawling for 3 weeks now and Webmaster tools / Search console keeps notifying me and wanting to retest the url.
All I want to achieve is to configure Nginx to match the following requests and redirect them to the correct location but regex has me stumped on this one.
The unencoded URL string is:
/historical-rainfall-trends-south-africa-1921–2015.pdf
The encoded URL string is:
/historical-rainfall-trends-south-africa-1921%C3%A2%E2%82%AC%E2%80%9C2015.pdf
How can I get a location match for these?
UPDATE:
Still losing my mind, nothing I have tried is working. I get a match with this regex here - https://regex101.com/r/3Lk2zr/3
but then using this
location ~ /.*[^\x00-\x7F]+.* {
return 444;
}
still gives me a 404 and not a 444
Likewise I get a match with this - https://regex101.com/r/80KWJ8/1 But then
location ~ /.*([^?]*)\%(.*)$ {
return 444;
}
Gives 404 and not 444 😭
Also tried this but still no work. Sourced from: https://serverfault.com/questions/656096/rewriting-ascii-percent-encoded-locations-to-their-utf-8-encoded-equivalent
location ~* (*UTF8).*([^?]*)\%(.*)$ {
return 444;
}
location ~* (*UTF8).*[^\x00-\x7F]+.* {
return 444;
}
Temporary Solution
Thanks to @funilrys and also this How do I redirect all requests that contains a certain string to 404 in nginx?
This works now 100%
location /resources {
expires 3h;
add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800';
location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
expires 3h;
add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=10800';
}
location ~* \.(pdf)$ {
expires 30d;
add_header Cache-Control 'must-revalidate, proxy-revalidate, max-age=2592000';
if ($request_uri ~ .*%.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; }
if ($request_uri ~ .*[^\x00-\x7F]+.*) { return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf; }
}
Your solution is terrible, let me tell you why.
Every single request which matches this location block now has to be evaluated against two if conditions before being served.
Any request which matches then gets redirected to the correct url, which also matches this location block so now your server is doing another two evaluations of those if conditions.
Just for fun you are also making Nginx evaluate requests for image, css and js files against your if conditions too. None of them will match as you are worried about a pdf, but you are still adding an extra 200% overhead to the request processing.
A much more Nginx friendly solution is actually very simple.
Nginx does regex matching in the order the location directives are listed in your config and chooses the first matching block, so if this file url will match any of your other regex directives then you need to place this block above those locations:
location ~* /historical-rainfall-trends-south-africa-1921([^_])*?2015\.pdf$ {
return 301 https://example.com/resources/weather-documents/historical-rainfall-trends-south-africa_1921_2015.pdf;
}
Just tested it on one of my servers running Nginx 1.15.1, works a charm.