I have a column eventDate
which contains trailing spaces. I am trying to remove them with the PostgreSQL function TRIM()
. More specifically, I am running:
SELECT TRIM(both ' ' from eventDate)
FROM EventDates;
However, the trailing spaces don't go away. Furthermore, when I try and trim another character from the date (such as a number), it doesn't trim either. If I'm reading the manual correctly this should work. Any thoughts?
There are many different invisible characters. Many of them have the property WSpace=Y
("whitespace") in Unicode. But some special characters are not considered "whitespace" and still have no visible representation. The excellent Wikipedia articles about space (punctuation) and whitespace characters should give you an idea.
<rant>Unicode sucks in this regard: introducing lots of exotic characters that mainly serve to confuse people.</rant>
The standard SQL trim()
function by default only trims the basic Latin space character (Unicode: U+0020 / ASCII 32). Same with the rtrim()
and ltrim()
variants. Your call also only targets that particular character.
Use regular expressions with regexp_replace()
instead.
To remove all trailing white space (but not white space inside the string):
SELECT regexp_replace(eventdate, '\s+$', '') FROM eventdates;
The regular expression explained:
\s
.. regular expression class shorthand for [[:space:]]
- which is the set of white-space characters - see limitations below
+
.. 1 or more consecutive matches
$
.. end of string
Demo:
SELECT regexp_replace('inner white ', '\s+$', '') || '|'
Returns:
inner white|
Yes, that's a single backslash (\
). Details in this related answer.
To remove all leading white space (but not white space inside the string):
regexp_replace(eventdate, '^\s+', '')
^
.. start of string
To remove both, you can chain above function calls:
regexp_replace(regexp_replace(eventdate, '^\s+', ''), '\s+$', '')
Or you can combine both in a single call with two branches.
Add 'g'
as 4th parameter to replace all matches, not just the first:
regexp_replace(eventdate, '^\s+|\s+$', '', 'g')
But that should typically be faster with substring()
:
substring(eventdate, '\S(?:.*\S)*')
\S
.. everything but white space
(?:
re
)
Non-capturing set of parentheses
.*
.. any string of 0-n characters
Or one of these:
substring(eventdate, '^\s*(.*\S)')
substring(eventdate, '(\S.*\S)')
(
re
)
.. Capturing set of parentheses
Effectively takes the first non-whitespace character and everything up to the last non-whitespace character if available.
There are a few more related characters which are not classified as "whitespace" in Unicode - so not contained in the character class [[:space:]]
.
These print as invisible glyphs in pgAdmin for me: "mongolian vowel", "zero width space", "zero width non-joiner", "zero width joiner":
SELECT E'\u180e', E'\u200B', E'\u200C', E'\u200D';
'' | '' | '' | ''
Two more, printing as visible glyphs in pgAdmin, but invisible in my browser: "word joiner", "zero width non-breaking space":
SELECT E'\u2060', E'\uFEFF';
'' | ''
Ultimately, whether characters are rendered invisible or not also depends on the font used for display.
To remove all of these as well, replace '\s'
with '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]'
or '[\s]'
(note trailing invisible characters!).
Example, instead of:
regexp_replace(eventdate, '\s+$', '')
use:
regexp_replace(eventdate, '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]+$', '')
or:
regexp_replace(eventdate, '[\s]+$', '') -- note invisible characters
There is also the Posix character class [[:graph:]]
supposed to represent "visible characters". Example:
substring(eventdate, '([[:graph:]].*[[:graph:]])')
It works reliably for ASCII characters in every setup (where it boils down to [\x21-\x7E]
), but beyond that you currently (incl. pg 10) depend on information provided by the underlying OS (to define ctype
) and possibly locale settings.
Strictly speaking, that's the case for every reference to a character class, but there seems to be more disagreement with the less commonly used ones like graph. But you may have to add more characters to the character class [[:space:]]
(shorthand \s
) to catch all whitespace characters. Like: \u2007
, \u202f
and \u00a0
seem to also be missing for @XiCoN JFS.
Within a bracket expression, the name of a character class enclosed in
[:
and:]
stands for the list of all characters belonging to that class. Standard character class names are:alnum
,alpha
,blank
,cntrl
,digit
,graph
,lower
,punct
,space
,upper
,xdigit
. These stand for the character classes defined in ctype. A locale can provide others.
Bold emphasis mine.
Also note this limitation that was fixed with Postgres 10:
Fix regular expressions' character class handling for large character codes, particularly Unicode characters above
U+7FF
(Tom Lane)Previously, such characters were never recognized as belonging to locale-dependent character classes such as
[[:alpha:]]
.