Sorry for this stupid question, searched but not confident is the right answer is found, so the default separator is only space for awk?
Here's a pragmatic summary that applies to all major Awk implementations:
gawk
) - the default awk
in some Linux distrosmawk
) - the default awk
in some Linux distros (e.g., earlier versions of Ubuntu crysman reports that version 19.04 now comes with GNU Awk - see his comment below.)awk
on BSD-like platforms, including OSXOn Linux, awk -W version
will tell you which implementation the default awk
is.
BSD Awk only understands awk --version
(which GNU Awk understands in addition to awk -W version
).
Recent versions of all these implementations follow the POSIX standard with respect to field separators[1] (but not record separators).
Glossary:
RS
is the input-record separator, which describes how the input is broken into records:
\n
below; that is, input is broken into lines by default.awk
's command line, RS
can be specified as -v RS=<sep>
.RS
to a literal, single-character value, but GNU Awk and Mawk support multi-character values that may be extended regular expressions (BSD Awk does not support that).FS
is the input-field separator, which describes how each record is split into fields; it may be an extended regular expression.
awk
's command line, FS
can be specified as -F <sep>
(or -v FS=<sep>
).0x20
), but that space is not literally interpreted as the (only) separator, but has special meaning; see below.By default:
The POSIX spec. uses the abstraction <blank>
for spaces and tabs, which is true for all locales, but could comprise additional characters in specific locales - I don't know if any such locales exist.
Note that with the default input-record separator (RS
), \n
, newlines typically do not enter the picture as field separators, because no record itself contains \n
in that case.
Newlines as field separators do come into play, however:
RS
is set to a value that results in records themselves containing \n
instances (such as when RS
is set to the empty string; see below).split()
function is used to split a string into array elements without an explicit-field separator argument.
\n
instances in case the default RS
is in effect, the split()
function when invoked without an explicit field-separator argument on a multi-line string from a different source (e.g., a variable passed via the -v
option or as a pseudo-filename) always treats \n
as a field separator.Important NON-default considerations:
Assigning the empty string to RS
has special meaning: it reads the input in paragraph mode, meaning that the input is broken into records by runs of non-empty lines, with leading and trailing runs of empty lines ignored.
When you assign anything other than a literal space to FS
, the interpretation of FS
changes fundamentally:
FS
to [ ]
- even though it effectively amounts to a single space - causes every individual space instance in each record to be treated as a field separator. +
must be used; e.g., [\t]+
would recognize runs of tabs as a single separator.FS
to the empty string means that each character of a record is its own field.RS
is set to the empty string (paragraph mode), newlines (\n
) are also considered field separators, irrespective of the value of FS
.[1] Unfortunately, GNU Awk up to at least version 4.1.3 complies with an obsolete POSIX standard with respect to field separators when you use the option to enforce POSIX compliance, -P
(--posix
): with that option in effect and RS
set to a non-empty value, newlines (\n
instances) are NOT recognized as field separators. The GNU Awk manual spells out the obsolete behavior (but neglects to mention that it doesn't apply when RS
is set to the empty string). The POSIX standard changed in 2008 (see comments) to also consider newlines field separators when FS
has its default value - as GNU Awk has always done without -P
(--posix
).
Here are 2 commands that verify the behavior described above:
* With -P
in effect and RS
set to the empty string, \n
is still treated as a field separator:
gawk -P -F' ' -v RS='' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'
* With -P
in effect and a non-empty RS
, \n
is NOT treated as a field separator - this is the obsolete behavior:
gawk -P -F' ' -v RS='|' '{ printf "<%s>, <%s>\n", $1, $2 }' <<< $'a\nb'
A fix is coming, according to the GNU Awk maintainers; expect it in version 4.2 (no time frame given).
(Tip of the hat to @JohnKugelman and @EdMorton for their help.)