Sort a text file by line length including spaces

gnarbarian picture gnarbarian · May 7, 2011 · Viewed 84.3k times · Source

I have a CSV file that looks like this

AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st.                                        110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56

I need to sort it by line length including spaces. The following command doesn't include spaces, is there a way to modify it so it will work for me?

cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'

Answer

neillb picture neillb · May 7, 2011

Answer

cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-

Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:

cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-

In both cases, we have solved your stated problem by moving away from awk for your final cut.

Lines of matching length - what to do in the case of a tie:

The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.

(Those who want more control of sorting these ties might look at sort's --key option.)

Why the question's attempted solution fails (awk line-rebuilding):

It is interesting to note the difference between:

echo "hello   awk   world" | awk '{print}'
echo "hello   awk   world" | awk '{$1="hello"; print}'

They yield respectively

hello   awk   world
hello awk world

The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:

"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"

 $1 = $1   # force record to be reconstituted
 print $0  # or whatever else with $0

"This forces awk to rebuild the record."

Test input including some lines of equal length:

aa A line   with     MORE    spaces
bb The very longest line in the file
ccb
9   dd equal len.  Orig pos = 1
500 dd equal len.  Orig pos = 2
ccz
cca
ee A line with  some       spaces
1   dd equal len.  Orig pos = 3
ff
5   dd equal len.  Orig pos = 4
g