How to use regex with cut at the command line?

makansij picture makansij · Apr 9, 2017 · Viewed 8k times · Source

I have some output like this from ls -alth:

drwxr-xr-x    5 root    admin   170B Aug  3  2016 ..
drwxr-xr-x    5 root    admin    70B Aug  3  2016 ..
drwxr-xr-x    5 root    admin     3B Aug  3  2016 ..
drwxr-xr-x    5 root    admin     9M Aug  3  2016 ..

Now, I want to parse out the 170B part, which is obviously the size in human readable format. I wanted to do this using cut or sed, because I don't want to use tools that are any more complicated/difficult to use than necessary.

Ideally I want it to be robust enough to handle the B, M or K suffix that comes with the size, and multiply accordingly by 1, 1000000 and 1000 accordingly. I haven't found a good way to do that, though.

I've tried a few things without really knowing the best approach:

ls -alth | cut -f 5 -d \s+

I was hoping that would work because I'd be able to just delimit it on one or more spaces.

But that doesn't work. How do I supply cut with a regex delimiter? or is there an easier way to extract only the size of the file from ls -alth?

I'm using CentOS6.4

Answer

mklement0 picture mklement0 · Apr 10, 2017

This answer tackles the question as asked, but consider George Vasiliou's helpful find solution as a potentially superior alternative.

  • cut only supports a single, literal character as the delimiter (-d), so it isn't the right tool to use.

  • For extracting tokens (fields) that are separated with a variable amount of whitespace per line, awk is the best tool, so the solution proposed by George Vasiliou is the simplest one:
    ls -alth | awk '{print $5}'
    extracts the 5th whitespace-separated field ($5), which is the size.

  • Rather than use -h first and then reconvert the human-readable suffixes (such as B, M, and G) back to the mere byte counts (incidentally, the multipliers must be multiples of 1024, not 1000), simply omit -h from the ls command, which outputs the raw byte counts by default:
    ls -alt | awk '{print $5}'