Sequence length of FASTA file

cucurbit picture cucurbit · Jun 2, 2014 · Viewed 18.5k times · Source

I have the following FASTA file:

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

My desired output:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This is my code:

awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa

The output I get with this code is:

>header1
60
57
>header2
3
>header3
7

I need a small modification in order to deal with multiple sequence lines.

I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.

Answer

Juan Diego Godoy Robles picture Juan Diego Godoy Robles · Jun 2, 2014

An awk / gawk solution can be composed by three stages:

  1. Every time header is found these actions should be performed:

    • Print previous seqlen if exists.
    • Print tag.
    • Initialize seqlen.
  2. For the sequence lines we just need to accumulate totals.
  3. Finally at the END stage we print the remnant seqlen.

Commented code:

awk '/^>/ { # header pattern detected
        if (seqlen){
         # print previous seqlen if exists 
         print seqlen
         }

         # pring the tag 
         print

         # initialize sequence
         seqlen = 0

         # skip further processing
         next
      }

# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa

A oneliner:

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

For the totals:

awk '/^>/ { if (seqlen) {
              print seqlen
              }
            print

            seqtotal+=seqlen
            seqlen=0
            seq+=1
            next
            }
    {
    seqlen += length($0)
    }     
    END{print seqlen
        print seq" sequences, total length " seqtotal+seqlen
    }' file.fa