bioinformatics Linearize a FASTA sequence with AWK


Reading line by line

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < input.fa

one can read this awk script as:

  • if the current line ($0) starts like a fasta header (^>). Then we print a carriage return if this is not the first sequence. (N>0?"\n":"") followed with the line itself ($0), followed with a tabulation (\t). And we look for the next line (next;)
  • if the current line ($0) does not start like a fasta header, this is the default awk pattern. We just print the whole line without carriage return.
  • At the end (END) we only print a carriage return for the last sequence.