Comparing FASTA and FASTQ Sequence Formats

FASTA and FASTQ file formats are text-based formats for storing nucleotide sequences, with amino-acid (i.e., protein) sequences also being stored in FASTA. There are several differences in why and how FASTA and FASTQ are used.

FASTA Format

FASTA files are used for pretty much any type of nucleotide or protein sequence including:

FASTA format structure

A FASTA file contains one or more FASTA entries, each of which has two parts:

The next time a line starts with a >, that indicates the start of a new sequence.

Example FASTA file

Here’s an example of a FASTA file with three different types of sequences that normally aren’t found together:

>Sequence1 potential description after the space [species or other info]
ATGCCGTTAAGCGGCGTACGTGCCCGATAGAGAGCTTACG
GATCGGCTAGCTAGCTGACTGACTGACTGACGGCACAGAT
ACAGTGTACACGATGCCCGATCTAGTCAACCGGACTACGA
>Sequence2 Forward primer for actin
AGTATTACGATGCTAGCTG
>scaffold_1 length=80 depth=30.5x
GGGCCGTTAGCCGATCGATCGATCGGCTAGCTAGCTGACTGACTGACTGACGGCTAGCTACGGCTAGCTACGGCTAGCT

When are FASTA files used?

FASTA is the standard input format for a variety of applications, including:

FASTA files contain (almost) no quality information

A FASTA entry typically contains no information about the quality of the sequence, but some quality-related information may still be present:

FASTQ Format

FASTQ format is typically used for raw sequence reads from high-throughput sequencing technologies like Illumina. It includes raw sequence reads and per-base quality scores.

Each FASTQ entry has four lines:

  1. The sequence identifier, starting with @.
    • This identifier typically includes information about the sequencing machine, run, lane, whether it’s form the first or second read, the position of the read on the flow cell, and the tag/index that identifies the sample. For example, the identifier: @HWI-ST1276:73:C0J4DACXX:1:1101:1204:1931 1:N:0:CGATGT is run number “73” from sequencer “HWI-ST1276”, using flowcell “C0J4DACXX”. This particular read comes from lane 1 on the flowcell, tile 1101 on that flowcell, and the read comes from the cluster at coordinate 1204 x 1931 within that tile. It is read 1 (potentially out of a pair of reads) and comes from the sequencing library with the sample tag CGATGT.
  2. The sequence itself.
  3. A line starting with + that is often empty, but sometimes contains the sequence identifier again
  4. A letter or symbol for each nucleotide as in line 2. These are quality scores, where each character represents a different quality score on the “Phred scale” (see below).

Example of a FASTQ file with two 39-nucleotide reads:

@NS500784:901:HWH5GBGXL:1:11104:2976:10099 1:N:0:2
TCCCAAAGTATAATCAAAATACGATGTGAATGAATATA
+
AA6A/E6//6A/AAE/EEEE/<EE///</E//EEAEEE
@NS500784:901:HWH5GBGXL:1:11204:4829:14134 1:N:0:2
GTAAATTATACATCACCCATATAATTACTTAAAATCAT
+
A//AA/EE6EE/////E//AE/EEEE6//A/EA/A//E

FASTQ sequences files typically:

FASTQ files from an Illumina (or other short-read sequencer):

FASTQ files from PacBio or Oxford Nanopore long-read single-molecule sequencers:

What types of analyses use FASTQ format files as input?

FASTQ is the standard input format for a variety of applications, including:

Note: You wouldn’t normally run BLAST using FASTQ files if necessary, you can: SequenceServer automatically detects and convert FASTQ files to FASTA format to avoid slowing you down.

Why is it useful for FASTQ files to include quality scores?

During sequencing, some nucleotide bases are more likely to be incorrect than others. This can be because of molecular biology issues or calibration issues (e.g., sometimes there is some dust or a bubble in the sequencing run, or somebody bumps into the machine, or mistakes accumulate during sequencing-by-synthesis reaction)… these lead to a lower signal-to-noise ratio, and thus the sequencer has lower confidence in the base call. Lower confidence regions also occur in single-molecule long-read sequencing technology, where certain patterns are more difficult to correctly identify (e.g., homopolymer repeats such as GGGGGGGG), or where DNA modifications change the shape or charge or other properties of the DNA.

Furthermore, some specific applications, such as genotyping to understand genetic variation - rest on an understanding of the accuracy of the raw data. The genotyping algorithms can use the quality scores to decide whether an observed difference is more likely to be a technical sequencing error or something that is biologically meaningful.

In many cases, we may want to remove low-quality sequences. Either the entire reads or the lower-quality sections of the reads. This is because of “crap in - crap out” - lower-quality information can lead to incorrect conclusions

Summary table comparing FASTA and FASTQ

Stay up to date

To receive the latest news from our team, enter your email:

Some other blog posts you might like: