Comparing FASTA and FASTQ Sequence Formats
FASTA and FASTQ file formats are text-based formats for storing nucleotide sequences, with amino-acid (i.e., protein) sequences also being stored in FASTA. There are several differences in why and how FASTA and FASTQ are used.
FASTA Format
FASTA files are used for pretty much any type of nucleotide or protein sequence including:
- reference genomes
- transcriptome assemblies
- PCR primers
- predicted cDNA or mRNA sequences
- predicted coding sequences
- CRISPR “guide” gRNA sequences
- Sanger sequences of individual clones
- plasmid or vector sequences
- and more
FASTA format structure
A FASTA file contains one or more FASTA entries, each of which has two parts:
- The identifier line which is made up as follows:
- It starts with a
>
symbol - Then it has the identifier (sometimes multiple identifiers are stuck together using
|
symbols, such asgi|63148399|ref|NP_003997.2
) - Then after the first space optionally comes extra information. This can be information about gene function, its location in the genome, the species/taxonomy, or anything else such as statistics about its length or coverage.
- It starts with a
- The subsequent line(s) contain the sequence itself. The entire sequence can be on one (potentially) very long line, but often it is split into multiple lines of 60-100 characters each.
The next time a line starts with a >
, that indicates the start of a new sequence.
Example FASTA file
Here’s an example of a FASTA file with three different types of sequences that normally aren’t found together:
>Sequence1 potential description after the space [species or other info]
ATGCCGTTAAGCGGCGTACGTGCCCGATAGAGAGCTTACG
GATCGGCTAGCTAGCTGACTGACTGACTGACGGCACAGAT
ACAGTGTACACGATGCCCGATCTAGTCAACCGGACTACGA
>Sequence2 Forward primer for actin
AGTATTACGATGCTAGCTG
>scaffold_1 length=80 depth=30.5x
GGGCCGTTAGCCGATCGATCGATCGGCTAGCTAGCTGACTGACTGACTGACGGCTAGCTACGGCTAGCTACGGCTAGCT
When are FASTA files used?
FASTA is the standard input format for a variety of applications, including:
- Sequence alignment with tools such as Clustal or MUSCLE.
- Similarity search with tools such as BLAST or HMMER. The BLAST algorithm formats a FASTA file into a BLAST database. And for doing a BLAST search you similarly provide your query input sequence in FASTA format.
- As reference genomes or transcriptome assemblies for mapping reads with tools such as Bowtie or BWA, or pseudo-aligning reads with tools such as Salmon or Kallisto (the reads that you map or pseudo-align are typically in FASTQ format).
FASTA files contain (almost) no quality information
A FASTA entry typically contains no information about the quality of the sequence, but some quality-related information may still be present:
- Some assemblers use lower-case letters to indicate lower-confidence bases. Some bioinformatics tools can decide to ignore these bases – this is described as “soft-masking”.
- Some tools use the
>
line to indicate the quality of the entire sequence, such as>Sequence1 [quality=high]
. This is not part of the FASTA format but is a convention used by some tools. Similarly, information about coverage can help to decide which sequences to use (surprisingly, some assemblers create contigs that have 0 coverage (!)), and conversely, some sequences may have much higher coverage than expected, potentially indicating contamination or mis-assembly).
FASTQ Format
FASTQ format is typically used for raw sequence reads from high-throughput sequencing technologies like Illumina. It includes raw sequence reads and per-base quality scores.
Each FASTQ entry has four lines:
- The sequence identifier, starting with
@
.- This identifier typically includes information about the sequencing machine, run, lane, whether it’s form the first or second read, the position of the read on the flow cell, and the tag/index that identifies the sample. For example, the identifier:
@HWI-ST1276:73:C0J4DACXX:1:1101:1204:1931 1:N:0:CGATGT
is run number “73” from sequencer “HWI-ST1276”, using flowcell “C0J4DACXX”. This particular read comes from lane 1 on the flowcell, tile 1101 on that flowcell, and the read comes from the cluster at coordinate 1204 x 1931 within that tile. It is read 1 (potentially out of a pair of reads) and comes from the sequencing library with the sample tag CGATGT.
- This identifier typically includes information about the sequencing machine, run, lane, whether it’s form the first or second read, the position of the read on the flow cell, and the tag/index that identifies the sample. For example, the identifier:
- The sequence itself.
- A line starting with
+
that is often empty, but sometimes contains the sequence identifier again - A letter or symbol for each nucleotide as in line 2. These are quality scores, where each character represents a different quality score on the “Phred scale” (see below).
Example of a FASTQ file with two 39-nucleotide reads:
@NS500784:901:HWH5GBGXL:1:11104:2976:10099 1:N:0:2
TCCCAAAGTATAATCAAAATACGATGTGAATGAATATA
+
AA6A/E6//6A/AAE/EEEE/<EE///</E//EEAEEE
@NS500784:901:HWH5GBGXL:1:11204:4829:14134 1:N:0:2
GTAAATTATACATCACCCATATAATTACTTAAAATCAT
+
A//AA/EE6EE/////E//AE/EEEE6//A/EA/A//E
FASTQ sequences files typically:
- are compressed (e.g., with gzip). The file extension is typically
.fastq.gz
or.fq.gz
- are very large (e.g., 10s of GBs or more) because they contain raw sequence reads. To obtain a new genome assembly, we want to sequence that individual at least 30 times. Similarly, to estimate gene expression amplitudes, we may want 30,000,000 pairs of 150-nucleotide reads per sample. In both cases, there is a lot of redundancy that downstream processing will remove.
FASTQ files from an Illumina (or other short-read sequencer):
- come in pairs, where each pair is a forward and reverse read from the same initial 350 nucleotide DNA fragment. The first read of each fragment is stored in one file (e.g.,
mysample_1.fastq.gz
), and the read from the other end of the fragment is stored in the adjacent file (e.g.,mysample_2.fastq.gz
). These files are often called “read 1” and “read 2” files, or “R1” and “R2” files.
FASTQ files from PacBio or Oxford Nanopore long-read single-molecule sequencers:
- contain only one read per fragment, and thus only one file is produced per sample.
What types of analyses use FASTQ format files as input?
FASTQ is the standard input format for a variety of applications, including:
- Mapping reads with tools such as Bowtie or BWA
- Variant calling with tools such as GATK or FreeBayes
- Identifying copy number variation with tools such as CNVnator or CNVkit
- Identifying structural variants with tools such as Manta or Delly
- Creating Hi-C contact maps with tools such as HiC-Pro or HiCUP
- ChIP-seq analysis with tools such as MACS2 or SICER
- Transcript quantification with tools such as Salmon or Kallisto
- Assembly with tools such as SPAdes or Velvet. The input typically includes a FASTQ file, and the output is typically a FASTA file.
- Metagenomic analysis with tools such as MetaPhlAn or MetaBAT
- and more
- FASTQ files are also used for storing raw reads in public databases such as the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).
Note: You wouldn’t normally run BLAST using FASTQ files if necessary, you can: SequenceServer automatically detects and convert FASTQ files to FASTA format to avoid slowing you down.
Why is it useful for FASTQ files to include quality scores?
During sequencing, some nucleotide bases are more likely to be incorrect than others. This can be because of molecular biology issues or calibration issues (e.g., sometimes there is some dust or a bubble in the sequencing run, or somebody bumps into the machine, or mistakes accumulate during sequencing-by-synthesis reaction)… these lead to a lower signal-to-noise ratio, and thus the sequencer has lower confidence in the base call. Lower confidence regions also occur in single-molecule long-read sequencing technology, where certain patterns are more difficult to correctly identify (e.g., homopolymer repeats such as GGGGGGGG), or where DNA modifications change the shape or charge or other properties of the DNA.
Furthermore, some specific applications, such as genotyping to understand genetic variation - rest on an understanding of the accuracy of the raw data. The genotyping algorithms can use the quality scores to decide whether an observed difference is more likely to be a technical sequencing error or something that is biologically meaningful.
In many cases, we may want to remove low-quality sequences. Either the entire reads or the lower-quality sections of the reads. This is because of “crap in - crap out” - lower-quality information can lead to incorrect conclusions
Summary table comparing FASTA and FASTQ
- Information Content: Both have sequence data; FASTQ additionally includes quality scores for each base.
- Proteins vs Nucleotides: only FASTA is used for protein sequences.
- File Size: FASTQ files are larger due to additional quality information, and because they are focused on raw data that is straight out of the sequencer.
- Quality Information: FASTQ includes quality information, which enables removing, trimming, or masking lower-quality information
- Sequence Length: Both formats suit varying lengths, from short to long reads.