BLASTing Illumina reads in FASTQ format

Sequences typically come in FASTA or in FASTQ format, or in their compressed variations (i.e., with an additional .gz or .bz2).

BLAST uses FASTA format for queries and for database creation. So the BLAST algorithm does not directly understand FASTQ format. This is because:

BLAST was created long before the FASTQ format was created,
and because FASTQ files are typically inappropriate for BLAST analysis.

FASTQ files typically result from Illumina or Nanopore sequencing. They typically are huge files that containing tens to hundreds of millions of reads, with many being from the from the same subset of the genome or transcriptome, or from a particular amplicon. Such information is highly redundant. When this is the case:

BLAST analysis will be slow because the algorithm needs to search through a much larger set of sequences than if redundancy had been removed.
If results are found, they are likely to include a lot of redundancy (many similar reads obtaining high scores). This makes interpretation difficult.
Having a particularly large set of redundant sequences to search through also reduces BLAST’s ability to identify sequence similarities. This is because BLAST’s detection power depends on the size of the query and database. Indeed, the e-value of a particular alignment is lower if your database or search are larger (E-value is grossly equivalent to “the number of times I would find this match by chance if the database were made up of random nucleotides” - but see here for a detailed explanation of Evalue).

Most of the time if you want to BLAST a FASTQ file, you’re probably not using the best approach

It is likely that you want to first reduce redundancy in your dataset. The most biologically relevant way is often to perform whole genome or transcriptome assembly of your raw reads prior to BLASTing them. Sometimes, simple deduplication or collapsing is sufficient.

If you do want to work with the raw FASTQ reads, BLAST often isn’t the best way to perform analysis.

But what if I really do need to run BLAST on FASTQ files?

While it is often inappropriate to BLAST raw reads, gaining biological insight sometimes does depend on it.

SequenceServer automatically detects and converts FASTQ to FASTA format. Just paste the FASTQ reads into the search box. SequenceServer will instantly convert to FASTA for BLASTing.

Command-line batch conversion of FASTQ to FASTA

If you have huge numbers of reads, you’ll want to use a more automated approach to convert the FASTQ format file to a FASTA format file. Using a tried and tested tool is less risky than creating your own custom script by creatively using grep, sed, python, perl or chatgpt. The following seqtk command is one easy way:

seqtk seq -A input.fq > output.fasta

Before using huge numbers of reads for database creation or as queries, it’s often a good idea to remove redundancy. You can directly reduce redunancy with a tool like cd-hit, but it’s often best to run a quick assembly (e.g. with spades or megahit).

If you need a transcriptome, metagenome or genome assembly done on your raw data, we can help you with that. Contact support with your details and we’ll get back to you. We offer cheap and fast transcriptome, genome, and metagenome assembly services.

By leveraging cloud computing and publication-ready graphics, SequenceServer Cloud makes it easy to perform BLAST searches and to interpret them. Learn more

Stay up to date

To receive the latest news from our team, enter your email:

BLASTing Illumina reads in FASTQ format

Most of the time if you want to BLAST a FASTQ file, you’re probably not using the best approach

But what if I really do need to run BLAST on FASTQ files?

Command-line batch conversion of FASTQ to FASTA

Stay up to date

Some blog posts you might like:

BLASTN, TBLASTX, BLASTP, TBLASTN, BLASTX - which should I choose?

Get your exclusive SequenceServer stickers – and more!

Search Raw Reads from NCBI's SRA Database 🧬