BLASTing Illumina reads in FASTQ format
Sequences typically come in FASTA or in FASTQ format, or in their compressed variations (i.e., with an additional .gz
or .bz2
).
BLAST uses FASTA format for queries and for database creation. So the BLAST algorithm does not directly understand FASTQ format. This is because:
- BLAST was created long before the FASTQ format was created,
- and because FASTQ files are typically inappropriate for BLAST analysis.
FASTQ files typically result from Illumina or Nanopore sequencing. They typically are huge files that containing tens to hundreds of millions of reads, with many being from the from the same subset of the genome or transcriptome, or from a particular amplicon. Such information is highly redundant. When this is the case:
- BLAST analysis will be slow because the algorithm needs to search through a much larger set of sequences than if redundancy had been removed.
- If results are found, they are likely to include a lot of redundancy (many similar reads obtaining high scores). This makes interpretation difficult.
- Having a particularly large set of redundant sequences to search through also reduces BLAST’s ability to identify sequence similarities. This is because BLAST’s detection power depends on the size of the query and database. Indeed, the e-value of a particular alignment is lower if your database or search are larger (E-value is grossly equivalent to “the number of times I would find this match by chance if the database were made up of random nucleotides” - but see here for a detailed explanation of Evalue).
Most of the time if you want to BLAST a FASTQ file, you’re probably not using the best approach
It is likely that you want to first reduce redundancy in your dataset. The most biologically relevant way is often to perform whole genome or transcriptome assembly of your raw reads prior to BLASTing them. Sometimes, simple deduplication or collapsing is sufficient.
If you do want to work with the raw FASTQ reads, BLAST often isn’t the best way to perform analysis.
But what if I really do need to run BLAST on FASTQ files?
While it is often inappropriate to BLAST raw reads, gaining biological insight sometimes does depend on it.
SequenceServer automatically detects and converts FASTQ to FASTA format. Just paste the FASTQ reads into the search box. SequenceServer will instantly convert to FASTA for BLASTing.
Command-line batch conversion of FASTQ to FASTA
If you have huge numbers of reads, you’ll want to use a more automated approach to convert the FASTQ format file to a FASTA format file. Using a tried and tested tool is less risky than creating your own custom script by creatively using grep
, sed
, python
, perl
or chatgpt. The following seqtk
command is one easy way:
seqtk seq -A input.fq > output.fasta
Before using huge numbers of reads for database creation or as queries, it’s often a good idea to remove redundancy. You can directly reduce redunancy with a tool like cd-hit
, but it’s often best to run a quick assembly (e.g. with spades
or megahit
).
If you need a transcriptome, metagenome or genome assembly done on your raw data, we can help you with that. Contact support with your details and we’ll get back to you. We offer cheap and fast transcriptome, genome, and metagenome assembly services.
By leveraging cloud computing and publication-ready graphics, SequenceServer Cloud makes it easy to perform BLAST searches and to interpret them. Learn more
Stay up to date
To receive the latest news from our team, enter your email: