UCSC Genome Bioinformatics: FAQ

- - - - - - -

Frequently Asked Questions: Blat

Blat vs. Blast
Blat use restrictions
Downloading Blat source and documentation
Replicating web-based Blat parameters in command-line version
Using the -ooc flag
Replicating web-based Blat percent identity and score calculations
Replicating web-based Blat "I'm feeling lucky" search results
Using Blat for short sequences with maximum sensitivity

Blat vs. Blast

Question:
"What are the differences between Blat and Blast?"

Response:
Blat is an alignment tool like BLAST, but it is structured differently. On DNA, Blat works by keeping an index of an entire genome in memory. Thus, the target database of BLAT is not a set of GenBank sequences, but instead an index derived from the assembly of the entire genome. The index -- which uses less than a gigabyte of RAM -- consists of all non-overlapping 11-mers except for those heavily involved in repeats. This smaller size means that Blat is far more easily mirrored. Blat of DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or short sequence alignments.

On proteins, Blat uses 4-mers rather than 11-mers, finding protein sequences of 80% and greater similarity to the query of length 20+ amino acids. The protein index requires slightly more than 2 gigabytes of RAM. In practice -- due to sequence divergence rates over evolutionary time -- DNA Blat works well within humans and primates, while protein Blat continues to find good matches within terrestrial vertebrates and even earlier organisms for conserved proteins. Within humans, protein Blat gives a much better picture of gene families (paralogs) than DNA Blat. However, BLAST and psi-BLAST at NCBI can find much more remote matches.

From a practical standpoint, Blat has several advantages over BLAST:

speed (no queues, response in seconds) at the price of lesser homology depth
the ability to submit a long list of simultaneous queries in fasta format
five convenient output sort options
a direct link into the UCSC browser
alignment block details in natural genomic order
an option to launch the alignment later as part of a custom track

Blat is commonly used to look up the location of a sequence in the genome or determine the exon structure of an mRNA, but expert users can run large batch jobs and make internal parameter sensitivity changes by installing command line Blat on their own Linux server.

Blat use restrictions

Question:
"I received a high-volume traffic warning from your Blat server informing me that I had exceeded the server use limitations. Can you give me information on the UCSC Blat server use parameters?"

Response:
Due to the high demand on our Blat servers, we restrict service for users who programatically query Blat or do large batch queries. Program-driven use of Blat is limited to a maximum of one hit every 15 seconds and no more than 5,000 hits per day. Please limit batch queries to 25 sequences or less.

For users with high-volume Blat demands, we recommend downloading Blat for local use. For more information, see Downloading Blat source and documentation.

Downloading Blat source and documentation

Question:
"Is the Blat source available for download? Is there documentation available?"

Response:
Blat source and executables are freely available for academic, nonprofit and personal use. Commercial licensing information is available on the Kent Informatics website.

Blat source may be downloaded from http://www.soe.ucsc.edu/~kent (look for the blatSrc* zip file with the most recent date). For Blat executables, go to http://hgdownload.cse.ucsc.edu/admin/exe/; and choose your machine type.

Documentation on Blat program specifications is available here.

Replicating web-based Blat parameters in command-line version

Question:
"I'm setting up my own Blat server and would like to use the same parameter values that the UCSC web-based Blat server uses."

Response:
Use the following settings to replicate the search results of the UCSC Blat server. Note that you may still observe some slight differences between command line results and web-based results, depending on the search being performed.

faToTwoBit:

Use soft masking.

gfServer (this is how the UCSC web-based blat servers are configured):

blat server (capable of PCR):
gfServer start blatMachine portX -stepSize=5 -log=untrans.log database.2bit
translated blat server:
gfServer start blatMachine portY -trans -mask -log=trans.log database.2bit

For enabling DNA/DNA and DNA/RNA matches, only the host, port and twoBit files are needed. The same port is used for both untranslated blat (gfClient) and PCR (webPcr). You'll need a separate blat server on a separate port to enable translated blat (protein searches or translated searches in protein-space).

gfClient:

Set -minScore=0 and -minIdentity=0. This will result in some low-scoring, generally spurious hits, but for interactive use it's sufficiently easy to ignore them (because results are sorted by score) and sometimes the low-scoring hits come in handy.

standalone blat:

blat search:
blat -stepSize=5 -repMatch=2253 -minScore=0 -minIdentity=0 database.2bit query.fa output.psl

Notes on repMatch:
The default setting for gfServer dna matches is: repMatch = 1024 * (tileSize/stepSize).
The default setting for blat dna matches is: repMatch = 1024 (if tileSize=11).
To get command-line results that are equivalent to web-based results, repMatch must be specified when using blat.

For more information on the parameters available for blat, gfServer, and gfClient, see the blat specifications.

Using the -ooc flag

Question:
"What does the -ooc flag do?"

Response:
Using any -ooc option in blat, such as -ooc=11.ooc, simply serves to speed up searches similar to repeat-masking sequence. The 11.ooc file contains sequences determined to be over-represented in the genome sequence. To speed up searches, these sequences are not used when seeding an alignment against the genome. For reasonably-sized sequences, this will not create a problem and will significantly reduce processing time.

By not using the 11.ooc file, you will increase alignment time, but will also slightly increase sensitivity. This may be important if you are aligning shorter sequences or sequences of poor quality. For example, if a particular sequence consists primarily of sequences in the 11.ooc file, it will never be seeded correctly for an alignment if the -ooc flag is used.

In summary, if you are not finding certain sequences and can afford the extra processing time, you may want to run blat without the 11.ooc file if your particular situation warrants its use.

Replicating web-based Blat percent identity and score calculations

Question:
"Using my own command-line Blat server, how can I replicate the percent identity and score calculations produced by web-based Blat?"

Response:
There isn't an option to command-line Blat that gives you the percent ID and the score. Instead, you will have to write your own program to produce the calculations, incorporating some of the functions from the Genome Browser source code.

To calculate the percent ID, incorporate the following code and function into a program that processes your Blat PSL output. The parameter isMrna should be set to TRUE, regardless of whether the input sequence is mRNA or protein.

The percent identity score is calculated like this:

100.0 - pslCalcMilliBad(psl, TRUE) * 0.1

Here is the source for pslCalcMilliBad:

int pslCalcMilliBad(struct psl *psl, boolean isMrna)
/* Calculate badness in parts per thousand. */
{
int sizeMul = pslIsProtein(psl) ? 3 : 1;
int qAliSize, tAliSize, aliSize;
int milliBad = 0;
int sizeDif;
int insertFactor;
int total;

qAliSize = sizeMul * (psl->qEnd - psl->qStart);
tAliSize = psl->tEnd - psl->tStart;
aliSize = min(qAliSize, tAliSize);
if (aliSize <= 0)
    return 0;
sizeDif = qAliSize - tAliSize;
if (sizeDif < 0)
    {
    if (isMrna)
        sizeDif = 0;
    else
        sizeDif = -sizeDif;
    }
insertFactor = psl->qNumInsert;
if (!isMrna)
    insertFactor += psl->tNumInsert;

total = (sizeMul * (psl->match + psl->repMatch + psl->misMatch));
if (total != 0)
    milliBad = (1000 * (psl->misMatch*sizeMul + insertFactor + 
	round(3*log(1+sizeDif)))) / total;
return milliBad;
}

The complexity in milliBad arises primarily from how it handles inserts. Ignoring the inserts, the calculation is simply mismatches expressed as parts per thousand. However, the algorithm factors in insertion penalties as well, which are relatively weak compared to say blasts but still present. When huge inserts are allowed (which is necessary to accommodate introns), it is typically necessary to resort to logarithms like this calculation does.

The pslIsProtein function called by pslCalcMilliBad is:

boolean pslIsProtein(const struct psl *psl)
/* is psl a protein psl (are it's blockSizes and scores in protein space) 
*/
{
int lastBlock = psl->blockCount - 1;

return  (((psl->strand[1] == '+' ) &&
     (psl->tEnd == psl->tStarts[lastBlock] + 3*psl->blockSizes[lastBlock])) 
||
    ((psl->strand[1] == '-') &&
     (psl->tStart == (psl->tSize-(psl->tStarts[lastBlock] + 
3*psl->blockSizes[lastBlock])))));
}

This function automatically determines whether or not the PSL output file contains alignment information for a protein query. Alternatively, you could write the program such that the user specifies if the query is a protein or not.

The score calculation is generated by the following function:

int pslScore(const struct psl *psl)
/* Return score for psl. */
{
int sizeMul = pslIsProtein(psl) ? 3 : 1;

return sizeMul * (psl->match + ( psl->repMatch>>1)) -
         sizeMul * psl->misMatch - psl->qNumInsert - psl->tNumInsert;
}

For help with creating a C program to do perform these calculations, you may want to use the libraries from the Genome Browser source code. See our FAQ on source code licensing and downloads for information on obtaining the source. The file kent/src/lib/psl.c contains the pslCalcMilliBad, pslIsProtein and pslScore functions and also a useful function called pslLoadAll that loads the psl file into a linked list structure. The definition of the psl struct can be found in kent/src/inc/psl.h.

Replicating web-based Blat "I'm feeling lucky" search results

Question:
"How do I generate the same search results as web-based Blat's "I'm feeling lucky" option using command-line blat?"

Response:
The code for the "I'm feeling lucky" Blat search orders the results based on the sort output option that you selected on the query page. It then returns the highest-scoring alignment of the first query sequence.

If you are sorting results by "query, start" or "chrom, start", generating the "I'm feeling lucky" result is straightforward: sort the output file by these columns, then select the top result.

To replicate any of the sort options involving score, you first must calculate the score for each result in your PSL output file, then sort the results by score or other combination (e.g. "query, score" and "chrom, score"). See the section on Replicating web-based Blat percent identity and score calculations for information on calculating the score.

Alternatively, you can try filtering your Blat PSL output using either the pslReps or pslCDnaFilter program available in the Genome Browser source code. For information on obtaining the source code, see our FAQ on source code licensing and downloads.

Using Blat for short sequences with maximum sensitivity

Question:
"How do I configure blat for short sequences with maximum sensitivity?"

Response:
Here are some guidelines for configuring standalone blat and gfServer/gfClient for these conditions:

The formula to find the shortest query size that will guarantee a match (if matching tiles are not marked as overused) is: 2 * stepSize + tileSize - 1
For example, with stepSize set to 5 and tileSize set to 11, matches of query size 2 * 5 + 11 - 1 = 20 bp will be found if the query matches the target exactly.
The stepSize parameter can range from 1 to tileSize.
The tileSize parameter can range from 6 to 15. For protein, the range starts lower.
For minMatch=1 (e.g., protein), the minimum guaranteed match length is: 1 * stepSize + tileSize - 1
Try using -fine.
Use a large value for repMatch (e.g. -repMatch = 1000000) to reduce the chance of a tile being marked as over-used.
Do not use an .ooc file.
Do not use -fastMap.
Do not use masking command-line options.

The above changes will make BLAT more sensitive, but will also slow the speed and increase the memory usage. It may be necessary to process one chromosome at a time to reduce the memory requirements.

A note on filtering output: increasing the -minScore parameter value beyond one-half of the query size has no further effect. Therefore, use either the pslReps or pslCDnaFilter program available in the Genome Browser source code to filter for the size, score, coverage, or quality desired. For information on obtaining the source code, see our FAQ on source code licensing and downloads.