Frequently Asked Questions: Data and Downloads
|
|
|
Downloading sequence and annotation data |
|
|
|
|
Question:
"How do I obtain the sequence and/or annotation data for a release?"
Response:
Sequence and annotation data downloads are usually made available within
the first week of the release of a new assembly. The download directories
are automatically updated nightly to incorporate additions and modifications
to the data.
We recommend that you download data via our FTP site at
ftp://hgdownload.cse.ucsc.edu/,
particularly if you plan to download multiple files or files of large size.
To do so:
ftp hgdownload.cse.ucsc.edu
user name: anonymous
password: your email address
go to the goldenPath directory, pick an assembly directory, then a data directory
To download multiple files from the UNIX ftp command
line, use the "mget" command. You may want to use the
"prompt" command to toggle the interactive
mode if you do not want to be prompted for each file
that you download.
mget [filename1] [filename2] ...
- or -
mget -a (to download all the files in the directory)
You can also download data from our
Downloads
page or our DAS
server. To download a specific subset of the data or to configure the output
format of the data, use the Table
Browser. For information on extracting a large set
of sequences from an assembly, see
Extracting sequence in batch from
an assembly.
For more information on using the UCSC DAS server, see
Downloading data from the UCSC DAS server.
| |
|
|
Extracting sequence in batch from an assembly |
|
|
|
|
Question:
"I have a lot of coordinates for an assembly and want to
extract the corresponding sequences. What is the best
way to proceed?
Response:
There are two ways to extract genomic sequence in batch
from an assembly:
A. Download the appropriate fasta files from our
ftp server
and extract sequence data using your own tools or the
tools from our source tree. This is the recommended
method when you have very large sequence datasets or
will be extracting data frequently.
Sequence data for most assemblies is located in the
assembly's "chromosomes" subdirectory on the
downloads server. For example, the sequence for human
assembly hg17 can be found in
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17/chromosomes/.
You'll find instructions for obtaining our source
programs and utilities
here. Some programs
that you may find useful are nibFrag and twoBitToFa,
as well as other fa* programs. To obtain
usage information about most programs, execute it
without arguments.
B. Use the Table browser to extract sequence. This is a
convenient way to obtain small amounts of sequence.
- Create a
custom
track of the genomic coordinates in
BED format and upload
into the Genome Browser.
- Select the custom track in the Table browser, then
select the "sequence" output format to
retrieve data. We recommend that you save the file
locally as gzip.
| |
|
|
Downloading data from the UCSC DAS server |
|
|
|
|
Question:
"How do I download data using the UCSC DAS server?"
Response:
The UCSC DAS server provides access to genome annotation data for all current assemblies
featured in the Genome Browser. To view a list of the assemblies available from the
DAS server and their base URLs, see
http://genome.ucsc.edu/cgi-bin/das/dsn.
To construct a DAS query, combine an assembly's base URL with the
sequence entry point and type specifiers available for that assembly. The entry point
specifies chromosome position, and the type indicates the annotation table
requested. You can view the lists of entry points and types available for an assembly
with requests of the form:
http://genome.ucsc.edu/cgi-bin/das/[db_name]/entry_points
http://genome.ucsc.edu/cgi-bin/das/[db_name]/types
where [db_name] is the UCSC name for the assembly, e.g. hg16, mm4.
For example, here is a query that returns all the records in the refGene table for the
chromosome position chr1:1-100000 on the hg16 assembly:
http://genome.ucsc.edu/cgi-bin/das/hg16/features?segment=1:1,100000;type=refGene
For more information on DAS, see the
Biodas website and the
DAS specification.
| |
|
|
Downloading the UCSC Genome Browser source |
|
|
|
|
Question:
"Where can I download the Genome Browser source code and
executables?"
Response:
The Genome Browser source code and executables are freely
available for academic, nonprofit, and personal
use (see Licensing the Genome Browser
or Blat for commerical licensing requirements).
The latest version of the source code may be downloaded
here.
See Downloading Blat source
and documentation for information on Blat downloads.
| |
|
|
Download restrictions |
|
|
|
|
Question:
"Do you have restrictions on the amount of downloads one can do?"
Response:
Generally, we'd prefer that you not hit our interactive site with programs,
unless they are themselves front ends for interactive sites. We can handle
the traffic from all the clicks that biologists are likely to generate,
but not from programs. Program-driven use is limited to a maximum of one
hit every 15 seconds and no more than 5,000 hits per day.
If you need to run batch Blat jobs, see
Downloading Blat source
and documentation for a copy of Blat you can run
locally.
| |
|
|
Opening .fa files |
|
|
|
|
Question:
"I am trying to look at the final decoding of the human genome. How can I
open the *.fa files?"
Response:
Microsoft Word or any program that can handle large text files will do.
Some of the chromosomes begin with long blocks of N's. You may want
to search for an A to get past them.
Unless you have a particular need
to view or use the raw data files, you might find it more interesting to
look at the data using the Genome Browser. Type the name of a gene in which
you're interested into the position box (or use the default position),
then click the submit button. In the resulting Genome Browser
display, click the DNA link on the menu bar at the top of the page.
Select the Extended case/color options button at the bottom of the
next page. Now you can color the DNA sequence to display which portions are
repeats, known genes, genetic markers, etc.
| |
|
|
Data differences between downloaded data and browser display |
|
|
|
|
Question:
"I downloaded the genome annotations from your MySQL database tables, but the
mRNA locations didn't match what was showing in the Genome Browser. Shouldn't they
be in synch?"
Response:
Yes. The Genome Browser and Table Browser are both driven by the same
underlying MySQL database. Check that your downloaded tables are from
the same assembly version as the one you are viewing in the Genome Browser. If the
assembly
dates don't match, the coordinates of the data within the tables may differ.
In a very rare instance, you could also be affected
by the brief lag time between the update of the live databases underlying the Genome
Browser and the time it takes for text dumps of these databases to become available
in the downloads directory.
| |
|
|
Strange characters in FASTA file |
|
|
|
|
Question:
"I noticed several characters other than A, C, G,
T, and N in my fasta file, for example y, k,
s, etc. Is the file corrupted or are these characters valid?"
Response:
The characters most commonly seen in sequence are A, C, G,
T, and N, but there are
several other valid characters that are used in clones to indicate
ambiguity about the identity of certain bases in the sequence. It's not uncommon to
see these "wobble" codes at polymorphic positions in DNA sequences. The
following chart (IUPAC-IUB Symbols for Nucleotide Nomenclature: Cornish-Bowden
(1985). Nucl. Acids Res. 13:3021-3030) lists nucleotide symbols, including those
used for ambiguity:
--------------------------------------
Symbol Meaning Nucleic Acid
--------------------------------------
A A Adenine
C C Cytosine
G G Guanine
T T Thymine
U U Uracil
M A or C
R A or G Purine
W A or T
S C or G
Y C or T Pyrimidine
K G or T
V A or C or G
H A or C or T
D A or G or T
B C or G or T
X G or A or T or C
N G or A or T or C
| |
|
|
Selection of GenBank ESTs |
|
|
|
|
Question:
"I am interested in ESTs. How do you select which ones from GenBank to display in the
Genome Browser?"
Response:
All ESTs in GenBank on the date of the track data freeze for the given organism are
used - none are discarded. When two ESTs have identical sequences, both are
retained because this can be significant corroboration of a splice site.
ESTs are aligned against the genome using the Blat program. When a single EST aligns
in multiple places, the alignment having the highest base identity is found. Only
alignments that have a base identity level within a selected percentage of
the best are kept. Alignments
must also have a minimum base identity to be kept. For more information on the
selection criteria specific to each organism, consult the description page accompanying
the EST track for that organism.
The maximum intron length allowed by Blat is 500,000 bases, which may
eliminate some ESTs with very long introns that might otherwise align. If an EST
aligns non-contiguously (i.e. an intron has been spliced out), it is also a candidate
for the Spliced EST track, provided it meets various quality controls for intron and
exon length and match quality. Start and stop coordinates of each alignment block are
available from the appropriate table within the
Table Browser.
Note that only 250 EST tracks can be viewed at a time within the browser. If more
than 250 tracks exist for the selected region, the display defaults to a denser display
mode to
prevent the user's web browser from being overloaded. You can restore the EST track
display to a fuller display mode by zooming in on the chromosomal range or by using
the EST track filter to restrict the number of tracks displayed.
For tracks such as Non[Organism] ESTs and Non[Organism] mRNAs, some selection is done
on the full set at GenBank. If a sequence is too divergent from the organism's genome
to generate a significant Blat hit, it is not included in the track.
| |
|
|
EST strand direction |
|
|
|
|
Question:
"Could you help me with my interpretation of EST data? If the EST is taken
from the minus (-) strand, does this always mean that the transcript is generated
on the minus strand? Are two corresponding ESTs that are assigned
- and + always complementary?
I want to confirm the strand assignment for two
human ESTs:
- BQ016549 (chr22:22,310,674-22,332,143 on hg18): + strand in text and - strand in graphical
display
- AA928010 (chr22:20,345,264-20,354,528 on hg18): - strand in text and + strand in graphical
display.
The graphical display goes with the orientation of the gene in that location."
Response:
From the examples above, it can be seen that the strand to which an EST aligns is not
necessarily reflected in the direction of transcription shown by the arrows in the
display. When UCSC downloads mRNAs and ESTs from GenBank and aligns them to a genome
assembly using Blat, each EST aligns to the + or - strand (forward or reverse direction)
of the genome, which we record as + or - in the strand field of the corresponding database
table, e.g. all_ests or chrN_est. The strand information (+/-) therefore
indicates the direction of the match between the EST and the matching genomic
sequence. It bears no relationship to the direction of transcription of the RNA with
which it might be associated. Determining the direction of transcription for ESTs is
not an easy task so we do some calculations to make the best guess for the
transcription direction.
ESTs are sequenced from either the 5' or the 3' end. When sequenced from the 5' end, the resulting
sequence is the same as that of the mRNA which it represents. With a 3' end read, the resulting
sequence matches the opposite strand of the cDNA clone. Therefore, it is the reverse complement of
the actual mRNA sequence. A problem occurs if the EST contributor reverse-complements
the 3'-read sequence before depositing it into GenBank, with the idea that people will want
the mRNA (transcription-direction) sequence. It is not always possible to determine if this has
been done. Therefore, we do some calculations to try to determine the correct direction of
transcription for the EST sequence.
If an EST alignment produces canonical introns (with gt-ag splice-site pairs), this is used
to determine the transcription direction. For example when an EST is aligned to the genome, a
canonical intron would look like this:
NNNNexonNNNNgtnnnnintronnnnnnnnagNNNNexon
Here, the two nucleotides on either end of the intron show the canonical gt-ag splice site pairs.
To find transcription direction, we use a method that relies on finding gt-ag canonical pairs in one
direction more often than in the opposite direction. The calculation is:
gt/ag introns minus ct/ac introns = intronOrientation
The sign of this calculated intronOrientation field (stored in the estOrientInfo table) shows the
orientation of the transcript relative to the EST. Therefore, if intronOrientation is positive,
then the EST appears in the display with the arrows pointing in the same direction as the EST
alignment. If intronOrientation is negative, then the arrows point in the opposite direction. If
no introns exist or all of the introns are non-canonical, then intronOrientation is set to zero.
In both BQ016549 and AA928010 (in the example above), the intronOrientation is negative; therefore,
the arrows on the Genome Browser display point in the opposite direction to that indicated by the
alignment on the EST details page. Note: A low intronOrientation number can cause an incorrect
assignment of transcription direction when calculated in this way.
The alignment details pages and the Table Browser do not take the intron orientation
into account. They show only the alignment of the
GenBank sequence (as given) to the genome. If the alignment is used to
retrieve DNA sequence from the genome, the DNA sequence will look
similar to the GenBank sequence (not its complement).
| |
|
|
Missing RefSeq ID |
|
|
|
|
Question:
"Why isn't my refseq ID in your database?"
Response:
It may have been added after we last downloaded data from Genbank, or it may have
been replaced or removed. You can check the submission date and status of an accession
on the
NCBI
Entrez Nucleotide site.
| |
|
|
Finished vs. draft segments |
|
|
|
|
Question:
"Do chrN.fa tables contain both finished and draft segments? If so,
how do you determine which segments are finished?"
Response:
Yes, these tables contain both finished and draft segments. Use the
corresponding chrN_gold table to look them up. The quality of the draft
varies. In
general, the larger the contig it is in, the better the quality. The
quality of the last 500 bases on either end of a contig tends to be
lower than the rest of the contig.
How do you determine the accuracy? The
base-calling program Phred analyzes
the traces from the sequencing machines
and assigns a quality score to these. These quality scores are used by the
Phrap assembly program, which gives
quality scores for the bases on the assembly as well.
| |
|
|
chrN_random tables |
|
|
|
|
Question:
"What are the chrN_random_[table] files in the human assembly? Why are they
called random? Is there something biologically random about the sequence in
these tables or are they just not placed within their given chromosomes?"
Response:
In the past, these tables contained data related to sequence that is
known to be in a particular chromosome, but could not be reliably ordered
within the current sequence.
Starting with the April 2003 human assembly, these tables also include data for
sequence that is not in a finished state, but whose location in the chromosome is
known, in addition to the unordered sequence.
Because this sequence is not quite finished, it could not be included in the
main "finished" ordered and oriented section of the chromosome.
Also, in
a very few cases in the April 2003 assembly, the random files contain data related to sequence for alternative
haplotypes.
This is present primarily in chr6, where we have included two alternative
versions of the MHC region in chr6_random. There are a few clones in
other chromosomes that also correspond to a different haplotype. Because the
primary reference sequence can only display a single haplotype, these
alternatives were included in random files. In subsequent assemblies,
these regions have been moved into separate files (e.g. chr6_hla_hap1).
| |
|
|
Chromosome Un |
|
|
|
|
Question:
"What is ChrUn?"
Response:
ChrUn contains clone contigs that can't be confidently placed on a
specific chromosome. For the chrN_random and chrUn_random files, we
essentially just concatenate together all the contigs into short
pseudo-chromosomes. The coordinates of these are fairly arbitrary,
although the relative positions of the coordinates
are good within a contig. You can find more information about the data organization
and format on the Data
Organization and Format page.
| |
|
|
Chromosome M |
|
|
|
|
Question:
"What is chromosome M (chrM)?"
Response:
Mitochondrial DNA.
| |
|
|
N characters at beginning of human chr22 |
|
|
|
|
Question:
"When I download human chr22 from your web site, the unzipped file contains only
N's."
Response:
There is a large block of N's at the beginning and end of chr22. Search
for an A to bypass the initial group of N's.
| |
|
|
Erroneous duplicated chrY_random region on Mouse Build 34 (mm6) |
|
|
|
|
Question:
"On the mm6 assembly, I've found duplicate contigs
that are placed on both chrY and chrY_random. Is this
intentional?"
Response:
On the mm6 assembly, chrY_random erroneously contains
a region duplicated from chrY. Because NCBI
discovered this assembly problem after the UCSC
Genome Browser was processed, we were not able to
remove it from mm6 prior to the browser's release.
The duplicated section occupies chrY:1-696,521 and
chrY_random:29,615,053-30,311,573 (the end of the
chromosome) and includes the following repeated
fragments:
- AC139318.5
- AC134433.3
- AC145392.2
- AC148319.2
- AC145571.3
- AC145393.4
The fragments
are assembled into the contig NT_111995 for
chrY_random and also appear (under different names)
as regions on contigs MmY_110865_34, MmY_78990_34
and NT_078925.
| |
|
|
Problems with Mouse Build 32 (mm4) |
|
|
|
|
Question:
"I have heard that the Build 32 mouse assembly isn't
as good as the Build 30 assembly. Can you clarify?"
Response:
Unfortunately, there appear to be some problems with
the Build 32 assembly. Ensembl has conducted an analysis
of the assembly and has attributed the
problems to incorrect mapping information that led to
the generation of artificial duplications and some
incorrect flips in orientation. You can read more
information about the problems Ensembl identified and
review a list of the chromosomes and genes most likely
to be affected by these issues on the Ensembl
Mus musculus web page.
| |
|
|
Mapping chimp chromosome numbers to human chromsomes numbers |
|
|
|
|
Question:
How do the chimp and human chromosome numbering
schemes compare?
Response:
The following table shows the mapping of chromosomes in
the chimp draft assemblies to human chromosomes.
Starting with the panTro2 assembly, the numbering scheme
has been changed to reflect a new standard that
preserves orthology with human chromosomes. Initially
proposed by E.H. McConkey in 2004, the new numbering
convention was subsequently endorsed by the
International Chimpanzee Sequencing and Analysis
Consortium. This standard assigns the identifiers
"2a" and "2b" to the two chimp chromosomes that fused in
the human genome to form chromosome 2 and renumbers the
other chromosomes to more closely match their human
counterparts. As a result, chromosomes 2 and
23 (present in the panTro1 assembly) do not exist in
later versions.
Human Chr |
Chimp Chr (panTro1) |
Chimp Chr (panTro2) |
1 | 1 | 1 |
2 (part) | 12 | 2a |
2 (part) | 13 | 2b |
3 | 2 | 3 |
4 | 3 | 4 |
5 | 4 | 5 |
6 | 5 | 6 |
7 | 6 | 7 |
8 | 7 | 8 |
9 | 11 | 9 |
10 | 8 | 10 |
11 | 9 | 11 |
12 | 10 | 12 |
13 | 14 | 13 |
14 | 15 | 14 |
15 | 16 | 15 |
16 | 18 | 16 |
17 | 19 | 17 |
18 | 17 | 18 |
19 | 20 | 19 |
20 | 21 | 20 |
21 | 22 | 21 |
22 | 23 | 22 |
X | X | X |
Y | Y | Y |
| |
|
|
Converting genome coordinates between assemblies |
|
|
|
|
Question:
"I've been researching a specific area of the human genome on the current assembly,
and now you've just released a new version. Is there an easy way to locate
my area of interest on the new assembly?"
Response:
You can migrate data from one assembly to another by using the
blat alignment tool
or by converting assembly coordinates. There are two conversion
tools available on the Genome Browser web site: the
Convert utility and the LiftOver tool.
The Convert utility,
which is accessed from the menu on the Genome Browser
annotation tracks page, supports forward, reverse, and
cross-species conversions, but does not accept batch
input.
The LiftOver tool,
accessed via the Utilities link on the Genome Browser
home page, also supports forward, reverse, and
cross-species conversions, as well as batch conversions.
If you wish to update a large number of coordinates
to a different assembly and have access to a Linux
platform, you may find it useful to try the command-line
version of the LiftOver tool. The executable file for
this utility can be downloaded
here.
LiftOver requires a UCSC-generated over.chain
file as input. Pre-generated files are available for
selected assemblies from the
Downloads page.
If the desired file is not available, send a request to
the
genome mailing list
and we may be able to provide you with one.
| |
|
|
Linking gene name with accession number |
|
|
|
|
Question:
"I have the accession number for a gene and would like to
link it to the gene name. Is there a table that shows both
pieces of information?"
Response:
If you are looking at the RefSeq Genes, the
refFlat
table contains both the gene name (usually a
HUGO Gene Nomenclature Committee ID) and its accession
number. For the Known Genes,
use the kgAlias table.
| |
|
|
Obtaining a list of Known Genes |
|
|
|
|
Question:
"How can I obtain a complete list of all the genes in
the UCSC Known Genes table for a particular organism?
Response:
To obtain a complete copy of the entire Known Genes
data set for an organism, open the Genome Browser
Downloads page,
jump to the section specific to the organism, click the
Annotation database link in that section, then click the
link for the knownGene.txt.gz table.
Data for a specific region or chromosome may be
obtained from the Table Browser by selecting the
"Genes and Gene Prediction Tracks" group, the
"Known Genes" track and the
"knownGene" table. Set the position to the
region of interest, then click the "get
output" button.
| |
|
|
Repeat-masking data |
|
|
|
|
Question:
"What version of RepeatMasker do you use on your data?
Which flags do you use?"
Response:
UCSC uses the latest versions of RepeatMasker and
repeat libraries available on the date when the
assembly data is processed. RepeatMasker version
information can usually be found in the README text for
the assembly's bigZips
downloads directory.
Masking is done using the RepeatMasker -s
flag. For mouse repeats, we also use -m.
In addition to RepeatMasker, we use the Tandem Repeat
Finder (trf) program, masking out repeats of period 12
or less. The repeats are just "soft" masked.
Alignments are allowed to extend through repeats, but
not initiate in them.
| |
|
|
Availability of repeat-masked data |
|
|
|
|
Question:
"Are the repeat annotation files available for every chromosome?"
Response:
Yes, you can obtain the repeat-masked files via the Table Browser or from the
organism's annotation database downloads directory. The RepeatMasker annotation
tables are named
chrN_rmsk (where N represents the chromosome number) and the
Tandem Repeat Finder (TRF) tables are named simpleRepeat.
| |
|
|
RepeatMasker version differences - UCSC vs. RepeatMasker website |
|
|
|
|
Question:
"When I run RepeatMasker independently from the
RepeatMasker web server, my results vary from those of
UCSC. What's the cause?"
Response:
UCSC occasionally uses updated versions of the
RepeatMasker software and repeat libraries that are not
yet available on the RepeatMasker website (see
Repeat-masking data for more
information).
| |
|
|
Obtaining promoter sequence |
|
|
|
|
Question:
"How can I fetch promoter sequence upstream of a gene?"
Response:
The UCSC Genome Browser offers several ways to obtain this information,
depending on your requirements.
The Genome Browser downloads
site provides prepackaged downloads of 1000 bp, 2000 bp, and 5000 bp upstream
sequence for RefSeq genes that have a coding portion and annotated 5' and 3' UTRs. You
can obtain these from the bigZips downloads directory for the assembly of interest.
To fetch the upstream sequence for a specific gene, use the
Table Browser.
Enter the genome, assembly, and select the knownGene table. Paste the gene name
or accession number in the identifier field. Choose sequence for the output format
type, then click the get output button. On the next page, select genomic. On the
final page, you will have the opportunity to configure the amount of upstream
promoter sequence to fetch, along with several other options. Click Get Sequence
when you've finished configuring the output.
You can also use the Genome Browser to obtain sequence for a specific gene.
Open the Genome Browser window to display the gene in which you're
interested. Click the entry for the gene in the RefSeq or Known Genes track, then
click the Genomic Sequence link. Alternatively, you can click the DNA link in
the top menu bar of the Genome Browser tracks window to access options for displaying
the sequence.
The Stanford Human Promoters track on the
UCSC
Custom Annotation Tracks page shows promoters for some of the human assemblies.
| |
|
|
Data from Evolutionary Conservation Score tracks |
|
|
|
|
Question:
"Where can I download the conservation score data from the Human/Mouse
Evolutionary Conservation Score track?"
Response:
The conservation score data are stored in a group of tables in the annotation
database downloads directory.
The naming conventions of the tables vary among releases. In earlier
assemblies, table names are of the form chrN_humMusL, chrN_zoom1_humMusL, and
or chrN_zoom2500_humMusL. In later releases, the tables are named using
specific release numbers, such as chrN_hg16Mm3. The tables within a given set
differ by the number of bases/score interval and are used to generate the browser
displays at different zooming levels.
| |
|
|
Minus strand coordinates - axtNet
|
|
|
|
|
Question:
"I downloaded the axtNet alignments between the latest human and mouse assemblies.
I found that some of the alignments listed in the axtNet
files do not agree with what is shown in the browser."
Response:
Is this alignment on the minus strand? Minus strand coordinates in axt files
are handled differently from how they are handled in the Genome Browser. To convert
axt minus strand coordinates to Genome Browser coordinates, use:
start = chromSize + 1 - axtEnd
end = chromSize + 1 - axtStart
See an explanation of coordinate transforms in the genomeWiki.
| |
|
|
Mapping UCSC STS marker IDs to those of other groups |
|
|
|
|
Question:
"How do I map the STS genetic marker IDs in the genome browser to the
IDs assigned by other groups? "
Response:
We assign our own IDs to each of the STS markers, but we also track
the UniSTS IDs for each marker in the downloadable stsInfo2 table.
To determine the location of a specific marker, look up the marker's name
in the stsAlias table to determine the UCSC ID assigned to the
marker, and then use this ID to look it up in the stsMap table where the marker
is located. For example, D10S249 has UCSC ID 2880 and is located at chr10:240791-241019.
| |
|
|
deCODE map data |
|
|
|
|
Question:
"Where can I get more information about the deCODE map?"
Response:
You can obtain this information from the combination of a couple of tables.
The stsMap table contains the physical position of all STS markers,
including those on the deCODE map. This file also contains information about
the position on the genome-wide maps, including the deCODE map. A second file,
stsInfo2, contains additional information about each marker, including aliases,
primer sequence information, etc. This table is related to the first table by an
ID (the identNo field in both files).
| |
|
|
Direct MySQL access to data |
|
|
|
|
Question:
"Is it possible to run SQL queries directly on the
database rather than using the Table Browser interface?"
Response:
In response to requests from Genome Browser users, we have set up a MySQL
database for public access at genome-mysql.cse.ucsc.edu. This new server
allows MySQL access to the same set of data currently available on our
public Genome Browser site. The data are synchronized weekly with the main
databases on our public site.
During this synchronization period, the MySQL server can be
intermittently out of sync with the main website for a short period.
The weekly synchronization takes place on Monday mornings
from 4:00 am to 9:00 am Pacific Time.
To connect to the database, you must use a computer on which the MySQL
client libraries have been installed. We recommend you use the most current
version of v5.0 MySQL clients, which may be downloaded from
http://dev.mysql.com/downloads/mysql/5.0.html.
Connect to the MySql server
using the command:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
The -A flag is optional but is recommended for speed.
Once connected to the database, you may use a wide range of MySQL commands
to query the database.
As a courtesy to others, please observe the following
guidelines when using the database:
-
Avoid excessive or heavy queries that may impact the server performance.
Inappropriate query use will result in a restriction of access. If you plan
to execute a query that you think may be excessive, contact UCSC first to
avoid the possibility of having your access blocked.
-
Bot access and excessive program-driven use are not permitted.
-
Attachments by local mirror sites are prohibited.
The MySQL database can also be used by the numerous utilities
in the kent source tree. Add the following
specifications to your $HOME/.hg.conf file (remember to chmod your .hg.conf file to 600 permissions):
db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password
If you prefer a more structured graphical interface to the UCSC database
tables, use the
Table Browser.
System problems should be reported to
genome-www@soe.ucsc.edu.
Send questions regarding the database contents or queries to
genome@soe.ucsc.edu.
Messages sent to this address will be posted to the
moderated genome mailing list, which is archived on a public
Web-accessible pipermail archive. This archive may be
indexed by non-UCSC sites such as Google.
| |
|
|
Name of fourth column in BED output |
|
|
|
|
Question:
"When using the Table Browser to extract exons from a Gene track, what does the 'Name' column (fourth BED column) refer to?"
Response:
The fourth column of the BED output contains a lot of information separated by underscores. For example:
uc009vjk.2_cds_1_0_chr1_324343_f
This information is represented as follows:
ucscId_sequenceType_sequenceTypeNumber_basesAdded_chromosome_positionOfFirstBaseOfItem_strand
-
UCSC ID: our identification for the transcripts in the UCSC Genes track.
-
Sequence Type: exons, introns, cds, utr5, etc.
-
Sequence Type Number: for every transcript, there will be a row for
each sequence type (cds or intron) and this identifies which is represented in this
row; the first is denoted with 0. So, if you requested exons, and a particular
transcript has 10 exons, you will see a row for each one and in this position
they will be numbered 0-9.
-
Bases Added: number of bases added to the regions requested.
-
Chromosome: chromosome number the item is on.
-
Position of First Base of Item: if you have specified bases added to the
requested features (for example, Exons plus 10 bases on each end), then
columns 2 and 3 of the output wouldn't be the exact coordinates of the exon,
they would start and end 10 bases before/after the exon. So, this part of
the information is an easy way to see where the actual feature starts as
displayed in the browser. It is "as displayed in the browser" because the
coordinates in our tables almost always have 0-based starts (as they do
in columns 2 and 3 of this output) but display as 1-based in the browser
(for more info see this FAQ),
but this start position listed in this section of the 4th column is actually 1 based.
It will be the exact coordinate the feature starts on as displayed in the browser.
-
Strand: forward(f) or reverse(-) strand.
| |
|
|
|
|