Biological databases and database searching
As part of the exercises you will use a few (out of hundreds) available biological databases. These are often freely available to the academic community and can be accessed via the internet or downloaded to your own computer. Since these databases are interconnected they provide a highly useful resource for analyzing and retrieving biological information. These exercises will introduce you to several biological databases and several methods for searching these databases.
Useful addresses for the exercises:
Exercise 1: Nar issue databases
The 2011 Database Issue of Nucleic Acids Research is the 18th in a series dedicated to factual biological databases.Such databases are an essential resource for working biologists and this compilation provides descriptions of the most important of these databases and serves to introduce newly compiled databases that provide specialist information in the biological area.
NAR Online contains hotlinks to all of the databases in the compilation as well as brief summaries of their content.
Go to the NAR 2011 database issue (
http://www3.oup.co.uk/nar/database/c/ )
- Find information on e.g., Genbank (Nucleotide Sequence Databases), Genecards (General Human Genetic Databases), KEGG (Metabolic Pathways) and other databases that interest you. Try to find out how many nucleotides, sequences and organisms are currently in GenBank. Wat is the size of GenBank (how many kbytes or megabytes?). How can you download this database to your own computer?
Exercise 2: The GeneCards database.
- In case you have the name of your gene, the GeneCards database is a very useful resource for retrieving information about it. Go to the GeneCards database (http://www.genecards.org/) and retrieve the information for the HBA2 gene. GeneCards provides links to many biological and biomedical databases. Follow some of the links to get an impression what kind of information you can retrieve about the HBA2 gene.
- What are the gene names of the two neighbors of HBA2.
- View the three dimensional protein structure of a HBA2 molecular with Protopedia (what is this?).
- What is the molecular function of HBA2 according to the Gene Ontology (GO)? View the (graphical) GO tree in the AmiGO browser.
Exercise 3: Sequence formats
One problem with handling biological data is the existence of many different formats. Software may require a certain format as input and therefore one is often dealing with converting data from one format to another format. As an example consider the many number of data formats to represent and store DNA and protein sequences.
- Explore the different Sequence formats
- What is the Fasta format?
- Open the Readseq website The file NM_000517.txt contains a sequence in GCG format. (Right-click > open in new window)
- Copy and paste the sequence, select "Pearson|Fasta|fa" as the output format, View in browser and press the submit button.
- Change the output format and look at the different file formats
Exercise 4: Searching GenBank with Entrez.
- Go the GenBank database (http://www.ncbi.nlm.nih.gov/Entrez/) and retrieve the information about the mRNA sequence with accession code NM_000517. Which gene is this?
- What kind of information is present in the retrieved database record? What is the sequence format? Convert the sequence to Fasta format (do not use the conversion tool from exercise 3)
- Select the 'Graphic' to get an overview of the location of the features in this mRNA sequence.
- GenBank is also linked to many databases. To retrieve additional information you can use the 'Links' and then go to the Gene database. This provides you further information and links to other databases. Follow the links to PubMed, OMIM and the Protein database to retrieve additional information.
Exercise 5: Searching SwissProt with SRS
- Go to the Sequence Retrieval System (http://srs.ebi.ac.uk), and retrieve the record of the human HBA (HBA1 and HBA2) protein. What is the accession code of this protein. Have a look at the information that is provided about the protein and the links given to external databases.
- Go one page back and select the '!InterProScan', '!Charge' or any other analysis to retrieve additional information about the protein domains, structure, etc.
Exercise 6: Blast a sequence
You have obtained a sequence from your sequence experiment:
Q2.seq (right-click > open link in new window/tab).
- BLAST this sequence against the nucleotide database (nucleotide collection (nr/nt)) with blastn. Can you identify the gene that corresponds to this sequence? What is the source of the first 280 bases in the sequence?
Exercise 7: Genome browser
1. Go to the
UCSC genome browser for this exercise, select the HG19 version. Locate the gene corresponding to NM_000517.
2. The genome browsers is able to display many different 'tracks' with information about the region of the genome you selected. In the graphical view of the NM_000517, scroll down to turn tracks on or off.
Exercise 8: the Human Gene Mutation Database
In the living cell DNA undergoes frequent chemical change, especially when it is being replicated. Most of these changes are quickly repaired. Those that are not result in a mutation. The Human Gene Mutation Database (HGMD;
http://www.hgmd.org/ registration is required !. Skip this exercise if you do not want to register) contains information about such and other mutations in DNA.
- In sickle-cell disease, a replacement of A by T at the 17th nucleotide of the gene for the beta chain of hemoglobin changes the codon GAG (glutamic acid) to GTG (valine). Thus the 6th amino acid in the chain becomes valine instead of glutamic acid. Go the the HGMD and retrieve information about this mutation.
(Alternative. Go to MutDB; http://mutdb.org/; search HBB gene, select DBSNP:rs334 (amino acid position 7, Mutant type amino acid V ) and look at the annotation info, follow the link to dbSNP Source: rs334 )
- In patients with cystic fibrosis more than 200 mutations have been found in the CFTR gene. Each of these mutations occurs in a huge gene that encodes a protein (1480 amino acids) responsible for transporting chloride ions out of cells. The gene encompasses over 6000 nucleotides spread over 27 exons on chromosome 7. Defects in the protein cause the various symptoms of the disease. Unlike sickle-cell disease, no single mutation is responsible for all cases of cystic fibrosis. People with cystic fibrosis inherit two mutant genes, but the mutations need not be the same.
What is the effect of the mutation ‘TTA-->TGA’ in codon 1059. (First retrieve the HGMD entry for the CFTR gene and go to the nucleotide substitutions. The entry for codon 1059 lists the following information: CM940269 1059 TTA-TGA Leu-Term Cystic fibrosis. Thus, Leu is replaced with a stop codon giving a shorter protein.)
Exercise 9: using KEGG to identify known metabolic disorders.
Hemolytic anemia can be caused by glucosephosphate isomerase (GPI) deficiency. The KEGG database provides the tools to identify the pathway in which this enzyme is active.
- Go to the KEGG database (http://www.genome.ad.jp/kegg/) and then go to the GENES database. Locate the gene GPI (in human, hsa). To which enzyme does this gene correspond?
- Next click on the enzyme (in the Definition field) to proceed to the enzyme database. In which pathways is this enzyme active?
- What reaction does this enzyme catalyse?
- Proceed to the Glycolysis pathway map, which now indicates the enzymes (red outlined boxes) and the corresponding reactions steps.