DNA technology course

Notes DNA technology

Query a public database

Exercise 1

Examination of a GenBank entry

  • Go to the Genbank website
  • Search for “DMD AND Homo sapiens” and select CoreNucleotide (Genbank)
  • Select the RefSeq tab
  • Select transcript variant Dp427m, (NM_004006)
  • Explore the information, e.g. gene, coding sequence (CDS) , PubMed
  • Find the variation in region 3713


Retrieve a sequence from the database and store it locally

The "fasta" format is a format for sequences, which can be read by many programs.

  • Search in the Genbank database for 'human mRNA HBA2' (NM_000517)
  • Set Display (top left of the page) to Fasta
  • Send to File, save as NM_000517.txt (This is only to show how you can save information from GenBank, we do not need this sequence anymore)

Data conversion/translation.

One inconvenience of having a number of different DNA sequence analysis packages available is that they use different formats for storing more-or-less the same information. Further, most packages refuse to accept even the simplest files from one another. And finally, the DNA sequence databases - EMBL, Genbank and DDBJ - each has its own sequence file format.


Readseq is a program for sequence conversion, it was written originally around 1989 as a component of a sequence analysis program, it took on a life of its own as a conversion program for bioinformatics. Readseq is particularly useful as it automatically detects many sequence formats, and interconverts among them. Most sequence file formats allow more than one sequence per file, but some do not.

Sequence formats that allow one or more sequences:

  • GenBank/GB, genbank flatfile format
  • EMBL, EMBL flatfile format
  • DNAStrider, for common Mac program
  • Pearson/Fasta, a common format used by Fasta programs and others
  • Phylip3.2, sequential format for Phylip programs

Sequence formats that only allow one sequence.

  • GCG, single sequence format of GCG software
  • Plain/Raw, sequence data only (no name, document, numbering)

Exercise 2

  • Have a look at Sequence formats
  • Now open the Readseq website
  • The file NM_004014.txt (Right-click > open in new window) contains a sequence in GCG format (Dystrophin transcript variant Dp116).
  • Copy and paste the sequence, select "Pearson|Fasta|fa" as the output format, View in browser and press the submit button.
  • Change the output format and look at the different file formats
  • The file Sequences.txt (Right-click > open in new window) contains more Dystrophin protein sequences in Fasta format.
  • Translate to GCG format (View in browser)
  • Now see what happens if you (copy-paste) translate the file to fasta format

Design primers

Exercise 3

  • Read more about the Sickle cell anemia gene on this website

The bl2seq program (BLAST) is used to align two sequences against each other in stead of against the entire database. The sequence can be pasted into the boxes or the accessioncodes (unique identifier for a sequence) can be entered as input.

  • Go to the BLAST website. Determine the position of the HBS mutation with the program "Align two sequences (bl2seq)" (In the "Specialized Blast" box)
  • Type "NM_000518" in the box for sequence1, this is the mRNA sequence of normal HBB.
  • Type "M25113" (mRNA of HBS) in the second box.
  • Click on "Blast" and examine the output.
  • At the top of the page you can alter the 'formatting options'. Tick the "CDS feature" box to see the protein translations of the sequences.

The sequences are retrieved from the nucleotide database (GenBank) and are placed above eachother. This is called an alignment. Matching nucleotides are indicated by a pipe-sign: |. Mismatching nucleotides will not have a pipe-sign. Insertions/deletions are annotated as a dash (-) in one of the sequences. Below the alignment the protein sequence is placed when the annotation of one of the sequences is known. In this case the identifiers were given as input, so the bl2seq program could retrieve this information from GenBank.

  • Identify the mutation which causes the change in aminoacid.

Exercise 4

  • Go to the GeneCards database and retrieve the genecard for the HBB gene. Genecards is a cross-reference database with links to major information sources. Examine the HBB card. What kind of information is available?

The UCSC genome browser contains the DNA sequence of more than ten organisms, including human and mouse. Besides the sequence it contains several annotation tracks, like repetitive sequences, genes and conservation. It is also possible to send a custom made track to the browser.

  • Follow the link from the GeneCards database to the "Genomic location: UCSC golden path with GeneCards custom track". The "UCSC genes" track is a curated track where information of UniProt, RefSeq and mRNA sequences from GenBank are combined. Click on this track to expand it. Compare the GeneCards annotation to the "Known genes" annotation. Is the GeneCards annotation of HBB correct?
  • Search for the HBB gene in the genome browser if the GeneCards is not correct
  • Select HBB in the "UCSC genes" track. This also leads to a summary about the HBB gene. Check which information is available here. Follow the "Genome browser" link (in the box: Sequence and Links to Tools and Databases) The boundaries of the browser are now reset to the boundaries of the HBB gene according to the Known genes track. The thick blocks indicate exons of the particular gene, the thin lines indicate introns. The arrows give information about the orientation. >>> means that the gene is located on the positive strand (read from p to q arm of the chromosome), <<< means that the sequence is located on the opposite (reverse complement) strand.
  • To pick primers for HBB the DNA sequence is needed as input for the Primer3 program. This sequence can be downloaded from the UCSC database. Select "DNA" in the top menu. Add 100 bp extra up- and downstream of the gene to broaden your search for primer sequences. Make sure you'll get the "reverse complement". Leave the other options unchanged and click on "get DNA". Save the sequence on the desktop. You can copy/paste the sequence in a new text file or choose File->Save as... in the menu of the webbrowser.
  • Before the sequence can be analysed by Primer3, repeating sequences have to be removed. This can be done with the program RepeatMasker. Upload the DNA sequence (from the desktop) to the RepeatMasker program. Select "html" as "return format" and leave the other options unchanged. Which repeats are present in the sequence and how much is masked by RepeatMasker?
  • Copy/paste the Masked Sequence into Primer3. Keep the default settings and choose "Pick primers" Does the primerset cover the region of interest (i.e. the region with the mutation)?

Exercise 5

Pick primers for the first exon of HBB / HBS

  • Go back to the UCSC genome browser and search for HBB. Select HBB from the Known Genes track and click on HBB in the browser itself. This time retrieve the DNA sequence by selecting "Genomic Sequence (chr11:5,246,696-5,248,301)" from the "Sequence and Links to Tools and Databases" box. Uncheck the "Introns" box and select "One fasta record per region" including 100 bp up- and downstream. Select "Mask repeats" to N. This option does the same as the RepeatMasker program.
  • Copy the sequence of the first exon into Primer3. Does the resulting primerset include the region with the mutation?
  • Check if the primerset is unique. Go to the UCSC website and start the "In-silico PCR". Choose the most recent human genome assembly and copy/paste the forward and reverse primer from the previous exercise. Go to the graphical result to see if the primerset contains the correct region.

Sequence Alignment

Similarity search

When considering a new DNA sequence, the first question will probably be: "What do I have in my sequence?" which relates to the question "Is my sequence similar to previously known sequences?" To find out, you may compare your sequence with the public databases, using one or more computer programs that have been written for searching the databases for similarities to a given query sequence. The fastest way to search for similar sequences in the databases is BLAST (Basic Local Alignment Search Tool). Blast is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. Blast has several variants:

Sequence type nucleotide database amino acid database
nucleotide query blastn/tblastx blastx
amino acid query tblastn blastp
  • blastn is good for finding nucleotide sequences similar to yours.
  • blastp is good for finding amino acid sequences similar to yours.
  • blastx is good for finding amino acid sequences similar to any translation of your nucleotide sequence - e.g. if you could not recognise an ORF.
  • tblastn is good for finding nucleotide sequences that can be translated into something similar to your amino acid sequence - e.g. unannotated pseudogenes.
  • tblastx is good for keeping computers busy (or for very specific applications).

Exercise 6

The BLAST 'Search' box accepts a number of different types of input and automatically determines the format. Accepted input types are: FASTA file, bare sequence or database identifiers (e.g. genbank i.d. M18533, Swissprot i.d. P11532)

  • Open BLAST
  • Select Blastp, copy-paste Virtual.prot (Right-click > open in new window, this sequence is translated from an mRNA sequence) in the sequence input field, leave the other options unchanged and press BLAST

A Graphical overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, with the most similar hits uppermost and appearing in red. Pink, green, blue and black bars follow, representing proteins in decreasing order of similarity. Hatched areas would indicate a gap in similarity i.e., two or more distinct regions of similarity were found within the same sequence hit. Detached bars on the same line correspond to unrelated hits. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments.

  • On the Blast result page click on Taxonomy reports and look at the hits for different organisms
  • Go back to the Blast result page. You can access different databases by clicking on the accession code. Explore some links.
  • Find the original protein, transcript variant Dp116 (NP_004005). From which gene does this protein originate? Find the RefSeq accession code for this sequence (NM_004014).
  • On the Entrez page for NM_004014, click links and select Linkout and Genome browser, explore DMD. You can switch annotation tracks on and off below. Select "SNP (130)" in the "Variation and repeats" section to explore the known SNPs.

Pairwise alignment

Pairwise sequence alignment methods are concerned, in contrast to BLAST, with finding the best-matching piecewise (local) or global alignments of protein or DNA sequences.

Local alignment (Smith-Waterman algorithm)

Local alignment methods find related regions within sequences - in other words they can consist of a subset of the characters within each sequence (e.g. positions 20-40 of sequence A might align with positions 50-70 of sequence B).This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins (which is known as domain shuffling) can be identified as being related.

Global alignment (Needleman-Wunsch algorithm)

A global alignment between two sequences is an alignment in which all of the characters in both sequences participate in the alignment. Global alignments are useful mostly for finding closely-related sequences.

EMBOSS-Align

The EMBOSS-Align tool contains two programs each using a different algorithm:

  • When you want an alignment that covers the whole length of both sequences, use the needle program.
  • When you are trying to find the best region of similarity between two sequences, use the water program.

Exercise 7

Open EMBOSS

  • Compare the protein sequences DMD_HUMAN.txt and DMD_DROME.txt (fruitfly) with both the programs and look at the different output formats

Note that matching amino acids are connected with a "|" symbol. Mismatches would be connected with a space. A gap would be represented with a "-" symbol. Similar amino acids (e.g. threonine vs methionine) are connected via a "." symbol. Thus a sequence alignment can be represented in the format...

DMFCNTEGQGIAMM  
 |    ||||  ..      
TMG--NEGQGSETT

Gene Finding

In agreement with evolutionary principles, scientific research to date has shown that all genes share common elements. For many genetic elements, it has been possible to construct consensus sequences, those sequences best representing the norm for a given class of organisms (e.g, bacteria, eukaroytes). Common genetic elements include promoters, enhancers, polyadenylation signal sequences and protein binding sites.

A mRNA can be divided into three parts: a 5′ untranslated region (5′ UTR), the polypeptide coding region, sometimes called the open reading frame (ORF), and the 3′ translated region (3′ UTR).

The first codon in a messenger RNA sequence is almost always AUG. While this reduces the number of candidate codons, the reading frame of the sequence must also be taken into consideration.

There are six reading frames possible for a given DNA sequence, three on each strand, that must be considered, unless further information is available. Since genes are transcribed away from their promoters, the definitive location of this element can reduce the number of possible frames to three. The location of the appropriate start codon will include a frame in which there are no apparent abrupt stop codons. Incorrect reading frames usually predict relatively short peptide sequences. Therefore, it might seem deceptively simple to ascertain the correct frame. In bacteria, such is frequently the case. However, eukaryotes add a new obstacle to this process, introns.

Intron/exon splice sites can be predicted on the basis of their common features. Most introns begin with the nucleotides GT and end with the nucleotides AG.

Image:exons-introns.gif

Exercise 8

  • Have a look at the organisation of introns/exons on the DMD homepage

GeneCards is a database of human genes, their products and their involvement in diseases and there are several ways to search:

  • Go to the GeneCards website
    • Select disease genes and select the GeneCard for gene DMD.
      • This card shows all information available on the gene, with links to corresponding databases, e.g.
    • Disorders and Mutations:
      • Symbol Name DMD, explore the information
      • Search OMIM for DMD

The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds all open reading frames of a selectable minimum size in a user's sequence or in a sequence already in the database.

Exercise 9

  • Open ORF finder
  • enter the accession number "NM_004006" (Dystrophin transcript variant Dp427m) into the appropriate box.

You will see this full cDNA sequence translated in all 6 reading frames with the results shown in diagrammatic format. Remember that DNA is double stranded so there are 3 reading frames for the top strand and 3 more reading frames for the bottom strand. All the ORFs are indicated by the colored boxes with the rest of the cDNA left uncolored. The ORF Finder is a graphical-analysis tool which allows you to find all open reading frames in a given sequence

Image:orf-list.gif Open Reading Frame from the list
Image:orf-selected.gif Selected Open Reading Frame
Image:orf-accepted.gif Accepted Open Reading Frame

What you are generally looking for is the largest ORF. The software has already done this for you by ranking the ORFs from largest to smallest. It also tells you which frame it came from, for example +3 means it is the third frame (3) on the top strand (+).

  • Click on the colored box next to the +2 from the list on the largest ORF. You will see the color of the ORF box change and the deduced amino acid sequence.
  • "Accept" the proposed ORF and View Genbank entry and find the Gene name
  • Copy/paste the NM_004006_point.txt sequence in ORF finder and look at the difference output (genbank S43366, point mutation G/T causing translational stop, region 3713).

Links to databases and software

General

Overview of databases and online software

Reminders Figures and Tables Disease

DNA/mRNA/protein sequences for exercises

Programs

Database searching

Data Conversion/Translation Sequence Alignment Gene finding Primer design

Databases

Integrated databases

Human genome NCBI Other
Edit | Attach | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View |  | More topic actions
Topic revision: r4 - 2010-03-19 - 13:50:26 - BarberaVanSchaik