Photo 51 Rosalind Franklin's logo

scholl default banner

scholl default banner

 
Web Site Search
D. Eric Walters
Ph.D., Professor

Research || Publications || Teaching || Links || Miscellaneous || Walters Home
D. Eric Walters
Ph.D., Professor

Research || Publications || Teaching || Links || Miscellaneous || Walters Home
 

GIGP-510, Computer Applications in Biomedical Research


 

Working with Sequences--Part 3

 

In this session, we will be doing multiple sequence alignment. As in previous sessions, there is more than one way to do the task. We will look at network servers, a multiple sequence alignment database, and the latest version of ClustalW which you can download and run on your own computer. Our examples are for protein sequences, but most of the servers and programs will do nucleic acid s

Selected network servers for sequence search & retrieval

  • Multiple sequence alignment, http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html. Multiple sequence alignment at Baylor College of Medicine. Several methods to choose from!
  • The Protein Information Resource (PIR) server lets you run ClustalW, T-Coffee, or MUSCLE, and all you need is the UniProtKB identifiers of the proteins you want to run. Or you can cut/paste your own sequences.
  • The European Bioinformatics Institute (EMBL) also has a server with several methods to choose from.
  • Florence Corpet has written a program called MultAlin which is available on a server at INRA in Toulouse, France. You have several options for retrieving your output in useful graphical formats.

Doing a sequence alignment

This example uses the PIR site to locate some sequences and then align them.

  • First I went to the PIR site and, under the Search/Analysis menu, I selected Text Search. I entered the search term "leishmanolysin." This is a metalloprotease used by the leishmania parasites.
  • This produced a list of almost 500 hits. Not all of them are leishmanolysin! The text search catches entries that include the comment "related to leishmanolysin," for example. But the top entries are leishmanolysin precursors from several leishmania species. I was most familiar with Leishmania major, so I selected that one. Since the protein sequence databases merged, each entry has several identifiers--this one is P08148, GP63_LEIMA, PIRSF001204. I clicked on the PIRSF001204 link.
  • This leads to a page containing a great deal of information about this protein! It's worthwhile to explore the kinds of information you can derive; for now, I clicked on the link for Alignment and Tree. This brings up the Multiple Alignment form, with 7 protein ID codes already filled in, and the ClustalW button selected. I clicked on the Submit button to start the alignment.
  • The result page includes three kinds of output. At the bottom is the multiple sequence alignment (a portion is shown below):

  • Just above the alignment there is a tree, in which branch lengths are proportional to evolutionary distance:

This indicates that alignment took place as follows:
1: P43150 was aligned with Q8MNZ1; these two were most similar.
2: Q27673 was added to alignment #1.
3: Q25286 was aligned with Q00689.
4: Q06031 was added to alignment #1
5: Q25289 and alignment #4 were added to alignment #1.

  • At the top of the result page is a button to start a Java applet that lets you look at the alignment in several different ways, interactively:

PC Software

The favorite choice among molecular biologists I have talked to is ClustalX. You can download your own copy of the latest version from the authors in Strasbourg, France. They have versions for Windows, Mac, and various flavors of Unix. ClustalW is the same program, with a command-line interface.

Databases of Multiple Sequence Alignments

Multiple sequence alignments can be used in many ways, and there are several very useful databases containing multiple sequence alignments.Here are a few:

  • Pfam is a large collection of multiple sequence alignments containing many common protein domains.
  • The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.
  • PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family.
  • The ProDom protein domain database consists of an automatic compilation of homologous domains.

References:

  • Methods in Enzymology, vol 183 (1990); vol 266 (1996).
  • Bairoch, A. PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Res. 20, 2013-2018 (1992).
  • Dayhoff, M.O.; Barker, W.C.; Hunt, L.T. Establishing Homologies in Protein Sequences. Methods Enzymol. 91, 524-545 (1983).
  • Henikoff, S.; Henikoff, J.G. Automated Assembly of Protein Blocks for Database Searching. Nucleic Acids Res. 19, 6565-6572 (1991).
  • Hofman, K.; Stoffel, W. TMbase--A Database of Membrane Spanning Protein Segments. Biol. Chem. Hoppe-Seyler 347, 166 (1993).
  • Smith, T.F.; Waterman, M.S. Identification of Common Molecular Sequences. J. Mol. Biol. 147, 195-197 (1981).
  • Uberbacher, E.C.; Mural, R.J. Locating Protein-coding Regions in Human DNA Sequences by a Multiple Sensor Neural Network
 
                        Rosalind Franklin University of Medicine and Science - 3333 Green Bay Rd, North Chicago, IL 60064    (847) 578-3000