GMAP: A multi-Purpose Computer Program to Aid Synthetic Gene
      Design, Cassette mutagenesis and the Introduction of
        Potential Restriction Sites into DNA Sequences


               G P S Raghava and Girish Sahni

Computer Center and Section of Molecular and Chemical Biology
            Institute of Microbial Technology,
              Sector 39A, Post box No. 1304,
                Chandigarh 160 014, INDIA.



Address for Correspondence: G.P.S. Raghava, Scientist, IMTECH,
Post Box No. 1304, Chandigarh 160 014, India. Phone +91 (172)
44252, Email: raghava@imtech.ernet.in





                        ABSTRACT


A comprehensive compute program, GMAP, is described which is
developed for i) mapping the potential restriction endonuclease
(R.E.) sites in non-ambiguous DNA sequence, such as that of
natural genes, that can be introduced in the DNA sequence with or
without altering the amino acid sequence i.e. through non-silent
or silent mutations; ii) predicting the number and type of
mutations required to introduce unique R.E. sites in the non-
ambiguous DNA sequences after a limited number (1,2 or 3 bp per
R.E. site) of translationally silent/non-silent mutations; iii)
searching all R.E. sites in ambiguous DNA sequence obtained by
reverse translation of a given amino acid sequence; iv) searching
R.E. sites in DNA sequence obtained from reverse translation of
amino acid sequence employing user-defined codon usage.  These
facilities allow the program to be employed for the design of
synthetic genes as well as the modular redesign after introducing
limited base-pair mismatches in natural (wild type) genes in
order to adapt them for `cassette' mutagenesis.  The GMAP program
uses an algorithm based on set theory which reduces the degree of
complexity from an exponential to linear function of sequence
length.  As a consequence the speed of searching for potential
R.E. sites in reverse-translated gene sequences and the
prediction of new R.E. sites in natural genes by silent/non-
silent mutation is moderate fast.  The program has the ability to
cope with non-palindromic recognition sites also.  Examples are
given to show the actual and potential sites determined by this
program in a synthetic Ribonuclease A gene and the natural DNA
sequence of Streptokinase.  The DOS and VMS versions of GMAP are
freely  available from the EMBL netserver via e-mail.


INTRODUCTION


An important application of computers in biochemistry is
pattern recognition in biological sequences. The  need for
mapping the restriction enzyme (R.E.) sites is usually fulfilled
by using computers.  Finding translationally silent R.E. sites in
DNA sequences has become particularly important for biologists,
especially those dedicated to the investigation of protein
structure/function relationships.  The ability to predict
potential R.E. sites that are resident in an ambiguous DNA
sequence, such as those obtained by reverse translation of
protein amino acid sequences, allows one to construct synthetic
genes with appropriately placed sites for cutting and joining DNA
segments; similarly, the ability to introduce translationally
silent R.E. sites by limited mutagenesis into a non-ambiguous DNA
sequence (eg., the open-reading-frames of natural genes) or in a
translationally non-silent manner elsewhere in genes (such as
promoters and  other control elements that are not expressed into
proteins) permits the modular redesign of genes for `cassette'
mutagenesis.

The identification of pre-existing R.E. sites in DNA sequences is
possible with many available programs (4). Handling non-ambiguous
DNA sequences by available programs is fairly successful because
little computational complexity is involved. Several specialized
programs are reported to be able to manipulate ambiguous DNA
sequences or even protein-coding sequences (1, 5, 9, 11).
However, these programs fail to handle protein-coding sequences
properly when several ambiguous amino acids are clustered.
Presnell and Benner  (13) use LISP to represent the protein-
coding DNA sequences, but their program is unable to handle long
sequences due to the immense complexity involved. For example,
since the peptide sequence `Ala Ser Ile' can be represented  as
GCNAGYATH or GCNTCNATH, these two sequences must be examined
separately to determine the placement of all R.E. sites. Thus the
ambiguous amino acids (eg. argnine, leucine and serine), which
have six codons, increase the complexity by 2x (where x is the
occurrence of ambiguous amino acids in the amino acid sequence)
(13, 20).  An alternate approach was described  (16, 17, 20) for
reducing the complexity of searching for the presence of
potential R.E. sites in target amino acid sequences. In this
approach, each R.E. recognition sequence is translated into all 
possible peptides (from three reading frames) instead of reverse-
translating protein sequences. These peptides are then searched in 
the given amino acid sequence and if any match is found for a R.E. 
specificity it indicates that R.E. site is allowed.  Although this 
approach reduces  the overall complexity of searching  R.E. sites in 
DNA sequences, it is still quite high because each R.E. recognition 
sequence, having dozens of unique peptide patterns, must be 
separately matched against the target peptide  (16, 20).  Thus, these
program are successfully applicable only when R.E. recognition
sequences are short, do not contain any degeneracy and target DNA
sequences are relatively small (17).  To further shorten the
complexity, Jiang et al.  (7) described a generic algorithm based
on set theory, which reduces the complexity from an exponential
to linear function of length of sequence.

The generic algorithm  (7) was limited in scope to the search of
R.E. sites in ambiguous DNA sequences and/or protein-coded DNA
sequences.  We have further extended the generic algorithm to
incorporate the ability to predict the R.E. sites, which can be
introduced in DNA sequence by a limited number (1, 2 or 3 bp per
site) of silent mutations.  This is a novel capability that is
not available in most earlier programs on potential R.E. site
prediction. Our algorithm is an extension of the generic
algorithm of Jiang et al. (7), hence termed e-generic (e-generic
for `extended' generic algorithm). A comprehensive  program,
GMAP, has been developed based on this algorithm for searching
i) the potential R.E. sites in ambiguous DNA sequence and/or
protein coded DNA sequences and ii) The R.E. sites in non-
ambiguous DNA sequence, and those  that can be introduced by one,
two or three base-pair, site-directed, translationally silent or
non-silent mutations.  These facets make the program
particularly useful both for the redesign of natural genes to
incorporate conveniently placed R.E. sites for `cassette'
mutagenesis as well as in synthetic gene design. The program is
available  free of cost from the EMBL netserver via e-mail in
both DOS and VMS versions.


ALGORITHM


We have extended the generic algorithm described earlier (7)
for searching R.E. sites in ambiguous DNA sequences and/or
protein-coded DNA sequences so that it may also search R.E. sites
in non-ambiguous DNA sequence that can be introduced by limited
site-directed silent (or non-silent) mutations (1, 2 or 3 allowed
mismatches per R.E. site).  Basic elements of DNA sequences i.e.
nucleotides (bases) can be represented as sets, and DNA
sequences, whether non-ambiguous or ambiguous, can be represented
as sequence of these sets.  The DNA sequence or recognition
sequence of restriction enzyme can be expressed as x = x1, x2,
..., xr , where set xi is A, B, C, D, G, H, K, M, N, R, S, T, V,
W or Y , and set A = {A}, B = {G, C, T}, C = {C}, D = {A, G, T},
G = {G}, H = {A, C, T}, K = {T, G}, M = {C, A}, N = {A, G, C, T},
R = {A, G}, S = {C, G}, T = {T}, V = {A, G, C}, W = {A, T},Y =
{C, T} (italic capitals stand for sets and normal capitals for
nucleotides) (ref. 12).  Let  x1 and  x2 be two sets, the set of
intersection of both  x1 ļ x2 is a set that contains common
elements.  If there exists no common element, the result of
intersection is an empty set phi = {}. In other words,  x1 and
x2 are said to be matched if  x1 ļ x2 Ų phi; and mismatched if  x1
ļ x2 = phi. The potential restriction sites which have one, two or
three mismatches in a given DNA sequence can be searched by
implementing a set of intersection operations as described below.

Let x =  x1, x2, ...,  xr be the recognition sequence of a
restriction enzyme and  y =   yn, yn+1, ...,  yn+r-1 is the
segment of natural DNA sequence of length `r' (equal to the
length of R. E. recognition sequence) from n to n+r-1 . Let TSi
denotes mismatch index between  xi and  yn+i-1. Then

         1 if Xi ļ Y(n+i-1) = phi
TSi =  {
         0 otherwise

Then, the total mismatches may be computed for segment length r
as

     r
TS = { TSi
     i=1


The TS represents the total bp mismatches between R.E. sequence
and DNA segment. There will be a potential R.E. site at position
`n' in DNA sequence if value of TS is zero, one, two or three
(allowed mismatches) which can be introduced in the DNA sequence
either without mutation (ie. a natural site), or by one, two  or
three bp mutations respectively. For inspecting whether this site
is translationally silent or non-silent, the target DNA sequence
is translated to its amino acid sequence which is then reverse
translated to an ambiguous DNA sequence. Finally, the TS is
calculated between R.E. sequence and segment of ambiguous DNA
sequence from `n' to `n+r-1' . A TS value, zero represents a
silent potential site; a TS value, 1 or higher signifies a non-
silent potential site. Further details of the e-generic algorithm
can be obtained by request from the authors.


IMPLEMENTATION


GMAP was developed on a Micro Vax II under VMS (ver. 4.6)
operating system.  It was written in standard Pascal and compiled
with a VAX Pascal (3.5) compiler.  The code was also compiled
under DOS (4.01) on an IBM compatible PC/AT. It requires no
special hardware or graphics to be implemented and runs under VMS
or DOS. The program is interactive, fully menu driven and allows
input or output of the data using files.  A synthetic RNase A
gene (Fig. 1) was  analysed by GMAP on an IBM compatible PC/AT-
386; using GMAP, it took 63 seconds CPU time in searching for all
the potential sites of 188 type II restriction enzymes  (15).


OPERATION OF THE PROGRAM


The program GMAP is fully menu driven. Its options and sub-
options are shown in Table 1. The option `Input amino acid
sequence' allows the user to input the amino acid sequence (in
single or three letter code) using keyboard or text file, and
also allows one to create and update the amino acid sequence
file. The sequence data obtained  from PIR or NBRF can also be
directly used to create the input amino acid sequence file. The
option `Input DNA sequence file' allows one to create and update
the DNA sequence file.  The data can be inputted using keyboard
or from text (or ASCII) file, so that the sequence data extracted
from GenBank or EMBL can be directly used for creating a DNA
sequence file. This option also allows one to convert amino acid
sequence into DNA sequence by using a user-defined codon
preference table. The option `Input restriction enzyme sequence'
allows the user to create and update the restriction enzyme data
file.  The prototype restriction endonuclease recognition
sequences of type II enzymes (15) are already stored in a file.
The `Input Codon Usage Table' allows one to create and update the
codon preference table.  A file containing the codons preferred
by  E. coli (19) is included with the program.

The `Search R.E. sites in amino acid sequence' option allows the
user to  i) search for all the R.E. sites in fully ambiguous
DNA sequence obtained from reverse translation of amino acid
sequence  ii) search the sites for a specific restriction enzyme
in reverse translated ambiguous DNA sequence  iii) reverse
translate a given amino acid sequence into fully or partially
ambiguous DNA sequence or into completely non-ambiguous DNA
sequence using user-defined codon preference  iv) search all R.E.
sites in  partial (or non-ambiguous) DNA sequence obtained from
reverse translation of amino acid sequence employing user defined
codon preference table  v) search the sites for user-specified
enzyme in partially ambiguous or completely non-ambiguous DNA
sequence obtained from reverse translation of amino acid sequence
with user-defined codon usage.

The `Search R.E. sites in DNA sequences' option allows the user
to  i) search all the potential R.E. sites  which can be
introduced in DNA sequence by limited site-directed silent/non-
silent mutagenesis, and the number of mutations required to
introduce a site  ii) search the potential sites  for a specific
restriction enzyme, which can be introduced in DNA sequence by
site-directed silent/non-silent mutagenesis, and the number of
mutations required to introduce a site  iii) translate the DNA
sequence into amino acid sequence  iv) search the preexisting
sites of all R.E.'s in DNA sequence, and  v) search existing
sites of a specific R.E. in the DNA sequence.

The `Output DNA/Amino acid/R.E./Codon usage table' option allows
the display (or printout or save in file) of the amino acid
sequence, DNA sequence, restriction enzyme data and codon
preference usage table. Besides the main options and sub options
there are other options which allow the user to output the
results in the desired  format (Table 1).  The program has
ability to cope with both palindromic and non-palindromic
recognition sites.


RESULTS AND DISCUSSION


Protein engineering by genetic means is currently one of the
foremost techniques of studying the relationship between
structure and function of a protein.  Recombinant DNA technology,
particularly the use of the relatively straightforward PCR
methods, allows the facile incorporation of site-specific
alterations in amino acid sequence by modifying the target DNA
(6, 18). This approach is greatly facilitated if a given (wild-
type) gene is so altered (by prior, limited mutagenesis) as to
allow the ready replacement of a DNA segment (`cassette') of the
gene with another (synthetic or PCR-generated) segment that codes
for the desired alteration in the protein sequence. The
alternative to this semi-synthetic approach is the complete
redesign of genes by DNA synthesis methods.  Here, suitable
cassettes can be predesigned at will by introducing appropriately
placed translationally silent R.E. sites into the sequence (8,
13, 20). In this case, the `sequence space' can be further
limited by restricting the usage of the codons for different
amino acids based on the relative frequencies of their use in the
host organism employed  for the expression of the designed gene
(19). Thus, the prediction of `silent' R.E. sites in target
sequences is of  great importance in protein engineering
projects.

The Program GMAP is suitable both for the design of totally
synthetic genes based on protein sequence (ambiguous DNA
sequence), as well as the redesign of natural genes, with non-
ambiguous DNA sequences, by limited site-specific mutagenesis in
order to obtain a modular `cassette' arrangement. The program has
been successfully implemented for achieving both the objectives
using two examples, namely that of RNase A and Streptokinase.
One particular advantage that `custom-tailored', synthesis-based
gene design enjoys over the manipulation of natural genes is the
possibility of altering the codon preferences (19) for different
amino acids in  the protein in consonance with the requirements
of various host-cell protein synthesis machineries.  In this case
too, GMAP offers the option of restricting codon preferences to
user-defined dictates. Thus, instead of a totally non-
preferential usage of the respective codons for different amino
acids, a partially ambiguous DNA sequence is generated by reverse
translation of the amino acid sequence according to specified
codon usage, which is then analysed for mapping unique potential
R.E. sites that are translationally silent.  The user could then
incorporate any or all of these sites into the final DNA sequence
chosen for synthesis.

The applicability of GMAP to  de novo gene design has been
tested using the example of RNase A (Figs. 1 and 2).  The amino
acid sequence of RNase A is well known, and this protein has
served as a favorite model system for numerous protein structure-
function studies over the years (2, 3).  The amino acid sequence
of RNase A was reverse-translated by GMAP into DNA sequences with
varying degrees of ambiguity due to different codon usage (Fig.
1).  The maps of unique R.E. sites either in the totally non-
ambiguous DNA sequence (choosing only a single, most frequently
used, E. coli codon for each amino acid), or partially ambiguous
DNA sequence (using only the relatively more highly used codons
of  E. coli), or totally ambiguous DNA sequence (choosing all
degenerate  codons) were then determined (Fig. 2).  The example
of RNase A clearly illustrates the successful applicability of
GMAP for predicting the useful R.E. sites to be designed in the
different region of a gene with different codon preference
constraints during its  de novo synthesis.  Despite a relative
limitation on the degree of freedom of sequence choice due to
restriction on codon usage, an adequate number of unique sites
is still available in RNase A in the case of the partially- and
fully-degenerate DNA sequences that permit the design of a gene
with a cassette arrangement useful for its (future) mutagenenic
manipulation. In cases where this may not be possible, one has
the option of exploring double-, triple-  and higher order-
cutter R.E. sites for possible manipulation.  In such cases,
potential site/s resident in the area of interest can be retained
while sites elsewhere for the same enzyme can be simply abolished
(in a translationally silent manner) if  feasible on the basis of
sequence degeneracy.

Although gene design by total de novo synthesis is a powerful
tool for protein engineering , a convenient, albeit somewhat less
powerful, approach is to investigate natural genes for the
possibility of introducing new R.E. sites into the non-ambiguous
DNA sequence through a limited number of base-pair mismatches.
Since most of the cloned genes are of natural origin rather than
chemically synthesized  in vitro, the ability to manipulate wild-
type genes for protein engineering purposes is of special
interest to molecular biologists. Apart from conferring
advantages during modular mutagenesis, the  placement of unique
R.E. site/s (RFLPs) near or within a target sequence to mark that
gene or mutagenesis event also provides an useful tool in genetic
experiments.  This feature is clearly illustrated in the example
of Streptokinase, an important thrombolytic protein (Fig. 3). The
known DNA sequence of Streptokinase (10) was first analyzed by
GMAP for naturally present, single-cutter (i.e., unique) R.E.
sites (column 1 in Fig. 3).  Although several sites are placed
uniformly throughout most of the gene (a key consideration in
obtaining a modular arrangement of cassettes in genes), several
segments (indicated by boxes, marked A, B and C in column 1) lack
such sites.  However, upon limited mismatching (i.e 1, 2 or 3-bp
per R.E. site), several new, unique sites are seen to appear (see
corresponding boxes in column 2 in Fig. 3).  The choice is even
further extended when new double- and triple-cutter enzyme sites
(3rd and 4th columns, respectively) are explored.  In this case,
the user has the option of scrutinizing the potential double- ,
triple- or quadruple- cutter sites permissible in the sequence
after limited mismatching, and then allow only those sites that
allow the placement of a unique site in the area of interest
while leaving other sites intact. It may be mentioned here that
only those sites that are generated upon one-bp mismatching have
been shown in Fig. 3 (because they turned out to be adequate for
the example chosen).  In cases where this is not adequate, the
ensemble of potential R.E. sites permissible through 2, 3 or a
higher order of bp mismatching will likely be sufficiently vast
to permit the introduction of useful sites virtually in any
region of a gene. Alternatively, if one wishes to incorporate
potential R.E. sites into DNA that can be generated only by
altering the encoded amino acid sequence, GMAP offers this option
to the user. This scenario is particularly useful when the
constraint of `translation silence' is not needed for the
mutagenesis of regions other than the open-reading-frames, such
as the control elements of genes.  A pertinent example of this
type of application is when enhancing the expression of whole
genes by cassette mutagenesis wherein one desires to cut just
outside of a coding region in order to fuse it to a stronger
promoter.


Availability


GMAP is freely available either by request from authors or from
the EMBL (source code and compiled programs for VAX/VMS and DOS
computers) via electronic mail (e- mail) and anonymous file
transfer (ftp). E-mail can be sent to the internet address of
EMBL (14): `NETSERV@EMBL-HEIDELBERG.DE', by typing the following
commands (only one per line in the body of the message):

HELP DOS_SOFTWARE

GET DOS_SOFTWARE:GMAP.UUE              OR

HELP VAX_SOFTWARE

GET VAX_SOFTWARE:GMAP.UUE

This provides the programs in an uue-encoded form which can be
processed according to the information given in the HELP files.
Alternatively, fully functional programs can be obtained via file
transfer (ftp) from  `FTP.EMBL-HEIDELBERGE.DE' (192.54.41.33) by
giving the username ANONYMOUS and the full own e-mail address as
password. The DOS program GMAP can be obtained by switching the
directory (command: CD PUB/SOFTWARE/DOS), looking for available
files (command: DIR), altering the transfer-mode to binary
(command: BINARY), and ordering the desired program (command GET
GMAP$.EXE).  After transfer, the ftp-session is terminated
(command:QUIT).  GMAP$.EXE is self-unstuffing DOS application,
providing the interested user with the program, all necessary
files and a read-me document.


ACKNOWLEDGEMENTS


This is communication no. 019/93 from the Institute of Microbial
Technology, Chandigarh, India and was supported by grants from
the Council of Scientific and Industrial Research & the
Department of Biotechnology, Government of India.  The authors
are grateful for the suggestions offered by the anonymous
referees. The authors are also thankful to Dr.Grish C. Varshney
and Mr. Mahavir Yadav for their help in preparing the manuscript.


REFERENCES


1.  Arentzen, R. and W.C. Ripka. 1984. Introduction of
restriction enzyme sites in protein-coding DNA sequences by site-
specific mutagenesis not affecting the amino acid sequence: a
computer program. Nucleic Acids Res. 12:777-787.

2.  Beintema, J.J., C. Schueller, M. Irie and A.
Carsajna. 1988. Molecular evolution of the ribonuclease
superfamily. Prog. Biophys. Molec. Biol.  51:165-192.

3.  Blackburn, P. and S. Moore. 1982. Pancreatic Ribonuclease,
p. 317-329. In P.D. Boyer (Ed.), The Enzymes, Vol 15, Academic
Press, New York.

4.  Cannon G. 1990. Nucleic acid sequence analysis software for
microcomputers. Anal. Biochem.  190:147-153.

5.  de Boer, J.G. 1991. MARS: A program to find potential
restriction sites. Comput. Applic. Biosci.  7:267-267.

6.  Higuchi, R., B. Krummel and R.K. Saiki. 1988. A general
method of in vitro preparation and specific mutagenesis of DNA
fragments: study of protein and DNA interactions. Nucleic Acids
Res.  16:7351-7367.

7.  Jiang, K., J. Zheng and S.B. Higgins. 1991. A generic
algorithm for finding restriction sites within DNA sequences.
Comput. Applic. Biosci.  7:249-256.

8.  Libertini, G. and A. Di Donato. 1992. Computer-aided gene
design. Protein Eng.  5:821-825.

9.  Little, J.W. and D.W. Mount. 1984. Creating new restriction
sites by silent changes in coding sequences. Gene 32:67-73.

10.  Malke, H., B. Roe and J.J. Ferretti. 1985. Nucleotide
sequence of the streptokinase gene from  Streptococcus
equisimilis H 46A. Gene  34:357-362.

11.  Mount, D.W. and B. Conrad. 1986. Improved programs for DNA
and protein sequence analysis on the IBM personal computer and
other standard computer systems. Nucleic Acids Res. 14:443-454.

12.  NC-IUB recomendations (Nomenclature Committee of the
International Union of Biochemistry) 1985. Nomenclature for
incompletely specified bases in nucleic acid sequences. Eur. J.
Biochem.  150:1-5.

13.  Presnell, S.R. and S.A. Benner. 1988. The design of
synthetic genes. Nucleic Acids Res.  16:1693-1702.

14.  Rice, C.M., R. Fuchs, D.G. Higgins, P.J. Stoehr and G.N.
Cameron. 1993. The EMBL data library. Nucleic Acids Res.
21:2967-2971.

15.  Roberts, J.R. and D. Macelis. 1993. REBASE-Restriction
enzymes and methylases. Nucleic Acids Res.  21:3125-3137.

16.  Shankarappa, B., D.A. Sirko and G.D. Ehrlich. 1992. A
general method for the identification of regions suitable for
site-directed silent mutagenesis. BioTechniques  12:382-384.

17.  Shankarappa, B., K. Vijayananda and G.D. Ehrlich.
1992. SILMUT: A computer program for the identification  of
regions suitable for silent mutagenesis to introduce restriction
enzyme recognition sequences.  BioTechniques  12:882-884.

18.  Vallette, F., E. Mege, A. Reiss and M. Adesnik.
1989. Construction of mutant and chimeric genes using the
polymerase chain reaction.  Nucleic Acids Res.  17:723-733.

19.  Wada, K., Y. Wada, F. Ishibashi, T. Gojobori and T.
Ikemura. 1992. Codon usage tabulated from the GenBank genetic
sequence data.  Nucleic Acids Res.  20:2111-2118.

20. Weiner, M.P., and H.A. Scheraga. 1989. A set of Macintosh
computer programs for the design and analysis of synthetic genes.
Comput. Applic. Biosci.  5:191-198.

\bye
