Implementing REBASE using Perl.pdf
(
89 KB
)
Pobierz
Implementing REBASE using Perl
The Restriction Enzyme Database (Restriction
Enzyme
data
BASE)
is a collection of information
about restriction enzymes and related proteins. It contains published and unpublished references,
recognition and cleavage sites, isoschizomers, commercial availability, methylation sensitivity, crystal,
genome, and sequence data. DNA methyltransferases, homing endonucleases, nicking enzymes,
specificity subunits and control proteins are also included. Putative DNA methyltransferases and
restriction enzymes, as predicted from analysis of genomic sequences, are also listed. REBASE is
updated daily and is constantly expanding.
Background
Restriction enzymes are proteins that cut DNA at short, specific sequences; for example, the popular
restriction enzymes EcoRI and HindIII are widely used in the lab. EcoRI cuts where it finds
GAATTC
,
between the
G
and
A
. Actually, it cuts both complementary strands, leaving an overhang on each end.
These “sticky ends” of a few bases in single strands make it possible for the fragments to re-form,
making possible the insertion of DNA into vectors for cloning and sequencing, for instance. HindIII
cuts at
AAGCTT
and cuts between the
A
s. Some restriction enzymes cut in the middle and result in
“blunt ends” with no overhang. About 1,000 restriction enzymes are known.
If you look at the reverse complement of the restriction enzyme EcoRI, you see it’s
GAATTC
, the same
sequence. This is a biological version of a palindrome, a word that reads the same in reverse. Many
restriction sites are palindromes.
Computing restriction maps is a common and practical bioinformatics calculation in the laboratory.
Restriction maps are computed to plan experiments, to find the best way to cut DNA to insert a gene,
to make a site-specific mutation, or for several other applications of recombinant DNA techniques. By
computing first, the laboratory scientist saves considerably on the necessary trial-and-error at the
laboratory bench. Look for more about restriction enzymes at the official site of REBASE
http://rebase.neb.com/rebase/rebase.html.
Perl Program Implementation of REBASE
This program will look for restriction enzymes in a sequence of DNA and report back with a
restriction map of exactly where in the DNA the restriction enzymes appear. Restriction enzyme data
can be found at the Restriction Enzyme Database (REBASE), which is on the web at
http://rebase.neb.com/rebase/rebase.files.html.
Restriction Enzyme Data
The restriction enzyme data is available in a variety of formats, as a visit to the REBASE web site will
show you. After looking around, you decide to get the information from the
bionet
file, which has a
fairly simple layout. Here’s the header and a few restriction enzymes from that file:
REBASE version 104
bionet.104
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
REBASE, The Restriction Enzyme Database
http://rebase.neb.com
Copyright (c) Dr. Richard J. Roberts, 2001.
All rights reserved.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rich Roberts
AaaI (XmaIII)
AacI (BamHI)
C^GGCCG
GGATCC
Page: 1
Mar 30 2001
Implementation of REBASE using Perl
AaeI (BamHI)
AagI (ClaI)
AaqI (ApaLI)
AarI
AarI
AatI (StuI)
AatII
AauI (Bsp1407I)
AbaI (BclI)
AbeI (BbvCI)
AbeI (BbvCI)
AbrI (XhoI)
AcaI (AsuII)
AcaII (BamHI)
AcaIII (MstI)
AcaIV (HaeIII)
AccI
AccII (FnuDII)
AccIII (BspMII)
Acc16I (MstI)
Acc36I (BspMI)
Acc36I (BspMI)
Acc38I (EcoRII)
Acc65I (KpnI)
Acc113I (ScaI)
AccB1I (HgiCI)
AccB2I (HaeII)
AccB7I (PflMI)
AccBSI (BsrBI)
AccBSI (BsrBI)
AccEBI (BamHI)
AceI (TseI)
AceII (NheI)
AceIII
AceIII
AciI
AciI
AclI
AclNI (SpeI)
AclWI (BinI)
GGATCC
AT^CGAT
GTGCAC
CACCTGCNNNN^
^NNNNNNNNGCAGGTG
AGG^CCT
GACGT^C
T^GTACA
T^GATCA
CC^TCAGC
GC^TGAGG
C^TCGAG
TTCGAA
GGATCC
TGCGCA
GGCC
GT^MKAC
CG^CG
T^CCGGA
TGC^GCA
ACCTGCNNNN^
^NNNNNNNNGCAGGT
CCWGG
G^GTACC
AGT^ACT
G^GYRCC
RGCGC^Y
CCANNNN^NTGG
CCG^CTC
GAG^CGG
G^GATCC
G^CWGC
GCTAG^C
CAGCTCNNNNNNN^
^NNNNNNNNNNNGAGCTG
C^CGC
G^CGG
AA^CGTT
A^CTAGT
GGATCNNNN^
The first task is to read this file and get the names and the recognition site (or restriction site) for each
enzyme. To simplify matters for now, simply discard the parenthesized enzyme names.
Pseudocode:
# Get DNA
# get_file_data
extract_sequence_from_fasta_data
# Get the REBASE data into a hash, from file “bionet”
# parseREBASE(‘bionet’);
for each user query
If query is defined in the hash
Get positions of query in DNA
Report on positions, if any
}
The core programming of REBASE is below:
Page: 2
Implementation of REBASE using Perl
Step 1: Translate IUB ambiguity codes to regular expressions
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
Translate IUB ambiguity codes to regular expressions
A subroutine that, given a sequence with IUB ambiguity codes,
outputs a translation with IUB codes changed to regular expressions
These are the IUB ambiguity codes
R = G or A
Y = C or T
M = A or C
K = G or T
S = G or C
W = A or T
B = not A (C or G or T)
D = not C (A or G or T)
H = not G (A or C or T)
V = not T (A or C or G)
N = A or C or G or T
sub IUB_to_regexp
{
my($iub) = @_;
my $regular_expression = ‘‘;
my %iub2character_class = (
A => ‘A’,
C => ‘C’,
G => ‘G’,
T => ‘T’,
R => ‘[GA]’,
Y => ‘[CT]’,
M => ‘[AC]’,
K => ‘[GT]’,
S => ‘[GC]’,
W => ‘[AT]’,
B => ‘[CGT]’,
D => ‘[AGT]’,
H => ‘[ACT]’,
V => ‘[ACG]’,
N => ‘[ACGT]’,
);
# Remove the ^ signs from the recognition sites
$iub =~ s/\^//g;
# Translate each character in the iub sequence
for ( my $i = 0 ; $i < length($iub) ; ++$i )
{
$regular_expression .= $iub2character_class{substr($iub, $i, 1)};
}
return $regular_expression;
}
Step 2: Read and parse the REBASE bionet file
#
#
#
#
#
Subroutine to parse a REBASE bionet datafile
A subroutine to return a hash where
key = restriction enzyme name
value = whitespace-separated recognition site and regular expression
sub parseREBASE
{
my($rebasefile) = @_;
Page: 3
Implementation of REBASE using Perl
use strict;
use warnings;
# Declare variables
my @rebasefile = ( );
my %rebase_hash = ( );
my $name;
my $site;
my $regexp;
# Read in the REBASE file
my $rebase_filehandle = open_file($rebasefile);
while(<$rebase_filehandle>)
{
# Discard header lines
( 1 .. /Rich Roberts/ ) and next;
# Discard blank lines
/^\s*$/ and next;
# Split the two (or three if includes parenthesized name) fields
my @fields = split( “ “, $_);
# Get and store the name and the recognition site
# Remove parenthesized names, for simplicity’s sake,
# by not saving the middle field, if any,
# just the first and last
$name = shift @fields;
$site = pop @fields;
# Translate the recognition sites to regular expressions
$regexp = IUB_to_regexp($site);
# Store the data into the hash
$rebase_hash{$name} = “$site $regexp”;
}
# Return the hash containing the reformatted REBASE data
return %rebase_hash;
}
Step 3: Make restriction map from user queries
#!/usr/bin/perl -w
# Make restriction map from user queries on names of restriction enzymes
use strict;
use warnings;
# Declare and initialize variables
my %rebase_hash = ( );
my @file_data = ( );
my $query = ‘‘;
my $dna = ‘‘;
my $recognition_site = ‘‘;
my $regexp = ‘‘;
my @locations = ( );
# Read in the file “sample.dna”
@file_data = get_file_data(“sample.dna”);
Page: 4
Implementation of REBASE using Perl
# Extract the DNA sequence data from the contents of the file “sample.dna”
$dna = extract_sequence_from_fasta_data(@file_data);
# Get the REBASE data into a hash, from file “bionet”
%rebase_hash = parseREBASE(‘bionet’);
# Prompt user for restriction enzyme names, create restriction map
do
{
print “Search for what restriction site for (or quit)?: “;
$query = <STDIN>;
chomp $query;
# Exit if empty query
if ($query =~ /^\s*$/ )
{
exit;
}
# Perform the search in the DNA sequence
if ( exists $rebase_hash{$query} )
{
($recognition_site, $regexp) = split (“ “, $rebase_hash{$query});
# Create the restriction map
@locations = match_positions($regexp, $dna);
# Report the restriction map to the user
if (@locations)
{
print “Searching for $query $recognition_site $regexp\n”;
print “A restriction site for $query at locations:\n”;
print join(“ “, @locations), “\n”;
}
else
{
print “A restriction enzyme $query is not in the DNA:\n”;
}
}
print “\n”;
}
until ( $query =~ /quit/ );
exit;
################################################################################
#
# Subroutine
#
# Find locations of a match of a regular expression in a string
#
#
# return an array of positions where the regular expression
# appears in the string
#
sub match_positions
{
my($regexp, $sequence) = @_;
use strict;
#
# Declare variables
#
my @positions = ( );
Page: 5
Plik z chomika:
xyzgeo
Inne pliki z tego folderu:
oreilly_learning_perl(1).pdf
(4640 KB)
Perl(1).rar
(86 KB)
perl_wprowadzenie(2).pdf
(440 KB)
perl_tutorial(1).pdf
(2094 KB)
perltut(1).pdf
(2032 KB)
Inne foldery tego chomika:
0
algorytmika
artykuly
bioinformatyka (biotech06)
Bioinformatyka (patryska89)
Zgłoś jeśli
naruszono regulamin