Implementing REBASE using Perl.pdf

(89 KB) Pobierz

Implementing REBASE using Perl

The Restriction Enzyme Database (Restriction

Enzyme

data

BASE)

is a collection of information

about restriction enzymes and related proteins. It contains published and unpublished references,

recognition and cleavage sites, isoschizomers, commercial availability, methylation sensitivity, crystal,

genome, and sequence data. DNA methyltransferases, homing endonucleases, nicking enzymes,

specificity subunits and control proteins are also included. Putative DNA methyltransferases and

restriction enzymes, as predicted from analysis of genomic sequences, are also listed. REBASE is

updated daily and is constantly expanding.

Background

Restriction enzymes are proteins that cut DNA at short, specific sequences; for example, the popular

restriction enzymes EcoRI and HindIII are widely used in the lab. EcoRI cuts where it finds

GAATTC

between the

and

. Actually, it cuts both complementary strands, leaving an overhang on each end.

These “sticky ends” of a few bases in single strands make it possible for the fragments to re-form,

making possible the insertion of DNA into vectors for cloning and sequencing, for instance. HindIII

cuts at

AAGCTT

and cuts between the

s. Some restriction enzymes cut in the middle and result in

“blunt ends” with no overhang. About 1,000 restriction enzymes are known.

If you look at the reverse complement of the restriction enzyme EcoRI, you see it’s

GAATTC

, the same

sequence. This is a biological version of a palindrome, a word that reads the same in reverse. Many

restriction sites are palindromes.

Computing restriction maps is a common and practical bioinformatics calculation in the laboratory.

Restriction maps are computed to plan experiments, to find the best way to cut DNA to insert a gene,

to make a site-specific mutation, or for several other applications of recombinant DNA techniques. By

computing first, the laboratory scientist saves considerably on the necessary trial-and-error at the

laboratory bench. Look for more about restriction enzymes at the official site of REBASE

http://rebase.neb.com/rebase/rebase.html.

Perl Program Implementation of REBASE

This program will look for restriction enzymes in a sequence of DNA and report back with a

restriction map of exactly where in the DNA the restriction enzymes appear. Restriction enzyme data

can be found at the Restriction Enzyme Database (REBASE), which is on the web at

http://rebase.neb.com/rebase/rebase.files.html.

Restriction Enzyme Data

The restriction enzyme data is available in a variety of formats, as a visit to the REBASE web site will

show you. After looking around, you decide to get the information from the

bionet

file, which has a

fairly simple layout. Here’s the header and a few restriction enzymes from that file:

REBASE version 104

bionet.104

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

REBASE, The Restriction Enzyme Database

http://rebase.neb.com

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Rich Roberts

AaaI (XmaIII)

AacI (BamHI)

C^GGCCG

GGATCC

Page: 1

Mar 30 2001

Implementation of REBASE using Perl

AaeI (BamHI)

AagI (ClaI)

AaqI (ApaLI)

AarI

AatI (StuI)

AatII

AauI (Bsp1407I)

AbaI (BclI)

AbeI (BbvCI)

AbrI (XhoI)

AcaI (AsuII)

AcaII (BamHI)

AcaIII (MstI)

AcaIV (HaeIII)

AccI

AccII (FnuDII)

AccIII (BspMII)

Acc16I (MstI)

Acc36I (BspMI)

Acc38I (EcoRII)

Acc65I (KpnI)

Acc113I (ScaI)

AccB1I (HgiCI)

AccB2I (HaeII)

AccB7I (PflMI)

AccBSI (BsrBI)

AccEBI (BamHI)

AceI (TseI)

AceII (NheI)

AceIII

AciI

AclI

AclNI (SpeI)

AclWI (BinI)

GGATCC

AT^CGAT

GTGCAC

CACCTGCNNNN^

^NNNNNNNNGCAGGTG

AGG^CCT

GACGT^C

T^GTACA

T^GATCA

CC^TCAGC

GC^TGAGG

C^TCGAG

TTCGAA

GGATCC

TGCGCA

GGCC

GT^MKAC

CG^CG

T^CCGGA

TGC^GCA

ACCTGCNNNN^

^NNNNNNNNGCAGGT

CCWGG

G^GTACC

AGT^ACT

G^GYRCC

RGCGC^Y

CCANNNN^NTGG

CCG^CTC

GAG^CGG

G^GATCC

G^CWGC

GCTAG^C

CAGCTCNNNNNNN^

^NNNNNNNNNNNGAGCTG

C^CGC

G^CGG

AA^CGTT

A^CTAGT

GGATCNNNN^

The first task is to read this file and get the names and the recognition site (or restriction site) for each

enzyme. To simplify matters for now, simply discard the parenthesized enzyme names.

Pseudocode:

# Get DNA

# get_file_data

extract_sequence_from_fasta_data

# Get the REBASE data into a hash, from file “bionet”

# parseREBASE(‘bionet’);

for each user query

If query is defined in the hash

Get positions of query in DNA

Report on positions, if any

}

The core programming of REBASE is below:

Page: 2

Implementation of REBASE using Perl

Step 1: Translate IUB ambiguity codes to regular expressions

Translate IUB ambiguity codes to regular expressions

A subroutine that, given a sequence with IUB ambiguity codes,

outputs a translation with IUB codes changed to regular expressions

These are the IUB ambiguity codes

R = G or A

Y = C or T

M = A or C

K = G or T

S = G or C

W = A or T

B = not A (C or G or T)

D = not C (A or G or T)

H = not G (A or C or T)

V = not T (A or C or G)

N = A or C or G or T

sub IUB_to_regexp

{

my($iub) = @_;

my $regular_expression = ‘‘;

my %iub2character_class = (

A => ‘A’,

C => ‘C’,

G => ‘G’,

T => ‘T’,

R => ‘[GA]’,

Y => ‘[CT]’,

M => ‘[AC]’,

K => ‘[GT]’,

S => ‘[GC]’,

W => ‘[AT]’,

B => ‘[CGT]’,

D => ‘[AGT]’,

H => ‘[ACT]’,

V => ‘[ACG]’,

N => ‘[ACGT]’,

);

# Remove the ^ signs from the recognition sites

$iub =~ s/\^//g;

# Translate each character in the iub sequence

for ( my $i = 0 ; $i < length($iub) ; ++$i )

{

$regular_expression .= $iub2character_class{substr($iub, $i, 1)};

}

return $regular_expression;

}

Step 2: Read and parse the REBASE bionet file

Subroutine to parse a REBASE bionet datafile

A subroutine to return a hash where

key = restriction enzyme name

value = whitespace-separated recognition site and regular expression

sub parseREBASE

{

my($rebasefile) = @_;

Page: 3

Implementation of REBASE using Perl

use strict;

use warnings;

# Declare variables

my @rebasefile = ( );

my %rebase_hash = ( );

my $name;

my $site;

my $regexp;

# Read in the REBASE file

my $rebase_filehandle = open_file($rebasefile);

while(<$rebase_filehandle>)

{

# Discard header lines

( 1 .. /Rich Roberts/ ) and next;

# Discard blank lines

/^\s*$/ and next;

# Split the two (or three if includes parenthesized name) fields

my @fields = split( “ “, $_);

# Get and store the name and the recognition site

# Remove parenthesized names, for simplicity’s sake,

# by not saving the middle field, if any,

# just the first and last

$name = shift @fields;

$site = pop @fields;

# Translate the recognition sites to regular expressions

$regexp = IUB_to_regexp($site);

# Store the data into the hash

$rebase_hash{$name} = “$site $regexp”;

}

# Return the hash containing the reformatted REBASE data

return %rebase_hash;

}

Step 3: Make restriction map from user queries

#!/usr/bin/perl -w

# Make restriction map from user queries on names of restriction enzymes

use strict;

use warnings;

# Declare and initialize variables

my %rebase_hash = ( );

my @file_data = ( );

my $query = ‘‘;

my $dna = ‘‘;

my $recognition_site = ‘‘;

my $regexp = ‘‘;

my @locations = ( );

# Read in the file “sample.dna”

@file_data = get_file_data(“sample.dna”);

Page: 4

Implementation of REBASE using Perl

# Extract the DNA sequence data from the contents of the file “sample.dna”

$dna = extract_sequence_from_fasta_data(@file_data);

# Get the REBASE data into a hash, from file “bionet”

%rebase_hash = parseREBASE(‘bionet’);

# Prompt user for restriction enzyme names, create restriction map

{

print “Search for what restriction site for (or quit)?: “;

$query = <STDIN>;

chomp $query;

# Exit if empty query

if ($query =~ /^\s*$/ )

{

exit;

}

# Perform the search in the DNA sequence

if ( exists $rebase_hash{$query} )

{

($recognition_site, $regexp) = split (“ “, $rebase_hash{$query});

# Create the restriction map

@locations = match_positions($regexp, $dna);

# Report the restriction map to the user

if (@locations)

{

print “Searching for $query $recognition_site $regexp\n”;

print “A restriction site for $query at locations:\n”;

print join(“ “, @locations), “\n”;

}

else

{

print “A restriction enzyme $query is not in the DNA:\n”;

}

print “\n”;

}

until ( $query =~ /quit/ );

exit;

################################################################################

# Subroutine

# Find locations of a match of a regular expression in a string

# return an array of positions where the regular expression

# appears in the string

sub match_positions

{

my($regexp, $sequence) = @_;

use strict;

# Declare variables

my @positions = ( );

Page: 5

Plik z chomika:

xyzgeo

Implementing REBASE using Perl.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: