lecture02_handouts.pdf

(213 KB) Pobierz
Homology
Similarity
The rule of transitivity
Contents of the lecture
1
Homology
Definition
Homology and similarity
Similarity
Measures of similarity
Scoring matrices for proteins
Statistical meaning
The rule of transitivity
2
3
Homology
Similarity
The rule of transitivity
Definition
Homology and similarity
Evolution of organisms and genes
The evolution of organisms (speciation) is the effect
of differentiation and selection. The differentiation of species
works by accumulation of changes in their genomes (mutations).
Study of evolution on the level of genes and genomes is the
subject of
molecular phylogenetics
or
molecular evolution.
A group of species which differentiated from a common ancestor
is called a
clade
(sometimes a phylum, but the latter term has
a more specific meaning in taxonomy). Clade is a general term
applicable to both groups of related organisms and to groups of
related genes.
Genes from a single clade (having a common ancestor gene) are
called
homologous.
This term can NOT be applied to organisms.
Homology
Similarity
The rule of transitivity
Definition
Homology and similarity
Homology of biological sequences
Homology
means having the same ancestor gene, period.
There are no measures of homology (or could you have 35% of
a brother?)
The term ”homologous” is commonly applied to proteins and RNA
in the sense that they are products of homologous genes.
The problem is that we usually make (educated) guesses about
homology – it is OK to state ”there is a 10% chance of homology”
if you have calculated such probability.
In biology, a homologous trait has a broader sense of any
characteristic (e.g. a phenotype feature) which is derived from
a common ancestor.
Homology
Similarity
The rule of transitivity
Definition
Homology and similarity
Variants of homology
Evolution of genes does not necessarily follow speciation.
A gene may be duplicated within a species. If both copies survive
in the genome, they often diverge to provide specialized
or different functions. Such homologs are called
paralogs.
Genes may be transferred from one organism to another, even
completely unrelated. This is especially common among
prokaryotes (that is how resistance to antibiotics spreads), but it
happens to higher organisms, too.
Homologous genes transferred to unrelated organisms are called
xenologs.
The ”proper” homology, when the genes accumulate mutations
during speciation of their host organisms, but (usually) preserving
their function, is called
orthology.
Homology
Similarity
The rule of transitivity
Definition
Homology and similarity
Variants of homology
Differentiated due to
speciation
gene duplication
gene transfer
Gene function
usually the same
similar or different
usually the same
Homology
orthology
paralogy
xenology
A more complex combination of the above is possible.
As a rule of thumb: homologs from two species related by a common
ancestor are probably orthologs; homologs from the same organism
are paralogs.
Discerning xenology is more difficult – it should be proven, that the
gene homologs are missing in the ancestor – but existence of
homologs in unrelated species is a strong suggestion.
Homology
Similarity
The rule of transitivity
Definition
Homology and similarity
Exercise
Long, long ago, there lived a cell X with gene A providing function
α
.
A grand-grand-. . . -grandchild of cell X lysed in a horrible way,
exposing his DNA to the public. A spectator Y got transfected and
discovered himself to have gained an useful function
α
.
Zillions of years later, a student studied descendants of X (dX) and
descendants of Y (dY). He discovered homology between gene A’
from dX, with function
α
, and two genes B and C from dY. B provided
function
α
, and C function
β
.
Name the homology relations between A, A’, B and C.
Homology
Similarity
The rule of transitivity
Definition
Homology and similarity
Similarity
Similarity
of sequences
can be measured.
Many different measures of similarity are in use.
The most
seemingly
intuitive measure of similarity – percentage
of identical residues –
is the most misleading one!
a)
b)
AGCT
C
Sequence a) has
25%
residues identical with b)
C
AGCT
Sequence a) has
100%
residues identical with b)
Homology
Similarity
The rule of transitivity
Definition
Homology and similarity
Relation between similarity and homology
”Invention” of a functional protein structure is extraordinarily rare.
Mutations are accepted, if protein function is not destroyed.
Retained function means retained structure.
Some
amino acids at specific positions must be retained, too –
a limited number.
Other amino acids are replaceable.
The genetic code is degenerated.
Conclusions:
High similarity
over a significant length
is a conse-
quence of homology.
Low similarity does neither suggest nor exclude
homology.
Homology
Similarity
The rule of transitivity
Measures of similarity
Scoring matrices for proteins
Statistical meaning
Counting residues
All similarity measures count identical and different residues
length
in compared sequences and sum up scores:
S
=
Identity matrix
AGTC
A 1 00 0
G 0 10 0
T 0 01 0
C 0 00 1
Transition vs
Transversion
AGTC
A 2 10 0
G 1 20 0
T 0 02 1
C 0 01 2
i
s
i
A
G
T
C
AAAGGGCCCTTT
AGCGATCTATCG
S
=
4
AAAGGGCCCTTT
AGCGATCTATCG
S
=
12
AAAGGGCCCTTT
AGCGATCTATCG
S
=
12
BLASTN
A G T C
5
4
4
4
4 5
4
4
4
4 5
4
4
4
4 5
Homology
Similarity
The rule of transitivity
Measures of similarity
Scoring matrices for proteins
Statistical meaning
Scoring similarity with matrices
The table of score points used to calculate similarity is named
scoring matrix
or
substitution matrix.
Sum of scores for every pair of compared residues is called
raw score.
It depends on the scoring matrix and is proportional to compared
length of sequences (of course!).
Raw score does not mean anything if we don’t know what scoring
matrix was used – good similarity measure must give consistent
results.
Homology
Similarity
The rule of transitivity
Measures of similarity
Scoring matrices for proteins
Statistical meaning
Bit score
Normalized score of similarity is called
bit score.
S
=
1
ln 2
S
ln
K
)
S
is bit score;
S
is raw score or simply score;
λ
&
K
depend on the scoring matrix used to calculate
S.
bit score depends only on the compared sequences (their
similarity and length) . . .
. . . unless there are gaps!
Homology
Similarity
The rule of transitivity
Measures of similarity
Scoring matrices for proteins
Statistical meaning
Problems with measuring DNA similarity
Unknown reading frame and coding strand
Silent mutations: 20 amino acids / 61 codons
Two genes may have 33% different bases and still encode
an identical protein!
High threshold of insignificant similarity
1
2
3
assume equal probability
1
of each base:
1
4
make a sequence, drawing base after base randomly
you get
25% bases identical with
anything.
. .
Conclusion:
Homologous proteins may have quite different genes;
often below the level of statistical significance.
Homology
Similarity
The rule of transitivity
Measures of similarity
Scoring matrices for proteins
Statistical meaning
Advantages of measuring similarity of proteins
No reading frame problems.
Alphabet of 20 residues vs 4 bases:
Similar analysis gives threshold of insignificance at 5% of identity
2
Statistical analysis gives the real
threshold of about 15%
of identity, but:
1
2
compared sequences
must be long
enough.
it depends on sequence composition and structure.
Looking for homology, one should compare protein
sequences.
Conclusion:
Homology
Similarity
The rule of transitivity
Measures of similarity
Scoring matrices for proteins
Statistical meaning
Similarity of amino acid residues
What does it really mean, and how to score it?
Similar codons?
Similar size?
Polarity or hydrophobicity?
Acidity or basicity?
Functional groups?
H-bonding properites?
Ion coordination properties?
Nucleophilic properties?
Importance of properties is function-dependent, but we seek
a one-for-all measure. . .
Homology
Similarity
The rule of transitivity
Measures of similarity
Scoring matrices for proteins
Statistical meaning
The first idea: Percent Accepted Mutations
Based on phylogenesis of protein sequences sharing at least
85% of identical residues;
Matrix
M
of mutation probabilities
m
ij
of all single amino acid
substitutions (i
j)
calculated from frequencies of mutations;
accepted mutation per 100 amino acids.
M
defines a unit of evolutionary distance: 1 PAM defined as one
Probabilities of multiple mutations (e.g first
i
j,
then
j
k
) are
given by
m
ij
·
m
jk
. Therefore,
M
probabilities can be extrapolated
to any evolutionary distance by multiplication.
Dayhoff, M.O.
et al.,
”A model of evolutionary change in proteins”,
Atlas
of Protein Sequence and Structure,
vol. 5(3), pp. 345–352 (1978).
Zgłoś jeśli naruszono regulamin