mirror of
https://github.com/biopython/biopython.git
synced 2025-10-20 13:43:47 +08:00
remove deprecated functionality from Bio.Align.AlignInfo (#4907)
Authored-by: Michiel de Hoon <mdehoon@tkx288.genome.gsc.riken.jp>
This commit is contained in:
@ -1098,364 +1098,6 @@ add entries for missing letters, for example
|
||||
|
||||
This also allows you to change the order of letters in the alphabet.
|
||||
|
||||
.. _`sec:summary_info`:
|
||||
|
||||
Calculating summary information
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Once you have an alignment, you are very likely going to want to find
|
||||
out information about it. Instead of trying to have all of the functions
|
||||
that can generate information about an alignment in the alignment object
|
||||
itself, we’ve tried to separate out the functionality into separate
|
||||
classes, which act on the alignment.
|
||||
|
||||
Getting ready to calculate summary information about an object is quick
|
||||
to do. Let’s say we’ve got an alignment object called ``alignment``, for
|
||||
example read in using ``Bio.AlignIO.read(...)`` as described in
|
||||
Chapter :ref:`chapter:msa`. All we need to do to get an object that
|
||||
will calculate summary information is:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> from Bio.Align import AlignInfo
|
||||
>>> summary_align = AlignInfo.SummaryInfo(msa)
|
||||
|
||||
The ``summary_align`` object is very useful, and will do the following
|
||||
neat things for you:
|
||||
|
||||
#. Calculate a quick consensus sequence – see
|
||||
section :ref:`sec:consensus`
|
||||
|
||||
#. Get a position specific score matrix for the alignment – see
|
||||
section :ref:`sec:pssm`
|
||||
|
||||
#. Calculate the information content for the alignment – see
|
||||
section :ref:`sec:getting_info_content`
|
||||
|
||||
#. Generate information on substitutions in the alignment –
|
||||
section :ref:`sec:substitution_matrices`
|
||||
details using this to generate a substitution matrix.
|
||||
|
||||
.. _`sec:consensus`:
|
||||
|
||||
Calculating a quick consensus sequence
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``SummaryInfo`` object, described in
|
||||
section :ref:`sec:summary_info`, provides functionality to
|
||||
calculate a quick consensus of an alignment. Assuming we’ve got a
|
||||
``SummaryInfo`` object called ``summary_align`` we can calculate a
|
||||
consensus by doing:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> consensus = summary_align.dumb_consensus()
|
||||
>>> consensus
|
||||
Seq('XCTXCTX')
|
||||
|
||||
As the name suggests, this is a really simple consensus calculator, and
|
||||
will just add up all of the residues at each point in the consensus, and
|
||||
if the most common value is higher than some threshold value will add
|
||||
the common residue to the consensus. If it doesn’t reach the threshold,
|
||||
it adds an ambiguity character to the consensus. The returned consensus
|
||||
object is a ``Seq`` object.
|
||||
|
||||
You can adjust how ``dumb_consensus`` works by passing optional
|
||||
parameters:
|
||||
|
||||
the threshold
|
||||
This is the threshold specifying how common a particular residue has
|
||||
to be at a position before it is added. The default is :math:`0.7`
|
||||
(meaning :math:`70\%`).
|
||||
|
||||
the ambiguous character
|
||||
This is the ambiguity character to use. The default is ’N’.
|
||||
|
||||
Alternatively, you can convert the multiple sequence alignment object
|
||||
``msa`` to a new-style ``Alignment`` object (see section
|
||||
:ref:`sec:alignmentobject`) by using the
|
||||
``alignment`` attribute (see section :ref:`sec:alignment_newstyle`):
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> alignment = msa.alignment
|
||||
|
||||
You can then create a ``Motif`` object (see section
|
||||
:ref:`sec:motif_object`):
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> from Bio.motifs import Motif
|
||||
>>> motif = Motif("ACGT", alignment)
|
||||
|
||||
and obtain a quick consensus sequence:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> motif.consensus
|
||||
Seq('ACTCCTA')
|
||||
|
||||
The ``motif.counts.calculate_consensus`` method (see section
|
||||
:ref:`sec:motif_consensus`) lets you specify in
|
||||
detail how the consensus sequence should be calculated. For example,
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> motif.counts.calculate_consensus(identity=0.7)
|
||||
'NCTNCTN'
|
||||
|
||||
.. _`sec:pssm`:
|
||||
|
||||
Position Specific Score Matrices
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Position specific score matrices (PSSMs) summarize the alignment
|
||||
information in a different way than a consensus, and may be useful for
|
||||
different tasks. Basically, a PSSM is a count matrix. For each column in
|
||||
the alignment, the number of each alphabet letters is counted and
|
||||
totaled. The totals are displayed relative to some representative
|
||||
sequence along the left axis. This sequence may be the consensus
|
||||
sequence, but can also be any sequence in the alignment.
|
||||
|
||||
For instance for the alignment above:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> print(msa)
|
||||
Alignment with 4 rows and 7 columns
|
||||
ACTCCTA seq1
|
||||
AAT-CTA seq2
|
||||
CCTACT- seq3
|
||||
TCTCCTC seq4
|
||||
|
||||
we get a PSSM with the consensus sequence along the side using
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore=["N"])
|
||||
>>> print(my_pssm)
|
||||
A C T
|
||||
X 2.0 1.0 1.0
|
||||
C 1.0 3.0 0.0
|
||||
T 0.0 0.0 4.0
|
||||
X 1.0 2.0 0.0
|
||||
C 0.0 4.0 0.0
|
||||
T 0.0 0.0 4.0
|
||||
X 2.0 1.0 0.0
|
||||
<BLANKLINE>
|
||||
|
||||
where we ignore any ``N`` ambiguity residues when calculating the PSSM.
|
||||
|
||||
Two notes should be made about this:
|
||||
|
||||
#. To maintain strictness with the alphabets, you can only include
|
||||
characters along the top of the PSSM that are in the alphabet of the
|
||||
alignment object. Gaps are not included along the top axis of the
|
||||
PSSM.
|
||||
|
||||
#. The sequence passed to be displayed along the left side of the axis
|
||||
does not need to be the consensus. For instance, if you wanted to
|
||||
display the second sequence in the alignment along this axis, you
|
||||
would need to do:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> second_seq = msa[1]
|
||||
>>> my_pssm = summary_align.pos_specific_score_matrix(second_seq, chars_to_ignore=["N"])
|
||||
>>> print(my_pssm)
|
||||
A C T
|
||||
A 2.0 1.0 1.0
|
||||
A 1.0 3.0 0.0
|
||||
T 0.0 0.0 4.0
|
||||
- 1.0 2.0 0.0
|
||||
C 0.0 4.0 0.0
|
||||
T 0.0 0.0 4.0
|
||||
A 2.0 1.0 0.0
|
||||
<BLANKLINE>
|
||||
|
||||
The command above returns a ``PSSM`` object. You can access any element
|
||||
of the PSSM by subscripting like
|
||||
``your_pssm[sequence_number][residue_count_name]``. For instance, to get
|
||||
the counts for the ’A’ residue in the second element of the above PSSM
|
||||
you would do:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> print(my_pssm[5]["T"])
|
||||
4.0
|
||||
|
||||
The structure of the PSSM class hopefully makes it easy both to access
|
||||
elements and to pretty print the matrix.
|
||||
|
||||
Alternatively, you can convert the multiple sequence alignment object
|
||||
``msa`` to a new-style ``Alignment`` object (see section
|
||||
:ref:`sec:alignmentobject`) by using the
|
||||
``alignment`` attribute (see section :ref:`sec:alignment_newstyle`):
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> alignment = msa.alignment
|
||||
|
||||
You can then create a ``Motif`` object (see section
|
||||
:ref:`sec:motif_object`):
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> from Bio.motifs import Motif
|
||||
>>> motif = Motif("ACGT", alignment)
|
||||
|
||||
and obtain the counts of each nucleotide in each position:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> counts = motif.counts
|
||||
>>> print(counts)
|
||||
0 1 2 3 4 5 6
|
||||
A: 2.00 1.00 0.00 1.00 0.00 0.00 2.00
|
||||
C: 1.00 3.00 0.00 2.00 4.00 0.00 1.00
|
||||
G: 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||||
T: 1.00 0.00 4.00 0.00 0.00 4.00 0.00
|
||||
<BLANKLINE>
|
||||
>>> print(counts["T"][5])
|
||||
4.0
|
||||
|
||||
.. _`sec:getting_info_content`:
|
||||
|
||||
Information Content
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A potentially useful measure of evolutionary conservation is the
|
||||
information content of a sequence.
|
||||
|
||||
A useful introduction to information theory targeted towards molecular
|
||||
biologists can be found at
|
||||
http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we
|
||||
will be looking at the information content of a consensus sequence, or a
|
||||
portion of a consensus sequence. We calculate information content at a
|
||||
particular column in a multiple sequence alignment using the following
|
||||
formula:
|
||||
|
||||
.. math:: IC_{j} = \sum_{i=1}^{N_{a}} P_{ij} \mathrm{log}\left(\frac{P_{ij}}{Q_{i}}\right)
|
||||
|
||||
where:
|
||||
|
||||
- :math:`IC_{j}` – The information content for the :math:`j`-th column
|
||||
in an alignment.
|
||||
|
||||
- :math:`N_{a}` – The number of letters in the alphabet.
|
||||
|
||||
- :math:`P_{ij}` – The frequency of a particular letter :math:`i` in
|
||||
the :math:`j`-th column (i. e. if G occurred 3 out of 6 times in an
|
||||
alignment column, this would be 0.5)
|
||||
|
||||
- :math:`Q_{i}` – The expected frequency of a letter :math:`i`. This is
|
||||
an optional argument, usage of which is left at the user’s
|
||||
discretion. By default, it is automatically assigned to
|
||||
:math:`0.05 = 1/20` for a protein alphabet, and :math:`0.25 = 1/4`
|
||||
for a nucleic acid alphabet. This is for getting the information
|
||||
content without any assumption of prior distributions. When assuming
|
||||
priors, or when using a non-standard alphabet, you should supply the
|
||||
values for :math:`Q_{i}`.
|
||||
|
||||
Well, now that we have an idea what information content is being
|
||||
calculated in Biopython, let’s look at how to get it for a particular
|
||||
region of the alignment.
|
||||
|
||||
First, we need to use our alignment to get an alignment summary object,
|
||||
which we’ll assume is called ``summary_align`` (see
|
||||
section :ref:`sec:summary_info`) for instructions on how to get
|
||||
this. Once we’ve got this object, calculating the information content
|
||||
for a region is as easy as:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> e_freq_table = {"A": 0.3, "G": 0.2, "T": 0.3, "C": 0.2}
|
||||
>>> info_content = summary_align.information_content(
|
||||
... 2, 6, e_freq_table=e_freq_table, chars_to_ignore=["N"]
|
||||
... )
|
||||
>>> info_content # doctest:+ELLIPSIS
|
||||
6.3910647...
|
||||
|
||||
Now, ``info_content`` will contain the relative information content over
|
||||
the region [2:6] in relation to the expected frequencies.
|
||||
|
||||
The value return is calculated using base 2 as the logarithm base in the
|
||||
formula above. You can modify this by passing the parameter ``log_base``
|
||||
as the base you want:
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> info_content = summary_align.information_content(
|
||||
... 2, 6, e_freq_table=e_freq_table, log_base=10, chars_to_ignore=["N"]
|
||||
... )
|
||||
>>> info_content # doctest:+ELLIPSIS
|
||||
1.923902...
|
||||
|
||||
By default nucleotide or amino acid residues with a frequency of 0 in a
|
||||
column are not take into account when the relative information column
|
||||
for that column is computed. If this is not the desired result, you can
|
||||
use ``pseudo_count`` instead.
|
||||
|
||||
.. cont-doctest
|
||||
|
||||
.. code:: pycon
|
||||
|
||||
>>> info_content = summary_align.information_content(
|
||||
... 2, 6, e_freq_table=e_freq_table, chars_to_ignore=["N", "-"], pseudo_count=1
|
||||
... )
|
||||
>>> info_content # doctest:+ELLIPSIS
|
||||
4.299651...
|
||||
|
||||
In this case, the observed frequency :math:`P_{ij}` of a particular
|
||||
letter :math:`i` in the :math:`j`-th column is computed as follows:
|
||||
|
||||
.. math:: P_{ij} = \frac{n_{ij} + k\times Q_{i}}{N_{j} + k}
|
||||
|
||||
where:
|
||||
|
||||
- :math:`k` – the pseudo count you pass as argument.
|
||||
|
||||
- :math:`k` – the pseudo count you pass as argument.
|
||||
|
||||
- :math:`Q_{i}` – The expected frequency of the letter :math:`i` as
|
||||
described above.
|
||||
|
||||
Well, now you are ready to calculate information content. If you want to
|
||||
try applying this to some real life problems, it would probably be best
|
||||
to dig into the literature on information content to get an idea of how
|
||||
it is used. Hopefully your digging won’t reveal any mistakes made in
|
||||
coding this function!
|
||||
|
||||
.. _`sec:alignment_newstyle`:
|
||||
|
||||
Getting a new-style Alignment object
|
||||
|
Reference in New Issue
Block a user