remove deprecated functionality from Bio.Align.AlignInfo (#4907)

Authored-by: Michiel de Hoon <mdehoon@tkx288.genome.gsc.riken.jp>
This commit is contained in:
mdehoon
2025-01-16 03:38:28 +09:00
committed by GitHub
parent e297041e3a
commit 4ea2126ec0
6 changed files with 42 additions and 2240 deletions

View File

@ -1098,364 +1098,6 @@ add entries for missing letters, for example
This also allows you to change the order of letters in the alphabet.
.. _`sec:summary_info`:
Calculating summary information
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once you have an alignment, you are very likely going to want to find
out information about it. Instead of trying to have all of the functions
that can generate information about an alignment in the alignment object
itself, weve tried to separate out the functionality into separate
classes, which act on the alignment.
Getting ready to calculate summary information about an object is quick
to do. Lets say weve got an alignment object called ``alignment``, for
example read in using ``Bio.AlignIO.read(...)`` as described in
Chapter :ref:`chapter:msa`. All we need to do to get an object that
will calculate summary information is:
.. cont-doctest
.. code:: pycon
>>> from Bio.Align import AlignInfo
>>> summary_align = AlignInfo.SummaryInfo(msa)
The ``summary_align`` object is very useful, and will do the following
neat things for you:
#. Calculate a quick consensus sequence see
section :ref:`sec:consensus`
#. Get a position specific score matrix for the alignment see
section :ref:`sec:pssm`
#. Calculate the information content for the alignment see
section :ref:`sec:getting_info_content`
#. Generate information on substitutions in the alignment
section :ref:`sec:substitution_matrices`
details using this to generate a substitution matrix.
.. _`sec:consensus`:
Calculating a quick consensus sequence
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``SummaryInfo`` object, described in
section :ref:`sec:summary_info`, provides functionality to
calculate a quick consensus of an alignment. Assuming weve got a
``SummaryInfo`` object called ``summary_align`` we can calculate a
consensus by doing:
.. cont-doctest
.. code:: pycon
>>> consensus = summary_align.dumb_consensus()
>>> consensus
Seq('XCTXCTX')
As the name suggests, this is a really simple consensus calculator, and
will just add up all of the residues at each point in the consensus, and
if the most common value is higher than some threshold value will add
the common residue to the consensus. If it doesnt reach the threshold,
it adds an ambiguity character to the consensus. The returned consensus
object is a ``Seq`` object.
You can adjust how ``dumb_consensus`` works by passing optional
parameters:
the threshold
This is the threshold specifying how common a particular residue has
to be at a position before it is added. The default is :math:`0.7`
(meaning :math:`70\%`).
the ambiguous character
This is the ambiguity character to use. The default is N.
Alternatively, you can convert the multiple sequence alignment object
``msa`` to a new-style ``Alignment`` object (see section
:ref:`sec:alignmentobject`) by using the
``alignment`` attribute (see section :ref:`sec:alignment_newstyle`):
.. cont-doctest
.. code:: pycon
>>> alignment = msa.alignment
You can then create a ``Motif`` object (see section
:ref:`sec:motif_object`):
.. cont-doctest
.. code:: pycon
>>> from Bio.motifs import Motif
>>> motif = Motif("ACGT", alignment)
and obtain a quick consensus sequence:
.. cont-doctest
.. code:: pycon
>>> motif.consensus
Seq('ACTCCTA')
The ``motif.counts.calculate_consensus`` method (see section
:ref:`sec:motif_consensus`) lets you specify in
detail how the consensus sequence should be calculated. For example,
.. cont-doctest
.. code:: pycon
>>> motif.counts.calculate_consensus(identity=0.7)
'NCTNCTN'
.. _`sec:pssm`:
Position Specific Score Matrices
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Position specific score matrices (PSSMs) summarize the alignment
information in a different way than a consensus, and may be useful for
different tasks. Basically, a PSSM is a count matrix. For each column in
the alignment, the number of each alphabet letters is counted and
totaled. The totals are displayed relative to some representative
sequence along the left axis. This sequence may be the consensus
sequence, but can also be any sequence in the alignment.
For instance for the alignment above:
.. cont-doctest
.. code:: pycon
>>> print(msa)
Alignment with 4 rows and 7 columns
ACTCCTA seq1
AAT-CTA seq2
CCTACT- seq3
TCTCCTC seq4
we get a PSSM with the consensus sequence along the side using
.. cont-doctest
.. code:: pycon
>>> my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore=["N"])
>>> print(my_pssm)
A C T
X 2.0 1.0 1.0
C 1.0 3.0 0.0
T 0.0 0.0 4.0
X 1.0 2.0 0.0
C 0.0 4.0 0.0
T 0.0 0.0 4.0
X 2.0 1.0 0.0
<BLANKLINE>
where we ignore any ``N`` ambiguity residues when calculating the PSSM.
Two notes should be made about this:
#. To maintain strictness with the alphabets, you can only include
characters along the top of the PSSM that are in the alphabet of the
alignment object. Gaps are not included along the top axis of the
PSSM.
#. The sequence passed to be displayed along the left side of the axis
does not need to be the consensus. For instance, if you wanted to
display the second sequence in the alignment along this axis, you
would need to do:
.. cont-doctest
.. code:: pycon
>>> second_seq = msa[1]
>>> my_pssm = summary_align.pos_specific_score_matrix(second_seq, chars_to_ignore=["N"])
>>> print(my_pssm)
A C T
A 2.0 1.0 1.0
A 1.0 3.0 0.0
T 0.0 0.0 4.0
- 1.0 2.0 0.0
C 0.0 4.0 0.0
T 0.0 0.0 4.0
A 2.0 1.0 0.0
<BLANKLINE>
The command above returns a ``PSSM`` object. You can access any element
of the PSSM by subscripting like
``your_pssm[sequence_number][residue_count_name]``. For instance, to get
the counts for the A residue in the second element of the above PSSM
you would do:
.. cont-doctest
.. code:: pycon
>>> print(my_pssm[5]["T"])
4.0
The structure of the PSSM class hopefully makes it easy both to access
elements and to pretty print the matrix.
Alternatively, you can convert the multiple sequence alignment object
``msa`` to a new-style ``Alignment`` object (see section
:ref:`sec:alignmentobject`) by using the
``alignment`` attribute (see section :ref:`sec:alignment_newstyle`):
.. cont-doctest
.. code:: pycon
>>> alignment = msa.alignment
You can then create a ``Motif`` object (see section
:ref:`sec:motif_object`):
.. cont-doctest
.. code:: pycon
>>> from Bio.motifs import Motif
>>> motif = Motif("ACGT", alignment)
and obtain the counts of each nucleotide in each position:
.. cont-doctest
.. code:: pycon
>>> counts = motif.counts
>>> print(counts)
0 1 2 3 4 5 6
A: 2.00 1.00 0.00 1.00 0.00 0.00 2.00
C: 1.00 3.00 0.00 2.00 4.00 0.00 1.00
G: 0.00 0.00 0.00 0.00 0.00 0.00 0.00
T: 1.00 0.00 4.00 0.00 0.00 4.00 0.00
<BLANKLINE>
>>> print(counts["T"][5])
4.0
.. _`sec:getting_info_content`:
Information Content
~~~~~~~~~~~~~~~~~~~
A potentially useful measure of evolutionary conservation is the
information content of a sequence.
A useful introduction to information theory targeted towards molecular
biologists can be found at
http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we
will be looking at the information content of a consensus sequence, or a
portion of a consensus sequence. We calculate information content at a
particular column in a multiple sequence alignment using the following
formula:
.. math:: IC_{j} = \sum_{i=1}^{N_{a}} P_{ij} \mathrm{log}\left(\frac{P_{ij}}{Q_{i}}\right)
where:
- :math:`IC_{j}` The information content for the :math:`j`-th column
in an alignment.
- :math:`N_{a}` The number of letters in the alphabet.
- :math:`P_{ij}` The frequency of a particular letter :math:`i` in
the :math:`j`-th column (i. e. if G occurred 3 out of 6 times in an
alignment column, this would be 0.5)
- :math:`Q_{i}` The expected frequency of a letter :math:`i`. This is
an optional argument, usage of which is left at the users
discretion. By default, it is automatically assigned to
:math:`0.05 = 1/20` for a protein alphabet, and :math:`0.25 = 1/4`
for a nucleic acid alphabet. This is for getting the information
content without any assumption of prior distributions. When assuming
priors, or when using a non-standard alphabet, you should supply the
values for :math:`Q_{i}`.
Well, now that we have an idea what information content is being
calculated in Biopython, lets look at how to get it for a particular
region of the alignment.
First, we need to use our alignment to get an alignment summary object,
which well assume is called ``summary_align`` (see
section :ref:`sec:summary_info`) for instructions on how to get
this. Once weve got this object, calculating the information content
for a region is as easy as:
.. cont-doctest
.. code:: pycon
>>> e_freq_table = {"A": 0.3, "G": 0.2, "T": 0.3, "C": 0.2}
>>> info_content = summary_align.information_content(
... 2, 6, e_freq_table=e_freq_table, chars_to_ignore=["N"]
... )
>>> info_content # doctest:+ELLIPSIS
6.3910647...
Now, ``info_content`` will contain the relative information content over
the region [2:6] in relation to the expected frequencies.
The value return is calculated using base 2 as the logarithm base in the
formula above. You can modify this by passing the parameter ``log_base``
as the base you want:
.. cont-doctest
.. code:: pycon
>>> info_content = summary_align.information_content(
... 2, 6, e_freq_table=e_freq_table, log_base=10, chars_to_ignore=["N"]
... )
>>> info_content # doctest:+ELLIPSIS
1.923902...
By default nucleotide or amino acid residues with a frequency of 0 in a
column are not take into account when the relative information column
for that column is computed. If this is not the desired result, you can
use ``pseudo_count`` instead.
.. cont-doctest
.. code:: pycon
>>> info_content = summary_align.information_content(
... 2, 6, e_freq_table=e_freq_table, chars_to_ignore=["N", "-"], pseudo_count=1
... )
>>> info_content # doctest:+ELLIPSIS
4.299651...
In this case, the observed frequency :math:`P_{ij}` of a particular
letter :math:`i` in the :math:`j`-th column is computed as follows:
.. math:: P_{ij} = \frac{n_{ij} + k\times Q_{i}}{N_{j} + k}
where:
- :math:`k` the pseudo count you pass as argument.
- :math:`k` the pseudo count you pass as argument.
- :math:`Q_{i}` The expected frequency of the letter :math:`i` as
described above.
Well, now you are ready to calculate information content. If you want to
try applying this to some real life problems, it would probably be best
to dig into the literature on information content to get an idea of how
it is used. Hopefully your digging wont reveal any mistakes made in
coding this function!
.. _`sec:alignment_newstyle`:
Getting a new-style Alignment object