remove deprecated functionality from Bio.Align.AlignInfo (#4907)

Authored-by: Michiel de Hoon <mdehoon@tkx288.genome.gsc.riken.jp>
2025-10-20 13:43:47 +08:00 · 2025-01-16 03:38:28 +09:00
parent e297041e3a
commit 4ea2126ec0
6 changed files with 42 additions and 2240 deletions
--- a/Doc/Tutorial/chapter_msa.rst
+++ b/Doc/Tutorial/chapter_msa.rst
@ -1098,364 +1098,6 @@ add entries for missing letters, for example

 This also allows you to change the order of letters in the alphabet.

-.. _`sec:summary_info`:
-
-Calculating summary information
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Once you have an alignment, you are very likely going to want to find
-out information about it. Instead of trying to have all of the functions
-that can generate information about an alignment in the alignment object
-itself, we’ve tried to separate out the functionality into separate
-classes, which act on the alignment.
-
-Getting ready to calculate summary information about an object is quick
-to do. Let’s say we’ve got an alignment object called ``alignment``, for
-example read in using ``Bio.AlignIO.read(...)`` as described in
-Chapter :ref:`chapter:msa`. All we need to do to get an object that
-will calculate summary information is:
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> from Bio.Align import AlignInfo
-   >>> summary_align = AlignInfo.SummaryInfo(msa)
-
-The ``summary_align`` object is very useful, and will do the following
-neat things for you:
-
-#. Calculate a quick consensus sequence – see
-   section :ref:`sec:consensus`
-
-#. Get a position specific score matrix for the alignment – see
-   section :ref:`sec:pssm`
-
-#. Calculate the information content for the alignment – see
-   section :ref:`sec:getting_info_content`
-
-#. Generate information on substitutions in the alignment –
-   section :ref:`sec:substitution_matrices`
-   details using this to generate a substitution matrix.
-
-.. _`sec:consensus`:
-
-Calculating a quick consensus sequence
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The ``SummaryInfo`` object, described in
-section :ref:`sec:summary_info`, provides functionality to
-calculate a quick consensus of an alignment. Assuming we’ve got a
-``SummaryInfo`` object called ``summary_align`` we can calculate a
-consensus by doing:
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> consensus = summary_align.dumb_consensus()
-   >>> consensus
-   Seq('XCTXCTX')
-
-As the name suggests, this is a really simple consensus calculator, and
-will just add up all of the residues at each point in the consensus, and
-if the most common value is higher than some threshold value will add
-the common residue to the consensus. If it doesn’t reach the threshold,
-it adds an ambiguity character to the consensus. The returned consensus
-object is a ``Seq`` object.
-
-You can adjust how ``dumb_consensus`` works by passing optional
-parameters:
-
-the threshold
-   This is the threshold specifying how common a particular residue has
-   to be at a position before it is added. The default is :math:`0.7`
-   (meaning :math:`70\%`).
-
-the ambiguous character
-   This is the ambiguity character to use. The default is ’N’.
-
-Alternatively, you can convert the multiple sequence alignment object
-``msa`` to a new-style ``Alignment`` object (see section
-:ref:`sec:alignmentobject`) by using the
-``alignment`` attribute (see section :ref:`sec:alignment_newstyle`):
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> alignment = msa.alignment
-
-You can then create a ``Motif`` object (see section
-:ref:`sec:motif_object`):
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> from Bio.motifs import Motif
-   >>> motif = Motif("ACGT", alignment)
-
-and obtain a quick consensus sequence:
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> motif.consensus
-   Seq('ACTCCTA')
-
-The ``motif.counts.calculate_consensus`` method (see section
-:ref:`sec:motif_consensus`) lets you specify in
-detail how the consensus sequence should be calculated. For example,
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> motif.counts.calculate_consensus(identity=0.7)
-   'NCTNCTN'
-
-.. _`sec:pssm`:
-
-Position Specific Score Matrices
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Position specific score matrices (PSSMs) summarize the alignment
-information in a different way than a consensus, and may be useful for
-different tasks. Basically, a PSSM is a count matrix. For each column in
-the alignment, the number of each alphabet letters is counted and
-totaled. The totals are displayed relative to some representative
-sequence along the left axis. This sequence may be the consensus
-sequence, but can also be any sequence in the alignment.
-
-For instance for the alignment above:
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> print(msa)
-   Alignment with 4 rows and 7 columns
-   ACTCCTA seq1
-   AAT-CTA seq2
-   CCTACT- seq3
-   TCTCCTC seq4
-
-we get a PSSM with the consensus sequence along the side using
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore=["N"])
-   >>> print(my_pssm)
-       A   C   T
-   X  2.0 1.0 1.0
-   C  1.0 3.0 0.0
-   T  0.0 0.0 4.0
-   X  1.0 2.0 0.0
-   C  0.0 4.0 0.0
-   T  0.0 0.0 4.0
-   X  2.0 1.0 0.0
-   <BLANKLINE>
-
-where we ignore any ``N`` ambiguity residues when calculating the PSSM.
-
-Two notes should be made about this:
-
-#. To maintain strictness with the alphabets, you can only include
-   characters along the top of the PSSM that are in the alphabet of the
-   alignment object. Gaps are not included along the top axis of the
-   PSSM.
-
-#. The sequence passed to be displayed along the left side of the axis
-   does not need to be the consensus. For instance, if you wanted to
-   display the second sequence in the alignment along this axis, you
-   would need to do:
-
-   .. cont-doctest
-
-   .. code:: pycon
-
-      >>> second_seq = msa[1]
-      >>> my_pssm = summary_align.pos_specific_score_matrix(second_seq, chars_to_ignore=["N"])
-      >>> print(my_pssm)
-          A   C   T
-      A  2.0 1.0 1.0
-      A  1.0 3.0 0.0
-      T  0.0 0.0 4.0
-      -  1.0 2.0 0.0
-      C  0.0 4.0 0.0
-      T  0.0 0.0 4.0
-      A  2.0 1.0 0.0
-      <BLANKLINE>
-
-The command above returns a ``PSSM`` object. You can access any element
-of the PSSM by subscripting like
-``your_pssm[sequence_number][residue_count_name]``. For instance, to get
-the counts for the ’A’ residue in the second element of the above PSSM
-you would do:
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> print(my_pssm[5]["T"])
-   4.0
-
-The structure of the PSSM class hopefully makes it easy both to access
-elements and to pretty print the matrix.
-
-Alternatively, you can convert the multiple sequence alignment object
-``msa`` to a new-style ``Alignment`` object (see section
-:ref:`sec:alignmentobject`) by using the
-``alignment`` attribute (see section :ref:`sec:alignment_newstyle`):
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> alignment = msa.alignment
-
-You can then create a ``Motif`` object (see section
-:ref:`sec:motif_object`):
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> from Bio.motifs import Motif
-   >>> motif = Motif("ACGT", alignment)
-
-and obtain the counts of each nucleotide in each position:
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> counts = motif.counts
-   >>> print(counts)
-           0      1      2      3      4      5      6
-   A:   2.00   1.00   0.00   1.00   0.00   0.00   2.00
-   C:   1.00   3.00   0.00   2.00   4.00   0.00   1.00
-   G:   0.00   0.00   0.00   0.00   0.00   0.00   0.00
-   T:   1.00   0.00   4.00   0.00   0.00   4.00   0.00
-   <BLANKLINE>
-   >>> print(counts["T"][5])
-   4.0
-
-.. _`sec:getting_info_content`:
-
-Information Content
-~~~~~~~~~~~~~~~~~~~
-
-A potentially useful measure of evolutionary conservation is the
-information content of a sequence.
-
-A useful introduction to information theory targeted towards molecular
-biologists can be found at
-http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we
-will be looking at the information content of a consensus sequence, or a
-portion of a consensus sequence. We calculate information content at a
-particular column in a multiple sequence alignment using the following
-formula:
-
-.. math:: IC_{j} = \sum_{i=1}^{N_{a}} P_{ij} \mathrm{log}\left(\frac{P_{ij}}{Q_{i}}\right)
-
-where:
-
-  :math:`IC_{j}` – The information content for the :math:`j`-th column
-   in an alignment.
-
-  :math:`N_{a}` – The number of letters in the alphabet.
-
-  :math:`P_{ij}` – The frequency of a particular letter :math:`i` in
-   the :math:`j`-th column (i. e. if G occurred 3 out of 6 times in an
-   alignment column, this would be 0.5)
-
-  :math:`Q_{i}` – The expected frequency of a letter :math:`i`. This is
-   an optional argument, usage of which is left at the user’s
-   discretion. By default, it is automatically assigned to
-   :math:`0.05 = 1/20` for a protein alphabet, and :math:`0.25 = 1/4`
-   for a nucleic acid alphabet. This is for getting the information
-   content without any assumption of prior distributions. When assuming
-   priors, or when using a non-standard alphabet, you should supply the
-   values for :math:`Q_{i}`.
-
-Well, now that we have an idea what information content is being
-calculated in Biopython, let’s look at how to get it for a particular
-region of the alignment.
-
-First, we need to use our alignment to get an alignment summary object,
-which we’ll assume is called ``summary_align`` (see
-section :ref:`sec:summary_info`) for instructions on how to get
-this. Once we’ve got this object, calculating the information content
-for a region is as easy as:
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> e_freq_table = {"A": 0.3, "G": 0.2, "T": 0.3, "C": 0.2}
-   >>> info_content = summary_align.information_content(
-   ...     2, 6, e_freq_table=e_freq_table, chars_to_ignore=["N"]
-   ... )
-   >>> info_content  # doctest:+ELLIPSIS
-   6.3910647...
-
-Now, ``info_content`` will contain the relative information content over
-the region [2:6] in relation to the expected frequencies.
-
-The value return is calculated using base 2 as the logarithm base in the
-formula above. You can modify this by passing the parameter ``log_base``
-as the base you want:
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> info_content = summary_align.information_content(
-   ...     2, 6, e_freq_table=e_freq_table, log_base=10, chars_to_ignore=["N"]
-   ... )
-   >>> info_content  # doctest:+ELLIPSIS
-   1.923902...
-
-By default nucleotide or amino acid residues with a frequency of 0 in a
-column are not take into account when the relative information column
-for that column is computed. If this is not the desired result, you can
-use ``pseudo_count`` instead.
-
-.. cont-doctest
-
-.. code:: pycon
-
-   >>> info_content = summary_align.information_content(
-   ...     2, 6, e_freq_table=e_freq_table, chars_to_ignore=["N", "-"], pseudo_count=1
-   ... )
-   >>> info_content  # doctest:+ELLIPSIS
-   4.299651...
-
-In this case, the observed frequency :math:`P_{ij}` of a particular
-letter :math:`i` in the :math:`j`-th column is computed as follows:
-
-.. math:: P_{ij} = \frac{n_{ij} + k\times Q_{i}}{N_{j} + k}
-
-where:
-
-  :math:`k` – the pseudo count you pass as argument.
-
-  :math:`k` – the pseudo count you pass as argument.
-
-  :math:`Q_{i}` – The expected frequency of the letter :math:`i` as
-   described above.
-
-Well, now you are ready to calculate information content. If you want to
-try applying this to some real life problems, it would probably be best
-to dig into the literature on information content to get an idea of how
-it is used. Hopefully your digging won’t reveal any mistakes made in
-coding this function!
-
 .. _`sec:alignment_newstyle`:

 Getting a new-style Alignment object