Use C code to parse alignments in which dashes represent gaps (#4737)

* update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * fix * update * update * update * update * update * Use int64_t instead of long * avoid testing numpy array output * update * update * update * test * test * test * test * update * update * update * update * update * no more compiler warnings * update * update * update * update * change submodule name --------- Co-authored-by: Michiel de Hoon <mdehoon@tkx249.genome.gsc.riken.jp>
2025-10-20 13:43:47 +08:00 · 2024-06-07 23:39:07 +09:00
parent 09eb5f35f0
commit 53a65b413e
9 changed files with 881 additions and 438 deletions
--- a/Doc/Tutorial/chapter_align.rst
+++ b/Doc/Tutorial/chapter_align.rst
@ -150,8 +150,14 @@ aligned sequences as follows:
   CGGTTTTT
   AG-TTT--
   AGGTTT--
+   >>> lines = [line.encode() for line in lines]  # convert to bytes
+   >>> lines
+   [b'CGGTTTTT', b'AG-TTT--', b'AGGTTT--']
   >>> sequences, coordinates = Alignment.parse_printed_alignment(lines)
   >>> sequences
+   [b'CGGTTTTT', b'AGTTT', b'AGGTTT']
+   >>> sequences = [sequence.decode() for sequence in sequences]
+   >>> sequences
   ['CGGTTTTT', 'AGTTT', 'AGGTTT']
   >>> coordinates
   array([[0, 2, 3, 6, 8],
@ -4568,6 +4574,12 @@ dictionary. Please refer to the test script ``test_Align_bigbed.py`` in
 the ``Tests`` subdirectory in the Biopython distribution for more
 examples of writing alignment files in the bigBed format.

+Optional arguments are ``compress`` (default value is ``True``), ``blockSize``
+(default value is 256), and ``itemsPerSlot`` (default value is 512). See the
+documentation of UCSC's ``bedToBigBed`` program for a description of these
+arguments.  Searching a ``bigBed`` file can be faster by using
+``compress=False`` and ``itemsPerSlot=1`` when creating the bigBed file.
+
 .. _`subsec:align_psl`:

 Pattern Space Layout (PSL)
@ -4929,6 +4941,12 @@ See section :ref:`subsec:align_psl` for an explanation on how the
 number of matches, mismatches, repeat region matches, and matches to
 unknown nucleotides are obtained.

+Further optional arguments are ``blockSize`` (default value is 256), and
+``itemsPerSlot`` (default value is 512). See the documentation of UCSC's
+``bedToBigBed`` program for a description of these arguments.  Searching a
+``bigPsl`` file can be faster by using ``compress=False`` and
+``itemsPerSlot=1`` when creating the bigPsl file.
+
 .. _`subsec:align_maf`:

 Multiple Alignment Format (MAF)
@ -5185,8 +5203,8 @@ bigMaf

 A bigMaf file is a bigBed file with a BED3+1 format consisting of the 3
 required BED fields plus a custom field that stores a MAF alignment
-block as a string, crearing an indexed binary version of a MAF file (see
-section :ref:`subsec:align_bigmaf`). The associated AutoSql file
+block as a string, creating an indexed binary version of a MAF file (see
+section :ref:`subsec:align_maf`). The associated AutoSql file
 `bigMaf.as <https://genome.ucsc.edu/goldenPath/help/examples/bigMaf.as>`__
 is provided by UCSC. To create a bigMaf file, you can either use the
 ``mafToBigMaf`` and ``bedToBigBed`` programs from UCSC. or you can use
@ -5344,6 +5362,9 @@ be of the form ``reference.chromosome``, where ``reference`` refers to
 the reference species. ``Bio.Align.write`` has the additional keyword
 argument ``compress`` (``True`` by default) specifying whether the data
 should be compressed using zlib.
+Further optional arguments are ``blockSize`` (default value is 256), and
+``itemsPerSlot`` (default value is 512). See the documentation of UCSC's
+``bedToBigBed`` program for a description of these arguments.

 As a bigMaf file is a special case of a bigBed file, you can use the
 ``search`` method on the ``alignments`` object to find alignments to
@ -5374,6 +5395,9 @@ start and end positions may be ``None`` to start searching from position
 respectively. Note that we can search on genomic position for the
 reference species only.

+Searching a ``bigMaf`` file can be faster by using ``compress=False`` and
+``itemsPerSlot=1`` when creating the bigMaf file.
+
 .. _`subsec:align_chain`:

 UCSC chain file format