Use C code to parse alignments in which dashes represent gaps (#4737)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* fix

* update

* update

* update

* update

* update

* Use int64_t instead of long

* avoid testing numpy array output

* update

* update

* update

* test

* test

* test

* test

* update

* update

* update

* update

* update

* no more compiler warnings

* update

* update

* update

* update

* change submodule name

---------

Co-authored-by: Michiel de Hoon <mdehoon@tkx249.genome.gsc.riken.jp>
This commit is contained in:
mdehoon
2024-06-07 23:39:07 +09:00
committed by GitHub
parent 09eb5f35f0
commit 53a65b413e
9 changed files with 881 additions and 438 deletions

View File

@ -150,8 +150,14 @@ aligned sequences as follows:
CGGTTTTT
AG-TTT--
AGGTTT--
>>> lines = [line.encode() for line in lines] # convert to bytes
>>> lines
[b'CGGTTTTT', b'AG-TTT--', b'AGGTTT--']
>>> sequences, coordinates = Alignment.parse_printed_alignment(lines)
>>> sequences
[b'CGGTTTTT', b'AGTTT', b'AGGTTT']
>>> sequences = [sequence.decode() for sequence in sequences]
>>> sequences
['CGGTTTTT', 'AGTTT', 'AGGTTT']
>>> coordinates
array([[0, 2, 3, 6, 8],
@ -4568,6 +4574,12 @@ dictionary. Please refer to the test script ``test_Align_bigbed.py`` in
the ``Tests`` subdirectory in the Biopython distribution for more
examples of writing alignment files in the bigBed format.
Optional arguments are ``compress`` (default value is ``True``), ``blockSize``
(default value is 256), and ``itemsPerSlot`` (default value is 512). See the
documentation of UCSC's ``bedToBigBed`` program for a description of these
arguments. Searching a ``bigBed`` file can be faster by using
``compress=False`` and ``itemsPerSlot=1`` when creating the bigBed file.
.. _`subsec:align_psl`:
Pattern Space Layout (PSL)
@ -4929,6 +4941,12 @@ See section :ref:`subsec:align_psl` for an explanation on how the
number of matches, mismatches, repeat region matches, and matches to
unknown nucleotides are obtained.
Further optional arguments are ``blockSize`` (default value is 256), and
``itemsPerSlot`` (default value is 512). See the documentation of UCSC's
``bedToBigBed`` program for a description of these arguments. Searching a
``bigPsl`` file can be faster by using ``compress=False`` and
``itemsPerSlot=1`` when creating the bigPsl file.
.. _`subsec:align_maf`:
Multiple Alignment Format (MAF)
@ -5185,8 +5203,8 @@ bigMaf
A bigMaf file is a bigBed file with a BED3+1 format consisting of the 3
required BED fields plus a custom field that stores a MAF alignment
block as a string, crearing an indexed binary version of a MAF file (see
section :ref:`subsec:align_bigmaf`). The associated AutoSql file
block as a string, creating an indexed binary version of a MAF file (see
section :ref:`subsec:align_maf`). The associated AutoSql file
`bigMaf.as <https://genome.ucsc.edu/goldenPath/help/examples/bigMaf.as>`__
is provided by UCSC. To create a bigMaf file, you can either use the
``mafToBigMaf`` and ``bedToBigBed`` programs from UCSC. or you can use
@ -5344,6 +5362,9 @@ be of the form ``reference.chromosome``, where ``reference`` refers to
the reference species. ``Bio.Align.write`` has the additional keyword
argument ``compress`` (``True`` by default) specifying whether the data
should be compressed using zlib.
Further optional arguments are ``blockSize`` (default value is 256), and
``itemsPerSlot`` (default value is 512). See the documentation of UCSC's
``bedToBigBed`` program for a description of these arguments.
As a bigMaf file is a special case of a bigBed file, you can use the
``search`` method on the ``alignments`` object to find alignments to
@ -5374,6 +5395,9 @@ start and end positions may be ``None`` to start searching from position
respectively. Note that we can search on genomic position for the
reference species only.
Searching a ``bigMaf`` file can be faster by using ``compress=False`` and
``itemsPerSlot=1`` when creating the bigMaf file.
.. _`subsec:align_chain`:
UCSC chain file format