initial commit

This commit is contained in:
Yaossg 2025-01-18 21:09:52 +08:00
commit 5e601d0401
428 changed files with 206785 additions and 0 deletions

1
.gitattributes vendored Normal file
View File

@ -0,0 +1 @@
*.pbxproj binary merge=union

48
.gitignore vendored Normal file
View File

@ -0,0 +1,48 @@
*~
*.dSYM
.DS_Store
tags
*-debug
*-s
*-l
hisat2.xcodeproj/project.xcworkspace
hisat2.xcodeproj/xcuserdata
hisat2.xcodeproj/xcshareddata
*.patch
build_automaton
build_index
clean_alignment
determinize
gcsa_alignment
gcsa_test
hisat2-repeat
hisat2_test/*.bt2
hisat2_test/*.ht2
hisat2_test/*.sam
hisat2_test/paper_example.malignment.automaton
hisat2_test/paper_example.malignment.backbone
hisat2_test/paper_example.malignment.gcsa
hisat2_test/kim_example*.malignment.automaton
hisat2_test/kim_example*.malignment.backbone
hisat2_test/kim_example*.malignment.gcsa
hisat2_test/genome*
hisat2_test/2*
hisat2_test/snp142*
hisat2_test/testset*
.idea
.vscode
.ht2lib-obj*
*.a
*.so
docs/_site
docs/*.lock
docs/.*-cache
*.tar.gz
*.ipynb
*.pyc
cmake*

29
AUTHORS Normal file
View File

@ -0,0 +1,29 @@
Ben Langmead <langmea@cs.jhu.edu> wrote Bowtie 2, which is based partially on
Bowtie. Bowtie was written by Ben Langmead and Cole Trapnell.
Bowtie & Bowtie 2: http://bowtie-bio.sf.net
A DLL from the pthreads for Win32 library is distributed with the Win32 version
of Bowtie 2. The pthreads for Win32 library and the GnuWin32 package have many
contributors (see their respective web sites).
pthreads for Win32: http://sourceware.org/pthreads-win32
GnuWin32: http://gnuwin32.sf.net
The ForkManager.pm perl module is used in Bowtie 2's random testing framework,
and is included as scripts/sim/contrib/ForkManager.pm. ForkManager.pm is
written by dLux (Szabo, Balazs), with contributions by others. See the perldoc
in ForkManager.pm for the complete list.
The file ls.h includes an implementation of the Larsson-Sadakane suffix sorting
algorithm. The implementation is by N. Jesper Larsson and was adapted somewhat
for use in Bowtie 2.
TinyThreads is a portable thread implementation with a fairly compatible subset
of C++11 thread management classes written by Marcus Geelnard. For more info
check http://tinythreadpp.bitsnbites.eu/
Various users have kindly supplied patches, bug reports and feature requests
over the years. Many, many thanks go to them.
September 2011

BIN
HISAT2-genotype.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 724 KiB

1
HISAT2_VERSION Normal file
View File

@ -0,0 +1 @@
2.2.1-3n-0.0.3

674
LICENSE Normal file
View File

@ -0,0 +1,674 @@
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Use with the GNU Affero General Public License.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
<program> Copyright (C) <year> <name of author>
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<http://www.gnu.org/licenses/>.
The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
<http://www.gnu.org/philosophy/why-not-lgpl.html>.

1467
MANUAL Normal file

File diff suppressed because it is too large Load Diff

2437
MANUAL.markdown Normal file

File diff suppressed because it is too large Load Diff

590
Makefile Normal file
View File

@ -0,0 +1,590 @@
#
# Copyright 2015, Daehwan Kim <infphilo@gmail.com>
#
# This file is part of HISAT2.
#
# HISAT 2 is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# HISAT 2 is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with HISAT. If not, see <http://www.gnu.org/licenses/>.
#
#
# Makefile for hisat2-align, hisat2-build, hisat2-inspect
#
INC =
GCC_PREFIX = $(shell dirname `which gcc`)
GCC_SUFFIX =
CC = $(GCC_PREFIX)/gcc$(GCC_SUFFIX)
CPP = $(GCC_PREFIX)/g++$(GCC_SUFFIX)
CXX = $(CPP)
HEADERS = $(wildcard *.h)
BOWTIE_MM = 1
BOWTIE_SHARED_MEM = 0
# Detect Cygwin or MinGW
WINDOWS = 0
CYGWIN = 0
MINGW = 0
ifneq (,$(findstring CYGWIN,$(shell uname)))
WINDOWS = 1
CYGWIN = 1
# POSIX memory-mapped files not currently supported on Windows
BOWTIE_MM = 0
BOWTIE_SHARED_MEM = 0
else
ifneq (,$(findstring MINGW,$(shell uname)))
WINDOWS = 1
MINGW = 1
# POSIX memory-mapped files not currently supported on Windows
BOWTIE_MM = 0
BOWTIE_SHARED_MEM = 0
endif
endif
MACOS = 0
ifneq (,$(findstring Darwin,$(shell uname)))
MACOS = 1
endif
EXTRA_FLAGS += -DPOPCNT_CAPABILITY -std=c++11
INC += -I. -I third_party
MM_DEF =
ifeq (1,$(BOWTIE_MM))
MM_DEF = -DBOWTIE_MM
endif
SHMEM_DEF =
ifeq (1,$(BOWTIE_SHARED_MEM))
SHMEM_DEF = -DBOWTIE_SHARED_MEM
endif
PTHREAD_PKG =
PTHREAD_LIB =
ifeq (1,$(MINGW))
PTHREAD_LIB =
else
PTHREAD_LIB = -lpthread
endif
SEARCH_LIBS =
BUILD_LIBS =
INSPECT_LIBS =
ifeq (1,$(MINGW))
BUILD_LIBS =
INSPECT_LIBS =
endif
USE_SRA = 0
SRA_DEF =
SRA_LIB =
SERACH_INC =
ifeq (1,$(USE_SRA))
SRA_DEF = -DUSE_SRA
SRA_LIB = -lncbi-ngs-c++-static -lngs-c++-static -lncbi-vdb-static -ldl
SEARCH_INC += -I$(NCBI_NGS_DIR)/include -I$(NCBI_VDB_DIR)/include
SEARCH_LIBS += -L$(NCBI_NGS_DIR)/lib64 -L$(NCBI_VDB_DIR)/lib64
endif
LIBS = $(PTHREAD_LIB)
HT2LIB_DIR = hisat2lib
HT2LIB_CPPS = $(HT2LIB_DIR)/ht2_init.cpp \
$(HT2LIB_DIR)/ht2_repeat.cpp \
$(HT2LIB_DIR)/ht2_index.cpp
SHARED_CPPS = ccnt_lut.cpp ref_read.cpp alphabet.cpp shmem.cpp \
edit.cpp gfm.cpp \
reference.cpp ds.cpp multikey_qsort.cpp limit.cpp \
random_source.cpp tinythread.cpp utility_3n.cpp
SEARCH_CPPS = qual.cpp pat.cpp \
read_qseq.cpp aligner_seed_policy.cpp \
aligner_seed.cpp \
aligner_seed2.cpp \
aligner_sw.cpp \
aligner_sw_driver.cpp aligner_cache.cpp \
aligner_result.cpp ref_coord.cpp mask.cpp \
pe.cpp aln_sink.cpp dp_framer.cpp \
scoring.cpp presets.cpp unique.cpp \
simple_func.cpp \
random_util.cpp \
aligner_bt.cpp sse_util.cpp \
aligner_swsse.cpp outq.cpp \
aligner_swsse_loc_i16.cpp \
aligner_swsse_ee_i16.cpp \
aligner_swsse_loc_u8.cpp \
aligner_swsse_ee_u8.cpp \
aligner_driver.cpp \
splice_site.cpp \
alignment_3n.cpp \
position_3n.cpp \
$(HT2LIB_CPPS)
BUILD_CPPS = diff_sample.cpp
REPEAT_CPPS = \
mask.cpp \
qual.cpp \
aligner_bt.cpp \
scoring.cpp \
simple_func.cpp \
dp_framer.cpp \
aligner_result.cpp \
aligner_sw_driver.cpp \
aligner_sw.cpp \
aligner_swsse_ee_i16.cpp \
aligner_swsse_ee_u8.cpp \
aligner_swsse_loc_i16.cpp \
aligner_swsse_loc_u8.cpp \
aligner_swsse.cpp \
bit_packed_array.cpp \
repeat_builder.cpp
THREE_N_HEADERS = \
position_3n_table.h \
alignment_3n_table.h \
utility_3n_table.h
HISAT2_CPPS_MAIN = $(SEARCH_CPPS) hisat2_main.cpp
HISAT2_BUILD_CPPS_MAIN = $(BUILD_CPPS) hisat2_build_main.cpp
HISAT2_REPEAT_CPPS_MAIN = $(REPEAT_CPPS) $(BUILD_CPPS) hisat2_repeat_main.cpp
SEARCH_FRAGMENTS = $(wildcard search_*_phase*.c)
VERSION := $(shell cat HISAT2_VERSION)
# Convert BITS=?? to a -m flag
BITS=32
ifeq (x86_64,$(shell uname -m))
BITS=64
endif
# msys will always be 32 bit so look at the cpu arch instead.
ifneq (,$(findstring AMD64,$(PROCESSOR_ARCHITEW6432)))
ifeq (1,$(MINGW))
BITS=64
endif
endif
BITS_FLAG =
ifeq (32,$(BITS))
BITS_FLAG = -m32
endif
ifeq (64,$(BITS))
BITS_FLAG = -m64
endif
SSE_FLAG=-msse2
DEBUG_FLAGS = -O0 -g3 $(BITS_FLAG) $(SSE_FLAG)
DEBUG_DEFS = -DCOMPILER_OPTIONS="\"$(DEBUG_FLAGS) $(EXTRA_FLAGS)\""
RELEASE_FLAGS = -O3 $(BITS_FLAG) $(SSE_FLAG) -funroll-loops -g3
RELEASE_DEFS = -DCOMPILER_OPTIONS="\"$(RELEASE_FLAGS) $(EXTRA_FLAGS)\""
NOASSERT_FLAGS = -DNDEBUG
FILE_FLAGS = -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE
HT2LIB_FLAGS = -DHISAT2_BUILD_LIB
ifeq (1,$(USE_SRA))
ifeq (1, $(MACOS))
SRA_LIB += -stdlib=libc++
DEBUG_FLAGS += -mmacosx-version-min=10.10
RELEASE_FLAGS += -mmacosx-version-min=10.10
endif
endif
HISAT2_BIN_LIST = hisat2-build-s \
hisat2-build-l \
hisat2-align-s \
hisat2-align-l \
hisat2-inspect-s \
hisat2-inspect-l \
hisat2-repeat \
hisat-3n-table
HISAT2_BIN_LIST_AUX = hisat2-build-s-debug \
hisat2-build-l-debug \
hisat2-align-s-debug \
hisat2-align-l-debug \
hisat2-inspect-s-debug \
hisat2-inspect-l-debug \
hisat2-repeat-debug
HT2LIB_SRCS = $(SHARED_CPPS) \
$(HT2LIB_CPPS)
HT2LIB_OBJS = $(HT2LIB_SRCS:.cpp=.o)
HT2LIB_DEBUG_OBJS = $(addprefix .ht2lib-obj-debug/,$(HT2LIB_OBJS))
HT2LIB_RELEASE_OBJS = $(addprefix .ht2lib-obj-release/,$(HT2LIB_OBJS))
HT2LIB_SHARED_DEBUG_OBJS = $(addprefix .ht2lib-obj-debug-shared/,$(HT2LIB_OBJS))
HT2LIB_SHARED_RELEASE_OBJS = $(addprefix .ht2lib-obj-release-shared/,$(HT2LIB_OBJS))
HT2LIB_PKG_SRC = \
$(HT2LIB_DIR)/ht2_init.cpp \
$(HT2LIB_DIR)/ht2_repeat.cpp \
$(HT2LIB_DIR)/ht2_index.cpp \
$(HT2LIB_DIR)/ht2.h \
$(HT2LIB_DIR)/ht2_handle.h \
$(HT2LIB_DIR)/java_jni/Makefile \
$(HT2LIB_DIR)/java_jni/ht2module.c \
$(HT2LIB_DIR)/java_jni/HT2Module.java \
$(HT2LIB_DIR)/java_jni/HT2ModuleExample.java \
$(HT2LIB_DIR)/pymodule/Makefile \
$(HT2LIB_DIR)/pymodule/ht2module.c \
$(HT2LIB_DIR)/pymodule/setup.py \
$(HT2LIB_DIR)/pymodule/ht2example.py
GENERAL_LIST = $(wildcard scripts/*.sh) \
$(wildcard scripts/*.pl) \
$(wildcard *.py) \
$(wildcard example/index/*.ht2) \
$(wildcard example/reads/*.fa) \
example/reference/22_20-21M.fa \
example/reference/22_20-21M.snp \
$(PTHREAD_PKG) \
hisat2 \
hisat2-build \
hisat2-inspect \
AUTHORS \
LICENSE \
NEWS \
MANUAL \
MANUAL.markdown \
TUTORIAL \
HISAT2_VERSION
ifeq (1,$(WINDOWS))
HISAT2_BIN_LIST := $(HISAT2_BIN_LIST) hisat2.bat hisat2-build.bat hisat2-inspect.bat
endif
# This is helpful on Windows under MinGW/MSYS, where Make might go for
# the Windows FIND tool instead.
FIND=$(shell which find)
SRC_PKG_LIST = $(wildcard *.h) \
$(wildcard *.hh) \
$(wildcard *.c) \
$(wildcard *.cpp) \
$(HT2LIB_PKG_SRC) \
Makefile \
CMakeLists.txt \
$(GENERAL_LIST)
BIN_PKG_LIST = $(GENERAL_LIST)
.PHONY: all allall both both-debug
all: $(HISAT2_BIN_LIST)
allall: $(HISAT2_BIN_LIST) $(HISAT2_BIN_LIST_AUX)
both: hisat2-align-s hisat2-align-l hisat2-build-s hisat2-build-l
both-debug: hisat2-align-s-debug hisat2-align-l-debug hisat2-build-s-debug hisat2-build-l-debug
repeat: hisat2-repeat
repeat-debug: hisat2-repeat-debug
DEFS :=-fno-strict-aliasing \
-DHISAT2_VERSION="\"`cat HISAT2_VERSION`\"" \
-DBUILD_HOST="\"`hostname`\"" \
-DBUILD_TIME="\"`date`\"" \
-DCOMPILER_VERSION="\"`$(CXX) -v 2>&1 | tail -1`\"" \
$(FILE_FLAGS) \
$(PREF_DEF) \
$(MM_DEF) \
$(SHMEM_DEF)
#
# hisat-bp targets
#
hisat-bp-bin: hisat_bp.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 $(NOASSERT_FLAGS) -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT_CPPS_MAIN) \
$(LIBS) $(SEARCH_LIBS)
hisat-bp-bin-debug: hisat_bp.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)
$(CXX) $(DEBUG_FLAGS) \
$(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT_CPPS_MAIN) \
$(LIBS) $(SEARCH_LIBS)
#
# hisat2-repeat targets
#
hisat2-repeat: hisat2_repeat.cpp $(REPEAT_CPPS) $(SHARED_CPPS) $(HEADERS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_REPEAT_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
hisat2-repeat-debug: hisat2_repeat.cpp $(REPEAT_CPPS) $(SHARED_CPPS) $(HEADERS)
$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_REPEAT_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
#
# hisat2-build targets
#
hisat2-build-s: hisat2_build.cpp $(SHARED_CPPS) $(HEADERS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 $(NOASSERT_FLAGS) -Wall -DMASSIVE_DATA_RLCSA \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_BUILD_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
hisat2-build-l: hisat2_build.cpp $(SHARED_CPPS) $(HEADERS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_BUILD_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
hisat2-build-s-debug: hisat2_build.cpp $(SHARED_CPPS) $(HEADERS)
$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -Wall -DMASSIVE_DATA_RLCSA \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_BUILD_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
hisat2-build-l-debug: hisat2_build.cpp $(SHARED_CPPS) $(HEADERS)
$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_BUILD_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
#
# hisat2 targets
#
hisat2-align-s: hisat2.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) $(SRA_DEF) -DBOWTIE2 $(NOASSERT_FLAGS) -Wall \
$(INC) $(SEARCH_INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_CPPS_MAIN) \
$(LIBS) $(SRA_LIB) $(SEARCH_LIBS)
hisat2-align-l: hisat2.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) $(SRA_DEF) -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \
$(INC) $(SEARCH_INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_CPPS_MAIN) \
$(LIBS) $(SRA_LIB) $(SEARCH_LIBS)
hisat2-align-s-debug: hisat2.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)
$(CXX) $(DEBUG_FLAGS) \
$(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) $(SRA_DEF) -DBOWTIE2 -Wall \
$(INC) $(SEARCH_INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_CPPS_MAIN) \
$(LIBS) $(SRA_LIB) $(SEARCH_LIBS)
hisat2-align-l-debug: hisat2.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)
$(CXX) $(DEBUG_FLAGS) \
$(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) $(SRA_DEF) -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) $(SEARCH_INC) \
-o $@ $< \
$(SHARED_CPPS) $(HISAT2_CPPS_MAIN) \
$(LIBS) $(SRA_LIB) $(SEARCH_LIBS)
#
# hisat2-inspect targets
#
hisat2-inspect-s: hisat2_inspect.cpp $(HEADERS) $(SHARED_CPPS)
$(CXX) $(RELEASE_FLAGS) \
$(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -DHISAT2_INSPECT_MAIN -Wall \
$(INC) -I . \
-o $@ $< \
$(SHARED_CPPS) \
$(LIBS) $(INSPECT_LIBS)
hisat2-inspect-l: hisat2_inspect.cpp $(HEADERS) $(SHARED_CPPS)
$(CXX) $(RELEASE_FLAGS) \
$(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -DBOWTIE_64BIT_INDEX -DHISAT2_INSPECT_MAIN -Wall \
$(INC) -I . \
-o $@ $< \
$(SHARED_CPPS) \
$(LIBS) $(INSPECT_LIBS)
hisat2-inspect-s-debug: hisat2_inspect.cpp $(HEADERS) $(SHARED_CPPS)
$(CXX) $(DEBUG_FLAGS) \
$(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -DHISAT2_INSPECT_MAIN -Wall \
$(INC) -I . \
-o $@ $< \
$(SHARED_CPPS) \
$(LIBS) $(INSPECT_LIBS)
hisat2-inspect-l-debug: hisat2_inspect.cpp $(HEADERS) $(SHARED_CPPS)
$(CXX) $(DEBUG_FLAGS) \
$(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DBOWTIE2 -DBOWTIE_64BIT_INDEX -DHISAT2_INSPECT_MAIN -Wall \
$(INC) -I . \
-o $@ $< \
$(SHARED_CPPS) \
$(LIBS) $(INSPECT_LIBS)
#
# hisat-3n-table targets
#
hisat-3n-table: hisat_3n_table.cpp $(THREE_N_HEADERS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) $(NOASSERT_FLAGS) $(DEFS) -pthread -o $@ $<
#
# HT2LIB targets
#
ht2lib: libhisat2lib-debug.a libhisat2lib.a libhisat2lib-debug.so libhisat2lib.so
libhisat2lib-debug.a: $(HT2LIB_DEBUG_OBJS)
ar rc $@ $(HT2LIB_DEBUG_OBJS)
libhisat2lib.a: $(HT2LIB_RELEASE_OBJS)
ar rc $@ $(HT2LIB_RELEASE_OBJS)
libhisat2lib-debug.so: $(HT2LIB_SHARED_DEBUG_OBJS)
$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) $(DEFS) $(SRA_DEF) -DBOWTIE2 -Wall $(INC) $(SEARCH_INC) \
-shared -o $@ $(HT2LIB_SHARED_DEBUG_OBJS) $(LIBS) $(SRA_LIB) $(SEARCH_LIBS)
libhisat2lib.so: $(HT2LIB_SHARED_RELEASE_OBJS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) $(DEFS) $(SRA_DEF) -DBOWTIE2 $(NOASSERT_FLAGS) -Wall $(INC) $(SEARCH_INC)\
-shared -o $@ $(HT2LIB_SHARED_RELEASE_OBJS) $(LIBS) $(SRA_LIB) $(SEARCH_LIBS)
.ht2lib-obj-debug/%.o: %.cpp
@mkdir -p $(dir $@)/$(dir $<)
$(CXX) -fPIC $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) $(DEFS) $(SRA_DEF) $(HT2LIB_FLAGS) -DBOWTIE2 -Wall $(INC) $(SEARCH_INC) \
-c -o $@ $<
.ht2lib-obj-release/%.o: %.cpp
@mkdir -p $(dir $@)/$(dir $<)
$(CXX) -fPIC $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) $(DEFS) $(SRA_DEF) $(HT2LIB_FLAGS) -DBOWTIE2 $(NOASSERT_FLAGS) -Wall $(INC) $(SEARCH_INC) \
-c -o $@ $<
.ht2lib-obj-debug-shared/%.o: %.cpp
@mkdir -p $(dir $@)/$(dir $<)
$(CXX) -fPIC $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) $(DEFS) $(SRA_DEF) $(HT2LIB_FLAGS) -DBOWTIE2 -Wall $(INC) $(SEARCH_INC) \
-c -o $@ $<
.ht2lib-obj-release-shared/%.o: %.cpp
@mkdir -p $(dir $@)/$(dir $<)
$(CXX) -fPIC $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) $(DEFS) $(SRA_DEF) $(HT2LIB_FLAGS) -DBOWTIE2 $(NOASSERT_FLAGS) -Wall $(INC) $(SEARCH_INC) \
-c -o $@ $<
#
# repeatexp
#
repeatexp:
g++ -o repeatexp repeatexp.cpp -I hisat2lib libhisat2lib.a
hisat2: ;
hisat2.bat:
echo "@echo off" > hisat2.bat
echo "perl %~dp0/hisat2 %*" >> hisat2.bat
hisat2-build.bat:
echo "@echo off" > hisat2-build.bat
echo "python %~dp0/hisat2-build %*" >> hisat2-build.bat
hisat2-inspect.bat:
echo "@echo off" > hisat2-inspect.bat
echo "python %~dp0/hisat2-inspect %*" >> hisat2-inspect.bat
.PHONY: hisat2-src
hisat2-src: $(SRC_PKG_LIST)
chmod a+x scripts/*.sh scripts/*.pl
mkdir .src.tmp
mkdir .src.tmp/hisat2-$(VERSION)
zip tmp.zip $(SRC_PKG_LIST)
mv tmp.zip .src.tmp/hisat2-$(VERSION)
cd .src.tmp/hisat2-$(VERSION) ; unzip tmp.zip ; rm -f tmp.zip
cd .src.tmp ; zip -r hisat2-$(VERSION)-source.zip hisat2-$(VERSION)
cp .src.tmp/hisat2-$(VERSION)-source.zip .
rm -rf .src.tmp
.PHONY: hisat2-bin
hisat2-bin: $(BIN_PKG_LIST) $(HISAT2_BIN_LIST) $(HISAT2_BIN_LIST_AUX)
chmod a+x scripts/*.sh scripts/*.pl
rm -rf .bin.tmp
mkdir .bin.tmp
mkdir .bin.tmp/hisat2-$(VERSION)
if [ -f hisat2.exe ] ; then \
zip tmp.zip $(BIN_PKG_LIST) $(addsuffix .exe,$(HISAT2_BIN_LIST) $(HISAT2_BIN_LIST_AUX)) ; \
else \
zip tmp.zip $(BIN_PKG_LIST) $(HISAT2_BIN_LIST) $(HISAT2_BIN_LIST_AUX) ; \
fi
mv tmp.zip .bin.tmp/hisat2-$(VERSION)
cd .bin.tmp/hisat2-$(VERSION) ; unzip tmp.zip ; rm -f tmp.zip
cd .bin.tmp ; zip -r hisat2-$(VERSION)-$(BITS).zip hisat2-$(VERSION)
cp .bin.tmp/hisat2-$(VERSION)-$(BITS).zip .
rm -rf .bin.tmp
.PHONY: doc
doc: doc/manual.inc.html MANUAL
doc/manual.inc.html: MANUAL.markdown
pandoc -T "HISAT2 Manual" -o $@ \
--from markdown --to HTML --toc $^
perl -i -ne \
'$$w=0 if m|^</body>|;print if $$w;$$w=1 if m|^<body>|;' $@
MANUAL: MANUAL.markdown
perl doc/strip_markdown.pl < $^ > $@
.PHONY: clean
clean:
rm -f $(HISAT2_BIN_LIST) $(HISAT2_BIN_LIST_AUX) \
$(addsuffix .exe,$(HISAT2_BIN_LIST) $(HISAT2_BIN_LIST_AUX)) \
hisat2-src.zip hisat2-bin.zip
rm -f core.* .tmp.head
rm -rf *.dSYM
rm -rf .ht2lib-obj*
rm -f libhisat2lib*.a libhisat2lib*.so
.PHONY: push-doc
push-doc: doc/manual.inc.html
scp doc/*.*html doc/indexes.txt salz-dmz:/ccb/salz7-data/www/ccb.jhu.edu/html/software/hisat2/

16
NEWS Normal file
View File

@ -0,0 +1,16 @@
HISAT 2 NEWS
=============
HISAT 2 is now available for download from the project website,
http://bowtie-bio.sf.net/bowtie2. 2.0.0-beta is the first version released to
the public and 2.0.7 is the latest version. HISAT 2 is licensed under
the GPLv3 license. See `LICENSE' file for details.
Version Release History
=======================
Version 2.0.0-beta - August XX, 2015
* Improved multithreading support so that Bowtie 2 now uses native Windows
threads when compiled on Windows and uses a faster mutex. Threading
performance should improve on all platforms.

247
README.md Normal file
View File

@ -0,0 +1,247 @@
HISAT-3N
============
Overview
-----------------
HISAT-3N (hierarchical indexing for spliced alignment of transcripts - 3 nucleotides)
is an ultrafast and memory-efficient sequence aligner designed for nucleotide conversion
sequencing technologies. HISAT-3N index contains two HISAT2 indexes which require memory small:
for the human genome, it requires 9 GB for standard 3N-index and 10.5 GB for repeat 3N-index.
The repeat 3N-index could be used to align one read to thousands position 3 times faster standard 3N-index.
HISAT-3N is developed based on [HISAT2],
which is particularly optimized for RNA sequencing technology. HISAT-3N support both strand-specific and non-strand reads.
HISAT-3N can be used for any base-converted sequencing reads include [BS-seq], [SLAM-seq], [scBS-seq], [scSLAM-seq], and [TAPS].
See the [HISAT-3N] website for more information.
[HISAT2]:https://github.com/DaehwanKimLab/hisat2
[BS-seq]: https://en.wikipedia.org/wiki/Bisulfite_sequencing
[SLAM-seq]: https://www.nature.com/articles/nmeth.4435
[scBS-seq]: https://www.nature.com/articles/nmeth.3035
[scSLAM-seq]: https://www.nature.com/articles/s41586-019-1369-y
[TAPS]: https://www.nature.com/articles/s41587-019-0041-2
[HISAT-3N]:https://daehwankimlab.github.io/hisat2/hisat-3n
Getting started
============
HISAT-3N requires a 64-bit computer running either Linux or Mac OS X and at least 16 GB of RAM.
A few notes:
1. Building the standard 3N index requires 16GB of RAM or less.
2. Building the repeat 3N index requires 256GB of RAM.
3. The alignment process using either the standard or repeat index requires less than 16GB of RAM.
4. [SAMtools] is required to sort SAM files in order to generate a HISAT-3N table.
Install
------------
git clone https://github.com/DaehwanKimLab/hisat2.git hisat-3n
cd hisat-3n
git checkout -b hisat-3n origin/hisat-3n
make
Build a HISAT-3N index with `hisat-3n-build`
-----------
`hisat-3n-build` builds a 3N-index, which contains two hisat2 indexes, from a set of DNA sequences. For standard 3N-index,
each index contains 16 files with suffix `.3n.*.*.ht2`.
For repeat 3N-index, there are 16 more files in addition to the standard 3N-index, and they have the suffix
`.3n.*.rep.*.ht2`.
These files constitute the hisat-3n index and no other file is needed to alignment reads to the reference.
* `--base-change <chr1,chr2>` argument is required for `hisat-3n-build` and `hisat-3n`.
Provide which base is converted in the sequencing process to another base. Please enter
2 letters separated by ',' for this argument. The first letter(chr1) should be the converted base, the second letter(chr2) should be
the converted to base. For example, during slam-seq, some 'T' is converted to 'C',
please enter `--base-change T,C`. During bisulfite-seq, some 'C' is converted to 'T', please enter `--base-change C,T`.
* Different conversion types may build the same hisat-3n index. Please check the table below for more detail.
Once you build the hisat-3n index with C to T conversion (for example BS-seq).
You can align the T to C conversion reads (for example SLAM-seq reads) with the same index.
| Conversion Types | HISAT-3N index suffix |
|:----------------------------------:|:-----------------------------:|
|C -> T<br>T -> C<br>A -> G<br>G -> A|.3n.CT.\*.ht2 <br>.3n.GA.\*.ht2|
|A -> C<br>C -> A<br>G -> T<br>T -> G|.3n.AC.\*.ht2 <br>.3n.TG.\*.ht2|
|A -> T<br>T -> A |.3n.AT.\*.ht2 <br>.3n.TA.\*.ht2|
|C -> G<br>G -> C |.3n.CG.\*.ht2 <br>.3n.GC.\*.ht2|
#### Examples:
# Build the standard HISAT-3N index (with C to T conversion):
hisat-3n-build --base-change C,T genome.fa genome
# Build the repeat HISAT-3N index (with T to C conversion, require 256 GB memory for human genome index):
hisat-3n-build --base-change T,C --repeat-index genome.fa genome
It is optional to make the graph index and add SNP or spice site information to the index, to increase the alignment accuracy.
The graph index building may require more memory than the linear index building.
For more detail, please check the [HISAT2 manual].
[HISAT2 manual]:https://daehwankimlab.github.io/hisat2/manual/
#### Examples:
# Build the standard HISAT-3N index integrated index with SNP information
hisat-3n-build --base-change C,T --snp genome.snp genome.fa genome
# Build the standard HISAT-3N integrated index with splice site information
hisat-3n-build --base-change C,T --ss genome.ss --exon genome.exon genome.fa genome
# Build the repeat HISAT-3N index integrated index with SNP information
hisat-3n-build --base-change C,T --repeat-index --snp genome.snp genome.fa genome
# Build the repeat HISAT-3N integrated index with splice site information
hisat-3n-build --base-change C,T --repeat-index --ss genome.ss --exon genome.exon genome.fa genome
Alignment with `hisat-3n`
------------
After building the HISAT-3N index, you are ready to use `hisat-3n` for alignment.
HISAT-3N has the same set of parameters as in HISAT2 with some additional arguments. Please refer to the [HISAT2 manual] for more details.
For the human reference genome, HISAT-3N requires about 9GB for alignment with the standard 3N-index and 10.5GB for the repeat 3N-index.
* `--base-change <nt1,nt2>`
Specify the nucleotide conversion type (e.g., C to T in bisulfite-sequencing reads). The parameter option is two characters separated by ','. Type the original nucleotide for the first character (nt1) and type the converted nucleotide as the second character (nt2). For example, if performing [SLAM-seq] where some 'T's are converted to 'C's, input `--base-change T,C`.
As another example, if performing bisulfite-seq, where some 'C's are converted to 'T's, please input `--base-change C,T`.
If you want to align non-converted reads to the regular HISAT2 index, then omit this command.
* `--index/-x <hisat-3n-idx>`
Specify the index file basename for HISAT-3N. The basename is the name of the index files up to but not including the suffix `.3n.*.*.ht2` / etc.
For example, if you build your index with basename 'genome' using a HISAT-3N-build, please input `--index genome`.
* `--directional-mapping`
Make directional mapping. Please use this option only if your sequencing reads are generated from a strand-specific library.
The directional mapping mode is about 2x faster than the standard (non-directional) mapping mode.
* `--repeat-limit <int>`
You can set up the number of alignments to be checked for each repeat alignment. You may increase the number to direct hisat-3n
to output more, if a read has multiple mapping locations. We suggest that you limit the repeat number for paired-end read alignment to no more
than 1,000,000. default: 1000.
* `--unique-only`
Only output uniquely aligned reads.
#### Examples:
* Single-end [SLAM-seq] read (T to C conversion) alignment with standard 3N-index:
`hisat-3n --index genome -f -U read.fa -S output.sam --base-change T,C`
* Paired-end strand-specific bisulfite-seq read (C to T conversion) alignment with repeat 3N-index:
`hisat-3n --index genome -f -1 read_1.fa -2 read_2.fa -S output.sam --base-change C,T --directional-mapping`
* Single-end TAPS reads (C to T conversion) alignment with repeat 3N-index and only output unique aligned results:
`hisat-3n --index genome -q -U read.fq -S output.sam --base-change C,T --unique`
#### Extra SAM tags generated by HISAT-3N:
* `Yf:i:<N>`: Number of conversions detected in the read.
* `Zf:i:<N>`: Number of un-converted bases are detected in the read. Yf + Zf = total number of bases which can be converted in the read sequence.
* `YZ:A:<A>`: The value `+` or `` indicates the read is mapped to REF-3N (`+`) or REF-RC-3N (`-`), respectively.
Generate a 3N-conversion-table with `hisat-3n-table`
------------
### Preparation
To generate a 3N-conversion-table, users need to sort the `hisat-3n` generated SAM alignment file.
[SAMtools] is required for this sorting process.
Use `samtools sort` to convert the SAM file into a sorted SAM file.
samtools sort output.sam -o output_sorted.sam -O sam
Generate 3N-conversion-table with `hisat-3n-table`:
### Usage
hisat-3n-table [options]* --alignments <alignmentFile> --ref <refFile> --base-change <char1,char2>
#### Main arguments
* `--alignments <alignmentFile>`
SORTED SAM file. Please enter `-` for standard input.
* `--ref <refFile>`
The reference genome file (FASTA format) for generating HISAT-3N index.
* `--output-name <outputFile>`
Filename to write 3N-conversion-table (tsv format) to. By default, table is written to the “standard out” or “stdout” filehandle (i.e. the console).
* `--base-change <char1,char2>`
The base-change rule. User should enter the exact same `--base-change` arguments in hisat-3n.
For example, please enter `--base-change C,T` for bisulfite sequencing reads.
#### Input options
* `-u/--unique-only`
Only count the unique aligned reads into 3N-conversion-table.
* `-m/--multiple-only`
Only count the multiple aligned reads into 3N-conversion-table.
* `-c/--CG-only`
Only count the CpG sites in reference genome. This option is designed for bisulfite sequencing reads.
* `--added-chrname`
Please add this option if you use `--add-chrname` during `hisat-3n` alignment.
During `hisat-3n` alignment, the prefix "chr" is added in front of chromosome name and shows on SAM output, when user choose `--add-chrname`.
`hisat-3n-table` cannot find the chromosome name on reference because it has an additional "chr" prefix. This option is to help `hisat-3n-table`
find the matching chromosome name on reference file. The 3n-table provides the same chromosome name as SAM file.
* `--removed-chrname`
Please add this option if you use `--remove-chrname` during `hisat-3n` alignment.
During `hisat-3n` alignment, the prefix "chr" is removed in front of chromosome name and shows on SAM output, when user choose `--remove-chrname`.
`hisat-3n-table` cannot find the chromosome name on reference because it has no "chr" prefix. This option is to help `hisat-3n-table`
find the matching chromosome name on reference file. The 3n-table provides the same chromosome name as SAM file.
#### Other options:
* `-p/--threads <int>`
Launch `int` parallel threads (default: 1) for table building.
* `-h/--help`
Print usage information and quit.
#### Examples:
# Generate the 3N-conversion-table for bisulfite sequencing data:
hisat-3n-table -p 16 --alignments sorted_alignment_result.sam --ref genome.fa --output-name output.tsv --base-change C,T
# Generate the 3N-conversion-table for TAPS data and only count base in CpG site and uniquely aligned:
hisat-3n-table -p 16 --alignments sorted_alignment_result.sam --ref genome.fa --output-name output.tsv --base-change C,T --CG-only --unique-only
# Generate the 3N-conversion-table for bisulfite sequencing data from sorted BAM file:
samtools view -h sorted_alignment_result.bam | hisat-3n-table --ref genome.fa --alignments - --output-name output.tsv --base-change C,T
# Generate the 3N-conversion-table for bisulfite sequencing data from unsorted BAM file:
samtools sort alignment_result.bam -O sam | hisat-3n-table --ref genome.fa --alignments - --output-name output.tsv --base-change C,T
#### Note:
There are 7 columns in the 3N-conversion-table:
1. `ref`: the chromosome name.
2. `pos`: 1-based position in `ref`.
3. `strand`: '+' for forward strand. '-' for reverse strand.
4. `convertedBaseQualities`: the qualities of the converted bases in read-level measurement. The length of this string is equal to the number of converted bases.
5. `convertedBaseCount`: the number of distinct read positions where converted bases in read-level measurements were found.
this number is equal to the length of convertedBaseQualities.
6. `unconvertedBaseQualities`: the qualities of the unconverted bases in read-level measurement. The length of this string is equal to the number of unconverted bases in read-level measurement.
7. `unconvertedBaseCount`: the number of distinct read positions where unconverted bases in read-level measurements were found.
this number is equal to the length of unconvertedBaseQualities.
##### Sample 3N-conversion-table:
ref pos strand convertedBaseQualities convertedBaseCount unconvertedBaseQualities unconvertedBaseCount
1 11874 + FFFFFB<BF<F 11 0
1 11877 - FFFFFF< 7 0
1 11878 + FFFBB//F/BB 11 0
1 11879 + 0 FFFBB//FB/ 10
1 11880 - F 1 FFFF/ 5
[SAMtools]: http://samtools.sourceforge.net
Publication
============
* HISAT-3N
Zhang, Y., Park, C., Bennett, C., Thornton, M. and Kim, D. [Rapid and accurate alignment of nucleotide conversion sequencing reads with HISAT-3N](https://doi.org/10.1101/gr.275193.120). _Genome Research_ **31(7)**: 1290-1295 (2021)
* HIAST2
Kim, D., Paggi, J.M., Park, C. _et al._ [Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype](https://doi.org/10.1038/s41587-019-0201-4). _Nat Biotechnol_ **37**, 907915 (2019)

4
TUTORIAL Normal file
View File

@ -0,0 +1,4 @@
See section toward end of MANUAL entited "Getting started with HISAT2". Or,
for tutorial for latest HISAT2 version, visit:
https://ccb.jhu.edu/software/hisat2/manual.shtml#getting-started-with-hisat2

1
_config.yml Normal file
View File

@ -0,0 +1 @@
theme: jekyll-theme-time-machine

1772
aligner_bt.cpp Normal file

File diff suppressed because it is too large Load Diff

947
aligner_bt.h Normal file
View File

@ -0,0 +1,947 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALIGNER_BT_H_
#define ALIGNER_BT_H_
#include <utility>
#include <stdint.h>
#include "aligner_sw_common.h"
#include "aligner_result.h"
#include "scoring.h"
#include "edit.h"
#include "limit.h"
#include "dp_framer.h"
#include "sse_util.h"
/* Say we've filled in a DP matrix in a cost-only manner, not saving the scores
* for each of the cells. At the end, we obtain a list of candidate cells and
* we'd like to backtrace from them. The per-cell scores are gone, but we have
* to re-create the correct path somehow. Hopefully we can do this without
* recreating most or al of the score matrix, since this takes too much memory.
*
* Approach 1: Naively refill the matrix.
*
* Just refill the matrix, perhaps backwards starting from the backtrace cell.
* Since this involves recreating all or most of the score matrix, this is not
* a good approach.
*
* Approach 2: Naive backtracking.
*
* Conduct a search through the space of possible backtraces, rooted at the
* candidate cell. To speed things along, we can prioritize paths that have a
* high score and that align more characters from the read.
*
* The approach is simple, but it's neither fast nor memory-efficient in
* general.
*
* Approach 3: Refilling with checkpoints.
*
* Refill the matrix "backwards" starting from the candidate cell, but use
* checkpoints to ensure that only a series of relatively small triangles or
* rectangles need to be refilled. The checkpoints must include elements from
* the H, E and F matrices; not just H. After each refill, we backtrace
* through the refilled area, then discard/reuse the fill memory. I call each
* such fill/backtrace a mini-fill/backtrace.
*
* If there's only one path to be found, then this is O(m+n). But what if
* there are many? And what if we would like to avoid paths that overlap in
* one or more cells? There are two ways we can make this more efficient:
*
* 1. Remember the re-calculated E/F/H values and try to retrieve them
* 2. Keep a record of cells that have already been traversed
*
* Legend:
*
* 1: Candidate cell
* 2: Final cell from first mini-fill/backtrace
* 3: Final cell from second mini-fill/backtrace (third not shown)
* +: Checkpointed cell
* *: Cell filled from first or second mini-fill/backtrace
* -: Unfilled cell
*
* ---++--------++--------++----
* --++--------++*-------++-----
* -++--(etc)-++**------++------
* ++--------+3***-----++-------
* +--------++****----++--------
* --------++*****---++--------+
* -------++******--++--------++
* ------++*******-++*-------++-
* -----++********++**------++--
* ----++********2+***-----++---
* ---++--------++****----++----
* --++--------++*****---++-----
* -++--------++*****1--++------
* ++--------++--------++-------
*
* Approach 4: Backtracking with checkpoints.
*
* Conduct a search through the space of possible backtraces, rooted at the
* candidate cell. Use "checkpoints" to prune. That is, when a backtrace
* moves through a cell with a checkpointed score, consider the score
* accumulated so far and the cell's saved score; abort if those two scores
* add to something less than a valid score. Note we're only checkpointing H
* in this case (possibly; see "subtle point"), not E or F.
*
* Subtle point: checkpoint scores are a result of moving forward through
* the matrix whereas backtracking scores result from moving backward. This
* matters becuase the two paths that meet up at a cell might have both
* factored in a gap open penalty for the same gap, in which case we will
* underestimate the overall score and prune a good path. Here are two ideas
* for how to resolve this:
*
* Idea 1: when we combine the forward and backward scores to find an overall
* score, and our backtrack procedure *just* made a horizontal or vertical
* move, add in a "bonus" equal to the gap open penalty of the appropraite
* type (read gap open for horizontal, ref gap open for vertical). This might
* overcompensate, since
*
* Idea 2: keep the E and F values for the checkpoints around, in addition to
* the H values. When it comes time to combine the score from the forward
* and backward paths, we consider the last move we made in the backward
* backtrace. If it's a read gap (horizontal move), then we calculate the
* overall score as:
*
* max(Score-backward + H-forward, Score-backward + E-forward + read-open)
*
* If it's a reference gap (vertical move), then we calculate the overall
* score as:
*
* max(Score-backward + H-forward, Score-backward + F-forward + ref-open)
*
* What does it mean to abort a backtrack? If we're starting a new branch
* and there is a checkpoing in the bottommost cell of the branch, and the
* overall score is less than the target, then we can simply ignore the
* branch. If the checkpoint occurs in the middle of a string of matches, we
* need to curtail the branch such that it doesn't include the checkpointed
* cell and we won't ever try to enter the checkpointed cell, e.g., on a
* mismatch.
*
* Approaches 3 and 4 seem reasonable, and could be combined. For simplicity,
* we implement only approach 4 for now.
*
* Checkpoint information is propagated from the fill process to the backtracer
* via a
*/
enum {
BT_NOT_FOUND = 1, // could not obtain the backtrace because it
// overlapped a previous solution
BT_FOUND, // obtained a valid backtrace
BT_REJECTED_N, // backtrace rejected because it had too many Ns
BT_REJECTED_CORE_DIAG // backtrace rejected because it failed to overlap a
// core diagonal
};
/**
* Parameters for a matrix of potential backtrace problems to solve.
* Encapsulates information about:
*
* The problem given a particular reference substring:
*
* - The query string (nucleotides and qualities)
* - The reference substring (incl. orientation, offset into overall sequence)
* - Checkpoints (i.e. values of matrix cells)
* - Scoring scheme and other thresholds
*
* The problem given a particular reference substring AND a particular row and
* column from which to backtrace:
*
* - The row and column
* - The target score
*/
class BtBranchProblem {
public:
/**
* Create new uninitialized problem.
*/
BtBranchProblem() { reset(); }
/**
* Initialize a new problem.
*/
void initRef(
const char *qry, // query string (along rows)
const char *qual, // query quality string (along rows)
size_t qrylen, // query string (along rows) length
const char *ref, // reference string (along columns)
TRefOff reflen, // in-rectangle reference string length
TRefOff treflen,// total reference string length
TRefId refid, // reference id
TRefOff refoff, // reference offset
bool fw, // orientation of problem
const DPRect* rect, // dynamic programming rectangle filled out
const Checkpointer* cper, // checkpointer
const Scoring *sc, // scoring scheme
size_t nceil) // max # Ns allowed in alignment
{
qry_ = qry;
qual_ = qual;
qrylen_ = qrylen;
ref_ = ref;
reflen_ = reflen;
treflen_ = treflen;
refid_ = refid;
refoff_ = refoff;
fw_ = fw;
rect_ = rect;
cper_ = cper;
sc_ = sc;
nceil_ = nceil;
}
/**
* Initialize a new problem.
*/
void initBt(
size_t row, // row
size_t col, // column
bool fill, // use a filling rather than a backtracking strategy
bool usecp, // use checkpoints to short-circuit while backtracking
TAlScore targ) // target score
{
row_ = row;
col_ = col;
targ_ = targ;
fill_ = fill;
usecp_ = usecp;
if(fill) {
assert(usecp_);
}
}
/**
* Reset to uninitialized state.
*/
void reset() {
qry_ = qual_ = ref_ = NULL;
cper_ = NULL;
rect_ = NULL;
sc_ = NULL;
qrylen_ = reflen_ = treflen_ = refid_ = refoff_ = row_ = col_ = targ_ = nceil_ = 0;
fill_ = fw_ = usecp_ = false;
}
/**
* Return true iff the BtBranchProblem has been initialized.
*/
bool inited() const {
return qry_ != NULL;
}
#ifndef NDEBUG
/**
* Sanity-check the problem.
*/
bool repOk() const {
assert_gt(qrylen_, 0);
assert_gt(reflen_, 0);
assert_gt(treflen_, 0);
assert_lt(row_, qrylen_);
assert_lt((TRefOff)col_, reflen_);
return true;
}
#endif
size_t reflen() const { return reflen_; }
size_t treflen() const { return treflen_; }
protected:
const char *qry_; // query string (along rows)
const char *qual_; // query quality string (along rows)
size_t qrylen_; // query string (along rows) length
const char *ref_; // reference string (along columns)
TRefOff reflen_; // in-rectangle reference string length
TRefOff treflen_;// total reference string length
TRefId refid_; // reference id
TRefOff refoff_; // reference offset
bool fw_; // orientation of problem
const DPRect* rect_; // dynamic programming rectangle filled out
size_t row_; // starting row
size_t col_; // starting column
TAlScore targ_; // target score
const Checkpointer *cper_; // checkpointer
bool fill_; // use mini-fills
bool usecp_; // use checkpointing?
const Scoring *sc_; // scoring scheme
size_t nceil_; // max # Ns allowed in alignment
friend class BtBranch;
friend class BtBranchQ;
friend class BtBranchTracer;
};
/**
* Encapsulates a "branch" which is a diagonal of cells (possibly of length 0)
* in the matrix where all the cells are matches. These stretches are linked
* together by edits to form a full backtrace path through the matrix. Lengths
* are measured w/r/t to the number of rows traversed by the path, so a branch
* that represents a read gap extension could have length = 0.
*
* At the end of the day, the full backtrace path is represented as a list of
* BtBranch's where each BtBranch represents a stretch of matching cells (and
* up to one mismatching cell at its bottom extreme) ending in an edit (or in
* the bottommost row, in which case the edit is uninitialized). Each
* BtBranch's row and col fields indicate the bottommost cell involved in the
* diagonal stretch of matches, and the len_ field indicates the length of the
* stretch of matches. Note that the edits themselves also correspond to
* movement through the matrix.
*
* A related issue is how we record which cells have been visited so that we
* never report a pair of paths both traversing the same (row, col) of the
* overall DP matrix. This gets a little tricky because we have to take into
* account the cells covered by *edits* in addition to the cells covered by the
* stretches of matches. For instance: imagine a mismatch. That takes up a
* cell of the DP matrix, but it may or may not be preceded by a string of
* matches. It's hard to imagine how to represent this unless we let the
* mismatch "count toward" the len_ of the branch and let (row, col) refer to
* the cell where the mismatch occurs.
*
* We need BtBranches to "live forever" so that we can make some BtBranches
* parents of others using parent pointers. For this reason, BtBranch's are
* stored in an EFactory object in the BtBranchTracer class.
*/
class BtBranch {
public:
BtBranch() { reset(); }
BtBranch(
const BtBranchProblem& prob,
size_t parentId,
TAlScore penalty,
TAlScore score_en,
int64_t row,
int64_t col,
Edit e,
int hef,
bool root,
bool extend)
{
init(prob, parentId, penalty, score_en, row, col, e, hef, root, extend);
}
/**
* Reset to uninitialized state.
*/
void reset() {
parentId_ = 0;
score_st_ = score_en_ = len_ = row_ = col_ = 0;
curtailed_ = false;
e_.reset();
}
/**
* Caller gives us score_en, row and col. We figure out score_st and len_
* by comparing characters from the strings.
*/
void init(
const BtBranchProblem& prob,
size_t parentId,
TAlScore penalty,
TAlScore score_en,
int64_t row,
int64_t col,
Edit e,
int hef,
bool root,
bool extend);
/**
* Return true iff this branch ends in a solution to the backtrace problem.
*/
bool isSolution(const BtBranchProblem& prob) const {
const bool end2end = prob.sc_->monotone;
return score_st_ == prob.targ_ && (!end2end || endsInFirstRow());
}
/**
* Return true iff this branch could potentially lead to a valid alignment.
*/
bool isValid(const BtBranchProblem& prob) const {
int64_t scoreFloor = prob.sc_->monotone ? MIN_I64 : 0;
if(score_st_ < scoreFloor) {
// Dipped below the score floor
return false;
}
if(isSolution(prob)) {
// It's a solution, so it's also valid
return true;
}
if((int64_t)len_ > row_) {
// Went all the way to the top row
//assert_leq(score_st_, prob.targ_);
return score_st_ == prob.targ_;
} else {
int64_t match = prob.sc_->match();
int64_t bonusLeft = (row_ + 1 - len_) * match;
return score_st_ + bonusLeft >= prob.targ_;
}
}
/**
* Return true iff this branch overlaps with the given branch.
*/
bool overlap(const BtBranchProblem& prob, const BtBranch& bt) const {
// Calculate this branch's diagonal
assert_lt(row_, (int64_t)prob.qrylen_);
size_t fromend = prob.qrylen_ - row_ - 1;
size_t diag = fromend + col_;
int64_t lo = 0, hi = row_ + 1;
if(len_ == 0) {
lo = row_;
} else {
lo = row_ - (len_ - 1);
}
// Calculate other branch's diagonal
assert_lt(bt.row_, (int64_t)prob.qrylen_);
size_t ofromend = prob.qrylen_ - bt.row_ - 1;
size_t odiag = ofromend + bt.col_;
if(diag != odiag) {
return false;
}
int64_t olo = 0, ohi = bt.row_ + 1;
if(bt.len_ == 0) {
olo = bt.row_;
} else {
olo = bt.row_ - (bt.len_ - 1);
}
int64_t losm = olo, hism = ohi;
if(hi - lo < ohi - olo) {
swap(lo, losm);
swap(hi, hism);
}
if((lo <= losm && hi > losm) || (lo < hism && hi >= hism)) {
return true;
}
return false;
}
/**
* Return true iff this branch is higher priority than the branch 'o'.
*/
bool operator<(const BtBranch& o) const {
// Prioritize uppermost above score
if(uppermostRow() != o.uppermostRow()) {
return uppermostRow() < o.uppermostRow();
}
if(score_st_ != o.score_st_) return score_st_ > o.score_st_;
if(row_ != o.row_) return row_ < o.row_;
if(col_ != o.col_) return col_ > o.col_;
if(parentId_ != o.parentId_) return parentId_ > o.parentId_;
assert(false);
return false;
}
/**
* Return true iff the topmost cell involved in this branch is in the top
* row.
*/
bool endsInFirstRow() const {
assert_leq((int64_t)len_, row_ + 1);
return (int64_t)len_ == row_+1;
}
/**
* Return the uppermost row covered by this branch.
*/
size_t uppermostRow() const {
assert_geq(row_ + 1, (int64_t)len_);
return row_ + 1 - (int64_t)len_;
}
/**
* Return the leftmost column covered by this branch.
*/
size_t leftmostCol() const {
assert_geq(col_ + 1, (int64_t)len_);
return col_ + 1 - (int64_t)len_;
}
#ifndef NDEBUG
/**
* Sanity-check this BtBranch.
*/
bool repOk() const {
assert(root_ || e_.inited());
assert_gt(len_, 0);
assert_geq(col_ + 1, (int64_t)len_);
assert_geq(row_ + 1, (int64_t)len_);
return true;
}
#endif
protected:
// ID of the parent branch.
size_t parentId_;
// Penalty associated with the edit at the bottom of this branch (0 if
// there is no edit)
TAlScore penalty_;
// Score at the beginning of the branch
TAlScore score_st_;
// Score at the end of the branch (taking the edit into account)
TAlScore score_en_;
// Length of the branch. That is, the total number of diagonal cells
// involved in all the matches and in the edit (if any). Should always be
// > 0.
size_t len_;
// The row of the final (bottommost) cell in the branch. This might be the
// bottommost match if the branch has no associated edit. Otherwise, it's
// the cell occupied by the edit.
int64_t row_;
// The column of the final (bottommost) cell in the branch.
int64_t col_;
// The edit at the bottom of the branch. If this is the bottommost branch
// in the alignment and it does not end in an edit, then this remains
// uninitialized.
Edit e_;
// True iff this is the bottommost branch in the alignment. We can't just
// use row_ to tell us this because local alignments don't necessarily end
// in the last row.
bool root_;
bool curtailed_; // true -> pruned at a checkpoint where we otherwise
// would have had a match
friend class BtBranchQ;
friend class BtBranchTracer;
};
/**
* Instantiate and solve best-first branch-based backtraces.
*/
class BtBranchTracer {
public:
explicit BtBranchTracer() :
prob_(), bs_(), seenPaths_(DP_CAT), sawcell_(DP_CAT), doTri_() { }
/**
* Add a branch to the queue.
*/
void add(size_t id) {
assert(!bs_[id].isSolution(prob_));
unsorted_.push_back(make_pair(bs_[id].score_st_, id));
}
/**
* Add a branch to the list of solutions.
*/
void addSolution(size_t id) {
assert(bs_[id].isSolution(prob_));
solutions_.push_back(id);
}
/**
* Given a potential branch to add to the queue, see if we can follow the
* branch a little further first. If it's still valid, or if we reach a
* choice between valid outgoing paths, go ahead and add it to the queue.
*/
void examineBranch(
int64_t row,
int64_t col,
const Edit& e,
TAlScore pen,
TAlScore sc,
size_t parentId);
/**
* Take all possible ways of leaving the given branch and add them to the
* branch queue.
*/
void addOffshoots(size_t bid);
/**
* Get the best branch and remove it from the priority queue.
*/
size_t best(RandomSource& rnd) {
assert(!empty());
flushUnsorted();
assert_gt(sortedSel_ ? sorted1_.size() : sorted2_.size(), cur_);
// Perhaps shuffle everyone who's tied for first?
size_t id = sortedSel_ ? sorted1_[cur_] : sorted2_[cur_];
cur_++;
return id;
}
/**
* Return true iff there are no branches left to try.
*/
bool empty() const {
return size() == 0;
}
/**
* Return the size, i.e. the total number of branches contained.
*/
size_t size() const {
return unsorted_.size() +
(sortedSel_ ? sorted1_.size() : sorted2_.size()) - cur_;
}
/**
* Return true iff there are no solutions left to try.
*/
bool emptySolution() const {
return sizeSolution() == 0;
}
/**
* Return the size of the solution set so far.
*/
size_t sizeSolution() const {
return solutions_.size();
}
/**
* Sort unsorted branches, merge them with master sorted list.
*/
void flushUnsorted();
#ifndef NDEBUG
/**
* Sanity-check the queue.
*/
bool repOk() const {
assert_lt(cur_, (sortedSel_ ? sorted1_.size() : sorted2_.size()));
return true;
}
#endif
/**
* Initialize the tracer with respect to a new read. This involves
* resetting all the state relating to the set of cells already visited
*/
void initRef(
const char* rd, // in: read sequence
const char* qu, // in: quality sequence
size_t rdlen, // in: read sequence length
const char* rf, // in: reference sequence
size_t rflen, // in: in-rectangle reference sequence length
TRefOff trflen, // in: total reference sequence length
TRefId refid, // in: reference id
TRefOff refoff, // in: reference offset
bool fw, // in: orientation
const DPRect *rect, // in: DP rectangle
const Checkpointer *cper, // in: checkpointer
const Scoring& sc, // in: scoring scheme
size_t nceil) // in: N ceiling
{
prob_.initRef(rd, qu, rdlen, rf, rflen, trflen, refid, refoff, fw, rect, cper, &sc, nceil);
const size_t ndiag = rflen + rdlen - 1;
seenPaths_.resize(ndiag);
for(size_t i = 0; i < ndiag; i++) {
seenPaths_[i].clear();
}
// clear each of the per-column sets
if(sawcell_.size() < rflen) {
size_t isz = sawcell_.size();
sawcell_.resize(rflen);
for(size_t i = isz; i < rflen; i++) {
sawcell_[i].setCat(DP_CAT);
}
}
for(size_t i = 0; i < rflen; i++) {
sawcell_[i].setCat(DP_CAT);
sawcell_[i].clear(); // clear the set
}
}
/**
* Initialize with a new backtrace.
*/
void initBt(
TAlScore escore, // in: alignment score
size_t row, // in: start in this row
size_t col, // in: start in this column
bool fill, // in: use mini-filling?
bool usecp, // in: use checkpointing?
bool doTri, // in: triangle-shaped mini-fills?
RandomSource& rnd) // in: random gen, to choose among equal paths
{
prob_.initBt(row, col, fill, usecp, escore);
Edit e; e.reset();
unsorted_.clear();
solutions_.clear();
sorted1_.clear();
sorted2_.clear();
cur_ = 0;
nmm_ = 0; // number of mismatches attempted
nnmm_ = 0; // number of mismatches involving N attempted
nrdop_ = 0; // number of read gap opens attempted
nrfop_ = 0; // number of ref gap opens attempted
nrdex_ = 0; // number of read gap extensions attempted
nrfex_ = 0; // number of ref gap extensions attempted
nmmPrune_ = 0; // number of mismatches attempted
nnmmPrune_ = 0; // number of mismatches involving N attempted
nrdopPrune_ = 0; // number of read gap opens attempted
nrfopPrune_ = 0; // number of ref gap opens attempted
nrdexPrune_ = 0; // number of read gap extensions attempted
nrfexPrune_ = 0; // number of ref gap extensions attempted
row_ = row;
col_ = col;
doTri_ = doTri;
bs_.clear();
if(!prob_.fill_) {
size_t id = bs_.alloc();
bs_[id].init(
prob_,
0, // parent id
0, // penalty
0, // starting score
row, // row
col, // column
e,
0,
true, // this is the root
true); // this should be extend with exact matches
if(bs_[id].isSolution(prob_)) {
addSolution(id);
} else {
add(id);
}
} else {
int64_t row = row_, col = col_;
TAlScore targsc = prob_.targ_;
int hef = 0;
bool done = false, abort = false;
size_t depth = 0;
while(!done && !abort) {
// Accumulate edits as we go. We can do this by adding
// BtBranches to the bs_ structure. Each step of the backtrace
// either involves an edit (thereby starting a new branch) or
// extends the previous branch by one more position.
//
// Note: if the BtBranches are in line, then trySolution can be
// used to populate the SwResult and check for various
// situations where we might reject the alignment (i.e. due to
// a cell having been visited previously).
if(doTri_) {
triangleFill(
row, // row of cell to backtrace from
col, // column of cell to backtrace from
hef, // cell to bt from: H (0), E (1), or F (2)
targsc, // score of cell to backtrace from
prob_.targ_, // score of alignment we're looking for
rnd, // pseudo-random generator
row, // out: row we ended up in after bt
col, // out: column we ended up in after bt
hef, // out: H/E/F after backtrace
targsc, // out: score up to cell we ended up in
done, // out: finished tracing out an alignment?
abort); // out: aborted b/c cell was seen before?
} else {
squareFill(
row, // row of cell to backtrace from
col, // column of cell to backtrace from
hef, // cell to bt from: H (0), E (1), or F (2)
targsc, // score of cell to backtrace from
prob_.targ_, // score of alignment we're looking for
rnd, // pseudo-random generator
row, // out: row we ended up in after bt
col, // out: column we ended up in after bt
hef, // out: H/E/F after backtrace
targsc, // out: score up to cell we ended up in
done, // out: finished tracing out an alignment?
abort); // out: aborted b/c cell was seen before?
}
if(depth >= ndep_.size()) {
ndep_.resize(depth+1);
ndep_[depth] = 1;
} else {
ndep_[depth]++;
}
depth++;
assert((row >= 0 && col >= 0) || done);
}
}
ASSERT_ONLY(seen_.clear());
}
/**
* Get the next valid alignment given the backtrace problem. Return false
* if there is no valid solution, e.g., if
*/
bool nextAlignment(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd);
/**
* Return true iff this tracer has been initialized
*/
bool inited() const {
return prob_.inited();
}
/**
* Return true iff the mini-fills are triangle-shaped.
*/
bool doTri() const { return doTri_; }
/**
* Fill in a triangle of the DP table and backtrace from the given cell to
* a cell in the previous checkpoint, or to the terminal cell.
*/
void triangleFill(
int64_t rw, // row of cell to backtrace from
int64_t cl, // column of cell to backtrace from
int hef, // cell to backtrace from is H (0), E (1), or F (2)
TAlScore targ, // score of cell to backtrace from
TAlScore targ_final, // score of alignment we're looking for
RandomSource& rnd, // pseudo-random generator
int64_t& row_new, // out: row we ended up in after backtrace
int64_t& col_new, // out: column we ended up in after backtrace
int& hef_new, // out: H/E/F after backtrace
TAlScore& targ_new, // out: score up to cell we ended up in
bool& done, // out: finished tracing out an alignment?
bool& abort); // out: aborted b/c cell was seen before?
/**
* Fill in a square of the DP table and backtrace from the given cell to
* a cell in the previous checkpoint, or to the terminal cell.
*/
void squareFill(
int64_t rw, // row of cell to backtrace from
int64_t cl, // column of cell to backtrace from
int hef, // cell to backtrace from is H (0), E (1), or F (2)
TAlScore targ, // score of cell to backtrace from
TAlScore targ_final, // score of alignment we're looking for
RandomSource& rnd, // pseudo-random generator
int64_t& row_new, // out: row we ended up in after backtrace
int64_t& col_new, // out: column we ended up in after backtrace
int& hef_new, // out: H/E/F after backtrace
TAlScore& targ_new, // out: score up to cell we ended up in
bool& done, // out: finished tracing out an alignment?
bool& abort); // out: aborted b/c cell was seen before?
protected:
/**
* Get the next valid alignment given a backtrace problem. Return false
* if there is no valid solution. Use a backtracking search to find the
* solution. This can be very slow.
*/
bool nextAlignmentBacktrace(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd);
/**
* Get the next valid alignment given a backtrace problem. Return false
* if there is no valid solution. Use a triangle-fill backtrace to find
* the solution. This is usually fast (it's O(m + n)).
*/
bool nextAlignmentFill(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd);
/**
* Try all the solutions accumulated so far. Solutions might be rejected
* if they, for instance, overlap a previous solution, have too many Ns,
* fail to overlap a core diagonal, etc.
*/
bool trySolutions(
bool lookForOlap,
SwResult& res,
size_t& off,
size_t& nrej,
RandomSource& rnd,
bool& success);
/**
* See if a given solution branch works as a solution (i.e. doesn't overlap
* another one, have too many Ns, fail to overlap a core diagonal, etc.)
*/
int trySolution(
size_t id,
bool lookForOlap,
SwResult& res,
size_t& off,
size_t& nrej,
RandomSource& rnd);
BtBranchProblem prob_; // problem configuration
EFactory<BtBranch> bs_; // global BtBranch factory
// already reported alignments going through these diagonal segments
ELList<std::pair<size_t, size_t> > seenPaths_;
ELSet<size_t> sawcell_; // cells already backtraced through
EList<std::pair<TAlScore, size_t> > unsorted_; // unsorted list of as-yet-unflished BtBranches
EList<size_t> sorted1_; // list of BtBranch, sorted by score
EList<size_t> sorted2_; // list of BtBranch, sorted by score
EList<size_t> solutions_; // list of solution branches
bool sortedSel_; // true -> 1, false -> 2
size_t cur_; // cursor into sorted list to start from
size_t nmm_; // number of mismatches attempted
size_t nnmm_; // number of mismatches involving N attempted
size_t nrdop_; // number of read gap opens attempted
size_t nrfop_; // number of ref gap opens attempted
size_t nrdex_; // number of read gap extensions attempted
size_t nrfex_; // number of ref gap extensions attempted
size_t nmmPrune_; //
size_t nnmmPrune_; //
size_t nrdopPrune_; //
size_t nrfopPrune_; //
size_t nrdexPrune_; //
size_t nrfexPrune_; //
size_t row_; // row
size_t col_; // column
bool doTri_; // true -> fill in triangles; false -> squares
EList<CpQuad> sq_; // square to fill when doing mini-fills
ELList<CpQuad> tri_; // triangle to fill when doing mini-fills
EList<size_t> ndep_; // # triangles mini-filled at various depths
#ifndef NDEBUG
ESet<size_t> seen_; // seedn branch ids; should never see same twice
#endif
};
#endif /*ndef ALIGNER_BT_H_*/

181
aligner_cache.cpp Normal file
View File

@ -0,0 +1,181 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include "aligner_cache.h"
#include "tinythread.h"
#ifdef ALIGNER_CACHE_MAIN
#include <iostream>
#include <getopt.h>
#include <string>
#include "random_source.h"
using namespace std;
enum {
ARG_TESTS = 256
};
static const char *short_opts = "vCt";
static struct option long_opts[] = {
{(char*)"verbose", no_argument, 0, 'v'},
{(char*)"tests", no_argument, 0, ARG_TESTS},
};
static void printUsage(ostream& os) {
os << "Usage: sawhi-cache [options]*" << endl;
os << "Options:" << endl;
os << " --tests run unit tests" << endl;
os << " -v/--verbose talkative mode" << endl;
}
int gVerbose = 0;
static void add(
RedBlack<QKey, QVal>& t,
Pool& p,
const char *dna)
{
QKey qk;
qk.init(BTDnaString(dna, true));
t.add(p, qk, NULL);
}
/**
* Small tests for the AlignmentCache.
*/
static void aligner_cache_tests() {
RedBlack<QKey, QVal> rb(1024);
Pool p(64 * 1024, 1024);
// Small test
add(rb, p, "ACGTCGATCGT");
add(rb, p, "ACATCGATCGT");
add(rb, p, "ACGACGATCGT");
add(rb, p, "ACGTAGATCGT");
add(rb, p, "ACGTCAATCGT");
add(rb, p, "ACGTCGCTCGT");
add(rb, p, "ACGTCGAACGT");
assert_eq(7, rb.size());
rb.clear();
p.clear();
// Another small test
add(rb, p, "ACGTCGATCGT");
add(rb, p, "CCGTCGATCGT");
add(rb, p, "TCGTCGATCGT");
add(rb, p, "GCGTCGATCGT");
add(rb, p, "AAGTCGATCGT");
assert_eq(5, rb.size());
rb.clear();
p.clear();
// Regression test (attempt to make it smaller)
add(rb, p, "CCTA");
add(rb, p, "AGAA");
add(rb, p, "TCTA");
add(rb, p, "GATC");
add(rb, p, "CTGC");
add(rb, p, "TTGC");
add(rb, p, "GCCG");
add(rb, p, "GGAT");
rb.clear();
p.clear();
// Regression test
add(rb, p, "CCTA");
add(rb, p, "AGAA");
add(rb, p, "TCTA");
add(rb, p, "GATC");
add(rb, p, "CTGC");
add(rb, p, "CATC");
add(rb, p, "CAAA");
add(rb, p, "CTAT");
add(rb, p, "CTCA");
add(rb, p, "TTGC");
add(rb, p, "GCCG");
add(rb, p, "GGAT");
assert_eq(12, rb.size());
rb.clear();
p.clear();
// Larger random test
EList<BTDnaString> strs;
char buf[5];
for(int i = 0; i < 4; i++) {
for(int j = 0; j < 4; j++) {
for(int k = 0; k < 4; k++) {
for(int m = 0; m < 4; m++) {
buf[0] = "ACGT"[i];
buf[1] = "ACGT"[j];
buf[2] = "ACGT"[k];
buf[3] = "ACGT"[m];
buf[4] = '\0';
strs.push_back(BTDnaString(buf, true));
}
}
}
}
// Add all of the 4-mers in several different random orders
RandomSource rand;
for(uint32_t runs = 0; runs < 100; runs++) {
rb.clear();
p.clear();
assert_eq(0, rb.size());
rand.init(runs);
EList<bool> used;
used.resize(256);
for(int i = 0; i < 256; i++) used[i] = false;
for(int i = 0; i < 256; i++) {
int r = rand.nextU32() % (256-i);
int unused = 0;
bool added = false;
for(int j = 0; j < 256; j++) {
if(!used[j] && unused == r) {
used[j] = true;
QKey qk;
qk.init(strs[j]);
rb.add(p, qk, NULL);
added = true;
break;
}
if(!used[j]) unused++;
}
assert(added);
}
}
}
/**
* A way of feeding simply tests to the seed alignment infrastructure.
*/
int main(int argc, char **argv) {
int option_index = 0;
int next_option;
do {
next_option = getopt_long(argc, argv, short_opts, long_opts, &option_index);
switch (next_option) {
case 'v': gVerbose = true; break;
case ARG_TESTS: aligner_cache_tests(); return 0;
case -1: break;
default: {
cerr << "Unknown option: " << (char)next_option << endl;
printUsage(cerr);
exit(1);
}
}
} while(next_option != -1);
}
#endif

1013
aligner_cache.h Normal file

File diff suppressed because it is too large Load Diff

80
aligner_driver.cpp Normal file
View File

@ -0,0 +1,80 @@
/*
* Copyright 2012, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include "aligner_driver.h"
void AlignerDriverRootSelector::select(
const Read& q,
const Read* qo,
bool nofw,
bool norc,
EList<DescentConfig>& confs,
EList<DescentRoot>& roots)
{
// Calculate interval length for both mates
int interval = rootIval_.f<int>((double)q.length());
if(qo != NULL) {
// Boost interval length by 20% for paired-end reads
interval = (int)(interval * 1.2 + 0.5);
}
float pri = 0.0f;
for(int fwi = 0; fwi < 2; fwi++) {
bool fw = (fwi == 0);
if((fw && nofw) || (!fw && norc)) {
continue;
}
// Put down left-to-right roots w/r/t forward and reverse-complement reads
{
bool first = true;
size_t i = 0;
while(first || (i + landing_ <= q.length())) {
confs.expand();
confs.back().cons.init(landing_, consExp_);
roots.expand();
roots.back().init(
i, // offset from 5' end
true, // left-to-right?
fw, // fw?
q.length(), // query length
pri); // root priority
i += interval;
first = false;
}
}
// Put down right-to-left roots w/r/t forward and reverse-complement reads
{
bool first = true;
size_t i = 0;
while(first || (i + landing_ <= q.length())) {
confs.expand();
confs.back().cons.init(landing_, consExp_);
roots.expand();
roots.back().init(
q.length() - i - 1, // offset from 5' end
false, // left-to-right?
fw, // fw?
q.length(), // query length
pri); // root priority
i += interval;
first = false;
}
}
}
}

247
aligner_driver.h Normal file
View File

@ -0,0 +1,247 @@
/*
* Copyright 2012, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
/*
* aligner_driver.h
*
* REDUNDANT SEED HITS
*
* We say that two seed hits are redundant if they trigger identical
* seed-extend dynamic programming problems. Put another way, they both lie on
* the same diagonal of the overall read/reference dynamic programming matrix.
* Detecting redundant seed hits is simple when the seed hits are ungapped. We
* do this after offset resolution but before the offset is converted to genome
* coordinates (see uses of the seenDiags1_/seenDiags2_ fields for examples).
*
* REDUNDANT ALIGNMENTS
*
* In an unpaired context, we say that two alignments are redundant if they
* share any cells in the global DP table. Roughly speaking, this is like
* saying that two alignments are redundant if any read character aligns to the
* same reference character (same reference sequence, same strand, same offset)
* in both alignments.
*
* In a paired-end context, we say that two paired-end alignments are redundant
* if the mate #1s are redundant and the mate #2s are redundant.
*
* How do we enforce this? In the unpaired context, this is relatively simple:
* the cells from each alignment are checked against a set containing all cells
* from all previous alignments. Given a new alignment, for each cell in the
* new alignment we check whether it is in the set. If there is any overlap,
* the new alignment is rejected as redundant. Otherwise, the new alignment is
* accepted and its cells are added to the set.
*
* Enforcement in a paired context is a little trickier. Consider the
* following approaches:
*
* 1. Skip anchors that are redundant with any previous anchor or opposite
* alignment. This is sufficient to ensure no two concordant alignments
* found are redundant.
*
* 2. Same as scheme 1, but with a "transitive closure" scheme for finding all
* concordant pairs in the vicinity of an anchor. Consider the AB/AC
* scenario from the previous paragraph. If B is the anchor alignment, we
* will find AB but not AC. But under this scheme, once we find AB we then
* let B be a new anchor and immediately look for its opposites. Likewise,
* if we find any opposite, we make them anchors and continue searching. We
* don't stop searching until every opposite is used as an anchor.
*
* 3. Skip anchors that are redundant with any previous anchor alignment (but
* allow anchors that are redundant with previous opposite alignments).
* This isn't sufficient to avoid redundant concordant alignments. To avoid
* redundant concordants, we need an additional procedure that checks each
* new concordant alignment one-by-one against a list of previous concordant
* alignments to see if it is redundant.
*
* We take approach 1.
*/
#ifndef ALIGNER_DRIVER_H_
#define ALIGNER_DRIVER_H_
#include "aligner_seed2.h"
#include "simple_func.h"
#include "aln_sink.h"
/**
* Concrete subclass of DescentRootSelector. Puts a root every 'ival' chars,
* where 'ival' is determined by user-specified parameters. A root is filtered
* out if the end of the read is less than 'landing' positions away, in the
* direction of the search.
*/
class AlignerDriverRootSelector : public DescentRootSelector {
public:
AlignerDriverRootSelector(
double consExp,
const SimpleFunc& rootIval,
size_t landing)
{
consExp_ = consExp;
rootIval_ = rootIval;
landing_ = landing;
}
virtual ~AlignerDriverRootSelector() { }
virtual void select(
const Read& q, // read that we're selecting roots for
const Read* qo, // opposite mate, if applicable
bool nofw, // don't add roots for fw read
bool norc, // don't add roots for rc read
EList<DescentConfig>& confs, // put DescentConfigs here
EList<DescentRoot>& roots); // put DescentRoot here
protected:
double consExp_;
SimpleFunc rootIval_;
size_t landing_;
};
/**
* Return values from extendSeeds and extendSeedsPaired.
*/
enum {
// Candidates were examined exhaustively
ALDRIVER_EXHAUSTED_CANDIDATES = 1,
// The policy does not need us to look any further
ALDRIVER_POLICY_FULFILLED,
// We stopped because we ran up against a limit on how much work we should
// do for one set of seed ranges, e.g. the limit on number of consecutive
// unproductive DP extensions
ALDRIVER_EXCEEDED_LIMIT
};
/**
* This class is the glue between a DescentDriver and the dynamic programming
* implementations in Bowtie 2. The DescentDriver is used to find some very
* high-scoring alignments, but is additionally used to rank partial alignments
* so that they can be extended using dynamic programming.
*/
template <typename index_t>
class AlignerDriver {
public:
AlignerDriver(
double consExp,
const SimpleFunc& rootIval,
size_t landing,
bool veryVerbose,
const SimpleFunc& totsz,
const SimpleFunc& totfmops) :
sel_(consExp, rootIval, landing),
alsel_(),
dr1_(veryVerbose),
dr2_(veryVerbose)
{
totsz_ = totsz;
totfmops_ = totfmops;
}
/**
* Initialize driver with respect to a new read or pair.
*/
void initRead(
const Read& q1,
bool nofw,
bool norc,
TAlScore minsc,
TAlScore maxpen,
const Read* q2)
{
dr1_.initRead(q1, nofw, norc, minsc, maxpen, q2, &sel_);
red1_.init(q1.length());
paired_ = false;
if(q2 != NULL) {
dr2_.initRead(*q2, nofw, norc, minsc, maxpen, &q1, &sel_);
red2_.init(q2->length());
paired_ = true;
} else {
dr2_.reset();
}
size_t totsz = totsz_.f<size_t>(q1.length());
size_t totfmops = totfmops_.f<size_t>(q1.length());
stop_.init(
totsz,
0,
true,
totfmops);
}
/**
* Start the driver. The driver will begin by conducting a best-first,
* index-assisted search through the space of possible full and partial
* alignments. This search may be followed up with a dynamic programming
* extension step, taking a prioritized set of partial SA ranges found
* during the search and extending each with DP. The process might also be
* iterated, with the search being occasioanally halted so that DPs can be
* tried, then restarted, etc.
*/
int go(
const Scoring& sc,
const GFM<index_t>& gfmFw,
const GFM<index_t>& gfmBw,
const BitPairReference& ref,
DescentMetrics& met,
WalkMetrics& wlm,
PerReadMetrics& prm,
RandomSource& rnd,
AlnSinkWrap<index_t>& sink);
/**
* Reset state of all DescentDrivers.
*/
void reset() {
dr1_.reset();
dr2_.reset();
red1_.reset();
red2_.reset();
}
protected:
AlignerDriverRootSelector sel_; // selects where roots should go
DescentAlignmentSelector<index_t> alsel_; // one selector can deal with >1 drivers
DescentDriver<index_t> dr1_; // driver for mate 1/unpaired reads
DescentDriver<index_t> dr2_; // driver for paired-end reads
DescentStoppingConditions stop_; // when to pause index-assisted BFS
bool paired_; // current read is paired?
SimpleFunc totsz_; // memory limit on best-first search data
SimpleFunc totfmops_; // max # FM ops for best-first search
// For detecting redundant alignments
RedundantAlns red1_; // database of cells used for mate 1 alignments
RedundantAlns red2_; // database of cells used for mate 2 alignments
// For AlnRes::matchesRef
ASSERT_ONLY(SStringExpandable<char> raw_refbuf_);
ASSERT_ONLY(SStringExpandable<uint32_t> raw_destU32_);
ASSERT_ONLY(EList<bool> raw_matches_);
ASSERT_ONLY(BTDnaString tmp_rf_);
ASSERT_ONLY(BTDnaString tmp_rdseq_);
ASSERT_ONLY(BTString tmp_qseq_);
ASSERT_ONLY(EList<index_t> tmp_reflens_);
ASSERT_ONLY(EList<index_t> tmp_refoffs_);
};
#endif /* defined(ALIGNER_DRIVER_H_) */

352
aligner_metrics.h Normal file
View File

@ -0,0 +1,352 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALIGNER_METRICS_H_
#define ALIGNER_METRICS_H_
#include <math.h>
#include <iostream>
#include "alphabet.h"
#include "timer.h"
#include "sstring.h"
using namespace std;
/**
* Borrowed from http://www.johndcook.com/standard_deviation.html,
* which in turn is borrowed from Knuth.
*/
class RunningStat {
public:
RunningStat() : m_n(0), m_tot(0.0) { }
void clear() {
m_n = 0;
m_tot = 0.0;
}
void push(float x) {
m_n++;
m_tot += x;
// See Knuth TAOCP vol 2, 3rd edition, page 232
if (m_n == 1) {
m_oldM = m_newM = x;
m_oldS = 0.0;
} else {
m_newM = m_oldM + (x - m_oldM)/m_n;
m_newS = m_oldS + (x - m_oldM)*(x - m_newM);
// set up for next iteration
m_oldM = m_newM;
m_oldS = m_newS;
}
}
int num() const {
return m_n;
}
double tot() const {
return m_tot;
}
double mean() const {
return (m_n > 0) ? m_newM : 0.0;
}
double variance() const {
return ( (m_n > 1) ? m_newS/(m_n - 1) : 0.0 );
}
double stddev() const {
return sqrt(variance());
}
private:
int m_n;
double m_tot;
double m_oldM, m_newM, m_oldS, m_newS;
};
/**
* Encapsulates a set of metrics that we would like an aligner to keep
* track of, so that we can possibly use it to diagnose performance
* issues.
*/
class AlignerMetrics {
public:
AlignerMetrics() :
curBacktracks_(0),
curBwtOps_(0),
first_(true),
curIsLowEntropy_(false),
curIsHomoPoly_(false),
curHadRanges_(false),
curNumNs_(0),
reads_(0),
homoReads_(0),
lowEntReads_(0),
hiEntReads_(0),
alignedReads_(0),
unalignedReads_(0),
threeOrMoreNReads_(0),
lessThanThreeNRreads_(0),
bwtOpsPerRead_(),
backtracksPerRead_(),
bwtOpsPerHomoRead_(),
backtracksPerHomoRead_(),
bwtOpsPerLoEntRead_(),
backtracksPerLoEntRead_(),
bwtOpsPerHiEntRead_(),
backtracksPerHiEntRead_(),
bwtOpsPerAlignedRead_(),
backtracksPerAlignedRead_(),
bwtOpsPerUnalignedRead_(),
backtracksPerUnalignedRead_(),
bwtOpsPer0nRead_(),
backtracksPer0nRead_(),
bwtOpsPer1nRead_(),
backtracksPer1nRead_(),
bwtOpsPer2nRead_(),
backtracksPer2nRead_(),
bwtOpsPer3orMoreNRead_(),
backtracksPer3orMoreNRead_(),
timer_(cout, "", false)
{ }
void printSummary() {
if(!first_) {
finishRead();
}
cout << "AlignerMetrics:" << endl;
cout << " # Reads: " << reads_ << endl;
float hopct = (reads_ > 0) ? (((float)homoReads_)/((float)reads_)) : (0.0f);
hopct *= 100.0f;
cout << " % homo-polymeric: " << (hopct) << endl;
float lopct = (reads_ > 0) ? ((float)lowEntReads_/(float)(reads_)) : (0.0f);
lopct *= 100.0f;
cout << " % low-entropy: " << (lopct) << endl;
float unpct = (reads_ > 0) ? ((float)unalignedReads_/(float)(reads_)) : (0.0f);
unpct *= 100.0f;
cout << " % unaligned: " << (unpct) << endl;
float npct = (reads_ > 0) ? ((float)threeOrMoreNReads_/(float)(reads_)) : (0.0f);
npct *= 100.0f;
cout << " % with 3 or more Ns: " << (npct) << endl;
cout << endl;
cout << " Total BWT ops: avg: " << bwtOpsPerRead_.mean() << ", stddev: " << bwtOpsPerRead_.stddev() << endl;
cout << " Total Backtracks: avg: " << backtracksPerRead_.mean() << ", stddev: " << backtracksPerRead_.stddev() << endl;
time_t elapsed = timer_.elapsed();
cout << " BWT ops per second: " << (bwtOpsPerRead_.tot()/elapsed) << endl;
cout << " Backtracks per second: " << (backtracksPerRead_.tot()/elapsed) << endl;
cout << endl;
cout << " Homo-poly:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerHomoRead_.mean() << ", stddev: " << bwtOpsPerHomoRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerHomoRead_.mean() << ", stddev: " << backtracksPerHomoRead_.stddev() << endl;
cout << " Low-entropy:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerLoEntRead_.mean() << ", stddev: " << bwtOpsPerLoEntRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerLoEntRead_.mean() << ", stddev: " << backtracksPerLoEntRead_.stddev() << endl;
cout << " High-entropy:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerHiEntRead_.mean() << ", stddev: " << bwtOpsPerHiEntRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerHiEntRead_.mean() << ", stddev: " << backtracksPerHiEntRead_.stddev() << endl;
cout << endl;
cout << " Unaligned:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerUnalignedRead_.mean() << ", stddev: " << bwtOpsPerUnalignedRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerUnalignedRead_.mean() << ", stddev: " << backtracksPerUnalignedRead_.stddev() << endl;
cout << " Aligned:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerAlignedRead_.mean() << ", stddev: " << bwtOpsPerAlignedRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerAlignedRead_.mean() << ", stddev: " << backtracksPerAlignedRead_.stddev() << endl;
cout << endl;
cout << " 0 Ns:" << endl;
cout << " BWT ops: avg: " << bwtOpsPer0nRead_.mean() << ", stddev: " << bwtOpsPer0nRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPer0nRead_.mean() << ", stddev: " << backtracksPer0nRead_.stddev() << endl;
cout << " 1 N:" << endl;
cout << " BWT ops: avg: " << bwtOpsPer1nRead_.mean() << ", stddev: " << bwtOpsPer1nRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPer1nRead_.mean() << ", stddev: " << backtracksPer1nRead_.stddev() << endl;
cout << " 2 Ns:" << endl;
cout << " BWT ops: avg: " << bwtOpsPer2nRead_.mean() << ", stddev: " << bwtOpsPer2nRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPer2nRead_.mean() << ", stddev: " << backtracksPer2nRead_.stddev() << endl;
cout << " >2 Ns:" << endl;
cout << " BWT ops: avg: " << bwtOpsPer3orMoreNRead_.mean() << ", stddev: " << bwtOpsPer3orMoreNRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPer3orMoreNRead_.mean() << ", stddev: " << backtracksPer3orMoreNRead_.stddev() << endl;
cout << endl;
}
/**
*
*/
void nextRead(const BTDnaString& read) {
if(!first_) {
finishRead();
}
first_ = false;
//float ent = entropyDna5(read);
float ent = 0.0f;
curIsLowEntropy_ = (ent < 0.75f);
curIsHomoPoly_ = (ent < 0.001f);
curHadRanges_ = false;
curBwtOps_ = 0;
curBacktracks_ = 0;
// Count Ns
curNumNs_ = 0;
const size_t len = read.length();
for(size_t i = 0; i < len; i++) {
if((int)read[i] == 4) curNumNs_++;
}
}
/**
*
*/
void setReadHasRange() {
curHadRanges_ = true;
}
/**
* Commit the running statistics for this read to
*/
void finishRead() {
reads_++;
if(curIsHomoPoly_) homoReads_++;
else if(curIsLowEntropy_) lowEntReads_++;
else hiEntReads_++;
if(curHadRanges_) alignedReads_++;
else unalignedReads_++;
bwtOpsPerRead_.push((float)curBwtOps_);
backtracksPerRead_.push((float)curBacktracks_);
// Drill down by entropy
if(curIsHomoPoly_) {
bwtOpsPerHomoRead_.push((float)curBwtOps_);
backtracksPerHomoRead_.push((float)curBacktracks_);
} else if(curIsLowEntropy_) {
bwtOpsPerLoEntRead_.push((float)curBwtOps_);
backtracksPerLoEntRead_.push((float)curBacktracks_);
} else {
bwtOpsPerHiEntRead_.push((float)curBwtOps_);
backtracksPerHiEntRead_.push((float)curBacktracks_);
}
// Drill down by whether it aligned
if(curHadRanges_) {
bwtOpsPerAlignedRead_.push((float)curBwtOps_);
backtracksPerAlignedRead_.push((float)curBacktracks_);
} else {
bwtOpsPerUnalignedRead_.push((float)curBwtOps_);
backtracksPerUnalignedRead_.push((float)curBacktracks_);
}
if(curNumNs_ == 0) {
lessThanThreeNRreads_++;
bwtOpsPer0nRead_.push((float)curBwtOps_);
backtracksPer0nRead_.push((float)curBacktracks_);
} else if(curNumNs_ == 1) {
lessThanThreeNRreads_++;
bwtOpsPer1nRead_.push((float)curBwtOps_);
backtracksPer1nRead_.push((float)curBacktracks_);
} else if(curNumNs_ == 2) {
lessThanThreeNRreads_++;
bwtOpsPer2nRead_.push((float)curBwtOps_);
backtracksPer2nRead_.push((float)curBacktracks_);
} else {
threeOrMoreNReads_++;
bwtOpsPer3orMoreNRead_.push((float)curBwtOps_);
backtracksPer3orMoreNRead_.push((float)curBacktracks_);
}
}
// Running-total of the number of backtracks and BWT ops for the
// current read
uint32_t curBacktracks_;
uint32_t curBwtOps_;
protected:
bool first_;
// true iff the current read is low entropy
bool curIsLowEntropy_;
// true if current read is all 1 char (or very close)
bool curIsHomoPoly_;
// true iff the current read has had one or more ranges reported
bool curHadRanges_;
// number of Ns in current read
int curNumNs_;
// # reads
uint32_t reads_;
// # homo-poly reads
uint32_t homoReads_;
// # low-entropy reads
uint32_t lowEntReads_;
// # high-entropy reads
uint32_t hiEntReads_;
// # reads with alignments
uint32_t alignedReads_;
// # reads without alignments
uint32_t unalignedReads_;
// # reads with 3 or more Ns
uint32_t threeOrMoreNReads_;
// # reads with < 3 Ns
uint32_t lessThanThreeNRreads_;
// Distribution of BWT operations per read
RunningStat bwtOpsPerRead_;
RunningStat backtracksPerRead_;
// Distribution of BWT operations per homo-poly read
RunningStat bwtOpsPerHomoRead_;
RunningStat backtracksPerHomoRead_;
// Distribution of BWT operations per low-entropy read
RunningStat bwtOpsPerLoEntRead_;
RunningStat backtracksPerLoEntRead_;
// Distribution of BWT operations per high-entropy read
RunningStat bwtOpsPerHiEntRead_;
RunningStat backtracksPerHiEntRead_;
// Distribution of BWT operations per read that "aligned" (for
// which a range was arrived at - range may not have necessarily
// lead to an alignment)
RunningStat bwtOpsPerAlignedRead_;
RunningStat backtracksPerAlignedRead_;
// Distribution of BWT operations per read that didn't align
RunningStat bwtOpsPerUnalignedRead_;
RunningStat backtracksPerUnalignedRead_;
// Distribution of BWT operations/backtracks per read with no Ns
RunningStat bwtOpsPer0nRead_;
RunningStat backtracksPer0nRead_;
// Distribution of BWT operations/backtracks per read with one N
RunningStat bwtOpsPer1nRead_;
RunningStat backtracksPer1nRead_;
// Distribution of BWT operations/backtracks per read with two Ns
RunningStat bwtOpsPer2nRead_;
RunningStat backtracksPer2nRead_;
// Distribution of BWT operations/backtracks per read with three or
// more Ns
RunningStat bwtOpsPer3orMoreNRead_;
RunningStat backtracksPer3orMoreNRead_;
Timer timer_;
};
#endif /* ALIGNER_METRICS_H_ */

35
aligner_report.h Normal file
View File

@ -0,0 +1,35 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALIGNER_REPORT_H_
#define ALIGNER_REPORT_H_
#include "aligner_cache.h"
class Reporter {
public:
/**
*
*/
bool report(const AlignmentCacheIface<uint32_t>& cache, const QVal<uint32_t>& qv) {
return true; // don't retry
}
};
#endif /*ALIGNER_REPORT_H_*/

2162
aligner_result.cpp Normal file

File diff suppressed because it is too large Load Diff

2325
aligner_result.h Normal file

File diff suppressed because it is too large Load Diff

530
aligner_seed.cpp Normal file
View File

@ -0,0 +1,530 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include "aligner_cache.h"
#include "aligner_seed.h"
#include "search_globals.h"
#include "gfm.h"
using namespace std;
/**
* Construct a constraint with no edits of any kind allowed.
*/
Constraint Constraint::exact() {
Constraint c;
c.edits = c.mms = c.ins = c.dels = c.penalty = 0;
return c;
}
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
Constraint Constraint::penaltyBased(int pen) {
Constraint c;
c.penalty = pen;
return c;
}
/**
* Construct a constraint where the only constraint is a total
* penalty constraint related to the length of the read.
*/
Constraint Constraint::penaltyFuncBased(const SimpleFunc& f) {
Constraint c;
c.penFunc = f;
return c;
}
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
Constraint Constraint::mmBased(int mms) {
Constraint c;
c.mms = mms;
c.edits = c.dels = c.ins = 0;
return c;
}
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
Constraint Constraint::editBased(int edits) {
Constraint c;
c.edits = edits;
c.dels = c.ins = c.mms = 0;
return c;
}
//
// Some static methods for constructing some standard SeedPolicies
//
/**
* Given a read, depth and orientation, extract a seed data structure
* from the read and fill in the steps & zones arrays. The Seed
* contains the sequence and quality values.
*/
bool
Seed::instantiate(
const Read& read,
const BTDnaString& seq, // seed read sequence
const BTString& qual, // seed quality sequence
const Scoring& pens,
int depth,
int seedoffidx,
int seedtypeidx,
bool fw,
InstantiatedSeed& is) const
{
assert(overall != NULL);
int seedlen = len;
if((int)read.length() < seedlen) {
// Shrink seed length to fit read if necessary
seedlen = (int)read.length();
}
assert_gt(seedlen, 0);
is.steps.resize(seedlen);
is.zones.resize(seedlen);
// Fill in 'steps' and 'zones'
//
// The 'steps' list indicates which read character should be
// incorporated at each step of the search process. Often we will
// simply proceed from one end to the other, in which case the
// 'steps' list is ascending or descending. In some cases (e.g.
// the 2mm case), we might want to switch directions at least once
// during the search, in which case 'steps' will jump in the
// middle. When an element of the 'steps' list is negative, this
// indicates that the next
//
// The 'zones' list indicates which zone constraint is active at
// each step. Each element of the 'zones' list is a pair; the
// first pair element indicates the applicable zone when
// considering either mismatch or delete (ref gap) events, while
// the second pair element indicates the applicable zone when
// considering insertion (read gap) events. When either pair
// element is a negative number, that indicates that we are about
// to leave the zone for good, at which point we may need to
// evaluate whether we have reached the zone's budget.
//
switch(type) {
case SEED_TYPE_EXACT: {
for(int k = 0; k < seedlen; k++) {
is.steps[k] = -(seedlen - k);
// Zone 0 all the way
is.zones[k].first = is.zones[k].second = 0;
}
break;
}
case SEED_TYPE_LEFT_TO_RIGHT: {
for(int k = 0; k < seedlen; k++) {
is.steps[k] = k+1;
// Zone 0 from 0 up to ceil(len/2), then 1
is.zones[k].first = is.zones[k].second = ((k < (seedlen+1)/2) ? 0 : 1);
}
// Zone 1 ends at the RHS
is.zones[seedlen-1].first = is.zones[seedlen-1].second = -1;
break;
}
case SEED_TYPE_RIGHT_TO_LEFT: {
for(int k = 0; k < seedlen; k++) {
is.steps[k] = -(seedlen - k);
// Zone 0 from 0 up to floor(len/2), then 1
is.zones[k].first = ((k < seedlen/2) ? 0 : 1);
// Inserts: Zone 0 from 0 up to ceil(len/2)-1, then 1
is.zones[k].second = ((k < (seedlen+1)/2+1) ? 0 : 1);
}
is.zones[seedlen-1].first = is.zones[seedlen-1].second = -1;
break;
}
case SEED_TYPE_INSIDE_OUT: {
// Zone 0 from ceil(N/4) up to N-floor(N/4)
int step = 0;
for(int k = (seedlen+3)/4; k < seedlen - (seedlen/4); k++) {
is.zones[step].first = is.zones[step].second = 0;
is.steps[step++] = k+1;
}
// Zone 1 from N-floor(N/4) up
for(int k = seedlen - (seedlen/4); k < seedlen; k++) {
is.zones[step].first = is.zones[step].second = 1;
is.steps[step++] = k+1;
}
// No Zone 1 if seedlen is short (like 2)
//assert_eq(1, is.zones[step-1].first);
is.zones[step-1].first = is.zones[step-1].second = -1;
// Zone 2 from ((seedlen+3)/4)-1 down to 0
for(int k = ((seedlen+3)/4)-1; k >= 0; k--) {
is.zones[step].first = is.zones[step].second = 2;
is.steps[step++] = -(k+1);
}
assert_eq(2, is.zones[step-1].first);
is.zones[step-1].first = is.zones[step-1].second = -2;
assert_eq(seedlen, step);
break;
}
default:
throw 1;
}
// Instantiate constraints
for(int i = 0; i < 3; i++) {
is.cons[i] = zones[i];
is.cons[i].instantiate(read.length());
}
is.overall = *overall;
is.overall.instantiate(read.length());
// Take a sweep through the seed sequence. Consider where the Ns
// occur and how zones are laid out. Calculate the maximum number
// of positions we can jump over initially (e.g. with the ftab) and
// perhaps set this function's return value to false, indicating
// that the arrangements of Ns prevents the seed from aligning.
bool streak = true;
is.maxjump = 0;
bool ret = true;
bool ltr = (is.steps[0] > 0); // true -> left-to-right
for(size_t i = 0; i < is.steps.size(); i++) {
assert_neq(0, is.steps[i]);
int off = is.steps[i];
off = abs(off)-1;
Constraint& cons = is.cons[abs(is.zones[i].first)];
int c = seq[off]; assert_range(0, 4, c);
int q = qual[off];
if(ltr != (is.steps[i] > 0) || // changed direction
is.zones[i].first != 0 || // changed zone
is.zones[i].second != 0) // changed zone
{
streak = false;
}
if(c == 4) {
// Induced mismatch
if(cons.canN(q, pens)) {
cons.chargeN(q, pens);
} else {
// Seed disqualified due to arrangement of Ns
return false;
}
}
if(streak) is.maxjump++;
}
is.seedoff = depth;
is.seedoffidx = seedoffidx;
is.fw = fw;
is.s = *this;
return ret;
}
/**
* Return a set consisting of 1 seed encapsulating an exact matching
* strategy.
*/
void
Seed::zeroMmSeeds(int ln, EList<Seed>& pols, Constraint& oall) {
oall.init();
// Seed policy 1: left-to-right search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_EXACT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::exact();
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
}
/**
* Return a set of 2 seeds encapsulating a half-and-half 1mm strategy.
*/
void
Seed::oneMmSeeds(int ln, EList<Seed>& pols, Constraint& oall) {
oall.init();
// Seed policy 1: left-to-right search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_LEFT_TO_RIGHT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(1);
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
// Seed policy 2: right-to-left search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_RIGHT_TO_LEFT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(1);
pols.back().zones[1].mmsCeil = 0;
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
}
/**
* Return a set of 3 seeds encapsulating search roots for:
*
* 1. Starting from the left-hand side and searching toward the
* right-hand side allowing 2 mismatches in the right half.
* 2. Starting from the right-hand side and searching toward the
* left-hand side allowing 2 mismatches in the left half.
* 3. Starting (effectively) from the center and searching out toward
* both the left and right-hand sides, allowing one mismatch on
* either side.
*
* This is not exhaustive. There are 2 mismatch cases mised; if you
* imagine the seed as divided into four successive quarters A, B, C
* and D, the cases we miss are when mismatches occur in A and C or B
* and D.
*/
void
Seed::twoMmSeeds(int ln, EList<Seed>& pols, Constraint& oall) {
oall.init();
// Seed policy 1: left-to-right search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_LEFT_TO_RIGHT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(2);
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
// Seed policy 2: right-to-left search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_RIGHT_TO_LEFT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(2);
pols.back().zones[1].mmsCeil = 1; // Must have used at least 1 mismatch
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
// Seed policy 3: inside-out search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_INSIDE_OUT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(1);
pols.back().zones[1].mmsCeil = 0; // Must have used at least 1 mismatch
pols.back().zones[2] = Constraint::mmBased(1);
pols.back().zones[2].mmsCeil = 0; // Must have used at least 1 mismatch
pols.back().overall = &oall;
}
/**
* Types of actions that can be taken by the SeedAligner.
*/
enum {
SA_ACTION_TYPE_RESET = 1,
SA_ACTION_TYPE_SEARCH_SEED, // 2
SA_ACTION_TYPE_FTAB, // 3
SA_ACTION_TYPE_FCHR, // 4
SA_ACTION_TYPE_MATCH, // 5
SA_ACTION_TYPE_EDIT // 6
};
#define MIN(x, y) ((x < y) ? x : y)
#ifdef ALIGNER_SEED_MAIN
#include <getopt.h>
#include <string>
/**
* Parse an int out of optarg and enforce that it be at least 'lower';
* if it is less than 'lower', than output the given error message and
* exit with an error and a usage message.
*/
static int parseInt(const char *errmsg, const char *arg) {
long l;
char *endPtr = NULL;
l = strtol(arg, &endPtr, 10);
if (endPtr != NULL) {
return (int32_t)l;
}
cerr << errmsg << endl;
throw 1;
return -1;
}
enum {
ARG_NOFW = 256,
ARG_NORC,
ARG_MM,
ARG_SHMEM,
ARG_TESTS,
ARG_RANDOM_TESTS,
ARG_SEED
};
static const char *short_opts = "vCt";
static struct option long_opts[] = {
{(char*)"verbose", no_argument, 0, 'v'},
{(char*)"color", no_argument, 0, 'C'},
{(char*)"timing", no_argument, 0, 't'},
{(char*)"nofw", no_argument, 0, ARG_NOFW},
{(char*)"norc", no_argument, 0, ARG_NORC},
{(char*)"mm", no_argument, 0, ARG_MM},
{(char*)"shmem", no_argument, 0, ARG_SHMEM},
{(char*)"tests", no_argument, 0, ARG_TESTS},
{(char*)"random", required_argument, 0, ARG_RANDOM_TESTS},
{(char*)"seed", required_argument, 0, ARG_SEED},
};
static void printUsage(ostream& os) {
os << "Usage: ac [options]* <index> <patterns>" << endl;
os << "Options:" << endl;
os << " --mm memory-mapped mode" << endl;
os << " --shmem shared memory mode" << endl;
os << " --nofw don't align forward-oriented read" << endl;
os << " --norc don't align reverse-complemented read" << endl;
os << " -t/--timing show timing information" << endl;
os << " -C/--color colorspace mode" << endl;
os << " -v/--verbose talkative mode" << endl;
}
bool gNorc = false;
bool gNofw = false;
bool gColor = false;
int gVerbose = 0;
int gGapBarrier = 1;
bool gColorExEnds = true;
int gSnpPhred = 30;
bool gReportOverhangs = true;
extern void aligner_seed_tests();
extern void aligner_random_seed_tests(
int num_tests,
uint32_t qslo,
uint32_t qshi,
bool color,
uint32_t seed);
/**
* A way of feeding simply tests to the seed alignment infrastructure.
*/
int main(int argc, char **argv) {
bool useMm = false;
bool useShmem = false;
bool mmSweep = false;
bool noRefNames = false;
bool sanity = false;
bool timing = false;
int option_index = 0;
int seed = 777;
int next_option;
do {
next_option = getopt_long(
argc, argv, short_opts, long_opts, &option_index);
switch (next_option) {
case 'v': gVerbose = true; break;
case 'C': gColor = true; break;
case 't': timing = true; break;
case ARG_NOFW: gNofw = true; break;
case ARG_NORC: gNorc = true; break;
case ARG_MM: useMm = true; break;
case ARG_SHMEM: useShmem = true; break;
case ARG_SEED: seed = parseInt("", optarg); break;
case ARG_TESTS: {
aligner_seed_tests();
aligner_random_seed_tests(
100, // num references
100, // queries per reference lo
400, // queries per reference hi
false, // true -> generate colorspace reference/reads
18); // pseudo-random seed
return 0;
}
case ARG_RANDOM_TESTS: {
seed = parseInt("", optarg);
aligner_random_seed_tests(
100, // num references
100, // queries per reference lo
400, // queries per reference hi
false, // true -> generate colorspace reference/reads
seed); // pseudo-random seed
return 0;
}
case -1: break;
default: {
cerr << "Unknown option: " << (char)next_option << endl;
printUsage(cerr);
exit(1);
}
}
} while(next_option != -1);
char *reffn;
if(optind >= argc) {
cerr << "No reference; quitting..." << endl;
return 1;
}
reffn = argv[optind++];
if(optind >= argc) {
cerr << "No reads; quitting..." << endl;
return 1;
}
string gfmBase(reffn);
BitPairReference ref(
gfmBase, // base path
gColor, // whether we expect it to be colorspace
sanity, // whether to sanity-check reference as it's loaded
NULL, // fasta files to sanity check reference against
NULL, // another way of specifying original sequences
false, // true -> infiles (2 args ago) contains raw seqs
useMm, // use memory mapping to load index?
useShmem, // use shared memory (not memory mapping)
mmSweep, // touch all the pages after memory-mapping the index
gVerbose, // verbose
gVerbose); // verbose but just for startup messages
Timer *t = new Timer(cerr, "Time loading fw index: ", timing);
GFM gfmFw(
gfmBase,
0, // don't need entireReverse for fw index
true, // index is for the forward direction
-1, // offrate (irrelevant)
useMm, // whether to use memory-mapped files
useShmem, // whether to use shared memory
mmSweep, // sweep memory-mapped files
!noRefNames, // load names?
false, // load SA sample?
true, // load ftab?
true, // load rstarts?
NULL, // reference map, or NULL if none is needed
gVerbose, // whether to be talkative
gVerbose, // talkative during initialization
false, // handle memory exceptions, don't pass them up
sanity);
delete t;
t = new Timer(cerr, "Time loading bw index: ", timing);
GFM gfmBw(
gfmBase + ".rev",
1, // need entireReverse
false, // index is for the backward direction
-1, // offrate (irrelevant)
useMm, // whether to use memory-mapped files
useShmem, // whether to use shared memory
mmSweep, // sweep memory-mapped files
!noRefNames, // load names?
false, // load SA sample?
true, // load ftab?
false, // load rstarts?
NULL, // reference map, or NULL if none is needed
gVerbose, // whether to be talkative
gVerbose, // talkative during initialization
false, // handle memory exceptions, don't pass them up
sanity);
delete t;
for(int i = optind; i < argc; i++) {
}
}
#endif

2922
aligner_seed.h Normal file

File diff suppressed because it is too large Load Diff

1245
aligner_seed2.cpp Normal file

File diff suppressed because it is too large Load Diff

4291
aligner_seed2.h Normal file

File diff suppressed because it is too large Load Diff

916
aligner_seed_policy.cpp Normal file
View File

@ -0,0 +1,916 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include <string>
#include <iostream>
#include <sstream>
#include <limits>
#include "ds.h"
#include "aligner_seed_policy.h"
#include "mem_ids.h"
using namespace std;
static int parseFuncType(const std::string& otype) {
string type = otype;
if(type == "C" || type == "Constant") {
return SIMPLE_FUNC_CONST;
} else if(type == "L" || type == "Linear") {
return SIMPLE_FUNC_LINEAR;
} else if(type == "S" || type == "Sqrt") {
return SIMPLE_FUNC_SQRT;
} else if(type == "G" || type == "Log") {
return SIMPLE_FUNC_LOG;
}
std::cerr << "Error: Bad function type '" << otype.c_str()
<< "'. Should be C (constant), L (linear), "
<< "S (square root) or G (natural log)." << std::endl;
throw 1;
}
#define PARSE_FUNC(fv) { \
if(ctoks.size() >= 1) { \
fv.setType(parseFuncType(ctoks[0])); \
} \
if(ctoks.size() >= 2) { \
double co; \
istringstream tmpss(ctoks[1]); \
tmpss >> co; \
fv.setConst(co); \
} \
if(ctoks.size() >= 3) { \
double ce; \
istringstream tmpss(ctoks[2]); \
tmpss >> ce; \
fv.setCoeff(ce); \
} \
if(ctoks.size() >= 4) { \
double mn; \
istringstream tmpss(ctoks[3]); \
tmpss >> mn; \
fv.setMin(mn); \
} \
if(ctoks.size() >= 5) { \
double mx; \
istringstream tmpss(ctoks[4]); \
tmpss >> mx; \
fv.setMin(mx); \
} \
}
/**
* Parse alignment policy when provided in this format:
* <lab>=<val>;<lab>=<val>;<lab>=<val>...
*
* And label=value possibilities are:
*
* Bonus for a match
* -----------------
*
* MA=xx (default: MA=0, or MA=2 if --local is set)
*
* xx = Each position where equal read and reference characters match up
* in the alignment contriubtes this amount to the total score.
*
* Penalty for a mismatch
* ----------------------
*
* MMP={Cxx|Q|RQ} (default: MMP=C6)
*
* Cxx = Each mismatch costs xx. If MMP=Cxx is specified, quality
* values are ignored when assessing penalities for mismatches.
* Q = Each mismatch incurs a penalty equal to the mismatched base's
* value.
* R = Each mismatch incurs a penalty equal to the mismatched base's
* rounded quality value. Qualities are rounded off to the
* nearest 10, and qualities greater than 30 are rounded to 30.
*
* Penalty for position with N (in either read or reference)
* ---------------------------------------------------------
*
* NP={Cxx|Q|RQ} (default: NP=C1)
*
* Cxx = Each alignment position with an N in either the read or the
* reference costs xx. If NP=Cxx is specified, quality values are
* ignored when assessing penalities for Ns.
* Q = Each alignment position with an N in either the read or the
* reference incurs a penalty equal to the read base's quality
* value.
* R = Each alignment position with an N in either the read or the
* reference incurs a penalty equal to the read base's rounded
* quality value. Qualities are rounded off to the nearest 10,
* and qualities greater than 30 are rounded to 30.
*
* Penalty for a read gap
* ----------------------
*
* RDG=xx,yy (default: RDG=5,3)
*
* xx = Read gap open penalty.
* yy = Read gap extension penalty.
*
* Total cost incurred by a read gap = xx + (yy * gap length)
*
* Penalty for a reference gap
* ---------------------------
*
* RFG=xx,yy (default: RFG=5,3)
*
* xx = Reference gap open penalty.
* yy = Reference gap extension penalty.
*
* Total cost incurred by a reference gap = xx + (yy * gap length)
*
* Minimum score for valid alignment
* ---------------------------------
*
* MIN=xx,yy (defaults: MIN=-0.6,-0.6, or MIN=0.0,0.66 if --local is set)
*
* xx,yy = For a read of length N, the total score must be at least
* xx + (read length * yy) for the alignment to be valid. The
* total score is the sum of all negative penalties (from
* mismatches and gaps) and all positive bonuses. The minimum
* can be negative (and is by default in global alignment mode).
*
* Score floor for local alignment
* -------------------------------
*
* FL=xx,yy (defaults: FL=-Infinity,0.0, or FL=0.0,0.0 if --local is set)
*
* xx,yy = If a cell in the dynamic programming table has a score less
* than xx + (read length * yy), then no valid alignment can go
* through it. Defaults are highly recommended.
*
* N ceiling
* ---------
*
* NCEIL=xx,yy (default: NCEIL=0.0,0.15)
*
* xx,yy = For a read of length N, the number of alignment
* positions with an N in either the read or the
* reference cannot exceed
* ceiling = xx + (read length * yy). If the ceiling is
* exceeded, the alignment is considered invalid.
*
* Seeds
* -----
*
* SEED=mm,len,ival (default: SEED=0,22)
*
* mm = Maximum number of mismatches allowed within a seed.
* Must be >= 0 and <= 2. Note that 2-mismatch mode is
* not fully sensitive; i.e. some 2-mismatch seed
* alignments may be missed.
* len = Length of seed.
* ival = Interval between seeds. If not specified, seed
* interval is determined by IVAL.
*
* Seed interval
* -------------
*
* IVAL={L|S|C},xx,yy (default: IVAL=S,1.0,0.0)
*
* L = let interval between seeds be a linear function of the
* read length. xx and yy are the constant and linear
* coefficients respectively. In other words, the interval
* equals a * len + b, where len is the read length.
* Intervals less than 1 are rounded up to 1.
* S = let interval between seeds be a function of the sqaure
* root of the read length. xx and yy are the
* coefficients. In other words, the interval equals
* a * sqrt(len) + b, where len is the read length.
* Intervals less than 1 are rounded up to 1.
* C = Like S but uses cube root of length instead of square
* root.
*
* Example 1:
*
* SEED=1,10,5 and read sequence is TGCTATCGTACGATCGTAC:
*
* The following seeds are extracted from the forward
* representation of the read and aligned to the reference
* allowing up to 1 mismatch:
*
* Read: TGCTATCGTACGATCGTACA
*
* Seed 1+: TGCTATCGTA
* Seed 2+: TCGTACGATC
* Seed 3+: CGATCGTACA
*
* ...and the following are extracted from the reverse-complement
* representation of the read and align to the reference allowing
* up to 1 mismatch:
*
* Seed 1-: TACGATAGCA
* Seed 2-: GATCGTACGA
* Seed 3-: TGTACGATCG
*
* Example 2:
*
* SEED=1,20,20 and read sequence is TGCTATCGTACGATC. The seed
* length is 20 but the read is only 15 characters long. In this
* case, Bowtie2 automatically shrinks the seed length to be equal
* to the read length.
*
* Read: TGCTATCGTACGATC
*
* Seed 1+: TGCTATCGTACGATC
* Seed 1-: GATCGTACGATAGCA
*
* Example 3:
*
* SEED=1,10,10 and read sequence is TGCTATCGTACGATC. Only one seed
* fits on the read; a second seed would overhang the end of the read
* by 5 positions. In this case, Bowtie2 extracts one seed.
*
* Read: TGCTATCGTACGATC
*
* Seed 1+: TGCTATCGTA
* Seed 1-: TACGATAGCA
*/
void SeedAlignmentPolicy::parseString(
const std::string& s,
bool local,
bool noisyHpolymer,
bool ignoreQuals,
int& bonusMatchType,
int& bonusMatch,
int& penMmcType,
int& penMmcMax,
int& penMmcMin,
int& penScMax,
int& penScMin,
int& penNType,
int& penN,
int& penRdExConst,
int& penRfExConst,
int& penRdExLinear,
int& penRfExLinear,
SimpleFunc& costMin,
SimpleFunc& nCeil,
bool& nCatPair,
int& multiseedMms,
int& multiseedLen,
SimpleFunc& multiseedIval,
size_t& failStreak,
size_t& seedRounds,
SimpleFunc* penCanIntronLen,
SimpleFunc* penNoncanIntronLen)
{
bonusMatchType = local ? DEFAULT_MATCH_BONUS_TYPE_LOCAL : DEFAULT_MATCH_BONUS_TYPE;
bonusMatch = local ? DEFAULT_MATCH_BONUS_LOCAL : DEFAULT_MATCH_BONUS;
penMmcType = ignoreQuals ? DEFAULT_MM_PENALTY_TYPE_IGNORE_QUALS :
DEFAULT_MM_PENALTY_TYPE;
penMmcMax = DEFAULT_MM_PENALTY_MAX;
penMmcMin = DEFAULT_MM_PENALTY_MIN;
penNType = DEFAULT_N_PENALTY_TYPE;
penN = DEFAULT_N_PENALTY;
penScMax = DEFAULT_SC_PENALTY_MAX;
penScMin = DEFAULT_SC_PENALTY_MIN;
const double DMAX = std::numeric_limits<double>::max();
costMin.init(
local ? SIMPLE_FUNC_LOG : SIMPLE_FUNC_LINEAR,
local ? DEFAULT_MIN_CONST_LOCAL : 0.0f,
local ? DEFAULT_MIN_LINEAR_LOCAL : -0.2f);
nCeil.init(
SIMPLE_FUNC_LINEAR, 0.0f, DMAX,
DEFAULT_N_CEIL_CONST, DEFAULT_N_CEIL_LINEAR);
multiseedIval.init(
DEFAULT_IVAL, 1.0f, DMAX,
DEFAULT_IVAL_B, DEFAULT_IVAL_A);
nCatPair = DEFAULT_N_CAT_PAIR;
if(!noisyHpolymer) {
penRdExConst = DEFAULT_READ_GAP_CONST;
penRdExLinear = DEFAULT_READ_GAP_LINEAR;
penRfExConst = DEFAULT_REF_GAP_CONST;
penRfExLinear = DEFAULT_REF_GAP_LINEAR;
} else {
penRdExConst = DEFAULT_READ_GAP_CONST_BADHPOLY;
penRdExLinear = DEFAULT_READ_GAP_LINEAR_BADHPOLY;
penRfExConst = DEFAULT_REF_GAP_CONST_BADHPOLY;
penRfExLinear = DEFAULT_REF_GAP_LINEAR_BADHPOLY;
}
multiseedMms = DEFAULT_SEEDMMS;
multiseedLen = DEFAULT_SEEDLEN;
EList<string> toks(MISC_CAT);
string tok;
istringstream ss(s);
int setting = 0;
// Get each ;-separated token
while(getline(ss, tok, ';')) {
setting++;
EList<string> etoks(MISC_CAT);
string etok;
// Divide into tokens on either side of =
istringstream ess(tok);
while(getline(ess, etok, '=')) {
etoks.push_back(etok);
}
// Must be exactly 1 =
if(etoks.size() != 2) {
cerr << "Error parsing alignment policy setting " << setting
<< "; must be bisected by = sign" << endl
<< "Policy: " << s.c_str() << endl;
assert(false); throw 1;
}
// LHS is tag, RHS value
string tag = etoks[0], val = etoks[1];
// Separate value into comma-separated tokens
EList<string> ctoks(MISC_CAT);
string ctok;
istringstream css(val);
while(getline(css, ctok, ',')) {
ctoks.push_back(ctok);
}
if(ctoks.size() == 0) {
cerr << "Error parsing alignment policy setting " << setting
<< "; RHS must have at least 1 token" << endl
<< "Policy: " << s.c_str() << endl;
assert(false); throw 1;
}
for(size_t i = 0; i < ctoks.size(); i++) {
if(ctoks[i].length() == 0) {
cerr << "Error parsing alignment policy setting " << setting
<< "; token " << i+1 << " on RHS had length=0" << endl
<< "Policy: " << s.c_str() << endl;
assert(false); throw 1;
}
}
// Bonus for a match
// MA=xx (default: MA=0, or MA=10 if --local is set)
if(tag == "MA") {
if(ctoks.size() != 1) {
cerr << "Error parsing alignment policy setting " << setting
<< "; RHS must have 1 token" << endl
<< "Policy: " << s.c_str() << endl;
assert(false); throw 1;
}
string tmp = ctoks[0];
istringstream tmpss(tmp);
tmpss >> bonusMatch;
}
// Scoring for mismatches
// MMP={Cxx|Q|RQ}
// Cxx = constant, where constant is integer xx
// Qxx = equal to quality, scaled
// R = equal to maq-rounded quality value (rounded to nearest
// 10, can't be greater than 30)
else if(tag == "MMP") {
if(ctoks.size() > 3) {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'"
<< "; RHS must have at most 3 tokens" << endl
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
if(ctoks[0][0] == 'C') {
string tmp = ctoks[0].substr(1);
// Parse constant penalty
istringstream tmpss(tmp);
tmpss >> penMmcMax;
penMmcMin = penMmcMax;
// Parse constant penalty
penMmcType = COST_MODEL_CONSTANT;
} else if(ctoks[0][0] == 'Q') {
if(ctoks.size() >= 2) {
string tmp = ctoks[1];
istringstream tmpss(tmp);
tmpss >> penMmcMax;
} else {
penMmcMax = DEFAULT_MM_PENALTY_MAX;
}
if(ctoks.size() >= 3) {
string tmp = ctoks[2];
istringstream tmpss(tmp);
tmpss >> penMmcMin;
} else {
penMmcMin = DEFAULT_MM_PENALTY_MIN;
}
if(penMmcMin > penMmcMax) {
cerr << "Error: Maximum mismatch penalty (" << penMmcMax
<< ") is less than minimum penalty (" << penMmcMin
<< endl;
throw 1;
}
// Set type to =quality
penMmcType = COST_MODEL_QUAL;
} else if(ctoks[0][0] == 'R') {
// Set type to=Maq-quality
penMmcType = COST_MODEL_ROUNDED_QUAL;
} else {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'"
<< "; RHS must start with C, Q or R" << endl
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
}
else if(tag == "SCP") {
if(ctoks.size() > 3) {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'"
<< "; SCP must have at most 3 tokens" << endl
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
istringstream tmpMax(ctoks[1]);
tmpMax >> penScMax;
istringstream tmpMin(ctoks[1]);
tmpMin >> penScMin;
if(penScMin > penScMax) {
cerr << "max (" << penScMax << ") should be >= min (" << penScMin << ")" << endl;
assert(false); throw 1;
}
if(penScMin < 1) {
cerr << "min (" << penScMin << ") should be greater than 0" << endl;
assert(false); throw 1;
}
}
// Scoring for mismatches where read char=N
// NP={Cxx|Q|RQ}
// Cxx = constant, where constant is integer xx
// Q = equal to quality
// R = equal to maq-rounded quality value (rounded to nearest
// 10, can't be greater than 30)
else if(tag == "NP") {
if(ctoks.size() != 1) {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'"
<< "; RHS must have 1 token" << endl
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
if(ctoks[0][0] == 'C') {
string tmp = ctoks[0].substr(1);
// Parse constant penalty
istringstream tmpss(tmp);
tmpss >> penN;
// Parse constant penalty
penNType = COST_MODEL_CONSTANT;
} else if(ctoks[0][0] == 'Q') {
// Set type to =quality
penNType = COST_MODEL_QUAL;
} else if(ctoks[0][0] == 'R') {
// Set type to=Maq-quality
penNType = COST_MODEL_ROUNDED_QUAL;
} else {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'"
<< "; RHS must start with C, Q or R" << endl
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
}
// Scoring for read gaps
// RDG=xx,yy,zz
// xx = read gap open penalty
// yy = read gap extension penalty constant coefficient
// (defaults to open penalty)
// zz = read gap extension penalty linear coefficient
// (defaults to 0)
else if(tag == "RDG") {
if(ctoks.size() >= 1) {
istringstream tmpss(ctoks[0]);
tmpss >> penRdExConst;
} else {
penRdExConst = noisyHpolymer ?
DEFAULT_READ_GAP_CONST_BADHPOLY :
DEFAULT_READ_GAP_CONST;
}
if(ctoks.size() >= 2) {
istringstream tmpss(ctoks[1]);
tmpss >> penRdExLinear;
} else {
penRdExLinear = noisyHpolymer ?
DEFAULT_READ_GAP_LINEAR_BADHPOLY :
DEFAULT_READ_GAP_LINEAR;
}
}
// Scoring for reference gaps
// RFG=xx,yy,zz
// xx = ref gap open penalty
// yy = ref gap extension penalty constant coefficient
// (defaults to open penalty)
// zz = ref gap extension penalty linear coefficient
// (defaults to 0)
else if(tag == "RFG") {
if(ctoks.size() >= 1) {
istringstream tmpss(ctoks[0]);
tmpss >> penRfExConst;
} else {
penRfExConst = noisyHpolymer ?
DEFAULT_REF_GAP_CONST_BADHPOLY :
DEFAULT_REF_GAP_CONST;
}
if(ctoks.size() >= 2) {
istringstream tmpss(ctoks[1]);
tmpss >> penRfExLinear;
} else {
penRfExLinear = noisyHpolymer ?
DEFAULT_REF_GAP_LINEAR_BADHPOLY :
DEFAULT_REF_GAP_LINEAR;
}
}
// Minimum score as a function of read length
// MIN=xx,yy
// xx = constant coefficient
// yy = linear coefficient
else if(tag == "MIN") {
PARSE_FUNC(costMin);
}
// Per-read N ceiling as a function of read length
// NCEIL=xx,yy
// xx = N ceiling constant coefficient
// yy = N ceiling linear coefficient (set to 0 if unspecified)
else if(tag == "NCEIL") {
PARSE_FUNC(nCeil);
}
/*
* Seeds
* -----
*
* SEED=mm,len,ival (default: SEED=0,22)
*
* mm = Maximum number of mismatches allowed within a seed.
* Must be >= 0 and <= 2. Note that 2-mismatch mode is
* not fully sensitive; i.e. some 2-mismatch seed
* alignments may be missed.
* len = Length of seed.
* ival = Interval between seeds. If not specified, seed
* interval is determined by IVAL.
*/
else if(tag == "SEED") {
if(ctoks.size() > 2) {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'; RHS must have 1 or 2 tokens, "
<< "had " << ctoks.size() << ". "
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
if(ctoks.size() >= 1) {
istringstream tmpss(ctoks[0]);
tmpss >> multiseedMms;
if(multiseedMms > 1) {
cerr << "Error: -N was set to " << multiseedMms << ", but cannot be set greater than 1" << endl;
throw 1;
}
if(multiseedMms < 0) {
cerr << "Error: -N was set to a number less than 0 (" << multiseedMms << ")" << endl;
throw 1;
}
}
if(ctoks.size() >= 2) {
istringstream tmpss(ctoks[1]);
tmpss >> multiseedLen;
} else {
multiseedLen = DEFAULT_SEEDLEN;
}
}
else if(tag == "SEEDLEN") {
if(ctoks.size() > 1) {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'; RHS must have 1 token, "
<< "had " << ctoks.size() << ". "
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
if(ctoks.size() >= 1) {
istringstream tmpss(ctoks[0]);
tmpss >> multiseedLen;
}
}
else if(tag == "DPS") {
if(ctoks.size() > 1) {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'; RHS must have 1 token, "
<< "had " << ctoks.size() << ". "
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
if(ctoks.size() >= 1) {
istringstream tmpss(ctoks[0]);
tmpss >> failStreak;
}
}
else if(tag == "ROUNDS") {
if(ctoks.size() > 1) {
cerr << "Error parsing alignment policy setting "
<< "'" << tag.c_str() << "'; RHS must have 1 token, "
<< "had " << ctoks.size() << ". "
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
if(ctoks.size() >= 1) {
istringstream tmpss(ctoks[0]);
tmpss >> seedRounds;
}
}
/*
* Seed interval
* -------------
*
* IVAL={L|S|C},a,b (default: IVAL=S,1.0,0.0)
*
* L = let interval between seeds be a linear function of the
* read length. xx and yy are the constant and linear
* coefficients respectively. In other words, the interval
* equals a * len + b, where len is the read length.
* Intervals less than 1 are rounded up to 1.
* S = let interval between seeds be a function of the sqaure
* root of the read length. xx and yy are the
* coefficients. In other words, the interval equals
* a * sqrt(len) + b, where len is the read length.
* Intervals less than 1 are rounded up to 1.
* C = Like S but uses cube root of length instead of square
* root.
*/
else if(tag == "IVAL") {
PARSE_FUNC(multiseedIval);
}
else if(tag == "CANINTRONLEN") {
assert(penCanIntronLen != NULL);
PARSE_FUNC((*penCanIntronLen));
}
else if(tag == "NONCANINTRONLEN") {
assert(penNoncanIntronLen != NULL);
PARSE_FUNC((*penNoncanIntronLen));
}
else {
// Unknown tag
cerr << "Unexpected alignment policy setting "
<< "'" << tag.c_str() << "'" << endl
<< "Policy: '" << s.c_str() << "'" << endl;
assert(false); throw 1;
}
}
}
#ifdef ALIGNER_SEED_POLICY_MAIN
int main() {
int bonusMatchType;
int bonusMatch;
int penMmcType;
int penMmc;
int penScMax;
int penScMin;
int penNType;
int penN;
int penRdExConst;
int penRfExConst;
int penRdExLinear;
int penRfExLinear;
SimpleFunc costMin;
SimpleFunc costFloor;
SimpleFunc nCeil;
bool nCatPair;
int multiseedMms;
int multiseedLen;
SimpleFunc msIval;
SimpleFunc posfrac;
SimpleFunc rowmult;
uint32_t mhits;
{
cout << "Case 1: Defaults 1 ... ";
const char *pol = "";
SeedAlignmentPolicy::parseString(
string(pol),
false, // --local?
false, // noisy homopolymers a la 454?
false, // ignore qualities?
bonusMatchType,
bonusMatch,
penMmcType,
penMmc,
penScMax,
penScMin,
penNType,
penN,
penRdExConst,
penRfExConst,
penRdExLinear,
penRfExLinear,
costMin,
costFloor,
nCeil,
nCatPair,
multiseedMms,
multiseedLen,
msIval,
mhits);
assert_eq(DEFAULT_MATCH_BONUS_TYPE, bonusMatchType);
assert_eq(DEFAULT_MATCH_BONUS, bonusMatch);
assert_eq(DEFAULT_MM_PENALTY_TYPE, penMmcType);
assert_eq(DEFAULT_MM_PENALTY_MAX, penMmcMax);
assert_eq(DEFAULT_MM_PENALTY_MIN, penMmcMin);
assert_eq(DEFAULT_N_PENALTY_TYPE, penNType);
assert_eq(DEFAULT_N_PENALTY, penN);
assert_eq(DEFAULT_MIN_CONST, costMin.getConst());
assert_eq(DEFAULT_MIN_LINEAR, costMin.getCoeff());
assert_eq(DEFAULT_FLOOR_CONST, costFloor.getConst());
assert_eq(DEFAULT_FLOOR_LINEAR, costFloor.getCoeff());
assert_eq(DEFAULT_N_CEIL_CONST, nCeil.getConst());
assert_eq(DEFAULT_N_CAT_PAIR, nCatPair);
assert_eq(DEFAULT_READ_GAP_CONST, penRdExConst);
assert_eq(DEFAULT_READ_GAP_LINEAR, penRdExLinear);
assert_eq(DEFAULT_REF_GAP_CONST, penRfExConst);
assert_eq(DEFAULT_REF_GAP_LINEAR, penRfExLinear);
assert_eq(DEFAULT_SEEDMMS, multiseedMms);
assert_eq(DEFAULT_SEEDLEN, multiseedLen);
assert_eq(DEFAULT_IVAL, msIval.getType());
assert_eq(DEFAULT_IVAL_A, msIval.getCoeff());
assert_eq(DEFAULT_IVAL_B, msIval.getConst());
cout << "PASSED" << endl;
}
{
cout << "Case 2: Defaults 2 ... ";
const char *pol = "";
SeedAlignmentPolicy::parseString(
string(pol),
false, // --local?
true, // noisy homopolymers a la 454?
false, // ignore qualities?
bonusMatchType,
bonusMatch,
penMmcType,
penMmc,
penNType,
penN,
penRdExConst,
penRfExConst,
penRdExLinear,
penRfExLinear,
costMin,
costFloor,
nCeil,
nCatPair,
multiseedMms,
multiseedLen,
msIval,
mhits);
assert_eq(DEFAULT_MATCH_BONUS_TYPE, bonusMatchType);
assert_eq(DEFAULT_MATCH_BONUS, bonusMatch);
assert_eq(DEFAULT_MM_PENALTY_TYPE, penMmcType);
assert_eq(DEFAULT_MM_PENALTY_MAX, penMmc);
assert_eq(DEFAULT_MM_PENALTY_MIN, penMmc);
assert_eq(DEFAULT_N_PENALTY_TYPE, penNType);
assert_eq(DEFAULT_N_PENALTY, penN);
assert_eq(DEFAULT_MIN_CONST, costMin.getConst());
assert_eq(DEFAULT_MIN_LINEAR, costMin.getCoeff());
assert_eq(DEFAULT_FLOOR_CONST, costFloor.getConst());
assert_eq(DEFAULT_FLOOR_LINEAR, costFloor.getCoeff());
assert_eq(DEFAULT_N_CEIL_CONST, nCeil.getConst());
assert_eq(DEFAULT_N_CAT_PAIR, nCatPair);
assert_eq(DEFAULT_READ_GAP_CONST_BADHPOLY, penRdExConst);
assert_eq(DEFAULT_READ_GAP_LINEAR_BADHPOLY, penRdExLinear);
assert_eq(DEFAULT_REF_GAP_CONST_BADHPOLY, penRfExConst);
assert_eq(DEFAULT_REF_GAP_LINEAR_BADHPOLY, penRfExLinear);
assert_eq(DEFAULT_SEEDMMS, multiseedMms);
assert_eq(DEFAULT_SEEDLEN, multiseedLen);
assert_eq(DEFAULT_IVAL, msIval.getType());
assert_eq(DEFAULT_IVAL_A, msIval.getCoeff());
assert_eq(DEFAULT_IVAL_B, msIval.getConst());
cout << "PASSED" << endl;
}
{
cout << "Case 3: Defaults 3 ... ";
const char *pol = "";
SeedAlignmentPolicy::parseString(
string(pol),
true, // --local?
false, // noisy homopolymers a la 454?
false, // ignore qualities?
bonusMatchType,
bonusMatch,
penMmcType,
penMmc,
penNType,
penN,
penRdExConst,
penRfExConst,
penRdExLinear,
penRfExLinear,
costMin,
costFloor,
nCeil,
nCatPair,
multiseedMms,
multiseedLen,
msIval,
mhits);
assert_eq(DEFAULT_MATCH_BONUS_TYPE_LOCAL, bonusMatchType);
assert_eq(DEFAULT_MATCH_BONUS_LOCAL, bonusMatch);
assert_eq(DEFAULT_MM_PENALTY_TYPE, penMmcType);
assert_eq(DEFAULT_MM_PENALTY_MAX, penMmcMax);
assert_eq(DEFAULT_MM_PENALTY_MIN, penMmcMin);
assert_eq(DEFAULT_N_PENALTY_TYPE, penNType);
assert_eq(DEFAULT_N_PENALTY, penN);
assert_eq(DEFAULT_MIN_CONST_LOCAL, costMin.getConst());
assert_eq(DEFAULT_MIN_LINEAR_LOCAL, costMin.getCoeff());
assert_eq(DEFAULT_FLOOR_CONST_LOCAL, costFloor.getConst());
assert_eq(DEFAULT_FLOOR_LINEAR_LOCAL, costFloor.getCoeff());
assert_eq(DEFAULT_N_CEIL_CONST, nCeil.getConst());
assert_eq(DEFAULT_N_CEIL_LINEAR, nCeil.getCoeff());
assert_eq(DEFAULT_N_CAT_PAIR, nCatPair);
assert_eq(DEFAULT_READ_GAP_CONST, penRdExConst);
assert_eq(DEFAULT_READ_GAP_LINEAR, penRdExLinear);
assert_eq(DEFAULT_REF_GAP_CONST, penRfExConst);
assert_eq(DEFAULT_REF_GAP_LINEAR, penRfExLinear);
assert_eq(DEFAULT_SEEDMMS, multiseedMms);
assert_eq(DEFAULT_SEEDLEN, multiseedLen);
assert_eq(DEFAULT_IVAL, msIval.getType());
assert_eq(DEFAULT_IVAL_A, msIval.getCoeff());
assert_eq(DEFAULT_IVAL_B, msIval.getConst());
cout << "PASSED" << endl;
}
{
cout << "Case 4: Simple string 1 ... ";
const char *pol = "MMP=C44;MA=4;RFG=24,12;FL=C,8;RDG=2;NP=C4;MIN=C,7";
SeedAlignmentPolicy::parseString(
string(pol),
true, // --local?
false, // noisy homopolymers a la 454?
false, // ignore qualities?
bonusMatchType,
bonusMatch,
penMmcType,
penMmc,
penNType,
penN,
penRdExConst,
penRfExConst,
penRdExLinear,
penRfExLinear,
costMin,
costFloor,
nCeil,
nCatPair,
multiseedMms,
multiseedLen,
msIval,
mhits);
assert_eq(COST_MODEL_CONSTANT, bonusMatchType);
assert_eq(4, bonusMatch);
assert_eq(COST_MODEL_CONSTANT, penMmcType);
assert_eq(44, penMmc);
assert_eq(COST_MODEL_CONSTANT, penNType);
assert_eq(4.0f, penN);
assert_eq(7, costMin.getConst());
assert_eq(DEFAULT_MIN_LINEAR_LOCAL, costMin.getCoeff());
assert_eq(8, costFloor.getConst());
assert_eq(DEFAULT_FLOOR_LINEAR_LOCAL, costFloor.getCoeff());
assert_eq(DEFAULT_N_CEIL_CONST, nCeil.getConst());
assert_eq(DEFAULT_N_CEIL_LINEAR, nCeil.getCoeff());
assert_eq(DEFAULT_N_CAT_PAIR, nCatPair);
assert_eq(2.0f, penRdExConst);
assert_eq(DEFAULT_READ_GAP_LINEAR, penRdExLinear);
assert_eq(24.0f, penRfExConst);
assert_eq(12.0f, penRfExLinear);
assert_eq(DEFAULT_SEEDMMS, multiseedMms);
assert_eq(DEFAULT_SEEDLEN, multiseedLen);
assert_eq(DEFAULT_IVAL, msIval.getType());
assert_eq(DEFAULT_IVAL_A, msIval.getCoeff());
assert_eq(DEFAULT_IVAL_B, msIval.getConst());
cout << "PASSED" << endl;
}
}
#endif /*def ALIGNER_SEED_POLICY_MAIN*/

234
aligner_seed_policy.h Normal file
View File

@ -0,0 +1,234 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALIGNER_SEED_POLICY_H_
#define ALIGNER_SEED_POLICY_H_
#include "scoring.h"
#include "simple_func.h"
#define DEFAULT_SEEDMMS 0
#define DEFAULT_SEEDLEN 22
#define DEFAULT_IVAL SIMPLE_FUNC_SQRT
#define DEFAULT_IVAL_A 1.15f
#define DEFAULT_IVAL_B 0.0f
#define DEFAULT_UNGAPPED_HITS 6
/**
* Encapsulates the set of all parameters that affect what the
* SeedAligner does with reads.
*/
class SeedAlignmentPolicy {
public:
/**
* Parse alignment policy when provided in this format:
* <lab>=<val>;<lab>=<val>;<lab>=<val>...
*
* And label=value possibilities are:
*
* Bonus for a match
* -----------------
*
* MA=xx (default: MA=0, or MA=2 if --local is set)
*
* xx = Each position where equal read and reference characters match up
* in the alignment contriubtes this amount to the total score.
*
* Penalty for a mismatch
* ----------------------
*
* MMP={Cxx|Q|RQ} (default: MMP=C6)
*
* Cxx = Each mismatch costs xx. If MMP=Cxx is specified, quality
* values are ignored when assessing penalities for mismatches.
* Q = Each mismatch incurs a penalty equal to the mismatched base's
* value.
* R = Each mismatch incurs a penalty equal to the mismatched base's
* rounded quality value. Qualities are rounded off to the
* nearest 10, and qualities greater than 30 are rounded to 30.
*
* Penalty for position with N (in either read or reference)
* ---------------------------------------------------------
*
* NP={Cxx|Q|RQ} (default: NP=C1)
*
* Cxx = Each alignment position with an N in either the read or the
* reference costs xx. If NP=Cxx is specified, quality values are
* ignored when assessing penalities for Ns.
* Q = Each alignment position with an N in either the read or the
* reference incurs a penalty equal to the read base's quality
* value.
* R = Each alignment position with an N in either the read or the
* reference incurs a penalty equal to the read base's rounded
* quality value. Qualities are rounded off to the nearest 10,
* and qualities greater than 30 are rounded to 30.
*
* Penalty for a read gap
* ----------------------
*
* RDG=xx,yy (default: RDG=5,3)
*
* xx = Read gap open penalty.
* yy = Read gap extension penalty.
*
* Total cost incurred by a read gap = xx + (yy * gap length)
*
* Penalty for a reference gap
* ---------------------------
*
* RFG=xx,yy (default: RFG=5,3)
*
* xx = Reference gap open penalty.
* yy = Reference gap extension penalty.
*
* Total cost incurred by a reference gap = xx + (yy * gap length)
*
* Minimum score for valid alignment
* ---------------------------------
*
* MIN=xx,yy (defaults: MIN=-0.6,-0.6, or MIN=0.0,0.66 if --local is set)
*
* xx,yy = For a read of length N, the total score must be at least
* xx + (read length * yy) for the alignment to be valid. The
* total score is the sum of all negative penalties (from
* mismatches and gaps) and all positive bonuses. The minimum
* can be negative (and is by default in global alignment mode).
*
* N ceiling
* ---------
*
* NCEIL=xx,yy (default: NCEIL=0.0,0.15)
*
* xx,yy = For a read of length N, the number of alignment
* positions with an N in either the read or the
* reference cannot exceed
* ceiling = xx + (read length * yy). If the ceiling is
* exceeded, the alignment is considered invalid.
*
* Seeds
* -----
*
* SEED=mm,len,ival (default: SEED=0,22)
*
* mm = Maximum number of mismatches allowed within a seed.
* Must be >= 0 and <= 2. Note that 2-mismatch mode is
* not fully sensitive; i.e. some 2-mismatch seed
* alignments may be missed.
* len = Length of seed.
* ival = Interval between seeds. If not specified, seed
* interval is determined by IVAL.
*
* Seed interval
* -------------
*
* IVAL={L|S|C},xx,yy (default: IVAL=S,1.0,0.0)
*
* L = let interval between seeds be a linear function of the
* read length. xx and yy are the constant and linear
* coefficients respectively. In other words, the interval
* equals a * len + b, where len is the read length.
* Intervals less than 1 are rounded up to 1.
* S = let interval between seeds be a function of the sqaure
* root of the read length. xx and yy are the
* coefficients. In other words, the interval equals
* a * sqrt(len) + b, where len is the read length.
* Intervals less than 1 are rounded up to 1.
* C = Like S but uses cube root of length instead of square
* root.
*
* Example 1:
*
* SEED=1,10,5 and read sequence is TGCTATCGTACGATCGTAC:
*
* The following seeds are extracted from the forward
* representation of the read and aligned to the reference
* allowing up to 1 mismatch:
*
* Read: TGCTATCGTACGATCGTACA
*
* Seed 1+: TGCTATCGTA
* Seed 2+: TCGTACGATC
* Seed 3+: CGATCGTACA
*
* ...and the following are extracted from the reverse-complement
* representation of the read and align to the reference allowing
* up to 1 mismatch:
*
* Seed 1-: TACGATAGCA
* Seed 2-: GATCGTACGA
* Seed 3-: TGTACGATCG
*
* Example 2:
*
* SEED=1,20,20 and read sequence is TGCTATCGTACGATC. The seed
* length is 20 but the read is only 15 characters long. In this
* case, Bowtie2 automatically shrinks the seed length to be equal
* to the read length.
*
* Read: TGCTATCGTACGATC
*
* Seed 1+: TGCTATCGTACGATC
* Seed 1-: GATCGTACGATAGCA
*
* Example 3:
*
* SEED=1,10,10 and read sequence is TGCTATCGTACGATC. Only one seed
* fits on the read; a second seed would overhang the end of the read
* by 5 positions. In this case, Bowtie2 extracts one seed.
*
* Read: TGCTATCGTACGATC
*
* Seed 1+: TGCTATCGTA
* Seed 1-: TACGATAGCA
*/
static void parseString(
const std::string& s,
bool local,
bool noisyHpolymer,
bool ignoreQuals,
int& bonusMatchType,
int& bonusMatch,
int& penMmcType,
int& penMmcMax,
int& penMmcMin,
int& penScMax,
int& penScMin,
int& penNType,
int& penN,
int& penRdExConst,
int& penRfExConst,
int& penRdExLinear,
int& penRfExLinear,
SimpleFunc& costMin,
SimpleFunc& nCeil,
bool& nCatPair,
int& multiseedMms,
int& multiseedLen,
SimpleFunc& multiseedIval,
size_t& failStreak,
size_t& seedRounds,
SimpleFunc* penCanIntronLen = NULL,
SimpleFunc* penNoncanIntronLen = NULL);
};
#endif /*ndef ALIGNER_SEED_POLICY_H_*/

3214
aligner_sw.cpp Normal file

File diff suppressed because it is too large Load Diff

648
aligner_sw.h Normal file
View File

@ -0,0 +1,648 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
/*
* aligner_sw.h
*
* Classes and routines for solving dynamic programming problems in aid of read
* alignment. Goals include the ability to handle:
*
* - Both read alignment, where the query must align end-to-end, and local
* alignment, where we seek a high-scoring alignment that need not involve
* the entire query.
* - Situations where: (a) we've found a seed hit and are trying to extend it
* into a larger hit, (b) we've found an alignment for one mate of a pair and
* are trying to find a nearby alignment for the other mate, (c) we're
* aligning against an entire reference sequence.
* - Caller-specified indicators for what columns of the dynamic programming
* matrix we are allowed to start in or end in.
*
* TODO:
*
* - A slicker way to filter out alignments that violate a ceiling placed on
* the number of Ns permitted in the reference portion of the alignment.
* Right now we accomplish this by masking out ending columns that correspond
* to *ungapped* alignments with too many Ns. This results in false
* positives and false negatives for gapped alignments. The margin of error
* (# of Ns by which we might miscount) is bounded by the number of gaps.
*/
/**
* |-maxgaps-|
* ***********oooooooooooooooooooooo -
* ***********ooooooooooooooooooooo |
* ***********oooooooooooooooooooo |
* ***********ooooooooooooooooooo |
* ***********oooooooooooooooooo |
* ***********ooooooooooooooooo read len
* ***********oooooooooooooooo |
* ***********ooooooooooooooo |
* ***********oooooooooooooo |
* ***********ooooooooooooo |
* ***********oooooooooooo -
* |-maxgaps-|
* |-readlen-|
* |-------skip--------|
*/
#ifndef ALIGNER_SW_H_
#define ALIGNER_SW_H_
#define INLINE_CUPS
#include <stdint.h>
#include <iostream>
#include <limits>
#include "threading.h"
#include <emmintrin.h>
#include "aligner_sw_common.h"
#include "aligner_sw_nuc.h"
#include "ds.h"
#include "aligner_seed.h"
#include "reference.h"
#include "random_source.h"
#include "mem_ids.h"
#include "aligner_result.h"
#include "mask.h"
#include "dp_framer.h"
#include "aligner_swsse.h"
#include "aligner_bt.h"
#define QUAL2(d, f) sc_->mm((int)(*rd_)[rdi_ + d], \
(int) rf_ [rfi_ + f], \
(int)(*qu_)[rdi_ + d] - 33)
#define QUAL(d) sc_->mm((int)(*rd_)[rdi_ + d], \
(int)(*qu_)[rdi_ + d] - 33)
#define N_SNP_PEN(c) (((int)rf_[rfi_ + c] > 15) ? sc_->n(30) : sc_->penSnp)
/**
* SwAligner
* =========
*
* Ensapsulates facilities for alignment using dynamic programming. Handles
* alignment of nucleotide reads against known reference nucleotides.
*
* The class is stateful. First the user must call init() to initialize the
* object with details regarding the dynamic programming problem to be solved.
* Next, the user calls align() to fill the dynamic programming matrix and
* calculate summaries describing the solutions. Finally the user calls
* nextAlignment(...), perhaps repeatedly, to populate the SwResult object with
* the next result. Results are dispensend in best-to-worst, left-to-right
* order.
*
* The class expects the read string, quality string, and reference string
* provided by the caller live at least until the user is finished aligning and
* obtaining alignments from this object.
*
* There is a design tradeoff between hiding/exposing details of the genome and
* its strands to the SwAligner. In a sense, a better design is to hide
* details such as the id of the reference sequence aligned to, or whether
* we're aligning the read in its original forward orientation or its reverse
* complement. But this means that any alignment results returned by SwAligner
* have to be extended to include those details before they're useful to the
* caller. We opt for messy but expedient - the reference id and orientation
* of the read are given to SwAligner, remembered, and used to populate
* SwResults.
*
* LOCAL VS GLOBAL
*
* The dynamic programming aligner supports both local and global alignment,
* and one option in between. To implement global alignment, the aligner (a)
* allows negative scores (i.e. doesn't necessarily clamp them up to 0), (b)
* checks in rows other than the last row for acceptable solutions, and (c)
* optionally adds a bonus to the score for matches.
*
* For global alignment, we:
*
* (a) Allow negative scores
* (b) Check only in the last row
* (c) Either add a bonus for matches or not (doesn't matter)
*
* For local alignment, we:
*
* (a) Clamp scores to 0
* (b) Check in any row for a sufficiently high score
* (c) Add a bonus for matches
*
* An in-between solution is to allow alignments to be curtailed on the
* right-hand side if a better score can be achieved thereby, but not on the
* left. For this, we:
*
* (a) Allow negative scores
* (b) Check in any row for a sufficiently high score
* (c) Either add a bonus for matches or not (doesn't matter)
*
* REDUNDANT ALIGNMENTS
*
* When are two alignments distinct and when are they redundant (not distinct)?
* At one extreme, we might say the best alignment from any given dynamic
* programming problem is redundant with all other alignments from that
# problem. At the other extreme, we might say that any two alignments with
* distinct starting points and edits are distinct. The former is probably too
* conservative for mate-finding DP problems. The latter is certainly too
* permissive, since two alignments that differ only in how gaps are arranged
* should not be considered distinct.
*
* Some in-between solutions are:
*
* (a) If two alignments share an end point on either end, they are redundant.
* Otherwise, they are distinct.
* (b) If two alignments share *both* end points, they are redundant.
* (c) If two alignments share any cells in the DP table, they are redundant.
* (d) 2 alignments are redundant if either end within N poss of each other
* (e) Like (d) but both instead of either
* (f, g) Like d, e, but where N is tied to maxgaps somehow
*
* Why not (a)? One reason is that it's possible for two alignments to have
* different start & end positions but share many cells. Consider alignments 1
* and 2 below; their end-points are labeled.
*
* 1 2
* \ \
* -\
* \
* \
* \
* -\
* \ \
* 1 2
*
* 1 and 2 are distinct according to (a) but they share many cells in common.
*
* Why not (f, g)? It fixes the problem with (a) above by forcing the
* alignments to be spread so far that they can't possibly share diagonal cells
* in common
*/
class SwAligner {
typedef std::pair<size_t, size_t> SizeTPair;
// States that the aligner can be in
enum {
STATE_UNINIT, // init() hasn't been called yet
STATE_INITED, // init() has been called, but not align()
STATE_ALIGNED, // align() has been called
};
const static size_t ALPHA_SIZE = 5;
public:
explicit SwAligner() :
sseU8fw_(DP_CAT),
sseU8rc_(DP_CAT),
sseI16fw_(DP_CAT),
sseI16rc_(DP_CAT),
state_(STATE_UNINIT),
initedRead_(false),
readSse16_(false),
initedRef_(false),
rfwbuf_(DP_CAT),
btnstack_(DP_CAT),
btcells_(DP_CAT),
btdiag_(),
btncand_(DP_CAT),
btncanddone_(DP_CAT),
btncanddoneSucc_(0),
btncanddoneFail_(0),
cper_(),
cperMinlen_(),
cperPerPow2_(),
cperEf_(),
cperTri_(),
colstop_(0),
lastsolcol_(0),
cural_(0)
ASSERT_ONLY(, cand_tmp_(DP_CAT))
{ }
/**
* Prepare the dynamic programming driver with a new read and a new scoring
* scheme.
*/
void initRead(
const BTDnaString& rdfw, // read sequence for fw read
const BTDnaString& rdrc, // read sequence for rc read
const BTString& qufw, // read qualities for fw read
const BTString& qurc, // read qualities for rc read
size_t rdi, // offset of first read char to align
size_t rdf, // offset of last read char to align
const Scoring& sc); // scoring scheme
/**
* Initialize with a new alignment problem.
*/
void initRef(
bool fw, // whether to forward or revcomp read is aligning
TRefId refidx, // id of reference aligned against
const DPRect& rect, // DP rectangle
char *rf, // reference sequence
size_t rfi, // offset of first reference char to align to
size_t rff, // offset of last reference char to align to
TRefOff reflen, // length of reference sequence
const Scoring& sc, // scoring scheme
TAlScore minsc, // minimum score
bool enable8, // use 8-bit SSE if possible?
size_t cminlen, // minimum length for using checkpointing scheme
size_t cpow2, // interval b/t checkpointed diags; 1 << this
bool doTri, // triangular mini-fills?
bool extend); // true iff this is a seed extension
/**
* Given a read, an alignment orientation, a range of characters in a
* referece sequence, and a bit-encoded version of the reference,
* execute the corresponding dynamic programming problem.
*
* Here we expect that the caller has already narrowed down the relevant
* portion of the reference (e.g. using a seed hit) and all we do is
* banded dynamic programming in the vicinity of that portion. This is not
* the function to call if we are trying to solve the whole alignment
* problem with dynamic programming (that is TODO).
*
* Returns true if an alignment was found, false otherwise.
*/
void initRef(
bool fw, // whether to forward or revcomp read aligned
TRefId refidx, // reference aligned against
const DPRect& rect, // DP rectangle
const BitPairReference& refs, // Reference strings
TRefOff reflen, // length of reference sequence
const Scoring& sc, // scoring scheme
TAlScore minsc, // minimum alignment score
bool enable8, // use 8-bit SSE if possible?
size_t cminlen, // minimum length for using checkpointing scheme
size_t cpow2, // interval b/t checkpointed diags; 1 << this
bool doTri, // triangular mini-fills?
bool extend, // true iff this is a seed extension
size_t upto, // count the number of Ns up to this offset
size_t& nsUpto); // output: the number of Ns up to 'upto'
/**
* Given a read, an alignment orientation, a range of characters in a
* referece sequence, and a bit-encoded version of the reference, set up
* and execute the corresponding ungapped alignment problem. There can
* only be one solution.
*
* The caller has already narrowed down the relevant portion of the
* reference using, e.g., the location of a seed hit, or the range of
* possible fragment lengths if we're searching for the opposite mate in a
* pair.
*/
int ungappedAlign(
const BTDnaString& rd, // read sequence (could be RC)
const BTString& qu, // qual sequence (could be rev)
const Coord& coord, // coordinate aligned to
const BitPairReference& refs, // Reference strings
size_t reflen, // length of reference sequence
const Scoring& sc, // scoring scheme
bool ohang, // allow overhang?
TAlScore minsc, // minimum score
SwResult& res); // put alignment result here
/**
* Align read 'rd' to reference using read & reference information given
* last time init() was called. Uses dynamic programming.
*/
bool align(RandomSource& rnd, TAlScore& best);
/**
* Populate the given SwResult with information about the "next best"
* alignment if there is one. If there isn't one, false is returned. Note
* that false might be returned even though a call to done() would have
* returned false.
*/
bool nextAlignment(
SwResult& res,
TAlScore minsc,
RandomSource& rnd);
/**
* Print out an alignment result as an ASCII DP table.
*/
void printResultStacked(
const SwResult& res,
std::ostream& os)
{
res.alres.printStacked(*rd_, os);
}
/**
* Return true iff there are no more solution cells to backtace from.
* Note that this may return false in situations where there are actually
* no more solutions, but that hasn't been discovered yet.
*/
bool done() const {
assert(initedRead() && initedRef());
return cural_ == btncand_.size();
}
/**
* Return true iff this SwAligner has been initialized with a read to align.
*/
inline bool initedRef() const { return initedRef_; }
/**
* Return true iff this SwAligner has been initialized with a reference to
* align against.
*/
inline bool initedRead() const { return initedRead_; }
/**
* Reset, signaling that we're done with this dynamic programming problem
* and won't be asking for any more alignments.
*/
inline void reset() { initedRef_ = initedRead_ = false; }
#ifndef NDEBUG
/**
* Check that aligner is internally consistent.
*/
bool repOk() const {
assert_gt(dpRows(), 0);
// Check btncand_
for(size_t i = 0; i < btncand_.size(); i++) {
assert(btncand_[i].repOk());
assert_geq(btncand_[i].score, minsc_);
}
return true;
}
#endif
/**
* Return the number of alignments given out so far by nextAlignment().
*/
size_t numAlignmentsReported() const { return cural_; }
/**
* Merge tallies in the counters related to filling the DP table.
*/
void merge(
SSEMetrics& sseU8ExtendMet,
SSEMetrics& sseU8MateMet,
SSEMetrics& sseI16ExtendMet,
SSEMetrics& sseI16MateMet,
uint64_t& nbtfiltst,
uint64_t& nbtfiltsc,
uint64_t& nbtfiltdo)
{
sseU8ExtendMet.merge(sseU8ExtendMet_);
sseU8MateMet.merge(sseU8MateMet_);
sseI16ExtendMet.merge(sseI16ExtendMet_);
sseI16MateMet.merge(sseI16MateMet_);
nbtfiltst += nbtfiltst_;
nbtfiltsc += nbtfiltsc_;
nbtfiltdo += nbtfiltdo_;
}
/**
* Reset all the counters related to filling in the DP table to 0.
*/
void resetCounters() {
sseU8ExtendMet_.reset();
sseU8MateMet_.reset();
sseI16ExtendMet_.reset();
sseI16MateMet_.reset();
nbtfiltst_ = nbtfiltsc_ = nbtfiltdo_ = 0;
}
/**
* Return the size of the DP problem.
*/
size_t size() const {
return dpRows() * (rff_ - rfi_);
}
protected:
/**
* Return the number of rows that will be in the dynamic programming table.
*/
inline size_t dpRows() const {
assert(initedRead_);
return rdf_ - rdi_;
}
/**
* Align nucleotides from read 'rd' to the reference string 'rf' using
* vector instructions. Return the score of the best alignment found, or
* the minimum integer if an alignment could not be found. Flag is set to
* 0 if an alignment is found, -1 if no valid alignment is found, or -2 if
* the score saturated at any point during alignment.
*/
TAlScore alignNucleotidesEnd2EndSseU8( // unsigned 8-bit elements
int& flag, bool debug);
TAlScore alignNucleotidesLocalSseU8( // unsigned 8-bit elements
int& flag, bool debug);
TAlScore alignNucleotidesEnd2EndSseI16( // signed 16-bit elements
int& flag, bool debug);
TAlScore alignNucleotidesLocalSseI16( // signed 16-bit elements
int& flag, bool debug);
/**
* Aligns by filling a dynamic programming matrix with the SSE-accelerated,
* banded DP approach of Farrar. As it goes, it determines which cells we
* might backtrace from and tallies the best (highest-scoring) N backtrace
* candidate cells per diagonal. Also returns the alignment score of the best
* alignment in the matrix.
*
* This routine does *not* maintain a matrix holding the entire matrix worth of
* scores, nor does it maintain any other dense O(mn) data structure, as this
* would quickly exhaust memory for queries longer than about 10,000 kb.
* Instead, in the fill stage it maintains two columns worth of scores at a
* time (current/previous, or right/left) - these take O(m) space. When
* finished with the current column, it determines which cells from the
* previous column, if any, are candidates we might backtrace from to find a
* full alignment. A candidate cell has a score that rises above the threshold
* and isn't improved upon by a match in the next column. The best N
* candidates per diagonal are stored in a O(m + n) data structure.
*/
TAlScore alignGatherEE8( // unsigned 8-bit elements
int& flag, bool debug);
TAlScore alignGatherLoc8( // unsigned 8-bit elements
int& flag, bool debug);
TAlScore alignGatherEE16( // signed 16-bit elements
int& flag, bool debug);
TAlScore alignGatherLoc16( // signed 16-bit elements
int& flag, bool debug);
/**
* Build query profile look up tables for the read. The query profile look
* up table is organized as a 1D array indexed by [i][j] where i is the
* reference character in the current DP column (0=A, 1=C, etc), and j is
* the segment of the query we're currently working on.
*/
void buildQueryProfileEnd2EndSseU8(bool fw);
void buildQueryProfileLocalSseU8(bool fw);
/**
* Build query profile look up tables for the read. The query profile look
* up table is organized as a 1D array indexed by [i][j] where i is the
* reference character in the current DP column (0=A, 1=C, etc), and j is
* the segment of the query we're currently working on.
*/
void buildQueryProfileEnd2EndSseI16(bool fw);
void buildQueryProfileLocalSseI16(bool fw);
bool gatherCellsNucleotidesLocalSseU8(TAlScore best);
bool gatherCellsNucleotidesEnd2EndSseU8(TAlScore best);
bool gatherCellsNucleotidesLocalSseI16(TAlScore best);
bool gatherCellsNucleotidesEnd2EndSseI16(TAlScore best);
bool backtraceNucleotidesLocalSseU8(
TAlScore escore, // in: expected score
SwResult& res, // out: store results (edits and scores) here
size_t& off, // out: store diagonal projection of origin
size_t& nbts, // out: # backtracks
size_t row, // start in this rectangle row
size_t col, // start in this rectangle column
RandomSource& rand); // random gen, to choose among equal paths
bool backtraceNucleotidesLocalSseI16(
TAlScore escore, // in: expected score
SwResult& res, // out: store results (edits and scores) here
size_t& off, // out: store diagonal projection of origin
size_t& nbts, // out: # backtracks
size_t row, // start in this rectangle row
size_t col, // start in this rectangle column
RandomSource& rand); // random gen, to choose among equal paths
bool backtraceNucleotidesEnd2EndSseU8(
TAlScore escore, // in: expected score
SwResult& res, // out: store results (edits and scores) here
size_t& off, // out: store diagonal projection of origin
size_t& nbts, // out: # backtracks
size_t row, // start in this rectangle row
size_t col, // start in this rectangle column
RandomSource& rand); // random gen, to choose among equal paths
bool backtraceNucleotidesEnd2EndSseI16(
TAlScore escore, // in: expected score
SwResult& res, // out: store results (edits and scores) here
size_t& off, // out: store diagonal projection of origin
size_t& nbts, // out: # backtracks
size_t row, // start in this rectangle row
size_t col, // start in this rectangle column
RandomSource& rand); // random gen, to choose among equal paths
bool backtrace(
TAlScore escore, // in: expected score
bool fill, // in: use mini-fill?
bool usecp, // in: use checkpoints?
SwResult& res, // out: store results (edits and scores) here
size_t& off, // out: store diagonal projection of origin
size_t row, // start in this rectangle row
size_t col, // start in this rectangle column
size_t maxiter,// max # extensions to try
size_t& niter, // # extensions tried
RandomSource& rnd) // random gen, to choose among equal paths
{
bter_.initBt(
escore, // in: alignment score
row, // in: start in this row
col, // in: start in this column
fill, // in: use mini-fill?
usecp, // in: use checkpoints?
cperTri_, // in: triangle-shaped mini-fills?
rnd); // in: random gen, to choose among equal paths
assert(bter_.inited());
size_t nrej = 0;
if(bter_.emptySolution()) {
return false;
} else {
return bter_.nextAlignment(maxiter, res, off, nrej, niter, rnd);
}
}
const BTDnaString *rd_; // read sequence
const BTString *qu_; // read qualities
const BTDnaString *rdfw_; // read sequence for fw read
const BTDnaString *rdrc_; // read sequence for rc read
const BTString *qufw_; // read qualities for fw read
const BTString *qurc_; // read qualities for rc read
TReadOff rdi_; // offset of first read char to align
TReadOff rdf_; // offset of last read char to align
bool fw_; // true iff read sequence is original fw read
TRefId refidx_; // id of reference aligned against
TRefOff reflen_; // length of entire reference sequence
const DPRect* rect_; // DP rectangle
char *rf_; // reference sequence
TRefOff rfi_; // offset of first ref char to align to
TRefOff rff_; // offset of last ref char to align to (excl)
size_t rdgap_; // max # gaps in read
size_t rfgap_; // max # gaps in reference
bool enable8_;// enable 8-bit sse
bool extend_; // true iff this is a seed-extend problem
const Scoring *sc_; // penalties for edit types
TAlScore minsc_; // penalty ceiling for valid alignments
int nceil_; // max # Ns allowed in ref portion of aln
bool sse8succ_; // whether 8-bit worked
bool sse16succ_; // whether 16-bit worked
SSEData sseU8fw_; // buf for fw query, 8-bit score
SSEData sseU8rc_; // buf for rc query, 8-bit score
SSEData sseI16fw_; // buf for fw query, 16-bit score
SSEData sseI16rc_; // buf for rc query, 16-bit score
bool sseU8fwBuilt_; // built fw query profile, 8-bit score
bool sseU8rcBuilt_; // built rc query profile, 8-bit score
bool sseI16fwBuilt_; // built fw query profile, 16-bit score
bool sseI16rcBuilt_; // built rc query profile, 16-bit score
SSEMetrics sseU8ExtendMet_;
SSEMetrics sseU8MateMet_;
SSEMetrics sseI16ExtendMet_;
SSEMetrics sseI16MateMet_;
int state_; // state
bool initedRead_; // true iff initialized with initRead
bool readSse16_; // true -> sse16 from now on for read
bool initedRef_; // true iff initialized with initRef
EList<uint32_t> rfwbuf_; // buffer for wordized ref stretches
EList<DpNucFrame> btnstack_; // backtrace stack for nucleotides
EList<SizeTPair> btcells_; // cells involved in current backtrace
NBest<DpBtCandidate> btdiag_; // per-diagonal backtrace candidates
EList<DpBtCandidate> btncand_; // cells we might backtrace from
EList<DpBtCandidate> btncanddone_; // candidates that we investigated
size_t btncanddoneSucc_; // # investigated and succeeded
size_t btncanddoneFail_; // # investigated and failed
BtBranchTracer bter_; // backtracer
Checkpointer cper_; // structure for saving checkpoint cells
size_t cperMinlen_; // minimum length for using checkpointer
size_t cperPerPow2_; // checkpoint every 1 << perpow2 diags (& next)
bool cperEf_; // store E and F in addition to H?
bool cperTri_; // checkpoint for triangular mini-fills?
size_t colstop_; // bailed on DP loop after this many cols
size_t lastsolcol_; // last DP col with valid cell
size_t cural_; // index of next alignment to be given
uint64_t nbtfiltst_; // # candidates filtered b/c starting cell was seen
uint64_t nbtfiltsc_; // # candidates filtered b/c score uninteresting
uint64_t nbtfiltdo_; // # candidates filtered b/c dominated by other cell
ASSERT_ONLY(SStringExpandable<uint32_t> tmp_destU32_);
ASSERT_ONLY(BTDnaString tmp_editstr_, tmp_refstr_);
ASSERT_ONLY(EList<DpBtCandidate> cand_tmp_);
};
#endif /*ALIGNER_SW_H_*/

305
aligner_sw_common.h Normal file
View File

@ -0,0 +1,305 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALIGNER_SW_COMMON_H_
#define ALIGNER_SW_COMMON_H_
#include "aligner_result.h"
/**
* Encapsulates the result of a dynamic programming alignment, including
* colorspace alignments. In our case, the result is a combination of:
*
* 1. All the nucleotide edits
* 2. All the "edits" where an ambiguous reference char is resolved to
* an unambiguous char.
* 3. All the color edits (if applicable)
* 4. All the color miscalls (if applicable). This is a subset of 3.
* 5. The score of the best alginment
* 6. The score of the second-best alignment
*
* Having scores for the best and second-best alignments gives us an
* idea of where gaps may make reassembly beneficial.
*/
struct SwResult {
SwResult() :
alres(),
sws(0),
swcups(0),
swrows(0),
swskiprows(0),
swskip(0),
swsucc(0),
swfail(0),
swbts(0)
{ }
/**
* Clear all contents.
*/
void reset() {
sws = swcups = swrows = swskiprows = swskip = swsucc =
swfail = swbts = 0;
alres.reset();
}
/**
* Reverse all edit lists.
*/
void reverse() {
alres.reverseEdits();
}
/**
* Return true iff no result has been installed.
*/
bool empty() const {
return alres.empty();
}
#ifndef NDEBUG
/**
* Check that result is internally consistent.
*/
bool repOk() const {
assert(alres.repOk());
return true;
}
/**
* Check that result is internally consistent w/r/t read.
*/
bool repOk(const Read& rd) const {
assert(alres.repOk(rd));
return true;
}
#endif
AlnRes alres;
uint64_t sws; // # DP problems solved
uint64_t swcups; // # DP cell updates
uint64_t swrows; // # DP row updates
uint64_t swskiprows; // # skipped DP row updates (b/c no valid alignments can go thru row)
uint64_t swskip; // # DP problems skipped by sse filter
uint64_t swsucc; // # DP problems resulting in alignment
uint64_t swfail; // # DP problems not resulting in alignment
uint64_t swbts; // # DP backtrace steps
int nup; // upstream decoded nucleotide; for colorspace reads
int ndn; // downstream decoded nucleotide; for colorspace reads
};
/**
* Encapsulates counters that measure how much work has been done by
* the dynamic programming driver and aligner.
*/
struct SwMetrics {
SwMetrics() : mutex_m() {
reset();
}
void reset() {
sws = swcups = swrows = swskiprows = swskip = swsucc = swfail = swbts =
sws10 = sws5 = sws3 =
rshit = ungapsucc = ungapfail = ungapnodec = 0;
exatts = exranges = exrows = exsucc = exooms = 0;
mm1atts = mm1ranges = mm1rows = mm1succ = mm1ooms = 0;
sdatts = sdranges = sdrows = sdsucc = sdooms = 0;
}
void init(
uint64_t sws_,
uint64_t sws10_,
uint64_t sws5_,
uint64_t sws3_,
uint64_t swcups_,
uint64_t swrows_,
uint64_t swskiprows_,
uint64_t swskip_,
uint64_t swsucc_,
uint64_t swfail_,
uint64_t swbts_,
uint64_t rshit_,
uint64_t ungapsucc_,
uint64_t ungapfail_,
uint64_t ungapnodec_,
uint64_t exatts_,
uint64_t exranges_,
uint64_t exrows_,
uint64_t exsucc_,
uint64_t exooms_,
uint64_t mm1atts_,
uint64_t mm1ranges_,
uint64_t mm1rows_,
uint64_t mm1succ_,
uint64_t mm1ooms_,
uint64_t sdatts_,
uint64_t sdranges_,
uint64_t sdrows_,
uint64_t sdsucc_,
uint64_t sdooms_)
{
sws = sws_;
sws10 = sws10_;
sws5 = sws5_;
sws3 = sws3_;
swcups = swcups_;
swrows = swrows_;
swskiprows = swskiprows_;
swskip = swskip_;
swsucc = swsucc_;
swfail = swfail_;
swbts = swbts_;
ungapsucc = ungapsucc_;
ungapfail = ungapfail_;
ungapnodec = ungapnodec_;
// Exact end-to-end attempts
exatts = exatts_;
exranges = exranges_;
exrows = exrows_;
exsucc = exsucc_;
exooms = exooms_;
// 1-mismatch end-to-end attempts
mm1atts = mm1atts_;
mm1ranges = mm1ranges_;
mm1rows = mm1rows_;
mm1succ = mm1succ_;
mm1ooms = mm1ooms_;
// Seed attempts
sdatts = sdatts_;
sdranges = sdranges_;
sdrows = sdrows_;
sdsucc = sdsucc_;
sdooms = sdooms_;
}
/**
* Merge (add) the counters in the given SwResult object into this
* SwMetrics object.
*/
void update(const SwResult& r) {
sws += r.sws;
swcups += r.swcups;
swrows += r.swrows;
swskiprows += r.swskiprows;
swskip += r.swskip;
swsucc += r.swsucc;
swfail += r.swfail;
swbts += r.swbts;
}
/**
* Merge (add) the counters in the given SwMetrics object into this
* object. This is the only safe way to update a SwMetrics shared
* by multiple threads.
*/
void merge(const SwMetrics& r, bool getLock = false) {
ThreadSafe ts(&mutex_m, getLock);
sws += r.sws;
sws10 += r.sws10;
sws5 += r.sws5;
sws3 += r.sws3;
swcups += r.swcups;
swrows += r.swrows;
swskiprows += r.swskiprows;
swskip += r.swskip;
swsucc += r.swsucc;
swfail += r.swfail;
swbts += r.swbts;
rshit += r.rshit;
ungapsucc += r.ungapsucc;
ungapfail += r.ungapfail;
ungapnodec += r.ungapnodec;
exatts += r.exatts;
exranges += r.exranges;
exrows += r.exrows;
exsucc += r.exsucc;
exooms += r.exooms;
mm1atts += r.mm1atts;
mm1ranges += r.mm1ranges;
mm1rows += r.mm1rows;
mm1succ += r.mm1succ;
mm1ooms += r.mm1ooms;
sdatts += r.sdatts;
sdranges += r.sdranges;
sdrows += r.sdrows;
sdsucc += r.sdsucc;
sdooms += r.sdooms;
}
void tallyGappedDp(size_t readGaps, size_t refGaps) {
size_t mx = max(readGaps, refGaps);
if(mx < 10) sws10++;
if(mx < 5) sws5++;
if(mx < 3) sws3++;
}
uint64_t sws; // # DP problems solved
uint64_t sws10; // # DP problems solved where max gaps < 10
uint64_t sws5; // # DP problems solved where max gaps < 5
uint64_t sws3; // # DP problems solved where max gaps < 3
uint64_t swcups; // # DP cell updates
uint64_t swrows; // # DP row updates
uint64_t swskiprows; // # skipped DP rows (b/c no valid alns go thru row)
uint64_t swskip; // # DP problems skipped by sse filter
uint64_t swsucc; // # DP problems resulting in alignment
uint64_t swfail; // # DP problems not resulting in alignment
uint64_t swbts; // # DP backtrace steps
uint64_t rshit; // # DP problems avoided b/c seed hit was redundant
uint64_t ungapsucc; // # DP problems avoided b/c seed hit was redundant
uint64_t ungapfail; // # DP problems avoided b/c seed hit was redundant
uint64_t ungapnodec; // # DP problems avoided b/c seed hit was redundant
uint64_t exatts; // total # attempts at exact-hit end-to-end aln
uint64_t exranges; // total # ranges returned by exact-hit queries
uint64_t exrows; // total # rows returned by exact-hit queries
uint64_t exsucc; // exact-hit yielded non-empty result
uint64_t exooms; // exact-hit offset memory exhausted
uint64_t mm1atts; // total # attempts at 1mm end-to-end aln
uint64_t mm1ranges; // total # ranges returned by 1mm-hit queries
uint64_t mm1rows; // total # rows returned by 1mm-hit queries
uint64_t mm1succ; // 1mm-hit yielded non-empty result
uint64_t mm1ooms; // 1mm-hit offset memory exhausted
uint64_t sdatts; // total # attempts to find seed alignments
uint64_t sdranges; // total # seed-alignment ranges found
uint64_t sdrows; // total # seed-alignment rows found
uint64_t sdsucc; // # times seed alignment yielded >= 1 hit
uint64_t sdooms; // # times an OOM occurred during seed alignment
MUTEX_T mutex_m;
};
// The various ways that one might backtrack from a later cell (either oall,
// rdgap or rfgap) to an earlier cell
enum {
SW_BT_OALL_DIAG, // from oall cell to oall cell
SW_BT_OALL_REF_OPEN, // from oall cell to oall cell
SW_BT_OALL_READ_OPEN, // from oall cell to oall cell
SW_BT_RDGAP_EXTEND, // from rdgap cell to rdgap cell
SW_BT_RFGAP_EXTEND // from rfgap cell to rfgap cell
};
#endif /*def ALIGNER_SW_COMMON_H_*/

20
aligner_sw_driver.cpp Normal file
View File

@ -0,0 +1,20 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/

2938
aligner_sw_driver.h Normal file

File diff suppressed because it is too large Load Diff

262
aligner_sw_nuc.h Normal file
View File

@ -0,0 +1,262 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALIGNER_SW_NUC_H_
#define ALIGNER_SW_NUC_H_
#include <stdint.h>
#include "aligner_sw_common.h"
#include "aligner_result.h"
/**
* Encapsulates a backtrace stack frame. Includes enough information that we
* can "pop" back up to this frame and choose to make a different backtracking
* decision. The information included is:
*
* 1. The mask at the decision point. When we first move through the mask and
* when we backtrack to it, we're careful to mask out the bit corresponding
* to the path we're taking. When we move through it after removing the
* last bit from the mask, we're careful to pop it from the stack.
* 2. The sizes of the edit lists. When we backtrack, we resize the lists back
* down to these sizes to get rid of any edits introduced since the branch
* point.
*/
struct DpNucFrame {
/**
* Initialize a new DpNucFrame stack frame.
*/
void init(
size_t nedsz_,
size_t aedsz_,
size_t celsz_,
size_t row_,
size_t col_,
size_t gaps_,
size_t readGaps_,
size_t refGaps_,
AlnScore score_,
int ct_)
{
nedsz = nedsz_;
aedsz = aedsz_;
celsz = celsz_;
row = row_;
col = col_;
gaps = gaps_;
readGaps = readGaps_;
refGaps = refGaps_;
score = score_;
ct = ct_;
}
size_t nedsz; // size of the nucleotide edit list at branch (before
// adding the branch edit)
size_t aedsz; // size of ambiguous nucleotide edit list at branch
size_t celsz; // size of cell-traversed list at branch
size_t row; // row of cell where branch occurred
size_t col; // column of cell where branch occurred
size_t gaps; // number of gaps before branch occurred
size_t readGaps; // number of read gaps before branch occurred
size_t refGaps; // number of ref gaps before branch occurred
AlnScore score; // score where branch occurred
int ct; // table type (oall, rdgap or rfgap)
};
enum {
BT_CAND_FATE_SUCCEEDED = 1,
BT_CAND_FATE_FAILED,
BT_CAND_FATE_FILT_START, // skipped b/c starting cell already explored
BT_CAND_FATE_FILT_DOMINATED, // skipped b/c it was dominated
BT_CAND_FATE_FILT_SCORE // skipped b/c score not interesting anymore
};
/**
* Encapsulates a cell that we might want to backtrace from.
*/
struct DpBtCandidate {
DpBtCandidate() { reset(); }
DpBtCandidate(size_t row_, size_t col_, TAlScore score_) {
init(row_, col_, score_);
}
void reset() { init(0, 0, 0); }
void init(size_t row_, size_t col_, TAlScore score_) {
row = row_;
col = col_;
score = score_;
// 0 = invalid; this should be set later according to what happens
// before / during the backtrace
fate = 0;
}
/**
* Return true iff this candidate is (heuristically) dominated by the given
* candidate. We say that candidate A dominates candidate B if (a) B is
* somewhere in the N x N square that extends up and to the left of A,
* where N is an arbitrary number like 20, and (b) B's score is <= than
* A's.
*/
inline bool dominatedBy(const DpBtCandidate& o) {
const size_t SQ = 40;
size_t rowhi = row;
size_t rowlo = o.row;
if(rowhi < rowlo) swap(rowhi, rowlo);
size_t colhi = col;
size_t collo = o.col;
if(colhi < collo) swap(colhi, collo);
return (colhi - collo) <= SQ &&
(rowhi - rowlo) <= SQ;
}
/**
* Return true if this candidate is "greater than" (should be considered
* later than) the given candidate.
*/
bool operator>(const DpBtCandidate& o) const {
if(score < o.score) return true;
if(score > o.score) return false;
if(row < o.row ) return true;
if(row > o.row ) return false;
if(col < o.col ) return true;
if(col > o.col ) return false;
return false;
}
/**
* Return true if this candidate is "less than" (should be considered
* sooner than) the given candidate.
*/
bool operator<(const DpBtCandidate& o) const {
if(score > o.score) return true;
if(score < o.score) return false;
if(row > o.row ) return true;
if(row < o.row ) return false;
if(col > o.col ) return true;
if(col < o.col ) return false;
return false;
}
/**
* Return true if this candidate equals the given candidate.
*/
bool operator==(const DpBtCandidate& o) const {
return row == o.row &&
col == o.col &&
score == o.score;
}
bool operator>=(const DpBtCandidate& o) const { return !((*this) < o); }
bool operator<=(const DpBtCandidate& o) const { return !((*this) > o); }
#ifndef NDEBUG
/**
* Check internal consistency.
*/
bool repOk() const {
assert(VALID_SCORE(score));
return true;
}
#endif
size_t row; // cell row
size_t col; // cell column w/r/t LHS of rectangle
TAlScore score; // score fo alignment
int fate; // flag indicating whether we succeeded, failed, skipped
};
template <typename T>
class NBest {
public:
NBest<T>() { nelt_ = nbest_ = n_ = 0; }
bool inited() const { return nelt_ > 0; }
void init(size_t nelt, size_t nbest) {
nelt_ = nelt;
nbest_ = nbest;
elts_.resize(nelt * nbest);
ncur_.resize(nelt);
ncur_.fill(0);
n_ = 0;
}
/**
* Add a new result to bin 'elt'. Where it gets prioritized in the list of
* results in that bin depends on the result of operator>.
*/
bool add(size_t elt, const T& o) {
assert_lt(elt, nelt_);
const size_t ncur = ncur_[elt];
assert_leq(ncur, nbest_);
n_++;
for(size_t i = 0; i < nbest_ && i <= ncur; i++) {
if(o > elts_[nbest_ * elt + i] || i >= ncur) {
// Insert it here
// Move everyone from here on down by one slot
for(int j = (int)ncur; j > (int)i; j--) {
if(j < (int)nbest_) {
elts_[nbest_ * elt + j] = elts_[nbest_ * elt + j - 1];
}
}
elts_[nbest_ * elt + i] = o;
if(ncur < nbest_) {
ncur_[elt]++;
}
return true;
}
}
return false;
}
/**
* Return true iff there are no solutions.
*/
bool empty() const {
return n_ == 0;
}
/**
* Dump all the items in our payload into the given EList.
*/
template<typename TList>
void dump(TList& l) const {
if(empty()) return;
for(size_t i = 0; i < nelt_; i++) {
assert_leq(ncur_[i], nbest_);
for(size_t j = 0; j < ncur_[i]; j++) {
l.push_back(elts_[i * nbest_ + j]);
}
}
}
protected:
size_t nelt_;
size_t nbest_;
EList<T> elts_;
EList<size_t> ncur_;
size_t n_; // total # results added
};
#endif /*def ALIGNER_SW_NUC_H_*/

88
aligner_swsse.cpp Normal file
View File

@ -0,0 +1,88 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include <string.h>
#include "aligner_sw_common.h"
#include "aligner_swsse.h"
/**
* Given a number of rows (nrow), a number of columns (ncol), and the
* number of words to fit inside a single __m128i vector, initialize the
* matrix buffer to accomodate the needed configuration of vectors.
*/
void SSEMatrix::init(
size_t nrow,
size_t ncol,
size_t wperv)
{
nrow_ = nrow;
ncol_ = ncol;
wperv_ = wperv;
nvecPerCol_ = (nrow + (wperv-1)) / wperv;
// The +1 is so that we don't have to special-case the final column;
// instead, we just write off the end of the useful part of the table
// with pvEStore.
try {
matbuf_.resizeNoCopy((ncol+1) * nvecPerCell_ * nvecPerCol_);
} catch(exception& e) {
cerr << "Tried to allocate DP matrix with " << (ncol+1)
<< " columns, " << nvecPerCol_
<< " vectors per column, and and " << nvecPerCell_
<< " vectors per cell" << endl;
throw e;
}
assert(wperv_ == 8 || wperv_ == 16);
vecshift_ = (wperv_ == 8) ? 3 : 4;
nvecrow_ = (nrow + (wperv_-1)) >> vecshift_;
nveccol_ = ncol;
colstride_ = nvecPerCol_ * nvecPerCell_;
rowstride_ = nvecPerCell_;
inited_ = true;
}
/**
* Initialize the matrix of masks and backtracking flags.
*/
void SSEMatrix::initMasks() {
assert_gt(nrow_, 0);
assert_gt(ncol_, 0);
masks_.resize(nrow_);
reset_.resizeNoCopy(nrow_);
reset_.fill(false);
}
/**
* Given a row, col and matrix (i.e. E, F or H), return the corresponding
* element.
*/
int SSEMatrix::eltSlow(size_t row, size_t col, size_t mat) const {
assert_lt(row, nrow_);
assert_lt(col, ncol_);
assert_leq(mat, 3);
// Move to beginning of column/row
size_t rowelt = row / nvecrow_;
size_t rowvec = row % nvecrow_;
size_t eltvec = (col * colstride_) + (rowvec * rowstride_) + mat;
if(wperv_ == 16) {
return (int)((uint8_t*)(matbuf_.ptr() + eltvec))[rowelt];
} else {
assert_eq(8, wperv_);
return (int)((int16_t*)(matbuf_.ptr() + eltvec))[rowelt];
}
}

500
aligner_swsse.h Normal file
View File

@ -0,0 +1,500 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALIGNER_SWSSE_H_
#define ALIGNER_SWSSE_H_
#include "ds.h"
#include "mem_ids.h"
#include "random_source.h"
#include "scoring.h"
#include "mask.h"
#include "sse_util.h"
#include <string>
struct SSEMetrics {
SSEMetrics():mutex_m() { reset(); }
void clear() { reset(); }
void reset() {
dp = dpsat = dpfail = dpsucc =
col = cell = inner = fixup =
gathsol = bt = btfail = btsucc = btcell =
corerej = nrej = 0;
}
void merge(const SSEMetrics& o, bool getLock = false) {
ThreadSafe ts(&mutex_m, getLock);
dp += o.dp;
dpsat += o.dpsat;
dpfail += o.dpfail;
dpsucc += o.dpsucc;
col += o.col;
cell += o.cell;
inner += o.inner;
fixup += o.fixup;
gathsol += o.gathsol;
bt += o.bt;
btfail += o.btfail;
btsucc += o.btsucc;
btcell += o.btcell;
corerej += o.corerej;
nrej += o.nrej;
}
uint64_t dp; // DPs tried
uint64_t dpsat; // DPs saturated
uint64_t dpfail; // DPs failed
uint64_t dpsucc; // DPs succeeded
uint64_t col; // DP columns
uint64_t cell; // DP cells
uint64_t inner; // DP inner loop iters
uint64_t fixup; // DP fixup loop iters
uint64_t gathsol; // DP gather solution cells found
uint64_t bt; // DP backtraces
uint64_t btfail; // DP backtraces failed
uint64_t btsucc; // DP backtraces succeeded
uint64_t btcell; // DP backtrace cells traversed
uint64_t corerej; // DP backtrace core rejections
uint64_t nrej; // DP backtrace N rejections
MUTEX_T mutex_m;
};
/**
* Encapsulates matrix information calculated by the SSE aligner.
*
* Matrix memory is laid out as follows:
*
* - Elements (individual cell scores) are packed into __m128i vectors
* - Vectors are packed into quartets, quartet elements correspond to: a vector
* from E, one from F, one from H, and one that's "reserved"
* - Quartets are packed into columns, where the number of quartets is
* determined by the number of query characters divided by the number of
* elements per vector
*
* Regarding the "reserved" element of the vector quartet: we use it for two
* things. First, we use the first column of reserved vectors to stage the
* initial column of H vectors. Second, we use the "reserved" vectors during
* the backtrace procedure to store information about (a) which cells have been
* traversed, (b) whether the cell is "terminal" (in local mode), etc.
*/
struct SSEMatrix {
// Each matrix element is a quartet of vectors. These constants are used
// to identify members of the quartet.
const static size_t E = 0;
const static size_t F = 1;
const static size_t H = 2;
const static size_t TMP = 3;
SSEMatrix(int cat = 0) : nvecPerCell_(4), matbuf_(cat) { }
/**
* Return a pointer to the matrix buffer.
*/
inline __m128i *ptr() {
assert(inited_);
return matbuf_.ptr();
}
/**
* Return a pointer to the E vector at the given row and column. Note:
* here row refers to rows of vectors, not rows of elements.
*/
inline __m128i* evec(size_t row, size_t col) {
assert_lt(row, nvecrow_);
assert_lt(col, nveccol_);
size_t elt = row * rowstride() + col * colstride() + E;
assert_lt(elt, matbuf_.size());
return ptr() + elt;
}
/**
* Like evec, but it's allowed to ask for a pointer to one column after the
* final one.
*/
inline __m128i* evecUnsafe(size_t row, size_t col) {
assert_lt(row, nvecrow_);
assert_leq(col, nveccol_);
size_t elt = row * rowstride() + col * colstride() + E;
assert_lt(elt, matbuf_.size());
return ptr() + elt;
}
/**
* Return a pointer to the F vector at the given row and column. Note:
* here row refers to rows of vectors, not rows of elements.
*/
inline __m128i* fvec(size_t row, size_t col) {
assert_lt(row, nvecrow_);
assert_lt(col, nveccol_);
size_t elt = row * rowstride() + col * colstride() + F;
assert_lt(elt, matbuf_.size());
return ptr() + elt;
}
/**
* Return a pointer to the H vector at the given row and column. Note:
* here row refers to rows of vectors, not rows of elements.
*/
inline __m128i* hvec(size_t row, size_t col) {
assert_lt(row, nvecrow_);
assert_lt(col, nveccol_);
size_t elt = row * rowstride() + col * colstride() + H;
assert_lt(elt, matbuf_.size());
return ptr() + elt;
}
/**
* Return a pointer to the TMP vector at the given row and column. Note:
* here row refers to rows of vectors, not rows of elements.
*/
inline __m128i* tmpvec(size_t row, size_t col) {
assert_lt(row, nvecrow_);
assert_lt(col, nveccol_);
size_t elt = row * rowstride() + col * colstride() + TMP;
assert_lt(elt, matbuf_.size());
return ptr() + elt;
}
/**
* Like tmpvec, but it's allowed to ask for a pointer to one column after
* the final one.
*/
inline __m128i* tmpvecUnsafe(size_t row, size_t col) {
assert_lt(row, nvecrow_);
assert_leq(col, nveccol_);
size_t elt = row * rowstride() + col * colstride() + TMP;
assert_lt(elt, matbuf_.size());
return ptr() + elt;
}
/**
* Given a number of rows (nrow), a number of columns (ncol), and the
* number of words to fit inside a single __m128i vector, initialize the
* matrix buffer to accomodate the needed configuration of vectors.
*/
void init(
size_t nrow,
size_t ncol,
size_t wperv);
/**
* Return the number of __m128i's you need to skip over to get from one
* cell to the cell one column over from it.
*/
inline size_t colstride() const { return colstride_; }
/**
* Return the number of __m128i's you need to skip over to get from one
* cell to the cell one row down from it.
*/
inline size_t rowstride() const { return rowstride_; }
/**
* Given a row, col and matrix (i.e. E, F or H), return the corresponding
* element.
*/
int eltSlow(size_t row, size_t col, size_t mat) const;
/**
* Given a row, col and matrix (i.e. E, F or H), return the corresponding
* element.
*/
inline int elt(size_t row, size_t col, size_t mat) const {
assert(inited_);
assert_lt(row, nrow_);
assert_lt(col, ncol_);
assert_lt(mat, 3);
// Move to beginning of column/row
size_t rowelt = row / nvecrow_;
size_t rowvec = row % nvecrow_;
size_t eltvec = (col * colstride_) + (rowvec * rowstride_) + mat;
assert_lt(eltvec, matbuf_.size());
if(wperv_ == 16) {
return (int)((uint8_t*)(matbuf_.ptr() + eltvec))[rowelt];
} else {
assert_eq(8, wperv_);
return (int)((int16_t*)(matbuf_.ptr() + eltvec))[rowelt];
}
}
/**
* Return the element in the E matrix at element row, col.
*/
inline int eelt(size_t row, size_t col) const {
return elt(row, col, E);
}
/**
* Return the element in the F matrix at element row, col.
*/
inline int felt(size_t row, size_t col) const {
return elt(row, col, F);
}
/**
* Return the element in the H matrix at element row, col.
*/
inline int helt(size_t row, size_t col) const {
return elt(row, col, H);
}
/**
* Return true iff the given cell has its reportedThru bit set.
*/
inline bool reportedThrough(
size_t row, // current row
size_t col) const // current column
{
return (masks_[row][col] & (1 << 0)) != 0;
}
/**
* Set the given cell's reportedThru bit.
*/
inline void setReportedThrough(
size_t row, // current row
size_t col) // current column
{
masks_[row][col] |= (1 << 0);
}
/**
* Return true iff the H mask has been set with a previous call to hMaskSet.
*/
bool isHMaskSet(
size_t row, // current row
size_t col) const; // current column
/**
* Set the given cell's H mask. This is the mask of remaining legal ways to
* backtrack from the H cell at this coordinate. It's 5 bits long and has
* offset=2 into the 16-bit field.
*/
void hMaskSet(
size_t row, // current row
size_t col, // current column
int mask);
/**
* Return true iff the E mask has been set with a previous call to eMaskSet.
*/
bool isEMaskSet(
size_t row, // current row
size_t col) const; // current column
/**
* Set the given cell's E mask. This is the mask of remaining legal ways to
* backtrack from the E cell at this coordinate. It's 2 bits long and has
* offset=8 into the 16-bit field.
*/
void eMaskSet(
size_t row, // current row
size_t col, // current column
int mask);
/**
* Return true iff the F mask has been set with a previous call to fMaskSet.
*/
bool isFMaskSet(
size_t row, // current row
size_t col) const; // current column
/**
* Set the given cell's F mask. This is the mask of remaining legal ways to
* backtrack from the F cell at this coordinate. It's 2 bits long and has
* offset=11 into the 16-bit field.
*/
void fMaskSet(
size_t row, // current row
size_t col, // current column
int mask);
/**
* Analyze a cell in the SSE-filled dynamic programming matrix. Determine &
* memorize ways that we can backtrack from the cell. If there is at least one
* way to backtrack, select one at random and return the selection.
*
* There are a few subtleties to keep in mind regarding which cells can be at
* the end of a backtrace. First of all: cells from which we can backtrack
* should not be at the end of a backtrace. But have to distinguish between
* cells whose masks eventually become 0 (we shouldn't end at those), from
* those whose masks were 0 all along (we can end at those).
*/
void analyzeCell(
size_t row, // current row
size_t col, // current column
size_t ct, // current cell type: E/F/H
int refc,
int readc,
int readq,
const Scoring& sc, // scoring scheme
int64_t offsetsc, // offset to add to each score
RandomSource& rand, // rand gen for choosing among equal options
bool& empty, // out: =true iff no way to backtrace
int& cur, // out: =type of transition
bool& branch, // out: =true iff we chose among >1 options
bool& canMoveThru, // out: =true iff ...
bool& reportedThru); // out: =true iff ...
/**
* Initialize the matrix of masks and backtracking flags.
*/
void initMasks();
/**
* Return the number of rows in the dynamic programming matrix.
*/
size_t nrow() const {
return nrow_;
}
/**
* Return the number of columns in the dynamic programming matrix.
*/
size_t ncol() const {
return ncol_;
}
/**
* Prepare a row so we can use it to store masks.
*/
void resetRow(size_t i) {
assert(!reset_[i]);
masks_[i].resizeNoCopy(ncol_);
masks_[i].fillZero();
reset_[i] = true;
}
bool inited_; // initialized?
size_t nrow_; // # rows
size_t ncol_; // # columns
size_t nvecrow_; // # vector rows (<= nrow_)
size_t nveccol_; // # vector columns (<= ncol_)
size_t wperv_; // # words per vector
size_t vecshift_; // # bits to shift to divide by words per vec
size_t nvecPerCol_; // # vectors per column
size_t nvecPerCell_; // # vectors per matrix cell (4)
size_t colstride_; // # vectors b/t adjacent cells in same row
size_t rowstride_; // # vectors b/t adjacent cells in same col
EList_m128i matbuf_; // buffer for holding vectors
ELList<uint16_t> masks_; // buffer for masks/backtracking flags
EList<bool> reset_; // true iff row in masks_ has been reset
};
/**
* All the data associated with the query profile and other data needed for SSE
* alignment of a query.
*/
struct SSEData {
SSEData(int cat = 0) : profbuf_(cat), mat_(cat) { }
EList_m128i profbuf_; // buffer for query profile & temp vecs
EList_m128i vecbuf_; // buffer for 2 column vectors (not using mat_)
size_t qprofStride_; // stride for query profile
size_t gbarStride_; // gap barrier for query profile
SSEMatrix mat_; // SSE matrix for holding all E, F, H vectors
size_t maxPen_; // biggest penalty of all
size_t maxBonus_; // biggest bonus of all
size_t lastIter_; // which 128-bit striped word has final row?
size_t lastWord_; // which word within 128-word has final row?
int bias_; // all scores shifted up by this for unsigned
};
/**
* Return true iff the H mask has been set with a previous call to hMaskSet.
*/
inline bool SSEMatrix::isHMaskSet(
size_t row, // current row
size_t col) const // current column
{
return (masks_[row][col] & (1 << 1)) != 0;
}
/**
* Set the given cell's H mask. This is the mask of remaining legal ways to
* backtrack from the H cell at this coordinate. It's 5 bits long and has
* offset=2 into the 16-bit field.
*/
inline void SSEMatrix::hMaskSet(
size_t row, // current row
size_t col, // current column
int mask)
{
assert_lt(mask, 32);
masks_[row][col] &= ~(31 << 1);
masks_[row][col] |= (1 << 1 | mask << 2);
}
/**
* Return true iff the E mask has been set with a previous call to eMaskSet.
*/
inline bool SSEMatrix::isEMaskSet(
size_t row, // current row
size_t col) const // current column
{
return (masks_[row][col] & (1 << 7)) != 0;
}
/**
* Set the given cell's E mask. This is the mask of remaining legal ways to
* backtrack from the E cell at this coordinate. It's 2 bits long and has
* offset=8 into the 16-bit field.
*/
inline void SSEMatrix::eMaskSet(
size_t row, // current row
size_t col, // current column
int mask)
{
assert_lt(mask, 4);
masks_[row][col] &= ~(7 << 7);
masks_[row][col] |= (1 << 7 | mask << 8);
}
/**
* Return true iff the F mask has been set with a previous call to fMaskSet.
*/
inline bool SSEMatrix::isFMaskSet(
size_t row, // current row
size_t col) const // current column
{
return (masks_[row][col] & (1 << 10)) != 0;
}
/**
* Set the given cell's F mask. This is the mask of remaining legal ways to
* backtrack from the F cell at this coordinate. It's 2 bits long and has
* offset=11 into the 16-bit field.
*/
inline void SSEMatrix::fMaskSet(
size_t row, // current row
size_t col, // current column
int mask)
{
assert_lt(mask, 4);
masks_[row][col] &= ~(7 << 10);
masks_[row][col] |= (1 << 10 | mask << 11);
}
#define ROWSTRIDE_2COL 4
#define ROWSTRIDE 4
#endif /*ndef ALIGNER_SWSSE_H_*/

1911
aligner_swsse_ee_i16.cpp Normal file

File diff suppressed because it is too large Load Diff

1902
aligner_swsse_ee_u8.cpp Normal file

File diff suppressed because it is too large Load Diff

2272
aligner_swsse_loc_i16.cpp Normal file

File diff suppressed because it is too large Load Diff

2266
aligner_swsse_loc_u8.cpp Normal file

File diff suppressed because it is too large Load Diff

193
alignment_3n.cpp Normal file
View File

@ -0,0 +1,193 @@
/*
* Copyright 2020, Yun (Leo) Zhang <imzhangyun@gmail.com>
*
* This file is part of HISAT-3N.
*
* HISAT-3N is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* HISAT-3N is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with HISAT-3N. If not, see <http://www.gnu.org/licenses/>.
*/
#include "alignment_3n.h"
#include "aln_sink.h"
/**
* return true if two location is concordant.
* return false, if there are not concordant or too far (>maxPairDistance).
*/
bool Alignment::isConcordant(long long int location1, bool &forward1, long long int readLength1, long long int location2, bool &forward2, long long int readLength2) {
if (forward1 == forward2) // same direction
{
return false;
}
// adjust the location of the start of the read
if (!forward1)
{
location1 = location1 + readLength1 - 1;
}
if (!forward2)
{
location2 = location2 + readLength2 - 1;
}
// return false if two reads are too far from each other
if (abs(location1-location2) > maxPairDistance)
{
return false;
}
if (location1 == location2)
{
return true;
}
else if (location1 < location2)
{
if (forward1 && !forward2)
{
return true;
}
}
else
{
if (!forward1 && forward2)
{
return true;
}
}
return false;
}
/**
* this is the basic function to calculate DNA pair score.
* if the distance between 2 alignments is more than penaltyFreeDistance_DNA, we reduce the score by the distance/100.
* if two alignment is concordant we add concordantScoreBounce to make sure to select the concordant pair as best pair.
*/
int Alignment::calculatePairScore_DNA (long long int &location0, int& AS0, bool& forward0, long long int readLength0, long long int &location1, int &AS1, bool &forward1, long long int readLength1, bool& concordant) {
int score = ASPenalty*AS0 + ASPenalty*AS1;
int distance = abs(location0 - location1);
if (distance > maxPairDistance) { return numeric_limits<int>::min(); }
if (distance > penaltyFreeDistance_DNA) { score -= distance/distancePenaltyFraction_DNA; }
concordant = isConcordant(location0, forward0, readLength0, location1, forward1, readLength1);
if (concordant) { score += concordantScoreBounce; }
return score;
}
/**
* this is the basic function to calculate RNA pair score.
* if the distance between 2 alignments is more than penaltyFreeDistance_RNA, we reduce the score by the distance/1000.
* if two alignment is concordant we add concordantScoreBounce to make sure to select the concordant pair as best pair.
*/
int Alignment::calculatePairScore_RNA (long long int &location0, int& XM0, bool& forward0, long long int readLength0, long long int &location1, int &XM1, bool &forward1, long long int readLength1, bool& concordant) {
// this is the basic function to calculate pair score.
// if the distance between 2 alignment is more than 100,000, we reduce the score by the distance/1000.
// if two alignment is concordant we add 500,000 to make sure to select the concordant pair as best pair.
int score = -ASPenalty*XM0 + -ASPenalty*XM1;
int distance = abs(location0 - location1);
if (distance > maxPairDistance) { return numeric_limits<int>::min(); }
if (distance > penaltyFreeDistance_RNA) { score -= distance/distancePenaltyFraction_RNA; }
concordant = isConcordant(location0, forward0, readLength0, location1, forward1, readLength1);
if (concordant) { score += concordantScoreBounce; }
return score;
}
/**
* calculate the pairScore for a pair of alignment result. Output pair Score and number of pair.
* Do not update their pairScore.
*/
int Alignment::calculatePairScore(Alignment *inputAlignment, int &nPair) {
int pairScore = numeric_limits<int>::min();
nPair = 0;
if (pairSegment == inputAlignment->pairSegment){
// when 2 alignment results are from same pair segment, output the lowest score and number of pair equal zero.
pairScore = numeric_limits<int>::min();
} else if (!mapped && !inputAlignment->mapped) {
// both unmapped.
pairScore = numeric_limits<int>::min()/2 - 1;
} else if (!mapped || !inputAlignment->mapped) {
// one of the segment unmapped.
pairScore = numeric_limits<int>::min()/2;
nPair = 1;
} else if ((!repeat && !inputAlignment->repeat)){
// both mapped and (both non-repeat or not expand repeat)
bool concordant;
if (DNA) {
pairScore = calculatePairScore_DNA(location,
AS,
forward,
readSequence.length(),
inputAlignment->location,
inputAlignment->AS,
inputAlignment->forward,
inputAlignment->readSequence.length(),
concordant);
} else {
pairScore = calculatePairScore_RNA(location,
XM,
forward,
readSequence.length(),
inputAlignment->location,
inputAlignment->XM,
inputAlignment->forward,
inputAlignment->readSequence.length(),
concordant);
}
setConcordant(concordant);
inputAlignment->setConcordant(concordant);
nPair = 1;
}
return pairScore;
}
void Alignments::reportStats_single(ReportingMetrics& met) {
int nAlignment = alignmentPositions.nBestSingle;
if (nAlignment == 0) {
met.nunp_0++;
} else {
met.nunp_uni++;
if (nAlignment == 1) { met.nunp_uni1++; }
else { met.nunp_uni2++; }
}
}
void Alignments::reportStats_paired(ReportingMetrics& met) {
if (!alignmentPositions.concordantExist) {
met.nconcord_0++;
if (alignmentPositions.nBestPair == 0) {
met.nunp_0_0 += 2;
return;
}
if (alignmentPositions.bestPairScore == numeric_limits<int>::min()/2) {
// one mate is unmapped, one mate is mapped
met.nunp_0_0++;
met.nunp_0_uni++;
if (alignmentPositions.nBestPair == 1) { met.nunp_0_uni1++; }
else { met.nunp_0_uni2++; }
} else { //both mate is mapped
if (alignmentPositions.nBestPair == 1) {
met.ndiscord++;
return;
}
else {
met.nunp_0_uni += 2;
met.nunp_0_uni2 += 2;
}
}
} else {
assert(alignmentPositions.nBestPair > 0);
met.nconcord_uni++;
if (alignmentPositions.nBestPair == 1) { met.nconcord_uni1++; }
else { met.nconcord_uni2++; }
}
}

1214
alignment_3n.h Normal file

File diff suppressed because it is too large Load Diff

287
alignment_3n_table.h Normal file
View File

@ -0,0 +1,287 @@
/*
* Copyright 2020, Yun (Leo) Zhang <imzhangyun@gmail.com>
*
* This file is part of HISAT-3N.
*
* HISAT-3N is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* HISAT-3N is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with HISAT-3N. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALIGNMENT_3N_TABLE_H
#define ALIGNMENT_3N_TABLE_H
#include <string>
#include "utility_3n_table.h"
extern bool uniqueOnly;
extern bool multipleOnly;
extern char convertFrom;
extern char convertTo;
extern char convertFromComplement;
extern char convertToComplement;
using namespace std;
/**
* the class to store information from one SAM line
*/
class Alignment {
public:
string chromosome;
long long int location;
long long int mateLocation;
int flag;
bool mapped;
char strand;
string sequence;
string quality;
bool unique;
string mapQ;
int NH;
vector<PosQuality> bases;
CIGAR cigarString;
MD_tag MD;
unsigned long long readNameID;
int sequenceCoveredLength; // the sum of number is cigarString;
bool overlap; // if the segment could overlap with the mate segment.
bool paired;
void initialize() {
chromosome.clear();
location = -1;
mateLocation = -1;
flag = -1;
mapped = false;
MD.initialize();
cigarString.initialize();
sequence.clear();
quality.clear();
unique = false;
mapQ.clear();
NH = -1;
bases.clear();
readNameID = 0;
sequenceCoveredLength = 0;
overlap = false;
paired = false;
}
/**
* for start position in input Line, check if it contain the target information.
*/
bool startWith(string* inputLine, int startPosition, string tag){
for (int i = 0; i < tag.size(); i++){
if (inputLine->at(startPosition+i) != tag[i]){
return false;
}
}
return true;
}
/**
* generate a hash value for readName
*/
void getNameHash(string& readName) {
readNameID = 0;
int a = 63689;
for (size_t i = 0; i < readName.size(); i++) {
readNameID = (readNameID * a) + (int)readName[i];
}
}
/**
* extract the information from SAM line to Alignment.
*/
void parseInfo(string* line) {
int startPosition = 0;
int endPosition = 0;
int count = 0;
while ((endPosition = line->find("\t", startPosition)) != string::npos) {
if (count == 0) {
string readName = line->substr(startPosition, endPosition - startPosition);
getNameHash(readName);
} else if (count == 1) {
flag = stoi(line->substr(startPosition, endPosition - startPosition));
mapped = (flag & 4) == 0;
paired = (flag & 1) != 0;
} else if (count == 2) {
chromosome = line->substr(startPosition, endPosition - startPosition);
} else if (count == 3) {
location = stoll(line->substr(startPosition, endPosition - startPosition));
} else if (count == 4) {
mapQ = line->substr(startPosition, endPosition - startPosition);
if (mapQ == "1") {
unique = false;
} else {
unique = true;
}
} else if (count == 5) {
cigarString.loadString(line->substr(startPosition, endPosition - startPosition));
} else if (count == 7) {
mateLocation = stoll(line->substr(startPosition, endPosition - startPosition));
} else if (count == 9) {
sequence = line->substr(startPosition, endPosition - startPosition);
} else if (count == 10) {
quality = line->substr(startPosition, endPosition - startPosition);
} else if (count > 10) {
if (startWith(line, startPosition, "MD")) {
MD.loadString(line->substr(startPosition + 5, endPosition - startPosition - 5));
} else if (startWith(line, startPosition, "NM")) {
NH = stoi(line->substr(startPosition + 5, endPosition - startPosition - 5));
} else if (startWith(line, startPosition, "YZ")) {
strand = line->at(endPosition-1);
}
}
startPosition = endPosition + 1;
count++;
}
if (startWith(line, startPosition, "MD")) {
MD.loadString(line->substr(startPosition + 5, endPosition - startPosition - 5));
} else if (startWith(line, startPosition, "NM")) {
NH = stoi(line->substr(startPosition + 5, endPosition - startPosition - 5));
} else if (startWith(line, startPosition, "YZ")) {
strand = line->at(endPosition-1);
}
}
/**
* change the overlap = true, if the read is not uniquely mapped or the read segment is overlap to it's mate.
*/
void checkOverlap() {
if (!unique) {
overlap = true;
} else {
if (paired && (location + sequenceCoveredLength >= mateLocation)) {
overlap = true;
} else {
overlap = false;
}
}
}
/**
* parse the sam line to alignment information
*/
void parse(string* line) {
initialize();
parseInfo(line);
if ((uniqueOnly && !unique) || (multipleOnly && unique)) {
return;
}
appendBase();
}
/**
* scan all base in read sequence label them if they are qualified.
*/
void appendBase() {
if (!mapped || sequenceCoveredLength > 500000) { // if the read's intron longer than 500,000 ignore this read
return;
}
bases.reserve(sequence.size());
for (int i = 0; i < sequence.size(); i++) {
bases.emplace_back(i);
}
int pos = adjustPos();
string match;
while (MD.getNextSegment(match)) {
if (isdigit(match.front())) { // the first char of match is digit this is match
int len = stoi(match);
for (int i = 0; i < len; i++) {
while (bases[pos].remove) {
pos++;
}
if ((strand == '+' && sequence[pos] == convertFrom) ||
(strand == '-' && sequence[pos] == convertFromComplement)) {
bases[pos].setQual(quality[pos], false);
} else {
bases[pos].remove = true;
}
pos ++;
}
} else if (isalpha(match.front())) { // this is mismatch or conversion
char refBase = match.front();
// for + strand, it should have C->T change
// for - strand, it should have G->A change
while (bases[pos].remove) {
pos++;
}
if ((strand == '+' && refBase == convertFrom && sequence[pos] == convertTo) ||
(strand == '-' && refBase == convertFromComplement && sequence[pos] == convertToComplement)){
bases[pos].setQual(quality[pos], true);
} else {
bases[pos].remove = true;
}
pos ++;
} else { // deletion. do nothing.
}
}
}
/**
* adjust the reference position in bases
*/
int adjustPos() {
int readPos = 0;
int returnPos = 0;
int seqLength = sequence.size();
char cigarSymbol;
int cigarLen;
sequenceCoveredLength = 0;
while (cigarString.getNextSegment(cigarLen, cigarSymbol)) {
sequenceCoveredLength += cigarLen;
if (cigarSymbol == 'S') {
if (readPos == 0) { // soft clip is at the begin of the read
returnPos = cigarLen;
for (int i = cigarLen; i < seqLength; i++) {
bases[i].refPos -= cigarLen;
}
} else { // soft clip is at the end of the read
// do nothing
}
readPos += cigarLen;
} else if (cigarSymbol == 'N') {
for (int i = readPos; i < seqLength; i++) {
bases[i].refPos += cigarLen;
}
} else if (cigarSymbol == 'M') {
for (int i = readPos; i < readPos+cigarLen; i++) {
bases[i].remove = false;
}
readPos += cigarLen;
} else if (cigarSymbol == 'I') {
for (int i = readPos + cigarLen; i < seqLength; i++) {
bases[i].refPos -= cigarLen;
}
readPos += cigarLen;
} else if (cigarSymbol == 'D') {
for (int i = readPos; i < seqLength; i++) {
bases[i].refPos += cigarLen;
}
}
}
return returnPos;
}
};
#endif //ALIGNMENT_3N_TABLE_H

785
aln_sink.cpp Normal file
View File

@ -0,0 +1,785 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include <iomanip>
#include <limits>
#include "aln_sink.h"
#include "aligner_seed.h"
#include "util.h"
using namespace std;
/**
* Initialize state machine with a new read. The state we start in depends
* on whether it's paired-end or unpaired.
*/
void ReportingState::nextRead(bool paired) {
paired_ = paired;
if(paired) {
state_ = CONCORDANT_PAIRS;
doneConcord_ = false;
doneDiscord_ = p_.discord ? false : true;
doneUnpair1_ = p_.mixed ? false : true;
doneUnpair2_ = p_.mixed ? false : true;
exitConcord_ = ReportingState::EXIT_DID_NOT_EXIT;
exitDiscord_ = p_.discord ?
ReportingState::EXIT_DID_NOT_EXIT :
ReportingState::EXIT_DID_NOT_ENTER;
exitUnpair1_ = p_.mixed ?
ReportingState::EXIT_DID_NOT_EXIT :
ReportingState::EXIT_DID_NOT_ENTER;
exitUnpair2_ = p_.mixed ?
ReportingState::EXIT_DID_NOT_EXIT :
ReportingState::EXIT_DID_NOT_ENTER;
} else {
// Unpaired
state_ = UNPAIRED;
doneConcord_ = true;
doneDiscord_ = true;
doneUnpair1_ = false;
doneUnpair2_ = true;
exitConcord_ = ReportingState::EXIT_DID_NOT_ENTER; // not relevant
exitDiscord_ = ReportingState::EXIT_DID_NOT_ENTER; // not relevant
exitUnpair1_ = ReportingState::EXIT_DID_NOT_EXIT;
exitUnpair2_ = ReportingState::EXIT_DID_NOT_ENTER; // not relevant
}
doneUnpair_ = doneUnpair1_ && doneUnpair2_;
done_ = false;
nconcord_ = ndiscord_ = nunpair1_ = nunpair2_ = 0;
nunpairRepeat1_ = nunpairRepeat2_ = 0;
concordBest_ = getMinScore();
}
/**
* Caller uses this member function to indicate that one additional
* concordant alignment has been found.
*/
bool ReportingState::foundConcordant(TAlScore score) {
assert(paired_);
assert_geq(state_, ReportingState::CONCORDANT_PAIRS);
assert(!doneConcord_);
if(score > concordBest_) {
concordBest_ = score;
nconcord_ = 0;
}
nconcord_++;
// DK CONCORDANT - debugging purpuses
// areDone(nconcord_, doneConcord_, exitConcord_);
// No need to search for discordant alignments if there are one or more
// concordant alignments.
doneDiscord_ = true;
exitDiscord_ = ReportingState::EXIT_SHORT_CIRCUIT_TRUMPED;
if(doneConcord_) {
// If we're finished looking for concordant alignments, do we have to
// continue on to search for unpaired alignments? Only if our exit
// from the concordant stage is EXIT_SHORT_CIRCUIT_M. If it's
// EXIT_SHORT_CIRCUIT_k or EXIT_WITH_ALIGNMENTS, we can skip unpaired.
assert_neq(ReportingState::EXIT_NO_ALIGNMENTS, exitConcord_);
if(exitConcord_ != ReportingState::EXIT_SHORT_CIRCUIT_M) {
if(!doneUnpair1_) {
doneUnpair1_ = true;
exitUnpair1_ = ReportingState::EXIT_SHORT_CIRCUIT_TRUMPED;
}
if(!doneUnpair2_) {
doneUnpair2_ = true;
exitUnpair2_ = ReportingState::EXIT_SHORT_CIRCUIT_TRUMPED;
}
}
}
updateDone();
return done();
}
/**
* Caller uses this member function to indicate that one additional unpaired
* mate alignment has been found for the specified mate.
*/
bool ReportingState::foundUnpaired(bool mate1, bool repeat) {
assert_gt(state_, ReportingState::NO_READ);
// Note: it's not right to assert !doneUnpair1_/!doneUnpair2_ here.
// Even if we're done with finding
if(mate1) {
nunpair1_++;
if(repeat) {
nunpairRepeat1_++;
}
// Did we just finish with this mate?
if(!doneUnpair1_) {
areDone(nunpair1_, doneUnpair1_, exitUnpair1_);
if(doneUnpair1_) {
doneUnpair_ = doneUnpair1_ && doneUnpair2_;
updateDone();
}
}
if(nunpair1_ > 1) {
doneDiscord_ = true;
exitDiscord_ = ReportingState::EXIT_NO_ALIGNMENTS;
}
} else {
nunpair2_++;
if(repeat) {
nunpairRepeat2_++;
}
// Did we just finish with this mate?
if(!doneUnpair2_) {
areDone(nunpair2_, doneUnpair2_, exitUnpair2_);
if(doneUnpair2_) {
doneUnpair_ = doneUnpair1_ && doneUnpair2_;
updateDone();
}
}
if(nunpair2_ > 1) {
doneDiscord_ = true;
exitDiscord_ = ReportingState::EXIT_NO_ALIGNMENTS;
}
}
return done();
}
/**
* Called to indicate that the aligner has finished searching for
* alignments. This gives us a chance to finalize our state.
*
* TODO: Keep track of short-circuiting information.
*/
void ReportingState::finish() {
if(!doneConcord_) {
doneConcord_ = true;
exitConcord_ =
((nconcord_ > 0) ?
ReportingState::EXIT_WITH_ALIGNMENTS :
ReportingState::EXIT_NO_ALIGNMENTS);
}
assert_gt(exitConcord_, EXIT_DID_NOT_EXIT);
if(!doneUnpair1_) {
doneUnpair1_ = true;
exitUnpair1_ =
((nunpair1_ > 0) ?
ReportingState::EXIT_WITH_ALIGNMENTS :
ReportingState::EXIT_NO_ALIGNMENTS);
}
assert_gt(exitUnpair1_, EXIT_DID_NOT_EXIT);
if(!doneUnpair2_) {
doneUnpair2_ = true;
exitUnpair2_ =
((nunpair2_ > 0) ?
ReportingState::EXIT_WITH_ALIGNMENTS :
ReportingState::EXIT_NO_ALIGNMENTS);
}
assert_gt(exitUnpair2_, EXIT_DID_NOT_EXIT);
if(!doneDiscord_) {
// Check if the unpaired alignments should be converted to a single
// discordant paired-end alignment.
assert_eq(0, ndiscord_);
if(nconcord_ == 0 && nunpair1_ == 1 && nunpair2_ == 1) {
convertUnpairedToDiscordant();
}
doneDiscord_ = true;
exitDiscord_ =
((ndiscord_ > 0) ?
ReportingState::EXIT_WITH_ALIGNMENTS :
ReportingState::EXIT_NO_ALIGNMENTS);
}
assert(!paired_ || exitDiscord_ > ReportingState::EXIT_DID_NOT_EXIT);
doneUnpair_ = done_ = true;
assert(done());
}
/**
* Populate given counters with the number of various kinds of alignments
* to report for this read. Concordant alignments are preferable to (and
* mutually exclusive with) discordant alignments, and paired-end
* alignments are preferable to unpaired alignments.
*
* The caller also needs some additional information for the case where a
* pair or unpaired read aligns repetitively. If the read is paired-end
* and the paired-end has repetitive concordant alignments, that should be
* reported, and 'pairMax' is set to true to indicate this. If the read is
* paired-end, does not have any conordant alignments, but does have
* repetitive alignments for one or both mates, then that should be
* reported, and 'unpair1Max' and 'unpair2Max' are set accordingly.
*
* Note that it's possible in the case of a paired-end read for the read to
* have repetitive concordant alignments, but for one mate to have a unique
* unpaired alignment.
*/
void ReportingState::getReport(
uint64_t& nconcordAln, // # concordant alignments to report
uint64_t& ndiscordAln, // # discordant alignments to report
uint64_t& nunpair1Aln, // # unpaired alignments for mate #1 to report
uint64_t& nunpair2Aln, // # unpaired alignments for mate #2 to report
uint64_t& nunpairRepeat1Aln, // # unpaired alignments for mate #1 to report
uint64_t& nunpairRepeat2Aln, // # unpaired alignments for mate #2 to report
bool& pairMax, // repetitive concordant alignments
bool& unpair1Max, // repetitive alignments for mate #1
bool& unpair2Max) // repetitive alignments for mate #2
const
{
nconcordAln = ndiscordAln = nunpair1Aln = nunpair2Aln = 0;
nunpairRepeat1Aln = nunpairRepeat2Aln = 0;
pairMax = unpair1Max = unpair2Max = false;
assert_gt(p_.khits, 0);
assert_gt(p_.mhits, 0);
if(paired_) {
// Do we have 1 or more concordant alignments to report?
if(exitConcord_ == ReportingState::EXIT_SHORT_CIRCUIT_k) {
// k at random
assert_geq(nconcord_, (uint64_t)p_.khits);
nconcordAln = p_.khits;
return;
} else if(exitConcord_ == ReportingState::EXIT_SHORT_CIRCUIT_M) {
assert(p_.msample);
assert_gt(nconcord_, 0);
pairMax = true; // repetitive concordant alignments
if(p_.mixed) {
unpair1Max = nunpair1_ > (uint64_t)p_.mhits;
unpair2Max = nunpair2_ > (uint64_t)p_.mhits;
}
// Not sure if this is OK
nconcordAln = 1; // 1 at random
return;
} else if(exitConcord_ == ReportingState::EXIT_WITH_ALIGNMENTS) {
assert_gt(nconcord_, 0);
// <= k at random
nconcordAln = min<uint64_t>(p_.khits, nconcord_);
}
assert(!p_.mhitsSet() || nconcord_ <= (uint64_t)p_.mhits+1);
// Do we have a discordant alignment to report?
if(exitDiscord_ == ReportingState::EXIT_WITH_ALIGNMENTS) {
// Report discordant
assert(p_.discord);
ndiscordAln = 1;
return;
}
}
assert_neq(ReportingState::EXIT_SHORT_CIRCUIT_TRUMPED, exitUnpair1_);
assert_neq(ReportingState::EXIT_SHORT_CIRCUIT_TRUMPED, exitUnpair2_);
if((paired_ && !p_.mixed) || nunpair1_ + nunpair2_ == 0) {
// Unpaired alignments either not reportable or non-existant
return;
}
// Do we have 1 or more alignments for mate #1 to report?
if(exitUnpair1_ == ReportingState::EXIT_SHORT_CIRCUIT_k) {
// k at random
assert_geq(nunpair1_, (uint64_t)p_.khits);
nunpair1Aln = p_.khits;
} else if(exitUnpair1_ == ReportingState::EXIT_SHORT_CIRCUIT_M) {
assert(p_.msample);
assert_gt(nunpair1_, 0);
unpair1Max = true; // repetitive alignments for mate #1
nunpair1Aln = 1; // 1 at random
} else if(exitUnpair1_ == ReportingState::EXIT_WITH_ALIGNMENTS) {
assert_gt(nunpair1_, 0);
// <= k at random
nunpair1Aln = min<uint64_t>(nunpair1_, (uint64_t)p_.khits);
}
assert(!p_.mhitsSet() || paired_ || nunpair1_ <= (uint64_t)p_.mhits+1);
if(p_.repeat) nunpairRepeat1Aln = nunpairRepeat1_;
// Do we have 2 or more alignments for mate #2 to report?
if(exitUnpair2_ == ReportingState::EXIT_SHORT_CIRCUIT_k) {
// k at random
nunpair2Aln = p_.khits;
} else if(exitUnpair2_ == ReportingState::EXIT_SHORT_CIRCUIT_M) {
assert(p_.msample);
assert_gt(nunpair2_, 0);
unpair2Max = true; // repetitive alignments for mate #1
nunpair2Aln = 1; // 1 at random
} else if(exitUnpair2_ == ReportingState::EXIT_WITH_ALIGNMENTS) {
assert_gt(nunpair2_, 0);
// <= k at random
nunpair2Aln = min<uint64_t>(nunpair2_, (uint64_t)p_.khits);
}
assert(!p_.mhitsSet() || paired_ || nunpair2_ <= (uint64_t)p_.mhits+1);
if(p_.repeat) nunpairRepeat2Aln = nunpairRepeat2_;
}
/**
* Given the number of alignments in a category, check whether we
* short-circuited out of the category. Set the done and exit arguments to
* indicate whether and how we short-circuited.
*/
inline void ReportingState::areDone(
uint64_t cnt, // # alignments in category
bool& done, // out: whether we short-circuited out of category
int& exit) const // out: if done, how we short-circuited (-k? -m? etc)
{
assert(!done);
// Have we exceeded the -k limit?
assert_gt(p_.khits, 0);
assert_gt(p_.mhits, 0);
if(cnt >= (uint64_t)p_.khits && !p_.mhitsSet()) {
done = true;
exit = ReportingState::EXIT_SHORT_CIRCUIT_k;
}
// Have we exceeded the -m or -M limit?
else if(p_.mhitsSet() && cnt > (uint64_t)p_.mhits) {
done = true;
assert(p_.msample);
exit = ReportingState::EXIT_SHORT_CIRCUIT_M;
}
}
#ifdef ALN_SINK_MAIN
#include <iostream>
bool testDones(
const ReportingState& st,
bool done1,
bool done2,
bool done3,
bool done4,
bool done5,
bool done6)
{
assert(st.doneConcordant() == done1);
assert(st.doneDiscordant() == done2);
assert(st.doneUnpaired(true) == done3);
assert(st.doneUnpaired(false) == done4);
assert(st.doneUnpaired() == done5);
assert(st.done() == done6);
assert(st.repOk());
return true;
}
int main(void) {
cerr << "Case 1 (simple unpaired 1) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
2, // khits
0, // mhits
0, // pengap
false, // msample
false, // discord
false); // mixed
ReportingState st(rp);
st.nextRead(false); // unpaired read
assert(testDones(st, true, true, false, true, false, false));
st.foundUnpaired(true);
assert(testDones(st, true, true, false, true, false, false));
st.foundUnpaired(true);
assert(testDones(st, true, true, true, true, true, true));
st.finish();
assert(testDones(st, true, true, true, true, true, true));
assert_eq(0, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(2, st.numUnpaired1());
assert_eq(0, st.numUnpaired2());
assert(st.repOk());
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(0, nconcord);
assert_eq(0, ndiscord);
assert_eq(2, nunpair1);
assert_eq(0, nunpair2);
assert(!pairMax);
assert(!unpair1Max);
assert(!unpair2Max);
}
cerr << "PASSED" << endl;
cerr << "Case 2 (simple unpaired 1) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
2, // khits
3, // mhits
0, // pengap
false, // msample
false, // discord
false); // mixed
ReportingState st(rp);
st.nextRead(false); // unpaired read
assert(testDones(st, true, true, false, true, false, false));
st.foundUnpaired(true);
assert(testDones(st, true, true, false, true, false, false));
st.foundUnpaired(true);
assert(testDones(st, true, true, false, true, false, false));
st.foundUnpaired(true);
assert(testDones(st, true, true, false, true, false, false));
st.foundUnpaired(true);
assert(testDones(st, true, true, true, true, true, true));
assert_eq(0, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(4, st.numUnpaired1());
assert_eq(0, st.numUnpaired2());
st.finish();
assert(testDones(st, true, true, true, true, true, true));
assert_eq(0, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(4, st.numUnpaired1());
assert_eq(0, st.numUnpaired2());
assert(st.repOk());
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(0, nconcord);
assert_eq(0, ndiscord);
assert_eq(0, nunpair1);
assert_eq(0, nunpair2);
assert(!pairMax);
assert(unpair1Max);
assert(!unpair2Max);
}
cerr << "PASSED" << endl;
cerr << "Case 3 (simple paired 1) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
2, // khits
3, // mhits
0, // pengap
false, // msample
false, // discord
false); // mixed
ReportingState st(rp);
st.nextRead(true); // unpaired read
assert(testDones(st, false, true, true, true, true, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, true, true, true, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, true, true, true, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, true, true, true, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, true, true, true, false));
st.foundUnpaired(false);
assert(testDones(st, false, true, true, true, true, false));
st.foundUnpaired(false);
assert(testDones(st, false, true, true, true, true, false));
st.foundUnpaired(false);
assert(testDones(st, false, true, true, true, true, false));
st.foundUnpaired(false);
assert(testDones(st, false, true, true, true, true, false));
st.foundConcordant();
assert(testDones(st, false, true, true, true, true, false));
st.foundConcordant();
assert(testDones(st, false, true, true, true, true, false));
st.foundConcordant();
assert(testDones(st, false, true, true, true, true, false));
st.foundConcordant();
assert(testDones(st, true, true, true, true, true, true));
assert_eq(4, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(4, st.numUnpaired1());
assert_eq(4, st.numUnpaired2());
st.finish();
assert(testDones(st, true, true, true, true, true, true));
assert_eq(4, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(4, st.numUnpaired1());
assert_eq(4, st.numUnpaired2());
assert(st.repOk());
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(0, nconcord);
assert_eq(0, ndiscord);
assert_eq(0, nunpair1);
assert_eq(0, nunpair2);
assert(pairMax);
assert(!unpair1Max); // because !mixed
assert(!unpair2Max); // because !mixed
}
cerr << "PASSED" << endl;
cerr << "Case 4 (simple paired 2) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
2, // khits
3, // mhits
0, // pengap
false, // msample
true, // discord
true); // mixed
ReportingState st(rp);
st.nextRead(true); // unpaired read
assert(testDones(st, false, false, false, false, false, false));
st.foundUnpaired(true);
assert(testDones(st, false, false, false, false, false, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, false, false, false, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, false, false, false, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, true, false, false, false));
st.foundUnpaired(false);
assert(testDones(st, false, true, true, false, false, false));
st.foundUnpaired(false);
assert(testDones(st, false, true, true, false, false, false));
st.foundUnpaired(false);
assert(testDones(st, false, true, true, false, false, false));
st.foundUnpaired(false);
assert(testDones(st, false, true, true, true, true, false));
st.foundConcordant();
assert(testDones(st, false, true, true, true, true, false));
st.foundConcordant();
assert(testDones(st, false, true, true, true, true, false));
st.foundConcordant();
assert(testDones(st, false, true, true, true, true, false));
st.foundConcordant();
assert(testDones(st, true, true, true, true, true, true));
assert_eq(4, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(4, st.numUnpaired1());
assert_eq(4, st.numUnpaired2());
st.finish();
assert(testDones(st, true, true, true, true, true, true));
assert_eq(4, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(4, st.numUnpaired1());
assert_eq(4, st.numUnpaired2());
assert(st.repOk());
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(0, nconcord);
assert_eq(0, ndiscord);
assert_eq(0, nunpair1);
assert_eq(0, nunpair2);
assert(pairMax);
assert(unpair1Max);
assert(unpair2Max);
}
cerr << "PASSED" << endl;
cerr << "Case 5 (potential discordant after concordant) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
2, // khits
3, // mhits
0, // pengap
false, // msample
true, // discord
true); // mixed
ReportingState st(rp);
st.nextRead(true);
assert(testDones(st, false, false, false, false, false, false));
st.foundUnpaired(true);
st.foundUnpaired(false);
st.foundConcordant();
assert(testDones(st, false, true, false, false, false, false));
st.finish();
assert(testDones(st, true, true, true, true, true, true));
assert_eq(1, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(1, st.numUnpaired1());
assert_eq(1, st.numUnpaired2());
assert(st.repOk());
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(1, nconcord);
assert_eq(0, ndiscord);
assert_eq(0, nunpair1);
assert_eq(0, nunpair2);
assert(!pairMax);
assert(!unpair1Max);
assert(!unpair2Max);
}
cerr << "PASSED" << endl;
cerr << "Case 6 (true discordant) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
2, // khits
3, // mhits
0, // pengap
false, // msample
true, // discord
true); // mixed
ReportingState st(rp);
st.nextRead(true);
assert(testDones(st, false, false, false, false, false, false));
st.foundUnpaired(true);
st.foundUnpaired(false);
assert(testDones(st, false, false, false, false, false, false));
st.finish();
assert(testDones(st, true, true, true, true, true, true));
assert_eq(0, st.numConcordant());
assert_eq(1, st.numDiscordant());
assert_eq(0, st.numUnpaired1());
assert_eq(0, st.numUnpaired2());
assert(st.repOk());
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(0, nconcord);
assert_eq(1, ndiscord);
assert_eq(0, nunpair1);
assert_eq(0, nunpair2);
assert(!pairMax);
assert(!unpair1Max);
assert(!unpair2Max);
}
cerr << "PASSED" << endl;
cerr << "Case 7 (unaligned pair & uniquely aligned mate, mixed-mode) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
1, // khits
1, // mhits
0, // pengap
false, // msample
true, // discord
true); // mixed
ReportingState st(rp);
st.nextRead(true); // unpaired read
// assert(st.doneConcordant() == done1);
// assert(st.doneDiscordant() == done2);
// assert(st.doneUnpaired(true) == done3);
// assert(st.doneUnpaired(false) == done4);
// assert(st.doneUnpaired() == done5);
// assert(st.done() == done6);
st.foundUnpaired(true);
assert(testDones(st, false, false, false, false, false, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, true, false, false, false));
assert_eq(0, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(2, st.numUnpaired1());
assert_eq(0, st.numUnpaired2());
st.finish();
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(0, nconcord);
assert_eq(0, ndiscord);
assert_eq(0, nunpair1);
assert_eq(0, nunpair2);
assert(!pairMax);
assert(unpair1Max);
assert(!unpair2Max);
}
cerr << "PASSED" << endl;
cerr << "Case 8 (unaligned pair & uniquely aligned mate, NOT mixed-mode) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
1, // khits
1, // mhits
0, // pengap
false, // msample
true, // discord
false); // mixed
ReportingState st(rp);
st.nextRead(true); // unpaired read
// assert(st.doneConcordant() == done1);
// assert(st.doneDiscordant() == done2);
// assert(st.doneUnpaired(true) == done3);
// assert(st.doneUnpaired(false) == done4);
// assert(st.doneUnpaired() == done5);
// assert(st.done() == done6);
st.foundUnpaired(true);
assert(testDones(st, false, false, true, true, true, false));
st.foundUnpaired(true);
assert(testDones(st, false, true, true, true, true, false));
assert_eq(0, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(2, st.numUnpaired1());
assert_eq(0, st.numUnpaired2());
st.finish();
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(0, nconcord);
assert_eq(0, ndiscord);
assert_eq(0, nunpair1);
assert_eq(0, nunpair2);
assert(!pairMax);
assert(!unpair1Max); // not really relevant
assert(!unpair2Max); // not really relevant
}
cerr << "PASSED" << endl;
cerr << "Case 9 (repetitive pair, only one mate repetitive) ... ";
{
uint64_t nconcord = 0, ndiscord = 0, nunpair1 = 0, nunpair2 = 0;
bool pairMax = false, unpair1Max = false, unpair2Max = false;
ReportingParams rp(
1, // khits
1, // mhits
0, // pengap
true, // msample
true, // discord
true); // mixed
ReportingState st(rp);
st.nextRead(true); // unpaired read
// assert(st.doneConcordant() == done1);
// assert(st.doneDiscordant() == done2);
// assert(st.doneUnpaired(true) == done3);
// assert(st.doneUnpaired(false) == done4);
// assert(st.doneUnpaired() == done5);
// assert(st.done() == done6);
st.foundConcordant();
assert(st.repOk());
st.foundUnpaired(true);
assert(st.repOk());
st.foundUnpaired(false);
assert(st.repOk());
assert(testDones(st, false, true, false, false, false, false));
assert(st.repOk());
st.foundConcordant();
assert(st.repOk());
st.foundUnpaired(true);
assert(st.repOk());
assert(testDones(st, true, true, true, false, false, false));
assert_eq(2, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(2, st.numUnpaired1());
assert_eq(1, st.numUnpaired2());
st.foundUnpaired(false);
assert(st.repOk());
assert(testDones(st, true, true, true, true, true, true));
assert_eq(2, st.numConcordant());
assert_eq(0, st.numDiscordant());
assert_eq(2, st.numUnpaired1());
assert_eq(2, st.numUnpaired2());
st.finish();
st.getReport(nconcord, ndiscord, nunpair1, nunpair2,
pairMax, unpair1Max, unpair2Max);
assert_eq(1, nconcord);
assert_eq(0, ndiscord);
assert_eq(0, nunpair1);
assert_eq(0, nunpair2);
assert(pairMax);
assert(unpair1Max); // not really relevant
assert(unpair2Max); // not really relevant
}
cerr << "PASSED" << endl;
}
#endif /*def ALN_SINK_MAIN*/

4384
aln_sink.h Normal file

File diff suppressed because it is too large Load Diff

536
alphabet.cpp Normal file
View File

@ -0,0 +1,536 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include <stdint.h>
#include <cassert>
#include <string>
#include "alphabet.h"
using namespace std;
/**
* Mapping from ASCII characters to DNA categories:
*
* 0 = invalid - error
* 1 = DNA
* 2 = IUPAC (ambiguous DNA)
* 3 = not an error, but unmatchable; alignments containing this
* character are invalid
*/
uint8_t asc2dnacat[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,
/* - */
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0, 1, 2, 1, 2, 0, 0, 1, 2, 0, 0, 2, 0, 2, 2, 0,
/* A B C D G H K M N */
/* 80 */ 0, 0, 2, 2, 1, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0,
/* R S T V W X Y */
/* 96 */ 0, 1, 2, 1, 2, 0, 0, 1, 2, 0, 0, 2, 0, 2, 2, 0,
/* a b c d g h k m n */
/* 112 */ 0, 0, 2, 2, 1, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0,
/* r s t v w x y */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
// 5-bit pop count
int mask2popcnt[] = {
0, 1, 1, 2, 1, 2, 2, 3,
1, 2, 2, 3, 2, 3, 3, 4,
1, 2, 2, 3, 2, 3, 3, 4,
2, 3, 3, 4, 3, 4, 4, 5
};
/**
* Mapping from masks to ASCII characters for ambiguous nucleotides.
*/
char mask2dna[] = {
'?', // 0
'A', // 1
'C', // 2
'M', // 3
'G', // 4
'R', // 5
'S', // 6
'V', // 7
'T', // 8
'W', // 9
'Y', // 10
'H', // 11
'K', // 12
'D', // 13
'B', // 14
'N', // 15 (inclusive N)
'N' // 16 (exclusive N)
};
/**
* Mapping from ASCII characters for ambiguous nucleotides into masks:
*/
uint8_t asc2dnamask[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0, 1,14, 2,13, 0, 0, 4,11, 0, 0,12, 0, 3,15, 0,
/* A B C D G H K M N */
/* 80 */ 0, 0, 5, 6, 8, 0, 7, 9, 0,10, 0, 0, 0, 0, 0, 0,
/* R S T V W Y */
/* 96 */ 0, 1,14, 2,13, 0, 0, 4,11, 0, 0,12, 0, 3,15, 0,
/* a b c d g h k m n */
/* 112 */ 0, 0, 5, 6, 8, 0, 7, 9, 0,10, 0, 0, 0, 0, 0, 0,
/* r s t v w y */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
/**
* Convert a pair of DNA masks to a color mask
*
*
*/
uint8_t dnamasks2colormask[16][16] = {
/* 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 */
/* 0 */ { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 },
/* 1 */ { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
/* 2 */ { 0, 2, 1, 3, 8, 10, 9, 11, 4, 6, 5, 7, 12, 14, 13, 15 },
/* 3 */ { 0, 3, 3, 3, 12, 15, 15, 15, 12, 15, 15, 15, 12, 15, 15, 15 },
/* 4 */ { 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15 },
/* 5 */ { 0, 5, 10, 15, 5, 5, 15, 15, 10, 15, 10, 15, 15, 15, 15, 15 },
/* 6 */ { 0, 6, 9, 15, 9, 15, 9, 15, 6, 6, 15, 15, 15, 15, 15, 15 },
/* 7 */ { 0, 7, 11, 15, 13, 15, 15, 15, 14, 15, 15, 15, 15, 15, 15, 15 },
/* 8 */ { 0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15 },
/* 9 */ { 0, 9, 6, 15, 6, 15, 6, 15, 9, 9, 15, 15, 15, 15, 15, 15 },
/* 10 */ { 0, 10, 5, 15, 10, 10, 15, 15, 5, 15, 5, 15, 15, 15, 15, 15 },
/* 11 */ { 0, 11, 7, 15, 14, 15, 15, 15, 13, 15, 15, 15, 15, 15, 15, 15 },
/* 12 */ { 0, 12, 12, 12, 3, 15, 15, 15, 3, 15, 15, 15, 3, 15, 15, 15 },
/* 13 */ { 0, 13, 14, 15, 7, 15, 15, 15, 11, 15, 15, 15, 15, 15, 15, 15 },
/* 14 */ { 0, 14, 13, 15, 11, 15, 15, 15, 7, 15, 15, 15, 15, 15, 15, 15 },
/* 15 */ { 0, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 }
};
/**
* Mapping from ASCII characters for ambiguous nucleotides into masks:
*/
char asc2dnacomp[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,'-', 0, 0,
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0,'T','V','G','H', 0, 0,'C','D', 0, 0,'M', 0,'K','N', 0,
/* A B C D G H K M N */
/* 80 */ 0, 0,'Y','S','A', 0,'B','W', 0,'R', 0, 0, 0, 0, 0, 0,
/* R S T V W Y */
/* 96 */ 0,'T','V','G','H', 0, 0,'C','D', 0, 0,'M', 0,'K','N', 0,
/* a b c d g h k m n */
/* 112 */ 0, 0,'Y','S','A', 0,'B','W', 0,'R', 0, 0, 0, 0, 0, 0,
/* r s t v w y */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};
/**
* Mapping from ASCII characters for ambiguous nucleotides into masks:
*/
char col2dna[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,'-','N', 0,
/* - . */
/* 48 */'A','C','G','T','N', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 0 1 2 3 4 */
/* 64 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 80 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 96 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 112 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};
/**
* Mapping from ASCII characters for ambiguous nucleotides into masks:
*/
char dna2col[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,'-', 0, 0,
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0,'0', 0,'1', 0, 0, 0,'2', 0, 0, 0, 0, 0, 0,'4', 0,
/* A C G N */
/* 80 */ 0, 0, 0, 0,'3', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* T */
/* 92 */ 0,'0', 0,'1', 0, 0, 0,'2', 0, 0, 0, 0, 0, 0,'4', 0,
/* a c g n */
/* 112 */ 0, 0, 0, 0,'3', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* t */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};
/**
* Mapping from ASCII characters for ambiguous nucleotides into masks:
*/
const char* dna2colstr[] = {
/* 0 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 16 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 32 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "-", "?", "?",
/* 48 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 64 */ "?", "0","1|2|3","1","0|2|3","?", "?", "2","0|1|3","?", "?", "2|3", "?", "0|1", ".", "?",
/* A B C D G H K M N */
/* 80 */ "?", "?", "0|2","1|2", "3", "?","0|1|2","0|3","?", "1|3", "?", "?", "?", "?", "?", "?",
/* R S T V W Y */
/* 92 */ "?", "?","1|2|3","1","0|2|3","?", "?", "2","0|1|3","?", "?", "2|3", "?", "0|1", ".", "?",
/* a b c d g h k m n */
/* 112 */ "?", "0", "0|2","1|2", "3", "?","0|1|2","0|3","?", "1|3", "?", "?", "?", "?", "?", "?",
/* r s t v w y */
/* 128 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 144 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 160 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 176 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 192 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 208 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 224 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?",
/* 240 */ "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?", "?"
};
/**
* Mapping from ASCII characters to color categories:
*
* 0 = invalid - error
* 1 = valid color
* 2 = IUPAC (ambiguous DNA) - there is no such thing for colors to my
* knowledge
* 3 = not an error, but unmatchable; alignments containing this
* character are invalid
*/
uint8_t asc2colcat[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0,
/* - . */
/* 48 */ 1, 1, 1, 1, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 0 1 2 3 4 */
/* 64 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 80 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 96 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 112 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
/**
* Set the category for all IUPAC codes. By default they're in
* category 2 (IUPAC), but sometimes we'd like to put them in category
* 3 (unmatchable), for example.
*/
void setIupacsCat(uint8_t cat) {
assert(cat < 4);
asc2dnacat[(int)'B'] = asc2dnacat[(int)'b'] =
asc2dnacat[(int)'D'] = asc2dnacat[(int)'d'] =
asc2dnacat[(int)'H'] = asc2dnacat[(int)'h'] =
asc2dnacat[(int)'K'] = asc2dnacat[(int)'k'] =
asc2dnacat[(int)'M'] = asc2dnacat[(int)'m'] =
asc2dnacat[(int)'N'] = asc2dnacat[(int)'n'] =
asc2dnacat[(int)'R'] = asc2dnacat[(int)'r'] =
asc2dnacat[(int)'S'] = asc2dnacat[(int)'s'] =
asc2dnacat[(int)'V'] = asc2dnacat[(int)'v'] =
asc2dnacat[(int)'W'] = asc2dnacat[(int)'w'] =
asc2dnacat[(int)'X'] = asc2dnacat[(int)'x'] =
asc2dnacat[(int)'Y'] = asc2dnacat[(int)'y'] = cat;
}
/// For converting from ASCII to the Dna5 code where A=0, C=1, G=2,
/// T=3, N=4
uint8_t asc2dna[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* A C G N */
/* 80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* T */
/* 96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* a c g n */
/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* t */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
uint8_t asc2dna_3N[2][256] = {
{
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* A C G N */
/* 80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* T */
/* 96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* a c g n */
/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* t */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
},
{
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* A C G N */
/* 80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* T */
/* 96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* a c g n */
/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* t */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
}
};
// this is only used in BASE_CHANGE case
uint8_t asc2dna_1[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* A C G N */
/* 80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* T */
/* 96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* a c g n */
/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* t */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
uint8_t asc2dna_2[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 48 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* A C G N */
/* 80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* T */
/* 96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* a c g n */
/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* t */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
/// Convert an ascii char representing a base or a color to a 2-bit
/// code: 0=A,0; 1=C,1; 2=G,2; 3=T,3; 4=N,.
uint8_t asc2dnaOrCol[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0,
/* - . */
/* 48 */ 0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 0 1 2 3 */
/* 64 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* A C G N */
/* 80 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* T */
/* 96 */ 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0,
/* a c g n */
/* 112 */ 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* t */
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
/// For converting from ASCII to the Dna5 code where A=0, C=1, G=2,
/// T=3, N=4
uint8_t asc2col[] = {
/* 0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 16 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 32 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0,
/* - . */
/* 48 */ 0, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 0 1 2 3 */
/* 64 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 80 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 96 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 112 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 128 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 144 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 160 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 176 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 192 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 208 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 224 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 240 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};
/**
* Convert a nucleotide and a color to the paired nucleotide. Indexed
* first by nucleotide then by color. Note that this is exactly the
* same as the dinuc2color array.
*/
uint8_t nuccol2nuc[5][5] = {
/* B G O R . */
/* A */ {0, 1, 2, 3, 4},
/* C */ {1, 0, 3, 2, 4},
/* G */ {2, 3, 0, 1, 4},
/* T */ {3, 2, 1, 0, 4},
/* N */ {4, 4, 4, 4, 4}
};
/**
* Convert a pair of nucleotides to a color.
*/
uint8_t dinuc2color[5][5] = {
/* A */ {0, 1, 2, 3, 4},
/* C */ {1, 0, 3, 2, 4},
/* G */ {2, 3, 0, 1, 4},
/* T */ {3, 2, 1, 0, 4},
/* N */ {4, 4, 4, 4, 4}
};
/// Convert bit encoded DNA char to its complement
int dnacomp[5] = {
3, 2, 1, 0, 4
};
const char *iupacs = "!ACMGRSVTWYHKDBN!acmgrsvtwyhkdbn";
char mask2iupac[16] = {
-1,
'A', // 0001
'C', // 0010
'M', // 0011
'G', // 0100
'R', // 0101
'S', // 0110
'V', // 0111
'T', // 1000
'W', // 1001
'Y', // 1010
'H', // 1011
'K', // 1100
'D', // 1101
'B', // 1110
'N', // 1111
};
int maskcomp[16] = {
0, // 0000 (!) -> 0000 (!)
8, // 0001 (A) -> 1000 (T)
4, // 0010 (C) -> 0100 (G)
12, // 0011 (M) -> 1100 (K)
2, // 0100 (G) -> 0010 (C)
10, // 0101 (R) -> 1010 (Y)
6, // 0110 (S) -> 0110 (S)
14, // 0111 (V) -> 1110 (B)
1, // 1000 (T) -> 0001 (A)
9, // 1001 (W) -> 1001 (W)
5, // 1010 (Y) -> 0101 (R)
13, // 1011 (H) -> 1101 (D)
3, // 1100 (K) -> 0011 (M)
11, // 1101 (D) -> 1011 (H)
7, // 1110 (B) -> 0111 (V)
15, // 1111 (N) -> 1111 (N)
};

199
alphabet.h Normal file
View File

@ -0,0 +1,199 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALPHABETS_H_
#define ALPHABETS_H_
#include <stdexcept>
#include <string>
#include <sstream>
#include <stdint.h>
#include "assert_helpers.h"
using namespace std;
/// Convert an ascii char to a DNA category. Categories are:
/// 0 -> invalid
/// 1 -> unambiguous a, c, g or t
/// 2 -> ambiguous
/// 3 -> unmatchable
extern uint8_t asc2dnacat[];
/// Convert masks to ambiguous nucleotides
extern char mask2dna[];
/// Convert ambiguous ASCII nuceleotide to mask
extern uint8_t asc2dnamask[];
/// Convert mask to # of alternative in the mask
extern int mask2popcnt[];
/// Convert an ascii char to a 2-bit base: 0=A, 1=C, 2=G, 3=T, 4=N
extern uint8_t asc2dna[];
/// Convert an ascii char representing a base or a color to a 2-bit
/// code: 0=A,0; 1=C,1; 2=G,2; 3=T,3; 4=N,.
extern uint8_t asc2dnaOrCol[];
/// Convert a pair of DNA masks to a color mask
extern uint8_t dnamasks2colormask[16][16];
/// Convert an ascii char to a color category. Categories are:
/// 0 -> invalid
/// 1 -> unambiguous 0, 1, 2 or 3
/// 2 -> ambiguous (not applicable for colors)
/// 3 -> unmatchable
extern uint8_t asc2colcat[];
/// Convert an ascii char to a 2-bit base: 0=A, 1=C, 2=G, 3=T, 4=N
extern uint8_t asc2col[];
/// Convert an ascii char to its DNA complement, including IUPACs
extern char asc2dnacomp[];
/// Convert a pair of 2-bit (and 4=N) encoded DNA bases to a color
extern uint8_t dinuc2color[5][5];
/// Convert a 2-bit nucleotide (and 4=N) and a color to the
/// corresponding 2-bit nucleotide
extern uint8_t nuccol2nuc[5][5];
/// Convert a 4-bit mask into an IUPAC code
extern char mask2iupac[16];
/// Convert an ascii color to an ascii dna char
extern char col2dna[];
/// Convert an ascii dna to a color char
extern char dna2col[];
/// Convert an ascii dna to a color char
extern const char* dna2colstr[];
/// Convert bit encoded DNA char to its complement
extern int dnacomp[5];
/// String of all DNA and IUPAC characters
extern const char *iupacs;
/// Map from masks to their reverse-complement masks
extern int maskcomp[16];
/**
* Return true iff c is a Dna character.
*/
static inline bool isDna(char c) {
return asc2dnacat[(int)c] > 0;
}
/**
* Return true iff c is a color character.
*/
static inline bool isColor(char c) {
return asc2colcat[(int)c] > 0;
}
/**
* Return true iff c is an ambiguous Dna character.
*/
static inline bool isAmbigNuc(char c) {
return asc2dnacat[(int)c] == 2;
}
/**
* Return true iff c is an ambiguous color character.
*/
static inline bool isAmbigColor(char c) {
return asc2colcat[(int)c] == 2;
}
/**
* Return true iff c is an ambiguous character.
*/
static inline bool isAmbig(char c, bool color) {
return (color ? asc2colcat[(int)c] : asc2dnacat[(int)c]) == 2;
}
/**
* Return true iff c is an unambiguous DNA character.
*/
static inline bool isUnambigNuc(char c) {
return asc2dnacat[(int)c] == 1;
}
/**
* Return the DNA complement of the given ASCII char.
*/
static inline char comp(char c) {
switch(c) {
case 'a': return 't';
case 'A': return 'T';
case 'c': return 'g';
case 'C': return 'G';
case 'g': return 'c';
case 'G': return 'C';
case 't': return 'a';
case 'T': return 'A';
default: return c;
}
}
/**
* Return the reverse complement of a bit-encoded nucleotide.
*/
static inline int compDna(int c) {
assert_leq(c, 4);
return dnacomp[c];
}
/**
* Return true iff c is an unambiguous Dna character.
*/
static inline bool isUnambigDna(char c) {
return asc2dnacat[(int)c] == 1;
}
/**
* Return true iff c is an unambiguous color character (0,1,2,3).
*/
static inline bool isUnambigColor(char c) {
return asc2colcat[(int)c] == 1;
}
/// Convert a pair of 2-bit (and 4=N) encoded DNA bases to a color
extern uint8_t dinuc2color[5][5];
/**
* Decode a not-necessarily-ambiguous nucleotide.
*/
static inline void decodeNuc(char c , int& num, int *alts) {
switch(c) {
case 'A': alts[0] = 0; num = 1; break;
case 'C': alts[0] = 1; num = 1; break;
case 'G': alts[0] = 2; num = 1; break;
case 'T': alts[0] = 3; num = 1; break;
case 'M': alts[0] = 0; alts[1] = 1; num = 2; break;
case 'R': alts[0] = 0; alts[1] = 2; num = 2; break;
case 'W': alts[0] = 0; alts[1] = 3; num = 2; break;
case 'S': alts[0] = 1; alts[1] = 2; num = 2; break;
case 'Y': alts[0] = 1; alts[1] = 3; num = 2; break;
case 'K': alts[0] = 2; alts[1] = 3; num = 2; break;
case 'V': alts[0] = 0; alts[1] = 1; alts[2] = 2; num = 3; break;
case 'H': alts[0] = 0; alts[1] = 1; alts[2] = 3; num = 3; break;
case 'D': alts[0] = 0; alts[1] = 2; alts[2] = 3; num = 3; break;
case 'B': alts[0] = 1; alts[1] = 2; alts[2] = 3; num = 3; break;
case 'N': alts[0] = 0; alts[1] = 1; alts[2] = 2; alts[3] = 3; num = 4; break;
default: {
std::cerr << "Bad IUPAC code: " << c << ", (int: " << (int)c << ")" << std::endl;
throw std::runtime_error("");
}
}
}
extern void setIupacsCat(uint8_t cat);
#endif /*ALPHABETS_H_*/

294
alt.h Normal file
View File

@ -0,0 +1,294 @@
/*
* Copyright 2015, Daehwan Kim <infphilo@gmail.com>
*
* This file is part of HISAT 2.
*
* HISAT 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* HISAT 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with HISAT 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ALT_H_
#define ALT_H_
#include <iostream>
#include <fstream>
#include <limits>
#include "assert_helpers.h"
#include "word_io.h"
#include "mem_ids.h"
using namespace std;
enum ALT_TYPE {
ALT_NONE = 0,
ALT_SNP_SGL, // single nucleotide substitution
ALT_SNP_INS, // small insertion wrt reference genome
ALT_SNP_DEL, // small deletion wrt reference genome
ALT_SNP_ALT, // alternative sequence (to be implemented ...)
ALT_SPLICESITE,
ALT_EXON
};
template <typename index_t>
struct ALT {
ALT() {
reset();
}
void reset() {
type = ALT_NONE;
pos = len = 0;
seq = 0;
}
ALT_TYPE type;
union {
index_t pos;
index_t left;
};
union {
index_t len;
index_t right;
};
union {
uint64_t seq; // used to store 32 bp, but it can be used to store a pointer to EList<uint64_t>
struct {
union {
bool fw;
bool reversed;
};
bool excluded;
};
};
public:
// in order to support a sequence longer than 32 bp
bool snp() const { return type == ALT_SNP_SGL || type == ALT_SNP_DEL || type == ALT_SNP_INS; }
bool splicesite() const { return type == ALT_SPLICESITE; }
bool mismatch() const { return type == ALT_SNP_SGL; }
bool gap() const { return type == ALT_SNP_DEL || type == ALT_SNP_INS || type == ALT_SPLICESITE; }
bool deletion() const { return type == ALT_SNP_DEL; }
bool insertion() const { return type == ALT_SNP_INS; }
bool exon() const { return type == ALT_EXON; }
bool operator< (const ALT& o) const {
if(pos != o.pos) return pos < o.pos;
if(type != o.type) {
if(type == ALT_NONE || o.type == ALT_NONE) {
return type == ALT_NONE;
}
if(type == ALT_SNP_INS) return true;
else if(o.type == ALT_SNP_INS) return false;
return type < o.type;
}
if(len != o.len) return len < o.len;
if(seq != o.seq) return seq < o.seq;
return false;
}
bool compatibleWith(const ALT& o) const {
if(pos == o.pos) return false;
// sort the two SNPs
const ALT& a = (pos < o.pos ? *this : o);
const ALT& b = (pos < o.pos ? o : *this);
if(a.snp()) {
if(a.type == ALT_SNP_DEL || a.type == ALT_SNP_INS) {
if(b.pos <= a.pos + a.len) {
return false;
}
}
} else if(a.splicesite()) {
if(b.pos <= a.right + 2) {
return false;
}
} else {
assert(false);
}
return true;
}
bool isSame(const ALT& o) const {
if(type != o.type)
return false;
if(type == ALT_SNP_SGL) {
return pos == o.pos && seq == o.seq;
} else if(type == ALT_SNP_DEL || type == ALT_SNP_INS || type == ALT_SPLICESITE) {
if(type == ALT_SNP_INS) {
if(seq != o.seq)
return false;
}
if(reversed == o.reversed) {
return pos == o.pos && len == o.len;
} else {
if(reversed) {
return pos - len + 1 == o.pos && len == o.len;
} else {
return pos == o.pos - o.len + 1 && len == o.len;
}
}
} else {
assert(false);
}
return true;
}
#ifndef NDEBUG
bool repOk() const {
if(type == ALT_SNP_SGL) {
if(len != 1) {
assert(false);
return false;
}
if(seq > 3) {
assert(false);
return false;
}
} else if(type == ALT_SNP_DEL) {
if(len <= 0) {
assert(false);
return false;
}
if(seq != 0) {
assert(false);
return false;
}
} else if(type == ALT_SNP_INS) {
if(len <= 0) {
assert(false);
return false;
}
} else if(type == ALT_SPLICESITE) {
assert_lt(left, right);
assert_leq(fw, 1);
}else {
assert(false);
return false;
}
return true;
}
#endif
bool write(ofstream& f_out, bool bigEndian) const {
writeIndex<index_t>(f_out, pos, bigEndian);
writeU32(f_out, type, bigEndian);
writeIndex<index_t>(f_out, len, bigEndian);
writeIndex<uint64_t>(f_out, seq, bigEndian);
return true;
}
bool read(ifstream& f_in, bool bigEndian) {
pos = readIndex<index_t>(f_in, bigEndian);
type = (ALT_TYPE)readU32(f_in, bigEndian);
assert_neq(type, ALT_SNP_ALT);
len = readIndex<index_t>(f_in, bigEndian);
seq = readIndex<uint64_t>(f_in, bigEndian);
return true;
}
};
template <typename index_t>
struct Haplotype {
Haplotype() {
reset();
}
void reset() {
left = right = 0;
alts.clear();
}
index_t left;
index_t right;
EList<index_t, 1> alts;
bool operator< (const Haplotype& o) const {
if(left != o.left) return left < o.left;
if(right != o.right) return right < o.right;
return false;
}
bool write(ofstream& f_out, bool bigEndian) const {
writeIndex<index_t>(f_out, left, bigEndian);
writeIndex<index_t>(f_out, right, bigEndian);
writeIndex<index_t>(f_out, alts.size(), bigEndian);
for(index_t i = 0; i < alts.size(); i++) {
writeIndex<index_t>(f_out, alts[i], bigEndian);
}
return true;
}
bool read(ifstream& f_in, bool bigEndian) {
left = readIndex<index_t>(f_in, bigEndian);
right = readIndex<index_t>(f_in, bigEndian);
assert_leq(left, right);
index_t num_alts = readIndex<index_t>(f_in, bigEndian);
alts.resizeExact(num_alts); alts.clear();
for(index_t i = 0; i < num_alts; i++) {
alts.push_back(readIndex<index_t>(f_in, bigEndian));
}
return true;
}
};
template <typename index_t>
class ALTDB {
public:
ALTDB() :
_snp(false),
_ss(false),
_exon(false)
{}
virtual ~ALTDB() {}
bool hasSNPs() const { return _snp; }
bool hasSpliceSites() const { return _ss; }
bool hasExons() const { return _exon; }
void setSNPs(bool snp) { _snp = snp; }
void setSpliceSites(bool ss) { _ss = ss; }
void setExons(bool exon) { _exon = exon; }
EList<ALT<index_t> >& alts() { return _alts; }
EList<string>& altnames() { return _altnames; }
EList<Haplotype<index_t> >& haplotypes() { return _haplotypes; }
EList<index_t>& haplotype_maxrights() { return _haplotype_maxrights; }
const EList<ALT<index_t> >& alts() const { return _alts; }
const EList<string>& altnames() const { return _altnames; }
const EList<Haplotype<index_t> >& haplotypes() const { return _haplotypes; }
const EList<index_t>& haplotype_maxrights() const { return _haplotype_maxrights; }
private:
bool _snp;
bool _ss;
bool _exon;
EList<ALT<index_t> > _alts;
EList<string> _altnames;
EList<Haplotype<index_t> > _haplotypes;
EList<index_t> _haplotype_maxrights;
};
#endif /*ifndef ALT_H_*/

279
assert_helpers.h Normal file
View File

@ -0,0 +1,279 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef ASSERT_HELPERS_H_
#define ASSERT_HELPERS_H_
#include <stdexcept>
#include <string>
#include <cassert>
#include <iostream>
/**
* Assertion for release-enabled assertions
*/
class ReleaseAssertException : public std::runtime_error {
public:
ReleaseAssertException(const std::string& msg = "") : std::runtime_error(msg) {}
};
/**
* Macros for release-enabled assertions, and helper macros to make
* all assertion error messages more helpful.
*/
#ifndef NDEBUG
#define ASSERT_ONLY(...) __VA_ARGS__
#else
#define ASSERT_ONLY(...)
#endif
#define rt_assert(b) \
if(!(b)) { \
std::cout << "rt_assert at " << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(); \
}
#define rt_assert_msg(b,msg) \
if(!(b)) { \
std::cout << msg << " at " << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(msg); \
}
#define rt_assert_eq(ex,ac) \
if(!((ex) == (ac))) { \
std::cout << "rt_assert_eq: expected (" << (ex) << ", 0x" << std::hex << (ex) << std::dec << ") got (" << (ac) << ", 0x" << std::hex << (ac) << std::dec << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(); \
}
#define rt_assert_eq_msg(ex,ac,msg) \
if(!((ex) == (ac))) { \
std::cout << "rt_assert_eq: " << msg << ": (" << (ex) << ", 0x" << std::hex << (ex) << std::dec << ") got (" << (ac) << ", 0x" << std::hex << (ac) << std::dec << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(msg); \
}
#ifndef NDEBUG
#define assert_eq(ex,ac) \
if(!((ex) == (ac))) { \
std::cout << "assert_eq: expected (" << (ex) << ", 0x" << std::hex << (ex) << std::dec << ") got (" << (ac) << ", 0x" << std::hex << (ac) << std::dec << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#define assert_eq_msg(ex,ac,msg) \
if(!((ex) == (ac))) { \
std::cout << "assert_eq: " << msg << ": (" << (ex) << ", 0x" << std::hex << (ex) << std::dec << ") got (" << (ac) << ", 0x" << std::hex << (ac) << std::dec << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#else
#define assert_eq(ex,ac)
#define assert_eq_msg(ex,ac,msg)
#endif
#define rt_assert_neq(ex,ac) \
if(!((ex) != (ac))) { \
std::cout << "rt_assert_neq: expected not (" << (ex) << ", 0x" << std::hex << (ex) << std::dec << ") got (" << (ac) << ", 0x" << std::hex << (ac) << std::dec << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(); \
}
#define rt_assert_neq_msg(ex,ac,msg) \
if(!((ex) != (ac))) { \
std::cout << "rt_assert_neq: " << msg << ": (" << (ex) << ", 0x" << std::hex << (ex) << std::dec << ") got (" << (ac) << ", 0x" << std::hex << (ac) << std::dec << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(msg); \
}
#ifndef NDEBUG
#define assert_neq(ex,ac) \
if(!((ex) != (ac))) { \
std::cout << "assert_neq: expected not (" << (ex) << ", 0x" << std::hex << (ex) << std::dec << ") got (" << (ac) << ", 0x" << std::hex << (ac) << std::dec << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#define assert_neq_msg(ex,ac,msg) \
if(!((ex) != (ac))) { \
std::cout << "assert_neq: " << msg << ": (" << (ex) << ", 0x" << std::hex << (ex) << std::dec << ") got (" << (ac) << ", 0x" << std::hex << (ac) << std::dec << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#else
#define assert_neq(ex,ac)
#define assert_neq_msg(ex,ac,msg)
#endif
#define rt_assert_gt(a,b) \
if(!((a) > (b))) { \
std::cout << "rt_assert_gt: expected (" << (a) << ") > (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(); \
}
#define rt_assert_gt_msg(a,b,msg) \
if(!((a) > (b))) { \
std::cout << "rt_assert_gt: " << msg << ": (" << (a) << ") > (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(msg); \
}
#ifndef NDEBUG
#define assert_gt(a,b) \
if(!((a) > (b))) { \
std::cout << "assert_gt: expected (" << (a) << ") > (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#define assert_gt_msg(a,b,msg) \
if(!((a) > (b))) { \
std::cout << "assert_gt: " << msg << ": (" << (a) << ") > (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#else
#define assert_gt(a,b)
#define assert_gt_msg(a,b,msg)
#endif
#define rt_assert_geq(a,b) \
if(!((a) >= (b))) { \
std::cout << "rt_assert_geq: expected (" << (a) << ") >= (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(); \
}
#define rt_assert_geq_msg(a,b,msg) \
if(!((a) >= (b))) { \
std::cout << "rt_assert_geq: " << msg << ": (" << (a) << ") >= (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(msg); \
}
#ifndef NDEBUG
#define assert_geq(a,b) \
if(!((a) >= (b))) { \
std::cout << "assert_geq: expected (" << (a) << ") >= (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#define assert_geq_msg(a,b,msg) \
if(!((a) >= (b))) { \
std::cout << "assert_geq: " << msg << ": (" << (a) << ") >= (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#else
#define assert_geq(a,b)
#define assert_geq_msg(a,b,msg)
#endif
#define rt_assert_lt(a,b) \
if(!(a < b)) { \
std::cout << "rt_assert_lt: expected (" << a << ") < (" << b << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(); \
}
#define rt_assert_lt_msg(a,b,msg) \
if(!(a < b)) { \
std::cout << "rt_assert_lt: " << msg << ": (" << a << ") < (" << b << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(msg); \
}
#ifndef NDEBUG
#define assert_lt(a,b) \
if(!(a < b)) { \
std::cout << "assert_lt: expected (" << a << ") < (" << b << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#define assert_lt_msg(a,b,msg) \
if(!(a < b)) { \
std::cout << "assert_lt: " << msg << ": (" << a << ") < (" << b << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#else
#define assert_lt(a,b)
#define assert_lt_msg(a,b,msg)
#endif
#define rt_assert_leq(a,b) \
if(!((a) <= (b))) { \
std::cout << "rt_assert_leq: expected (" << (a) << ") <= (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(); \
}
#define rt_assert_leq_msg(a,b,msg) \
if(!((a) <= (b))) { \
std::cout << "rt_assert_leq: " << msg << ": (" << (a) << ") <= (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
throw ReleaseAssertException(msg); \
}
#ifndef NDEBUG
#define assert_leq(a,b) \
if(!((a) <= (b))) { \
std::cout << "assert_leq: expected (" << (a) << ") <= (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#define assert_leq_msg(a,b,msg) \
if(!((a) <= (b))) { \
std::cout << "assert_leq: " << msg << ": (" << (a) << ") <= (" << (b) << ")" << std::endl; \
std::cout << __FILE__ << ":" << __LINE__ << std::endl; \
assert(0); \
}
#else
#define assert_leq(a,b)
#define assert_leq_msg(a,b,msg)
#endif
#ifndef NDEBUG
#define assert_in(c, s) assert_in2(c, s, __FILE__, __LINE__)
static inline void assert_in2(char c, const char *str, const char *file, int line) {
const char *s = str;
while(*s != '\0') {
if(c == *s) return;
s++;
}
std::cout << "assert_in: (" << c << ") not in (" << str << ")" << std::endl;
std::cout << file << ":" << line << std::endl;
assert(0);
}
#else
#define assert_in(c, s)
#endif
#ifndef NDEBUG
#define assert_range(b, e, v) assert_range_helper(b, e, v, __FILE__, __LINE__)
template<typename T>
inline static void assert_range_helper(const T& begin,
const T& end,
const T& val,
const char *file,
int line)
{
if(val < begin || val > end) {
std::cout << "assert_range: (" << val << ") not in ["
<< begin << ", " << end << "]" << std::endl;
std::cout << file << ":" << line << std::endl;
assert(0);
}
}
#else
#define assert_range(b, e, v)
#endif
#endif /*ASSERT_HELPERS_H_*/

27
banded.cpp Normal file
View File

@ -0,0 +1,27 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include <iostream>
#include "banded.h"
#ifdef MAIN_BANDED
int main(void) {
}
#endif

52
banded.h Normal file
View File

@ -0,0 +1,52 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef BANDED_H_
#define BANDED_H_
#include "sse_util.h"
/**
* Use SSE instructions to quickly find stretches with lots of matches, then
* resolve alignments.
*/
class BandedSseAligner {
public:
void init(
int *q, // query, maskized
size_t qi, // query start
size_t qf, // query end
int *r, // reference, maskized
size_t ri, // reference start
size_t rf) // reference end
{
}
void nextAlignment() {
}
protected:
EList_m128i mat_;
};
#endif

102
binary_sa_search.h Normal file
View File

@ -0,0 +1,102 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef BINARY_SA_SEARCH_H_
#define BINARY_SA_SEARCH_H_
#include <stdint.h>
#include <iostream>
#include <limits>
#include "alphabet.h"
#include "assert_helpers.h"
#include "ds.h"
#include "btypes.h"
/**
* Do a binary search using the suffix of 'host' beginning at offset
* 'qry' as the query and 'sa' as an already-lexicographically-sorted
* list of suffixes of host. 'sa' may be all suffixes of host or just
* a subset. Returns the index in sa of the smallest suffix of host
* that is larger than qry, or length(sa) if all suffixes of host are
* less than qry.
*
* We use the Manber and Myers optimization of maintaining a pair of
* counters for the longest lcp observed so far on the left- and right-
* hand sides and using the min of the two as a way of skipping over
* characters at the beginning of a new round.
*
* Returns maximum value if the query suffix matches an element of sa.
*/
template<typename TStr, typename TSufElt> inline
TIndexOffU binarySASearch(
const TStr& host,
TIndexOffU qry,
const EList<TSufElt>& sa)
{
TIndexOffU lLcp = 0, rLcp = 0; // greatest observed LCPs on left and right
TIndexOffU l = 0, r = (TIndexOffU)sa.size()+1; // binary-search window
TIndexOffU hostLen = (TIndexOffU)host.length();
while(true) {
assert_gt(r, l);
TIndexOffU m = (l+r) >> 1;
if(m == l) {
// Binary-search window has closed: we have an answer
if(m > 0 && sa[m-1] == qry) {
return std::numeric_limits<TIndexOffU>::max(); // qry matches
}
assert_leq(m, sa.size());
return m; // Return index of right-hand suffix
}
assert_gt(m, 0);
TIndexOffU suf = sa[m-1];
if(suf == qry) {
return std::numeric_limits<TIndexOffU>::max(); // query matches an elt of sa
}
TIndexOffU lcp = min(lLcp, rLcp);
#ifndef NDEBUG
if(sstr_suf_upto_neq(host, qry, host, suf, lcp)) {
assert(0);
}
#endif
// Keep advancing lcp, but stop when query mismatches host or
// when the counter falls off either the query or the suffix
while(suf+lcp < hostLen && qry+lcp < hostLen && host[suf+lcp] == host[qry+lcp]) {
lcp++;
}
// Fell off the end of either the query or the sa elt?
bool fell = (suf+lcp == hostLen || qry+lcp == hostLen);
if((fell && qry+lcp == hostLen) || (!fell && host[suf+lcp] < host[qry+lcp])) {
// Query is greater than sa elt
l = m; // update left bound
lLcp = max(lLcp, lcp); // update left lcp
}
else if((fell && suf+lcp == hostLen) || (!fell && host[suf+lcp] > host[qry+lcp])) {
// Query is less than sa elt
r = m; // update right bound
rLcp = max(rLcp, lcp); // update right lcp
} else {
assert(false); // Must be one or the other!
}
}
// Shouldn't get here
assert(false);
return std::numeric_limits<TIndexOffU>::max();
}
#endif /*BINARY_SA_SEARCH_H_*/

315
bit_packed_array.cpp Normal file
View File

@ -0,0 +1,315 @@
/*
* Copyright 2018, Chanhee Park <parkchanhee@gmail.com> and Daehwan Kim <infphilo@gmail.com>
*
* This file is part of HISAT 2.
*
* HISAT 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* HISAT 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with HISAT 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include <iostream>
#include <vector>
#include <algorithm>
#include "timer.h"
#include "aligner_sw.h"
#include "aligner_result.h"
#include "scoring.h"
#include "sstring.h"
#include "bit_packed_array.h"
TIndexOffU BitPackedArray::get(size_t index) const
{
assert_lt(index, cur_);
pair<size_t, size_t> addr = indexToAddress(index);
uint64_t *block = blocks_[addr.first];
pair<size_t, size_t> pos = columnToPosition(addr.second);
TIndexOffU val = getItem(block, pos.first, pos.second);
return val;
}
#define write_fp(x) fp.write((const char *)&(x), sizeof((x)))
void BitPackedArray::writeFile(ofstream &fp)
{
size_t sz = 0;
write_fp(item_bit_size_);
write_fp(elm_bit_size_);
write_fp(items_per_block_bit_);
write_fp(items_per_block_bit_mask_);
write_fp(items_per_block_);
write_fp(cur_);
write_fp(sz_);
write_fp(block_size_);
// number of blocks
sz = blocks_.size();
write_fp(sz);
for(size_t i = 0; i < sz; i++) {
fp.write((const char *)blocks_[i], block_size_);
}
}
void BitPackedArray::writeFile(const char *filename)
{
ofstream fp(filename, std::ofstream::binary);
writeFile(fp);
fp.close();
}
void BitPackedArray::writeFile(const string &filename)
{
writeFile(filename.c_str());
}
#define read_fp(x) fp.read((char *)&(x), sizeof((x)))
void BitPackedArray::readFile(ifstream &fp)
{
size_t val_sz = 0;
read_fp(val_sz);
init_by_log2(val_sz);
//rt_assert_eq(val_sz, item_bit_size_);
read_fp(val_sz);
rt_assert_eq(val_sz, elm_bit_size_);
read_fp(val_sz);
rt_assert_eq(val_sz, items_per_block_bit_);
read_fp(val_sz);
rt_assert_eq(val_sz, items_per_block_bit_mask_);
read_fp(val_sz);
rt_assert_eq(val_sz, items_per_block_);
// skip cur_
size_t prev_cnt = 0;
read_fp(prev_cnt);
cur_ = 0;
// skip sz_
size_t prev_sz = 0;
read_fp(prev_sz);
sz_ = 0;
// block_size_
read_fp(val_sz);
rt_assert_eq(val_sz, block_size_);
// alloc blocks
allocItems(prev_cnt);
rt_assert_eq(prev_sz, sz_);
// number of blocks
read_fp(val_sz);
rt_assert_eq(val_sz, blocks_.size());
for(size_t i = 0; i < blocks_.size(); i++) {
fp.read((char *)blocks_[i], block_size_);
}
cur_ = prev_cnt;
}
void BitPackedArray::readFile(const char *filename)
{
ifstream fp(filename, std::ifstream::binary);
readFile(fp);
fp.close();
}
void BitPackedArray::readFile(const string &filename)
{
readFile(filename.c_str());
}
void BitPackedArray::put(size_t index, TIndexOffU val)
{
assert_lt(index, cur_);
pair<size_t, size_t> addr = indexToAddress(index);
uint64_t *block = blocks_[addr.first];
pair<size_t, size_t> pos = columnToPosition(addr.second);
setItem(block, pos.first, pos.second, val);
}
void BitPackedArray::pushBack(TIndexOffU val)
{
if(cur_ == sz_) {
allocItems(items_per_block_);
}
put(cur_++, val);
assert_leq(cur_, sz_);
}
TIndexOffU BitPackedArray::getItem(uint64_t *block, size_t idx, size_t offset) const
{
size_t remains = item_bit_size_;
TIndexOffU val = 0;
while(remains > 0) {
size_t bits = min(elm_bit_size_ - offset, remains);
uint64_t mask = bitToMask(bits);
// get value from block
TIndexOffU t = (block[idx] >> offset) & mask;
val = val | (t << (item_bit_size_ - remains));
remains -= bits;
offset = 0;
idx++;
}
return val;
}
void BitPackedArray::setItem(uint64_t *block, size_t idx, size_t offset, TIndexOffU val)
{
size_t remains = item_bit_size_;
while(remains > 0) {
size_t bits = min(elm_bit_size_ - offset, remains);
uint64_t mask = bitToMask(bits);
uint64_t dest_mask = mask << offset;
// get 'bits' lsb from val
uint64_t t = val & mask;
val >>= bits;
// save 't' to block[idx]
t <<= offset;
block[idx] &= ~(dest_mask); // clear
block[idx] |= t;
idx++;
remains -= bits;
offset = 0;
}
}
pair<size_t, size_t> BitPackedArray::indexToAddress(size_t index) const
{
pair<size_t, size_t> addr;
addr.first = index >> items_per_block_bit_;
addr.second = index & items_per_block_bit_mask_;
return addr;
}
pair<size_t, size_t> BitPackedArray::columnToPosition(size_t col) const {
pair<size_t, size_t> pos;
pos.first = (col * item_bit_size_) / elm_bit_size_;
pos.second = (col * item_bit_size_) % elm_bit_size_;
return pos;
}
void BitPackedArray::expand(size_t count)
{
if((cur_ + count) > sz_) {
allocItems(count);
}
cur_ += count;
assert_leq(cur_, sz_);
}
void BitPackedArray::allocSize(size_t sz)
{
size_t num_block = (sz * sizeof(uint64_t) + block_size_ - 1) / block_size_;
for(size_t i = 0; i < num_block; i++) {
uint64_t *ptr = new uint64_t[block_size_];
blocks_.push_back(ptr);
sz_ += items_per_block_;
}
}
void BitPackedArray::allocItems(size_t count)
{
size_t sz = (count * item_bit_size_ + elm_bit_size_ - 1) / elm_bit_size_;
allocSize(sz);
}
void BitPackedArray::init_by_log2(size_t ceil_log2)
{
item_bit_size_ = ceil_log2;
elm_bit_size_ = sizeof(uint64_t) * 8;
items_per_block_bit_ = 20; // 1M
items_per_block_ = 1ULL << (items_per_block_bit_);
items_per_block_bit_mask_ = items_per_block_ - 1;
block_size_ = (items_per_block_ * item_bit_size_ + elm_bit_size_ - 1) / elm_bit_size_ * sizeof(uint64_t);
cur_ = 0;
sz_ = 0;
}
void BitPackedArray::init(size_t max_value)
{
init_by_log2((size_t)ceil(log2(max_value)));
}
void BitPackedArray::dump() const
{
cerr << "item_bit_size_: " << item_bit_size_ << endl;
cerr << "block_size_: " << block_size_ << endl;
cerr << "items_per_block_: " << items_per_block_ << endl;
cerr << "cur_: " << cur_ << endl;
cerr << "sz_: " << sz_ << endl;
cerr << "number of blocks: " << blocks_.size() << endl;
}
size_t BitPackedArray::getMemUsage() const
{
size_t tot = blocks_.size() * block_size_;
tot += blocks_.totalCapacityBytes();
return tot;
}
BitPackedArray::~BitPackedArray()
{
for(size_t i = 0; i < blocks_.size(); i++) {
uint64_t *ptr = blocks_[i];
delete [] ptr;
}
}
void BitPackedArray::reset()
{
cur_ = 0;
sz_ = 0;
for(size_t i = 0; i < blocks_.size(); i++) {
uint64_t *ptr = blocks_[i];
delete [] ptr;
}
blocks_.clear();
}

105
bit_packed_array.h Normal file
View File

@ -0,0 +1,105 @@
/*
* Copyright 2018, Chanhee Park <parkchanhee@gmail.com> and Daehwan Kim <infphilo@gmail.com>
*
* This file is part of HISAT 2.
*
* HISAT 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* HISAT 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with HISAT 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef __HISAT2_BIT_PACKED_ARRAY_H
#define __HISAT2_BIT_PACKED_ARRAY_H
#include <iostream>
#include <fstream>
#include <limits>
#include <map>
#include "assert_helpers.h"
#include "word_io.h"
#include "mem_ids.h"
#include "ds.h"
using namespace std;
class BitPackedArray {
public:
BitPackedArray () {}
~BitPackedArray();
/**
* Return true iff there are no items
* @return
*/
inline bool empty() const { return cur_ == 0; }
inline size_t size() const { return cur_; }
TIndexOffU get(size_t idx) const;
inline TIndexOffU operator[](size_t i) const { return get(i); }
void pushBack(TIndexOffU val);
void init(size_t max_value);
void reset();
void writeFile(const char *filename);
void writeFile(const string& filename);
void writeFile(ofstream &fp);
void readFile(const char *filename);
void readFile(const string& filename);
void readFile(ifstream &fp);
void dump() const;
size_t getMemUsage() const;
private:
void init_by_log2(size_t ceil_log2);
void put(size_t index, TIndexOffU val);
inline uint64_t bitToMask(size_t bit) const
{
return (uint64_t) ((1ULL << bit) - 1);
}
TIndexOffU getItem(uint64_t *block, size_t idx, size_t offset) const;
void setItem(uint64_t *block, size_t idx, size_t offset, TIndexOffU val);
pair<size_t, size_t> indexToAddress(size_t index) const;
pair<size_t, size_t> columnToPosition(size_t col) const;
void expand(size_t count = 1);
void allocSize(size_t sz);
void allocItems(size_t count);
private:
size_t item_bit_size_; // item bit size(e.g. 33bit)
size_t elm_bit_size_; // 64bit
size_t items_per_block_bit_;
size_t items_per_block_bit_mask_;
size_t items_per_block_; // number of items in block
size_t cur_; // current item count
size_t sz_; // maximum item count
size_t block_size_; // block size in byte
// List of packed array
EList<uint64_t *> blocks_;
};
#endif //__HISAT2_BIT_PACKED_ARRAY_H

80
bitpack.h Normal file
View File

@ -0,0 +1,80 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef BITPACK_H_
#define BITPACK_H_
#include <stdint.h>
#include "assert_helpers.h"
/**
* Routines for marshalling 2-bit values into and out of 8-bit or
* 32-bit hosts
*/
static inline void pack_2b_in_8b(const int two, uint8_t& eight, const int off) {
assert_lt(two, 4);
assert_lt(off, 4);
eight |= (two << (off*2));
}
static inline int unpack_2b_from_8b(const uint8_t eight, const int off) {
assert_lt(off, 4);
return ((eight >> (off*2)) & 0x3);
}
static inline void pack_2b_in_32b(const int two, uint32_t& thirty2, const int off) {
assert_lt(two, 4);
assert_lt(off, 16);
thirty2 |= (two << (off*2));
}
static inline int unpack_2b_from_32b(const uint32_t thirty2, const int off) {
assert_lt(off, 16);
return ((thirty2 >> (off*2)) & 0x3);
}
/**
* Routines for marshalling 1-bit values into and out of 8-bit or
* 32-bit hosts
*/
static inline void pack_1b_in_8b(const int one, uint8_t& eight, const int off) {
assert_lt(one, 2);
assert_lt(off, 8);
eight |= (one << off);
}
static inline int unpack_1b_from_8b(const uint8_t eight, const int off) {
assert_lt(off, 2);
return ((eight >> off) & 0x1);
}
static inline void pack_1b_in_32b(const int one, uint32_t& thirty2, const int off) {
assert_lt(one, 2);
assert_lt(off, 32);
thirty2 |= (one << off);
}
static inline int unpack_1b_from_32b(const uint32_t thirty2, const int off) {
assert_lt(off, 32);
return ((thirty2 >> off) & 0x1);
}
#endif /*BITPACK_H_*/

1113
blockwise_sa.h Normal file

File diff suppressed because it is too large Load Diff

1237
bp_aligner.h Normal file

File diff suppressed because it is too large Load Diff

48
btypes.h Normal file
View File

@ -0,0 +1,48 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#ifndef BOWTIE_INDEX_TYPES_H
#define BOWTIE_INDEX_TYPES_H
#ifdef BOWTIE_64BIT_INDEX
#define OFF_MASK 0xffffffffffffffff
#define OFF_LEN_MASK 0xc000000000000000
#define LS_SIZE 0x100000000000000
#define OFF_SIZE 8
#define INDEX_MAX 0xffffffffffffffff
typedef uint64_t TIndexOffU;
typedef int64_t TIndexOff;
#else
#define OFF_MASK 0xffffffff
#define OFF_LEN_MASK 0xc0000000
#define LS_SIZE 0x10000000
#define OFF_SIZE 4
#define INDEX_MAX 0xffffffff
typedef uint32_t TIndexOffU;
typedef int TIndexOff;
#endif /* BOWTIE_64BIT_INDEX */
extern const std::string gfm_ext;
#endif /* BOWTIE_INDEX_TYPES_H */

80
ccnt_lut.cpp Normal file
View File

@ -0,0 +1,80 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include <stdint.h>
/* Generated by gen_lookup_tables.pl */
uint8_t cCntLUT_4[4][4][256];
uint8_t cCntLUT_4_rev[4][4][256];
uint8_t cCntBIT[8][256];
int countCnt(int by, int c, uint8_t str) {
int count = 0;
if(by == 0) by = 4;
while(by-- > 0) {
int c2 = str & 3;
str >>= 2;
if(c == c2) count++;
}
return count;
}
int countCnt_rev(int by, int c, uint8_t str) {
int count = 0;
if(by == 0) by = 4;
while(by-- > 0) {
int c2 = (str >> 6) & 3;
str <<= 2;
if(c == c2) count++;
}
return count;
}
void initializeCntLut() {
for(int by = 0; by < 4; by++) {
for(int c = 0; c < 4; c++) {
for(int str = 0; str < 256; str++) {
cCntLUT_4[by][c][str] = countCnt(by, c, str);
cCntLUT_4_rev[by][c][str] = countCnt_rev(by, c, str);
}
}
}
}
int countBit(int b, uint8_t str) {
int count = 0;
if(b == 0) b = 8;
while(b-- > 0) {
if(str & 0x1) count++;
str >>= 1;
}
return count;
}
void initializeCntBit() {
for(int b = 0; b < 8; b++) {
for(int str = 0; str < 256; str++) {
cCntBIT[b][str] = countBit(b, str);
}
}
}

117
diff_sample.cpp Normal file
View File

@ -0,0 +1,117 @@
/*
* Copyright 2011, Ben Langmead <langmea@cs.jhu.edu>
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see <http://www.gnu.org/licenses/>.
*/
#include "diff_sample.h"
struct sampleEntry clDCs[16];
bool clDCs_calced = false; /// have clDCs been calculated?
/**
* Entries 4-57 are transcribed from page 6 of Luk and Wong's paper
* "Two New Quorum Based Algorithms for Distributed Mutual Exclusion",
* which is also used and cited in the Burkhardt and Karkkainen's
* papers on difference covers for sorting. These samples are optimal
* according to Luk and Wong.
*
* All other entries are generated via the exhaustive algorithm in
* calcExhaustiveDC().
*
* The 0 is stored at the end of the sample as an end-of-list marker,
* but 0 is also an element of each.
*
* Note that every difference cover has a 0 and a 1. Intuitively,
* any optimal difference cover sample can be oriented (i.e. rotated)
* such that it includes 0 and 1 as elements.
*
* All samples in this list have been verified to be complete covers.
*
* A value of 0xffffffff in the first column indicates that there is no
* sample for that value of v. We do not keep samples for values of v
* less than 3, since they are trivial (and the caller probably didn't
* mean to ask for it).
*/
uint32_t dc0to64[65][10] = {
{0xffffffff}, // 0
{0xffffffff}, // 1
{0xffffffff}, // 2
{1, 0}, // 3
{1, 2, 0}, // 4
{1, 2, 0}, // 5
{1, 3, 0}, // 6
{1, 3, 0}, // 7
{1, 2, 4, 0}, // 8
{1, 2, 4, 0}, // 9
{1, 2, 5, 0}, // 10
{1, 2, 5, 0}, // 11
{1, 3, 7, 0}, // 12
{1, 3, 9, 0}, // 13
{1, 2, 3, 7, 0}, // 14
{1, 2, 3, 7, 0}, // 15
{1, 2, 5, 8, 0}, // 16
{1, 2, 4, 12, 0}, // 17
{1, 2, 5, 11, 0}, // 18
{1, 2, 6, 9, 0}, // 19
{1, 2, 3, 6, 10, 0}, // 20
{1, 4, 14, 16, 0}, // 21
{1, 2, 3, 7, 11, 0}, // 22
{1, 2, 3, 7, 11, 0}, // 23
{1, 2, 3, 7, 15, 0}, // 24
{1, 2, 3, 8, 12, 0}, // 25
{1, 2, 5, 9, 15, 0}, // 26
{1, 2, 5, 13, 22, 0}, // 27
{1, 4, 15, 20, 22, 0}, // 28
{1, 2, 3, 4, 9, 14, 0}, // 29
{1, 2, 3, 4, 9, 19, 0}, // 30
{1, 3, 8, 12, 18, 0}, // 31
{1, 2, 3, 7, 11, 19, 0}, // 32
{1, 2, 3, 6, 16, 27, 0}, // 33
{1, 2, 3, 7, 12, 20, 0}, // 34
{1, 2, 3, 8, 12, 21, 0}, // 35
{1, 2, 5, 12, 14, 20, 0}, // 36
{1, 2, 4, 10, 15, 22, 0}, // 37
{1, 2, 3, 4, 8, 14, 23, 0}, // 38
{1, 2, 4, 13, 18, 33, 0}, // 39
{1, 2, 3, 4, 9, 14, 24, 0}, // 40
{1, 2, 3, 4, 9, 15, 25, 0}, // 41
{1, 2, 3, 4, 9, 15, 25, 0}, // 42
{1, 2, 3, 4, 10, 15, 26, 0}, // 43
{1, 2, 3, 6, 16, 27, 38, 0}, // 44
{1, 2, 3, 5, 12, 18, 26, 0}, // 45
{1, 2, 3, 6, 18, 25, 38, 0}, // 46
{1, 2, 3, 5, 16, 22, 40, 0}, // 47
{1, 2, 5, 9, 20, 26, 36, 0}, // 48
{1, 2, 5, 24, 33, 36, 44, 0}, // 49
{1, 3, 8, 17, 28, 32, 38, 0}, // 50
{1, 2, 5, 11, 18, 30, 38, 0}, // 51
{1, 2, 3, 4, 6, 14, 21, 30, 0}, // 52
{1, 2, 3, 4, 7, 21, 29, 44, 0}, // 53
{1, 2, 3, 4, 9, 15, 21, 31, 0}, // 54
{1, 2, 3, 4, 6, 19, 26, 47, 0}, // 55
{1, 2, 3, 4, 11, 16, 33, 39, 0}, // 56
{1, 3, 13, 32, 36, 43, 52, 0}, // 57
// Generated by calcExhaustiveDC()
{1, 2, 3, 7, 21, 33, 37, 50, 0}, // 58
{1, 2, 3, 6, 13, 21, 35, 44, 0}, // 59
{1, 2, 4, 9, 15, 25, 30, 42, 0}, // 60
{1, 2, 3, 7, 15, 25, 36, 45, 0}, // 61
{1, 2, 4, 10, 32, 39, 46, 51, 0}, // 62
{1, 2, 6, 8, 20, 38, 41, 54, 0}, // 63
{1, 2, 5, 14, 16, 34, 42, 59, 0} // 64
};

1000
diff_sample.h Normal file

File diff suppressed because it is too large Load Diff

9
docs/404.html Normal file
View File

@ -0,0 +1,9 @@
---
layout: page
title: 404 Not Found
permalink: 404.html
hide: true
share: false
---
Sorry, the requested page wasn't found on the server.

4
docs/Gemfile Normal file
View File

@ -0,0 +1,4 @@
source 'https://rubygems.org'
gem 'github-pages'
gem 'jekyll-feed'
gem 'jemoji'

21
docs/LICENSE Normal file
View File

@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2014 Rohan Chandra
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

59
docs/README.md Normal file
View File

@ -0,0 +1,59 @@
# jekyll-ttskch-theme
A simple and customizable theme for Jekyll.
> This theme was renamed from _jekyll-**qck**-theme_ to _jekyll-**tch**-theme_ at 2016.06.02.
> And renamed again from _jekyll-**tch**-theme_ to _jekyll-**ttskch**-theme_ at 2016.09.23.
## Screen shot
![image](https://cloud.githubusercontent.com/assets/4360663/18776176/62611b38-81a2-11e6-875b-86a66aa8f15c.png)
## Features
* A lot of Markdown features (also GitHub Flavored Markdown)
* `:emoji:` ready :+1:
* Easy color-scheme customization
* Tags list page
* Monthly Archives page
* Search feature without any Jekyll plugins
* `<!--more-->` tag feature
* Anchor links for each headings
* Sticky side nav
* Responsive
* OGP ready
* Share buttons ready
## Getting started
1. [Fork me](https://github.com/ttskch/jekyll-ttskch-theme/fork)
2. Rename the repository from `jekyll-ttskch-theme` to `{username}.github.io` ([learn more](https://pages.github.com/))
3. Modify `_config.yml`
4. Modify `_sass/base/_variables.scss` if you need to change colors or font sizes
5. Add new posts into `_posts/` :smiley:
## Demo
You can see live demo at below:
* https://ttskch.github.io/jekyll-ttskch-theme/
## Thanks for using :wink:
* http://ttskch.github.io
* http://sitaramshelke.github.io
* http://jffourmond.github.io
* http://vbflash8.github.io
* http://luqitao.github.io
* http://harusametime.github.io
* http://gitzxon.github.io
* http://hutsonlu.github.io
* http://k0-1.github.io
* http://anthonygore.github.io
* http://getjsdojo.github.io
* http://georgezhuo.github.io
* http://neontapir.github.io
* https://sasukeh.github.io
* https://blog.guilhermegarnier.com
Please PR if you want to add your blog.

130
docs/_config.yml Normal file
View File

@ -0,0 +1,130 @@
#
# Basic settings.
#
url: http://DaehwanKimLab.github.io
baseurl: /hisat2
title: HISAT2
description: graph-based alignment of next generation sequencing reads to a population of genomes
avatar: /assets/img/ogp.png
# favicon: /favicon.ico
favicon: /assets/img/ogp.png
# language: ja
language: en
#
# Icons
#
icons:
rss: true
email:
github: DaehwanKimLab
bitbucket:
twitter:
facebook:
google_plus:
tumblr:
behance:
dribbble:
flickr:
instagram:
linkedin: # full URL
pinterest:
reddit:
soundcloud:
stack_exchange: # full URL
steam:
wordpress:
youtube:
#
# default for front matter
#
defaults:
-
scope:
path: ""
values:
category: "main"
#
# Prettify url.
#
permalink: pretty
#
# Scripts.
#
google_analytics: # e.g. UA-000000-01
disqus:
#
# Localizations.
#
str_next: Next
str_prev: Prev
str_read_more: Read more...
str_search: Search
str_recent_posts: Recent posts
str_show_all_posts: Show all posts
#
# Recent posts.
#
recent_posts_num: 10
#
# Pagination.
#
paginate: 10
paginate_path: page/:num
#
# Social.
#
share_buttons:
twitter: true
facebook: false # needs ogp.fb.app_id
hatena: false
ogp:
image_url: //ttskch.github.io/jekyll-ttskch-theme/assets/img/ogp.png
fb:
admin: # facebook admin id
app_id: # facebook application id
#
# Plugins.
#
gems:
- jekyll-paginate
- jekyll-feed
- jemoji
#
# Styles: see "_sass/base/_variables.scss"
#
#
# !! Danger zone !!
#
include: ["_pages"]
markdown: kramdown
kramdown:
input: GFM
syntax_highlighter: rouge
excerpt_separator: <!--more-->
sass:
sass_dir: _sass
style: :compressed # or :expanded
exclude:
- Gemfile
- Gemfile.lock
- LICENSE
- README.md
- vendor

View File

@ -0,0 +1,6 @@
- name: Lyda Hill Department of Bioinformatics, The University of Texas Southwestern Medical Center
url: https://www.utsouthwestern.edu/departments/bioinformatics
logo: /assets/img/bioinformatics_utsw_logo.png
- name: Center for Computational Biologoy, Johns Hopkins University
url: http://ccb.jhu.edu
logo: /assets/img/ccb_jhu_logo_tmp.png

View File

@ -0,0 +1,10 @@
- name: Chanhee Park
url: /chanhee.park/
- name: Ben Langmead
url: http://www.langmead-lab.org/
- name: Yun (Leo) Zhang
url: /leo.zhang/
- name: Steven Salzberg
url: https://salzberg-lab.org/in-the-news/about-me/
- name: Daehwan Kim
url: https://kim-lab.org/daehwan-kim-principal-investigator/

View File

@ -0,0 +1,66 @@
latest_version: 2.2.1,2.2.0,2.1.0
release:
- version: 2.2.1
date: 7/24/2020
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/fE9QCsX3NH4QwBi/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/zMgEtnF6LjnjFrr/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download
- version: 2.2.0
date: 2/6/2020
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-220-source/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-220-OSX_x86_64/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-220-Linux_x86_64/download
- version: 2.1.0
date: 6/8/2017
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-210-source/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-210-OSX_x86_64/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-210-Linux_x86_64/download
Windows: http://www.di.fc.ul.pt/~afalcao/hisat2_windows.html
- version: 2.0.5
date: 11/4/2016
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-205-source/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-205-OSX_x86_64/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-205-Linux_x86_64/download
- version: 2.0.4
date: 5/18/2016
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-204-source/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-204-OSX_x86_64/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-204-Linux_x86_64/download
- version: 2.0.3-beta
date: 3/28/2016
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-203-beta-source/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-203-beta-OSX_x86_64/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-203-beta-Linux_x86_64/download
- version: 2.0.2-beta
date: 3/17/2016
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-202-beta-source/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-202-beta-OSX_x86_64/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-202-beta-Linux_x86_64/download
- version: 2.0.1-beta
date: 11/19/2015
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-201-beta-source/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-201-beta-OSX_x86_64/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-201-beta-Linux_x86_64/download
- version: 2.0.0-beta
date: 9/8/2015
name: HISAT2
artifacts:
Source: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-200-beta-source/download
OSX_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-200-beta-OSX_x86_64/download
Linux_x86_64: https://cloud.biohpc.swmed.edu/index.php/s/hisat2-200-beta-Linux_x86_64/download

View File

@ -0,0 +1,81 @@
- organism: H. sapiens
data:
GRCh38:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/grch38_genome.tar.gz
genome_snp:
url: https://genome-idx.s3.amazonaws.com/hisat/grch38_snp.tar.gz
genome_tran:
url: https://genome-idx.s3.amazonaws.com/hisat/grch38_tran.tar.gz
genome_snp_tran:
url: https://genome-idx.s3.amazonaws.com/hisat/grch38_snptran.tar.gz
genome_rep(above 2.2.0):
url: https://genome-idx.s3.amazonaws.com/hisat/grch38_rep.tar.gz
genome_snp_rep(above 2.2.0):
url: https://genome-idx.s3.amazonaws.com/hisat/grch38_snprep.tar.gz
UCSC hg38:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/hg38_genome.tar.gz
genome_tran:
url: https://genome-idx.s3.amazonaws.com/hisat/hg38_tran.tar.gz
GRCh37:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/grch37_genome.tar.gz
genome_snp:
url: https://genome-idx.s3.amazonaws.com/hisat/grch37_snp.tar.gz
genome_tran:
url: https://genome-idx.s3.amazonaws.com/hisat/grch37_tran.tar.gz
genome_snp_tran:
url: https://genome-idx.s3.amazonaws.com/hisat/grch37_snptran.tar.gz
UCSC hg19:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/hg19_genome.tar.gz
- organism: M. musculus
data:
GRCm38:
genome:
url: https://cloud.biohpc.swmed.edu/index.php/s/grcm38/download
genome_snp:
url: https://cloud.biohpc.swmed.edu/index.php/s/grcm38_snp/download
genome_tran:
url: https://cloud.biohpc.swmed.edu/index.php/s/grcm38_tran/download
genome_snp_tran:
url: https://cloud.biohpc.swmed.edu/index.php/s/grcm38_snp_tran/download
UCSC mm10:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/mm10_genome.tar.gz
- organism: R. norvegicus
data:
UCSC rn6:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/rn6_genome.tar.gz
- organism: D. melanogaster
data:
BDGP6:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/bdgp6.tar.gz
genome_tran:
url: https://genome-idx.s3.amazonaws.com/hisat/bdgp6_tran.tar.gz
UCSC dm6:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/dm6.tar.gz
- organism: C. elegans
data:
WBcel235:
genome:
url: https://genome-idx.s3.amazonaws.com/hisat/wbcel235.tar.gz
genome_tran:
url: https://genome-idx.s3.amazonaws.com/hisat/wbcel235_tran.tar.gz
UCSC ce10:
genome:
url: https://cloud.biohpc.swmed.edu/index.php/s/bbynxoY2TPpRNQb/download
- organism: S. cerevisiae
data:
R64-1-1:
genome:
url: https://cloud.biohpc.swmed.edu/index.php/s/JRSoKHD5cHfpCFE/download
genome_tran:
url: https://cloud.biohpc.swmed.edu/index.php/s/akeiMrGGtt5KoJY/download
UCSC sacCer3:
genome:
url: https://cloud.biohpc.swmed.edu/index.php/s/Gsq4goLW4TDAz4E/download

View File

@ -0,0 +1,5 @@
<footer>
{% if site.share_buttons and include.share != false %}
{% include share-buttons.html page=include.page %}
{% endif %}
</footer>

View File

@ -0,0 +1,64 @@
{% assign page = include.page %}
<header>
<div class="panel">
<h1>
{% if include.link %}
<a class="post-link" href="{{ page.url | prepend: site.baseurl }}">{{ page.title }}</a>
{% else %}
{{ page.title }}
{% endif %}
</h1>
<ul class="tags">
{% assign tags_num = (page.tags | size) %}
{% if tags_num > 0 %}
<li><i class="fa fa-tags"></i></li>
{% endif %}
{% for tag in page.tags %}
<li>
<a class="tag" href="{{ '/search/?t=' | append: tag | prepend: site.baseurl }}">#{{ tag }}</a>
</li>
{% endfor %}
</ul>
<div class="clearfix">
<ul class="meta">
{% if page.date %}
<li>
<i class="fa fa-calendar"></i>
{{ page.date | date: "%Y-%m-%d" }}
</li>
{% endif %}
{% if page.author %}
<li>
<a href="{{ '/search/?a=' | append: page.author | prepend: site.baseurl }}">
<i class="fa fa-user"></i>
{{ page.author }}
</a>
</li>
{% if page.icons %}
<li>
<ul class="icons">
{% include icons.html icons=page.icons %}
</ul>
</li>
{% endif %}
{% endif %}
</ul>
</div>
</div>
{% if site.share_buttons and include.share != false %}
<div style="margin-top: 1em;">
{% include share-buttons.html page=page %}
</div>
{% endif %}
{% if include.eye_catch != false and page.eye_catch %}
<p style="text-align: center">
<img class="eye-catch" src="{{ page.eye_catch }}"/>
</p>
{% endif %}
</header>

View File

@ -0,0 +1,10 @@
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = '{{ site.disqus }}';
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>

View File

@ -0,0 +1,11 @@
<!-- Init Facebook SDK -->
{% if site.share_buttons.facebook %}
<div id="fb-root"></div>
<script>(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id)) return;
js = d.createElement(s); js.id = id;
js.src = "//connect.facebook.net/ja_JP/sdk.js#xfbml=1&version=v2.5&appId={{ site.ogp.fb.app_id }}";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script>
{% endif %}

View File

@ -0,0 +1,12 @@
<!-- Google Analytics -->
{% if site.google_analytics %}
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', '{{ site.google_analytics }}', 'auto');
ga('send', 'pageview');
</script>
{% endif %}

161
docs/_includes/icons.html Normal file
View File

@ -0,0 +1,161 @@
{% assign icons = include.icons %}
{% if icons.rss %}
<li>
<a href="{{ '/feed.xml' | prepend: site.baseurl }}">
<i class="fa fa-fw fa-rss"></i>
</a>
</li>
{% endif %}
{% if icons.email %}
<li>
<a href="mailto:{{ icons.email }}">
<i class="fa fa-fw fa-envelope"></i>
</a>
</li>
{% endif %}
{% if icons.github %}
<li>
<a href="https://github.com/{{ icons.github }}">
<i class="fa fa-fw fa-github"></i>
</a>
</li>
{% endif %}
{% if icons.bitbucket %}
<li>
<a href="https://bitbucket.org/{{ icons.bitbucket }}">
<i class="fa fa-fw fa-bitbucket"></i>
</a>
</li>
{% endif %}
{% if icons.twitter %}
<li>
<a href="https://twitter.com/{{ icons.twitter }}">
<i class="fa fa-fw fa-twitter"></i>
</a>
</li>
{% endif %}
{% if icons.facebook %}
<li>
<a href="https://www.facebook.com/{{ icons.facebook }}">
<i class="fa fa-fw fa-facebook"></i>
</a>
</li>
{% endif %}
{% if icons.google_plus %}
<li>
<a href="https://plus.google.com/{{ icons.google_plus }}">
<i class="fa fa-fw fa-google-plus"></i>
</a>
</li>
{% endif %}
{% if icons.tumblr %}
<li>
<a href="https://{{ icons.tumblr }}.tumblr.com/">
<i class="fa fa-fw fa-tumblr"></i>
</a>
</li>
{% endif %}
{% if icons.behance %}
<li>
<a href="https://www.behance.net/{{ icons.behance }}">
<i class="fa fa-fw fa-behance"></i>
</a>
</li>
{% endif %}
{% if icons.dribbble %}
<li>
<a href="https://dribbble.com/{{ icons.dribbble }}">
<i class="fa fa-fw fa-dribbble"></i>
</a>
</li>
{% endif %}
{% if icons.flickr %}
<li>
<a href="https://www.flickr.com/photos/{{ icons.flickr }}">
<i class="fa fa-fw fa-flickr"></i>
</a>
</li>
{% endif %}
{% if icons.instagram %}
<li>
<a href="http://instagram.com/{{ icons.instagram }}">
<i class="fa fa-fw fa-instagram"></i>
</a>
</li>
{% endif %}
{% if icons.linkedin %}
<li>
<a href="{{ icons.linkedin }}">
<i class="fa fa-fw fa-linkedin"></i>
</a>
</li>
{% endif %}
{% if icons.pinterest %}
<li>
<a href="http://www.pinterest.com/{{ icons.pinterest }}">
<i class="fa fa-fw fa-pinterest"></i>
</a>
</li>
{% endif %}
{% if icons.reddit %}
<li>
<a href="https://www.reddit.com/user/{{ icons.reddit }}">
<i class="fa fa-fw fa-reddit"></i>
</a>
</li>
{% endif %}
{% if icons.soundcloud %}
<li>
<a href="https://soundcloud.com/{{ icons.soundcloud }}">
<i class="fa fa-fw fa-soundcloud"></i>
</a>
</li>
{% endif %}
{% if icons.stack_exchange %}
<li>
<a href="{{ icons.stack_exchange }}">
<i class="fa fa-fw fa-stack-exchange"></i>
</a>
</li>
{% endif %}
{% if icons.steam %}
<li>
<a href="http://steamcommunity.com/id/{{ icons.steam }}">
<i class="fa fa-fw fa-steam"></i>
</a>
</li>
{% endif %}
{% if icons.wordpress %}
<li>
<a href="https://{{ icons.wordpress }}.wordpress.com/">
<i class="fa fa-fw fa-wordpress"></i>
</a>
</li>
{% endif %}
{% if icons.youtube %}
<li>
<a href="https://www.youtube.com/user/{{ icons.youtube }}">
<i class="fa fa-fw fa-youtube"></i>
</a>
</li>
{% endif %}

View File

@ -0,0 +1,7 @@
{% assign page = include.page %}
{% if page.canonical %}
{% assign url = page.canonical | prepend: site.baseurl | prepend: site.url %}
{% else %}
{% assign url = page.url | replace: 'index.html', '' | prepend: site.baseurl | prepend: site.url %}
{% endif %}

View File

@ -0,0 +1,29 @@
{% if paginator.total_pages > 1 %}
<div class="pagination">
{% if paginator.previous_page %}
<a class="btn" href="{{ paginator.previous_page_path | prepend: site.baseurl }}">
<i class="fa fa-chevron-left"></i>
{{ site.str_prev }}
</a>
{% else %}
<span class="btn disabled">
<i class="fa fa-chevron-left"></i>
{{ site.str_prev }}
</span>
{% endif %}
{% if paginator.next_page %}
<a class="btn" href="{{ paginator.next_page_path | prepend: site.baseurl }}">
{{ site.str_next }}
<i class="fa fa-chevron-right"></i>
</a>
{% else %}
<span class="btn disabled">
{{ site.str_next }}
<i class="fa fa-chevron-right"></i>
</span>
{% endif %}
</div>
{% endif %}

View File

@ -0,0 +1,22 @@
{% include page-url-resolver.html page=include.page %}
{% assign title = include.page.title | append: ' | ' | append: site.title %}
<div class="clearfix">
<div style="float: right !important;">
{% if site.share_buttons.twitter %}
<div style="margin-right: 5px !important; float: left !important;">
<a href="https://twitter.com/share" class="twitter-share-button"{count} data-url="{{ url }}" data-text="{{ title }}">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
</div>
{% endif %}
{% if site.share_buttons.facebook %}
<div style="width: 93px !important; float: left !important;">
<div class="fb-like" data-href="{{ url }}" data-layout="button_count"></div>
</div>
{% endif %}
{% if site.share_buttons.hatena %}
<div style="float: left !important;">
<a href="http://b.hatena.ne.jp/entry/{{ url }}" class="hatena-bookmark-button" data-hatena-bookmark-title="{{ title }}" data-hatena-bookmark-layout="standard-balloon" data-hatena-bookmark-lang="ja" title="このエントリーをはてなブックマークに追加"><img src="https://b.st-hatena.com/images/entry-button/button-only@2x.png" alt="このエントリーをはてなブックマークに追加" width="20" height="20" style="border: none;" /></a><script type="text/javascript" src="https://b.st-hatena.com/js/bookmark_button.js" charset="utf-8" async="async"></script>
</div>
{% endif %}
</div>
</div>

194
docs/_layouts/default.html Normal file
View File

@ -0,0 +1,194 @@
<!DOCTYPE html>
<html lang="{{ site.language }}">
<head>
{% capture title %}{% if page.title %}{{ page.title }} | {% endif %}{{ site.title }}{% endcapture %}
{% include page-url-resolver.html page=page %}
{% if page.excerpt %}
{% assign description = page.excerpt | strip_html | strip_newlines | truncate: 160 %}
{% else %}
{% assign description = site.description %}
{% endif %}
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>{{ title }}</title>
<meta name="description" content="{{ description }}">
<link rel="shortcut icon" href="{{ site.favicon | prepend: site.baseurl }}" type="image/x-icon">
<link rel="canonical" href="{{ url }}">
<link rel="alternate" type="application/atom+xml" title="{{ site.title }}" href="{{ '/feed.xml' | prepend: site.baseurl }}" />
{% if page.eye_catch %}
{% assign ogp_image_url = page.eye_catch %}
{% else %}
{% assign ogp_image_url = site.ogp.image_url %}
{% endif %}
<meta property="og:title" content="{{ title }}" />
<meta property="og:type" content="website" />
<meta property="og:image" content="{{ ogp_image_url }}" />
<meta property="og:url" content="{{ url }}" />
<meta property="og:site_name" content="{{ site.title }}" />
<meta property="fb:admins" content="{{ site.ogp.fb.admin }}" />
<meta property="fb:app_id" content="{{ site.ogp.fb.app_id }}" />
<meta property="og:description" content="{{ description }}" />
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<script src="https://use.fontawesome.com/1f5f360d80.js"></script>
<link href="//fonts.googleapis.com/css?family=Source+Sans+Pro:400,700,700italic,400italic" rel="stylesheet">
<link href="{{ '/assets/css/style.css' | prepend: site.baseurl }}" rel="stylesheet">
</head>
<body>
<header class="site-header">
<div class="inner clearfix">
{% if site.avatar %}
<a href="{{ '/' | prepend: site.baseurl }}">
<img class="avatar" src="{{ site.avatar | prepend: site.baseurl }}" alt=""/>
</a>
{% endif %}
<h1 class="clearfix">
<a class="title {% if site.avatar == null %}slim{% endif %}" href="{{ '/' | prepend: site.baseurl }}">{{ site.title }}</a>
<br><span class="description">{{ site.description }}</span>
</h1>
</div>
</header>
<div class="site-container">
<div class="site-content">
{{ content }}
</div>
<aside class="site-aside">
<div class="inner">
<div class="block">
<form action="{{ site.baseurl }}/search">
<input type="search" id="search" name="q" placeholder="{{ site.str_search }}" />
</form>
</div>
<div class="block">
<ul>
{% assign pages = site.pages | where: "category", "main" | sort: 'order' %}
{% for page in pages %}
{% if page.title and page.hide != true %}
<li><a class="page-link" href="{{ page.url | prepend: site.baseurl }}">{{ page.title }}</a></li>
{% endif %}
{% endfor %}
</ul>
</div>
<!--
<ul class="icons">
{% include icons.html icons=site.icons %}
</ul>
<hr class="with-no-margin margin-bottom"/>
-->
<div class="block">
<h2>Funding</h2>
<br>
<div style="font-size: 0.8em">
This work was supported in part by the National Human Genome Research Institute under grants R01-HG006102 and R01-HG006677,
and NIH grants R01-LM06845 and R01-GM083873 and NSF grant CCF-0347992 to Steven L. Salzberg
and by the Cancer Prevention Research Institute of Texas under grant RR170068 and NIH grant R01-GM135341 to Daehwan Kim
</div>
</div>
<div class="block">
<h2>Getting Help</h2>
<br>
Please use <a href="mailto:hisat2.genomics@gmail.com">hisat2.genomics@gmail.com</a> for private communications only. Please do not email technical questions to HISAT2 contributors directly.
</div>
<div class="block">
<h2>Publications</h2>
<div style="font-size: 0.8em">
<ul>
<li>Kim, D., Paggi, J.M., Park, C. <i>et al.</i> <a class="publication" href="https://doi.org/10.1038/s41587-019-0201-4">Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype.</a> <a class="publication" href="https://www.nature.com/nbt/"><i>Nat Biotechnol</i></a> <b>37</b>, 907915 (2019).</li>
<li>Kim D, Langmead B and Salzberg SL. <a class="publication" href="https://doi.org/10.1038/nmeth.3317">HISAT: a fast spliced aligner with low memory requirements.</a> <a class="publication" href="https://www.nature.com/nmeth/"><i>Nature Methods</i></a> 2015</li>
<li>Pertea M, Kim D, Pertea G, Leek JT and Salzberg SL. <a class="publication" href="https://doi.org/10.1038/nprot.2016.095">Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.</a> <a class="publication" href="https://www.nature.com/nprot/"><i>Nature Protocols</i></a> 2016</li>
</ul>
</div>
</div>
<div class="block">
<h2>Contributors</h2>
<ul>
{% for item in site.data.contributor %}
<li>
{% if item.url contains "http://" or item.url contains "https://" %}
<a class="page-link" href="{{ item.url }}">{{ item.name }}</a>
{% else %}
<a class="page-link" href="{{ item.url | prepend: site.baseurl }}">{{ item.name }}</a>
{% endif %}
</li>
{% endfor %}
</ul>
</div>
{% if site.data.collaborate %}
<div class="block">
{% for item in site.data.collaborate %}
<ul style="text-align: center">
<a href="{{ item.url }}">
<img class="avatar" src="{{ item.logo | prepend: site.baseurl }}" alt="{{ item.name }}" />
</a>
</ul>
{% endfor %}
</div>
{% endif %}
<!--
<div class="block sticky">
<h2>{{ site.str_recent_posts }}</h2>
<ul>
{% assign posts = '' | split: '' %}
{% for post in site.posts %}
{% if post.hide != true %}
{% assign posts = posts | push: post %}
{% endif %}
{% endfor %}
{% assign posts = posts | sort: 'date' | reverse %}
{% for post in posts limit:site.recent_posts_num %}
<li><a href="{{ post.url | prepend: site.baseurl }}">{{ post.title }}</a></li>
{% endfor %}
</ul>
</div>
-->
</div>
</aside>
</div>
<footer class="site-footer">
<div class="inner">
<span>Powered by <a href="http://jekyllrb.com">Jekyll</a> with <a href="https://github.com/ttskch/jekyll-ttskch-theme">TtskchTheme</a></span>
</div>
</footer>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="{{ '/assets/lib/garand-sticky/jquery.sticky.js' | prepend: site.baseurl }}"></script>
<script src="{{ '/assets/js/script.js' | prepend: site.baseurl }}"></script>
{% if page.id %}
<script src="{{ '/assets/js/header-link.js' | prepend: site.baseurl }}"></script>
{% endif %}
{% if page.permalink == '/search/' %}
<script src="{{ '/assets/js/search.js' | prepend: site.baseurl }}"></script>
{% endif %}
{% include fb-root.html %}
{% include google-analytics.html %}
</body>
</html>

13
docs/_layouts/page.html Normal file
View File

@ -0,0 +1,13 @@
---
layout: default
---
<div class="article-wrapper">
<article>
{% include article-header.html page=page link=false share=page.share %}
<section class="post-content">
{{ content }}
</section>
{% include article-footer.html page=page share=page.share %}
</article>
</div>

19
docs/_layouts/post.html Normal file
View File

@ -0,0 +1,19 @@
---
layout: default
---
<div class="article-wrapper">
<article>
{% include article-header.html page=page link=false share=page.share %}
<section class="post-content">
{{ content }}
</section>
{% include article-footer.html page=page share=page.share %}
</article>
</div>
{% if site.disqus %}
<section class="comments">
{% include disqus.html %}
</section>
{% endif %}

9
docs/_pages/about.md Normal file
View File

@ -0,0 +1,9 @@
---
layout: page
title: About
permalink: /about/
order: 2
share: false
---
**HISAT2** is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes as well as to a single reference genome. Based on an extension of BWT for graphs ([Sir&eacute;n et al. 2014](http://dl.acm.org/citation.cfm?id=2674828)), we designed and implemented a graph FM index (GFM), an original approach and its first implementation. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome. These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).

View File

@ -0,0 +1,20 @@
---
layout: page
title: All Posts
permalink: /archives/all/
hide: true
share: false
---
<div id="search-results">
<hr id="first-hr" class="with-no-margin"/>
{% for post in site.posts %}
<div class="article-wrapper">
<article>
{% include article-header.html page=post link=true share=false eye_catch=false %}
</article>
</div>
<hr class="with-no-margin"/>
{% endfor %}
</div>

35
docs/_pages/archives.html Normal file
View File

@ -0,0 +1,35 @@
---
layout: page
title: Archives
permalink: /archives/
order: 3
share: false
hide: true
---
{% for post in site.posts %}
{% unless post.next %}
<h3>{{ post.date | date: '%Y' }}</h3>
<ul>
{% else %}
{% assign year = post.date | date: '%Y' %}
{% assign next_year = post.next.date | date: '%Y' %}
{% if year != next_year %}
</ul>
<h3>{{ post.date | date: '%Y' }}</h3>
<ul>
{% endif %}
{% endunless %}
{% assign month = post.date | date: '%m' %}
{% assign next_month = post.next.date | date: '%m' %}
{% if year != next_year or month != next_month %}
<li><a href="{{ '/search/?d=' | prepend: site.baseurl }}{{ post.date | date: '%Y-%m' }}">{{ post.date | date: '%Y/%m' }}</a></li>
{% endif %}
{% endfor %}
{% if site.posts %}
</ul>
{% endif %}
<a class="btn" href="{{ '/archives/all/' | prepend: site.baseurl }}">{{ site.str_show_all_posts }}</a>

View File

@ -0,0 +1,12 @@
---
layout: page
title: Chanhee Park
permalink: /chanhee.park/
order: 1
share: false
category: contributor
---
Chanhee Park is a Scientific Software Engineer in the Kim Lab at UTSW responsible for maintaining and improving HISAT2.
[Linkedin](https://www.linkedin.com/in/chanhee-park-97677297/)

View File

@ -0,0 +1,12 @@
---
layout: page
title: Yun (Leo) Zhang
permalink: /leo.zhang/
order: 1
share: false
category: contributor
---
Yun (Leo) is a biomedical engineering graduate student at UT Southwestern Medical Center. His main research includes developing advance alignment tools.
[Linkedin](https://www.linkedin.com/in/zhang-yun-a9565891/)

61
docs/_pages/download.md Normal file
View File

@ -0,0 +1,61 @@
---
layout: page
title: Download
permalink: /download/
order: 5
share: false
---
Please cite:
>Kim, D., Paggi, J.M., Park, C. _et al._ Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. _Nat Biotechnol_ **37**, 907915 (2019). <https://doi.org/10.1038/s41587-019-0201-4>
- TOC
{:toc}
## Index
HISAT2 indexes are hosted on AWS (Amazon Web Services), thanks to the AWS Public Datasets program. Click this [link](https://registry.opendata.aws/jhu-indexes/) for more details.
{% for item in site.data.download-index %}
### {{ item.organism }}
{% for data in item.data %}
<li>{{ data[0] }}</li>
<table style="border-collapse: collapse; border: none;">
{% for genome in data[1] %}
<tr style="border: none;"><td style="border: none;">{{ genome[0] }}</td>
<td style="border: none;">
{% for url in genome[1] %}
<a href="{{ url[1] }}">{{ url[1] }}</a><br/>
{% endfor %}
</td>
</tr>
{% endfor %}
</table>
{% endfor %}
{% endfor %}
genome: HISAT2 index for reference
genome_snp: HISAT2 Graph index for reference plus SNPs
genome_tran: HISAT2 Graph index for reference plus transcripts
genome_snp_tran: HISAT2 Graph index for reference plus SNPs and transcripts
## Binaries
{: binaries }
{% assign targets = site.data.download-binary.latest_version | split: "," %}
{% for release in site.data.download-binary.release %}
{% assign version = release['version'] %}
{% if targets contains version or targets == null %}
{% assign name = release['name'] %}
### Version: {{name}} {{version}}
<table style="border-collapse: collapse; border: none;">
<tr style="border: none;"><td style="border: none;" colspan="2"><b>Release Date</b>: {{release['date']}}</td></tr>
{% for artifact in release['artifacts'] %}
{% assign type = artifact[0] %}
<tr style="border: none;"><td style="border: none;">{{type}}</td><td style="border: none;"><a href="{{artifact[1]}}">{{artifact[1]}}</a></td></tr>
{% endfor %}
</table>
{% endif %}
{% endfor %}

225
docs/_pages/hisat-3n.md Normal file
View File

@ -0,0 +1,225 @@
---
layout: page
title: HISAT-3N
permalink: /hisat-3n/
order: 4
share: false
---
HISAT-3N
============
Overview
-----------------
**HISAT-3N** (hierarchical indexing for spliced alignment of transcripts - 3 nucleotides)
is designed for nucleotide conversion sequencing technologies and implemented based on HISAT2.
There are two strategies for HISAT-3N to align nuleotide conversion sequencing reads: *standard mode* and *repeat mode*.
The standard mode align reads with standard-3N index only, so it is fast and require small memory (~9GB for human genome alignment).
The repeat mode align reads with both standard-3N index and repeat-3N index, then output 1,000 alignment result (the output number can be adjust by `--repeat-limit`).
The repeat mode can align nucleotide conversion reads more accurately,
and it is only 10% slower than the standard mode with tiny more memory (repeat mode use about ~10.5GB) usage than standard mode.
HISAT-3N is developed based on [HISAT2], which is particularly optimized for RNA sequencing technology.
HISAT-3N can be used for any base-converted sequencing reads include [BS-seq], [SLAM-seq], [TAB-seq], [oxBS-seq], [TAPS], [scBS-seq], and [scSLAM-seq],.
[HISAT2]:https://github.com/DaehwanKimLab/hisat2
[BS-seq]: https://en.wikipedia.org/wiki/Bisulfite_sequencing
[SLAM-seq]: https://www.nature.com/articles/nmeth.4435
[scBS-seq]: https://www.nature.com/articles/nmeth.3035
[scSLAM-seq]: https://www.nature.com/articles/s41586-019-1369-y
[TAPS]: https://www.nature.com/articles/s41587-019-0041-2
[TAB-seq]: https://doi.org/10.1016/j.cell.2012.04.027
[oxBS-seq]: https://science.sciencemag.org/content/336/6083/934
Getting started
============
HISAT-3N requires a 64-bit computer running either Linux or Mac OS X and at least 16 GB of RAM.
A few notes:
1. The repeat 3N index building process requires 256 GB of RAM.
2. The standard 3N index building requires no more than 16 GB of RAM.
3. The alignment process with either standard or repeat index requires no more than 16 GB of RAM.
4. [SAMtools] is required to sort SAM file for hisat-3n-table.
Install
------------
git clone https://github.com/DaehwanKimLab/hisat2.git
cd hisat2
git checkout -b hisat-3n origin/hisat-3n
make
Make sure that you are in the `hisat-3n` branch
Build a 3N index with `hisat-3n-build`
-----------
`hisat-3n-build` builds a 3N-index, which contains two hisat2 indexes, from a set of DNA sequences. For standard 3N-index,
each index contains 16 files with suffix `.3n.*.*.ht2`.
For repeat 3N-index, there are 16 more files in addition to the standard 3N-index, and they have the suffix
`.3n.*.rep.*.ht2`.
These files constitute the hisat-3n index and no other file is needed to alignment reads to the reference.
* Example for standard HISAT-3N index building:
`hisat-3n-build genome.fa genome`
* Example for repeat HISAT-3N index building (require 256 GB memory):
`hisat-3n-build --repeat-index genome.fa genome`
It is optional to make the graph index and add SNP or spicing site information to the index, to increase the alignment accuracy.
for more detail, please check the [HISAT2 manual].
[HISAT2 manual]:https://daehwankimlab.github.io/hisat2/manual/
# Standard HISAT-3N integrated index with SNP information
hisat-3n-build --exons genome.exon genome.fa genome
# Standard HISAT-3N integrated index with splicing site information
hisat-3n-build --ss genome.ss genome.fa genome
# Repeat HISAT-3N integrated index with SNP information
hisat-3n-build --repeat-index --exons genome.exon genome.fa genome
# Repeat HISAT-3N integrated index with splicing site information
hisat-3n-build --repeat-index --ss genome.ss genome.fa genome
Alignment with `hisat-3n`
------------
After we build the HISAT-3N index, you are ready to use `hisat-3n` for alignment.
HISAT-3N uses the HISAT2 argument but has some extra arguments. Please check [HISAT2 manual] for more detail.
For human genome reference, HISAT-3N requires about 9GB for alignment with standard 3N-index and 10.5 GB for repeat 3N-index.
* `--base-change <chr1,chr2>`
Provide which base is converted in the sequencing process to another base. Please enter
2 letters separated by ',' for this argument. The first letter(chr1) should be the converted base, the second letter(chr2) should be
the converted to base. For example, during slam-seq, some 'T' is converted to 'C',
please enter `--base-change T,C`. During bisulfite-seq, some 'C' is converted to 'T', please enter `--base-change C,T`.
If you want to align non-converted reads to the regular HISAT2 index, do not use this option.
* `--index/-x <hisat-3n-idx>`
The index for HISAT-3N. The basename is the name of the index files up to but not including the suffix `.3n.*.*.ht2` / etc.
For example, you build your index with basename 'genome' by HISAT-3N-build, please enter `--index genome`.
* `--repeat-limit <int>`
You can set up the number of alignment will be check for each repeat alignment. You may increase the number to let hisat-3n
output more, if a read has multiple mapping. We suggest the repeat limit number for paired-end reads alignment is no more
than 1,000,000. default: 1000.
* `--unique-only`
Only output uniquely aligned reads.
#### Examples:
* Single-end slam-seq reads (T to C conversion) alignment with standard 3N-index:
`hisat-3n --index genome -f -U read.fa -S alignment_result.sam --base-change T,C`
* Paired-end bisulfite-seq reads (C to T conversion) alignment with repeat 3N-index:
`hisat-3n --index genome -f -1 read_1.fa -2 read_2.fa -S alignment_result.sam --base-change C,T`
* Single-end TAPS reads (have C to T conversion) alignment with repeat 3N-index and only output unique aligned result:
`hisat-3n --index genome -q -U read.fq -S alignment_result.sam --base-change C,T --unique`
#### Extra SAM tags generated by HISAT-3N:
* `Yf:i:<N>`: Number of conversions are detected in the read.
* `YZ:A:<A>`: The value `+` or `` indicate the read is mapped to REF-3N (`+`) or REF-RC-3N (`-`).
Generate a 3N-conversion-table with `hisat-3n-table`
------------
### Preparation
To generate 3N-conversion-table, users need to sort the SAM file which generated by `hisat-3n`.
[SAMtools] is required for this sorting process.
Use `samtools sort` to convert the SAM file to a sorted SAM file.
samtools sort alignment_result.sam -o sorted_alignment_result.sam -O sam
Generate 3N-conversion-table with `hisat-3n-table`:
### Usage
hisat-3n-table [options]* --alignments <alignmentFile> --ref <refFile> --output-name <outputFile> --base-change <char1,char2>
#### Main arguments
* `--alignments <alignmentFile>`
SORTED SAM file. Please enter `-` for standard input.
* `--ref <refFile>`
The reference genome file (FASTA format) for generating HISAT-3N index.
* `--output-name <outputFile>`
Filename to write 3N-conversion-table (tsv format) to.
* `--base-change <char1,char2>`
The base-change rule. User should enter the exact same `--base-change` arguments in hisat-3n.
For example, please enter `--base-change C,T` for bisulfite sequencing reads.
#### Input options
* `-u/--unique-only`
Only count the unique aligned reads into 3N-conversion-table.
* `-m/--multiple-only`
Only count the multiple aligned reads into 3N-conversion-table.
* `-c/--CG-only`
Only count the CpG island in reference genome. This option is designed for bisulfite sequencing reads.
* `-p/--threads <int>`
Launch `int` parallel threads (default: 1) for table building.
* `-h/--help`
Print usage information and quit.
#### Examples:
* Generate 3N conversion table for bisulfite sequencing data:
`hisat-3n-table -p 16 --alignments sorted_alignment_result.sam --ref genome.fa --output-name output.tsv --base-change C,T`
* Generate 3N-conversion-table for TAPS data and only count base in CpG island and uniquely aligned:
`hisat-3n-table -p 16 --alignments sorted_alignment_result.sam --ref genome.fa --output-name output.tsv --base-change C,T --CG-only --unique-only`
* Generate 3N conversion table for bisulfite sequencing data from sorted BAM file:
`samtools view -h sorted_alignment_result.bam | hisat-3n-table --ref genome.fa --alignments - --output-name output.tsv --base-change C,T`
* Generate 3N conversion table for bisulfite sequencing data from unsorted BAM file:
`samtools sort alignment_result.bam -O sam | hisat-3n-table --ref genome.fa --alignments - --output-name output.tsv --base-change C,T`
#### Note:
There are 7 columns in the 3N-conversion-table:
1. `ref`: the chromosome name.
2. `pos`: 1-based position in ref.
3. `strand`: '+' for forward strand. '-' for reverse strand.
4. `convertedBaseQualities`: the qualities for converted base in read-level measurement. Length of this string is equal to
the number of converted Base in read-level measurement.
5. `convertedBaseCount`: number of distinct read positions where converted base in read-level measurements were found.
this number should equal to the length of convertedBaseQualities.
6. `unconvertedBaseQualities`: the qualities for unconverted base in read-level measurement. Length of this string is equal to
the number of unconverted Base in read-level measurement.
7. `unconvertedBaseCount`: number of distinct read positions where unconverted base in read-level measurements were found.
this number should equal to the length of unconvertedBaseQualities.
##### Sample 3N-conversion-table:
ref pos strand convertedBaseQualities convertedBaseCount unconvertedBaseQualities unconvertedBaseCount
1 11874 + FFFFFB<BF<F 11 0
1 11877 - FFFFFF< 7 0
1 11878 + FFFBB//F/BB 11 0
1 11879 + 0 FFFBB//FB/ 10
1 11880 - F 1 FFFF/ 5
[SAMtools]: http://samtools.sourceforge.net
Publication
============
* HISAT-3N paper
Zhang, Y., Park, C., Bennett, C., Thornton, M., & Kim, D. (2021). [Rapid and accurate alignment of nucleotide conversion sequencing reads with HISAT-3N](https://doi.org/10.1101/gr.275193.120). Genome research, gr.275193.120. Advance online publication.
* HIAST2 paper
Kim, D., Paggi, J.M., Park, C. _et al._ [Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype](https://doi.org/10.1038/s41587-019-0201-4). _Nat Biotechnol_ **37**, 907915 (2019)

135
docs/_pages/hisat2.md Normal file
View File

@ -0,0 +1,135 @@
---
layout: page
title: Main
permalink: /main/
order: 1
share: false
---
**HISAT2** is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes as well as to a single reference genome. Based on an extension of BWT for graphs ([Sir&eacute;n et al. 2014](http://dl.acm.org/citation.cfm?id=2674828)), we designed and implemented a graph FM index (GFM), an original approach and its first implementation. In addition to using one global GFM index that represents a population of human genomes, **HISAT2** uses a large set of small GFM indexes that collectively cover the whole genome. These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).
### Index files are moved to the AWS Public Dataset Program. 9/3/2020
We have moved HISAT2 index files to the AWS Public Dataset Program. See the [link](https://registry.opendata.aws/jhu-indexes/) for more details.
### HISAT 2.2.1 release 7/24/2020
This patch version includes the following changes.
* Python3 support
* Remove the HISAT-genotype related scripts. HISAT-genotype moved to [http://daehwankimlab.github.io/hisat-genotype/](http://daehwankimlab.github.io/hisat-genotype/)
* Fixed bugs related to `--read-lengths` option
### HISAT 2.2.0 release 2/6/2020
This major version update includes a new feature to handle “repeat” reads. Based on sets of 100-bp simulated and 101-bp real reads that we tested, we found that 2.6-3.4% and 1.4-1.8% of the reads were mapped to >5 locations and >100 locations, respectively. Attempting to report all alignments would likely consume a prohibitive amount of disk space. In order to address this issue, our repeat indexing and alignment approach directly aligns reads to repeat sequences, resulting in one repeat alignment per read. HISAT2 provides application programming interfaces (API) for C++, Python, and JAVA that rapidly retrieve genomic locations from repeat alignments for use in downstream analyses.
Other minor bug fixes are also included as follows:
* Fixed occasional sign (+ or -) issues of template lengths in SAM file
* Fixed duplicate read alignments in SAM file
* Skip a splice site if exon's last base or first base is ambiguous (N)
### Index files are moved to a different location. 8/30/2019
Due to a high volume of index downloads, we have moved HISAT2 index files to a different location in order to provide faster download speed. If you use wget or curl to download index files, then you may need to use the following commands to get the correct file name.
* `wget --content-disposition` *download_link*
* `curl -OJ` *download_link*
### [The HISAT2 paper](https://www.nature.com/articles/s41587-019-0201-4) is out in *Nature Biotechnology*. 8/2/2019
### HISAT 2.1.0 release 6/8/2017
* This major version includes the first release of HISAT-genotype, which currently performs HLA typing,
DNA fingerprinting analysis, and CYP typing on whole genome sequencing (WGS) reads.
We plan to extend the system so that it can analyze not just a few genes, but a whole human genome.
Please refer to [the HISAT-genotype website](https://daehwankimlab.github.io/hisat-genotype) for more details.
* HISAT2 can be directly compiled and executed on Windows system using Visual Studio, thanks to [Nigel Dyer](http://www2.warwick.ac.uk/fac/sci/systemsbiology/staff/dyer/).
* Implemented `--new-summary` option to output a new style of alignment summary, which is easier to parse for programming purposes.
* Implemented `--summary-file` option to output alignment summary to a file in addition to the terminal (e.g. stderr).
* Fixed discrepancy in HISAT2s alignment summary.
* Implemented `--no-templatelen-adjustment` option to disable automatic template length adjustment for RNA-seq reads.
### HISAT2 2.0.5 release 11/4/2016
Version 2.0.5 is a minor release with the following changes.
* Due to a policy change (HTTP to HTTPS) in using SRA data (`--sra-option`), users are strongly encouraged to use this version. As of 11/9/2016, NCBI will begin a permanent redirect to HTTPS, which means the previous versions of HISAT2 no longer works with `--sra-acc` option soon.
* Implemented `-I` and `-X` options for specifying minimum and maximum fragment lengths. The options are valid only when used with `--no-spliced-alignment`, which is used for the alignment of DNA-seq reads.
* Fixed some cases where reads with SNPs on their 5' ends were not properly aligned.
* Implemented `--no-softclip` option to disable soft-clipping.
* Implemented `--max-seeds` to specify the maximum number of seeds that HISAT2 will try to extend to full-length alignments (see [the manual] for details).
### [HISAT, StringTie and Ballgown protocol](http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html) published at Nature Protocols 8/11/2016
### HISAT2 2.0.4 Windows binary available [here](http://www.di.fc.ul.pt/~afalcao/hisat2_windows.html), thanks to [Andre Osorio Falcao](http://www.di.fc.ul.pt/~afalcao/) 5/24/2016
### HISAT2 2.0.4 release 5/18/2016
Version 2.0.4 is a minor release with the following changes.
* Improved template length estimation (the 9th column of the SAM format) of RNA-seq reads by taking introns into account.
* Introduced two options, `--remove-chrname` and `--add-chrname`, to remove "chr" from reference names or add "chr" to reference names in the alignment output, respectively (the 3rd column of the SAM format).
* Changed the maximum of mapping quality (the 5th column of the SAM format) from 255 to 60. Note that 255 is an undefined value according to the SAM manual and some programs would not work with this value (255) properly.
* Fixed NH (number of hits) in the alignment output.
* HISAT2 allows indels of any length pertaining to minimum alignment score (previously, the maximum length of indels was 3 bp).
* Fixed several cases that alignment goes beyond reference sequences.
* Fixed reporting duplicate alignments.
### HISAT2 2.0.3-beta release 3/28/2016
Version 2.0.3-beta is a minor release with the following changes.
* Fixed graph index building when using both SNPs and transcripts. As a result, genome_snp_tran indexes here on the HISAT2 website have been rebuilt.
* Included some missing files needed to follow the small test example (see [the manual] for details).
### HISAT2 2.0.2-beta release 3/17/2016
**Note (3/19/2016):** this version is slightly updated to handle reporting splice sites with the correct chromosome names.
Version 2.0.2-beta is a major release with the following changes.
* Memory mappaped IO (`--mm` option) works now.
* Building linear index can be now done using multi-threads.
* Changed the minimum score for alignment in keeping with read lengths, so it's now `--score-min L,0.0,-0.2`, meaning a minimum score of -20 for 100-bp reads and -30 for 150-bp reads.
* Fixed a bug that the same read was written into a file multiple times when `--un-conc` was used.
* Fixed another bug that caused reads to map beyond reference sequences.
* Introduced `--haplotype` option in the hisat2-build (index building), which is used with `--snp` option together to incorporate those SNP combinations present in the human population. This option also prevents graph construction from exploding due to exponential combinations of SNPs in small genomic regions.
* Provided a new python script to extract SNPs and haplotypes from VCF files, <i>hisat2_extract_snps_haplotypes_VCF.py</i>
* Changed several python script names as follows<
* *extract_splice_sites.py* to *hisat2_extract_splice_sites.py*
* *extract_exons.py* to *hisat2_extract_exons.py*
* *extract_snps.py* to *hisat2_extract_snps_haplotypes_UCSC.py*
### HISAT2 2.0.1-beta release 11/19/2015
Version 2.0.1-beta is a maintenance release with the following changes.
* Fixed a bug that caused reads to map beyond reference sequences.
* Fixed a deadlock issue that happened very rarely.
* Fixed a bug that led to illegal memory access when reading SNP information.
* Fixed a system-specific bug related to popcount instruction.
### HISAT2 2.0.0-beta release 9/8/2015 - first release
We extended the BWT/FM index to incorporate genomic differences among individuals into the reference genome, while keeping memory requirements low enough to fit the entire index onto a desktop computer. Using this novel Hierarchical Graph FM index (HGFM) approach, we built a new alignment system, HISAT2, with an index that incorporates ~12.3M common SNPs from the dbSNP database. HISAT2 provides greater alignment accuracy for reads containing SNPs.
* HISAT2's index size for the human reference genome and 12.3 million common SNPs is 6.2GB (the memory footprint of HISAT2 is 6.7GB). The SNPs consist of 11 million single nucleotide polymorphisms, 728,000 deletions, and 555,000 insertions. The insertions and deletions used in this index are small (usually <20bp).
* HISAT2 comes with several index types:
* Hierarchical FM index (HFM) for a reference genome (index base: <i>genome</i>)
* Hierarchical Graph FM index (HGFM) for a reference genome plus SNPs (index base: <i>genome_snp</i>)
* Hierarchical Graph FM index (HGFM) for a reference genome plus transcripts (index base: <i>genome_tran</i>)
* Hierarchical Graph FM index (HGFM) for a reference genome plus SNPs and transcripts (index base: <i>genome_snp_tran</i>)
* HISAT2 is a successor to both [HISAT](http://ccb.jhu.edu/software/hisat) and [TopHat2](http://ccb.jhu.edu/software/tophat). We recommend that HISAT and TopHat2 users switch to HISAT2.
* HISAT2 can be considered an enhanced version of HISAT with many improvements and bug fixes. The alignment speed and memory requirements of HISAT2 are virtually the same as those of HISAT when using the HFM index (<i>genome</i>).
* When using graph-based indexes (HGFM), the runtime of HISAT2 is slightly slower than HISAT (30~80% additional CPU time).
* HISAT2 allows for mapping reads directly against transcripts, similar to that of TopHat2 (use <i>genome_tran</i> or <i>genome_snp_tran</i>).
* When reads contain SNPs, the SNP information is provided as an optional field in the SAM output of HISAT2 (e.g., **<code>Zs:Z:1|S|rs3747203,97|S|rs16990981</code>** - see [the manual] for details). This feature enables fast and sensitive genotyping in downstream analyses. Note that there is no alignment penalty for mismatches, insertions, and deletions if they correspond to known SNPs.
* HISAT2 provides options for transcript assemblers (e.g., StringTie and Cufflinks) to work better with the alignment from HISAT2 (see options such as `--dta` and `--dta-cufflinks`).
* Some slides about HISAT2 are found [here]({{ '/assets/data/HISAT2-first_release-Sept_8_2015.pdf' | prepend: site.baseurl }}) and we are preparing detailed documention.
* We plan to incorporate a larger set of SNPs and structural variations (SV) into this index (e.g., long insertions/deletions, inversions, and translocations).
[the manual]: {{ site.baseurl }}{% link _pages/manual.md %}
### The HISAT2 source code is available in a [public GitHub repository](https://github.com/DaehwanKimLab/hisat2) (5/30/2015).

78
docs/_pages/howto.md Normal file
View File

@ -0,0 +1,78 @@
---
layout: page
title: HowTo
permalink: /howto/
order: 6
share: false
---
## HOWTO
{: .no_toc}
- TOC
{:toc}
### Building indexes
Depend on your purpose, you have to download reference sequence, gene annotation and SNP files.
We also provides scripts to build indexes. [Download]({{ site.baseurl }}{% link _pages/download.md %})
#### Prepare data
1. Download reference
```
$ wget ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ mv Homo_sapiens.GRCh38.dna.primary_assembly.fa genome.fa
```
1. Download GTF and make exon, splicesite file.
If you want to build HFM index, you can skip this step.
```
$ wget ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz
$ gzip -d Homo_sapiens.GRCh38.84.gtf.gz
$ mv Homo_sapiens.GRCh38.84.gtf genome.gtf
$ hisat2_extract_splice_sites.py genome.gtf > genome.ss
$ hisat2_extract_exons.py genome.gtf > genome.exon
```
1. Download SNP
If you want to build HFM index, you can skip this step.
```
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144Common.txt.gz
$ gzip -d snp144Common.txt.gz
```
Convert chromosome names of UCSC Database to Ensembl Annotation
```
$ awk 'BEGIN{OFS="\t"} {if($2 ~ /^chr/) {$2 = substr($2, 4)}; if($2 == "M") {$2 = "MT"} print}' snp144Common.txt > snp144Common.txt.ensembl
```
make SNPs and haplotype file
```
$ hisat2_extract_snps_haplotypes_UCSC.py genome.fa snp144Common.txt.ensembl genome
```
#### Build HFM index
It takes about 20 minutes(depend on HW spec) to build index, and requires at least 6GB memory.
```
$ hisat2-build -p 16 genome.fa genome
```
#### Build HGFM index with SNPs
```
$ hisat2-build -p 16 --snp genome.snp --haplotype genome.haplotype genome.fa genome_snp
```
#### Build HGFM index with transcripts
It takes about 1 hour(depend on HW spec) to build index, and requires at least 160GB memory.
```
$ hisat2-build -p 16 --exon genome.exon --ss genome.ss genome.fa genome_tran
```
#### Build HGFM index with SNPs and transcripts
```
$ hisat2-build -p 16 --snp genome.snp --haplotype genome.haplotype --exon genome.exon --ss genome.ss genome.fa genome_snp_tran
```

17
docs/_pages/links.md Normal file
View File

@ -0,0 +1,17 @@
---
layout: page
title: Links
permalink: /links/
order: 7
share: false
---
* KimLab - <https://kim-lab.org>
* github - <https://github.com/DaehwanKimLab>
* hisat-genotype - <https://daehwankimlab.github.io/hisat-genotype>
* github for hisat-genotype - <https://github.com/DaehwanKimLab/hisat-genotype>
* Lyda Hill Department of Bioinformatics at UT Southwestern Medical Center - <https://www.utsouthwestern.edu/departments/bioinformatics>
* Center for Computational Biology at Johns Hopkins University - <http://www.ccb.jhu.edu>

1545
docs/_pages/manual.md Normal file

File diff suppressed because it is too large Load Diff

26
docs/_pages/search.html Normal file
View File

@ -0,0 +1,26 @@
---
layout: page
title: Search Results
permalink: /search/
hide: true
share: false
---
<script>
var baseurl = "{{ site.baseurl }}";
</script>
<div id="search-results">
<hr id="first-hr" class="with-no-margin"/>
{% for post in site.posts %}
<div id="{{ post.id | replace: '/', '-' }}" style="display: none;">
<div class="article-wrapper">
<article>
{% include article-header.html page=post link=true share=false eye_catch=false %}
</article>
</div>
<hr class="with-no-margin"/>
</div>
{% endfor %}
</div>

14
docs/_pages/tags.html Normal file
View File

@ -0,0 +1,14 @@
---
layout: page
title: Tags
permalink: /tags/
order: 2
share: false
hide: true
---
<ul class="inline">
{% for tag in site.tags %}
<li><a href="{{ '/search/?t=' | prepend: site.baseurl }}{{ tag[0] }}">#{{ tag[0] }}</a></li>
{% endfor %}
</ul>

View File

@ -0,0 +1,13 @@
---
layout: post
title: Daehwan Kim
tags: daehwankim
eye_catch: https://avatars0.githubusercontent.com/u/28678667?s=460&v=4
---
Daehwan Kim is an Assistant Professor at UT Southwestern and was the original designer who layed much of the ground work for HISAT-genotype.
[Webpage](https://kim-lab.org/daehwan-kim-principal-investigator/)

View File

@ -0,0 +1,11 @@
---
layout: post
title: Steven Salzberg
tags: stevensalzberg
eye_catch: https://avatars0.githubusercontent.com/u/28678667?s=460&v=4
---
Steven Salzberg is the Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where Im also Director of the Center for Computational Biology.
[Webpage](https://salzberg-lab.org/in-the-news/about-me/)

View File

@ -0,0 +1,13 @@
---
layout: post
title: Ben Langmead
tags: benlangmead
eye_catch: https://avatars0.githubusercontent.com/u/28678667?s=460&v=4
---
Ben Langmead is an Associate Professor of Computer Science at Johns Hopkins University.
[Webpage](http://www.langmead-lab.org/)

View File

@ -0,0 +1,10 @@
---
layout: post
title: Chanhee Park
tags: chanheepark
eye_catch: https://avatars0.githubusercontent.com/u/28678667?s=460&v=4
---
Chanhee Park is a Scientific Software Engineer in the Kim Lab at UTSW responsible for maintaining and improving HISAT2, the core of HISAT-genotype.
[Linkedin](https://www.linkedin.com/in/chanhee-park-97677297/)

Some files were not shown because too many files have changed in this diff Show More