Add 18 pKa scales for isoelectric point calculation#5200
Conversation
Add support for 18 different pKa scales to Bio.SeqUtils.IsoelectricPoint and Bio.SeqUtils.ProtParam, selectable via the pKa_scale parameter. Available scales: Bjellqvist, DTASelect, Dawson, EMBOSS, Grimsley, IPC2_peptide, IPC2_protein, IPC_peptide, IPC_protein, Lehninger, Nozaki, Patrickios, Rodwell, Sillero, Solomon, Thurlkill, Toseland, Wikipedia. The default remains Bjellqvist for backward compatibility. All existing tests pass unchanged. For improved accuracy, IPC2_protein (proteins) or IPC2_peptide (peptides) are recommended (Kozlowski 2021, DOI: 10.1093/nar/gkab295). Key changes vs the original PR biopython#3819: - Keep Bjellqvist as default (not IPC2_protein) to avoid breaking changes - Raise ValueError for invalid scale names (not silent fallback) - Keep bisection tolerance at 0.0001 (not relaxed to 0.001) - Widen search range to [1.0, 14.0] for all pKa scales - Add comprehensive test suite (56 tests across 9 test classes) Cross-validated against the standalone IPC2 tool (ipc-2.0.1) from https://ipc2.mimuw.edu.pl/ -- max difference 0.007 across 72 comparisons (well within IPC2's own 0.01 precision). Co-authored-by: Claude Opus 4.6 <[email protected]>
682a7e8 to
ed30fe6
Compare
|
Was this PR or its description written with AI tools? |
As I wrote, it picked up the original PR code which had 76% code coverage, which I got tested by Opus 4.6 and added the remaining tests for missing areas, added documentation, etc. The |
Neither, some people could call it an upgrade/improvement. BTW, I made this change 5(!) years ago - this was even before ChatGPT existed. Best wishes, Lukasz |
@lukasz-kozlowski Feel free to pick it up back and re-submit it under your name, as usual. I keep my fingers crossed. Our lab might soon use the pKa functions for our own research, as well as our mass spec facility. |
| """ | ||
|
|
||
| def __init__(self, protein_sequence, aa_content=None): | ||
| def __init__(self, aa_sequence, pKa_scale=None, aa_content=None): |
There was a problem hiding this comment.
Since positional arguments may be called by their name and keyword arguments do not need to called by their name, this may break existing scripts, e.g.
>>> protein = PA(protein_sequence='"PETER")
>>> protein = PA("PETER", my_aa_content)There was a problem hiding this comment.
In general, please stop using silly toy sequences (e.g. "PETER"), even if they are perfectly valid in respect of the amino acid alphabet, they harm the benchmarking methods in real situations.
For benchmarks, you should use classic protein sequences for which we know almost every biochemical aspect, including the structures, and use them to test in folding, MD simulations, etc. For instance:
- Lysozyme 1LYZ -
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL - Villin 1YU5 - LSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF
- Ubiquitin 1UBQ - MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
etc.
Frequently, many methods will fail or make dull predictions if you feed them with nonsense sequences GIGO
There was a problem hiding this comment.
You didn't get my point. You changed the signature of the class in a way that will break existing scripts if they used the class as I have exemplified, "toy sequences" or not.
There was a problem hiding this comment.
Sorry for the confusion, this was a side comment. A few moments ago we had a comment that there is little difference between methods, and the problem is that biopython for brevity uses those toy sequences, aka I want to run/test the function I am working on right now and I have no idea on which sequence I should run it so I will call it on something short, let it be ... name.
You are right, the positional arguments may be called by their name and keyword arguments, which may be problematic. Feel free to correct it according to biopython standard.
| Sillero 3.2 4.0 4.5 9.0 10.0 8.2 10.4 12.0 6.4 | ||
| Rodwell 3.1 3.68 4.25 8.33 10.07 8.0 11.5 11.5 6.0 | ||
| Patrickios 4.2 4.2 4.2 0.0 0.0 11.2 11.2 11.2 0.0 | ||
| Wikipedia 3.65 3.9 4.07 8.18 10.46 8.2 10.54 12.48 6.04 |
There was a problem hiding this comment.
In the above source (ipc2-isoelectrip-point.org), "Wikipedia" links to http://en.wikipedia.org/wiki/List_of_standard_amino_acids, which re-directs to https://en.wikipedia.org/wiki/Proteinogenic_amino_acid.
The latter Wikipedia entry does not contain the numbers given here. Given the fact that Wikipedia may update these numbers with time, I would remove this table.
There was a problem hiding this comment.
It is up to your decision, but this is a tricky situation. At some point (I think it was around 2015 or so), those pKas were in Wikipedia, and most likely they changed (as anything in Wikipedia), thus you may not find exact numbers like those there anymore. BUT those numbers were hardcoded in bioinformatics software, and those pKa numbers were benchmarked against other pKa sets (proving to be better many times than well-established pKa sets).
We have a similar situation to that with BLOSUM62. Those BLOCKs can not be remade with the current databases and even more, the original calculation of BLOSUM62 was wrong (there were bugs in the code calculating matrices), but the original, wrongly calculated BLOSUM62 was adopted in the field and later even proven to work better than the correctly calculated one, so BLOSUM62 is still used in its initial form.
There was a problem hiding this comment.
So maybe there is a stable reference for exactly these numbers which could serve as a better name?
| correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539. | ||
|
|
||
| For references of all 18 individual methods, see: | ||
| http://ipc2-isoelectric-point.org/theory.html |
There was a problem hiding this comment.
This one? https://ipc2.mimuw.edu.pl/theory.html
| http://ipc2-isoelectric-point.org/theory.html | ||
|
|
||
| I designed the algorithm according to a note by David L. Tabb, available at: | ||
| http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf |
There was a problem hiding this comment.
This one does not longer exist. If we update the class we want to remove this reference or find another one describing the algorithm
There was a problem hiding this comment.
https://pubmed.ncbi.nlm.nih.gov/12643522/ - I think we can use this one as a reference; the original file is here
20010710-pI-Algorithm.pdf
Summary
Add support for 18 different pKa scales to
Bio.SeqUtils.IsoelectricPointandBio.SeqUtils.ProtParam, selectable via the newpKa_scaleparameter.Based on the original work by @lukasz-kozlowski in #3819 (stagnant since Dec 2021), rebased onto current master with design improvements to address the review feedback from @MarkusPiotrowski.
Available pKa scales
Bjellqvist, DTASelect, Dawson, EMBOSS, Grimsley, IPC2_peptide, IPC2_protein, IPC_peptide, IPC_protein, Lehninger, Nozaki, Patrickios, Rodwell, Sillero, Solomon, Thurlkill, Toseland, Wikipedia.
Key design decisions (vs PR #3819)
ValueErrorraisedBackward compatibility
The default remains
Bjellqvist— all existing tests pass unchanged (test_ProtParam.py,test_SeqUtils.py). The reviewer's main concern from #3819 is fully addressed.For improved accuracy, users are encouraged to use
IPC2_protein(proteins) orIPC2_peptide(peptides) explicitly, as recommended in Kozlowski 2021 (DOI: 10.1093/nar/gkab295).Verification
Cross-validated against the standalone IPC2 tool (v2.0.1): maximum difference across 72 comparisons (4 sequences × 18 scales) is 0.007, well within IPC2's own precision of 0.01. Our Biopython implementation uses finer bisection tolerance (0.0001 vs IPC2's 0.01).
Usage
Note on Decimal precision
Decimalis not needed for this use case. The Henderson-Hasselbalch calculation involves10 ** (pH - pK)with pKa values known to 2-3 decimal places. Float64 provides ~15 significant digits — far more than the underlying data precision. UsingDecimalwould add ~10-50× slowdown for zero practical benefit.IPC2 currency
As of 2024, newer methods like pKALM (protein language model-based, 2024) and GNN approaches claim to outperform IPC2 on benchmarks. However, IPC2 remains the standard for Henderson-Hasselbalch-based prediction, and the 18 pKa tables integrated here are empirical data — they do not become stale.
Closes #3819
Co-Authored-By: Lukasz P. Kozlowski [email protected]