Add 18 pKa scales for isoelectric point calculation by mmokrejs · Pull Request #5200 · biopython/biopython

mmokrejs · 2026-04-17T21:51:17Z

Summary

Add support for 18 different pKa scales to Bio.SeqUtils.IsoelectricPoint and Bio.SeqUtils.ProtParam, selectable via the new pKa_scale parameter.

Based on the original work by @lukasz-kozlowski in #3819 (stagnant since Dec 2021), rebased onto current master with design improvements to address the review feedback from @MarkusPiotrowski.

Available pKa scales

Bjellqvist, DTASelect, Dawson, EMBOSS, Grimsley, IPC2_peptide, IPC2_protein, IPC_peptide, IPC_protein, Lehninger, Nozaki, Patrickios, Rodwell, Sillero, Solomon, Thurlkill, Toseland, Wikipedia.

Key design decisions (vs PR #3819)

Aspect	PR #3819	This PR
Default pKa scale	IPC2_protein (breaking)	Bjellqvist (backward compat)
Invalid scale name	Silent fallback to IPC2_protein	`ValueError` raised
Bisection tolerance	0.001 (relaxed)	0.0001 (unchanged)
Search range	[1.0, 14.0]	[1.0, 14.0] (widened from [4.05, 12])
Test coverage	66.7% (codecov)	56 new tests, 9 test classes

Backward compatibility

The default remains Bjellqvist — all existing tests pass unchanged (test_ProtParam.py, test_SeqUtils.py). The reviewer's main concern from #3819 is fully addressed.

For improved accuracy, users are encouraged to use IPC2_protein (proteins) or IPC2_peptide (peptides) explicitly, as recommended in Kozlowski 2021 (DOI: 10.1093/nar/gkab295).

Verification

Cross-validated against the standalone IPC2 tool (v2.0.1): maximum difference across 72 comparisons (4 sequences × 18 scales) is 0.007, well within IPC2's own precision of 0.01. Our Biopython implementation uses finer bisection tolerance (0.0001 vs IPC2's 0.01).

Usage

from Bio.SeqUtils.IsoelectricPoint import IsoelectricPoint as IP

# Default (Bjellqvist, backward compatible)
protein = IP("INGAR")
print(protein.pi())  # 9.75

# Recommended for proteins
protein = IP("ADDKNPLEECFRETDYEEFLEIARNGLKATSNPKRVV", "IPC2_protein")
print(protein.pi())  # 4.83

# Recommended for peptides
peptide = IP("EFYTVDGQK", "IPC2_peptide")
print(peptide.pi())  # 4.28

# Via ProteinAnalysis
from Bio.SeqUtils.ProtParam import ProteinAnalysis as PA
analysis = PA("MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRD...")
print(analysis.isoelectric_point("IPC2_protein"))
print(analysis.charge_at_pH(7.4, "IPC2_protein"))

Note on Decimal precision

Decimal is not needed for this use case. The Henderson-Hasselbalch calculation involves 10 ** (pH - pK) with pKa values known to 2-3 decimal places. Float64 provides ~15 significant digits — far more than the underlying data precision. Using Decimal would add ~10-50× slowdown for zero practical benefit.

IPC2 currency

As of 2024, newer methods like pKALM (protein language model-based, 2024) and GNN approaches claim to outperform IPC2 on benchmarks. However, IPC2 remains the standard for Henderson-Hasselbalch-based prediction, and the 18 pKa tables integrated here are empirical data — they do not become stale.

Closes #3819

Co-Authored-By: Lukasz P. Kozlowski [email protected]

Add support for 18 different pKa scales to Bio.SeqUtils.IsoelectricPoint and Bio.SeqUtils.ProtParam, selectable via the pKa_scale parameter. Available scales: Bjellqvist, DTASelect, Dawson, EMBOSS, Grimsley, IPC2_peptide, IPC2_protein, IPC_peptide, IPC_protein, Lehninger, Nozaki, Patrickios, Rodwell, Sillero, Solomon, Thurlkill, Toseland, Wikipedia. The default remains Bjellqvist for backward compatibility. All existing tests pass unchanged. For improved accuracy, IPC2_protein (proteins) or IPC2_peptide (peptides) are recommended (Kozlowski 2021, DOI: 10.1093/nar/gkab295). Key changes vs the original PR biopython#3819: - Keep Bjellqvist as default (not IPC2_protein) to avoid breaking changes - Raise ValueError for invalid scale names (not silent fallback) - Keep bisection tolerance at 0.0001 (not relaxed to 0.001) - Widen search range to [1.0, 14.0] for all pKa scales - Add comprehensive test suite (56 tests across 9 test classes) Cross-validated against the standalone IPC2 tool (ipc-2.0.1) from https://ipc2.mimuw.edu.pl/ -- max difference 0.007 across 72 comparisons (well within IPC2's own 0.01 precision). Co-authored-by: Claude Opus 4.6 <[email protected]>

peterjc · 2026-04-18T20:31:59Z

Was this PR or its description written with AI tools?

mmokrejs · 2026-04-18T21:24:35Z

Was this PR or its description written with AI tools?

As I wrote, it picked up the original PR code which had 76% code coverage, which I got tested by Opus 4.6 and added the remaining tests for missing areas, added documentation, etc. The git log confirms the original author is still as the author, I do not need to have my name there at all. Maybe the initial comment along this thread could be a bit improved. Sorry for the many amends and force-pushes, took me ages to realzie you rely on black-4.10.x, somewhat when reading the CONTRIBUTIONS.rst I overlooked that and the AI student as well. I took care more of the actual git commit descriptions and attributions in there.

lukasz-kozlowski · 2026-04-18T21:24:37Z

Was this PR or its description written with AI tools?

Neither, some people could call it an upgrade/improvement.

BTW, I made this change 5(!) years ago - this was even before ChatGPT existed. Best wishes, Lukasz

mmokrejs · 2026-04-18T21:30:24Z

Was this PR or its description written with AI tools?

Neither, some people could call it an upgrade/improvement.

BTW, I made this change 5(!) years ago - this was even before ChatGPT existed. Best wishes, Lukasz

@lukasz-kozlowski Feel free to pick it up back and re-submit it under your name, as usual. I keep my fingers crossed. Our lab might soon use the pKa functions for our own research, as well as our mass spec facility.

MarkusPiotrowski · 2026-04-20T07:41:07Z

    """

-    def __init__(self, protein_sequence, aa_content=None):
+    def __init__(self, aa_sequence, pKa_scale=None, aa_content=None):


Since positional arguments may be called by their name and keyword arguments do not need to called by their name, this may break existing scripts, e.g.

>>> protein = PA(protein_sequence='"PETER") >>> protein = PA("PETER", my_aa_content)

In general, please stop using silly toy sequences (e.g. "PETER"), even if they are perfectly valid in respect of the amino acid alphabet, they harm the benchmarking methods in real situations.

For benchmarks, you should use classic protein sequences for which we know almost every biochemical aspect, including the structures, and use them to test in folding, MD simulations, etc. For instance:

Lysozyme 1LYZ -
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL

Villin 1YU5 - LSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF

Ubiquitin 1UBQ - MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
etc.

Frequently, many methods will fail or make dull predictions if you feed them with nonsense sequences GIGO

You didn't get my point. You changed the signature of the class in a way that will break existing scripts if they used the class as I have exemplified, "toy sequences" or not.

Sorry for the confusion, this was a side comment. A few moments ago we had a comment that there is little difference between methods, and the problem is that biopython for brevity uses those toy sequences, aka I want to run/test the function I am working on right now and I have no idea on which sequence I should run it so I will call it on something short, let it be ... name.

You are right, the positional arguments may be called by their name and keyword arguments, which may be problematic. Feel free to correct it according to biopython standard.

MarkusPiotrowski · 2026-04-20T08:06:25Z

+    Sillero      3.2   4.0   4.5   9.0   10.0    8.2   10.4   12.0   6.4
+    Rodwell      3.1   3.68  4.25  8.33  10.07   8.0   11.5   11.5   6.0
+    Patrickios   4.2   4.2   4.2   0.0    0.0   11.2   11.2   11.2   0.0
+    Wikipedia    3.65  3.9   4.07  8.18  10.46   8.2   10.54  12.48  6.04


In the above source (ipc2-isoelectrip-point.org), "Wikipedia" links to http://en.wikipedia.org/wiki/List_of_standard_amino_acids, which re-directs to https://en.wikipedia.org/wiki/Proteinogenic_amino_acid.

The latter Wikipedia entry does not contain the numbers given here. Given the fact that Wikipedia may update these numbers with time, I would remove this table.

It is up to your decision, but this is a tricky situation. At some point (I think it was around 2015 or so), those pKas were in Wikipedia, and most likely they changed (as anything in Wikipedia), thus you may not find exact numbers like those there anymore. BUT those numbers were hardcoded in bioinformatics software, and those pKa numbers were benchmarked against other pKa sets (proving to be better many times than well-established pKa sets).

We have a similar situation to that with BLOSUM62. Those BLOCKs can not be remade with the current databases and even more, the original calculation of BLOSUM62 was wrong (there were bugs in the code calculating matrices), but the original, wrongly calculated BLOSUM62 was adopted in the field and later even proven to work better than the correctly calculated one, so BLOSUM62 is still used in its initial form.

So maybe there is a stable reference for exactly these numbers which could serve as a better name?

MarkusPiotrowski · 2026-04-20T08:12:02Z

+      correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539.
+
+For references of all 18 individual methods, see:
+http://ipc2-isoelectric-point.org/theory.html


This one? https://ipc2.mimuw.edu.pl/theory.html

MarkusPiotrowski · 2026-04-20T08:24:15Z

+http://ipc2-isoelectric-point.org/theory.html

 I designed the algorithm according to a note by David L. Tabb, available at:
 http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf


This one does not longer exist. If we update the class we want to remove this reference or find another one describing the algorithm

https://pubmed.ncbi.nlm.nih.gov/12643522/ - I think we can use this one as a reference; the original file is here
20010710-pI-Algorithm.pdf

mmokrejs mentioned this pull request Apr 17, 2026

Adding ability to calculate the isoelectric point by 18 methods. #3819

Open

mmokrejs force-pushed the isoelectric-point-methods branch from 682a7e8 to ed30fe6 Compare April 17, 2026 22:17

mmokrejs mentioned this pull request Apr 17, 2026

SeqAnalysis module for proteins data analysis and visualization #3124

Open

3 tasks

MarkusPiotrowski reviewed Apr 20, 2026

View reviewed changes

Conversation

mmokrejs commented Apr 17, 2026

Summary

Available pKa scales

Key design decisions (vs PR #3819)

Backward compatibility

Verification

Usage

Note on Decimal precision

IPC2 currency

Uh oh!

peterjc commented Apr 18, 2026

Uh oh!

mmokrejs commented Apr 18, 2026

Uh oh!

lukasz-kozlowski commented Apr 18, 2026

Uh oh!

mmokrejs commented Apr 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukasz-kozlowski Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lukasz-kozlowski Apr 20, 2026 •

edited

Loading