Skip to content

Add 18 pKa scales for isoelectric point calculation#5200

Open
mmokrejs wants to merge 1 commit into
biopython:masterfrom
mmokrejs:isoelectric-point-methods
Open

Add 18 pKa scales for isoelectric point calculation#5200
mmokrejs wants to merge 1 commit into
biopython:masterfrom
mmokrejs:isoelectric-point-methods

Conversation

@mmokrejs
Copy link
Copy Markdown

Summary

Add support for 18 different pKa scales to Bio.SeqUtils.IsoelectricPoint and Bio.SeqUtils.ProtParam, selectable via the new pKa_scale parameter.

Based on the original work by @lukasz-kozlowski in #3819 (stagnant since Dec 2021), rebased onto current master with design improvements to address the review feedback from @MarkusPiotrowski.

Available pKa scales

Bjellqvist, DTASelect, Dawson, EMBOSS, Grimsley, IPC2_peptide, IPC2_protein, IPC_peptide, IPC_protein, Lehninger, Nozaki, Patrickios, Rodwell, Sillero, Solomon, Thurlkill, Toseland, Wikipedia.

Key design decisions (vs PR #3819)

Aspect PR #3819 This PR
Default pKa scale IPC2_protein (breaking) Bjellqvist (backward compat)
Invalid scale name Silent fallback to IPC2_protein ValueError raised
Bisection tolerance 0.001 (relaxed) 0.0001 (unchanged)
Search range [1.0, 14.0] [1.0, 14.0] (widened from [4.05, 12])
Test coverage 66.7% (codecov) 56 new tests, 9 test classes

Backward compatibility

The default remains Bjellqvistall existing tests pass unchanged (test_ProtParam.py, test_SeqUtils.py). The reviewer's main concern from #3819 is fully addressed.

For improved accuracy, users are encouraged to use IPC2_protein (proteins) or IPC2_peptide (peptides) explicitly, as recommended in Kozlowski 2021 (DOI: 10.1093/nar/gkab295).

Verification

Cross-validated against the standalone IPC2 tool (v2.0.1): maximum difference across 72 comparisons (4 sequences × 18 scales) is 0.007, well within IPC2's own precision of 0.01. Our Biopython implementation uses finer bisection tolerance (0.0001 vs IPC2's 0.01).

Usage

from Bio.SeqUtils.IsoelectricPoint import IsoelectricPoint as IP

# Default (Bjellqvist, backward compatible)
protein = IP("INGAR")
print(protein.pi())  # 9.75

# Recommended for proteins
protein = IP("ADDKNPLEECFRETDYEEFLEIARNGLKATSNPKRVV", "IPC2_protein")
print(protein.pi())  # 4.83

# Recommended for peptides
peptide = IP("EFYTVDGQK", "IPC2_peptide")
print(peptide.pi())  # 4.28

# Via ProteinAnalysis
from Bio.SeqUtils.ProtParam import ProteinAnalysis as PA
analysis = PA("MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRD...")
print(analysis.isoelectric_point("IPC2_protein"))
print(analysis.charge_at_pH(7.4, "IPC2_protein"))

Note on Decimal precision

Decimal is not needed for this use case. The Henderson-Hasselbalch calculation involves 10 ** (pH - pK) with pKa values known to 2-3 decimal places. Float64 provides ~15 significant digits — far more than the underlying data precision. Using Decimal would add ~10-50× slowdown for zero practical benefit.

IPC2 currency

As of 2024, newer methods like pKALM (protein language model-based, 2024) and GNN approaches claim to outperform IPC2 on benchmarks. However, IPC2 remains the standard for Henderson-Hasselbalch-based prediction, and the 18 pKa tables integrated here are empirical data — they do not become stale.

Closes #3819

Co-Authored-By: Lukasz P. Kozlowski [email protected]

Add support for 18 different pKa scales to Bio.SeqUtils.IsoelectricPoint
and Bio.SeqUtils.ProtParam, selectable via the pKa_scale parameter.

Available scales: Bjellqvist, DTASelect, Dawson, EMBOSS, Grimsley,
IPC2_peptide, IPC2_protein, IPC_peptide, IPC_protein, Lehninger,
Nozaki, Patrickios, Rodwell, Sillero, Solomon, Thurlkill, Toseland,
Wikipedia.

The default remains Bjellqvist for backward compatibility. All existing
tests pass unchanged.  For improved accuracy, IPC2_protein (proteins)
or IPC2_peptide (peptides) are recommended (Kozlowski 2021,
DOI: 10.1093/nar/gkab295).

Key changes vs the original PR biopython#3819:
- Keep Bjellqvist as default (not IPC2_protein) to avoid breaking changes
- Raise ValueError for invalid scale names (not silent fallback)
- Keep bisection tolerance at 0.0001 (not relaxed to 0.001)
- Widen search range to [1.0, 14.0] for all pKa scales
- Add comprehensive test suite (56 tests across 9 test classes)

Cross-validated against the standalone IPC2 tool (ipc-2.0.1) from
https://ipc2.mimuw.edu.pl/ -- max difference 0.007 across 72
comparisons (well within IPC2's own 0.01 precision).

Co-authored-by: Claude Opus 4.6 <[email protected]>
@peterjc
Copy link
Copy Markdown
Member

peterjc commented Apr 18, 2026

Was this PR or its description written with AI tools?

@mmokrejs
Copy link
Copy Markdown
Author

Was this PR or its description written with AI tools?

As I wrote, it picked up the original PR code which had 76% code coverage, which I got tested by Opus 4.6 and added the remaining tests for missing areas, added documentation, etc. The git log confirms the original author is still as the author, I do not need to have my name there at all. Maybe the initial comment along this thread could be a bit improved. Sorry for the many amends and force-pushes, took me ages to realzie you rely on black-4.10.x, somewhat when reading the CONTRIBUTIONS.rst I overlooked that and the AI student as well. I took care more of the actual git commit descriptions and attributions in there.

@lukasz-kozlowski
Copy link
Copy Markdown

Was this PR or its description written with AI tools?

Neither, some people could call it an upgrade/improvement.

BTW, I made this change 5(!) years ago - this was even before ChatGPT existed. Best wishes, Lukasz

@mmokrejs
Copy link
Copy Markdown
Author

Was this PR or its description written with AI tools?

Neither, some people could call it an upgrade/improvement.

BTW, I made this change 5(!) years ago - this was even before ChatGPT existed. Best wishes, Lukasz

@lukasz-kozlowski Feel free to pick it up back and re-submit it under your name, as usual. I keep my fingers crossed. Our lab might soon use the pKa functions for our own research, as well as our mass spec facility.

"""

def __init__(self, protein_sequence, aa_content=None):
def __init__(self, aa_sequence, pKa_scale=None, aa_content=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since positional arguments may be called by their name and keyword arguments do not need to called by their name, this may break existing scripts, e.g.

>>> protein = PA(protein_sequence='"PETER")
>>> protein = PA("PETER", my_aa_content)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, please stop using silly toy sequences (e.g. "PETER"), even if they are perfectly valid in respect of the amino acid alphabet, they harm the benchmarking methods in real situations.

For benchmarks, you should use classic protein sequences for which we know almost every biochemical aspect, including the structures, and use them to test in folding, MD simulations, etc. For instance:

  1. Lysozyme 1LYZ -
    KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL
  2. Villin 1YU5 - LSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF
  3. Ubiquitin 1UBQ - MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG
    etc.

Frequently, many methods will fail or make dull predictions if you feed them with nonsense sequences GIGO

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't get my point. You changed the signature of the class in a way that will break existing scripts if they used the class as I have exemplified, "toy sequences" or not.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, this was a side comment. A few moments ago we had a comment that there is little difference between methods, and the problem is that biopython for brevity uses those toy sequences, aka I want to run/test the function I am working on right now and I have no idea on which sequence I should run it so I will call it on something short, let it be ... name.

You are right, the positional arguments may be called by their name and keyword arguments, which may be problematic. Feel free to correct it according to biopython standard.

Sillero 3.2 4.0 4.5 9.0 10.0 8.2 10.4 12.0 6.4
Rodwell 3.1 3.68 4.25 8.33 10.07 8.0 11.5 11.5 6.0
Patrickios 4.2 4.2 4.2 0.0 0.0 11.2 11.2 11.2 0.0
Wikipedia 3.65 3.9 4.07 8.18 10.46 8.2 10.54 12.48 6.04
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above source (ipc2-isoelectrip-point.org), "Wikipedia" links to http://en.wikipedia.org/wiki/List_of_standard_amino_acids, which re-directs to https://en.wikipedia.org/wiki/Proteinogenic_amino_acid.

The latter Wikipedia entry does not contain the numbers given here. Given the fact that Wikipedia may update these numbers with time, I would remove this table.

Copy link
Copy Markdown

@lukasz-kozlowski lukasz-kozlowski Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is up to your decision, but this is a tricky situation. At some point (I think it was around 2015 or so), those pKas were in Wikipedia, and most likely they changed (as anything in Wikipedia), thus you may not find exact numbers like those there anymore. BUT those numbers were hardcoded in bioinformatics software, and those pKa numbers were benchmarked against other pKa sets (proving to be better many times than well-established pKa sets).

We have a similar situation to that with BLOSUM62. Those BLOCKs can not be remade with the current databases and even more, the original calculation of BLOSUM62 was wrong (there were bugs in the code calculating matrices), but the original, wrongly calculated BLOSUM62 was adopted in the field and later even proven to work better than the correctly calculated one, so BLOSUM62 is still used in its initial form.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe there is a stable reference for exactly these numbers which could serve as a better name?

correlate with polypeptide compositions. Electrophoresis 1994, 15, 529-539.

For references of all 18 individual methods, see:
http://ipc2-isoelectric-point.org/theory.html
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

http://ipc2-isoelectric-point.org/theory.html

I designed the algorithm according to a note by David L. Tabb, available at:
http://fields.scripps.edu/DTASelect/20010710-pI-Algorithm.pdf
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one does not longer exist. If we update the class we want to remove this reference or find another one describing the algorithm

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pubmed.ncbi.nlm.nih.gov/12643522/ - I think we can use this one as a reference; the original file is here
20010710-pI-Algorithm.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants