perf(build): parallelize cython extension compilation#1689
Merged
Conversation
Default `build_ext`'s `parallel` to `os.cpu_count()` so the per-module `gcc` compile step runs in parallel. Wheel builds for emulated archs (riscv64 + armv7l via QEMU) currently spend ~16 of their ~20 minutes serializing 18 `gcc -O3` invocations; on 4-core runners this should drop the compile phase to ~4-5 minutes.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1689 +/- ##
=======================================
Coverage 99.76% 99.76%
=======================================
Files 33 33
Lines 3410 3410
Branches 464 464
=======================================
Hits 3402 3402
Misses 5 5
Partials 3 3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This was referenced May 16, 2026
This was referenced May 16, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Default
BuildExt.paralleltoos.cpu_count()so per-modulegcccompilation runs in parallel instead of serially. Targets the wheel-build pipeline, where the QEMU-emulated jobs (riscv64 + armv7l) are the long pole.Details
Profiling
Wheels for ubuntu-latest (manylinux) riscv64 cp313(20m20s total):gcc -O3compiles under QEMU: ~16 min (each 12–98s)build_ext.py:48never setself.parallel, so distutils'build_extensionsran each compile one at a time, leaving 3 of the 4 runner cores idle the entire time.Local sanity check on macOS arm64: full Cython wheel went from serial to ~270% CPU, 10s wall / 23s user — confirming
ThreadPoolExecutoris dispatching parallelgccinvocations.SKIP_CYTHON=1path still builds.Expected CI impact: 4-core ubuntu-latest runners should cut the QEMU compile phase to ~4–5 min, taking those 26 wheel jobs from ~20 min to ~8 min each. Native runners benefit too, smaller win since they're already fast.
The
# type: ignore[has-type]is becausedistutils.command.build_extships no type stubs and mypy can't see thatself.parallelis initialized toNoneby the base class.Test plan
REQUIRE_CYTHON=1 pip wheel --no-deps -w wh .— wheel builds, CPU saturation visibleSKIP_CYTHON=1 pip wheel --no-deps -w wh .— pure-Python fallback still worksgccinvocations and reduced wall time