Skip to content

perf(build): parallelize cython extension compilation#1689

Merged
bdraco merged 1 commit into
masterfrom
perf-parallel-cython-build
May 16, 2026
Merged

perf(build): parallelize cython extension compilation#1689
bdraco merged 1 commit into
masterfrom
perf-parallel-cython-build

Conversation

@bdraco
Copy link
Copy Markdown
Member

@bdraco bdraco commented May 16, 2026

Summary

Default BuildExt.parallel to os.cpu_count() so per-module gcc compilation runs in parallel instead of serially. Targets the wheel-build pipeline, where the QEMU-emulated jobs (riscv64 + armv7l) are the long pole.

Details

Profiling Wheels for ubuntu-latest (manylinux) riscv64 cp313 (20m20s total):

  • Cython codegen under QEMU: ~3 min
  • 18 sequential gcc -O3 compiles under QEMU: ~16 min (each 12–98s)
  • 26 such jobs in the matrix (13× riscv64, 13× armv7l) at 13–24 min apiece

build_ext.py:48 never set self.parallel, so distutils' build_extensions ran each compile one at a time, leaving 3 of the 4 runner cores idle the entire time.

Local sanity check on macOS arm64: full Cython wheel went from serial to ~270% CPU, 10s wall / 23s user — confirming ThreadPoolExecutor is dispatching parallel gcc invocations. SKIP_CYTHON=1 path still builds.

Expected CI impact: 4-core ubuntu-latest runners should cut the QEMU compile phase to ~4–5 min, taking those 26 wheel jobs from ~20 min to ~8 min each. Native runners benefit too, smaller win since they're already fast.

The # type: ignore[has-type] is because distutils.command.build_ext ships no type stubs and mypy can't see that self.parallel is initialized to None by the base class.

Test plan

  • REQUIRE_CYTHON=1 pip wheel --no-deps -w wh . — wheel builds, CPU saturation visible
  • SKIP_CYTHON=1 pip wheel --no-deps -w wh . — pure-Python fallback still works
  • pre-commit (ruff, mypy, flake8, codespell, cython-lint) passes
  • CI green across the matrix; eyeball a riscv64 wheel job to confirm parallel gcc invocations and reduced wall time

Default `build_ext`'s `parallel` to `os.cpu_count()` so the per-module
`gcc` compile step runs in parallel. Wheel builds for emulated archs
(riscv64 + armv7l via QEMU) currently spend ~16 of their ~20 minutes
serializing 18 `gcc -O3` invocations; on 4-core runners this should drop
the compile phase to ~4-5 minutes.
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 16, 2026

Merging this PR will not alter performance

✅ 6 untouched benchmarks


Comparing perf-parallel-cython-build (dbba4cf) with master (c96a997)

Open in CodSpeed

@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.76%. Comparing base (c96a997) to head (dbba4cf).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1689   +/-   ##
=======================================
  Coverage   99.76%   99.76%           
=======================================
  Files          33       33           
  Lines        3410     3410           
  Branches      464      464           
=======================================
  Hits         3402     3402           
  Misses          5        5           
  Partials        3        3           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant