EdgeFirst Hardware Abstraction Layer

The EdgeFirst Hardware Abstraction Layer (HAL) is a Rust workspace that provides hardware-accelerated tensor management, image processing, ML model output decoding, and multi-object tracking for edge AI inference pipelines. It ships as a Rust crate, a Python package, and a C library — same code, three language surfaces — with Linux DMA-BUF, OpenGL ES, and NXP G2D acceleration where the platform supports them, and a portable CPU fallback everywhere else.

Features

Zero-copy memory management — DMA-BUF, POSIX shared memory, OpenGL PBO, and heap with automatic backend selection
Hardware-accelerated image processing — OpenGL → G2D → CPU dispatch with shared cache infrastructure
YOLO + ModelPack decoding — YOLOv5 / v8 / v11 / v26 (incl. end-to-end) and ModelPack post-processing
Multi-object tracking — ByteTrack with Kalman filtering and stable per-track UUIDs
Cross-platform — Linux (i.MX 8M Plus / i.MX 95 / desktop), macOS, with three CPU/GPU/DMA tiers
Production-ready — used in the Au-Zone EdgeFirst suite for edge AI deployments

Quick Start

Installation

Python:

pip install edgefirst-hal

Rust:

[dependencies]
edgefirst-hal = "0.22"

C: download a release archive from GitHub Releases and link against libedgefirst_hal.so (or .a); see crates/capi/README.md for full instructions.

Basic usage

Python:

import edgefirst_hal as ef

img = ef.Tensor.load("image.jpg", ef.PixelFormat.Rgb)
processor = ef.ImageProcessor()
output = processor.create_image(640, 640, ef.PixelFormat.Rgb)
processor.convert(img, output)

decoder = ef.Decoder(config, 0.5, 0.45)
boxes, scores, classes, masks = decoder.decode([output0, output1])

# Fused decode + draw — masks never leave Rust
processor.draw_masks(decoder, [output0, output1], output)

Rust:

The umbrella edgefirst-hal crate re-exports its sub-crates as modules, so a single edgefirst-hal = "0.22" dependency is enough — no need to list edgefirst-image / edgefirst-tensor separately in Cargo.toml.

use edgefirst_hal::image::{load_image, ImageProcessor, ImageProcessorTrait, Rotation, Flip, Crop};
use edgefirst_hal::tensor::{PixelFormat, DType};

let bytes = std::fs::read("image.jpg")?;
let input = load_image(&bytes, Some(PixelFormat::Rgb), None)?;
let mut processor = ImageProcessor::new()?;
let mut output = processor.create_image(640, 640, PixelFormat::Rgb, DType::U8, None)?;
processor.convert(&input, &mut output, Rotation::None, Flip::None, Crop::default())?;

If you prefer to depend on the sub-crates directly (e.g. to opt out of features or to track them at independent versions), add the relevant edgefirst-image, edgefirst-tensor, edgefirst-decoder, and edgefirst-tracker entries to your Cargo.toml and use the unprefixed edgefirst_image::* / edgefirst_tensor::* paths above.

C:

#include <edgefirst/hal.h>

struct hal_image_processor *proc = hal_image_processor_new();
/* `src` is loaded from disk or imported from a DMA-BUF fd —
 * see the C API README for hal_tensor_load_file / hal_import_image. */
struct hal_tensor *src = /* ... */;
struct hal_tensor *dst = hal_image_processor_create_image(
    proc, 640, 640, HAL_PIXEL_FORMAT_RGB, HAL_DTYPE_U8);
hal_image_processor_convert(proc, src, dst, HAL_ROTATION_NONE, HAL_FLIP_NONE, NULL);

Per-language quick-starts and richer examples live in each crate's README: Rust (edgefirst-hal), C API, Python.

System Architecture

graph TB
    subgraph "EdgeFirst HAL Ecosystem"
        Python["Python Bindings (edgefirst-hal)<br/>PyO3"]
        CAPI["C API (edgefirst-hal-capi)<br/>cbindgen"]
        Main["Umbrella crate (edgefirst-hal)<br/>Re-exports"]

        Python --> Main
        CAPI --> Main

        Tensor["edgefirst-tensor<br/>Zero-copy buffers"]
        Codec["edgefirst-codec<br/>Image decode"]
        Image["edgefirst-image<br/>Format conv + draw"]
        Decoder["edgefirst-decoder<br/>Model output decode"]
        Tracker["edgefirst-tracker<br/>ByteTrack"]

        Main --> Tensor
        Main --> Codec
        Main --> Image
        Main --> Decoder
        Main -.->|tracker feature| Tracker
        CAPI --> Tracker

        Codec --> Tensor
        Image --> Tensor
        Image --> Decoder
        Image -.optional.-> G2D["g2d-sys<br/>NXP i.MX"]
    end

    Tensor -.-> DMA["Linux DMA-Heap<br/>Shared Memory"]
    Decoder -.-> PostProc["Model Output<br/>Post-Processing"]

    style Python fill:#e1f5ff
    style CAPI fill:#e1f5ff
    style Main fill:#fff4e1
    style Tensor fill:#e8f5e9
    style Codec fill:#e8f5e9
    style Image fill:#e8f5e9
    style Decoder fill:#e8f5e9
    style Tracker fill:#e8f5e9

Core Components

Crate	Role	Architecture	Testing
`edgefirst-tensor`	Zero-copy multi-dim buffers (DMA / SHM / Mem / PBO)	ARCH	TEST
`edgefirst-codec`	JPEG/PNG decode into pre-allocated tensors (strided, multi-dtype)	ARCH	TEST
`edgefirst-image`	OpenGL / G2D / CPU image processor + mask rendering	ARCH	TEST
`edgefirst-decoder`	YOLO + ModelPack post-processing, NMS, proto-mask APIs	ARCH	TEST
`edgefirst-tracker`	ByteTrack multi-object tracking	ARCH	TEST
`edgefirst-hal`	Umbrella + tracing subscriber	ARCH	TEST
`edgefirst-hal-capi`	C ABI + Delegate DMA-BUF framework	ARCH	TEST
`crates/python/` (PyPI: `edgefirst-hal`)	PyO3 bindings, numpy buffer protocol	ARCH	TEST

The deep dive on each component (class diagrams, supported operations, backend dispatch, performance considerations) lives in the per-crate ARCHITECTURE.md. The cross-cutting story (DMA-BUF identity, performance tracing internals, design patterns) lives in the project ARCHITECTURE.md.

Optimization Guide

This section is the rules part of the cross-language performance contract. Each rule has a measurable cost when broken; see BENCHMARKS.md for empirical penalties per platform, ARCHITECTURE.md for why the rule exists, and TESTING.md for how to verify your integration follows it.

Rule	Why it matters	Measured penalty when broken
Reuse tensors across frames	Each new tensor mints a fresh `BufferIdentity`; the EGL image cache misses every frame	1.7–3.3× slower preprocessing on Vivante / Mali
Allocate via `ImageProcessor::create_image()`	Auto-selects DMA-buf / PBO / heap based on the active GPU; bypassing forces a slow transfer path	Forced `glTexSubImage2D` upload or full CPU readback
Cache imported camera tensors by inode, not by fd	V4L2 / libcamera recycle fd numbers across a small buffer pool; an fd-keyed cache misses on every frame even when the physical buffer is the same	Full EGL re-import per frame (≈0.5–1.5 ms on Vivante, doubled with chroma planes)
Build `Decoder` once, decode many	Decoder construction parses model metadata and allocates working buffers	Parse + alloc cost per frame
One `ImageProcessor` per pipeline	Each instance owns its own GL context, EGL display, and per-thread caches	Multiple GL contexts contend on the global `GL_MUTEX`
Use native fp16 / AVX build overrides only on supporting CPUs	These flags unlock native widening / vector paths for local perf testing	Unsupported targets may SIGILL or fail to build; portability loss
Pass numpy arrays straight to `Tensor.from_numpy()` — do not pre-`ascontiguousarray()`	HAL detects strided sources and materializes via numpy's vectorized C strided→contig pass; a manual workaround above HAL adds a redundant copy	Redundant pre-copy on every call (≈ 1.5 ms on a `(1, 116, 8400)` f32 view, rpi5-hailo)
For COCO/IoU evaluation use `MaskResolution::Scaled(orig_w, orig_h)`, not `Proto`	`Scaled` upsamples the proto plane before thresholding (clean sub-pixel edges); `Proto` thresholds at proto resolution and callers typically nearest-upsample (blocky)	Mask mAP regression of up to 0.04–0.05 absolute when `Proto` is nearest-upsampled

Important

The single most common performance bug is calling Tensor::from_fd() (or import_image()) on every frame from a V4L2 / libcamera buffer pool. The HAL's internal EGL image cache cannot rescue you — the cache key includes a per-tensor monotonic ID that is fresh on every import. The fix lives in the calling code, not in HAL.

Rule 1 — Reuse tensors across frames

Allocate input and output tensors once at pipeline startup; reuse the same objects on every frame. The DMA memory backing a tensor is live: when an upstream producer (V4L2 DQBUF, codec output, ISP) writes new pixels into it, the existing tensor and its cached EGLImage remain valid. No re-import, no re-allocation.

let mut proc = ImageProcessor::new()?;
let mut dst = proc.create_image(640, 640, PixelFormat::Rgb, DType::U8, None)?;

for frame in camera_frames {
    proc.convert(&frame, &mut dst, Rotation::None, Flip::None, Crop::default())?;
    run_inference(&dst)?;
}

proc = ef.ImageProcessor()
dst = proc.create_image(640, 640, ef.PixelFormat.Rgb)
for frame in camera_frames:
    proc.convert(frame, dst)
    run_inference(dst)

Rule 2 — Allocate via `ImageProcessor::create_image()`

create_image() selects the fastest memory backend for the active GPU at construction time:

Priority	Backend	Transfer	Platforms
1st	DMA-buf	Zero-copy EGLImage import	NXP i.MX 8M Plus, i.MX 95
2nd	PBO	Zero-copy GL buffer binding	NVIDIA desktop
3rd	Mem (heap)	CPU memcpy fallback	All platforms

The probe runs once at ImageProcessor::new() time. All subsequent create_image() calls reuse the same backend. Use create_image() for every destination passed to convert(); direct Tensor::new(memory=...) bypasses the probe.

For DMA-buf access, the process needs /dev/dma_heap/{linux,cma|system} and a DRM render/card node — the GL backend probes /dev/dri/renderD128, then /dev/dri/card0, then /dev/dri/card1 and uses the first one that opens. On embedded Linux, add the user to video and render groups, or set udev rules. If DMA-buf fails, create_image() transparently falls back to PBO or heap.

Rule 3 — Cache imported camera tensors by inode, not by fd

V4L2, libcamera, and codec output all surface frames as DMA-BUF file descriptors drawn from a small fixed pool (typically 4–16 buffers). The fd number is recycled: the same fd can refer to a different physical buffer between frames, and the same physical buffer can be exported with a different fd over time. A cache keyed by fd will produce false hits or false misses.

The kernel assigns each dma_buf object a unique inode in the anonymous inode filesystem. The inode is constant for the buffer's lifetime regardless of how many times it is exported. Cache imported HAL tensors by (inode, plane_offset):

#include <sys/stat.h>

typedef struct { ino_t inode; size_t offset; } BufferKey;

struct stat st;
if (fstat(fd, &st) != 0) continue;
BufferKey key = { .inode = st.st_ino, .offset = plane_offset };

struct hal_tensor *tensor = lookup_tensor(cache, &key);
if (!tensor) {
    struct hal_plane_descriptor *pd = hal_plane_descriptor_new(fd);
    if (!pd) { perror("hal_plane_descriptor_new"); continue; }
    tensor = hal_import_image(proc, pd, NULL, w, h,
                              HAL_PIXEL_FORMAT_NV12, HAL_DTYPE_U8);
    // pd is consumed by hal_import_image (success or failure)
    if (!tensor) { perror("hal_import_image"); continue; }
    insert_tensor(cache, &key, tensor);
}
hal_image_processor_convert(proc, tensor, dst, /* ... */);

import os
buffer_cache: dict[tuple[int, int], ef.Tensor] = {}

def get_or_import(proc, fd, offset, width, height, fmt):
    key = (os.fstat(fd).st_ino, offset)
    t = buffer_cache.get(key)
    if t is None:
        t = proc.import_image(fd, width, height, fmt, "uint8", offset=offset)
        buffer_cache[key] = t
    return t

EdgeFirst's GStreamer elements implement this as a reference. For other pipelines (libcamera direct, custom V4L2, RTSP decoder) you are responsible for the equivalent layer above HAL. See ARCHITECTURE.md § Appendix C for the full identity-and-caching story.

Rule 4 — Build the decoder once

Decoder parses the model output schema, resolves quantization, and allocates working buffers at construction time. Build it once outside the loop; the decoder clears its output vectors per call:

let decoder = DecoderBuilder::default()
    .with_config_yaml_str(config_yaml)
    .with_score_threshold(0.5)
    .with_iou_threshold(0.45)
    .build()?;

for frame in frames {
    let outputs = run_inference(frame)?;
    let refs: Vec<&TensorDyn> = outputs.iter().collect();
    decoder.decode(&refs, &mut boxes, &mut masks)?;
}

The same applies to ByteTrack: construct once, call update() per frame.

Rule 5 — One `ImageProcessor` per pipeline

ImageProcessor owns its OpenGL context, dedicated GL thread, and EGL image cache. The EGL display itself is process-global (a shared SharedEglDisplay initialized once and never terminated), so additional processors don't pay the display-creation cost — but each one still creates a fresh context and per-instance caches, and all GL operations across every processor serialize on a global GL_MUTEX. Construct one per pipeline (or one per worker thread for parallel pipelines) and share it across all convert(), draw_*(), and create_image() calls.

ImageProcessor is Send + Sync, so it can be moved or shared across threads. Concurrent use of a single shared instance still serializes on GL_MUTEX; per-worker ownership gives more predictable cache behavior.

Rule 6 — Local fp16 / AVX build overrides

The default HAL binary is built to the target triple's guaranteed baseline ISA so a single distributed binary runs on every CPU within that triple. Richer ISAs (ARMv8.2-FP16, x86_64 F16C / FMA / AVX2) are not enabled by default; until HAL gains runtime CPU-feature detection with dynamic dispatch, baking them in would SIGILL on older CPUs.

For local benchmarking on supporting hosts, enable them via RUSTFLAGS:

# Orin Nano (Cortex-A78AE) — exclude the PyO3 binding (cross-Python toolchain not configured)
RUSTFLAGS="-C target-cpu=cortex-a78ae" cargo build --release \
  --target aarch64-unknown-linux-gnu --workspace --exclude edgefirst_hal

# Generic aarch64 with FEAT_FP16 (do NOT use on Cortex-A53 / imx8mp)
RUSTFLAGS="-C target-feature=+fp16" cargo build --release \
  --target aarch64-unknown-linux-gnu -p edgefirst-image

# x86_64 Haswell+ (F16C + FMA + AVX2)
RUSTFLAGS="-C target-feature=+f16c,+fma,+avx2" cargo build --release \
  -p edgefirst-image

When active, the f16 mask kernel at crates/image/src/cpu/masks.rs compiles to native widening (fcvt on aarch64, vcvtph2ps on x86_64), and on x86_64 with +f16c,+fma an explicit 8-lane _mm256_cvtph_ps + _mm256_fmadd_ps intrinsic path is enabled via cfg gate. Verify with scripts/audit_f16_codegen.sh.

Rule 7 — NumPy interop: pass arrays straight to `from_numpy()`

Tensor.from_numpy() (and the implicit copy from numpy arrays passed to Decoder.decode_proto()) handles strided / non-contiguous sources internally. Do not maintain a manual np.ascontiguousarray() workaround — it wastes a copy.

The Python binding's copy_numpy_to_tensor_dyn selects one of three paths based on the source array's layout:

Source layout	Path	Cost
Fully contiguous	Single `copy_from_slice` (memcpy), rayon-parallel ≥ 256 KiB	Lower bound
Strided with contiguous inner rows (column slice, sub-volume, negative stride)	Per-row memcpy iterating outer dimensions	≈ same as contiguous
Fully strided (transposed view, every-other-element)	Internal `np.ascontiguousarray()` materialisation, then Path 1 memcpy	≈ 4× contiguous

The fully-strided case is the one that bites users in practice: HailoRT's natural output is arr.transpose(0, 2, 1) over a (1, anchors, channels) buffer. PR #58 replaced the legacy element-wise loop with internal np.ascontiguousarray materialization (≈ 4× faster than the legacy loop, within ≈ 1.5× of the manual workaround).

# Wrong (post-PR #58): adds an extra copy above HAL.
tensor.from_numpy(np.ascontiguousarray(arr_strided))

# Right: HAL detects the strided layout and materializes internally.
tensor.from_numpy(arr_strided)

The regression tests in tests/test_tensor.py (test_from_numpy_hailort_shape, test_from_numpy_hailort_shape_perf_sanity) pin the behaviour and the ≤ 1.5× perf bound.

Rule 8 — Choose the correct `MaskResolution`

ImageProcessor.materialize_masks() accepts a MaskResolution parameter:

Mode	Output	Pipeline	When to use
`MaskResolution::Proto` (default)	`(roi_h, roi_w, 1)` u8 binary at 160×160 proto resolution	dot → sign threshold → emit	Real-time visualisation, when proto-resolution binary suffices
`MaskResolution::Scaled { width, height }`	`(roi_h, roi_w, 1)` u8 binary at requested resolution	dot → sigmoid → upsample to `(W, H)` → threshold (`>127`)	All COCO / IoU / mAP evaluation

import edgefirst_hal as hal

# Wrong: threshold then upsample → blocky edges, mAP regression.
tiles = proc.materialize_masks(boxes, scores, classes, proto_data, letterbox=lb)
for tile, box in zip(tiles, boxes):
    binary = (tile[:, :, 0] > 127).astype(np.uint8)
    canvas[y:y+h, x:x+w] = cv2.resize(binary, (W, H), cv2.INTER_NEAREST)

# Right: HAL upsamples-then-thresholds inside its batched-GEMM kernel.
tiles = proc.materialize_masks(boxes, scores, classes, proto_data,
                               letterbox=lb,
                               resolution=hal.MaskResolution.Scaled(W, H))
for tile, box in zip(tiles, boxes):
    canvas[y:y+h, x:x+w] = (tile[:, :, 0] > 127).astype(np.uint8)

The Scaled path uses the batched-GEMM materializer (PR #54). At N ≥ 16 detections it amortizes a single GEMM at proto resolution and upsamples per-detection in rayon-parallel — both more accurate than threshold-then-resize and faster than per-detection scalar work in caller code.

Tip

If you see a mask-mAP gap between your HAL validator and a reference (ONNX / numpy) implementation, this rule is almost always the first thing to check.

Where to go next

Document	Level	Use it for
ARCHITECTURE.md § Appendix C: DMA-BUF Identity and Tensor Caching	Architecture	Why the rules exist: `BufferIdentity`, EGL image cache, the v4l2 / GStreamer fd-recycling story, and the inode-keyed downstream cache pattern
image/ARCHITECTURE.md § Performance Considerations	Architecture	GL serialization (`GL_MUTEX`), backend dispatch, per-instance caches
TESTING.md § Validating Optimizations	Testing	Confirming your integration follows the rules
BENCHMARKS.md	Benchmarks	Empirical cost of breaking each rule, per platform

Platform Support

Feature	Linux (i.MX)	Linux (other)	macOS	Windows
DMA tensors	Yes	Yes	No	No
PBO tensors (GPU)	Yes	Yes	No	No
IOSurface tensors (zero-copy)	No	No	Yes (with ANGLE)	No
Shared memory tensors	Yes	Yes	Yes	No
Heap tensors	Yes	Yes	Yes	Yes
G2D acceleration	Yes	No	No	No
OpenGL acceleration	Yes (optional)	Yes (optional)	Yes (with ANGLE)	No
CPU fallback	Yes	Yes	Yes	Yes

On macOS the OpenGL backend is enabled when ANGLE is installed — see macOS GPU Acceleration below for setup. If ANGLE is not present the HAL falls back to the CPU backend.

macOS GPU Acceleration

The HAL uses Google's ANGLE to translate the same OpenGL ES 3.0 calls used on Linux to Metal, and Apple's IOSurface for zero-copy buffer interchange (the role DMA-BUF plays on Linux). ANGLE is not part of macOS and must be installed separately. If it is not present at runtime the HAL logs a warning and falls back to the CPU backend.

Source / Cargo installs

Install ANGLE via the third-party Homebrew tap:

brew install startergo/angle/angle

Then re-sign the installed dylibs. Homebrew's install_name_tool step invalidates the bundled code signatures and macOS 26 (Tahoe) refuses to load dylibs with broken signatures at dlopen time, which manifests as an immediate SIGKILL (Code Signature Invalid) with no stdout. The canonical workaround is an ad-hoc re-sign:

codesign --force --sign - $(brew --prefix)/opt/angle/lib/libEGL.dylib
codesign --force --sign - $(brew --prefix)/opt/angle/lib/libGLESv2.dylib

This is a one-time step per ANGLE install; the next brew upgrade angle needs the re-sign again. See Homebrew/brew#19144 for the upstream tracking issue.

The HAL locates libEGL.dylib through the standard dyld search path. On Apple Silicon, /opt/homebrew/lib is on the default search path; on Intel Macs /usr/local/lib is.

Verifying the GPU backend is active

RUST_LOG=edgefirst_image=debug cargo run --release --example pipeline_demo

Look for ANGLE (Apple, ANGLE Metal Renderer: ...) in the bring-up log. If ANGLE is missing or signatures are still broken you will see a warning and the CPU backend is selected.

Custom ANGLE locations

If your ANGLE install is not in /opt/homebrew/opt/angle/lib (or /usr/local/opt/angle/lib on Intel Macs), set EDGEFIRST_ANGLE_PATH to the directory containing libEGL.dylib and libGLESv2.dylib:

EDGEFIRST_ANGLE_PATH=/path/to/angle/lib cargo run --release ...

The lookup order is: EDGEFIRST_ANGLE_PATH → Homebrew → @loader_path (alongside the binary) → @executable_path → unqualified libEGL.dylib on the dyld search path. For bundled distributions, drop the re-signed ANGLE dylibs next to the executable (or into <App>.app/Contents/Frameworks/) and no env var is needed.

When you don't need this setup

pip install edgefirst-hal — the macOS wheel ships ANGLE bundled alongside the Python extension; no separate install required.
EdgeFirst-signed binary distribution — official binary releases bundle ANGLE re-signed under the EdgeFirst Apple Developer ID. Install and run with no additional setup.

These channels exist precisely so end users do not need to deal with the Homebrew install or re-signing step.

Build System

The workspace builds with standard cargo. The Makefile wraps the common workflows (make test, make bench, make build, make format lint check) with the right flags and gates.

For Python wheels, see crates/python/README.md and crates/python/TESTING.md. For the C library and consumer linking, see crates/capi/README.md.

Environment Variables

Variable	Description
`EDGEFIRST_TENSOR_FORCE_MEM`	`1` forces heap memory (disables DMA / SHM)
`EDGEFIRST_DISABLE_G2D`	Disable G2D backend
`EDGEFIRST_DISABLE_GL`	Disable OpenGL backend
`EDGEFIRST_DISABLE_CPU`	Disable CPU backend
`EDGEFIRST_FORCE_BACKEND`	Force one backend: `cpu`, `g2d`, or `opengl` (disables fallback)
`EDGEFIRST_FORCE_TRANSFER`	Force GL transfer: `pbo`, `dmabuf`, or `sync`
`EDGEFIRST_OPENGL_RENDERSURFACE`	`1` enables EGL renderbuffer path for non-`dma_heap` DMA-BUF (i.MX 95 Neutron NPU)
`EDGEFIRST_PROTO_COMPUTE`	`1` enables GLES 3.1 compute shader for HWC→CHW proto repack
`EDGEFIRST_ANGLE_PATH`	macOS only: directory containing `libEGL.dylib` / `libGLESv2.dylib`. Overrides the default search (Homebrew → `@loader_path` → `@executable_path` → `libEGL.dylib` on dyld). Set this when deploying a bundled or custom-signed ANGLE alongside the binary.
`EDGEFIRST_TESTDATA_DIR`	Override testdata location (used by benches and CI)
`RUST_LOG`	Standard `env_logger` filter — `RUST_LOG=edgefirst_image=debug` for backend dispatch + cache stats

Per-crate variables and additional detail live in each crate's README.

Testing

See TESTING.md for the cross-cutting testing guide (single-threaded rule, on-target gating, cross-compilation, CI matrix, optimization validation). Per-crate testing detail lives in each crate's TESTING.md — links in the Core Components table.

Benchmarking

Binary	Crate	What it measures
`tensor_benchmark`	`edgefirst-tensor`	Tensor allocation and map/unmap latency across buffer types
`image_benchmark`	`edgefirst-image`	Crop, flip, rotate, resize, draw
`pipeline_benchmark`	`edgefirst-image`	Letterbox pipeline + format conversion
`decode_pipeline_benchmark`	`edgefirst-image`	JPEG decode → letterbox convert (strided, HWC/CHW)
`mask_benchmark`	`edgefirst-image`	`draw_decoded_masks`, `draw_proto_masks`, hybrid path
`opencv_benchmark`	`edgefirst-image`	OpenCV baseline comparison
`decoder_benchmark`	`edgefirst-decoder`	YOLO post-processing, NMS, dequant
`tracker_benchmark`	`edgefirst-tracker`	ByteTrack throughput vs. simultaneous tracks

Run on host:

cargo bench -p edgefirst-image --bench pipeline_benchmark -- --bench

# Force a backend
EDGEFIRST_FORCE_BACKEND=cpu cargo bench -p edgefirst-image --bench pipeline_benchmark -- --bench

Cross-compile + deploy to a target (SSH hostnames in ~/.ssh/config: imx8mp-frdm, imx95-frdm, rpi5-hailo, jetson-orin-nano, maivin):

cargo-zigbuild zigbuild --target aarch64-unknown-linux-gnu --release \
  -p edgefirst-image --features opengl --bench pipeline_benchmark

scp target/aarch64-unknown-linux-gnu/release/deps/pipeline_benchmark-* imx8mp-frdm:/tmp/
ssh imx8mp-frdm '/tmp/pipeline_benchmark-* --bench --json /tmp/pipeline.json'

All benchmarks accept --bench --json <path> for structured output. Store results under benchmarks/<platform>/<name>.json. Update BENCHMARKS.md via:

python3 .github/scripts/generate_benchmark_tables.py --data-dir benchmarks/

Performance Tracing

The HAL ships with built-in tracing for capturing detailed performance traces across all processing stages. Traces use the Chrome JSON format and view in Perfetto UI.

How it works

Every HAL library crate emits tracing spans on hot paths. These spans have near-zero overhead when no subscriber is active — each site compiles to a single relaxed atomic load. No heap allocations, no string formatting, no function calls on the hot path.

When a session is started via the API, a Chrome JSON subscriber records all span enter/exit events with high-resolution timestamps and structured metadata (detection counts, proto dimensions, format conversions, memory types, etc.) to a file.

Span coverage

The tracing surface covers decode, image conversion, GL multi-pass, mask materialization, tensor lifecycle, tracker association, and the Python entry points. Each span carries structured fields — see the per-crate ARCHITECTURE.md files for the authoritative list of spans and fields per component.

Enabling tracing

Python:

import edgefirst_hal as hal
with hal.Tracing("/tmp/trace.json"):
    # ... run inference pipeline ...
    pass

Rust:

use edgefirst_hal::trace::{start_tracing, stop_tracing};

start_tracing("/tmp/trace.json").expect("start tracing");
// ... inference pipeline ...
stop_tracing(); // flushes and closes the trace file

C:

#include <edgefirst/hal.h>
hal_start_tracing("/tmp/trace.json");
/* ... inference pipeline ... */
hal_stop_tracing();

Viewing traces

Open https://ui.perfetto.dev/
Drag the generated .json file onto the page
Click slices to see structured fields in the Current Selection panel

Using traces for optimization

The tracing infrastructure complements the rules in the Optimization Guide and the data in BENCHMARKS.md:

Identify bottlenecks — common findings:
- extract_proto > 3 ms → model emits NCHW protos but HAL is transposing (check the layout field)
- cpu_format_convert appearing twice → intermediate format conversion (consider matching src/dst formats)
- tensor_alloc per-frame → tensors not being reused (Rule 1)
Validate rules — re-run with tracing after applying a rule to confirm the expected spans disappear or shrink.
Cross-reference with perf — for CPU-bound spans, combine trace data with perf record for instruction-level hotspots.

Limitations

Only one trace session per process lifetime (Rust global subscriber model).
Rayon worker spans are not automatically parented to the calling span.
The log::* output (via env_logger / C callback logger) operates independently from trace capture; both can be active simultaneously.

Dependencies

Key external dependencies

PyO3 — Python bindings
ndarray — N-dimensional arrays
rayon — Data parallelism
fast_image_resize — CPU image operations
zune-png — PNG image decoding (JPEG uses custom decoder)
dma-heap — Linux DMA allocation
nix — Unix system calls

Internal dependency graph

graph TD
    EF[edgefirst-hal<br/>umbrella]
    Tensor[edgefirst-tensor]
    Image[edgefirst-image]
    Decoder[edgefirst-decoder]
    Tracker[edgefirst-tracker<br/>optional]
    G2D[g2d-sys<br/>optional]

    EF --> Tensor
    EF --> Image
    EF --> Decoder
    Image --> Tensor
    Image --> Decoder
    Image -.optional.-> G2D
    Image -.->|tracker feature| Tracker
    Decoder -.->|tracker feature| Tracker

    Python[edgefirst_hal<br/>PyO3]
    CAPI[edgefirst-hal-capi]

    Python --> EF
    CAPI --> EF
    CAPI --> Tensor
    CAPI --> Image
    CAPI --> Decoder
    CAPI --> Tracker

    style EF fill:#fff4e1
    style Python fill:#e1f5ff
    style CAPI fill:#e1f5ff
    style Tracker fill:#e8f5e9

Future Considerations

Model HAL — planned abstraction for inference engines (ONNX, TFLite, Kinara)
VPI integration — support for NVIDIA Vision Programming Interface
Additional trackers — SORT, Deep SORT
Async I/O — non-blocking image loading and processing
GPU compute — Vulkan / CUDA backends for custom operations

Support

Community resources

GitHub Discussions — questions and ideas
Issue Tracker — bug reports and feature requests

EdgeFirst ecosystem

This project is part of the EdgeFirst Perception stack:

EdgeFirst Studio — complete MLOps platform
EdgeFirst Hardware Platforms — NPU/GPU acceleration on NXP i.MX

Professional services

Au-Zone Technologies offers comprehensive support for production deployments: training & workshops, custom development, integration services, enterprise SLAs, and hardware reference designs.

Contact: [email protected] · au-zone.com

Contributing

We welcome contributions! Please see CONTRIBUTING.md for development setup and guidelines. This project follows our Code of Conduct.

Security

For security vulnerabilities, see SECURITY.md or email [email protected] with subject "Security Vulnerability".

Documentation

ARCHITECTURE.md — cross-crate architecture story
TESTING.md — workspace testing rules and CI matrix
BENCHMARKS.md — empirical performance reference
CHANGELOG.md — release history
Per-crate docs (README + ARCHITECTURE + TESTING) — see Core Components table

License

Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 992 Commits
.cargo		.cargo
.config		.config
.github		.github
.vscode		.vscode
crates		crates
scripts		scripts
testdata		testdata
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
TESTING.md		TESTING.md
entitlements.plist		entitlements.plist
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml
sonar-project.properties		sonar-project.properties

Folders and files

Latest commit

History

Repository files navigation

EdgeFirst Hardware Abstraction Layer

Features

Quick Start

Installation

Basic usage

System Architecture

Core Components

Optimization Guide

Rule 1 — Reuse tensors across frames

Rule 2 — Allocate via ImageProcessor::create_image()

Rule 3 — Cache imported camera tensors by inode, not by fd

Rule 4 — Build the decoder once

Rule 5 — One ImageProcessor per pipeline

Rule 6 — Local fp16 / AVX build overrides

Rule 7 — NumPy interop: pass arrays straight to from_numpy()

Rule 8 — Choose the correct MaskResolution

Where to go next

Platform Support

macOS GPU Acceleration

Source / Cargo installs

Verifying the GPU backend is active

Custom ANGLE locations

When you don't need this setup

Build System

Environment Variables

Testing

Benchmarking

Performance Tracing

How it works

Span coverage

Enabling tracing

Viewing traces

Using traces for optimization

Limitations

Dependencies

Key external dependencies

Internal dependency graph

Future Considerations

Support

Community resources

EdgeFirst ecosystem

Professional services

Contributing

Security

Documentation

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 38

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Rule 2 — Allocate via `ImageProcessor::create_image()`

Rule 5 — One `ImageProcessor` per pipeline

Rule 7 — NumPy interop: pass arrays straight to `from_numpy()`

Rule 8 — Choose the correct `MaskResolution`

Packages