Skip to content

EdgeFirstAI/hal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

992 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

EdgeFirst Hardware Abstraction Layer

Build Status License Crates.io PyPI

The EdgeFirst Hardware Abstraction Layer (HAL) is a Rust workspace that provides hardware-accelerated tensor management, image processing, ML model output decoding, and multi-object tracking for edge AI inference pipelines. It ships as a Rust crate, a Python package, and a C library — same code, three language surfaces — with Linux DMA-BUF, OpenGL ES, and NXP G2D acceleration where the platform supports them, and a portable CPU fallback everywhere else.

Features

  • Zero-copy memory management — DMA-BUF, POSIX shared memory, OpenGL PBO, and heap with automatic backend selection
  • Hardware-accelerated image processing — OpenGL → G2D → CPU dispatch with shared cache infrastructure
  • YOLO + ModelPack decoding — YOLOv5 / v8 / v11 / v26 (incl. end-to-end) and ModelPack post-processing
  • Multi-object tracking — ByteTrack with Kalman filtering and stable per-track UUIDs
  • Cross-platform — Linux (i.MX 8M Plus / i.MX 95 / desktop), macOS, with three CPU/GPU/DMA tiers
  • Production-ready — used in the Au-Zone EdgeFirst suite for edge AI deployments

Quick Start

Installation

Python:

pip install edgefirst-hal

Rust:

[dependencies]
edgefirst-hal = "0.22"

C: download a release archive from GitHub Releases and link against libedgefirst_hal.so (or .a); see crates/capi/README.md for full instructions.

Basic usage

Python:

import edgefirst_hal as ef

img = ef.Tensor.load("image.jpg", ef.PixelFormat.Rgb)
processor = ef.ImageProcessor()
output = processor.create_image(640, 640, ef.PixelFormat.Rgb)
processor.convert(img, output)

decoder = ef.Decoder(config, 0.5, 0.45)
boxes, scores, classes, masks = decoder.decode([output0, output1])

# Fused decode + draw — masks never leave Rust
processor.draw_masks(decoder, [output0, output1], output)

Rust:

The umbrella edgefirst-hal crate re-exports its sub-crates as modules, so a single edgefirst-hal = "0.22" dependency is enough — no need to list edgefirst-image / edgefirst-tensor separately in Cargo.toml.

use edgefirst_hal::image::{load_image, ImageProcessor, ImageProcessorTrait, Rotation, Flip, Crop};
use edgefirst_hal::tensor::{PixelFormat, DType};

let bytes = std::fs::read("image.jpg")?;
let input = load_image(&bytes, Some(PixelFormat::Rgb), None)?;
let mut processor = ImageProcessor::new()?;
let mut output = processor.create_image(640, 640, PixelFormat::Rgb, DType::U8, None)?;
processor.convert(&input, &mut output, Rotation::None, Flip::None, Crop::default())?;

If you prefer to depend on the sub-crates directly (e.g. to opt out of features or to track them at independent versions), add the relevant edgefirst-image, edgefirst-tensor, edgefirst-decoder, and edgefirst-tracker entries to your Cargo.toml and use the unprefixed edgefirst_image::* / edgefirst_tensor::* paths above.

C:

#include <edgefirst/hal.h>

struct hal_image_processor *proc = hal_image_processor_new();
/* `src` is loaded from disk or imported from a DMA-BUF fd —
 * see the C API README for hal_tensor_load_file / hal_import_image. */
struct hal_tensor *src = /* ... */;
struct hal_tensor *dst = hal_image_processor_create_image(
    proc, 640, 640, HAL_PIXEL_FORMAT_RGB, HAL_DTYPE_U8);
hal_image_processor_convert(proc, src, dst, HAL_ROTATION_NONE, HAL_FLIP_NONE, NULL);

Per-language quick-starts and richer examples live in each crate's README: Rust (edgefirst-hal), C API, Python.

System Architecture

graph TB
    subgraph "EdgeFirst HAL Ecosystem"
        Python["Python Bindings (edgefirst-hal)<br/>PyO3"]
        CAPI["C API (edgefirst-hal-capi)<br/>cbindgen"]
        Main["Umbrella crate (edgefirst-hal)<br/>Re-exports"]

        Python --> Main
        CAPI --> Main

        Tensor["edgefirst-tensor<br/>Zero-copy buffers"]
        Codec["edgefirst-codec<br/>Image decode"]
        Image["edgefirst-image<br/>Format conv + draw"]
        Decoder["edgefirst-decoder<br/>Model output decode"]
        Tracker["edgefirst-tracker<br/>ByteTrack"]

        Main --> Tensor
        Main --> Codec
        Main --> Image
        Main --> Decoder
        Main -.->|tracker feature| Tracker
        CAPI --> Tracker

        Codec --> Tensor
        Image --> Tensor
        Image --> Decoder
        Image -.optional.-> G2D["g2d-sys<br/>NXP i.MX"]
    end

    Tensor -.-> DMA["Linux DMA-Heap<br/>Shared Memory"]
    Decoder -.-> PostProc["Model Output<br/>Post-Processing"]

    style Python fill:#e1f5ff
    style CAPI fill:#e1f5ff
    style Main fill:#fff4e1
    style Tensor fill:#e8f5e9
    style Codec fill:#e8f5e9
    style Image fill:#e8f5e9
    style Decoder fill:#e8f5e9
    style Tracker fill:#e8f5e9
Loading

Core Components

Crate Role Architecture Testing
edgefirst-tensor Zero-copy multi-dim buffers (DMA / SHM / Mem / PBO) ARCH TEST
edgefirst-codec JPEG/PNG decode into pre-allocated tensors (strided, multi-dtype) ARCH TEST
edgefirst-image OpenGL / G2D / CPU image processor + mask rendering ARCH TEST
edgefirst-decoder YOLO + ModelPack post-processing, NMS, proto-mask APIs ARCH TEST
edgefirst-tracker ByteTrack multi-object tracking ARCH TEST
edgefirst-hal Umbrella + tracing subscriber ARCH TEST
edgefirst-hal-capi C ABI + Delegate DMA-BUF framework ARCH TEST
crates/python/ (PyPI: edgefirst-hal) PyO3 bindings, numpy buffer protocol ARCH TEST

The deep dive on each component (class diagrams, supported operations, backend dispatch, performance considerations) lives in the per-crate ARCHITECTURE.md. The cross-cutting story (DMA-BUF identity, performance tracing internals, design patterns) lives in the project ARCHITECTURE.md.

Optimization Guide

This section is the rules part of the cross-language performance contract. Each rule has a measurable cost when broken; see BENCHMARKS.md for empirical penalties per platform, ARCHITECTURE.md for why the rule exists, and TESTING.md for how to verify your integration follows it.

Rule Why it matters Measured penalty when broken
Reuse tensors across frames Each new tensor mints a fresh BufferIdentity; the EGL image cache misses every frame 1.7–3.3× slower preprocessing on Vivante / Mali
Allocate via ImageProcessor::create_image() Auto-selects DMA-buf / PBO / heap based on the active GPU; bypassing forces a slow transfer path Forced glTexSubImage2D upload or full CPU readback
Cache imported camera tensors by inode, not by fd V4L2 / libcamera recycle fd numbers across a small buffer pool; an fd-keyed cache misses on every frame even when the physical buffer is the same Full EGL re-import per frame (≈0.5–1.5 ms on Vivante, doubled with chroma planes)
Build Decoder once, decode many Decoder construction parses model metadata and allocates working buffers Parse + alloc cost per frame
One ImageProcessor per pipeline Each instance owns its own GL context, EGL display, and per-thread caches Multiple GL contexts contend on the global GL_MUTEX
Use native fp16 / AVX build overrides only on supporting CPUs These flags unlock native widening / vector paths for local perf testing Unsupported targets may SIGILL or fail to build; portability loss
Pass numpy arrays straight to Tensor.from_numpy() — do not pre-ascontiguousarray() HAL detects strided sources and materializes via numpy's vectorized C strided→contig pass; a manual workaround above HAL adds a redundant copy Redundant pre-copy on every call (≈ 1.5 ms on a (1, 116, 8400) f32 view, rpi5-hailo)
For COCO/IoU evaluation use MaskResolution::Scaled(orig_w, orig_h), not Proto Scaled upsamples the proto plane before thresholding (clean sub-pixel edges); Proto thresholds at proto resolution and callers typically nearest-upsample (blocky) Mask mAP regression of up to 0.04–0.05 absolute when Proto is nearest-upsampled

Important

The single most common performance bug is calling Tensor::from_fd() (or import_image()) on every frame from a V4L2 / libcamera buffer pool. The HAL's internal EGL image cache cannot rescue you — the cache key includes a per-tensor monotonic ID that is fresh on every import. The fix lives in the calling code, not in HAL.

Rule 1 — Reuse tensors across frames

Allocate input and output tensors once at pipeline startup; reuse the same objects on every frame. The DMA memory backing a tensor is live: when an upstream producer (V4L2 DQBUF, codec output, ISP) writes new pixels into it, the existing tensor and its cached EGLImage remain valid. No re-import, no re-allocation.

let mut proc = ImageProcessor::new()?;
let mut dst = proc.create_image(640, 640, PixelFormat::Rgb, DType::U8, None)?;

for frame in camera_frames {
    proc.convert(&frame, &mut dst, Rotation::None, Flip::None, Crop::default())?;
    run_inference(&dst)?;
}
proc = ef.ImageProcessor()
dst = proc.create_image(640, 640, ef.PixelFormat.Rgb)
for frame in camera_frames:
    proc.convert(frame, dst)
    run_inference(dst)

Rule 2 — Allocate via ImageProcessor::create_image()

create_image() selects the fastest memory backend for the active GPU at construction time:

Priority Backend Transfer Platforms
1st DMA-buf Zero-copy EGLImage import NXP i.MX 8M Plus, i.MX 95
2nd PBO Zero-copy GL buffer binding NVIDIA desktop
3rd Mem (heap) CPU memcpy fallback All platforms

The probe runs once at ImageProcessor::new() time. All subsequent create_image() calls reuse the same backend. Use create_image() for every destination passed to convert(); direct Tensor::new(memory=...) bypasses the probe.

For DMA-buf access, the process needs /dev/dma_heap/{linux,cma|system} and a DRM render/card node — the GL backend probes /dev/dri/renderD128, then /dev/dri/card0, then /dev/dri/card1 and uses the first one that opens. On embedded Linux, add the user to video and render groups, or set udev rules. If DMA-buf fails, create_image() transparently falls back to PBO or heap.

Rule 3 — Cache imported camera tensors by inode, not by fd

V4L2, libcamera, and codec output all surface frames as DMA-BUF file descriptors drawn from a small fixed pool (typically 4–16 buffers). The fd number is recycled: the same fd can refer to a different physical buffer between frames, and the same physical buffer can be exported with a different fd over time. A cache keyed by fd will produce false hits or false misses.

The kernel assigns each dma_buf object a unique inode in the anonymous inode filesystem. The inode is constant for the buffer's lifetime regardless of how many times it is exported. Cache imported HAL tensors by (inode, plane_offset):

#include <sys/stat.h>

typedef struct { ino_t inode; size_t offset; } BufferKey;

struct stat st;
if (fstat(fd, &st) != 0) continue;
BufferKey key = { .inode = st.st_ino, .offset = plane_offset };

struct hal_tensor *tensor = lookup_tensor(cache, &key);
if (!tensor) {
    struct hal_plane_descriptor *pd = hal_plane_descriptor_new(fd);
    if (!pd) { perror("hal_plane_descriptor_new"); continue; }
    tensor = hal_import_image(proc, pd, NULL, w, h,
                              HAL_PIXEL_FORMAT_NV12, HAL_DTYPE_U8);
    // pd is consumed by hal_import_image (success or failure)
    if (!tensor) { perror("hal_import_image"); continue; }
    insert_tensor(cache, &key, tensor);
}
hal_image_processor_convert(proc, tensor, dst, /* ... */);
import os
buffer_cache: dict[tuple[int, int], ef.Tensor] = {}

def get_or_import(proc, fd, offset, width, height, fmt):
    key = (os.fstat(fd).st_ino, offset)
    t = buffer_cache.get(key)
    if t is None:
        t = proc.import_image(fd, width, height, fmt, "uint8", offset=offset)
        buffer_cache[key] = t
    return t

EdgeFirst's GStreamer elements implement this as a reference. For other pipelines (libcamera direct, custom V4L2, RTSP decoder) you are responsible for the equivalent layer above HAL. See ARCHITECTURE.md § Appendix C for the full identity-and-caching story.

Rule 4 — Build the decoder once

Decoder parses the model output schema, resolves quantization, and allocates working buffers at construction time. Build it once outside the loop; the decoder clears its output vectors per call:

let decoder = DecoderBuilder::default()
    .with_config_yaml_str(config_yaml)
    .with_score_threshold(0.5)
    .with_iou_threshold(0.45)
    .build()?;

for frame in frames {
    let outputs = run_inference(frame)?;
    let refs: Vec<&TensorDyn> = outputs.iter().collect();
    decoder.decode(&refs, &mut boxes, &mut masks)?;
}

The same applies to ByteTrack: construct once, call update() per frame.

Rule 5 — One ImageProcessor per pipeline

ImageProcessor owns its OpenGL context, dedicated GL thread, and EGL image cache. The EGL display itself is process-global (a shared SharedEglDisplay initialized once and never terminated), so additional processors don't pay the display-creation cost — but each one still creates a fresh context and per-instance caches, and all GL operations across every processor serialize on a global GL_MUTEX. Construct one per pipeline (or one per worker thread for parallel pipelines) and share it across all convert(), draw_*(), and create_image() calls.

ImageProcessor is Send + Sync, so it can be moved or shared across threads. Concurrent use of a single shared instance still serializes on GL_MUTEX; per-worker ownership gives more predictable cache behavior.

Rule 6 — Local fp16 / AVX build overrides

The default HAL binary is built to the target triple's guaranteed baseline ISA so a single distributed binary runs on every CPU within that triple. Richer ISAs (ARMv8.2-FP16, x86_64 F16C / FMA / AVX2) are not enabled by default; until HAL gains runtime CPU-feature detection with dynamic dispatch, baking them in would SIGILL on older CPUs.

For local benchmarking on supporting hosts, enable them via RUSTFLAGS:

# Orin Nano (Cortex-A78AE) — exclude the PyO3 binding (cross-Python toolchain not configured)
RUSTFLAGS="-C target-cpu=cortex-a78ae" cargo build --release \
  --target aarch64-unknown-linux-gnu --workspace --exclude edgefirst_hal

# Generic aarch64 with FEAT_FP16 (do NOT use on Cortex-A53 / imx8mp)
RUSTFLAGS="-C target-feature=+fp16" cargo build --release \
  --target aarch64-unknown-linux-gnu -p edgefirst-image

# x86_64 Haswell+ (F16C + FMA + AVX2)
RUSTFLAGS="-C target-feature=+f16c,+fma,+avx2" cargo build --release \
  -p edgefirst-image

When active, the f16 mask kernel at crates/image/src/cpu/masks.rs compiles to native widening (fcvt on aarch64, vcvtph2ps on x86_64), and on x86_64 with +f16c,+fma an explicit 8-lane _mm256_cvtph_ps + _mm256_fmadd_ps intrinsic path is enabled via cfg gate. Verify with scripts/audit_f16_codegen.sh.

Rule 7 — NumPy interop: pass arrays straight to from_numpy()

Tensor.from_numpy() (and the implicit copy from numpy arrays passed to Decoder.decode_proto()) handles strided / non-contiguous sources internally. Do not maintain a manual np.ascontiguousarray() workaround — it wastes a copy.

The Python binding's copy_numpy_to_tensor_dyn selects one of three paths based on the source array's layout:

Source layout Path Cost
Fully contiguous Single copy_from_slice (memcpy), rayon-parallel ≥ 256 KiB Lower bound
Strided with contiguous inner rows (column slice, sub-volume, negative stride) Per-row memcpy iterating outer dimensions ≈ same as contiguous
Fully strided (transposed view, every-other-element) Internal np.ascontiguousarray() materialisation, then Path 1 memcpy ≈ 4× contiguous

The fully-strided case is the one that bites users in practice: HailoRT's natural output is arr.transpose(0, 2, 1) over a (1, anchors, channels) buffer. PR #58 replaced the legacy element-wise loop with internal np.ascontiguousarray materialization (≈ 4× faster than the legacy loop, within ≈ 1.5× of the manual workaround).

# Wrong (post-PR #58): adds an extra copy above HAL.
tensor.from_numpy(np.ascontiguousarray(arr_strided))

# Right: HAL detects the strided layout and materializes internally.
tensor.from_numpy(arr_strided)

The regression tests in tests/test_tensor.py (test_from_numpy_hailort_shape, test_from_numpy_hailort_shape_perf_sanity) pin the behaviour and the ≤ 1.5× perf bound.

Rule 8 — Choose the correct MaskResolution

ImageProcessor.materialize_masks() accepts a MaskResolution parameter:

Mode Output Pipeline When to use
MaskResolution::Proto (default) (roi_h, roi_w, 1) u8 binary at 160×160 proto resolution dot → sign threshold → emit Real-time visualisation, when proto-resolution binary suffices
MaskResolution::Scaled { width, height } (roi_h, roi_w, 1) u8 binary at requested resolution dot → sigmoid → upsample to (W, H) → threshold (>127) All COCO / IoU / mAP evaluation
import edgefirst_hal as hal

# Wrong: threshold then upsample → blocky edges, mAP regression.
tiles = proc.materialize_masks(boxes, scores, classes, proto_data, letterbox=lb)
for tile, box in zip(tiles, boxes):
    binary = (tile[:, :, 0] > 127).astype(np.uint8)
    canvas[y:y+h, x:x+w] = cv2.resize(binary, (W, H), cv2.INTER_NEAREST)

# Right: HAL upsamples-then-thresholds inside its batched-GEMM kernel.
tiles = proc.materialize_masks(boxes, scores, classes, proto_data,
                               letterbox=lb,
                               resolution=hal.MaskResolution.Scaled(W, H))
for tile, box in zip(tiles, boxes):
    canvas[y:y+h, x:x+w] = (tile[:, :, 0] > 127).astype(np.uint8)

The Scaled path uses the batched-GEMM materializer (PR #54). At N ≥ 16 detections it amortizes a single GEMM at proto resolution and upsamples per-detection in rayon-parallel — both more accurate than threshold-then-resize and faster than per-detection scalar work in caller code.

Tip

If you see a mask-mAP gap between your HAL validator and a reference (ONNX / numpy) implementation, this rule is almost always the first thing to check.

Where to go next

Document Level Use it for
ARCHITECTURE.md § Appendix C: DMA-BUF Identity and Tensor Caching Architecture Why the rules exist: BufferIdentity, EGL image cache, the v4l2 / GStreamer fd-recycling story, and the inode-keyed downstream cache pattern
image/ARCHITECTURE.md § Performance Considerations Architecture GL serialization (GL_MUTEX), backend dispatch, per-instance caches
TESTING.md § Validating Optimizations Testing Confirming your integration follows the rules
BENCHMARKS.md Benchmarks Empirical cost of breaking each rule, per platform

Platform Support

Feature Linux (i.MX) Linux (other) macOS Windows
DMA tensors Yes Yes No No
PBO tensors (GPU) Yes Yes No No
IOSurface tensors (zero-copy) No No Yes (with ANGLE) No
Shared memory tensors Yes Yes Yes No
Heap tensors Yes Yes Yes Yes
G2D acceleration Yes No No No
OpenGL acceleration Yes (optional) Yes (optional) Yes (with ANGLE) No
CPU fallback Yes Yes Yes Yes

On macOS the OpenGL backend is enabled when ANGLE is installed — see macOS GPU Acceleration below for setup. If ANGLE is not present the HAL falls back to the CPU backend.

macOS GPU Acceleration

The HAL uses Google's ANGLE to translate the same OpenGL ES 3.0 calls used on Linux to Metal, and Apple's IOSurface for zero-copy buffer interchange (the role DMA-BUF plays on Linux). ANGLE is not part of macOS and must be installed separately. If it is not present at runtime the HAL logs a warning and falls back to the CPU backend.

Source / Cargo installs

Install ANGLE via the third-party Homebrew tap:

brew install startergo/angle/angle

Then re-sign the installed dylibs. Homebrew's install_name_tool step invalidates the bundled code signatures and macOS 26 (Tahoe) refuses to load dylibs with broken signatures at dlopen time, which manifests as an immediate SIGKILL (Code Signature Invalid) with no stdout. The canonical workaround is an ad-hoc re-sign:

codesign --force --sign - $(brew --prefix)/opt/angle/lib/libEGL.dylib
codesign --force --sign - $(brew --prefix)/opt/angle/lib/libGLESv2.dylib

This is a one-time step per ANGLE install; the next brew upgrade angle needs the re-sign again. See Homebrew/brew#19144 for the upstream tracking issue.

The HAL locates libEGL.dylib through the standard dyld search path. On Apple Silicon, /opt/homebrew/lib is on the default search path; on Intel Macs /usr/local/lib is.

Verifying the GPU backend is active

RUST_LOG=edgefirst_image=debug cargo run --release --example pipeline_demo

Look for ANGLE (Apple, ANGLE Metal Renderer: ...) in the bring-up log. If ANGLE is missing or signatures are still broken you will see a warning and the CPU backend is selected.

Custom ANGLE locations

If your ANGLE install is not in /opt/homebrew/opt/angle/lib (or /usr/local/opt/angle/lib on Intel Macs), set EDGEFIRST_ANGLE_PATH to the directory containing libEGL.dylib and libGLESv2.dylib:

EDGEFIRST_ANGLE_PATH=/path/to/angle/lib cargo run --release ...

The lookup order is: EDGEFIRST_ANGLE_PATH → Homebrew → @loader_path (alongside the binary) → @executable_path → unqualified libEGL.dylib on the dyld search path. For bundled distributions, drop the re-signed ANGLE dylibs next to the executable (or into <App>.app/Contents/Frameworks/) and no env var is needed.

When you don't need this setup

  • pip install edgefirst-hal — the macOS wheel ships ANGLE bundled alongside the Python extension; no separate install required.
  • EdgeFirst-signed binary distribution — official binary releases bundle ANGLE re-signed under the EdgeFirst Apple Developer ID. Install and run with no additional setup.

These channels exist precisely so end users do not need to deal with the Homebrew install or re-signing step.

Build System

The workspace builds with standard cargo. The Makefile wraps the common workflows (make test, make bench, make build, make format lint check) with the right flags and gates.

For Python wheels, see crates/python/README.md and crates/python/TESTING.md. For the C library and consumer linking, see crates/capi/README.md.

Environment Variables

Variable Description
EDGEFIRST_TENSOR_FORCE_MEM 1 forces heap memory (disables DMA / SHM)
EDGEFIRST_DISABLE_G2D Disable G2D backend
EDGEFIRST_DISABLE_GL Disable OpenGL backend
EDGEFIRST_DISABLE_CPU Disable CPU backend
EDGEFIRST_FORCE_BACKEND Force one backend: cpu, g2d, or opengl (disables fallback)
EDGEFIRST_FORCE_TRANSFER Force GL transfer: pbo, dmabuf, or sync
EDGEFIRST_OPENGL_RENDERSURFACE 1 enables EGL renderbuffer path for non-dma_heap DMA-BUF (i.MX 95 Neutron NPU)
EDGEFIRST_PROTO_COMPUTE 1 enables GLES 3.1 compute shader for HWC→CHW proto repack
EDGEFIRST_ANGLE_PATH macOS only: directory containing libEGL.dylib / libGLESv2.dylib. Overrides the default search (Homebrew → @loader_path@executable_pathlibEGL.dylib on dyld). Set this when deploying a bundled or custom-signed ANGLE alongside the binary.
EDGEFIRST_TESTDATA_DIR Override testdata location (used by benches and CI)
RUST_LOG Standard env_logger filter — RUST_LOG=edgefirst_image=debug for backend dispatch + cache stats

Per-crate variables and additional detail live in each crate's README.

Testing

See TESTING.md for the cross-cutting testing guide (single-threaded rule, on-target gating, cross-compilation, CI matrix, optimization validation). Per-crate testing detail lives in each crate's TESTING.md — links in the Core Components table.

Benchmarking

Binary Crate What it measures
tensor_benchmark edgefirst-tensor Tensor allocation and map/unmap latency across buffer types
image_benchmark edgefirst-image Crop, flip, rotate, resize, draw
pipeline_benchmark edgefirst-image Letterbox pipeline + format conversion
decode_pipeline_benchmark edgefirst-image JPEG decode → letterbox convert (strided, HWC/CHW)
mask_benchmark edgefirst-image draw_decoded_masks, draw_proto_masks, hybrid path
opencv_benchmark edgefirst-image OpenCV baseline comparison
decoder_benchmark edgefirst-decoder YOLO post-processing, NMS, dequant
tracker_benchmark edgefirst-tracker ByteTrack throughput vs. simultaneous tracks

Run on host:

cargo bench -p edgefirst-image --bench pipeline_benchmark -- --bench

# Force a backend
EDGEFIRST_FORCE_BACKEND=cpu cargo bench -p edgefirst-image --bench pipeline_benchmark -- --bench

Cross-compile + deploy to a target (SSH hostnames in ~/.ssh/config: imx8mp-frdm, imx95-frdm, rpi5-hailo, jetson-orin-nano, maivin):

cargo-zigbuild zigbuild --target aarch64-unknown-linux-gnu --release \
  -p edgefirst-image --features opengl --bench pipeline_benchmark

scp target/aarch64-unknown-linux-gnu/release/deps/pipeline_benchmark-* imx8mp-frdm:/tmp/
ssh imx8mp-frdm '/tmp/pipeline_benchmark-* --bench --json /tmp/pipeline.json'

All benchmarks accept --bench --json <path> for structured output. Store results under benchmarks/<platform>/<name>.json. Update BENCHMARKS.md via:

python3 .github/scripts/generate_benchmark_tables.py --data-dir benchmarks/

Performance Tracing

The HAL ships with built-in tracing for capturing detailed performance traces across all processing stages. Traces use the Chrome JSON format and view in Perfetto UI.

How it works

Every HAL library crate emits tracing spans on hot paths. These spans have near-zero overhead when no subscriber is active — each site compiles to a single relaxed atomic load. No heap allocations, no string formatting, no function calls on the hot path.

When a session is started via the API, a Chrome JSON subscriber records all span enter/exit events with high-resolution timestamps and structured metadata (detection counts, proto dimensions, format conversions, memory types, etc.) to a file.

Span coverage

The tracing surface covers decode, image conversion, GL multi-pass, mask materialization, tensor lifecycle, tracker association, and the Python entry points. Each span carries structured fields — see the per-crate ARCHITECTURE.md files for the authoritative list of spans and fields per component.

Enabling tracing

Python:

import edgefirst_hal as hal
with hal.Tracing("/tmp/trace.json"):
    # ... run inference pipeline ...
    pass

Rust:

use edgefirst_hal::trace::{start_tracing, stop_tracing};

start_tracing("/tmp/trace.json").expect("start tracing");
// ... inference pipeline ...
stop_tracing(); // flushes and closes the trace file

C:

#include <edgefirst/hal.h>
hal_start_tracing("/tmp/trace.json");
/* ... inference pipeline ... */
hal_stop_tracing();

Viewing traces

  1. Open https://ui.perfetto.dev/
  2. Drag the generated .json file onto the page
  3. Click slices to see structured fields in the Current Selection panel

Using traces for optimization

The tracing infrastructure complements the rules in the Optimization Guide and the data in BENCHMARKS.md:

  1. Identify bottlenecks — common findings:
    • extract_proto > 3 ms → model emits NCHW protos but HAL is transposing (check the layout field)
    • cpu_format_convert appearing twice → intermediate format conversion (consider matching src/dst formats)
    • tensor_alloc per-frame → tensors not being reused (Rule 1)
  2. Validate rules — re-run with tracing after applying a rule to confirm the expected spans disappear or shrink.
  3. Cross-reference with perf — for CPU-bound spans, combine trace data with perf record for instruction-level hotspots.

Limitations

  • Only one trace session per process lifetime (Rust global subscriber model).
  • Rayon worker spans are not automatically parented to the calling span.
  • The log::* output (via env_logger / C callback logger) operates independently from trace capture; both can be active simultaneously.

Dependencies

Key external dependencies

Internal dependency graph

graph TD
    EF[edgefirst-hal<br/>umbrella]
    Tensor[edgefirst-tensor]
    Image[edgefirst-image]
    Decoder[edgefirst-decoder]
    Tracker[edgefirst-tracker<br/>optional]
    G2D[g2d-sys<br/>optional]

    EF --> Tensor
    EF --> Image
    EF --> Decoder
    Image --> Tensor
    Image --> Decoder
    Image -.optional.-> G2D
    Image -.->|tracker feature| Tracker
    Decoder -.->|tracker feature| Tracker

    Python[edgefirst_hal<br/>PyO3]
    CAPI[edgefirst-hal-capi]

    Python --> EF
    CAPI --> EF
    CAPI --> Tensor
    CAPI --> Image
    CAPI --> Decoder
    CAPI --> Tracker

    style EF fill:#fff4e1
    style Python fill:#e1f5ff
    style CAPI fill:#e1f5ff
    style Tracker fill:#e8f5e9
Loading

Future Considerations

  1. Model HAL — planned abstraction for inference engines (ONNX, TFLite, Kinara)
  2. VPI integration — support for NVIDIA Vision Programming Interface
  3. Additional trackers — SORT, Deep SORT
  4. Async I/O — non-blocking image loading and processing
  5. GPU compute — Vulkan / CUDA backends for custom operations

Support

Community resources

EdgeFirst ecosystem

This project is part of the EdgeFirst Perception stack:

Professional services

Au-Zone Technologies offers comprehensive support for production deployments: training & workshops, custom development, integration services, enterprise SLAs, and hardware reference designs.

Contact: [email protected] · au-zone.com

Contributing

We welcome contributions! Please see CONTRIBUTING.md for development setup and guidelines. This project follows our Code of Conduct.

Security

For security vulnerabilities, see SECURITY.md or email [email protected] with subject "Security Vulnerability".

Documentation

License

Apache License 2.0 — see LICENSE for details.

Copyright 2025-2026 Au-Zone Technologies

About

Hardware Abstraction Layer

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors