PixVOD Pixel-Distributed Direct Visual Odometry and Depth Estimation

  1. Shinjeong Kim
  2. Ignacio Alzugaray
  3. Callum Rhodes
  4. Paul H. J. Kelly
  5. Andrew J. Davison
  1. Imperial College London

arXiv · CVPRW 2026

Teaser

PixVOD targets on-sensor visual odometry: a Pixel Processor Array where each photosite tracks its own slice of the 6-DoF camera pose and its log-depth, exchanging Gaussian messages with its neighbours so that only the high-level signal — pose and depth — ever leaves the chip.

PixVOD pixel-level Gaussian Belief Propagation teaser

Abstract

Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Transmitting raw, redundant, and noisy pixel data off the sensor remains inefficient, motivating a shift toward focal-plane sensor-processors that perform a significant part of the computation directly within each pixel. We envision pixels synthesizing higher-level signals locally, reducing downstream load, and providing richer inputs for higher-level vision tasks.

We propose PixVOD, a fully parallelizable form of visual odometry and depth estimation across pixels, where sensor-processors exchange information through Gaussian Belief Propagation (GBP) to achieve consensus about camera motion and infer depth from per-pixel photometric observations and a surface normal prior. To maintain geometric stability during optimization, we introduce a keyframe-like anchoring mechanism that regulates the effective baseline between frames, enabling consistent motion and depth updates. Our method is evaluated on realistic datasets, demonstrating the feasibility of GBP-based pixel-level distributed visual odometry and depth estimation with keyframe anchoring on-sensor.

Factor Graph Structure

Each pixel holds a stacked variable $\boldsymbol{\mu}_i = (\boldsymbol{\xi}_i, d_i) \in \mathfrak{se}(3) \oplus \mathbb{R}$ — its local estimate of the global 6-DoF camera motion and its log-depth — and exchanges Gaussian messages with its neighbours. Four factor families wire the graph: photometric residuals after warping with the local pose & depth, pose-priors for motion continuity, camera-motion identity factors arranged in a hierarchical quadtree so consensus on the global pose propagates in $O(\log N)$ hops, and normal-integration factors regularizing log-depth between neighbours from a surface-normal prior.

Visualization

Camera-motion estimates aggregated from the per-pixel pose beliefs, and per-pixel depth recovered from the learned log-depth — across Replica office and room sequences, compared against ground truth.

BibTeX

@article{kim2026pixvod,
  title   = {{PixVOD}: Pixel-Distributed Direct Visual Odometry and Depth Estimation},
  author  = {Kim, Shinjeong and Alzugaray, Ignacio and Rhodes, Callum and Kelly, Paul H. J. and Davison, Andrew J.},
  journal = {arXiv preprint arXiv:2606.03989},
  year    = {2026}
}