Visual Odometry with Real-World Data

CS 585 Final Project: Visual Odometry with Real-World Data
Allison Mann
Siddharth Mysore
Date: April 30th 2020

NOTE This page uses javascript packages that needed to be loaded from for TeX display so the page may not display correctly if viewed offline

Problem Definition

Visual Odometry (VO) is the process by which the ego-motion a body is inferred by analyzing images captured while in motion. It is believed to be what a many vision-capable organisms use in perceiving their motion and aiding in self-localization. As a focus of this project, different modern approaches to VO were studied, with the following goals in mind:

  1. Understanding some of the different key approaches in VO currently in application (though limiting our study to non-deep-learning techniques).
  2. Attempt to implement and compare a few different approaches to better understand the processes of each, their strengths, and their limitations.

Project Overview


The term `Visual Odometry' was coined by Nister et al.[1] and is named for its similar utility to wheel odometry - the mechanism by which odometers in cars operate to indicate information about vehicle speed and heading direction to drivers. VO uses visual inputs (typically from cameras) to offer similar analyses of ego-motion. An advantage of visual inputs for motion estimation, over things such as counting steps or using wheel-odometry, is that visual information is less sensitive to localized disturbances such as slippage. Visual techniques do however require that there is sufficient overlap in the images captured so as to be able to estimate motion through visual correspondence. When well-implemented, VO techniques can ahieve very low error rates even in complex tracking problems, though this is highly dependent on the specific techniques utilized.

VO can be viewed as a practical application and extension of the concepts of Structure from Motion (SFM). SfM is employed often in machine perception to reconstruct elements of the 3-D world using inputs from multiple cameras with different points of view. VO infers information of a body’s motion by comparing new images captured as bodies move to models of the world developed using prior images. The VO problem can be broadly categorized into applications of stereo VO and monocular VO, the difference being that the first makes use of images from two or more cameras captured simultaneously, while the latter uses information from a single camera. While monocular processes may sometimes be preferred for their reduced computational complexity, they often face problems related to the ambiguity of scale that one typically encounters with monocular images. Stereo systems on the other hand are capable of determining the depth to 3-D points in the world using stereo correspondence between simultaneously captured images, where the separation between the cameras is known, thus eliminating scale ambiguity. Given this convenience afforded by stereo techniques, we chose to focuse on stereo VO techniques. For further information on the basics of VO, we encourage readers to read the Tutorials on Visual Odometry by Scaramuzza and Fraundorfer[2-3]. These tutorials were used as a guide for the work done in this project and provieds an overview of the history of VO as well as kep concepts that serve as a basis for many elements of classical VO.


For the purposes of this project, we are focusing on applications of stereo VO - where we work with simultaneous inputs from two cameras as the use of a stereo baseline allows us to avoid issues of ambiguity of scale when working with monocular images. We wanted to work with real-world data and found a natural fit with vehicle motion datasets. While multiple datasets were originally considered, including Waymo's Open Dataset[4], we ultimately opted to focus on the KITTI dataset[5] for the relative ease of access to the data and a baseline code implementation for comparison in LibVISO2[6]. Specifically, we use the KITI odometry dataset - focusing on datasets 03, 04, 06, and 09. There are additional image sequences in the odometry datasets but they are not used due to run-time constraints imposed by the large number of images.

The KITTI vechicles had two main cameras oriented in the direction of motion. We perform all pose estimations relative to the body coordinate frame of the left-camera and will assume the world coordinate frame is coincident and aligned with the left-camera at the start of any dataset. This is in line with the data provided. The KITTI dataset also allows us to make the following assumptions:

  1. The cameras are rigidly attached to the vehicle in motion and the relative separation between multiple cameras is unchanging
  2. The images are sampled simultaneiously, regularly and at discrete times

Figure 1 - KITTI car sensor layout

Below, we present a video representation of the 4 datasets used as videos of image frames from the left camera:

Dataset 03
Dataset 04
Dataset 06
Dataset 09

Methods Pipeline

Scaramuzza and Fraundorfer[2-3] present a general overview of the VO pipeline as in the figure exceprt below. The methods implemented in our work explore feature detection and matching as well as motion-estimation techniques. Due to the implementation complexity of bundle adjustment techniques, as well as difficulty faced in integrating off-the-shelf packages with our pipeline, we chose to forego the local optimization step. We note however that our results are reasonable despite this, which is to be expected as local optimization primarily serves to improve VO results but is not an integral component of the pipeline.

Figure 2 - Visual Odometry Pipeline[2]

Feature Detection and Matching


During the feature detection step, we comb the images for salient landmarks, called keypoints, with the goal to match them in the corresponding stereo image from the other camera. We look for features that differ from their immediate neighborhood such as corners or areas with unique color or texture. These features should ideally be repeatedly found between the two cameras [2]. Throughout this process, the region around each detected feature is converted into a compact descriptor that can be matched against other descriptors that is used during matching. After, a matching algorithm associates the features in corresponding stereo frames as input to the triangulation.

For this project, we experimented with four feature detection algorithms that were designed to be used for visual odometry and three possible matching algorithms.

Feature Detection Algorithms

Figure 3 - Example of feature detection from the left and right camera for visual odometry

The four feature detection methods we examined were SIFT, ROOTSIFT, SURF, and ORB [9]. SIFT is the baseline feature for visual odometry, developed specifically for this problem. The SIFT detector convolves the image at different scales with a difference of Gaussian operator and then takes the local minima or maxima of the output across scales and space. It popular because it is designed to be scale and rotation invariant, unlike most corner and other feature detectors.

SURF takes a similar approach to SIFT, but uses box filters to approximate the Gaussian. The advantage here is that it can be computed much faster than SIFT.

ROOTSIFT builds upon the output of SIFT and applies a hellinger kernel, taking the L1 norm of the SIFT vectors, then taking the square root and L2 normalizing it. This is a more recent approach that has very good results in practice, but is slightly slower than SIFT [10].

ORB is the final feature detection method we experimented with during this project. It develops descriptors with pairwise brightness comparisons sampled around keypoints. ORB is very fast to compute and tackles orientation invariance and optimizes the sampling scheme for the brightness value pairs.

Matching Algorithms

Figure 4 - Example of matched features. For visibility, here we are only displaying the top 85 matches. In our system, we use all matches that pass a distance threshold which is generally about 1000 feature pairs.

Given the feature descriptors from the feature detection step, matching algorithms attempt to find matches between features in the input stereo images. We looked at three different matchers for our experiments: Descriptor, Brute Force, and FLANN [8].

Descriptor Matcher is opencv's default matcher for visual odometry. It is an abstract class so it can be customized for the user's dataset and preference. We used the default settings for this project. The documentation about how this matcher works by default is sparse, but we wanted to include it because it is commonly used in the computer vision community. After some digging, I believe that it is a FLANN based method.

The Brute Force Matcher is a simple matcher. It takes a feature descriptor and calculates distance with all other features in other set and chooses the closest one. If this match passes a distance threshold, the features are matched.

The final matcher we considered was FLANN. FLANN is faster than Brute Force and uses an appoximate nearest neighbors approach. There are many possible algorithms that fall under this broad category. We used FLANN_INDEX_KDTREE for this experiment, which uses the KD-tree datastructure in calculating nearest neighbors. Again, we only accept matches that pass a distance threshold.

Sample Results

For initial experiments, we used the smallest video in the KITTI dataset, sample 04, as it allowed us to run the system many times. This dataset is a simple visual odometry scenario with the car travelling in a straight line.

We started with the ROOTSIFT features, as they are reported to be the best in literature. We compare sample results with the ground truth and the ouput from LibVISO, a standard baseline library that performs VO.
The ORB feature threw errors during the calculations later on in the VO system, so we decided not to pursue this feature, especially since the others gave reasonable results.

Matching Algorithm Comparison

Descriptor Matcher
Brute Force Matcher
FLANN Matcher

Feature Detection Comparison

Figure 5 - Comparing Feature Matching and Detection

Final choice

Based on the sample results, for VO 3D-to-3D we decided to primarily use ROOTSIFT with the brute force matcher.

Motion Estimation

Methods overview

After a set of features have been matched, there are a number of approaches to modeling the motion between frames. These can be broadly categorize into the 2D-to-2D, 3D-to-3D and 3D-to-2D approaches. As each of their names would suggest, 2D-to-2D seeks to estimate motion by using 2D feature correspondences in a sequence of images; 3D-to-3D attempts to determine the 3D projection that maps 3D point matches in one frame to another; and 3D-to-2D seeks to determine the relative pose by matching 2D features in one sequence to 3D points in another. 2D-to-2D methods are typically to monocular image sequences, which could result in the aforementioned issues related to ambiguities of 3D scale. To avoid these, we focus on 3D-to-3D and 3D-to-2D motion estimation techniques.

Notation overview

Going forward, readers are asked to keep the following notation in mind:

  1. We encode the orietnation of a given gamera in the form of a rotation matrix $R \in SO(3)$ and the position of the origin as a 3D vector $t \in \mathbb{R}^3$ and compose both into a single Pose matrix, $P \in SE(3)$ such that: $$ P = \begin{bmatrix} R & t\\ 0^{1 \times 3} & 1\end{bmatrix} $$
  2. We use a superscript notation to represent the target coordinate frame and a subscript to represent the current coordinate frame, so $P_k^{k-1}$ would represent the pose transformation of information from the $k^{th}$ camera frame to the frame of the $k-1^{th}$ camera. This notation also applies to rotations and translations (positional offset) defined in point 1 above.
  3. Per the assumptions established prior, we seek to estimate the current position and orientation of the camera w.r.t. the 1st camera, to which we affix the 'world' coordinate frame, i.e. we want to estimate $P_k^0$ $\forall$ $k$.
  4. When referring to images, we will refer to $I_k$ as the $k^{th}$ images with $I_{l,k}$ and $I_{r,k}$ representing the images from the left and right cameras respectively.

3D-to-3D Motion Estimation

The 3D-to-3D correspondence based motion estimation pipeline follows the following general structure laid out by Scaramuzza and Fraundorfer[2]:

  1. Capture two stereo image pairs, $I_k$ and $I_{k-1}$
  2. Extract and match featuers between $I_{l,k-1}$ and $I_{r,k-1}$ and determine which of those matched features are in both $I_{l,k}$ and $I_{r,k}$ (with additional matching operations) - this indicates which features can be used for 3D-triangulation in both image pairs
  3. Triangulate matched features to extract 3D features $X_k$ and $X_{k-1}$
  4. Compute $P_k^{k-1}$ from $X_k$ and $X_{k-1}$ by solving the Orthogonal Procrustes problem[7]
  5. Map $P_k^0 = P_{k-1}^0 P_k^{k-1}$
  6. Repeat from (1) for k = k + 1 until all images are processed

3D Point Triangulation

The projection equation takes the form of $$ \lambda\begin{bmatrix}x \\ 1 \end{bmatrix} = P_k \begin{bmatrix} X \\ 1 \end{bmatrix}$$ where $x$ represents the camera coordinates of the projection of 3D point $X \in \mathbb{R}^3$. $x$ is computed as $x = K^{-1}[p, 1]^\top$ where $K$ is the camera callibration matrix and $p$ is the pixel coordinate of the feature to be triangulated. This can be rewritten as: $$ \lambda \bar{x} = P_k \bar{X} $$ where $\bar{x}$ and $\bar{X}$ represent the homogenous pixel and 3D coordinates respectively. Note that $\lambda \bar p$ represents a line through the 3D point $X$ and the camera center, such that it passes the pixel coordinates on the image plane at $\lambda = 1$. Since the cross product between parallel lines is 0, we can rewrite the relationship as: $$ \bar{x} \times (P_k \bar{X}) = \widehat{x}P_k X$$ where $\widehat{x}$ is the skew-symmetric representation of $\bar{x}$.

With a single view, $X$ can be determined to a scale, but with two or more views, it can be determined absolutely through a least-squares minimization, or as the null space of $A$ where: $$ \begin{bmatrix} \widehat{x}_l P_l \\ \widehat{x}_r P_r \end{bmatrix} \bar{X} = \begin{bmatrix} \widehat{p}_l KP_l \\ \widehat{p}_r K P_r \end{bmatrix} \bar{X} = A\bar{X} = 0$$ $$ \bar{X} = \text{argmin}_{\bar{X}} ||A\bar{X}||^2 $$ $$ A = USV^\top $$ $$ \bar{X} = V_{end} $$ where $V_{end}$ is the 4-element last column of the right eignen-vector matrix V. $\bar{X}$ is conditioned for scale by division by the last element of $\bar{X}$ and oriented correctly by multiplying it by the sign of $\bar{X}_3$ - which corresponds to the z-coordinate since the z-coordinate is always in front of the camera. $$ X = sign(\bar{X}_3)\frac{\bar{X}_{1:3}}{\bar{X}_4} $$ Note that in our case, $P_l$ and $P_r$ are constants as the left and right cameras are always fixed in their relative orientations to each other, with $P_l$ being the root of the coordinate frame and thus an identity matrix of dimension 4.

Figure 6 - Sample 3D triangulation for features matched in Figure 3

Solving the Procrustes Problem

A 3D affine translation mapping coordinates $b$ to $a$ involving rotation and translation can be represented as: $$ a^{k-1} = R_k^{k-1}b^k + t^{k-1} \ \ \ \text{where } a,b,t \in \mathbb{R}^3$$ For $N \geq 3$ points, the Orthogonal Procrustes problem can be solved as: $$ t^{k-1} = A - R_k^{k-1}B \ \ \ \text{where } A = [a_1 - \text{avg}_{\forall i}(a_i), \dots, a_N - \text{avg}_{\forall i}(a_i)] \ \ \text{and } B = [b_1 - \text{avg}_{\forall i}(b_i), \dots, b_N - \text{avg}_{\forall i}(b_i)] $$ $$ BA^\top = USV^\top $$ $$ R_k^{k-1} = U \begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & \det(UV^\top) \end{bmatrix} V^\top $$ The Procrustes problem is solved using a 3-point RANSAC to filter out outliers.

3D-to-2D Motion Estimation with Optical Flow

Initially, we planned to implement VO by 3D-to-2D correpsondences by a method similar to the one described by Scaramuzza and Fraundorfer [2], however, we had trouble implementing the Perspective from n-Points (PnP) algorithm required by their algorithm (see step 2.3 of Algorithm 3 in Tutorial: Visual Odometry, Part 1 [2] for more details). Through debugging and analysis of code, we realized that the root of the problem was in how the PnP problem was being solved. The linearized solution was not sufficiently accurate and was prone to drift and using non-linear optimization (as is typically employed in PnP) proved difficult as we weren't sure what the appropriate tools would be in Python.

Realizing that we could however use optical flow to estimate 3D motion however, we were able to adapt the 3D-to-2D algorithm to include a measure of optical flow to allow us to estimate the vehicle's linear and angular velocity, thus allowing us to estimate its position and orientation.

The modified 3D-to-2D motion estimation algorithm which uses optical flow is as follows:

  1. For each iteration, $k$ starting with $k=1$ (the 2nd image), read the current stereo pair: $I_{l,k}$ and $I_{r,k}$
  2. Match and triangulate features between $I_{l,k}$ and $I_{r,k}$
  3. Match the matched features against $I_{l,k-1}$ to get a list of correspondences between the previous frame and the current stereo pair - this gives us a list of features that are tracked between the consecutive frames.
  4. Compute the optical flow, $\dot{p}$ between $I_{l,k}$ and $I_{l,k-1}$ as $\dot{p} = K^{-1}(p_{l,k} - p_{l,k-1})$
  5. Using triangulation from (2), compute rotational and translational velocity (see below for details)
  6. The velocity will be in the current camera frame ($I_{l,k}$) and needs to be cast to the previous frame using rotation $R_k^{k-1}$ which has a ZXY Euler roll, pitch and yaw defined by $\Delta \Omega^k = dt\cdot \dot{\Omega}_k^{k-1}$ where $\Omega = [roll, pitch, yaw]^\top$. This gives us $v^{k-1} = R_k^{k-1} v^{k}$ and $\Delta \Omega^{k-1} = R_k^{k-1}\Delta\Omega^{k}$. Finally, this is cast back to the world-coordinate frame by $v^{0} = R_{k-1}^0 v^{k-1}$ and $\Delta \Omega^0 = R_{k-1}^0 \Delta \Omega^{k-1}$
  7. The Pose at instance $k$ is then cmputed as $X_k^0 = X_{k-1}^0 + dt \cdot v_k^0$ and $\Omega_k^0 = \Omega_{k-1}^0 + dt \cdot \Delta \Omega_k^0$
  8. Repeat from (1) for $k = k+1$

Estimating Velocity from Optical Flow

Given the optical flow, $\dot{p}$ and feature depths $z$ for each $p$, the following can be established: $$ p = \frac{X}{z} \ \ \rightarrow \ \ \dot{p} = \frac{\dot{X}}{z} - \frac{\dot{z}}{z}p$$ By applying the relationship $\dot{X} = -\Omega\times X - V$, we derive: $$ \dot{p} = \frac{1}{z} e_3 \times (p \times V) + e_3 \times (p \times (p \times \Omega)) \ \ \text{where } e_3 = [0,0,1]^\top $$ $$ \dot{p} = \frac{1}{z}\begin{bmatrix} xV_z - V_x \\ yV_z - V_y \end{bmatrix} + \begin{bmatrix} xy & -(1 + x^2) & y \\ (1 + y^2) & -xy & -x\end{bmatrix}\Omega$$

We can solve for the linear velocity $V$ and rotational velocity $\Omega$ using a linear system of equations with 3 or more point-flow correpsondences and outliers are excluded using RANSAC.


Performance on Dataset 03 (801 Image Pairs)

Mean Absolute Error (m)
3D-to-3D 95.57 3D-to-2D 11.34 LibViso2 10.97

Performance on Dataset 04 (271 Image Pairs)

Mean Absolute Error (m)
3D-to-3D 27.82 3D-to-2D 6.13 LibViso2 6.53

Performance on Dataset 06 (1101 Image Pairs)

Mean Absolute Error (m)
3D-to-3D 160.05 3D-to-2D 15.69 LibViso2 25.86

Performance on Dataset 09 (1591 Image Pairs)

Mean Absolute Error (m)
3D-to-3D 57.28 3D-to-2D 144.48 LibViso2 42.30

Quality of Achieved Results

Generally speaking, we find that the results achived are satisfactory for the goals we set out to achieve. In addition to gaining a clear understanding of VO techniques through implementation and experimentation, we were even able to develop a functional method through intuition - though we doubt we are the first to attempt this appraoch. It was also interesting to note that our ad-hoc method outperformed the baseline LibVISO (which is based on dense stereo alignment and optmization) on 2 of the 4 tested datasets despite using significantly less computationally intensive methodology.

Reflecing on Results

Working on this project yielded significant insight on the workings of VO and specifically stereo-VO and some of the nuances and limitations associated with it. Two aspects of sparse stereo-VO (as implemented over the course of this project) stand out in particular: Dependence on good Features, and Error Propagation.

Feature saliency appears to be very important in determining the quality of tracking. It is not just the spread or locations of features though, but also, as we discovered, the scale invariance. While SURF features did work on the straight road, we were not able to demonstrate a working VO pipeline on more complicated routes with SURF features. A cursory analysis suggested that this was likely due to the lack of scale invariance, which tends to matter a lot as the vehicle approaches or drives away from objects. We further noted issues when objects with especially salient features - such as passing cars, could overwhelm the RANSAC consensus, with what should be the outliers being included while inliers were rejected. This type of issue was especially pronounced when lens flares caused very few features to be detected or when there was too much greenery which was not sufficiently salient - as was identifed to be the cause of the failed tracking of the 3D-to-2D method on Dataset 09, where good flow information was lost between frames due to bad feature tracks.

A second key observation was in the issues associated with error propagation through the pipeline and across multiple frames. A particularly sensitive aspect was the estimation of pose, which over time gets progressively worse if any significant errors are incurred, or if RANSAC was not run successfully (which can be loosely defined as when RANSAC yields near-optimal results).

Scope For Improvement

While the methods presented make for a reasonable first pass at a VO pipeline, many improvements can still be had. Mainly, we do not use any form of local optimization, filtering or error correction. Any or all such techniques are likely to boost performance significantly - though with varying degrees of added computational cost. It is also conceivable that using dense features instead of sparse feature detection as we do now might improve tracking performance - a rudimentray study of this however revealed limited gains as many features were still discarded in the cross-frame and cross-camera matching, resulting mainly in only the salient features being retained. The level of information loss might however be curbed through more careful tuning of matching and tracking thresholds.

Credits and Bibliography

[1] D. Nister, O. Naroditsky, and J. Bergen, "Visual odometry", in Proc., Int. Conf. Computer Vision and Pattern Recognition, 2004
[2] Scaramuzza, D. and Fraundorfer, F., "Tutorial: Visual Odometry, Part 1", IEEE ROBOTICS & AUTOMATION MAGAZINE, 2011
[3] Scaramuzza, D. and Fraundorfer, F., "Tutorial: Visual Odometry, Part 2", IEEE ROBOTICS & AUTOMATION MAGAZINE, 2012
[4] P. Sun et al., "Scalability in Perception for Autonomous Driving: Waymo Open Dataset", 2019
[5] A. Geiger, P. Lenz, R. Urtasun, "Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite", Conference onComputer Vision and Pattern Recognition (CVPR), 2012
[6] A. Geiger, J. Ziegler, C. Stiller, "StereoScan: Dense 3d Reconstruction in Real-time," IEEE Intelligent Vehicles Symposium, 2011
[7] "Orthogonal Procrustes Problem,"
[8]"Feature Matching",
[9]"Introduction to SIFT (Scale-Invariant Feature Transform)",
[10] Three things everyone should know to improve object retrieval. R. Arandjelovic, and A. Zisserman. CVPR, page 2911-2918. IEEE Computer Society, (2012)