Problem Definition
Visual Odometry (VO) is the process by which the egomotion a body is inferred by analyzing images captured while in motion. It is believed to be what a many visioncapable organisms use in perceiving their motion and aiding in selflocalization. As a focus of this project, different modern approaches to VO were studied, with the following goals in mind:
 Understanding some of the different key approaches in VO currently in application (though limiting our study to nondeeplearning techniques).
 Attempt to implement and compare a few different approaches to better understand the processes of each, their strengths, and their limitations.
Project Overview
Background
The term `Visual Odometry' was coined by Nister et al.[1] and is named for its similar utility to wheel odometry  the mechanism by which odometers in cars operate to indicate information about vehicle speed and heading direction to drivers. VO uses visual inputs (typically from cameras) to offer similar analyses of egomotion. An advantage of visual inputs for motion estimation, over things such as counting steps or using wheelodometry, is that visual information is less sensitive to localized disturbances such as slippage. Visual techniques do however require that there is sufficient overlap in the images captured so as to be able to estimate motion through visual correspondence. When wellimplemented, VO techniques can ahieve very low error rates even in complex tracking problems, though this is highly dependent on the specific techniques utilized.
VO can be viewed as a practical application and extension of the concepts of Structure from Motion (SFM). SfM is employed often in machine perception to reconstruct elements of the 3D world using inputs from multiple cameras with different points of view. VO infers information of a body’s motion by comparing new images captured as bodies move to models of the world developed using prior images. The VO problem can be broadly categorized into applications of stereo VO and monocular VO, the difference being that the first makes use of images from two or more cameras captured simultaneously, while the latter uses information from a single camera. While monocular processes may sometimes be preferred for their reduced computational complexity, they often face problems related to the ambiguity of scale that one typically encounters with monocular images. Stereo systems on the other hand are capable of determining the depth to 3D points in the world using stereo correspondence between simultaneously captured images, where the separation between the cameras is known, thus eliminating scale ambiguity. Given this convenience afforded by stereo techniques, we chose to focuse on stereo VO techniques. For further information on the basics of VO, we encourage readers to read the Tutorials on Visual Odometry by Scaramuzza and Fraundorfer[23]. These tutorials were used as a guide for the work done in this project and provieds an overview of the history of VO as well as kep concepts that serve as a basis for many elements of classical VO.
Datasets
For the purposes of this project, we are focusing on applications of stereo VO  where we work with simultaneous inputs from two cameras as the use of a stereo baseline allows us to avoid issues of ambiguity of scale when working with monocular images. We wanted to work with realworld data and found a natural fit with vehicle motion datasets. While multiple datasets were originally considered, including Waymo's Open Dataset[4], we ultimately opted to focus on the KITTI dataset[5] for the relative ease of access to the data and a baseline code implementation for comparison in LibVISO2[6]. Specifically, we use the KITI odometry dataset  focusing on datasets 03, 04, 06, and 09. There are additional image sequences in the odometry datasets but they are not used due to runtime constraints imposed by the large number of images.
The KITTI vechicles had two main cameras oriented in the direction of motion. We perform all pose estimations relative to the body coordinate frame of the leftcamera and will assume the world coordinate frame is coincident and aligned with the leftcamera at the start of any dataset. This is in line with the data provided. The KITTI dataset also allows us to make the following assumptions:
 The cameras are rigidly attached to the vehicle in motion and the relative separation between multiple cameras is unchanging
 The images are sampled simultaneiously, regularly and at discrete times
Below, we present a video representation of the 4 datasets used as videos of image frames from the left camera:
Methods Pipeline
Scaramuzza and Fraundorfer[23] present a general overview of the VO pipeline as in the figure exceprt below. The methods implemented in our work explore feature detection and matching as well as motionestimation techniques. Due to the implementation complexity of bundle adjustment techniques, as well as difficulty faced in integrating offtheshelf packages with our pipeline, we chose to forego the local optimization step. We note however that our results are reasonable despite this, which is to be expected as local optimization primarily serves to improve VO results but is not an integral component of the pipeline.
Feature Detection and Matching
Methods
During the feature detection step, we comb the images for salient landmarks, called keypoints, with the goal to match them in the corresponding stereo image from the other camera. We look for features that differ from their immediate neighborhood such as corners or areas with unique color or texture. These features should ideally be repeatedly found between the two cameras [2]. Throughout this process, the region around each detected feature is converted into a compact descriptor that can be matched against other descriptors that is used during matching. After, a matching algorithm associates the features in corresponding stereo frames as input to the triangulation.
For this project, we experimented with four feature detection algorithms that were designed to be used for visual odometry and three possible matching algorithms.
Feature Detection Algorithms
The four feature detection methods we examined were SIFT, ROOTSIFT, SURF, and ORB [9]. SIFT is the baseline feature for visual odometry, developed specifically for this problem. The SIFT detector convolves the image at different scales with a difference of Gaussian operator and then takes the local minima or maxima of the output across scales and space. It popular because it is designed to be scale and rotation invariant, unlike most corner and other feature detectors.
SURF takes a similar approach to SIFT, but uses box filters to approximate the Gaussian. The advantage here is that it can be computed much faster than SIFT.
ROOTSIFT builds upon the output of SIFT and applies a hellinger kernel, taking the L1 norm of the SIFT vectors, then taking the square root and L2 normalizing it. This is a more recent approach that has very good results in practice, but is slightly slower than SIFT [10].
ORB is the final feature detection method we experimented with during this project. It develops descriptors with pairwise brightness comparisons sampled around keypoints. ORB is very fast to compute and tackles orientation invariance and optimizes the sampling scheme for the brightness value pairs.
Matching Algorithms
Given the feature descriptors from the feature detection step, matching algorithms attempt to find matches between features in the input stereo images. We looked at three different matchers for our experiments: Descriptor, Brute Force, and FLANN [8].
Descriptor Matcher is opencv's default matcher for visual odometry. It is an abstract class so it can be customized for the user's dataset and preference. We used the default settings for this project. The documentation about how this matcher works by default is sparse, but we wanted to include it because it is commonly used in the computer vision community. After some digging, I believe that it is a FLANN based method.
The Brute Force Matcher is a simple matcher. It takes a feature descriptor and calculates distance with all other features in other set and chooses the closest one. If this match passes a distance threshold, the features are matched.
The final matcher we considered was FLANN. FLANN is faster than Brute Force and uses an appoximate nearest neighbors approach. There are many possible algorithms that fall under this broad category. We used FLANN_INDEX_KDTREE for this experiment, which uses the KDtree datastructure in calculating nearest neighbors. Again, we only accept matches that pass a distance threshold.
Sample Results
For initial experiments, we used the smallest video in the KITTI dataset, sample 04, as it allowed us to run the system many times. This dataset is a simple visual odometry scenario with the car travelling in a straight line.
We started with the ROOTSIFT features, as they are reported to be the best in literature. We compare sample results with the ground truth and the ouput from LibVISO, a standard baseline library that performs VO.
The ORB feature threw errors during the calculations later on in the VO system, so we decided not to pursue this feature, especially since the others gave reasonable results.
Matching Algorithm Comparison  


Feature Detection Comparison  


Final choice
Based on the sample results, for VO 3Dto3D we decided to primarily use ROOTSIFT with the brute force matcher.Motion Estimation
Methods overview
After a set of features have been matched, there are a number of approaches to modeling the motion between frames. These can be broadly categorize into the 2Dto2D, 3Dto3D and 3Dto2D approaches. As each of their names would suggest, 2Dto2D seeks to estimate motion by using 2D feature correspondences in a sequence of images; 3Dto3D attempts to determine the 3D projection that maps 3D point matches in one frame to another; and 3Dto2D seeks to determine the relative pose by matching 2D features in one sequence to 3D points in another. 2Dto2D methods are typically to monocular image sequences, which could result in the aforementioned issues related to ambiguities of 3D scale. To avoid these, we focus on 3Dto3D and 3Dto2D motion estimation techniques.
Notation overview
Going forward, readers are asked to keep the following notation in mind:
 We encode the orietnation of a given gamera in the form of a rotation matrix $R \in SO(3)$ and the position of the origin as a 3D vector $t \in \mathbb{R}^3$ and compose both into a single Pose matrix, $P \in SE(3)$ such that: $$ P = \begin{bmatrix} R & t\\ 0^{1 \times 3} & 1\end{bmatrix} $$
 We use a superscript notation to represent the target coordinate frame and a subscript to represent the current coordinate frame, so $P_k^{k1}$ would represent the pose transformation of information from the $k^{th}$ camera frame to the frame of the $k1^{th}$ camera. This notation also applies to rotations and translations (positional offset) defined in point 1 above.
 Per the assumptions established prior, we seek to estimate the current position and orientation of the camera w.r.t. the 1st camera, to which we affix the 'world' coordinate frame, i.e. we want to estimate $P_k^0$ $\forall$ $k$.
 When referring to images, we will refer to $I_k$ as the $k^{th}$ images with $I_{l,k}$ and $I_{r,k}$ representing the images from the left and right cameras respectively.
3Dto3D Motion Estimation
The 3Dto3D correspondence based motion estimation pipeline follows the following general structure laid out by Scaramuzza and Fraundorfer[2]:
 Capture two stereo image pairs, $I_k$ and $I_{k1}$
 Extract and match featuers between $I_{l,k1}$ and $I_{r,k1}$ and determine which of those matched features are in both $I_{l,k}$ and $I_{r,k}$ (with additional matching operations)  this indicates which features can be used for 3Dtriangulation in both image pairs
 Triangulate matched features to extract 3D features $X_k$ and $X_{k1}$
 Compute $P_k^{k1}$ from $X_k$ and $X_{k1}$ by solving the Orthogonal Procrustes problem[7]
 Map $P_k^0 = P_{k1}^0 P_k^{k1}$
 Repeat from (1) for k = k + 1 until all images are processed
3D Point Triangulation
The projection equation takes the form of $$ \lambda\begin{bmatrix}x \\ 1 \end{bmatrix} = P_k \begin{bmatrix} X \\ 1 \end{bmatrix}$$ where $x$ represents the camera coordinates of the projection of 3D point $X \in \mathbb{R}^3$. $x$ is computed as $x = K^{1}[p, 1]^\top$ where $K$ is the camera callibration matrix and $p$ is the pixel coordinate of the feature to be triangulated. This can be rewritten as: $$ \lambda \bar{x} = P_k \bar{X} $$ where $\bar{x}$ and $\bar{X}$ represent the homogenous pixel and 3D coordinates respectively. Note that $\lambda \bar p$ represents a line through the 3D point $X$ and the camera center, such that it passes the pixel coordinates on the image plane at $\lambda = 1$. Since the cross product between parallel lines is 0, we can rewrite the relationship as: $$ \bar{x} \times (P_k \bar{X}) = \widehat{x}P_k X$$ where $\widehat{x}$ is the skewsymmetric representation of $\bar{x}$.
With a single view, $X$ can be determined to a scale, but with two or more views, it can be determined absolutely through a leastsquares minimization, or as the null space of $A$ where: $$ \begin{bmatrix} \widehat{x}_l P_l \\ \widehat{x}_r P_r \end{bmatrix} \bar{X} = \begin{bmatrix} \widehat{p}_l KP_l \\ \widehat{p}_r K P_r \end{bmatrix} \bar{X} = A\bar{X} = 0$$ $$ \bar{X} = \text{argmin}_{\bar{X}} A\bar{X}^2 $$ $$ A = USV^\top $$ $$ \bar{X} = V_{end} $$ where $V_{end}$ is the 4element last column of the right eignenvector matrix V. $\bar{X}$ is conditioned for scale by division by the last element of $\bar{X}$ and oriented correctly by multiplying it by the sign of $\bar{X}_3$  which corresponds to the zcoordinate since the zcoordinate is always in front of the camera. $$ X = sign(\bar{X}_3)\frac{\bar{X}_{1:3}}{\bar{X}_4} $$ Note that in our case, $P_l$ and $P_r$ are constants as the left and right cameras are always fixed in their relative orientations to each other, with $P_l$ being the root of the coordinate frame and thus an identity matrix of dimension 4.
Solving the Procrustes Problem
A 3D affine translation mapping coordinates $b$ to $a$ involving rotation and translation can be represented as: $$ a^{k1} = R_k^{k1}b^k + t^{k1} \ \ \ \text{where } a,b,t \in \mathbb{R}^3$$ For $N \geq 3$ points, the Orthogonal Procrustes problem can be solved as: $$ t^{k1} = A  R_k^{k1}B \ \ \ \text{where } A = [a_1  \text{avg}_{\forall i}(a_i), \dots, a_N  \text{avg}_{\forall i}(a_i)] \ \ \text{and } B = [b_1  \text{avg}_{\forall i}(b_i), \dots, b_N  \text{avg}_{\forall i}(b_i)] $$ $$ BA^\top = USV^\top $$ $$ R_k^{k1} = U \begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & \det(UV^\top) \end{bmatrix} V^\top $$ The Procrustes problem is solved using a 3point RANSAC to filter out outliers.
3Dto2D Motion Estimation with Optical Flow
Initially, we planned to implement VO by 3Dto2D correpsondences by a method similar to the one described by Scaramuzza and Fraundorfer [2], however, we had trouble implementing the Perspective from nPoints (PnP) algorithm required by their algorithm (see step 2.3 of Algorithm 3 in Tutorial: Visual Odometry, Part 1 [2] for more details). Through debugging and analysis of code, we realized that the root of the problem was in how the PnP problem was being solved. The linearized solution was not sufficiently accurate and was prone to drift and using nonlinear optimization (as is typically employed in PnP) proved difficult as we weren't sure what the appropriate tools would be in Python.
Realizing that we could however use optical flow to estimate 3D motion however, we were able to adapt the 3Dto2D algorithm to include a measure of optical flow to allow us to estimate the vehicle's linear and angular velocity, thus allowing us to estimate its position and orientation.
The modified 3Dto2D motion estimation algorithm which uses optical flow is as follows:
 For each iteration, $k$ starting with $k=1$ (the 2nd image), read the current stereo pair: $I_{l,k}$ and $I_{r,k}$
 Match and triangulate features between $I_{l,k}$ and $I_{r,k}$
 Match the matched features against $I_{l,k1}$ to get a list of correspondences between the previous frame and the current stereo pair  this gives us a list of features that are tracked between the consecutive frames.
 Compute the optical flow, $\dot{p}$ between $I_{l,k}$ and $I_{l,k1}$ as $\dot{p} = K^{1}(p_{l,k}  p_{l,k1})$
 Using triangulation from (2), compute rotational and translational velocity (see below for details)
 The velocity will be in the current camera frame ($I_{l,k}$) and needs to be cast to the previous frame using rotation $R_k^{k1}$ which has a ZXY Euler roll, pitch and yaw defined by $\Delta \Omega^k = dt\cdot \dot{\Omega}_k^{k1}$ where $\Omega = [roll, pitch, yaw]^\top$. This gives us $v^{k1} = R_k^{k1} v^{k}$ and $\Delta \Omega^{k1} = R_k^{k1}\Delta\Omega^{k}$. Finally, this is cast back to the worldcoordinate frame by $v^{0} = R_{k1}^0 v^{k1}$ and $\Delta \Omega^0 = R_{k1}^0 \Delta \Omega^{k1}$
 The Pose at instance $k$ is then cmputed as $X_k^0 = X_{k1}^0 + dt \cdot v_k^0$ and $\Omega_k^0 = \Omega_{k1}^0 + dt \cdot \Delta \Omega_k^0$
 Repeat from (1) for $k = k+1$
Estimating Velocity from Optical Flow
Given the optical flow, $\dot{p}$ and feature depths $z$ for each $p$, the following can be established: $$ p = \frac{X}{z} \ \ \rightarrow \ \ \dot{p} = \frac{\dot{X}}{z}  \frac{\dot{z}}{z}p$$ By applying the relationship $\dot{X} = \Omega\times X  V$, we derive: $$ \dot{p} = \frac{1}{z} e_3 \times (p \times V) + e_3 \times (p \times (p \times \Omega)) \ \ \text{where } e_3 = [0,0,1]^\top $$ $$ \dot{p} = \frac{1}{z}\begin{bmatrix} xV_z  V_x \\ yV_z  V_y \end{bmatrix} + \begin{bmatrix} xy & (1 + x^2) & y \\ (1 + y^2) & xy & x\end{bmatrix}\Omega$$
We can solve for the linear velocity $V$ and rotational velocity $\Omega$ using a linear system of equations with 3 or more pointflow correpsondences and outliers are excluded using RANSAC.
Results
Performance on Dataset 03 (801 Image Pairs)  
Performance on Dataset 04 (271 Image Pairs)  
Performance on Dataset 06 (1101 Image Pairs)  
Performance on Dataset 09 (1591 Image Pairs)  
Quality of Achieved Results
Generally speaking, we find that the results achived are satisfactory for the goals we set out to achieve. In addition to gaining a clear understanding of VO techniques through implementation and experimentation, we were even able to develop a functional method through intuition  though we doubt we are the first to attempt this appraoch. It was also interesting to note that our adhoc method outperformed the baseline LibVISO (which is based on dense stereo alignment and optmization) on 2 of the 4 tested datasets despite using significantly less computationally intensive methodology.
Reflecing on Results
Working on this project yielded significant insight on the workings of VO and specifically stereoVO and some of the nuances and limitations associated with it.
Two aspects of sparse stereoVO (as implemented over the course of this project) stand out in particular: Dependence on good Features, and Error Propagation.
Feature saliency appears to be very important in determining the quality of tracking.
It is not just the spread or locations of features though, but also, as we discovered, the scale invariance.
While SURF features did work on the straight road, we were not able to demonstrate a working VO pipeline on more complicated routes with SURF features.
A cursory analysis suggested that this was likely due to the lack of scale invariance, which tends to matter a lot as the vehicle approaches or drives away from objects.
We further noted issues when objects with especially salient features  such as passing cars, could overwhelm the RANSAC consensus, with what should be the outliers being included while inliers were rejected.
This type of issue was especially pronounced when lens flares caused very few features to be detected or when there was too much greenery which was not sufficiently salient  as was identifed to be the cause of the failed tracking of the 3Dto2D method on Dataset 09, where good flow information was lost between frames due to bad feature tracks.
A second key observation was in the issues associated with error propagation through the pipeline and across multiple frames. A particularly sensitive aspect was the estimation of pose, which over time gets progressively worse if any significant errors are incurred, or if RANSAC was not run successfully (which can be loosely defined as when RANSAC yields nearoptimal results).
Scope For Improvement
While the methods presented make for a reasonable first pass at a VO pipeline, many improvements can still be had. Mainly, we do not use any form of local optimization, filtering or error correction. Any or all such techniques are likely to boost performance significantly  though with varying degrees of added computational cost. It is also conceivable that using dense features instead of sparse feature detection as we do now might improve tracking performance  a rudimentray study of this however revealed limited gains as many features were still discarded in the crossframe and crosscamera matching, resulting mainly in only the salient features being retained. The level of information loss might however be curbed through more careful tuning of matching and tracking thresholds.
Credits and Bibliography
[1] D. Nister, O. Naroditsky, and J. Bergen, "Visual odometry", in Proc., Int. Conf. Computer Vision and Pattern Recognition, 2004[2] Scaramuzza, D. and Fraundorfer, F., "Tutorial: Visual Odometry, Part 1", IEEE ROBOTICS & AUTOMATION MAGAZINE, 2011
[3] Scaramuzza, D. and Fraundorfer, F., "Tutorial: Visual Odometry, Part 2", IEEE ROBOTICS & AUTOMATION MAGAZINE, 2012
[4] P. Sun et al., "Scalability in Perception for Autonomous Driving: Waymo Open Dataset", 2019
[5] A. Geiger, P. Lenz, R. Urtasun, "Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite", Conference onComputer Vision and Pattern Recognition (CVPR), 2012
[6] A. Geiger, J. Ziegler, C. Stiller, "StereoScan: Dense 3d Reconstruction in Realtime," IEEE Intelligent Vehicles Symposium, 2011
[7] "Orthogonal Procrustes Problem,"https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem
[8]"Feature Matching",https://opencvpythontutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_matcher/py_matcher.html
[9]"Introduction to SIFT (ScaleInvariant Feature Transform)",https://opencvpythontutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_sift_intro/py_sift_intro.html
[10] Three things everyone should know to improve object retrieval. R. Arandjelovic, and A. Zisserman. CVPR, page 29112918. IEEE Computer Society, (2012)