Project 2 - Phase 1: 3D-2D Pose Estimation (PnP)

Published: March 29, 2025

Keywords: 3D-2D Pose Estimation, PnP, OpenCV, SVD, Linear Estimation
Coding language: C++, ROS (Docker with image: osrf/ros:kinetic-desktop-full-xenial, Visualization GUI: theasp/novnc:latest), RViz
Detailed code implementation can be found in the Github code repo

Given 2D image points and 3D world points, estimate the camera’s poses with respect to planar image points. This project should not use OpenCV’s PnP-related functions. The linear estimation solution should be compared with the reference solution based on OpenCV’s PnP-related functions.

Methodology

The projection from world to camera:

\[\lambda \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = K \begin{pmatrix} r_1 & r_2 & r_3 & t \end{pmatrix} \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix}\]

Assume all points in the world lie in the ground plane $z = 0$, then the equation is simplified to:

\[\lambda \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = K \begin{pmatrix} r_1 & r_2 & t \end{pmatrix} \begin{pmatrix} x \\ y \\ 1 \end{pmatrix}\]

Suppose we denote $H = K \begin{pmatrix} r_1 & r_2 & t \end{pmatrix}$, then the projection equation can be rewritten as:

\[\lambda \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{bmatrix} \begin{pmatrix} x \\ y \\ 1 \end{pmatrix}\]

Given stack pairs of observations $(u, v)=(u_i, v_i)$ of feature $(x_i, y_i)$, we can get:

\[\begin{bmatrix} x_i & y_i & 1 & 0 & 0 & 0 & -x_i u_i & -y_i u_i & -u_i \\ 0 & 0 & 0 & x_i & y_i & 1 & -x_i v_i & -y_i v_i & -v_i \end{bmatrix} \begin{bmatrix} h_{11} \\ h_{12} \\ h_{13} \\ h_{21} \\ h_{22} \\ h_{23} \\ h_{31} \\ h_{32} \\ h_{33} \end{bmatrix} = 0\]

Then, we can solve $H$ by SVD in a linear way.

Suppose we estimate an $\hat{H}$ from $N \geq 4$ coorespondences. Assume we know the intrinsic matrix $K$. Pose estimation means finding $R$ and $t$ given $H$ and intrinsic $K$.

From above equations, we observe that $K^{-1}H=(r_1\quad r_2\quad t)$ should have its first two columns as orthogonal unit vectors. But we cannot guarantee this property for the estimated $H$.

So, we denote the columns of $K^{-1}H$ as $[\tilde{h}_1\quad \tilde{h}_2\quad \tilde{h}_3]$. Then, we should seek the orthogonal $r_1$ and $r_2$ that are closest to $\tilde{h}_1$ and $\tilde{h}_2$. The solution to this position can be given by the SVD.

We find the orthogonal matrix $R$ that is the closest to $(\tilde{h}_1\quad \tilde{h}_2\quad \tilde{h}_1\times\tilde{h}_2)$ by:

\[\arg \min_{R \in SO(3)} \| R - (\tilde{h}_1 \quad \tilde{h}_2 \quad \tilde{h}_1 \times \tilde{h}_2) \|_F^2\]

If the SVD of $(\tilde{h}_1 \quad \tilde{h}_2 \quad \tilde{h}_1 \times \tilde{h}_2)$ is $USV^T$, then

\[R = UV^T\]

Then, the translation vector $t$ can be computed by:

\[t = \frac{\tilde{h}_3}{\|\tilde{h}_1\|}\]

Pose Estimation Visualization in RViz

Note: The green arrows are from my odometry, and the red arrows are from the provided reference odometry based on solutions using OpenCV’s functions (cv::solvePnP, cv::Rodrigues).

Result Statistics

The rotation matrix and translation vector generated by my algorithm were respectively compared with those from the reference method to compute and accumulate the corresponding RMS error (where the error of the rotation matrix is based on the Frobenius norm and the error of the translation vector is based on the Euclidean norm).

Metrics	Error Value (Averaged)
Rotation Matrix	0.0232805
Translation Vector	0.0384315

Implementation Notes

When we first try to solve the distortion of the points’ coordinates in the image frame, we can use cv::undistortPoints(pts_2, un_pts_2, K, D,noArray(),K); instead of the given code cv::undistortPoints(pts_2, un_pts_2, K, D);. The function cv::undistortPoints() can undistort the points and save the undistorted points in un_pts_2 but we need to add the same intrinsic matrix to as the projection matrix to put the undistorted points back from the normalized camera frame into the pixel coordinates (back into the same frame as the pts_3).
- To use the cv::undistortPoints(pts_2, un_pts_2, K, D); in the given code, we should not multiply the inverse of $K$ to get the estimation.
In this project, the tag is provided and fixed, which is stable on the floor. So, we assume the world frame is fixed on the $z_{w}=0$ plane. So, the second dimension of the matrix $A$ should be 9, and the matrix $H$ should be in the shape of (3, 3).
The orientation consistency has been checked after each SVD calculation (to check whether the camera is upside down).
The RMS error values are acceptable, considering we use the SVD to linearize the problem to get an approximated result.
No specific reference.