From Alice Vision.
NATURAL FEATURE EXTRACTION
The objective of this step is to extract distinctive groups of pixels that are, to some extent, invariant to changing camera viewpoints during image acquisition. Hence, a feature in the scene should have similar feature descriptions in all images.
The most well-know feature detection method is the SIFT (Scale-invariant feature transform) algorithm. The initial goal of SIFT is to extract discriminative patches in a first image that can be compared to discriminative patches of a second image irrespective of rotation, translation, and scale. As a relevant detail only exists at a certain scale, the extracted patches are centered at stable points of interest. The key idea is that, to some extent, one can use the SIFT invariance to deal with the image transformations occurring when the viewpoints are changing during image acquisition.
From the representation of one image at different scales, which is technically done by computing a pyramid of downscaled images. SIFT computes scale-space maxima of the Laplacian representation, which is a specific image energy-based representation of the image, using so-called differences of Gaussians. These maxima correspond to points of interest. It then samples for each one of these maxima a square image patch whose origin is the maximum and x-direction is the dominant gradient at the origin. For each keypoint, a description of these patches is associated.
The description, which is typically stored in 128 bits, consists of a statistics of gradients computed in regions around the keypoint. The region size is determined by the keypoint scale and the orientation is determined by the dominant axis.
As the number of extracted features may vary a lot due to the variability of textures complexity (from one image to another or in different parts of the image), a post-filtering step is used to control the number of extracted features to reasonable limits (for instance between one and ten thousands per image). We use a grid filtering to ensure a good repartition in the image.
The objective of this part is to find images that are looking to the same areas of the scene. For that, we use the image retrieval techniques to find images that share some content without the cost of resolving all feature matches in details. The ambition is to simplify the image in a compact image descriptor which allows to compute the distance between all images descriptors efficiently.
One of the most common method to generate this image descriptor is the vocabulary tree approach. By passing all extracted features descriptors into it, it makes a classification by comparing their descriptors to the ones on each node of this tree. Each feature descriptor ends up in one leaf, which can be stored by a simple index: the index of this leaf in the tree. The image descriptor is then represented by this collection of used leaves indices.
It is now possible to see if different images share the same content by comparing these image descriptors.
The objective of this step is to match all features between candidate image pairs.
First, we perform photometric matches between the set of descriptors from the 2 input images. For each feature in image A, we obtain a list of candidate features in image B. As the descriptor space is not a linear and well defined space, we cannot rely on absolute distance values to know if the match is valid or not (we can only have an absolute higher bound distance). To remove bad candidates, we assume that there’s only one valid match in the other image. So for each feature descriptor on the first image, we look for the 2 closest descriptors and we use a relative threshold between them. This assumption will kill features on repetitive structure but has proved to be a robust criterion [Lowe2004]. This provide a list of feature matching candidates based only on a photometric criterion. Find the 2 closest descriptors in the second image for each feature is computationally intensive with a brute force approach, but many optimized algorithms exists. The most common one is Approximate Nearest Neighbor, but there are alternatives like, Cascading Hashing.
Then, we use the features positions in the images to make a geometric filtering by using epipolar geometry in an outlier detection framework called RANSAC (RANdom SAmple Consensus). We randomly select a small set of feature correspondences and compute the fundamental (or essential) matrix, then we check the number of features that validates this model and iterate through the RANSAC framework.
STRUCTURE FROM MOTION
The objective of this step is to understand the geometric relationship behind all the observations provided by the input images, and infer the rigid scene structure (3D points) with the pose (position and orientation) and internal calibration of all cameras. The Incremental pipeline is a growing reconstruction process. It first computes an initial two-view reconstruction that is iteratively extended by adding new views.
First, it fuses all feature matches between image pairs into tracks. Each track is supposed to represent a point in space, visible from multiple cameras. However, at this step of the pipeline, it still contains many outliers. During this fusion of matches, we remove incoherent tracks.
Then, the incremental algorithm has to choose the best initial image pair. This choice is critical for the quality of the final reconstruction. It should indeed provide robust matches and contain reliable geometric information. So, this image pair should maximize the number of matches and the repartition of the corresponding features in each image. But at the same time, the angle between the cameras should also be large enough to provide reliable geometric information.
Then we compute the fundamental matrix between these 2 images and consider that the first one is the origin of the coordinate system. Now that we know the pose of the 2 first cameras, we can triangulate the corresponding 2D features into 3D points.
After that, we select all the images that have enough associations with the features that are already reconstructed in 3D. This algorithm is called next best views selection. Based on these 2D-3D associations it performs the resectioning of each of these new cameras. The resectioning is a Perspective-n-Point algorithm (PnP) in a RANSAC framework to find the pose of the camera that validates most of the features associations. On each camera, a non-linear minimization is performed to refine the pose.
From these new cameras poses, some tracks become visible by 2 or more resected cameras and it triangulates them. Then, we launch a Bundle Adjustment to refine everything: extrinsics and intrinsics parameters of all cameras as well as the position of all 3D points. We filter the results of the Bundle Adjustment by removing all observations that have high reprojection error or insufficient angles between observations.
As we have triangulated new points, we get more image candidates for next best views selection. We iterate like that, adding cameras and triangulating new 2D features into 3D points and removing 3D points that became invalidated, until we can’t localize new views.
Many other approaches exists like Global [Moulon2013], Hierarchical [Havlena2010], [Toldo2015] or multi-stage [Shah2014] approaches.
DEPTH MAPS ESTIMATION
For all cameras that have been resolved by SfM, we want to retrieve the depth value of each pixel. Many approaches exist, like Block Matching, Semi-Global Matching (SGM) [Hirschmüller2005], [Hirschmüller2008] or ADCensus [Xing2011]. We will focus on the SGM method implemented in AliceVision.
For each image, we select the N best/closest cameras around. We select fronto-parallel planes based on the intersection of the optical axis with the pixels of the selected neighboring cameras. This creates a volume W, H, Z with many depth candidates per pixel. We estimate the similarity for all of them. The similarity is computed by the Zero Mean Normalized Cross-Correlation (ZNCC) of a small patch in the main image reprojected into the other camera. This create a volume of similarities. For each neighboring image, we accumulate similarities into this volume. This volume is very noisy. We apply a filtering step along X and Y axes which accumulates local costs which drastically reduce the score of isolated high values. We finally select the local minima and replace the selected plane index with the depth value stored into a depth map. This depth map has banding artifacts as it is based on the original selection of depth values. So a refine step is applied to get depth values with sub-pixel accuracy.
All these depth maps can be computed independently in parallel. Then we apply a filtering step to ensure consistency between multiple cameras. A compromise is chosen based on both similarity value and the number of coherent cameras to keep weakly supported surfaces without adding artefacts.
The objective of this step is to create a dense geometric surface representation of the scene.
First, we fuse all the depth maps into a global octree where compatible depth values are merged into the octree cells.
We then perform a 3D Delaunay tetrahedralization. Then a complex voting procedure is done to compute weights on cells and weights on facets connecting the cells as explained in [Jancosek2011] and [Jancosek2014].
A Graph Cut Max-Flow [Boykov2004] is applied to optimally cut the volume. This cut represents the extracted mesh surface. We filter bad cells on the surface. We finally apply a Laplacian filtering on the mesh to remove local artefacts.
At this point, the mesh can also be simplified to reduce unnecessary vertices.
The objective of this step is to texture the generated mesh.
If the mesh has no associated UV, it computes automatic UV maps. AliceVision implements a basic UV mapping approach to minimize the texture space. The standard UV mapping approach is provided by [Levy2002].
For each triangle, we use the visibility information associated to each vertex to retrieve the texture candidates. We filter the cameras without a good angle to the surface to favor fronto-parallel cameras and finally average the pixel values. Instead of a naive averaging, we use a generalization of the multi-band blending described in [Burt1983], so we average more views in the low frequencies than in the high frequencies. This approach is in the same spirit than [Baumberg2002] and [Allene2008].
Based on the SfM results, we can perform camera localization and retrieve the motion of an animated camera in the scene of the 3D reconstruction.
The internal camera parameters can be calibrated from multiple views of a checkerboard. This allows to retrieve focal length, principal point and distortion parameters. A detailed explanation is presented in [opencvCameraCalibration].
SINGLE CAMERA LOCALIZATION
We use the algorithm presented in image matching section to localize the closest images in the SfM results. Then we perform feature matching with those images as well as with the N previous frames. Then we directly get 2D-3D associations, that are used to localize the camera. Finally, a global Bundle Adjustment is performed to refine the camera parameters.
RIG OF CAMERAS
If a rig of cameras is used, we can perform the rig calibration. We localize cameras individually on the whole sequence. Then we use all valid poses to compute the relative poses between cameras of the rig and choose the more stable value across the images. Then we initialize the rig relative pose with this value and perform a global Bundle Adjustment on all the cameras of the rig. When the rig is calibrated, we can use it to directly localize the rig pose from the synchronized multi-cameras system with [Kneip2014] approaches.