🗒️Shallow introduction to SfM & MVS & Slam
2023-4-17|2023-5-5
Anthony
type
status
date
slug
summary
tags
category
icon
password
The process of creating a 3D model from multiple pictures without depth information. It is called Structure from Motion (SfM). The more advanced model is Multi-views Stereo(MVS).
Basic Intro to SfM(运动结构恢复)
SfM can be used to create 3D models of a variety of objects and scenes, including buildings, landscapes, and even people. It is a powerful tool that can be used for a variety of applications, such as 3D printing, virtual reality, and augmented reality.
SfM pipeline
- Image acquisition. The first step is to acquire a set of images of the scene from different viewpoints.
- Feature extraction. In this step, features are extracted from each image. Features are points, lines, or other patterns that are distinctive and can be matched between images.
How to do this?
Scale-invariant feature transform (SIFT): SIFT is a feature extraction method that is invariant to scale, rotation, and illumination changes. This makes it a good choice for feature extraction in a variety of applications, including SfM.
Speeded-up robust features (SURF): SURF is a feature extraction method that is similar to SIFT, but it is faster and uses less memory. This makes it a good choice for feature extraction on mobile devices or other systems with limited resources.
HOG features: HOG features are global features that are invariant to rotation and translation. HOG features are extracted by first dividing the image into a grid of cells. The intensity of each cell is then computed. The HOG feature vector is then computed by computing the histogram of the intensities of the cells in each column of the grid.
- Feature matching. In this step, features from different images are matched to each other. This allows the creation of a graph of images, where each node in the graph represents an image and each edge represents a match between features in two images.
How to do this?
Once feature descriptors have been extracted from two images, they can be matched using a variety of methods, such as the Brute-Force Matching algorithm and the Fast Library for Approximate Nearest Neighbors (FLANN).
- Camera resection. In this step, the camera parameters of each image are estimated. This is done by solving a set of equations that relate the features in each image to the camera parameters.
How to do this?
PnP(Perspective-n-Point) is used in SfM to perform camera resection.PnP is a linear least-squares problem that can be solved using a variety of methods. The basic idea is to find the camera parameters that minimize the reprojection error, which is the error between the actual and predicted positions of the points in the image.The PnP problem can be solved in closed form for a calibrated camera. However, for an uncalibrated camera, the problem is nonlinear and must be solved using iterative methods.
- Bundle adjustment. In this step, the camera parameters and the 3D positions of the features are refined. This is done by minimizing the reprojection error, which is the error between the actual and predicted positions of the features in each image.
How to do this?
The bundle adjustment process is a nonlinear optimization problem. There are a number of different algorithms that can be used to solve this problem. The most common algorithms are Levenberg-Marquardt and Gauss-Newton.
- 3D reconstruction. In this step, the 3D positions of the points in the scene are reconstructed. This is done by triangulation, which is the process of finding the 3D coordinates of a point given its projections in two or more images.
Pros and Cons
Here are some of the benefits of using SfM to create 3D models:
- It is a non-contact method, so it does not damage the object or scene being modeled.
- It is a relatively inexpensive method, as it does not require specialized equipment.
- It is a quick and easy method, as it can be done with a smartphone or digital camera.
Here are some of the challenges of using SfM to create 3D models:
- The images must be taken from different angles in order to create a complete 3D model.
- The images must be of good quality in order to create a high-quality 3D model.
- The images must be taken in good lighting conditions in order to create a clear 3D model.
Overall, SfM is a powerful and versatile tool that can be used to create 3D models of a variety of objects and scenes. It is a non-contact, inexpensive, and quick method that can be done with a smartphone or digital camera. However, it is important to note that there are some challenges associated with using SfM, such as the need for multiple images from different angles and good-quality lighting.
Basic Intro to MVS
MVS is in advance of SVM.
MVS pipeline
The classic MVS pipeline consists of the following steps:
- Image acquisition: A set of images of the scene from different viewpoints is acquired. Those images could be taken from different angles, with no sequence, and overlap with each other.
- Feature extraction: Features are extracted from each image. These features can be local features, such as SIFT or SURF features, or global features, such as HOG features.
- Feature matching: The features extracted from each image are matched to the features extracted from the other images. This step creates a set of correspondences between the images.
- Depth estimation: The depth of each point(hence it is time costed) in the scene is estimated using the correspondences between the images.
How to do this?
Stereo matching. Stereo matching is a technique that compares pixels in two images of the same scene to find corresponding points. Once corresponding points have been found, the depth of each point can be estimated using the distance between the two cameras.
Multiview geometry. (In MVS, geometry refers to the mathematical relationships between the points in a scene and the cameras that are used to capture the scene. One of the most important geometric relationships in MVS is the epipolar constraint. Another important geometric relationship in MVS is the fundamental matrix.) Multiview geometry is a set of techniques that use the geometry of the scene to estimate the depth of each pixel. These techniques can be used even if the scene is not perfectly aligned in the images.
!!!However, when using monocular camera, the distance between two cameras is inaccessible, hence there is no way to do Stereo matching. Is is possible to access geometry in the scene using monocular using triangulation(Triangulation is a technique that uses the known distances between three points to calculate the location of a fourth point. In the case of a monocular camera, the three points are the camera and two points in the scene. The distance between the camera and each point can be calculated using the focal length of the camera and the angle between the two points. Once the location of the fourth point is known, the geometry of the scene can be reconstructed.) structure from motion(SfM)is a more complex but better way to asscess camera pose, so many of the researchers use SfM plus MVS to do 3d reconstruction with monocular camera.
- Surface reconstruction: The depth map is used to reconstruct the surface of the scene. This can be done using a variety of methods, such as marching cubes or Poisson surface reconstruction.
How to do this?
Voxel grids. Voxel grids are 3D arrays of voxels, where each voxel represents a small volume of space. The depth map from depth estimation is used to create a voxel grid. The surface of the voxel grid is then extracted using a technique called marching cubes.
Point clouds. Point clouds are sets of points in 3D space. The depth map from depth estimation can be used to create a point cloud. The surface of the point cloud is then reconstructed using a technique called Poisson surface reconstruction.
The best method for surface reconstruction in classic MVS depends on the specific application. For example, if the scene is well-aligned in the images, then voxel grids may be a good choice. If the scene is not well-aligned, then point clouds may be a better choice.
The basic steps of feature extraction and feature mapping are the same in SfM and MVS. However, there are some important differences.
In SfM, the goal is to create a sparse point cloud. This means that only a small number of 3D points are reconstructed. As a result, feature extraction can be done using a simple algorithm that only looks for local features in the images. Feature mapping is also relatively simple, as it only needs to find the corresponding features in each image.
In MVS, the goal is to create a dense point cloud. This means that all of the 3D points in the scene are reconstructed. As a result, feature extraction must be more sophisticated. It must be able to find features that are visible in multiple images, even if they are not close together. Feature mapping is also more complex, as it must be able to find the corresponding features in each image, even if they are not very similar.
The time cost of feature extraction and feature mapping is also different in SfM and MVS. Feature extraction is much faster in SfM, as it only needs to find a small number of features. Feature mapping is also faster in SfM, as it only needs to find the corresponding features in each image.
In MVS, feature extraction is much slower, as it must find a large number of features. Feature mapping is also slower in MVS, as it must find the corresponding features in each image, even if they are not very similar.
Learning-based MVS is called MVS because it is still used to reconstruct 3D scenes from multiple images. The main difference between learning-based MVS and classic MVS is that learning-based MVS uses deep learning techniques to estimate the depth and reconstruct the surface of the scene, while classic MVS uses traditional methods, such as triangulation and bundle adjustment.
Learning-based MVS methods keep some parts of the classic MVS pipeline, such as image acquisition and feature extraction. However, they replace the depth estimation and surface reconstruction steps with deep learning models.
SfM only produces a sparse model because it only uses features that can be matched between images. Features that are not visible in all of the images cannot be used to reconstruct the 3D scene. This is because the 3D position of a feature can only be determined if it is visible in at least three images.
There are a number of factors that can limit the number of features that can be matched between images. These factors include:
- The quality of the images. Images with low resolution or poor lighting will have fewer features that can be matched between images.
- The number of images. The more images that are used, the more features that can be matched between images.
- The complexity of the scene. More complex scenes will have fewer features that can be matched between images.
As a result of these factors, SfM can only produce a sparse model of the scene. This means that the model will only contain a small number of points, which can be used to represent the overall shape of the scene, but not the fine details.
There are a number of techniques that can be used to improve the accuracy and completeness of SfM models. These techniques include:
- Using images with high resolution and good lighting.
- Using a large number of images.
- Using image processing techniques to improve the quality of the images.
- Using structure from motion techniques that are designed to handle complex scenes.
By using these techniques, it is possible to produce more accurate and complete SfM models. However, it is important to note that the accuracy and completeness of the model will always be limited by the quality of the images and the complexity of the scene.
SfM VS MVS

Input
As you can see, the input of SfM needs a sequence of images while MVS uses images that overlap which could be used to compute the depth of each pixel in the images. Images with no sequence are not allowed in SfM and are acceptable in MVS. However, it is generally better to use images that are in sequence.
Why does SfM only output sparse reconstruction?
There are a few reasons why SfM cannot create a dense point cloud. First, features are not always unique. In some cases, there may be multiple features in an image that match the same point in another image. This can lead to ambiguity in the reconstruction process. Second, SfM algorithms cannot account for occlusions. In some cases, a point may be visible in one image but not in another image because it is hidden by an object in the scene. This can also lead to ambiguity in the reconstruction process.
Why MVS is slow?
Just as mentioned above, the feature extraction and feature mapping are more of complexity in MVS than in SfM.
However, depth estimation is the most time-consuming task in MVS. It requires finding correspondences between pixels in multiple images, which can be computationally expensive. The time cost of depth estimation depends on a number of factors, including the number of images, the size of the images, and the complexity of the scene.

MVS and its popular works
Multi-View Stereo (MVS)
is a technique that reconstructs a dense 3D surface from a collection of images. It does this by first estimating the camera poses of each image in the collection. Once the camera poses are known, MVS can then use the images to compute the depth of each pixel in each image. The depth information from all of the images can then be combined to create a dense 3D surface.
- "A Unified Framework for Multi-View Stereo" by Furukawa and Ponce (2009)
- "OpenMVS: A Modular and Scalable Multi-View Stereo System" by Schonberger and Frahm (2016)
- "Meshroom: A Free and Open Source 3D Reconstruction Software" by AliceVision (2018)
The best approach to MVS for a particular application will depend on the specific requirements of the application. For example, if accuracy is critical, then Furukawa and Ponce's approach may be the best choice. If real-time performance is critical, then Newcombe et al.'s approach may be the best choice. If flexibility is critical, then COLMAP may be the best choice. If cost is critical, then Meshroom may be the best choice.
What is the relationship between MVS and SfM and how could we integrate them?
MVS and SfM are two complementary techniques for 3D reconstruction from a set of images.
- Structure-from-Motion (SfM) is a technique for estimating the camera poses and scene geometry from a set of images. SfM works by first identifying corresponding features in each image. These features are then used to estimate the camera poses and scene geometry.
- Multi-View Stereo (MVS) is a technique for estimating the depth of points in a scene from a set of images. MVS works by first estimating the camera poses and scene geometry using SfM. Once the camera poses and scene geometry are known, MVS can then use the images to estimate the depth of points in the scene.
MVS and SfM are often used together to create 3D models from a set of images. The SfM step is used to estimate the camera poses and scene geometry, and the MVS step is used to estimate the depth of points in the scene. The combination of SfM and MVS can be used to create 3D models with high accuracy and detail.
Here are some of the benefits of using MVS and SfM together:
- Accuracy: SfM and MVS can be used to create 3D models with high accuracy. This is because SfM uses multiple images to estimate the camera poses and scene geometry, and MVS uses the camera poses and scene geometry to estimate the depth of points in the scene.
- Detail: SfM and MVS can be used to create 3D models with high detail. This is because MVS uses multiple images to estimate the depth of points in the scene.
- Efficiency: SfM and MVS can be used to create 3D models efficiently. This is because SfM and MVS can be parallelized, and they can be used to create 3D models from a large number of images.
MVS and SfM are two powerful techniques for 3D reconstruction from a set of images. They are often used together to create 3D models with high accuracy, detail, and efficiency.
Popular work
- "A Unified Framework for Multi-View Stereo and Structure-from-Motion" by Shotton et al. (2013)
This paper presents a unified framework for multi-view stereo (MVS) and structure-from-motion (SfM) that can be used to reconstruct 3D models from a set of images. The framework is based on a probabilistic model that jointly estimates the camera poses, scene geometry, and texture. The model is solved using a variational approach that is efficient and robust.
- "Efficient and Robust Dense Reconstruction from Multiple Views" by Furukawa et al. (2010)
This paper presents an efficient and robust method for dense reconstruction from multiple views. The method uses a variational approach to jointly estimate the camera poses, scene geometry, and texture. The method is able to reconstruct 3D models with high accuracy and detail, even from images with low contrast and noise.
- "A Low-Rank Approximation Framework for Multi-View Stereo" by Shen et al. (2012)
This paper presents a low-rank approximation framework for multi-view stereo. The framework uses a low-rank approximation of the scene geometry to reduce the computational complexity of the MVS problem. The framework is able to reconstruct 3D models with high accuracy and efficiency, even from images with a large number of views.
How could DL integrate with SfM
here are some examples of how deep learning can be integrated with SfM:
- Image matching: Deep learning can be used to match images together more accurately than traditional methods. This is important for SfM because it allows for the creation of more accurate 3D models.
- Camera pose estimation: Deep learning can be used to estimate the pose of a camera from an image. This is important for SfM because it allows for the creation of more accurate 3D models.
- Feature extraction: Deep learning can be used to extract features from images. These features can then be used to match images together and estimate the pose of a camera.
- Outlier rejection: Deep learning can be used to reject outliers from a set of images. This is important for SfM because it helps to create more accurate 3D models.
Overall, deep learning can be used to improve the accuracy and efficiency of SfM. This is because deep learning can be used to automate many of the tasks that are currently done manually, such as image matching, camera pose estimation, and outlier rejection. As a result, deep learning is making SfM a more powerful and versatile tool for creating 3D models.
Here are some specific examples of how deep learning is being used to improve SfM:
- In 2018, researchers at Google AI developed a deep learning model called COLMAP that can automatically create 3D models from a set of images. COLMAP is more accurate and efficient than traditional SfM methods, and it can be used to create 3D models of a variety of objects and scenes.
Popular Work
- "Deep Learning for Multi-View Stereo" by Eigen et al. (2014)
This paper presents a deep learning approach for multi-view stereo. The approach uses a deep convolutional neural network to estimate the depth of points in a scene from a set of images. The approach is able to reconstruct 3D models with high accuracy and detail, even from images with occlusions and low contrast.
- "MegaDepth: Learning Depth from Single and Multiple Images" by Godard et al. (2016)
This paper presents a deep learning approach for learning depth from single and multiple images. The approach uses a deep convolutional neural network to estimate the depth of points in a scene from a single image or a set of images. The approach is able to reconstruct 3D models with high accuracy and detail, even from images with occlusions and low contrast.
- "Unsupervised Learning of Depth from Single Images" by Godard et al. (2017)
This paper presents a deep learning approach for unsupervised learning of depth from single images. The approach uses a deep convolutional neural network to estimate the depth of points in a scene from a single image without any prior knowledge of the scene. The approach is able to reconstruct 3D models with high accuracy and detail, even from images with occlusions and low contrast.
OpenMVG library combined with OpenMVS (Moulon et al.,
2016; Moulon et al., 2013; Moisan et al., 2012; Moulon et
al., 2012a; Moulon et al., 2012b);
• COLMAP pipeline (Schönberger and Frahm, 2016;
Schönberger et al., 2016a);
• AliceVision computer vision framework (Moulon et al.,
2016; Jancosek et al., 2011)
Software that are accessible and open source for 3d reconstruction
