Single View to 3D¶

Nader Zantout¶

This work was initially done in partial fulfillment to the requirements of the course Learning for 3D Vision at CMU. It is an implementation of various single view to 3D models using voxel, point cloud, and mesh representations. The trainings were done on a reduced version of the ShapeNet dataset consisting of only the chair class in all sections but section 3.3, in which the models were trained on the chair, car, and plane classes. Here is the code.

1. Exploring loss functions¶

This section contains an exploration of the binary cross-entropy loss used for voxels, the Schamfer loss used for pointclouds, and the Schamfer loss along with a smoothness regularization term for mesh representations, used to fit single objects directly to their outputs by minimizing the single instance losses.

1.1 Fitting a voxel grid¶

Ground Truth Optimized
GT voxel grid Optimized voxel grid

1.2 Fitting a point cloud¶

Ground Truth Optimized
GT point cloud Optimized point cloud

1.3 Fitting a mesh¶

Ground Truth Optimized
GT mesh Optimized mesh

2. Reconstructing 3D from single view¶

2.1. Image to voxel grid¶

Mesh # Input RGB Predicted Ground Truth Mesh
0 No description has been provided for this image No description has been provided for this image No description has been provided for this image
500 No description has been provided for this image No description has been provided for this image No description has been provided for this image
600 No description has been provided for this image No description has been provided for this image No description has been provided for this image

The network I used for this task is based on the one used in the paper Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images. The decoder begins by reshaping the b x 512 input into a b x 512 x 1 x 1 x 1 volume tensor. This is followed by a 3D transposed convolutional layer with 256 filters of size 2 x 2 x 2 with stride 1 and no padding to give an output size of b x 256 x 2 x 2 x 2, followed by a ReLU activation. This is then followed by 4 3D transposed convolutional layers with filters of size 4 x 4 x 4, stride 2, padding 1, and [128, 64, 32, 8] filters respectively. With this configuration, each layer doubles the depth, width, and height of the layer before it, so we end up with an output of size b x 8 x 32 x 32 x 32. These 4 layers were each followed by a batch normalization layer and a ReLU activation. This is finally followed by a transpose convolutional layer of filter size 1 x 1 x 1, 1 filter, stride 1, and no padding, then a sigmoid activation to give us the output volume.

This network was trained for 10,000 iterations with a learning rate of 4e-4, and a batch size of 32. On evaluation, a threshold of 0.2 was used to determine the occupied voxels.

This model achieved an average F1 score @0.05 of 49.6 on the test set.

2.2. Image to point cloud¶

Mesh # Input RGB Predicted Ground Truth Mesh
0 No description has been provided for this image No description has been provided for this image No description has been provided for this image
100 No description has been provided for this image No description has been provided for this image No description has been provided for this image
600 No description has been provided for this image No description has been provided for this image No description has been provided for this image

The point cloud prediction network is a simple multi-layer perceptron (MLP) with 2 hidden layers with 1024 and 2048 hidden units respectively followed by a ReLU each, and an output layer of size 3 * n_points with no nonlinearity.

This network was trained for 10,000 iterations with a learning rate of 4e-4, a batch size of 32, and 5000 output points.

This model achieved an average F1 score @0.05 of 86.634 on the test set.

2.3. Image to mesh¶

Mesh # Input RGB Predicted Ground Truth Mesh
0 No description has been provided for this image No description has been provided for this image No description has been provided for this image
100 No description has been provided for this image No description has been provided for this image No description has been provided for this image
600 No description has been provided for this image No description has been provided for this image No description has been provided for this image

The mesh prediction network is an MLP similar to the point cloud prediction network with 2 hidden layers of 1024 and 2048 hidden units followed by a ReLU each, and an output layer of size 3 * (number of vertices) with no nonlinearity.

This network was trained for 10,000 iterations with a learning rate of 4e-4, a batch size of 32, and the default ico_sphere initialization.

This model achieved an average F1 score @0.05 of 81.899 on the test set.

2.4. Quantitative comparisons¶

Voxel Grid Point Cloud Mesh
No description has been provided for this image No description has been provided for this image No description has been provided for this image

The point cloud prediction model achieved the highest F1 score on the test dataset, followed by the mesh deformation model, then the voxel grid model with the lowest F1 score. This is expected, since:

  • The point cloud model has the highest flexibility of all the other models, as points can go anywhere and are not constrained by a particular topology or a grid size. The best possible error, or the Bayes error, for this model is effectively 0, as predicted points can be made arbitrarily close to the ground truth points (though in practice, this is bound by sampling stochasticity and the representational power of the network). Additionally, sampling the points more densely leads to a higher F1-score, as the average distances between the predicted points and the points sampled from the ground truth decrease.

  • The mesh model produces a more visually pleasing prediction, but is restricted by the topology of the initial mesh to be deformed compared to the ground truth mesh. If the initial mesh contains no topological holes, it cannot represent holes in the ground truth mesh, as can be seen in mesh 600 in section 2.3. The Bayes error for this model is therefore positive even with perfect reconstruction, and models with holes in them lead to a decrease in the F1-score due to the contribution of the points sampled in the mesh hole.

  • The inherent discretization in the voxel grid model leads to a relatively high Bayes error even with perfect reconstruction possible, as there will be points that are most certainly not on the original mesh. Additionally, any outliers contribute a more significant amount to the error than outliers in a densel sampled pointcloud or mesh model due to the discretization. This leads to the lowest F1-score.

2.5. Analyze effects of hyperparams variations¶

I varied the weight of the smoothness loss w_smoothness between 0 and 1, obtaining the following results:

Hyperparameter Mesh 1 Mesh 2 F1-score
Ground Truth No description has been provided for this image No description has been provided for this image
w_smoothness = 0 No description has been provided for this image No description has been provided for this image No description has been provided for this image
w_smoothness = 0.05 No description has been provided for this image No description has been provided for this image No description has been provided for this image
w_smoothness = 0.1 No description has been provided for this image No description has been provided for this image No description has been provided for this image
w_smoothness = 0.5 No description has been provided for this image No description has been provided for this image No description has been provided for this image
w_smoothness = 1 No description has been provided for this image No description has been provided for this image No description has been provided for this image

We notice that:

  • A low weight for the smoothness loss naturally creates a prediction with highly jagged edges and little smoothness in the mesh's geometry.
  • Increasing the smoothness weight removes these jagged edges, as seen above when setting it to 0.1 and 0.5, which ends up capturing smooth sections of the mesh such as the seat and part of the back better.
  • However, increasing it too much undermines the network's ability to capture the finer and sharper structures of the chair, such as its legs and edges, and ends up giving them a "bloated" appearance.
  • Consequently, with a higher smoothness loss, the F1-score ends up dipping from 80% when w_smoothness=0 to 77% when w_smoothness = 1.
  • Setting the optimal smoothness loss for mesh deformation is, therefore, dependent on the nature of the classes used in the distribution, and how "smooth" or "spiky" they are.

2.6. Interpret your model¶

Voxel grid prediction models output a 3D grid of occupancy probabilities. Thresholding and displaying the mesh produced by running marching cubes (or PyTorch's cubify algorithm) between the cells satisfying the threshold gives us a reasonable representation of the output, but does not tell us enough information to gauge the confidence of a particular cell's prediction. For that reason, I visualized the probabilities of each grid cell by drawing a cube proportional in size to the cell's output probability and ranging in color from small green cubes for low confidence ($0.1 \le conf < 0.2$) to full sized red cubes for high confidence ($conf \ge 0.5$) in the table below:

Mesh # Input RGB Voxel Grid with Confidence Levels Ground Truth Mesh
0 No description has been provided for this image No description has been provided for this image No description has been provided for this image
100 No description has been provided for this image No description has been provided for this image No description has been provided for this image
500 No description has been provided for this image No description has been provided for this image No description has been provided for this image
600 No description has been provided for this image No description has been provided for this image No description has been provided for this image

As we can see, sections of the mesh occupying a large volume such as chair seats and backs have high confidences which taper off as we reach the surface boundaries:

No description has been provided for this imageNo description has been provided for this imageNo description has been provided for this image

Sections of the mesh occupying smaller volumes such as chair legs tend to have smaller confidence values:

No description has been provided for this imageNo description has been provided for this imageNo description has been provided for this image

Additionally, sections of the mesh that are occluded in the input view, such as the side of the following chair that cannot be seen in the input PNG, tend to have a lower confidence (although, in this image, the network thinks with confidence that the armrests are closer to the bottom of the chair than they actually are, which is probably due to the face that the input view is from a high elevation looking down at the chair):

No description has been provided for this imageNo description has been provided for this imageNo description has been provided for this image

3. Exploring other architectures / datasets¶

3.3. Extended dataset for training¶

The three prediction networks (point cloud, voxel, and mesh) were trained on the full dataset (chair, car, and plane) with three classes for 1000 iterations until the losses stopped decreasing significantly. Each training used the same hyperparameters described in section 2, and the models were evaluated on the test set for the chair dataset. The F1-scores, as expected, dropped for each model:

Model Type Chair Dataset Full Dataset
Voxel No description has been provided for this image No description has been provided for this image
Point Cloud No description has been provided for this image No description has been provided for this image
Mesh No description has been provided for this image No description has been provided for this image

Comparing some of the predictions for the voxel model, we observe some interesting phenomena:

Mesh # Input RGB Chair Dataset Full Dataset Ground Truth Mesh
400 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image
500 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image
600 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image

Despite its limitations, the model trained on the chair dataset ends up capturing the unique features of the chair shown in the input image more accurately than the model trained on the full dataset. As one can see in mesh 400, the prediction from the voxel model trained on the full dataset ends up being more of a blob, with thicker legs than the actual image and the ground truth being less consistent with the input image and ground truth mesh. Training on the full dataset seems to have an averaging effect on the features of the unique classes, which may be a result of the model exhibiting a lower representational power than needed to capture these separate features (i.e. higher bias), or may be due to similarities between input images in other classes and occluded images of chairs, which end up making the output more difficult to predict overall given an input. This averaging effect can be seen by the similarity in the prediction between mesh 400 and mesh 500, which is emphasized in the point cloud and mesh representations as well.

We can see a similar effect with the point cloud model:

Mesh # Input RGB Chair Dataset Full Dataset Ground Truth Mesh
400 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image
500 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image
600 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image

Ignoring the fact that predictions made by the model trained on the full dataset are "fuzzier", we can see that the geometric structures of the predictions of mesh 400 from its input image are visibly different. The model trained on the chair dataset only correctly predicts a stouter chair with a broader back, while the model trained on the full dataset predicts a taller (incorrect) chair which is very similar to the prediction of mesh 500. This is another example of the output samples losing diversity.

Mesh predictions suffer from a similar ordeal:

Mesh # Input RGB Chair Dataset Full Dataset Ground Truth Mesh
400 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image
500 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image
600 No description has been provided for this image No description has been provided for this image No description has been provided for this image No description has been provided for this image

Although the inconsistency is less noticeable in this case since the model trained on the full dataset outputs a rougher mesh overall.