Lec 4. Learning + Geometry
约 795 字大约 3 分钟
2025-10-15
Geometric Datasets
4D dataset: 3D geometry + timing.
Let S be the shape, c be the label or condition
P(c∣S): given the shape, predict the label
P(S): the distribution of world's shapes (generate, ...)
P(S∣c): shape distribution given labels
Data-Driven Shape Analysis
Multi-View CNN: many views of the object separately through CNN1, then do pooling, then go through CNN2 till the final classifier.

3D CNN on Volumetric Data, using 4D kernels. However, the resolution is low due to computing load of 3D CNN.
Learn to project: "X-ray" projection into 2D, then use 2D CNN.

Sparsity of 3D Shape (portion of occupied grid is small comparing to total grids): but how to do convolution?

Sparse Convolution: only do the convolution centered at occupied voxels, thus maintaining the sparsity.

Octree: Recursively Partition the Space.
Only use small voxels in high resolution region. (e.g., for a table surface, large voxels may be enough)

More memory-efficient, but need more work on data structure and other issues.
Regular Grids can naturally use convolution kernel, but what about mesh and point cloud?
How to learn on mesh and point cloud?
Deep learning on irregular data representation (other than grids)
PointNet
Permutation invariance (the order of the points shouldn't change the result).
Intuition: symmetric functions.
Vanilla PointNet
Suppose g is max pool, how can we get a global understanding of the object? (e.g., h is voxel hash and g is max pool (max per each element of the vector), then after g we get the voxel representation)
However, PointNet vanilla don't have context and depends on the absolute coordinate (changes under object transformation).
PointNet++
First contract a local region feature (using a small PointNet).
(x,y)→(x,y,F), F is feature vector at (x,y).
If we want to down sample, then only keep the center points. Then use a big PointNet globally. For local PointNet, we place the centre point at origin, to some extend achieve translation invariance (when a chair is translated in the whole background).
One more problem: points are sampled from surface, can be not even, this will influence convolution. (heavily sampled region will have larger effect)
Monte Carlo Convolution.
e.g., first derive the density of points ρ, then time 1/ρ when doing convolution.
Isometric transformation: length preserving. Should modify PointNet (not directly using the coordinate), to use geodetic distance or other features to make the model invariance with respect to isometric transformation (e.g., a people in different postures)
Data-Driven Shape Synthesis
How to get data, and network?
2.5D: RGBD picture (let the model do depth prediction).
But only have relative depth (to metric depth, find some anchor objects, e.g., human is 1.8m).
Also, Loss need to be designed for the "relative depth" (scale the whole depth, loss unchanged)
L(y,y∗)=i∑∥logyi−logyi∗+α(y,y∗)∥2α(y,y∗)=n1i∑(logyi∗−logyi)
Image-Based 3D Object Modeling
Input 2D image, through CNN, predict 3D, compare with ground truth 3D.
Hard to collect [image, 3D] data. Using rendering from mesh to get all kinds of [image, 3D] data.
Learning to Predict Volumetric 3D.
Input 2D image, output 3D volume (voxel grid).
But the problem is still low resolution. A solution is Octree: each voxel predict 3-way (-1, empty; 1, full; 0, mixed, decode further at finer level).
Learning to predict point clouds: need order-invariant loss function. E.g., Chamfer Distance:
dCD(S1,S2)=x∈S1∑y∈S2min∥x−y∥2+y∈S2∑x∈S1min∥x−y∥2
We need both ways to avoid the situation where one set is a subset of the other.
Learning to predict parametric point clouds.
How about learning meshed? Treat mesh prediction as a deformation problem. E.g., Just predict the position of vertices, deform a sphere mesh to the target shape.
Difference between predicting point cloud:
Have smoothness objective, etc.
Can sample many points to compute loss despite only predicting a few vertices.
Modeling humans
SMPL: Disentangle shape (fat, thin), and poses. Not using the whole mesh data, but only β and θ (parametric modeling of shape and poses).
Can just decode the latent vector to β and θ.
Modeling a 3D scene from a 2D image.
A problem, our objective treats all objects (large or fine) equally, then don't care about the details.
Solution: decouple prediction of objects, locations, and layout.