Lecture 15¶

Announcements¶

  • Groups for this week:
    • Ned, Jacob, Finn
    • Layla, Charlie, Camiel
    • Grayson, Shingo
  • Today's tea: Yunnan Gold

Goals¶

  • Be able to describe several different ways 3D reconstructions can be represented (mesh, point cloud, signed distance field, voxels, continuous volumetric representations)
  • Explain why positional encodings are necessary for MLP neural networks in contexts like fitting 2D images or 3D scene representations.
  • Be prepared to implement NeRF (Project 4).
    • Know how to perform volume rendering along camera rays to predict a pixel color from the model

3D Representations¶

  • SfM: given images, get camera pose and (sparse) 3D scene geometry
    • Large-scale SfM result examples:
      • Ladybug: http://swehrwein.github.io/sfmflex-vis/Ladybug_demo.html
      • Venice: https://facultyweb.cs.wwu.edu/~wehrwes/sfmflex-vis/Venice_demo.html
  • Multiview stereo / 3D reconstruction: given SfM outputs, recover 3D model of the world
    • https://www.youtube.com/watch?v=5ceiOd8Yx3g&t=25s
    • https://earth.google.com/web/@47.62102506,-122.3493987,55.50286284a,993.8487854d,35y,18.72359613h,64.09030499t,360r/data=OgMKATA
  • Interesting question: how do you represent your 3D model?
Brainstorm¶

How would you reconstruct a 3D model of the world, given images, camera poses, and a sparse point cloud of the world?

How would you even represent a 3D model of the world?

These questions are clearly interrelated. Let's Brainstorm:

  • Triangle mesh
  • Denser point cloud
  • Voxels storing
    • color?
    • which sides are opaque (occupancy)
    • density
  • continuous volumetric representation
In [48]:
# boilerplate setup
%load_ext autoreload
%autoreload 2

%matplotlib inline

import os
import sys

src_path = os.path.abspath("../src")
if (src_path not in sys.path):
    sys.path.insert(0, src_path)

# Library imports
import numpy as np
import imageio.v3 as imageio
import matplotlib.pyplot as plt
import skimage as skim
import cv2
import torch
import torch.nn as nn
import torch.nn.functional as F

# codebase imports
import util
import filtering
import features
import geometry
import ML

# Check if GPU is available
if torch.cuda.is_available():
    device = torch.device('cuda') # nvidia/cuda
elif torch.mps.is_available():
    device = torch.device('mps') # apple
else:
    device = torch.device('cpu') # no acceleration

print(f'Using device: {device}')
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Using device: mps

3D Representations - Some ideas¶

Source for some relevant visuals: https://courses.cs.washington.edu/courses/cse455/10wi/lectures/multiview.pdf

  • Depth maps; multi-camera: depth map fusion
  • Voxel grids
  • Point clouds
  • Patch clouds (surfels)
  • Polygon mesh
  • SDF
  • Neural network!?

Today: Neural Radiance Fields, and other "Learned" 3D Scene Representations¶

ML review: 0 to MLP¶

Review by example the anatomy, care, and feeding of an MLP

In [50]:
import sklearn
import sklearn.datasets

moons = ML.scale_split(sklearn.datasets.make_moons(n_samples=1000, noise=0.1, random_state=0))
X, Xva, y, yva, xx, yy = moons
ML.plot_dataset(X, y)
No description has been provided for this image

Code tour:

  • look at ML.MLP()
  • look at training routine below
  • flag me down if something's not familiar!
In [51]:
def train(model, X, y, train_iters=1000):
    optimizer = torch.optim.Adam(model.parameters())

    for i in range(train_iters):
        optimizer.zero_grad()
    
        batch_indices = torch.randint(0, X.shape[0], (1000,))
        
        batch_X, batch_y = X[batch_indices,:], y[batch_indices]
        outputs = model(batch_X).squeeze()
        loss = F.mse_loss(batch_y, outputs)
        loss.backward()
        optimizer.step()
    return model

def plot_trained_model(model, X, y, xx, yy, encode=lambda x: x):

    with torch.no_grad():     
        h, w = xx.shape
        dense_X = encode(torch.vstack([xx.flatten(), yy.flatten()]).T)
        dense_ypred = model(dense_X).reshape((h, w)).flip([0])
        plt.gca().imshow(dense_ypred, extent=[xx.min(), xx.max(), yy.min(),yy.max()])
        ML.plot_dataset(X, y)
In [52]:
model = ML.MLP(2, 1)
model = train(model, X, y, train_iters=1000)

plot_trained_model(model, X, y, xx, yy)
No description has been provided for this image
In [55]:
X, y = ML.make_stripes(500, 4, 0.01)
ML.plot_dataset(X, y)
No description has been provided for this image
In [59]:
X, y = ML.make_stripes(500, 10, 0.00)
xx, yy = np.meshgrid(np.arange(0, 1, 0.01), np.arange(0, 1, 0.01))
xx = torch.Tensor(xx)
yy = torch.Tensor(yy)
model = train(ML.MLP(2,1), X, y)

plot_trained_model(model, X, y, xx, yy)
No description has been provided for this image

Conclusion: high-frequency stuff is hard for the MLP to learn!

Question: We need to go deeper; will more layers fix this?

Try MLP_N

In [63]:
X, y = ML.make_stripes(5000, 20, 0.0)
xx, yy = np.meshgrid(np.arange(0, 1, 0.01), np.arange(0, 1, 0.01))
xx = torch.Tensor(xx)
yy = torch.Tensor(yy)
model = train(ML.MLP_N(2, 6, 128, 1), X, y, 2000)

plot_trained_model(model, X, y, xx, yy)
No description has been provided for this image
In [70]:
sum([t.numel() for t in model.parameters()])
Out[70]:
66561

To a point, but it's going to get expensive...

Alternative: "positional encoding"

Very handwavy intuition: "smear" the input signal across more input channels to allow the network to learn high-frequency stuff.

In [72]:
pi = torch.pi

X, y = ML.make_stripes(1000, 20, 0.0)


# very barebones positional encoding:
def positional_encoding(X):
    return torch.hstack([
        torch.sin(2*pi * X),
        torch.sin(4*pi * X),
        torch.sin(8*pi * X),
        torch.sin(16*pi * X),
        torch.cos(2*pi * X),
        torch.cos(4*pi * X),
        torch.cos(8*pi * X),
        torch.cos(16*pi * X)])

Xpe = positional_encoding(X)

model = train(ML.MLP(Xpe.shape[1], 1), Xpe, y)

plot_trained_model(model, X, y, xx, yy, encode=positional_encoding)
No description has been provided for this image

Here's a paper that does some deeper analysis on this - has some helpful visuals:

  • https://bmild.github.io/fourfeat/

Neural Radiance Fields¶

Paper with helpful visuals: https://arxiv.org/pdf/2003.08934.pdf

Representation: continuous volume with color and density¶
  • Basic idea: Parameterize a volumetric representation with an MLP

Color and density are a function of 3D location and view direction:

$$ f(x, y, z, \phi, \theta) = (r, g, b, \sigma) $$

  • Detail: density is constrained to depend only on location, not direction.
MLP Architecture:¶

Volume Rendering¶

Given a magic color-density-producing machine, how do you make an image?

(notes, and the above architecture overview figure)

If you want to read more background on where this comes from, checkout https://www.scratchapixel.com/lessons/3d-basic-rendering/volume-rendering-for-developers/volume-rendering-summary-equations.html

HW Problem 2¶

The somewhat obfusque equation for weighting samples along a volume rendering ray is: $$ C(𝐫)=\int_{t_n}^{t_f}T(t)\sigma(𝐫(t))𝐜(𝐫(t),𝐝)dt $$ in its continuous form, and the discretized quadriture equation is: $$ \begin{align*} \hat{C}(𝐫) &= \sum_{i=1}^N w_i 𝐜_i \\ &=\sum_{i=1}^{N}T_i(1-\text{exp}(-\sigma_iδ_i))𝐜_i \end{align*} $$ where N is the number of samples, $T_i=\text{exp}(-\sum_{j=1}^{i-1}\sigma_iδ_i)$, and $δ_i=t_{i+1}-t_i$ is the distance between adjacent samples. This boils down to a weighted sum of the colors ($\mathbf{c}_i$) along the sample ray.

To get some intuition for this, let's plug in a simple case and plot the weights. Let's take samples at $t = 1..10$ and assume that the density is 0 except for an constant-density object with density $\sigma= 0.4$ ranging between $t=4$ and $t=6$ inclusive. Using software of your choice, plug this situation into the above equation to compute the weights $w_{1..10}$, and plot these to show the weights on the 10 different sample points.

Positional Encoding¶

High-frequencies aren't learned well by the naive implementation, so use positional encoding

$$ γ(p)=(\text{sin}(2^0πp), \text{cos}(2^0πp), \text{sin}(2^1πp), \text{cos}(2^1πp), \cdots, \text{sin}(2^{L-2}πp), \text{cos}(2^{L-2}πp), \text{sin}(2^{L-1}πp), \text{cos}(2^{L-1}πp)) $$

NeRF Extensions:¶

  • Unbounded and higher quality - Mip-Nerf360 Unbounded Anti-Aliased Neural Radiance Fieldshttps://jonbarron.info/mipnerf360/
  • Deformable - TiNeuVox: Fast Dynamic Radiance Fields with Time-Aware Neural Voxels https://jaminfong.cn/tineuvox/
  • Shape and lighting: https://xiuming.info/projects/nerfactor/
  • Editable: https://zju3dv.github.io/sine/
  • and so on...

More Recently: Gaussian Splats¶

Look ma, no MLP!

Just a giant cloud of 3D Gaussians that are "learned" (optimized) to minimize reprojection error!

Shiny results: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

Paper with some helpful visuals: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/3d_gaussian_splatting_low.pdf

Video with visualizations that help explain the representation and the learning process: https://www.youtube.com/watch?v=T_kXY43VZnk&t=61s

More Classical methods:¶

  • Poisson surface reconstruction (2006) https://doc.cgal.org/latest/Poisson_surface_reconstruction_3/index.html
  • Patch-based multi-view stereo (2007) (remained a workhorse through circa 2015+) https://www.di.ens.fr/pmvs/pmvs-1/index.html
  • DeepSDF (2019) - neural signed distance functions; ~direct predecessor to NeRF: https://openaccess.thecvf.com/content_CVPR_2019/papers/Park_DeepSDF_Learning_Continuous_Signed_Distance_Functions_for_Shape_Representation_CVPR_2019_paper.pdf