[Paper Notes] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Devin Z
3 min readSep 5, 2023

A pioneer work of the recent 3D reconstruction breakthroughs.

Google Moffett Place Campus, September 01, 2023
  • Traditional view synthesis approaches:
    - Interpolation techniques work for reconstruction from dense samples.
    - Mesh-based representations that require template meshes and are not suitable for gradient-based optimization.
    - Volumetric representations are well-suitable for gradient-based optimization but is not scalable to high resolutions due to discrete sampling.
  • A static scene is represented as a 5D continuous function:
    - Inputs are the 3D position and the 2D viewing direction of the camera.
    - One output is the view-dependent radiance (i.e. RGB) emitted at the location from the viewing direction.
    - Another output is the volume density representing the opacity at the location.
  • The neural radiance field (NeRF) is a multilayer perceptron (MLP) that does the above mapping.
    - The common bottom layers take in the locaction and output the density and a 256-dimensional feature vector.
    - An additional fully-connected layer takes in the 256-dimensional feature layer and the the camera ray’s viewing direction to output the view-dependent RGB color.
    - Without view dependence, there would be difficulty representing specularities.
    - Parameters are optimized separately for each scene via gradient descent.
    - The training error is simply the total squared error between the rendered and true pixel colors in the captured images.
  • The volume density can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at the location.
    - The accumulated transmittance looks like this because illuminance drops proportionally at a surface along the ray, leading to a 1st-order linear ODE according to a zhihu article.
Continuous Volume Rendering (Source: [1])
  • The continuous integral is numerically estimated using quadrature and stratified sampling.
Volume Rendering Estimate
  • Positional encoding:
    - Deep networks are biased towards learning lower frequency functions and perform poorly at representing high-frequency variation in color and geometry.
    - The solution is to preposition an encoding function to map each component of the position (3D vector) and the viewing direction (3D unit vector) to a higher dimensional space.
    - Such an encoding is so named in Transformer, where it was used for a different purpose.
  • Hierarchical volume sampling:
    - Sample efficiency needs improvement due to the presence of free space and occuluded regions that don’t contribute to the rendered images.
    - The solution is to separate the representation into a coarse network and a fine network that are optimized separately.
    - Evaluate the coarse network at locations obtained via stratified sampling to produce a more informed sampling.
    - A second set of samples are generated using inverse transformed sampling based on the color weights obtained from the coarse network, so that it contains more samples in regions expected to have visible content.
    - The fine network is evaluated at both the first and the second sets of sample.

--

--