|
| 1 | +Tl;dr What is NeRF and what does it do |
| 2 | +====================================== |
| 3 | + |
| 4 | +NeRF stands for Neural Radiance Fields. It solves for view |
| 5 | +interpolation, which is taking a set of input views (in this case a |
| 6 | +sparse set) and synthesizing novel views of the same scene. Current RGB |
| 7 | +volume rendering models are great for optimization, but require |
| 8 | +extensive storage space (1-10GB). One side benefit of NeRF is the |
| 9 | +weights generated from the neural network are $\sim$6000 less in size |
| 10 | +than the original images. |
| 11 | + |
| 12 | +Helpful Terminology |
| 13 | +=================== |
| 14 | + |
| 15 | +**Rasterization**: Computer graphics use this technique to display a 3D |
| 16 | +object on a 2D screen. Objects on the screen are created from virtual |
| 17 | +triangles/polygons to create 3D models of the objects. Computers convert |
| 18 | +these triangles into pixels, which are assigned a color. Overall, this |
| 19 | +is a computationally intensive process.\ |
| 20 | +**Ray Tracing**: In the real world, the 3D objects we see are |
| 21 | +illuminated by light. Light may be blocked, reflected, or refracted. Ray |
| 22 | +tracing captures those effects. It is also computationally intensive, |
| 23 | +but creates more realistic effects. **Ray**: A ray is a line connected |
| 24 | +from the camera center, determined by camera position parameters, in a |
| 25 | +particular direction determined by the camera angle.\ |
| 26 | +**NeRF uses ray tracing rather than rasterization for its models.**\ |
| 27 | +**Neural Rendering** As of 2020/2021, this terminology is used when a |
| 28 | +neural network is a black box that models the geometry of the world and |
| 29 | +a graphics engine renders it. Other terms commonly used are *scene |
| 30 | +representations*, and less frequently, *implicit representations*. In |
| 31 | +this case, the neural network is just a flexible function approximator |
| 32 | +and the rendering machine does not learn at all. |
| 33 | + |
| 34 | +Approach |
| 35 | +======== |
| 36 | + |
| 37 | +A continuous scene is represented as a 3D location *x* = (x, y, z) and |
| 38 | +2D viewing direction $(\theta,\phi)$ whose output is an emitted color c |
| 39 | += (r, g, b) and volume density $\sigma$. The density at each point acts |
| 40 | +like a differential opacity controlling how much radiance is accumulated |
| 41 | +in a ray passing through point *x*. In other words, an opaque surface |
| 42 | +will have a density of $\infty$ while a transparent surface would have |
| 43 | +$\sigma = 0$. In layman terms, the neural network is a black box that |
| 44 | +will repeatedly ask what is the color and what is the density at this |
| 45 | +point, and it will provide responses such as “red, dense.”\ |
| 46 | +This neural network is wrapped into volumetric ray tracing where you |
| 47 | +start with the back of the ray (furthest from you) and walk closer to |
| 48 | +you, querying the color and density. The equation for expected color |
| 49 | +$C(r)$ of a camera ray $r(t) = o + td$ with near and far bounds $t_n$ |
| 50 | +and $t_f$ is calculated using the following: |
| 51 | + |
| 52 | +$C(r) = \[ \int_{t_n}^{t_f} T(t)\sigma(r(t))c(r(t),d) \,dt \] |
| 53 | +where |
| 54 | + \[T(t) = exp(-\int_{t_n}^{t}\sigma(r(s))\, ds)\]$ |
| 55 | + |
| 56 | +\ |
| 57 | +To actually calculate this, the authors used a stratified sampling |
| 58 | +approach where they partition $[t_n, t_f]$ into N evenly spaced bins and |
| 59 | +then drew one sample uniformly from each bin: |
| 60 | + |
| 61 | +$$\hat{C}(r) = \sum_{i = 1}^{N}T_{i}(1-exp(-\sigma_{i}\delta_{i}))c_{i}, where T_{i} = exp(-\sum_{j=1}^{i-1}\sigma_{j}\delta_{j})$$ |
| 62 | + |
| 63 | +Where $\delta_{i} = t_{i+1} - t_{i}$ is the distance between adjacent |
| 64 | +samples. The volume rendering is differentiable. You can then train the |
| 65 | +model by minimizing rendering loss. |
| 66 | + |
| 67 | +$$min_{\theta}\sum_{i}\left\| render_{i}(F_{\Theta}-I_{i}\right\|^{2}$$ |
| 68 | + |
| 69 | + [fig:Figure 1] |
| 72 | + |
| 73 | +\ |
| 74 | +In practice, the Cartesian coordinates are expressed as vector d. You |
| 75 | +can approximate this representation through MLP with |
| 76 | +$F_\Theta = (x, d) \rightarrow (c, \sigma)$.\ |
| 77 | +**Why does NeRF use MLP rather than CNN?** Multilayer perceptron (MLP) |
| 78 | +is a feed forward neural network. The model doesn’t need to conserve |
| 79 | +every feature, therefore a CNN is not necessary.\ |
| 80 | + |
| 81 | +Common issues and mitigation |
| 82 | +============================ |
| 83 | + |
| 84 | +The naive implementation of a neural radiance field creates blurry |
| 85 | +results. To fix this, the 5D coordinates are transformed into positional |
| 86 | +encoding (terminology borrowed from transformer literature). $F_\Theta$ |
| 87 | +is a composition of two formulas: $F_\Theta = F'_\Theta \cdot \gamma$ |
| 88 | +which significantly improves performance. |
| 89 | + |
| 90 | +$$\gamma(p) = (sin(2^{0}\pi p), cos(2^{0}\pi p),...,sin(2^{L-1}\pi p), cos(2^{L-1} \pi p)$$ |
| 91 | + |
| 92 | +L determines how many levels there are in the positional encoding and it |
| 93 | +is used for regularizing NeRF (low L = smooth). This is also known as a |
| 94 | +Fourier feature, and it turns your MLP into an interpolation tool. |
| 95 | +Another way of looking at this is your Fourier feature based neural |
| 96 | +network is just a tiny look up table with extremely high resolution. |
| 97 | +Here is an example of applying Fourier feature to your code:\ |
| 98 | + |
| 99 | + B = SCALE * np.random.normal(shape = (input_dims, NUM_FEATURES)) |
| 100 | + x = np.concatenate([np.sin(x @ B), np.cos(x @ B)], axis = -1) |
| 101 | + x = nn.Dense(x, features = 256) |
| 102 | + |
| 103 | + [fig:Figure 2] |
| 106 | + |
| 107 | +NeRF also uses hierarchical volume sampling: coarse sampling and the |
| 108 | +fine network. This allows NeRF to more efficiently run their model and |
| 109 | +deprioritize areas of the camera ray where there is free space and |
| 110 | +occlusion. The coarse network uses $N_{c}$ sample points to evaluate the |
| 111 | +expected color of the ray with the stratified sampling. Based on these |
| 112 | +results, they bias the samples towards more relevant parts of the |
| 113 | +volume. |
| 114 | + |
| 115 | +$$\hat{C}_c(r) = \sum_{i=1}^{N_{c}}w_{i}c_{i}, w_{i}=T_{i}(1-exp(-\sigma_{i}\delta_{i}))$$ |
| 116 | + |
| 117 | +A second set of $N_{f}$ locations are sampled from this distribution |
| 118 | +using inverse transform sampling. This method allocates more samples to |
| 119 | +regions where we expect visual content. |
| 120 | + |
| 121 | +Results |
| 122 | +======= |
| 123 | + |
| 124 | +The paper goes in depth on quantitative measures of the results, which |
| 125 | +NeRF outperforms existing models. A visual assessment is shared below: |
| 126 | + |
| 127 | + [fig:Figure 3] |
| 129 | + |
| 130 | +Additional references |
| 131 | +===================== |
| 132 | + |
| 133 | +[What’s the difference between ray tracing and |
| 134 | +rasterization?](https://blogs.nvidia.com/blog/2018/03/19/whats-difference-between-ray-tracing-rasterization/) |
| 135 | +Self explanatory title, excellent write-up helping reader differentiate |
| 136 | +between two concepts.\ |
| 137 | +[Matthew Tancik NeRF ECCV 2020 Oral](https://www.matthewtancik.com/nerf) |
| 138 | +Videos showcasing NeRF produced images.\ |
| 139 | +[NeRF: Representing Scenes as Neural Radiance Fields for View |
| 140 | +Synthesis](https://towardsdatascience.com/nerf-representing-scenes-as-neural-radiance-fields-for-view-synthesis-ef1e8cebace4) |
| 141 | +Simple and alternative explanation for NeRF.\ |
| 142 | +[NeRF: Representing Scenes as Neural Radiance Fields for View |
| 143 | +Synthesis](https://arxiv.org/pdf/2003.08934.pdf) arxiv paper\ |
| 144 | +[CS 231n Spring 2021 Jon Barron Guest |
| 145 | +Lecture](https://stanford-pilot.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=66a23f12-764c-4787-a48a-ad330173e4b5) |
0 commit comments