404—GEN Whitepaper

Drafted 29.02.2024

Abstract

404 provides a platform to democratize 3D content creation, ultimately allowing anyone to create virtual worlds, games and AR/VR/XR experiences via SN17 on Bittensor.

404 leverages the existing fragmented and diverse landscape of open-source 3D content generation models - ranging from Gaussian Splatting, Neural Radiance Fields, and 3D Diffusion Models - to facilitate innovation. This is an ideal landscape in which to construct decentralized incentive-based networks via Bittensor.

We aim to kickstart the next revolution in gaming around AI native games.

The interconnectivity of Bittensor subnets can facilitate experiences in which assets, voice and sound are all generated at runtime. This would effectively allow a creative individual without any coding or game-dev experience to simply describe the game they want to create and have it manifested before them in real time.

Introduction

The creation of virtual environments relies on individuals manually modeling, sculpting and/or procedurally scripting 3D digital assets. Even for those with professional training in these fields, this process is time consuming, inflexible and expensive. For this reason, traditional use cases such as gaming have exceptionally high barriers to entry, take years to build, cost hundreds of thousands of dollars to create and require technical skill sets beyond creativity.

This problem is getting worse. Computing power is growing at an exponential rate and so is the expectation of digital consumers regarding the size, density and visual fidelity of virtual worlds. The demand extends far beyond gaming, into entertainment more broadly (film and VFX), as well as consumer and retail applications. Recent consumer hardware advances means this demand will exponentially multiply as AR, VR and XR manifestations become mainstream and are demanded for existing (and new) applications within months and years.

The barriers to entry must come down so that we can democratize 3D content creation.

This is not only the viable solution to meeting the exponentially growing demand for such services, it is also the way to ensure competition and create innovation in these markets. By allowing non-professional creators to build 3D assets and virtual worlds based on text-prompts, it forces creativity as the differentiating factor rather than technical ability or financial / incumbent positions.

Ultimately, AI allows the creation experience to be made more democratic, efficient and even automated, but the creator should always be able to intervene and express directorial control over any decision.

This shifting economic and societal landscape comes at a time when 3D AI is poised to explode due to technology advances of the last 12 months. For precedent, we can look towards the last three years of unbelievable improvements in State of the Art (SOTA) 2D AI models thanks to the introduction of Transformer networks and the subsequent creation of foundational models. 3D AI technology has gone through similar research innovations and is now at the stage in which foundational models can be built.

3D AI technology and miner considerations

A variety of technologies have been developed to tackle the issue of 3D content creation. These techniques allow users to input a text prompt(s), image(s), or a combination of both. This means that users who have no experience with 3D modeling or other traditional forms of 3D content creation can become creators. These models are typically trained to extract multiple synthetic views of the desired object or scene and then attempt to reconstruct a mesh, radiance field or splatt representation from the multiple synthetic views.

The 3D space is still so nascent that a ‘market winning’ technology has yet to be determined, with approaches such as 3D Diffusion, Neural Radiance Fields (NeRFs), and Gaussian Splattering (Splatts) all competing with different underlying neural network architectures. Research into all three of these approaches is rapidly developing and benchmarking solutions changes nearly every week.

In a landscape such as this - with competing, rapidly developing and open source models - Bittensor offers an ideal decentralized incentive-based platform for empowering innovation.

It is important to note that the current SOTA 3D AI networks are not yet at a level that rivals professional 3D content creation, but the speed at which these networks are developing suggest a trajectory in which they will do so in 2024. Further, despite aesthetic limitations (compared to professional 3D Artists), the insatiable demand for 3D content means there are already viable applications today which include 1) background / non-hero assets, 2) environments for virtual worlds, 3) abstract and/or highly stylised assets.

To help understand the technical differences between these networks a brief overview of the three key technologies powering this revolution is provided below:

Neural Radiance Fields (NeRFs)

Neural Radiance Fields (Mildenhall et al, 2020, Gao et al., 2023) is a machine learning based approach that synthesizes novel views of a complex scene and allows the user to generate its 3D representation. The input scene is defined as a set of images with known camera poses. The fully-connected deep neural network defines the input scene as a continuous 5D vector-field function (function components: (x, y, z) - spatial location, (theta, phi) - viewing direction). This continuous 5D function is approximated with a multilayer perceptron (MLP) network. The output of this network is a volume density and a view-dependent emitted radiance at the spatial location. NeRFs are quite heavy in terms of computation as to compute the radiance emittance it is essential to approximate the volume rendering integral:

$C(\bold{r}) = \int_{t_n}^{t_f} T(t)\sigma (\bold{r}(t)\bold{c}(\bold{r}(t), \bold{d}) dt$ where $T(t)=exp\Big(- \int_{t_n}^{t} \sigma (\bold{r}(s))ds\Big)$

Where $\bold{r}(t)=\bold{o} +t\bold{d}$ is a camera ray with near $t_n$ and far $t_f$ , $\sigma(x)$ is a differential probability of a ray terminating at an infinitesimal particle at location x and the function T(t) denotes the accumulated transmittance along the ray from $t_n$ to t.

In addition, NeRFs are quite heavy on memory consumption. Estimated surfaces are usually not clearly defined using volume density fields and extracting surfaces from the density-based representation often leads to noisy results.

Relevant to making this technology more viable within the Subnet, we researched quite a few methods proposed regarding how to overcome computations costs. One of them was “Instant Neural Graphics Primitives with a Multiresolution Hash Encoding” (Muller et al. 2022). Authors of this method suggested two techniques to speed up the computations, namely radiance caching and hash table encoding. These significantly speed up the computation process from hours to several minutes on a single GPU. Two additional influential works worth mentioning are MIP-NeRF (Barron et al. 2021) and Mip-NeRF 360 (Barron et al, 2022). The latest work extends Mip-NeRF to unbounded scenes and introduces several important techniques: namely, prediction of the appropriate sampling intervals for volumetric density, a novel scene parametrization constructed for the Gaussians in MIP-NeRF and a new regularization method that helps to prevent floater geometric artifacts and background collapse. A last remarkable advancement in the NeRF field worth noting was described in paper “Neurangelo: High-Fidelity Neural Surface Reconstruction” (Li et al., 2023), in which the authors managed to achieve significant quality improvement of the extracting surfaces.

Relevant to applications addressed in this paper, NeRFs can be editable. For instance, Edit-NeRF(Liu et al., 2021) provides NeRF editability via image conditioning from user input.

Gaussian Splatting

Gaussian Splatting (Kerbl et al., 2023, Chen et al., 2024) is another method that aims to efficiently optimize a scene representation to achieve high fidelity results for a novel view synthesis. This method relies on 3D Gaussians for defining the scene as it is a differentiable volumetric representation which is unstructured and explicit to allow very fast rendering. 3D Gaussians can be easily projected to 2D splats that are important for computing fast alpha blending for rendering. Gaussians are defined as a covariance matrix that describes the configuration of an ellipsoid. In this case, covariance matrix can be defined as: $\Sigma=RSS^TR^T$

Where S is a scaling matrix and R is a rotation matrix. Using the covariance matrix defined in this form it is possible to optimize it using stochastic gradient-descent.

To accurately capture the view-dependent appearance of the scene, spherical harmonics coefficients of each 3D Gaussian representing the color should be also optimized. This step is interleaved with steps that control the density of the Gaussians. According to (Chen et al., 2024) Gaussian Splatting representation can be defined as:

$L_{3DGS}(x, y, z, \theta, \phi)=\sum_i G(x,y,z,\bold{\mu}_i,\bold{\Sigma}_i)\cdot c_i(\theta,\phi)$

Where G() is the Gaussian function, x, y, z - are the coordinates of the point, $\bold{\mu}_i$ is the mean, $\bold{\Sigma}_i$ is a covariance matrix, c is the view-dependent color.

3D Diffusion

Diffusion models are a class of latent variable generative models. The diffusion model is composed of three steps: some forward process, some reverse process and the sampling procedure. The goal of the diffusion model is to learn a diffusion process used for generating a probability distribution of a given dataset. Such models were one of the first to be used within Text-to-3D generative pipelines. 3D generative models can be trained on explicit representations of structure (e.g. voxels or point clouds). The generative model is trained to slowly add structure to the initial random noise with some predefined transition (Poole et al., 2022).

Combinations of these techniques are also emerging. NeRFs (Poole et al., 2022; Lin et al., 2023; Melas-Kyriazi et al., 2023) and Gaussian Splatting (Liang et a., 2023; Tang et al., 2023; Wang et al., 2023) approaches have become critical building blocks for inverse rendering and generation of novel views within generative systems. In Poole, authors proposed a DreamFusion model that combines a 3D diffusion model with score distillation sampling method with a NeRF model for generating 3D objects. This method has two notable drawbacks: extremely slow NeRF optimization and low-resolution image space supervision on NeRF. In another work (Lin et al. 2023) researchers have proposed another 3D generative model, namely Magic3D, that outperforms DreamFusion and provides a solution for two mentioned drawbacks. Their approach has two major stages. First a coarse model is obtained using a low resolution diffusion model the computation of which is accelerated using 3D hash grid structure. After it a coarse representation is further optimized as a textured 3D model using an efficient differentiable renderer that interacts with a high-resolution latent diffusion model.

Further, the Gaussian Splatting method was used for multi-view reconstruction within the DreamGaussian model (Tang et al., 2023) instead of NeRF. This significantly speeds up the generation process to several minutes. The introduced pipeline allows the user to generate a 3D model from text prompt or from a single image. Authors also addressed the problem of accurate mesh extraction from the Gaussian Splats and proposed a technique for sharpened texture extraction. The LucidDreamer model (Liang et al., 2023) is another model that incorporated Gaussian Splatting within the text-to-3D generation pipeline. To solve this problem of over smoothing caused by the Score Distillation Sampling technique, the authors proposed a novel sampling method called Interval Score Matching. This method allows for deterministic diffusing trajectories and uses interval-based score matching to counteract over-smoothing.

Overall, our tests and research suggest that Gaussian Splattering is computationally more efficient and aesthetically higher quality than the other techniques. Therefore, although the subnet’s validation mechanism is able to accurately evaluate results generated from a diversity of neural networks (as long as the output result is the correct format of generated 3D representation), the sample miner code is set up for Splatts.

Subnet Evaluation / Validation While such a nascent and diverse technology landscape is ideal for builders and miners seeking innovation within the space. It makes the subnet’s evaluation system for open source models (the validation mechanism) particularly challenging. We have sought to tie key evaluation criteria to those factors which will have meaningful implications for both builders (visual fidelity, editability) and miners (training & inferencing time, memory efficiency), understanding that the distinction is not mutually exclusive.

A persistent challenge faced by subnets operating within AI/ML networks is the evaluation of miners' output when mining results are inherently indeterministic. Traditional scoring mechanisms struggle in these environments due to the absence of a single, definitive answer - introducing a degree of subjectivity that could be exploited, for instance through prompt engineering or targeted model fine-tuning. This predicament is amplified when requests are issued both by subnet users and validators, amplifying the uncertainty surrounding the "correct" response.

One conventional approach to establishing a benchmark for evaluation is the construction of a curated dataset consisting of predetermined question-answer pairs. Validators then use this dataset to gauge the quality of answers provided by miners. However, this method not only generates additional work with dubious added value but also carries the risk of 'dataset leakage', where miners might access the "correct" answers in advance, leading to manipulation (superficial score improvements) of the evaluation process.

Our solution circumvents these issues by leveraging image references to anchor the evaluation process and minimize subjectivity. These references can either accompany the input directly or be synthesized by image/multimodal subnets within the Bittensor ecosystem. Although image generation introduces some level of subjectivity, advancements in image generation models mitigate this concern significantly. By incorporating image references, our methodology fosters an objective and fair framework for scoring miner performance without the need for a static dataset of known answers, thereby preserving the integrity and reliability of the network.

Our initial setup is oriented around onboarding miners and validators with a clear goal to generate 3D synthetic datasets that are organized into game type and style categories so that results can be used as asset packs for immediate applications (as detailed later in this whitepaper). From this foundation, we will continually expand the diversity and quality of the datasets, to facilitate new creative applications and be more representative of user needs in every new iteration (while still recording data and model provenance).

While this initial setup builds asset packs that over time will rival and ultimately dwarf even the largest online 3D asset stores (sketchfab, Unity marketplace, etc) via synthetic traffic. This is ultimately a bridge to our longer term system which becomes increasingly bespoke, organic, and user-oriented as game developers, creators and other builders begin directly interacting with the subnet through web2 front end interfaces and partnerships.

Applications

We expect a variety of diverse real-world applications to emerge built on top of the 3D AI technology innovations and development facilitated and empowered by this subnet. In this section, we outline some tangible applications that we will actively facilitate through partnerships, collaborations and front end interfaces. These applications focus within the Gaming vertical as our team has strong domain knowledge in this industry and sees strong web3 and web2 opportunities arising here. One could easily expand these applications into related entertainment verticals, especially given the entertainment market as a whole is moving towards gamification - exemplified by the February 2024 Disney x Epic partnership announcement. Attention is becoming an increasingly scarce resource and passive consumption is being replaced by active engagement in the same way that finalized products in the entertainment industry are being replaced by continuous creation.

“We compete with (and lose to) Fortnite more than HBO” (Netflix letter to shareholder)

Immediate / Short Term: Single vertical game asset creation

We anticipate that some of the first widely used use cases will start as modding and vertical creation around a specific game genre and/or art style. This directly ties to our initiative to build out datasets in alignment with common gaming genres and aesthetics. As a point of reference one can consider the more narrow focus of Unreal Editor for Fortnight rather than the broader Unreal Engine.

Medium Term: Broader world building platform

Our short-term applications including the datasets created by miners & validators can act as a template for a larger flywheel: building games as drivers of constant engagement within expanding platforms in which interconnecting new games furthers the network effects of the ecosystem. In particular, when enabled by AI and Bittensor this should be possible in a fraction of the time and cost compared to game development of the last decade. And if historical precedent is true, then the first games produced using our subnet do not need to be a success, rather they simply need to foster growth on the platform and kickstart a creator flywheel until a viral hit is achieved (e.g. Fortnight Battle Royale).

Long Term: Foundational creation engine(s) / OS for entertainment verticals

As viral games and experiences are created using the subnet, this will create more interest and incentive to build atop the subnet and therefore increase in the number of viral games and experiences produced. Such traction will lead to network effects that ultimately facilitate breakthroughs and expansion beyond single games and even platforms. 3D AI combined with decentralization as envisioned by 3D Gen Subnet will enable a world in which an IP created can be iterated upon in multiple AI native forms. Other creators can mod games built using the subnet and/or builders can facilitate this by enabling UGC relying on the same Bittensor subnet. Furthermore, complementary features like NPCs and/or AI-enabled dialogue would naturally leverage other subnets given the interconnected potentials of the BIttensor system. In combination, this would kickstart the next revolution in gaming around AI native games in which assets, voice and sound are all generated at runtime. Effectively allowing a creative individual without any coding or game-dev experience to simply describe the game they want to create and have it manifested before them in real time.

Such applications eventually extend beyond our subnet and by their nature will leverage the broader Bittensor ecosystem. Given 3D experiences are an amalgamation of art, sound, animation, code, physics and more, a fully AI-native engine will not call underlying generative models in isolation (e.g. those provided by 3D Gen Subnet) but rather will be multi-modal, leveraging semantic scene or world understanding as input. This is a unique power of decentralized AI.

This is just the beginning.

The applications above focus on a single vertical (gaming), but a larger societal shift towards immersive environments suggests that there are numerous additional applications beyond gaming into VFX, retail, fashion design, architecture / urban planning, AR/VR/XR experiences, and so many more of which we are not yet aware. This hints at the true value of such a decentralized system - it incentivizes the community to build a broad range of initiatives and applications that discover and capture the highest value-added activities of 3D AI based upon our subnet.

Conclusion

We have proposed a system to democratize 3D content creation, thereby lowering the barrier to entry for the entertainment and media industries, and ultimately empowering any individual to create games, virtual worlds and AR/VR/XR experiences. This system is intrinsically reliant upon the decentralized intelligence of miners incentivised to build upon fragmented and rapidly evolving SOTA 3D AI models. It systematically enables validation of quality and aesthetics through the combination of synthetic dataset building (e.g. asset packs) combined with organic, user-oriented requests. Finally, it fundamentally will excel through tapping into the larger interconnected nature of combining various AI subnets to seed virtual worlds and experiences with potential to kickstart the next revolution in gaming.

About 404

404 is a team of EU-grant funded AI researchers, gaming industry veterans, and blockchain-native builders. For over three years, we have developed proprietary 3D generative AI solutions for AAA web2 game developers and entertainment industry leaders. We believe this subnet offers a unique opportunity to leverage our industry networks, technology and expertise to onboard web2 industry players into web3.

404.xyz

References

J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for antialiasing neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5855–5864
J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5470–5479.
G. Chen, W. Wang, “A Survey on 3D Gaussian Splatting”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
B. Kerbl, G. Kopanas, T. Leimkuhler, G. Drettakis, “3D Gaussian Splatting for Real-Time Radiance Field Rendering”, in ACM Transactions on Graphics, vol. 42, no. 4, 2023.
Y. Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y. Chen, “Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching,” arXiv preprint arXiv:2311.11284, 2023.
Z. Li, T. Muller, A. Evans, R. H. Taylor, M. Unberath, M.-Y. Liu, C.-H. Lin, “Neuralangelo: High-Fidelity Neural Surface Reconstruction”, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” arXiv preprint arXiv:2211.10440, 2022.
S. Liu, X. Zhang, Z. Zhang, R. Zhang, J.-Y. Zhu, and B. Russell, “Editing conditional radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5773–5783.
L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi, “Realfusion: 360° reconstruction of any object from a single image,” arXiv e-prints, pp. arXiv–2302, 2023.
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European conference on computer vision. Springer, 2020, pp. 405–421.
T. M¨uller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph., vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022. [Online]. Available: https://doi.org/10.1145/3528223.3530127
B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” arXiv preprint arXiv:2209.14988, 2022.
J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” arXiv preprint arXiv:2309.16653, 2023.
X. Li, H. Wang, and K.-K. Tseng, “Gaussiandiffusion: 3d gaussian splatting for denoising diffusion probabilistic models with structured noise,” arXiv preprint arXiv:2311.11221, 2023.
Disney and Epic Games to Create Expansive and Open Games and Entertainment Universe Connected to Fortnite. Press Release: https://thewaltdisneycompany.com/disney-and-epic-games-fortnite/
Netflix Q4 2018 Letter to Shareholders. Source: https://s22.q4cdn.com/959853165/files/doc_financials/quarterly_reports/2018/q4/FINAL-Q418-Shareholder-Letter.pdf
Troy Kirwin and Jonathan Lai. Unbundling the Game Engine: The Rise of Next Generation 3D Creation Engines. 2024.

Last updated 7 months ago