Gateway API Whitepaper

Drafted 29.05.2025

Motivation and Overview

In preparation for several large partnerships, public API access and expected increased demand of organic traffic, Subnet 17 sought solutions for robust and efficient subnet access that could improve the way validators work with organic traffic while at the same time creating a foundation for future monetization strategies and additional functionalities as business needs evolve.

Subnet 17 works with heavy 3D model file sizes and therefore this project is designed not only for high throughput but also to support advanced statistics and dynamic load balancing based on key parameters such as latency and the number of tasks in different regions. This scheme provides exceptional stability and performance, making it capable of handling up to 100,000+ user requests per second while laying a solid foundation for future enhancements.

Key Components of the Project:

• Consensus Engine – Based on Raft (OpenRaft), the system achieves distributed consensus across nodes and shares the internal state for every node in the cluster.

• Transport Protocol – QUIC transports RAFT messages, ensuring fast, low-latency communication while also encrypting all data.

• API Layer – The interface operates over HTTP/3, adhering to modern web standards for responsive client interactions.

• Encryption – Rustls provides TLS 1.3 encryption to secure communication channels.

• Concurrency Model – The solution is entirely non-blocking and employs lock-free algorithms throughout, with the only exception being an RwLock used in the commit log for consistency.

How Raft (OpenRaft) Works

Every node initially functions as a voting member, and the first process they undertake is to elect a leader. Here’s how the process unfolds:

1. All nodes start as voting members (also known as followers) and participate in the initial leader election.

2. If a node doesn’t hear from an active leader (through heartbeat messages or similar signals), it considers the leader unresponsive, transitions to a candidate state, and initiates an election.

3. The candidate solicits votes from the other nodes—each of which is already a voting member—and if it secures a majority, it is elected as the new leader.

4. With a leader now in place, the leader takes on the responsibility of sending out regular heartbeats and log entries to the followers, ensuring data consistency across the cluster.

5. Maintaining cluster integrity generally requires that a majority (typically two-thirds) of the nodes remain online and responsive, securing the quorum needed for both leader elections and log replication.

Once a leader is chosen, it starts sending heartbeats immediately and begins processing client requests. This election mechanism guarantees that the cluster always has an active leader, while ensuring that only one leader is active at any time to maintain consistency of our data (in our case it's just a https://doc.rust-lang.org/std/collections/struct.BTreeMap.html which saves the internal state of all nodes. Each node has its own task queue, which is not synchronized with other nodes, sharing only the size of the queue of available tasks in the global state.

In the cluster, write operations are always handled by the leader. When a node needs to write data, it sends the request directly to the leader. The leader then appends the new entry to its local log and initiates a replication process by sending the entry to the follower nodes. Each follower appends the received entry to its own commit log. Once a majority of nodes have successfully stored the entry, the leader marks it as committed and applies it to its state machine. Only after this confirmation does the client receive an updated response. This process ensures data consistency across the entire cluster.

Gateways Global State and JSON Representation

All nodes in the network maintain one global state, and this state is represented using JSON. This JSON structure provides a consistent and unified snapshot of the cluster's status—including each node's details and its respective available_tasks. Because the global state is replicated among all nodes through the Raft log replication process, any node in the network can serve queries about the current state with minimal delay.

For instance, consider the following internally shared JSON structure that represents the entire cluster:

{
  "gateways": [
    {
      "node_id": 1,
      "domain": "gateway-eu.404.xyz",
      "ip": "5.9.29.227",
      "name": "node-1-eu",
      "http_port": 4443,
      "available_tasks": 0,
      "last_task_acquisition": 1746010976,
      "last_update": 1746011013
    },
    {
      "node_id": 2,
      "domain": "gateway-us-east.404.xyz",
      "ip": "3.226.98.135",
      "name": "node-2-us-east",
      "http_port": 4443,
      "available_tasks": 0,
      "last_task_acquisition": 1746010975,
      "last_update": 1746011013
    },
    {
      "node_id": 3,
      "domain": "gateway-us-west.404.xyz",
      "ip": "13.56.102.231",
      "name": "node-3-us-west",
      "http_port": 4443,
      "available_tasks": 0,
      "last_task_acquisition": 1746010978,
      "last_update": 1746011014
    }
  ]
}

Notice that any change made to the available_tasks field is propagated throughout the cluster in a timely manner through the log replication. As soon as an update is committed by the leader and applied to the state machines of the majority of nodes, the global JSON state is refreshed. This ensures that if you query any node in the network, you'll receive the most up-to-date information with minimal delay.

The field last_task_acquisition indicates the most recent time a task was obtained from the Gateway. This helps validators develop strategies to balance their subsequent requests effectively and decide whether to collect a task from a specific gateway, especially when the task count is very low.

The validators also receive the structure above when tasks are requested from any Gateway. This way, they know in advance where they need to go for tasks next time, and they can also balance depending on the region (for example, by using latency and last_task_acquisition).

This architecture addresses current high-demand environments and is structured to support further development in analytics, convenient API key management, load balancing, and monetization, ensuring a robust, future-proof platform that can scale beyond conventional limitations.

Validator Behavior and Optimization

Validators are engineered to process and distribute tasks in a highly efficient, adaptive manner. They rely on the latest global state—which includes real‐time details such as available_tasks, last_task_acquisition and latency. Key aspects include:

Validators initially source tasks from the geographically closest gateway node, minimizing latency and response times.
Each validator maintains an optimized task queue that minimizes both gateway queue time and validator processing time, ensuring efficient throughput.
The system implements dynamic load balancing where validators monitor the load across gateway nodes and can pull tasks from distant nodes when necessary to maintain overall system performance.
Latency tracking mechanisms allow validators to prioritize gateways based on response metrics, continuously adjusting to minimize the average task delivery time to end users.
Validators retain flexibility in traffic participation, with the option to opt out of serving organic traffic based on their resource allocation preferences.
For customized deployment, validators can establish their own gateway clusters to attract and manage dedicated traffic streams.
Advanced configuration options enable validators to interface with multiple separate gateway clusters simultaneously, with assigned priorities to optimize workload distribution.
Regarding miners, the established prioritization remains unchanged—organic traffic maintains priority status and is directed to miners first, preserving the core workflow while enhancing the overall system capabilities.

Conclusion

As Subnet 17 prepares for increased organic traffic, we feel this project provides a basis to maintain flexibility and control necessary to not only grow in scale but also to enable new types of interactions and monetization strategies that will align with the changing business needs of the subnet.

By open sourcing the underlying code here and detailing the conceptual and technical reasons for the project above we also hope that other subnet owners and validators may benefit from similar implementations.

Last updated 1 month ago