Building Redis from Scratch #2: Async I/O with Tokio

Background

In my last blog, we implemented a simple Redis server from scratch. It certainly works, but it has a critical flaw. Redis is known for handling millions of requests per second accurately. Our simple Redis server does work, but it can’t operate at that scale. Forget about millions of requests; it would probably struggle to handle even thousands of requests per second.

The common thread-per-connection method (which we implemented) means that if you have 10,000 connections, you would have spun off 10,000 threads, one thread for each connection. Executing with threads is not exactly slower than async tasks. But as the number of threads increases, they compete with each other for CPU resources. Especially in these kinds of scenarios, where we have to handle millions of requests per second across many open connections, the CPU will spend an enormous amount of time switching between threads and managing them instead of handling requests. So in this particular scenario, threads are not the right choice.

The alternative is to use an async runtime to handle these connections. Rust has a crate called Tokio, which is exactly that. It provides building blocks needed for writing network applications. Instead of spawning one OS thread per connection, Tokio runs a small fixed pool of threads — typically one per CPU core — and multiplexes all connections across them as lightweight async tasks. 5,000 connections means 5,000 tasks, but the OS still only sees ~8 threads.

In this blog, we will implement the same feature with tokio and benchmark it against our previous implementation. Let’s see how it works out.

Mental model

The core difference is not that async makes the CPU faster. It changes what happens while a connection is waiting for I/O.

In the threaded version, every connection owns an OS thread. If that connection is waiting for the client to send the next Redis command, the OS parks that thread. With thousands of mostly idle connections, the process still has thousands of stacks and thousands of schedulable threads for the OS to manage.

In the async version, each connection is represented by a lightweight task. When a task waits for socket I/O, it yields back to Tokio instead of blocking a worker thread. Tokio asks the OS for readiness notifications, then resumes the task when the socket can be read from or written to.

So the benefit is not raw speed per command. The benefit is that many idle or slow connections can be kept open without dedicating one OS thread to each of them.

What we are building

Just like the last time, the requirement is the same.

# This will spin off the server on the default port 6379
$ redis-server

In a new terminal window, we will first store a key-value pair { "foo": "bar" } inside the Redis store and try to retrieve the value by passing the key “foo”. It will look something like this:

# PING command to check if Redis server is running
$ redis-cli PING
PONG

# SET command to store a key-value pair
$ redis-cli SET foo bar
OK

# GET command to retrieve a value by key
$ redis-cli GET foo
"bar"

The Plan

As we discussed in the previous blog, our implementation is divided into three layers.

Layer 1: TCP Server
Layer 2: Protocol Parser
Layer 3: Command Handler

The TCP server listens continuously for incoming connections. Whenever a connection arrives, this layer spawns a worker thread and delegates the connection to it, allowing the main thread to continue listening for new connections.

Check the code snippet below. The worker thread moves ahead by reading commands from the connection and passing those requests to the next layers, i.e. the Protocol Parser to parse the request and the Command Handler to handle the command and generate a response.

use std::collections::HashMap;
use std::net::{TcpListener, TcpStream};
use std::sync::{Arc, Mutex};

type Store = Arc<Mutex<HashMap<String, Vec<u8>>>>;

fn main() -> std::io::Result<()> {
    let store: Store = Arc::new(Mutex::new(HashMap::new()));
    let listener = TcpListener::bind("127.0.0.1:6379")?;

    for stream in listener.incoming() {
        let stream = stream?;
        let store = Arc::clone(&store);

        std::thread::spawn(move || {
            if let Err(err) = handle_client(stream, store) {
                eprintln!("Error handling client: {}", err);
            }
        });
    }

    Ok(())
}

fn handle_client(stream: TcpStream, store: Store) -> std::io::Result<()> {
    // read requests from this connection and pass them to the Protocol Parser and Command Handler
    Ok(())
}

Let’s build it

The only thing we have to do is replace the thread approach with an event loop one. There are a few things we need to change here to make it asynchronous:

All Sync APIs have to be replaced with Async ones. For dependencies like TcpListener and TcpStream, we need to use their async alternatives from tokio instead of std.
Replace std::thread::spawn with tokio::spawn so that it will create a new asynchronous task for every incoming connection instead of creating a new OS thread.
Update both the function definition and its invocation to use asynchronous code with async / await.

It will look something like this:

use std::collections::HashMap;
use std::sync::Arc;

use tokio::net::{TcpListener, TcpStream};
use tokio::sync::Mutex;

type Store = Arc<Mutex<HashMap<String, Vec<u8>>>>;

#[tokio::main]
async fn main() -> std::io::Result<()> {
    let store: Store = Arc::new(Mutex::new(HashMap::new()));
    let listener = TcpListener::bind("127.0.0.1:6379").await?;

    loop {
        let (stream, _) = listener.accept().await?;
        let store = Arc::clone(&store);

        tokio::spawn(async move {
            if let Err(err) = handle_client(stream, store).await {
                eprintln!("Error handling client: {}", err);
            }
        });
    }
}

async fn handle_client(stream: TcpStream, store: Store) -> std::io::Result<()> {
    // read requests from this connection and pass them to the Protocol Parser and Command Handler
    Ok(())
}

Benchmark

Both implementations were compiled in release mode (cargo build --release) and run locally on the same machine. We used redis-benchmark (the official Redis benchmarking tool) to measure throughput. Each server was started, benchmarked, then killed before the next one started.

The command we ran:

redis-benchmark -p 6379 -n 500000 -c <connections> -t set,get -q

-n 500000 — 500,000 total requests per test
-c — number of concurrent connections (varied across runs)
-t set,get — only SET and GET commands
-k 1 — persistent connections (default), matching how real backends connect to Redis

We ran each test at increasing connection counts to see where the two implementations diverge.

Results

Connections	Threaded SET	Async SET	Threaded GET	Async GET
50	165,125 req/s	168,294 req/s	168,350 req/s	121,477 req/s
100	170,126 req/s	153,092 req/s	168,691 req/s	133,120 req/s
200	166,889 req/s	168,976 req/s	167,001 req/s	128,074 req/s
500	160,823 req/s	168,919 req/s	161,551 req/s	124,719 req/s
1,000	158,428 req/s	165,344 req/s	156,055 req/s	122,249 req/s
2,000	141,443 req/s	156,055 req/s	138,889 req/s	112,284 req/s
3,000	128,139 req/s	150,875 req/s	122,309 req/s	105,042 req/s
4,000	120,135 req/s	142,531 req/s	—	101,482 req/s
5,000	—	—	—	—

At 5,000 concurrent connections, neither run produced a usable benchmark result. The threaded server ran out of OS threads. The async run hit socket limits at the OS level — even with ulimit raised, macOS imposes additional constraints on the number of concurrent TCP connections per process at this scale.

What the numbers tell us

Low concurrency

At 50–500 connections, there is no clear SET winner. Both implementations sit in the same range.
This is the boring but important part: threads are not automatically slow.
Threaded GET wins here because the async version pays extra per-request overhead.

High concurrency

Threaded SET drops from 170k at 100 connections to 120k at 4,000 — a 29% degradation.
Async SET drops from 169k to 142k over the same range — only 16%.
Async SET pulls ahead after 2,000 connections.
Async GET is slower in absolute throughput, but degrades more gracefully.

What is actually scaling better? Async is not making each command faster. It scales better because Tokio does not dedicate one OS thread to each open connection, while the threaded server pays more context-switching and thread-management cost as connections rise.

Summary

The implementation change was small — swap std::thread::spawn for tokio::spawn, replace the sync TCP primitives with their async counterparts, and annotate functions with async / await. The Protocol Parser and Command Handler needed minimal changes to accommodate async operations, but their core logic did not change one bit.

The benchmark told a more nuanced story than expected. At low concurrency, the threaded server was actually competitive — in some cases faster — because threads are cheap when there’s almost no I/O wait. The async version paid a small overhead from heap allocations in the parser. The real difference only showed up under pressure: at 5,000 concurrent connections, the threaded server hit the OS thread limit, while the async run was blocked by socket limits instead of per-connection thread exhaustion.

A few honest caveats: we benchmarked on loopback (no real network latency), redis-benchmark itself is single-threaded by default, which caps what you can measure at high concurrency, and the thread limit we hit is lower on macOS than on Linux. On a real network with a proper multi-threaded benchmark tool, the gap would likely be more pronounced.

The takeaway isn’t that async is always faster — it isn’t. The takeaway is that threads don’t scale to thousands of concurrent connections, and at the kind of scale Redis is designed for, that ceiling matters.