Mailry - Email infrastructure without the friction

# Scaling Email Infrastructure to 10M Emails/Day

Behind the scenes of how we rebuilt our sending architecture to handle massive scale without dropping a single email.

Introduction

Email is often viewed as a solved problem. You connect to an SMTP server, pass in your payload, and the message goes out. But when your volume grows from a few thousand messages a day to 10 million, the complexity scales exponentially. At 10 million emails per day, you are no longer just sending messages; you are managing a high-throughput, distributed data pipeline with strict latency requirements, external rate limits, and complex deliverability rules.

In this engineering deep dive, we will walk you through the architectural evolution of our email infrastructure. We will explore the bottlenecks we hit, the distributed systems concepts we applied, and the exact technologies we used to build a resilient, highly available email sending platform that never drops a single message.

The Initial Architecture: The Breaking Point

When we first launched, our email sending architecture was straightforward. A monolithic Node.js application handled user requests, composed the email payload, and synchronously dispatched it via a third-party SMTP relay. We tracked bounces and clicks by parsing webhooks that hit a single REST endpoint and updated our PostgreSQL database directly.

This worked perfectly up to about 500,000 emails a day. However, as our user base expanded, the cracks began to show:

Synchronous Blocking: When the SMTP relay experienced latency, our application workers became blocked, causing incoming API requests to queue up and eventually time out.
Database Contention: Processing hundreds of thousands of webhook events for opens, clicks, and bounces caused massive locking and contention in our primary PostgreSQL database.
No Retry Mechanism: If the SMTP server rejected a connection or returned a transient error, the email was simply lost. We relied on the application level to handle retries, which was memory-intensive and prone to data loss during deployments or crashes.
Deliverability Issues: Sending all emails from a single pool of IP addresses meant that one bad actor sending spam could ruin the reputation of our entire platform.

We realized that to reach 10 million emails a day, we needed a paradigm shift. We had to decouple the submission of an email from its actual transmission.

Redesigning for Scale: Embracing Asynchronous Messaging

The core philosophy of our new architecture was asynchronous processing. We introduced an event-driven architecture heavily reliant on Message Queues. After evaluating several options like RabbitMQ and Apache Kafka, we settled on Kafka for its high throughput, persistence, and ability to replay events.

The Ingestion Layer

Our new API ingestion layer is incredibly lightweight. When an API request comes in to send an email, the service performs basic validation, authenticates the user, and immediately publishes a generic EmailRequested event to a Kafka topic. It then returns a 202 Accepted response to the client. This decoupled ingestion means our API can handle massive spikes in traffic without breaking a sweat. It merely acts as a fast write proxy to Kafka.

The Routing and Rendering Workers

Once the event is in Kafka, a fleet of rendering workers picks it up. These workers are responsible for compiling the email templates (using Handlebars or React Email), injecting dynamic user data, and standardizing the payload. The output is a fully formed, raw MIME message. This compiled message is then pushed to a downstream Kafka topic called EmailReadyForDispatch.

By isolating the CPU-intensive rendering process into its own worker pool, we can scale it independently of the dispatch or ingestion layers.

Database Scaling: Handling the Analytics Firehose

At 10 million emails a day, the volume of analytics data (opens, clicks, bounces, spam complaints) is immense. A single email might generate 5 to 10 distinct events over its lifecycle. We were looking at inserting up to 100 million rows a day into our database.

PostgreSQL is fantastic, but using it as a time-series sink for this volume of data would quickly lead to bloat and performance degradation.

Moving to ClickHouse

We migrated our analytics workload to ClickHouse, a column-oriented DBMS optimized for OLAP workloads. Instead of writing events directly to the database, our webhook ingestion service writes incoming events to—you guessed it—another Kafka topic.

ClickHouse consumes directly from this Kafka topic using its native Kafka engine. This setup allows us to ingest tens of thousands of events per second with virtually zero overhead. Querying aggregation metrics (e.g., "How many emails did User X send yesterday, and what was the open rate?") now takes milliseconds instead of minutes. Our primary PostgreSQL database is now strictly reserved for core relational data: user accounts, billing, and configuration.

Deliverability and Smart Dispatching

Sending 10 million emails is only half the battle; ensuring they actually reach the inbox is the real challenge. ISPs (Gmail, Yahoo, Outlook) have stringent rate limits and sophisticated spam filters. If you open 1,000 concurrent connections to Gmail's MX servers, you will be temporarily banned.

Traffic Shaping and Rate Limiting

To handle this, we built a custom Dispatcher service in Go. Go's lightweight goroutines make it perfect for managing thousands of concurrent I/O-bound tasks. The Dispatcher pulls messages from the EmailReadyForDispatch topic and groups them by the destination domain.

We use Redis to implement a distributed token bucket rate limiter. Before opening an SMTP connection to gmail.com, the worker must acquire a token for that specific domain. If the token bucket is empty, the worker yields and the message is placed in an in-memory delay queue to be retried shortly. This traffic shaping ensures we strictly adhere to the warm-up schedules and connection limits enforced by major inbox providers.

IP Pooling

We also implemented dynamic IP pooling. Instead of routing all traffic through a single set of IPs, we categorize outgoing mail into "transactional" and "marketing" pools. Furthermore, high-volume senders with excellent reputations are dynamically assigned to premium, dedicated IP pools. This quarantine strategy prevents a sudden spike in bounce rates from one user from affecting the deliverability of others.

Fault Tolerance: What Happens When Things Fail?

In distributed systems, failure is a guarantee. Network partitions happen, disks fill up, and third-party APIs go down. Our system is designed with a "let it crash" and "retry with backoff" mentality.

Every step of our pipeline includes strict timeout configurations. If an external SMTP server times out, the Dispatcher catches the error, increments a retry counter on the message payload, and publishes it to a Retry-Delay Kafka topic. We utilize a series of delay topics with exponentially increasing wait times (1 minute, 5 minutes, 15 minutes, 1 hour). If an email fails to send after 24 hours, it is finally marked as a Hard Bounce and the user is alerted.

Because Kafka persists all messages on disk, even if our entire fleet of Dispatcher workers crashes, the messages remain safely queued. Once the workers are back online, they simply resume processing from their last committed offset. We have achieved true zero-message-loss architecture.

Observability: Seeing into the Black Box

When managing millions of moving parts, you need unparalleled visibility. We instrumented every microservice using OpenTelemetry. Metrics are scraped by Prometheus and visualized in Grafana dashboards.

We monitor:

Kafka Consumer Lag: The most critical metric. If consumer lag spikes, it means our workers cannot keep up with ingestion, and we need to automatically scale up the worker pods via Kubernetes HPA (Horizontal Pod Autoscaler).
SMTP Connection Latency: Grouped by provider. If Outlook's latency suddenly jumps from 200ms to 2000ms, our automated alerts page the on-call engineer.
Deliverability Rates: Real-time tracking of bounce rates per IP address. If an IP hits a bounce rate threshold, it is automatically paused and removed from the active rotation.

Distributed tracing allows us to follow a single email from the initial API POST request, through the Kafka queues, into the Go Dispatcher, and finally to the ClickHouse analytics event. When a customer asks, "Why was this specific email delayed?", we can pinpoint the exact microservice that caused the bottleneck.

Final Thoughts

Scaling to 10 million emails a day has been an incredible engineering journey. By transitioning from a synchronous monolith to an asynchronous, decoupled, event-driven architecture, we eliminated single points of failure. By adopting specialized data stores like ClickHouse for analytics and leveraging Kafka for resilient message queuing, we built a system that easily handles massive spikes in traffic while maintaining strict deliverability standards.

Engineering a robust email infrastructure is a complex, resource-intensive undertaking. It requires specialized knowledge in distributed systems, network protocols, and continuous monitoring of IP reputation. While building this from scratch is a rewarding challenge for any engineering team, it isn't always the best use of time when you need to focus on your core product.

If you are looking for high-scale, reliable email delivery without the headache of managing Kafka clusters, configuring ClickHouse, and fighting with ISP rate limits, Mailry is the perfect solution for you. Mailry offers a battle-tested, developer-first email infrastructure that handles millions of emails flawlessly out of the box. Let Mailry manage the complexities of deliverability, traffic shaping, and analytics, so your team can get back to building what matters most to your users.

Scaling Email Infrastructure to 10M Emails/Day