Scaling Your Email Infrastructure: Lessons from Sending 1B Emails

Scaling email infrastructure to handle billions of messages presents unique challenges. Here's what we learned building Postalynk's platform to handle massive volumes reliably.

The Challenge of Scale

Sending a billion emails per month means:

33 million emails per day
1.4 million emails per hour
23,000 emails per minute
385 emails per second (sustained)

And that's just the average. Peak loads can be 10x higher. Black Friday, product launches, and breaking news can spike volume dramatically.

Architecture Principles

1. Queue Everything

Never try to send emails synchronously. Every send request goes to a queue:

API Request → Validation → Queue → Worker → SMTP → Delivery

This decouples ingestion from delivery, allowing each to scale independently. When a spike hits, the queue absorbs it while workers process at their maximum rate.

2. Horizontal Scaling

Vertical scaling (bigger servers) has limits. We designed every component to scale horizontally:

**API servers**: Stateless, behind a load balancer
**Queue workers**: Add more workers to increase throughput
**SMTP relays**: Multiple sending nodes with intelligent routing
**Databases**: Sharded by organization for write scaling

3. Graceful Degradation

At scale, failures are inevitable. Design for them:

If a component fails, others continue operating
Retries with exponential backoff
Circuit breakers to prevent cascade failures
Fallback to secondary systems when primary fails

Database Considerations

The Problem with SQL at Scale

At billions of emails, a single database becomes a bottleneck:

INSERT rate limits
Index maintenance overhead
Query performance degradation
Backup and maintenance windows

Our Solution: Hybrid Approach

Hot Data (recent emails): PostgreSQL with time-based partitioning - Last 30 days of data - Automatic partition rotation - Fast queries for recent activity

Warm Data (older emails): ClickHouse - Columnar storage for analytics - Excellent compression (10x reduction) - Fast aggregation queries

Cold Data (archives): Object storage (S3) - Compliance and audit purposes - Rarely accessed, highly compressed - Retrieved on-demand

Sharding Strategy

We shard by organization ID:

Each organization's data on dedicated shard
Prevents noisy neighbor problems
Allows per-customer scaling
Simplifies data isolation and deletion

Queue Architecture

Why We Chose Kafka

At our scale, traditional message queues (RabbitMQ, SQS) struggle:

**Durability**: Messages survive broker failures
**Replay**: Re-process messages if needed
**Ordering**: Maintain order per partition
**Throughput**: Millions of messages per second

Queue Design

Partitions: 64 (allows 64 parallel consumers)
Retention: 7 days (for replay capability)
Replication: 3 (survive 2 node failures)

Messages are partitioned by organization ID, ensuring all emails for one customer are processed in order.

Dead Letter Queues

Failed messages go to a dead letter queue for:

Manual inspection
Automated retry after fixes
Error analysis and alerting

SMTP Layer

IP Reputation Management

At scale, IP reputation is everything. We maintain:

**Dedicated IPs per customer** (Enterprise plans)
**Shared IP pools** (grouped by sender reputation)
**Warm-up pools** (for new IPs)

Intelligent Routing

Not all emails should go through the same path:

if (recipient.domain === 'gmail.com') {
  route.through(gmailOptimizedPool);
} else if (priority === 'high') {
  route.through(dedicatedHighPriorityPool);
} else {
  route.through(defaultPool);
}

Rate Limiting

ISPs enforce rate limits. Exceed them, and you're blocked. We implement:

Per-domain sending limits
Automatic backoff when limits are hit
Queue prioritization when near limits

Monitoring at Scale

Metrics That Matter

We track thousands of metrics, but these are critical:

**Delivery rate by ISP**: Are Gmail, Outlook, Yahoo accepting our mail?
**Queue depth**: Growing queue = delivery problems
**Bounce rate by type**: Soft bounces are different from hard bounces
**p99 delivery latency**: 99th percentile delivery time

Alerting Philosophy

Alert on symptoms, not causes
Page for customer-impacting issues only
Use anomaly detection for traffic patterns
Correlate across services to find root cause

Observability Stack

**Metrics**: Prometheus + Grafana
**Logs**: Elasticsearch + Kibana
**Tracing**: Jaeger for request tracing
**Dashboards**: Real-time delivery visualizations

Lessons Learned

1. Optimize for the Common Case

80% of our traffic is transactional email to major providers. Optimize that path first. Edge cases can be slower.

2. Test at Production Scale

A system that works at 1000 emails/second might fail at 10,000. Load test regularly at 2x expected peak.

3. Plan for Failure

Every component will fail eventually. We do regular "chaos engineering" exercises to ensure the system handles failures gracefully.

4. Cache Aggressively

DNS lookups, MX records, and recipient verification are cached. At scale, even small optimizations compound.

5. Monitor Business Metrics

Technical metrics aren't enough. Track:

Revenue per email
Customer churn correlation with delivery issues
Support ticket rate vs. delivery problems

The Human Side

Scaling isn't just technical:

**On-call rotations**: Well-rested engineers make better decisions
**Documentation**: Future you will thank present you
**Runbooks**: Predefined responses to common issues
**Post-mortems**: Learn from every incident

Conclusion

Building email infrastructure at scale requires careful attention to architecture, monitoring, and operational practices. There are no shortcuts - but the investment pays off in reliability and customer trust.

At Postalynk, we've applied these lessons to create infrastructure that just works, so you can focus on your product instead of email delivery.