Scaling Your Email Infrastructure: Lessons from Sending 1B Emails
Insights and lessons learned from scaling our email infrastructure to handle billions of emails per month.
David Park
Infrastructure Lead
Scaling email infrastructure to handle billions of messages presents unique challenges. Here's what we learned building Postalynk's platform to handle massive volumes reliably.
The Challenge of Scale
Sending a billion emails per month means:
- 33 million emails per day
- 1.4 million emails per hour
- 23,000 emails per minute
- 385 emails per second (sustained)
And that's just the average. Peak loads can be 10x higher. Black Friday, product launches, and breaking news can spike volume dramatically.
Architecture Principles
1. Queue Everything
Never try to send emails synchronously. Every send request goes to a queue:
API Request → Validation → Queue → Worker → SMTP → DeliveryThis decouples ingestion from delivery, allowing each to scale independently. When a spike hits, the queue absorbs it while workers process at their maximum rate.
2. Horizontal Scaling
Vertical scaling (bigger servers) has limits. We designed every component to scale horizontally:
- **API servers**: Stateless, behind a load balancer
- **Queue workers**: Add more workers to increase throughput
- **SMTP relays**: Multiple sending nodes with intelligent routing
- **Databases**: Sharded by organization for write scaling
3. Graceful Degradation
At scale, failures are inevitable. Design for them:
- If a component fails, others continue operating
- Retries with exponential backoff
- Circuit breakers to prevent cascade failures
- Fallback to secondary systems when primary fails
Database Considerations
The Problem with SQL at Scale
At billions of emails, a single database becomes a bottleneck:
- INSERT rate limits
- Index maintenance overhead
- Query performance degradation
- Backup and maintenance windows
Our Solution: Hybrid Approach
Hot Data (recent emails): PostgreSQL with time-based partitioning - Last 30 days of data - Automatic partition rotation - Fast queries for recent activity
Warm Data (older emails): ClickHouse - Columnar storage for analytics - Excellent compression (10x reduction) - Fast aggregation queries
Cold Data (archives): Object storage (S3) - Compliance and audit purposes - Rarely accessed, highly compressed - Retrieved on-demand
Sharding Strategy
We shard by organization ID:
- Each organization's data on dedicated shard
- Prevents noisy neighbor problems
- Allows per-customer scaling
- Simplifies data isolation and deletion
Queue Architecture
Why We Chose Kafka
At our scale, traditional message queues (RabbitMQ, SQS) struggle:
- **Durability**: Messages survive broker failures
- **Replay**: Re-process messages if needed
- **Ordering**: Maintain order per partition
- **Throughput**: Millions of messages per second
Queue Design
Partitions: 64 (allows 64 parallel consumers)
Retention: 7 days (for replay capability)
Replication: 3 (survive 2 node failures)Messages are partitioned by organization ID, ensuring all emails for one customer are processed in order.
Dead Letter Queues
Failed messages go to a dead letter queue for:
- Manual inspection
- Automated retry after fixes
- Error analysis and alerting
SMTP Layer
IP Reputation Management
At scale, IP reputation is everything. We maintain:
- **Dedicated IPs per customer** (Enterprise plans)
- **Shared IP pools** (grouped by sender reputation)
- **Warm-up pools** (for new IPs)
Intelligent Routing
Not all emails should go through the same path:
if (recipient.domain === 'gmail.com') {
route.through(gmailOptimizedPool);
} else if (priority === 'high') {
route.through(dedicatedHighPriorityPool);
} else {
route.through(defaultPool);
}Rate Limiting
ISPs enforce rate limits. Exceed them, and you're blocked. We implement:
- Per-domain sending limits
- Automatic backoff when limits are hit
- Queue prioritization when near limits
Monitoring at Scale
Metrics That Matter
We track thousands of metrics, but these are critical:
- **Delivery rate by ISP**: Are Gmail, Outlook, Yahoo accepting our mail?
- **Queue depth**: Growing queue = delivery problems
- **Bounce rate by type**: Soft bounces are different from hard bounces
- **p99 delivery latency**: 99th percentile delivery time
Alerting Philosophy
- Alert on symptoms, not causes
- Page for customer-impacting issues only
- Use anomaly detection for traffic patterns
- Correlate across services to find root cause
Observability Stack
- **Metrics**: Prometheus + Grafana
- **Logs**: Elasticsearch + Kibana
- **Tracing**: Jaeger for request tracing
- **Dashboards**: Real-time delivery visualizations
Lessons Learned
1. Optimize for the Common Case
80% of our traffic is transactional email to major providers. Optimize that path first. Edge cases can be slower.
2. Test at Production Scale
A system that works at 1000 emails/second might fail at 10,000. Load test regularly at 2x expected peak.
3. Plan for Failure
Every component will fail eventually. We do regular "chaos engineering" exercises to ensure the system handles failures gracefully.
4. Cache Aggressively
DNS lookups, MX records, and recipient verification are cached. At scale, even small optimizations compound.
5. Monitor Business Metrics
Technical metrics aren't enough. Track:
- Revenue per email
- Customer churn correlation with delivery issues
- Support ticket rate vs. delivery problems
The Human Side
Scaling isn't just technical:
- **On-call rotations**: Well-rested engineers make better decisions
- **Documentation**: Future you will thank present you
- **Runbooks**: Predefined responses to common issues
- **Post-mortems**: Learn from every incident
Conclusion
Building email infrastructure at scale requires careful attention to architecture, monitoring, and operational practices. There are no shortcuts - but the investment pays off in reliability and customer trust.
At Postalynk, we've applied these lessons to create infrastructure that just works, so you can focus on your product instead of email delivery.
Related Articles
Email API Design Patterns for Developers
Best practices for integrating email APIs into your application, including error handling and retry strategies.
TechnicalUnderstanding SPF, DKIM, and DMARC: A Complete Guide
A comprehensive guide to email authentication protocols and why they're essential for your sending reputation.