swelearn 0.1.0 - Docs.rs

---
title: "Design a Notification System"
tags: [push-notifications, message-queues, fan-out, user-preferences, apns, fcm]
difficulty: medium
estimated_time: 45min
companies: [Google, Meta, Amazon, Uber, LinkedIn]
---

## Problem Statement

Design a notification system that sends notifications to users across multiple
channels (push, email, SMS) triggered by various application events.

## Clarifying Questions

Ask these before designing:

- Channels supported? (Push notifications, email, SMS — all three)
- Scale? (Assume 10M notifications/day, spikes up to 1M/hr during events)
- Latency requirement? (Push: < 5s; email: best-effort; SMS: < 30s)
- User preferences respected? (Yes — users can opt out per channel/type)
- Templating and personalization? (Yes — variable substitution in templates)
- Analytics / delivery tracking? (Yes — delivered, opened, clicked)

## Capacity Estimates

  10M notifications/day = ~115/sec average
  Peak: 1M/hr = ~277/sec during marketing campaigns
  Fanout: a single event (e.g., new follower) → 1 notification
          a broadcast (e.g., marketing blast) → 100M notifications

## High-Level Design

```
[Event Sources]
(App Services, Scheduled Jobs, Admin Campaigns)
     |
     v
[Notification Service / API]
     |
     v
[Validation + User Preference Check]
     |
     v
[Message Queue (Kafka)]
     |
  ┌──┴──────────────┬───────────────┐
  v                 v               v
[Push Worker]   [Email Worker]  [SMS Worker]
  |                 |               |
  v                 v               v
[APNs / FCM]   [SendGrid/SES]   [Twilio]
  |                 |               |
  v                 v               v
         [Delivery Tracker DB]
```

## Event Ingestion

Notification triggers arrive from many sources:
- Application services (user liked your post, your order shipped)
- Scheduled jobs (weekly digest emails, daily reminders)
- Admin campaigns (marketing blasts, product announcements)

The Notification Service exposes an internal API:

  POST /notifications
  {
    "user_id": "uuid",
    "type": "ORDER_SHIPPED",
    "template_id": "order-shipped-v2",
    "data": { "order_id": "123", "carrier": "UPS" },
    "channels": ["push", "email"]
  }

## User Preferences

Before enqueuing, check if the user wants this notification:
- User opted out of this notification type?
- User opted out of this channel?
- User in do-not-disturb hours?
- User's device token still valid?

Store preferences in a fast read store (Redis or a SQL table with read replicas).
Invalid/expired tokens should be cleaned up proactively (APNs/FCM return errors
for invalid tokens — act on these responses).

## Message Queue: Kafka

Use separate Kafka topics per channel:
  notifications.push   → consumed by Push Workers
  notifications.email  → consumed by Email Workers
  notifications.sms    → consumed by SMS Workers

Benefits:
  - Channel workers scale independently (push volume >> SMS volume)
  - Backpressure: if email provider is slow, email queue absorbs the spike
  - Retry: failed deliveries requeued with exponential backoff
  - Replay: re-send a batch if a provider had an outage

## Push Notification Workers

**iOS (APNs):**
  - HTTP/2 connection to APNs; send device token + payload
  - Max payload: 4 KB; keep notifications small
  - Handle 400/410 responses → remove invalid device tokens

**Android (FCM):**
  - HTTP v1 API; send registration token + message
  - Similar token lifecycle management as APNs
  - Supports data messages (silent wake) and notification messages

**Token management:**
  - Device registers token on app install/update → stored in DB
  - Tokens expire or are invalidated on reinstall — APNs/FCM tell you
  - Cron job: prune tokens that haven't been refreshed in 90+ days

## Fan-out for Broadcast Campaigns

Sending to 100M users at once:
1. Admin creates campaign: template + target segment + schedule
2. Campaign Service queries user segment (e.g., "all US users")
3. Produces user_ids in batches to a campaign Kafka topic
4. Fan-out Workers read batches, apply preference checks, write to channel queues
5. Channel workers deliver as normal

Rate limiting on the way out:
  - APNs: no strict rate limit, but respect their throughput
  - Email: SendGrid/SES have per-second send limits — use their batch APIs
  - SMS: most expensive channel — additional confirmation step for bulk

## Delivery Tracking

Track each notification through its lifecycle:

  notification_log
  ┌──────────────────┬──────────────────────────────────┐
  │ notification_id  │ UUID        PRIMARY KEY           │
  │ user_id          │ UUID                              │
  │ type             │ VARCHAR                           │
  │ channel          │ ENUM(push, email, sms)            │
  │ status           │ ENUM(queued, sent, delivered,     │
  │                  │      failed, opened, clicked)     │
  │ provider_id      │ VARCHAR     (provider's message ID│
  │ created_at       │ TIMESTAMP                         │
  │ updated_at       │ TIMESTAMP                         │
  └──────────────────┴──────────────────────────────────┘

Delivery confirmations:
  - APNs/FCM: callback or polling for delivery receipts
  - Email: webhook from SendGrid for opens/clicks (requires tracking pixel/links)
  - SMS: Twilio webhooks for delivery status

## Deep Dives

### Deduplication
  - Events can arrive more than once (at-least-once delivery from upstream services)
  - Generate a deterministic notification_id from (user_id + event_type + event_id)
  - Check dedup cache (Redis) before enqueuing — skip if already processed
  - Window: 24h dedup to prevent double-notifications from retried events

### Priority Queues
  - Transactional notifications (order shipped, password reset): high priority, low latency
  - Marketing notifications: low priority, bulk delivery acceptable
  - Separate Kafka topics or priority levels per channel

### Template Engine
  - Templates stored in DB with variable placeholders: "Hi {{first_name}}, your order..."
  - Render at worker time (not enqueue time) — most up-to-date user data
  - Localization: store template per locale; select based on user.locale

### Rate Limiting per User
  - Protect users from notification spam during system incidents
  - Max N notifications per user per hour (configurable per type)
  - Use Redis token bucket per user_id

## Key Decisions to Highlight

1. Separate Kafka topics per channel — independent scaling and failure isolation
2. Preference check before queuing — don't enqueue what won't be delivered
3. Idempotent notification IDs — prevent duplicate delivery from upstream retries
4. Delivery status tracking — enables analytics and debugging failed deliveries