Building a Scalable User Notification System: Architecture and Database Design
Design patterns, database schemas, and architectural decisions for building enterprise notification systems that handle millions of users
Multi-channel notification systems fail when teams model them as template-plus-send pipelines instead of per-user routing decisions that must fan out, coalesce, suppress, or defer across email, SMS, push, and in-app channels. Without an explicit router layer, delivery guarantees collapse, user preferences get ignored, and compliance auditability becomes impossible. This post covers the event-driven architecture, channel router pattern, PostgreSQL schema, and observability surface needed to build a production notification system.
This post is part 1 of a series on building a production notification system. It covers the event-driven architecture (producers, router, dispatcher), the database schema for events, preferences, and delivery state, the channel router pattern, and the observability surface that makes failures debuggable before they reach the user.
The Hidden Complexity of “Simple” Notifications
The initial mental model for notifications is: trigger event → send message → done. The production reality is a complex orchestration of user preferences, delivery channels, rate limiting, retry logic, template management, analytics tracking, and regulatory compliance.
The complexity becomes apparent during a first major product launch. With 10,000 users suddenly receiving welcome emails, password resets, and activity notifications simultaneously, the email service starts throttling, the database connection pool maxes out, and users start complaining about duplicate notifications.
System Architecture
The architecture below has been validated across different scales and industries. Every component exists because something broke in production without it.
Event-Driven Architecture
Notifications are not request-response operations. They are fire-and-forget events that require asynchronous processing. The following event structure works reliably across multiple systems:
interface NotificationEvent {
id: string;
userId: string;
type: NotificationType;
templateId?: string;
data: Record<string, any>;
priority: 'low' | 'normal' | 'high' | 'critical';
scheduledAt?: Date;
expiresAt?: Date;
metadata: {
source: string;
correlationId: string;
retryCount: number;
maxRetries: number;
};
}
enum NotificationType {
PROJECT_UPDATE = 'project_update',
SECURITY_ALERT = 'security_alert',
FEATURE_ANNOUNCEMENT = 'feature_announcement',
SYSTEM_MAINTENANCE = 'system_maintenance',
USER_ACTIVITY = 'user_activity',
INTEGRATION_UPDATE = 'integration_update'
}
The metadata section is crucial: the correlation ID enables tracing notification flows across distributed systems and is essential for debugging delivery failures.
The Notification Engine: Heart of the System
The notification engine is where most of the complexity lives. The following implementation reflects several iterations of refinement:
class NotificationEngine {
constructor(
private eventBus: EventBus,
private templateService: TemplateService,
private preferenceManager: PreferenceManager,
private rateLimiter: RateLimiter,
private channelRouter: ChannelRouter,
private analytics: AnalyticsService
) {}
async processEvent(event: NotificationEvent): Promise<void> {
try {
// Check if user exists and is active
const user = await this.getUserWithPreferences(event.userId);
if (!user?.isActive) {
await this.analytics.trackSkipped(event.id, 'user_inactive');
return;
}
// Apply user preferences filtering
const enabledChannels = await this.preferenceManager
.getEnabledChannels(event.userId, event.type);
if (enabledChannels.length === 0) {
await this.analytics.trackSkipped(event.id, 'all_channels_disabled');
return;
}
// Rate limiting check
const rateLimitResult = await this.rateLimiter
.checkLimits(event.userId, event.type);
if (!rateLimitResult.allowed) {
await this.scheduleRetry(event, rateLimitResult.retryAfter);
return;
}
// Process each enabled channel
const deliveryPromises = enabledChannels.map(channel =>
this.processChannel(event, channel, user)
);
const results = await Promise.allSettled(deliveryPromises);
await this.analytics.trackDeliveryResults(event.id, results);
} catch (error) {
await this.handleProcessingError(event, error);
}
}
private async processChannel(
event: NotificationEvent,
channel: NotificationChannel,
user: User
): Promise<DeliveryResult> {
// Template rendering with user data
const template = await this.templateService.getTemplate(
event.type,
channel,
user.locale
);
const renderedContent = await this.templateService.render(
template,
{ ...event.data, user }
);
// Route to appropriate channel handler
return await this.channelRouter.deliver(
channel,
user,
renderedContent,
event.metadata
);
}
}
The key insight here: every operation can fail, and you need to handle failures gracefully while maintaining visibility into what’s happening.
Database Design: Schema and Indexing Strategy
The schema below has been refined across multiple iterations and has held up under production-scale load:
Core Tables
-- Users table (assuming it exists)
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
phone VARCHAR(20),
locale VARCHAR(10) DEFAULT 'en',
timezone VARCHAR(50) DEFAULT 'UTC',
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
-- The preference system - this gets complex fast
CREATE TABLE notification_preferences (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id) ON DELETE CASCADE,
notification_type VARCHAR(100) NOT NULL,
channel VARCHAR(50) NOT NULL,
enabled BOOLEAN DEFAULT true,
frequency VARCHAR(20) DEFAULT 'immediate', -- immediate, daily, weekly
quiet_hours_start TIME DEFAULT '22:00:00',
quiet_hours_end TIME DEFAULT '08:00:00',
metadata JSONB DEFAULT '{}', -- for channel-specific settings
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(user_id, notification_type, channel)
);
-- Template management - localization is crucial
CREATE TABLE notification_templates (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
notification_type VARCHAR(100) NOT NULL,
channel VARCHAR(50) NOT NULL,
locale VARCHAR(10) DEFAULT 'en',
subject VARCHAR(500),
body TEXT NOT NULL,
variables JSONB DEFAULT '{}', -- expected variables
is_active BOOLEAN DEFAULT true,
version INTEGER DEFAULT 1,
created_by UUID REFERENCES users(id),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(notification_type, channel, locale, version)
);
Event Storage and Tracking
Event storage is the area most prone to scaling surprises. The schema below addresses the most common failure modes:
-- Main event table - this gets HUGE
CREATE TABLE notification_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id) ON DELETE CASCADE,
notification_type VARCHAR(100) NOT NULL,
template_id UUID REFERENCES notification_templates(id),
priority VARCHAR(20) DEFAULT 'normal',
data JSONB DEFAULT '{}',
scheduled_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
expires_at TIMESTAMP WITH TIME ZONE,
status VARCHAR(20) DEFAULT 'pending',
processed_at TIMESTAMP WITH TIME ZONE,
correlation_id VARCHAR(255), -- for tracing
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
INDEX idx_notification_events_user_created (user_id, created_at DESC),
INDEX idx_notification_events_status (status, scheduled_at),
INDEX idx_notification_events_correlation (correlation_id)
);
-- Delivery tracking - separate for performance
CREATE TABLE notification_deliveries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID REFERENCES notification_events(id) ON DELETE CASCADE,
channel VARCHAR(50) NOT NULL,
status VARCHAR(20) DEFAULT 'pending', -- pending, sent, delivered, failed, bounced
attempt_count INTEGER DEFAULT 0,
max_attempts INTEGER DEFAULT 3,
next_retry_at TIMESTAMP WITH TIME ZONE,
sent_at TIMESTAMP WITH TIME ZONE,
delivered_at TIMESTAMP WITH TIME ZONE,
failed_at TIMESTAMP WITH TIME ZONE,
error_code VARCHAR(50),
error_message TEXT,
provider_id VARCHAR(255), -- external provider message ID
provider_response JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
INDEX idx_deliveries_event_channel (event_id, channel),
INDEX idx_deliveries_retry (status, next_retry_at) WHERE status = 'pending'
);
-- Analytics aggregation table - learned this the hard way
CREATE TABLE notification_metrics (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
date DATE NOT NULL,
hour SMALLINT NOT NULL, -- 0-23
notification_type VARCHAR(100) NOT NULL,
channel VARCHAR(50) NOT NULL,
status VARCHAR(20) NOT NULL,
count INTEGER DEFAULT 1,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(date, hour, notification_type, channel, status)
);
Performance Lessons from Production
Here are the indexing strategies that actually matter when you’re processing millions of notifications:
-- Critical indexes based on query patterns
CREATE INDEX idx_events_user_type_created
ON notification_events(user_id, notification_type, created_at DESC);
CREATE INDEX idx_events_processing_queue
ON notification_events(status, scheduled_at)
WHERE status IN ('pending', 'retry');
CREATE INDEX idx_deliveries_retry_queue
ON notification_deliveries(next_retry_at, status)
WHERE status = 'pending' AND next_retry_at IS NOT NULL;
-- Partial indexes for common queries
CREATE INDEX idx_events_recent_active
ON notification_events(created_at DESC)
WHERE created_at > NOW() - INTERVAL '7 days';
-- For analytics queries
CREATE INDEX idx_metrics_time_type
ON notification_metrics(date, hour, notification_type);
The partial indexes are crucial. Without them, your analytics queries will start timing out when you hit millions of events.
User Preference Management
User preferences seem straightforward until you hit edge cases. Here’s the preference manager that’s handled real-world complexity:
class PreferenceManager {
async getEnabledChannels(
userId: string,
notificationType: string
): Promise<NotificationChannel[]> {
// Check global user preferences
const userPrefs = await this.db.query(`
SELECT np.channel, np.enabled, np.frequency,
np.quiet_hours_start, np.quiet_hours_end,
u.timezone
FROM notification_preferences np
JOIN users u ON u.id = np.user_id
WHERE np.user_id = $1 AND np.notification_type = $2
`, [userId, notificationType]);
if (userPrefs.length === 0) {
// Use default preferences for this notification type
return this.getDefaultChannels(notificationType);
}
const currentTime = new Date();
const enabledChannels: NotificationChannel[] = [];
for (const pref of userPrefs) {
if (!pref.enabled) continue;
// Check quiet hours
if (this.isInQuietHours(currentTime, pref)) {
// Check if this is a critical notification that overrides quiet hours
if (!this.isCriticalNotification(notificationType)) {
continue;
}
}
// Check frequency preferences
if (!this.shouldSendBasedOnFrequency(userId, pref.frequency, notificationType)) {
continue;
}
enabledChannels.push(pref.channel as NotificationChannel);
}
return enabledChannels;
}
private isInQuietHours(currentTime: Date, pref: any): boolean {
// Convert current time to user's timezone
const userTime = moment(currentTime)
.tz(pref.timezone || 'UTC')
.format('HH:mm:ss');
const quietStart = pref.quiet_hours_start;
const quietEnd = pref.quiet_hours_end;
// Handle quiet hours that cross midnight
if (quietStart > quietEnd) {
return userTime >= quietStart || userTime <= quietEnd;
}
return userTime >= quietStart && userTime <= quietEnd;
}
}
Timezone handling alone requires multiple iterations to get right. User preferences become significantly more complex in a global application than they first appear.
Template System: Localization and Personalization
Templates are where the rubber meets the road for user experience. Here’s the template service that handles localization, personalization, and A/B testing:
interface Template {
id: string;
name: string;
type: string;
channel: string;
locale: string;
subject?: string;
body: string;
variables: Record<string, TemplateVariable>;
abTest?: ABTestConfig;
}
class TemplateService {
async getTemplate(
notificationType: string,
channel: NotificationChannel,
locale: string = 'en'
): Promise<Template> {
// Try to get localized template first
let template = await this.db.findTemplate({
type: notificationType,
channel,
locale,
isActive: true
});
// Fallback to English if no localized version
if (!template && locale !== 'en') {
template = await this.db.findTemplate({
type: notificationType,
channel,
locale: 'en',
isActive: true
});
}
if (!template) {
throw new Error(`No template found for ${notificationType}/${channel}/${locale}`);
}
return template;
}
async render(template: Template, data: Record<string, any>): Promise<RenderedContent> {
try {
// Validate required variables
await this.validateTemplateData(template, data);
// Process template with Handlebars or similar
const subject = template.subject
? await this.renderString(template.subject, data)
: undefined;
const body = await this.renderString(template.body, data);
return {
subject,
body,
templateId: template.id,
locale: template.locale
};
} catch (error) {
// Log template rendering errors for debugging
await this.logger.error('Template rendering failed', {
templateId: template.id,
error: error.message,
data: this.sanitizeDataForLogging(data)
});
throw new TemplateRenderError(`Failed to render template ${template.id}`, error);
}
}
}
Rate Limiting: Protecting Users and Providers
Rate limiting balances user experience with system stability. The following implementation covers per-user, per-type limits with atomic Redis checks:
interface RateLimitConfig {
notificationType: string;
channel: string;
limits: {
perMinute: number;
perHour: number;
perDay: number;
};
burstAllowance: number;
}
class RateLimiter {
constructor(private redis: Redis, private configs: RateLimitConfig[]) {}
async checkLimits(
userId: string,
notificationType: string
): Promise<RateLimitResult> {
const config = this.getConfig(notificationType);
if (!config) {
return { allowed: true, remainingToday: Infinity };
}
const now = Date.now();
const keys = {
minute: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 60000)}`,
hour: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 3600000)}`,
day: `rate_limit:${userId}:${notificationType}:${Math.floor(now / 86400000)}`
};
// Use Redis pipeline for atomic checks
const pipeline = this.redis.pipeline();
pipeline.incr(keys.minute);
pipeline.expire(keys.minute, 60);
pipeline.incr(keys.hour);
pipeline.expire(keys.hour, 3600);
pipeline.incr(keys.day);
pipeline.expire(keys.day, 86400);
const results = await pipeline.exec();
const counts = {
minute: results[0][1] as number,
hour: results[2][1] as number,
day: results[4][1] as number
};
// Check against limits
if (counts.minute > config.limits.perMinute ||
counts.hour > config.limits.perHour ||
counts.day > config.limits.perDay) {
return {
allowed: false,
retryAfter: this.calculateRetryAfter(counts, config),
remainingToday: Math.max(0, config.limits.perDay - counts.day)
};
}
return {
allowed: true,
remainingToday: config.limits.perDay - counts.day
};
}
}
Key Lessons for Production Systems
Systems handling millions of messages daily surface the same lessons repeatedly:
-
Start with idempotency: Every notification operation should be idempotent. Users notice duplicates more acutely than missing notifications.
-
Design for observability: Debugging delivery issues consumes more time than building features. Correlation IDs and detailed logging are not optional.
-
Separate concerns early: A notification engine that grows into a monolith becomes hard to scale. Each channel should be independently deployable.
-
Plan for data retention: Notification data grows quickly. A retention and archiving strategy is needed from day one.
-
User preferences are complex: A simple on/off toggle becomes timezone-aware, frequency-based, channel-specific preferences with quiet hours and emergency overrides.
The next part of this series covers real-time delivery mechanisms: WebSocket connections, push notifications, and the channel-specific implementations. It also covers why retry logic and circuit breakers are required, not optional, in a production notification service.
References
- What is Amazon SNS? - AWS Documentation - Overview of Amazon Simple Notification Service as the pub/sub backbone for multi-channel notification fanout architecture
- Common Amazon SNS scenarios - AWS Documentation - Fanout, mobile push, and application-to-application messaging patterns that underpin scalable notification architectures
- Using dead-letter queues in Amazon SQS - AWS Documentation - SQS dead-letter queue configuration for isolating and replaying failed notification deliveries
- Firebase Cloud Messaging - firebase.google.com - Official FCM documentation covering multi-platform push notification delivery for Android, iOS, and web
- Sending notification requests to APNs - Apple Developer - Apple’s official guide for transmitting push notification payloads to iOS and macOS devices via APNs
The foundation outlined here may seem over-engineered for a simple notification system. When debugging why 50,000 users did not receive their password reset emails during a product launch, every piece of observability and resilience built into this design proves its value.
Building a Scalable User Notification System
A comprehensive 4-part series covering the design, implementation, and production challenges of building enterprise-grade notification systems. From architecture and database design to real-time delivery, debugging at scale, and performance optimization.
All posts in this series
Related posts
Choose the right database for your project across SQL, NoSQL, NewSQL, and edge solutions, with real implementation stories and performance benchmarks.
What Aurora Serverless v2 is under the hood: the shared storage layer, ACU-driven compute, the Caspian substrate, scale-to-zero, and mixed-mode clusters.
Build SaaS authorization with AWS Cognito and Verified Permissions, covering Cedar policies, multi-tenant patterns, JWT flow, and cost in TypeScript.
A vendor-neutral evaluation of AWS Verified Permissions, SpiceDB, OpenFGA, Cerbos, and OPA, with architecture patterns, cost analysis, and a decision framework.
A deep comparison of Cedar, Rego, OpenFGA DSL, and Cerbos YAML/CEL policy languages: syntax, performance, formal verification, tooling, and TypeScript integration.