In today’s software systems, many applications need to execute background tasks—at scale, reliably, and at the right time. This blog post covers a comprehensive system design for a Distributed Job Scheduler, diving deep into functional/non-functional requirements, high-level design, storage choices, idempotency, queue processing, and integration with tools like AWS SQS and Kafka.
🔧 Functional Requirements
- Users can submit jobs/tasks to run on a particular schedule.
- Support for manual triggering of jobs.
- Jobs are Python scripts (support for other languages can be added later).
- At-least-once execution of scheduled jobs.
- Recurring jobs with enable/disable mechanisms.
- Job sequencing support: Job1 -> Job2 -> Job3
📈 Non-Functional Requirements
High availability.
- At-least-once execution of jobs.
- Job runs must happen on or close to schedule (2-4s delay acceptable).
- Durability: no job should be lost even if system crashes.
- Delay notifications to users.
- Scalability to 10 billion jobs/day.
📊 Throughput & Storage Estimations
Throughput: 10⁹ jobs/day → ~10⁴ jobs/sec
Storage: Assuming 200 lines/job, 50 bytes/line, Total: 200 * 50 * 10⁹ = 10 TB/day for job binaries
🏗️ High-Level Design
🔹 API Design
POST /api/v1/jobs
{
"taskId": "someUniqueId",
"cron": "*/5 * * * *",
"params": {
"p1": "val1",
"p2": "val2"
}
}
GET
/api/v1/jobs/:jobId
Returns job status.
🧠 Database Considerations
Since the system is write-heavy, choose NoSQL stores like DynamoDB, Cassandra, or MongoDB.
Writes:
- Create job
- Record job run
- Update status
Reads:
Poll jobs due in the next minute (configurable)
Sort and enqueue
❌ Why not SQL?
No normalization required.
ACID guarantees not critical.
Sample Schema
{
"time_bucket": "2025-05-24T10:00:00Z",
"execution_time": "2025-05-24T10:00:05Z",
"job_id": "job_abc_123",
"user_id": "user_123",
"status": "PENDING",
"attempt": 0,
"script_url": "s3://bucket-name/path-to-script.py"
}
🔄 Scheduling & Execution Flow
- WatcherService polls DB every minute (via a CronJob or periodic ECS task).
- Picks jobs for next 1-min window.
- Pushes to FIFO SQS Queue.
- ExecutorService pulls from SQS.
- Spawns isolated ECSContainer to run each job securely.
- Performs idempotent check.
- Executes and updates DB.
🧾 Sequence Diagram – LLD
✅ Handling Duplicate Events (Idempotency)
Problem:
If the ExecutorService crashes mid-execution or fails before DeleteMessage, SQS redelivers the message.
Solution:
Implement idempotency by checking if jobRunId is already processed before executing:
if (jobRunId already in DB with status SUCCESS) {
// Skip processing
} else {
// Execute and update DB
}
📬 FIFO SQS Acknowledgement Workflow
- ReceiveMessage: Executor polls for messages.
- Visibility Timeout starts.
- If successfully processed → DeleteMessage() → removes from queue.
- If not deleted within timeout → message is re-queued.
Yes, the consumer must explicitly delete the message. This applies to all SQS-based systems.
🧪 Java Boilerplate (SQS FIFO Consumer)
AmazonSQS sqs = AmazonSQSClientBuilder.defaultClient();
String queueUrl = "https://sqs.us-east-1.amazonaws.com/your-queue-url";
ReceiveMessageRequest request = new ReceiveMessageRequest()
.withQueueUrl(queueUrl)
.withWaitTimeSeconds(10)
.withMaxNumberOfMessages(1);
while (true) {
List<Message> messages = sqs.receiveMessage(request).getMessages();
for (Message msg : messages) {
String body = msg.getBody();
// Deserialize and process
// Idempotency check here
// Delete after successful execution
sqs.deleteMessage(queueUrl, msg.getReceiptHandle());
}
}
🔁 How Kafka Handles This?
Kafka is different from SQS:
Kafka does not auto-delete messages.
Consumers commit offsets manually.
To prevent duplication, maintain idempotency checks (e.g., jobRunId).
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
if (!jobAlreadyProcessed(record.key())) {
process(record.value());
markJobComplete(record.key());
}
}
consumer.commitSync();
🧠 What is Idempotency?
A process is idempotent if running it multiple times has the same effect as running it once.
Example in Job Scheduler:
If a jobRunId is already marked as completed, skip execution.
This ensures safe reprocessing in case of retries.
📈 Metrics & Monitoring (Enhancement Ideas)
Execution Metrics:
- Jobs submitted/executed per minute/hour/day
- Average execution time
- Error/Success ratio
Infrastructure Health:
- CPU/Memory usage for Executor nodes
- Queue size (SQS/Kafka lag)
- DB read/write latency
Alerts & Dashboards:
- Set up alarms for job delays, failures, or high retry count
- Use tools like Prometheus + Grafana or AWS CloudWatch
Dead Letter Queues (DLQ):
- For jobs that fail after multiple retries
Audit Logs:
- Log job submissions, execution attempts, and results
- Store in Elasticsearch or S3 for analysis
📌 Final Thoughts
This blog outlines the complete design of a scalable, reliable distributed job scheduler with real-world production-grade strategies:
Durable job ingestion
Idempotent execution
FIFO queue management (SQS)
Resilience with AWS services
Extendable to Kafka or other Pub/Sub platforms
Ideas for metrics, monitoring, and alerting integration
Thanks for reading! I share insightful Real World System Designs and DSA patterns every week. Follow for high-quality, actionable content that will level up your skills!
Follow me for more such posts: https://www.linkedin.com/in/gauravsingh9356/