Building a Distributed Job Scheduler: End2End Design

In today’s software systems, many applications need to execute background tasks—at scale, reliably, and at the right time. This blog post covers a comprehensive system design for a Distributed Job Scheduler, diving deep into functional/non-functional requirements, high-level design, storage choices, idempotency, queue processing, and integration with tools like AWS SQS and Kafka.

🔧 Functional Requirements

Users can submit jobs/tasks to run on a particular schedule.
Support for manual triggering of jobs.
Jobs are Python scripts (support for other languages can be added later).
At-least-once execution of scheduled jobs.
Recurring jobs with enable/disable mechanisms.
Job sequencing support: Job1 -> Job2 -> Job3

📈 Non-Functional Requirements
High availability.

At-least-once execution of jobs.
Job runs must happen on or close to schedule (2-4s delay acceptable).
Durability: no job should be lost even if system crashes.
Delay notifications to users.
Scalability to 10 billion jobs/day.

📊 Throughput & Storage Estimations
Throughput: 10⁹ jobs/day → ~10⁴ jobs/sec
Storage: Assuming 200 lines/job, 50 bytes/line, Total: 200 * 50 * 10⁹ = 10 TB/day for job binaries

🏗️ High-Level Design
🔹 API Design
POST /api/v1/jobs

{
  "taskId": "someUniqueId",
  "cron": "*/5 * * * *",
  "params": {
    "p1": "val1",
    "p2": "val2"
  }
}

GET

/api/v1/jobs/:jobId

Returns job status.

🧠 Database Considerations
Since the system is write-heavy, choose NoSQL stores like DynamoDB, Cassandra, or MongoDB.

Writes:

Create job
Record job run
Update status
Reads:

Poll jobs due in the next minute (configurable)
Sort and enqueue

❌ Why not SQL?
No normalization required.
ACID guarantees not critical.

Sample Schema

{
  "time_bucket": "2025-05-24T10:00:00Z",
  "execution_time": "2025-05-24T10:00:05Z",
  "job_id": "job_abc_123",
  "user_id": "user_123",
  "status": "PENDING",
  "attempt": 0,
  "script_url": "s3://bucket-name/path-to-script.py"
}

🔄 Scheduling & Execution Flow

WatcherService polls DB every minute (via a CronJob or periodic ECS task).
Picks jobs for next 1-min window.
Pushes to FIFO SQS Queue.
ExecutorService pulls from SQS.
Spawns isolated ECSContainer to run each job securely.
Performs idempotent check.
Executes and updates DB.

🧾 Sequence Diagram – LLD

✅ Handling Duplicate Events (Idempotency)
Problem:
If the ExecutorService crashes mid-execution or fails before DeleteMessage, SQS redelivers the message.

Solution:
Implement idempotency by checking if jobRunId is already processed before executing:

if (jobRunId already in DB with status SUCCESS) {
   // Skip processing
} else {
   // Execute and update DB
}

📬 FIFO SQS Acknowledgement Workflow

ReceiveMessage: Executor polls for messages.
Visibility Timeout starts.
If successfully processed → DeleteMessage() → removes from queue.
If not deleted within timeout → message is re-queued.

Yes, the consumer must explicitly delete the message. This applies to all SQS-based systems.

🧪 Java Boilerplate (SQS FIFO Consumer)

AmazonSQS sqs = AmazonSQSClientBuilder.defaultClient();
String queueUrl = "https://sqs.us-east-1.amazonaws.com/your-queue-url";

ReceiveMessageRequest request = new ReceiveMessageRequest()
    .withQueueUrl(queueUrl)
    .withWaitTimeSeconds(10)
    .withMaxNumberOfMessages(1);

while (true) {
    List<Message> messages = sqs.receiveMessage(request).getMessages();
    for (Message msg : messages) {
        String body = msg.getBody();
        // Deserialize and process

        // Idempotency check here

        // Delete after successful execution
        sqs.deleteMessage(queueUrl, msg.getReceiptHandle());
    }
}

🔁 How Kafka Handles This?
Kafka is different from SQS:

Kafka does not auto-delete messages.
Consumers commit offsets manually.
To prevent duplication, maintain idempotency checks (e.g., jobRunId).

ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
    if (!jobAlreadyProcessed(record.key())) {
        process(record.value());
        markJobComplete(record.key());
    }
}
consumer.commitSync();

🧠 What is Idempotency?
A process is idempotent if running it multiple times has the same effect as running it once.

Example in Job Scheduler:

If a jobRunId is already marked as completed, skip execution.
This ensures safe reprocessing in case of retries.

📈 Metrics & Monitoring (Enhancement Ideas)
Execution Metrics:

Jobs submitted/executed per minute/hour/day
Average execution time
Error/Success ratio

Infrastructure Health:

CPU/Memory usage for Executor nodes
Queue size (SQS/Kafka lag)
DB read/write latency

Alerts & Dashboards:

Set up alarms for job delays, failures, or high retry count
Use tools like Prometheus + Grafana or AWS CloudWatch

Dead Letter Queues (DLQ):

For jobs that fail after multiple retries

Audit Logs:

Log job submissions, execution attempts, and results
Store in Elasticsearch or S3 for analysis

📌 Final Thoughts
This blog outlines the complete design of a scalable, reliable distributed job scheduler with real-world production-grade strategies:

Durable job ingestion
Idempotent execution
FIFO queue management (SQS)
Resilience with AWS services
Extendable to Kafka or other Pub/Sub platforms
Ideas for metrics, monitoring, and alerting integration

Thanks for reading! I share insightful Real World System Designs and DSA patterns every week. Follow for high-quality, actionable content that will level up your skills!

Follow me for more such posts: https://www.linkedin.com/in/gauravsingh9356/

Dungeons, Dragons, and Numbers

Iterator helpers have become Baseline Newly available | Blog | web.dev

Politicians Photoshopped Pulling Their Best Yoga Poses

Building a Distributed Job Scheduler: End2End Design

Check out our other content

Dungeons, Dragons, and Numbers

Iterator helpers have become Baseline Newly available | Blog | web.dev

Politicians Photoshopped Pulling Their Best Yoga Poses

Dungeons, Dragons, and Numbers

Iterator helpers have become Baseline Newly available | Blog | web.dev

Politicians Photoshopped Pulling Their Best Yoga Poses

The FDA plans to limit access to covid vaccines. Here’s why that’s not all bad.

Visual Studio Code now supports Baseline | Blog | web.dev

How to Build a Testing Framework for E-Commerce Checkout and Payments

Most Popular Articles

Dungeons, Dragons, and Numbers

Iterator helpers have become Baseline Newly available | Blog | web.dev

Politicians Photoshopped Pulling Their Best Yoga Poses

The FDA plans to limit access to covid vaccines. Here’s why that’s not all bad.

Visual Studio Code now supports Baseline | Blog | web.dev

How to Build a Testing Framework for E-Commerce Checkout and Payments

Red Hat announces Advanced Developer Suite

How Design Identity in Medical Devices Builds Trust