Saturday, May 24, 2025

Building a Distributed Job Scheduler: End2End Design

Programming LanguageBuilding a Distributed Job Scheduler: End2End Design


In today’s software systems, many applications need to execute background tasks—at scale, reliably, and at the right time. This blog post covers a comprehensive system design for a Distributed Job Scheduler, diving deep into functional/non-functional requirements, high-level design, storage choices, idempotency, queue processing, and integration with tools like AWS SQS and Kafka.

🔧 Functional Requirements

  • Users can submit jobs/tasks to run on a particular schedule.
  • Support for manual triggering of jobs.
  • Jobs are Python scripts (support for other languages can be added later).
  • At-least-once execution of scheduled jobs.
  • Recurring jobs with enable/disable mechanisms.
  • Job sequencing support: Job1 -> Job2 -> Job3

📈 Non-Functional Requirements
High availability.

  • At-least-once execution of jobs.
  • Job runs must happen on or close to schedule (2-4s delay acceptable).
  • Durability: no job should be lost even if system crashes.
  • Delay notifications to users.
  • Scalability to 10 billion jobs/day.

📊 Throughput & Storage Estimations
Throughput: 10⁹ jobs/day → ~10⁴ jobs/sec
Storage: Assuming 200 lines/job, 50 bytes/line, Total: 200 * 50 * 10⁹ = 10 TB/day for job binaries

🏗️ High-Level Design
🔹 API Design
POST /api/v1/jobs

{
  "taskId": "someUniqueId",
  "cron": "*/5 * * * *",
  "params": {
    "p1": "val1",
    "p2": "val2"
  }
}
Enter fullscreen mode

Exit fullscreen mode

GET

/api/v1/jobs/:jobId
Enter fullscreen mode

Exit fullscreen mode

Returns job status.

🧠 Database Considerations
Since the system is write-heavy, choose NoSQL stores like DynamoDB, Cassandra, or MongoDB.

Writes:

  • Create job
  • Record job run
  • Update status
    Reads:

Poll jobs due in the next minute (configurable)
Sort and enqueue

❌ Why not SQL?
No normalization required.
ACID guarantees not critical.

Sample Schema

{
  "time_bucket": "2025-05-24T10:00:00Z",
  "execution_time": "2025-05-24T10:00:05Z",
  "job_id": "job_abc_123",
  "user_id": "user_123",
  "status": "PENDING",
  "attempt": 0,
  "script_url": "s3://bucket-name/path-to-script.py"
}
Enter fullscreen mode

Exit fullscreen mode

🔄 Scheduling & Execution Flow

  • WatcherService polls DB every minute (via a CronJob or periodic ECS task).
  • Picks jobs for next 1-min window.
  • Pushes to FIFO SQS Queue.
  • ExecutorService pulls from SQS.
  • Spawns isolated ECSContainer to run each job securely.
  • Performs idempotent check.
  • Executes and updates DB.

🧾 Sequence Diagram – LLD

✅ Handling Duplicate Events (Idempotency)
Problem:
If the ExecutorService crashes mid-execution or fails before DeleteMessage, SQS redelivers the message.

Solution:
Implement idempotency by checking if jobRunId is already processed before executing:

if (jobRunId already in DB with status SUCCESS) {
   // Skip processing
} else {
   // Execute and update DB
}
Enter fullscreen mode

Exit fullscreen mode

📬 FIFO SQS Acknowledgement Workflow

  • ReceiveMessage: Executor polls for messages.
  • Visibility Timeout starts.
  • If successfully processed → DeleteMessage() → removes from queue.
  • If not deleted within timeout → message is re-queued.

Yes, the consumer must explicitly delete the message. This applies to all SQS-based systems.

🧪 Java Boilerplate (SQS FIFO Consumer)

AmazonSQS sqs = AmazonSQSClientBuilder.defaultClient();
String queueUrl = "https://sqs.us-east-1.amazonaws.com/your-queue-url";

ReceiveMessageRequest request = new ReceiveMessageRequest()
    .withQueueUrl(queueUrl)
    .withWaitTimeSeconds(10)
    .withMaxNumberOfMessages(1);

while (true) {
    List<Message> messages = sqs.receiveMessage(request).getMessages();
    for (Message msg : messages) {
        String body = msg.getBody();
        // Deserialize and process

        // Idempotency check here

        // Delete after successful execution
        sqs.deleteMessage(queueUrl, msg.getReceiptHandle());
    }
}
Enter fullscreen mode

Exit fullscreen mode

🔁 How Kafka Handles This?
Kafka is different from SQS:

Kafka does not auto-delete messages.
Consumers commit offsets manually.
To prevent duplication, maintain idempotency checks (e.g., jobRunId).

ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
    if (!jobAlreadyProcessed(record.key())) {
        process(record.value());
        markJobComplete(record.key());
    }
}
consumer.commitSync();
Enter fullscreen mode

Exit fullscreen mode

🧠 What is Idempotency?
A process is idempotent if running it multiple times has the same effect as running it once.

Example in Job Scheduler:

If a jobRunId is already marked as completed, skip execution.
This ensures safe reprocessing in case of retries.

📈 Metrics & Monitoring (Enhancement Ideas)
Execution Metrics:

  • Jobs submitted/executed per minute/hour/day
  • Average execution time
  • Error/Success ratio

Infrastructure Health:

  • CPU/Memory usage for Executor nodes
  • Queue size (SQS/Kafka lag)
  • DB read/write latency

Alerts & Dashboards:

  • Set up alarms for job delays, failures, or high retry count
  • Use tools like Prometheus + Grafana or AWS CloudWatch

Dead Letter Queues (DLQ):

  • For jobs that fail after multiple retries

Audit Logs:

  • Log job submissions, execution attempts, and results
  • Store in Elasticsearch or S3 for analysis

📌 Final Thoughts
This blog outlines the complete design of a scalable, reliable distributed job scheduler with real-world production-grade strategies:

Durable job ingestion
Idempotent execution
FIFO queue management (SQS)
Resilience with AWS services
Extendable to Kafka or other Pub/Sub platforms
Ideas for metrics, monitoring, and alerting integration
Enter fullscreen mode

Exit fullscreen mode

Thanks for reading! I share insightful Real World System Designs and DSA patterns every week. Follow for high-quality, actionable content that will level up your skills!

Follow me for more such posts: https://www.linkedin.com/in/gauravsingh9356/

Check out our other content

Check out other tags:

Most Popular Articles