Celery Task Monitoring: The Complete Guide

Celery powers background jobs for thousands of Python applications—processing payments, sending emails, generating reports, syncing data. But here's the uncomfortable truth: most Celery deployments have no idea when tasks silently fail.

Your worker crashed mid-execution. Your Beat scheduler died at 3 AM. A task that should run every hour hasn't run in six hours. Without proper monitoring, you won't know until customers complain—or worse, until you discover corrupted data weeks later.

This guide covers everything you need to monitor Celery in production: from basic visibility with Flower to comprehensive alerting with Prometheus, and solving the hardest problem of all—detecting when scheduled tasks don't run at all.

Why Celery monitoring is harder than it looks

Celery's architecture creates monitoring blind spots that catch teams off guard.

The silent failure problem

Celery's task_failure signal fires when a task raises an exception. But it doesn't fire when:

A worker gets OOM-killed (SIGKILL)
The hard time limit is exceeded
The worker crashes during execution
The result backend fails to store the result
A task is lost in transit between broker and worker

By default, Celery acknowledges failed tasks and moves on. No notification, no retry, no trace:

# Default behavior - failed tasks disappear silently
task_acks_on_failure_or_timeout = True  # This is the default!

Celery Beat is a single process. If it dies, every scheduled task in your application stops running. The problem? Nothing in Celery's ecosystem tells you this happened.

Flower—the official monitoring tool—cannot monitor Beat at all. Your workers look healthy, your queues look empty, and meanwhile your hourly data sync hasn't run in 18 hours.

What you actually need to monitor

A production Celery deployment requires visibility into four layers:

Task execution: Did individual tasks succeed or fail?
Worker health: Are workers alive and processing?
Queue health: Are tasks backing up? Is the broker healthy?
Schedule execution: Did scheduled tasks actually run when they should have?

No single tool covers all four. Let's build a monitoring stack that does.

Setting up Flower for real-time visibility

Flower is the starting point for any Celery monitoring setup. It provides a real-time web UI showing task progress, worker status, and queue depths.

Installation and basic setup

pip install flower

Start Flower alongside your Celery workers:

# Basic startup
celery -A your_project flower

# With authentication (required for production)
celery -A your_project flower \
    --basic_auth=admin:your-secure-password \
    --port=5555

For Docker deployments:

# docker-compose.yml
services:
  flower:
    image: mher/flower
    command: celery --broker=redis://redis:6379/0 flower
    ports:
      - "5555:5555"
    environment:
      - FLOWER_BASIC_AUTH=admin:password
    depends_on:
      - redis

Enabling Celery events

Flower relies on Celery's event system. Enable it in your configuration:

# celery.py
app.conf.update(
    worker_send_task_events=True,
    task_send_sent_event=True,  # Track queue wait time
    task_track_started=True,    # Know when tasks actually start
)

Without task_send_sent_event, you can't measure how long tasks wait in the queue before a worker picks them up—one of the most important production metrics.

What Flower can and cannot do

Flower excels at:

Real-time task visibility (in-progress, succeeded, failed)
Worker management (restart, shutdown, pool scaling)
Queue depth monitoring
Basic rate limiting

Flower's limitations:

No alerting—you must watch the dashboard
Data stored in RAM—history lost on restart
Cannot monitor Celery Beat
Misses tasks that fail while Flower is down

Flower is essential for debugging and development. For production alerting, you need more.

Prometheus metrics for production alerting

Prometheus + Grafana gives you historical metrics, dashboards, and alerting. Two approaches exist for exporting Celery metrics.

Option 1: Flower's built-in Prometheus endpoint

Flower exposes metrics at /metrics when started with the flag:

celery -A your_project flower --prometheus_metrics

Option 2: Dedicated Celery exporter (recommended)

The danihodovic/celery-exporter provides more comprehensive metrics and runs independently from Flower:

# docker-compose.yml
services:
  celery-exporter:
    image: danihodovic/celery-exporter
    environment:
      - CE_BROKER_URL=redis://redis:6379/0
    ports:
      - "9808:9808"

Key metrics to track

Configure alerts for these critical metrics:

# prometheus/alerts.yml
groups:
  - name: celery
    rules:
      # Alert when any task fails
      - alert: CeleryTaskFailed
        expr: increase(celery_task_failed_total[5m]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Celery task {{ $labels.name }} failed"

      # Alert when no workers are online
      - alert: CeleryNoWorkers
        expr: celery_workers == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No Celery workers are running"

      # Alert when queue is backing up
      - alert: CeleryQueueBacklog
        expr: celery_queue_length > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Celery queue {{ $labels.queue }} has {{ $value }} pending tasks"

      # Alert on low success rate
      - alert: CeleryLowSuccessRate
        expr: |
          (
            sum(rate(celery_task_succeeded_total[5m]))
            /
            sum(rate(celery_task_succeeded_total[5m]) + rate(celery_task_failed_total[5m]))
          ) < 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Celery task success rate is {{ $value | humanizePercentage }}"

The most important metric you're not tracking

Queue wait time—the duration between a task being sent and a worker starting it—is the best indicator of capacity problems:

# Requires task_send_sent_event=True in config
# Metric: celery_task_queue_time_seconds

If queue wait time increases, you need more workers or faster task execution. This metric catches problems before they become outages.

Catching errors with Sentry integration

Prometheus tells you how many tasks failed. Sentry tells you why.

Basic setup

pip install sentry-sdk

# celery.py
import sentry_sdk
from sentry_sdk.integrations.celery import CeleryIntegration

sentry_sdk.init(
    dsn="https://your-sentry-dsn",
    integrations=[CeleryIntegration()],
    traces_sample_rate=0.1,  # Adjust based on volume
)

This automatically captures:

Full stack traces with local variables
Task arguments and metadata
Distributed traces linking tasks to triggering code

Sentry Crons for Beat monitoring

Sentry's Crons feature can auto-discover Celery Beat tasks:

from sentry_sdk.crons import monitor

@app.task
@monitor(monitor_slug='daily-report')
def generate_daily_report():
    # Your task logic
    pass

This alerts you when scheduled tasks don't run on time—but requires Sentry's paid tier and tight coupling to their platform.

Detecting when scheduled tasks don't run

Here's the problem none of the tools above fully solve: how do you know when a task that should run every hour hasn't run at all?

Flower shows tasks that executed. Prometheus counts tasks that ran. Sentry captures errors from tasks that failed. But if Beat dies, or a task gets lost before reaching a worker, these tools show... nothing. An empty dashboard. No alerts.

This is where dead man's switch monitoring—also called heartbeat monitoring—becomes essential.

How dead man's switch monitoring works

Instead of monitoring for failures, you monitor for the absence of success:

Configure a monitor expecting a ping every hour
Your task sends a ping when it completes successfully
If the ping doesn't arrive on schedule, you get alerted

The monitor doesn't care why the task didn't run. Beat crashed? Worker died? Task lost in queue? Network partition? You get alerted regardless.

cronradar">Implementing heartbeat monitoring with CronRadar

CronRadar provides dead man's switch monitoring designed for scheduled tasks. Here's how to integrate it with Celery:

# tasks.py
import requests
from celery import Celery

app = Celery('tasks')

CRONRADAR_MONITORS = {
    'tasks.daily_backup': 'https://cronradar.io/ping/abc123',
    'tasks.sync_inventory': 'https://cronradar.io/ping/def456',
    'tasks.generate_reports': 'https://cronradar.io/ping/ghi789',
}

@app.task
def daily_backup():
    monitor_url = CRONRADAR_MONITORS.get('tasks.daily_backup')
    
    # Signal task started
    requests.get(f"{monitor_url}/start", timeout=5)
    
    try:
        # Your backup logic here
        perform_backup()
        
        # Signal success
        requests.get(monitor_url, timeout=5)
        
    except Exception as e:
        # Signal failure
        requests.get(f"{monitor_url}/fail", timeout=5)
        raise

Automatic monitoring with Celery signals

For cleaner code, use Celery signals to handle monitoring automatically:

# monitoring.py
import requests
from celery.signals import task_prerun, task_success, task_failure

CRONRADAR_MONITORS = {
    'tasks.daily_backup': 'https://cronradar.io/ping/abc123',
    'tasks.sync_inventory': 'https://cronradar.io/ping/def456',
}

def get_monitor_url(task_name):
    return CRONRADAR_MONITORS.get(task_name)

@task_prerun.connect
def on_task_start(sender=None, **kwargs):
    url = get_monitor_url(sender.name)
    if url:
        try:
            requests.get(f"{url}/start", timeout=5)
        except requests.RequestException:
            pass  # Don't fail task if monitoring ping fails

@task_success.connect
def on_task_success(sender=None, **kwargs):
    url = get_monitor_url(sender.name)
    if url:
        try:
            requests.get(url, timeout=5)
        except requests.RequestException:
            pass

@task_failure.connect
def on_task_failure(sender=None, **kwargs):
    url = get_monitor_url(sender.name)
    if url:
        try:
            requests.get(f"{url}/fail", timeout=5)
        except requests.RequestException:
            pass

Now any task registered in CRONRADAR_MONITORS is automatically tracked without modifying the task code.

Using the CronRadar Python SDK

For a cleaner integration, use the CronRadar SDK:

pip install cronradar

# celery.py
from cronradar.celery import CronRadarMiddleware

app = Celery('tasks')

# Register the middleware
app.steps['worker'].add(CronRadarMiddleware)

# Configure monitors
app.conf.cronradar_monitors = {
    'tasks.daily_backup': 'abc123',
    'tasks.sync_inventory': 'def456',
}

The SDK handles retries, timeouts, and edge cases automatically.

Docker and Kubernetes health checks

Container orchestrators need to know when workers are unhealthy. The standard celery inspect ping command doesn't work well—it blocks during long-running tasks.

Docker health check

# Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD celery -A your_project inspect ping -d celery@$HOSTNAME || exit 1

Kubernetes with liveness probes

For Kubernetes, implement a file-based health check using Celery signals:

# health.py
from celery.signals import worker_ready, heartbeat_sent
from pathlib import Path
import time

HEARTBEAT_FILE = Path('/tmp/celery_heartbeat')

@worker_ready.connect
def on_worker_ready(**kwargs):
    HEARTBEAT_FILE.touch()

@heartbeat_sent.connect  
def on_heartbeat(**kwargs):
    HEARTBEAT_FILE.touch()

def is_worker_healthy(max_age_seconds=60):
    """Check if heartbeat file was updated recently."""
    if not HEARTBEAT_FILE.exists():
        return False
    age = time.time() - HEARTBEAT_FILE.stat().st_mtime
    return age < max_age_seconds

# kubernetes deployment
livenessProbe:
  exec:
    command:
      - python
      - -c
      - "from health import is_worker_healthy; exit(0 if is_worker_healthy() else 1)"
  initialDelaySeconds: 30
  periodSeconds: 30

Production configuration checklist

Before deploying, ensure your Celery configuration includes these monitoring essentials:

# celery.py
app.conf.update(
    # === Event Configuration ===
    worker_send_task_events=True,
    task_send_sent_event=True,
    task_track_started=True,
    
    # === Reliability ===
    task_acks_late=True,
    task_reject_on_worker_lost=True,
    
    # === Timeouts (always set these!) ===
    task_time_limit=3600,        # Hard limit: 1 hour
    task_soft_time_limit=3300,   # Soft limit: 55 minutes
    
    # === Result Backend ===
    result_extended=True,        # Store task args in result
    task_store_errors_even_if_ignored=True,
    
    # === Worker Stability ===
    worker_max_tasks_per_child=1000,  # Prevent memory leaks
    worker_prefetch_multiplier=4,      # Balance throughput/latency
)

The complete monitoring stack

For production Celery, deploy all four monitoring layers:

Layer	Tool	Purpose
Real-time visibility	Flower	Debugging, task inspection
Metrics & alerting	Prometheus + Grafana	Historical trends, threshold alerts
Error tracking	Sentry	Stack traces, root cause analysis
Schedule monitoring	CronRadar	Detect when tasks don't run

What to do next

Start with Flower to understand what's happening in your Celery deployment. Add Prometheus for alerting on failures and queue backlogs. Integrate Sentry to capture error details.

Then ask yourself: would I know if my hourly backup task stopped running? If your Beat scheduler died right now, how long until you'd notice?

If you don't have a good answer, set up dead man's switch monitoring for your critical scheduled tasks. CronRadar's free tier monitors up to 20 jobs—enough to cover most applications' essential background tasks.

Your Celery workers are doing important work. Make sure you know when they stop.