Celery Task Monitoring: The Complete Guide
Celery powers background jobs for thousands of Python applications—processing payments, sending emails, generating reports, syncing data. But here's the uncomfortable truth: most Celery deployments have no idea when tasks silently fail.
Your worker crashed mid-execution. Your Beat scheduler died at 3 AM. A task that should run every hour hasn't run in six hours. Without proper monitoring, you won't know until customers complain—or worse, until you discover corrupted data weeks later.
This guide covers everything you need to monitor Celery in production: from basic visibility with Flower to comprehensive alerting with Prometheus, and solving the hardest problem of all—detecting when scheduled tasks don't run at all.
Why Celery monitoring is harder than it looks
Celery's architecture creates monitoring blind spots that catch teams off guard.
The silent failure problem
Celery's task_failure signal fires when a task raises an exception. But it doesn't fire when:
- A worker gets OOM-killed (SIGKILL)
- The hard time limit is exceeded
- The worker crashes during execution
- The result backend fails to store the result
- A task is lost in transit between broker and worker
By default, Celery acknowledges failed tasks and moves on. No notification, no retry, no trace:
# Default behavior - failed tasks disappear silently
task_acks_on_failure_or_timeout = True # This is the default!The Beat scheduler blind spot
Celery Beat is a single process. If it dies, every scheduled task in your application stops running. The problem? Nothing in Celery's ecosystem tells you this happened.
Flower—the official monitoring tool—cannot monitor Beat at all. Your workers look healthy, your queues look empty, and meanwhile your hourly data sync hasn't run in 18 hours.
What you actually need to monitor
A production Celery deployment requires visibility into four layers:
- Task execution: Did individual tasks succeed or fail?
- Worker health: Are workers alive and processing?
- Queue health: Are tasks backing up? Is the broker healthy?
- Schedule execution: Did scheduled tasks actually run when they should have?
No single tool covers all four. Let's build a monitoring stack that does.
Setting up Flower for real-time visibility
Flower is the starting point for any Celery monitoring setup. It provides a real-time web UI showing task progress, worker status, and queue depths.
Installation and basic setup
pip install flowerStart Flower alongside your Celery workers:
# Basic startup
celery -A your_project flower
# With authentication (required for production)
celery -A your_project flower \
--basic_auth=admin:your-secure-password \
--port=5555For Docker deployments:
# docker-compose.yml
services:
flower:
image: mher/flower
command: celery --broker=redis://redis:6379/0 flower
ports:
- "5555:5555"
environment:
- FLOWER_BASIC_AUTH=admin:password
depends_on:
- redisEnabling Celery events
Flower relies on Celery's event system. Enable it in your configuration:
# celery.py
app.conf.update(
worker_send_task_events=True,
task_send_sent_event=True, # Track queue wait time
task_track_started=True, # Know when tasks actually start
)Without task_send_sent_event, you can't measure how long tasks wait in the queue before a worker picks them up—one of the most important production metrics.
What Flower can and cannot do
Flower excels at:
- Real-time task visibility (in-progress, succeeded, failed)
- Worker management (restart, shutdown, pool scaling)
- Queue depth monitoring
- Basic rate limiting
Flower's limitations:
- No alerting—you must watch the dashboard
- Data stored in RAM—history lost on restart
- Cannot monitor Celery Beat
- Misses tasks that fail while Flower is down
Flower is essential for debugging and development. For production alerting, you need more.
Prometheus metrics for production alerting
Prometheus + Grafana gives you historical metrics, dashboards, and alerting. Two approaches exist for exporting Celery metrics.
Option 1: Flower's built-in Prometheus endpoint
Flower exposes metrics at /metrics when started with the flag:
celery -A your_project flower --prometheus_metricsOption 2: Dedicated Celery exporter (recommended)
The danihodovic/celery-exporter provides more comprehensive metrics and runs independently from Flower:
# docker-compose.yml
services:
celery-exporter:
image: danihodovic/celery-exporter
environment:
- CE_BROKER_URL=redis://redis:6379/0
ports:
- "9808:9808"Key metrics to track
Configure alerts for these critical metrics:
# prometheus/alerts.yml
groups:
- name: celery
rules:
# Alert when any task fails
- alert: CeleryTaskFailed
expr: increase(celery_task_failed_total[5m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Celery task {{ $labels.name }} failed"
# Alert when no workers are online
- alert: CeleryNoWorkers
expr: celery_workers == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No Celery workers are running"
# Alert when queue is backing up
- alert: CeleryQueueBacklog
expr: celery_queue_length > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Celery queue {{ $labels.queue }} has {{ $value }} pending tasks"
# Alert on low success rate
- alert: CeleryLowSuccessRate
expr: |
(
sum(rate(celery_task_succeeded_total[5m]))
/
sum(rate(celery_task_succeeded_total[5m]) + rate(celery_task_failed_total[5m]))
) < 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "Celery task success rate is {{ $value | humanizePercentage }}"The most important metric you're not tracking
Queue wait time—the duration between a task being sent and a worker starting it—is the best indicator of capacity problems:
# Requires task_send_sent_event=True in config
# Metric: celery_task_queue_time_secondsIf queue wait time increases, you need more workers or faster task execution. This metric catches problems before they become outages.
Catching errors with Sentry integration
Prometheus tells you how many tasks failed. Sentry tells you why.
Basic setup
pip install sentry-sdk# celery.py
import sentry_sdk
from sentry_sdk.integrations.celery import CeleryIntegration
sentry_sdk.init(
dsn="https://your-sentry-dsn",
integrations=[CeleryIntegration()],
traces_sample_rate=0.1, # Adjust based on volume
)This automatically captures:
- Full stack traces with local variables
- Task arguments and metadata
- Distributed traces linking tasks to triggering code
Sentry Crons for Beat monitoring
Sentry's Crons feature can auto-discover Celery Beat tasks:
from sentry_sdk.crons import monitor
@app.task
@monitor(monitor_slug='daily-report')
def generate_daily_report():
# Your task logic
passThis alerts you when scheduled tasks don't run on time—but requires Sentry's paid tier and tight coupling to their platform.
Detecting when scheduled tasks don't run
Here's the problem none of the tools above fully solve: how do you know when a task that should run every hour hasn't run at all?
Flower shows tasks that executed. Prometheus counts tasks that ran. Sentry captures errors from tasks that failed. But if Beat dies, or a task gets lost before reaching a worker, these tools show... nothing. An empty dashboard. No alerts.
This is where dead man's switch monitoring—also called heartbeat monitoring—becomes essential.
How dead man's switch monitoring works
Instead of monitoring for failures, you monitor for the absence of success:
- Configure a monitor expecting a ping every hour
- Your task sends a ping when it completes successfully
- If the ping doesn't arrive on schedule, you get alerted
The monitor doesn't care why the task didn't run. Beat crashed? Worker died? Task lost in queue? Network partition? You get alerted regardless.
cronradar">Implementing heartbeat monitoring with CronRadar
CronRadar provides dead man's switch monitoring designed for scheduled tasks. Here's how to integrate it with Celery:
# tasks.py
import requests
from celery import Celery
app = Celery('tasks')
CRONRADAR_MONITORS = {
'tasks.daily_backup': 'https://cronradar.io/ping/abc123',
'tasks.sync_inventory': 'https://cronradar.io/ping/def456',
'tasks.generate_reports': 'https://cronradar.io/ping/ghi789',
}
@app.task
def daily_backup():
monitor_url = CRONRADAR_MONITORS.get('tasks.daily_backup')
# Signal task started
requests.get(f"{monitor_url}/start", timeout=5)
try:
# Your backup logic here
perform_backup()
# Signal success
requests.get(monitor_url, timeout=5)
except Exception as e:
# Signal failure
requests.get(f"{monitor_url}/fail", timeout=5)
raiseAutomatic monitoring with Celery signals
For cleaner code, use Celery signals to handle monitoring automatically:
# monitoring.py
import requests
from celery.signals import task_prerun, task_success, task_failure
CRONRADAR_MONITORS = {
'tasks.daily_backup': 'https://cronradar.io/ping/abc123',
'tasks.sync_inventory': 'https://cronradar.io/ping/def456',
}
def get_monitor_url(task_name):
return CRONRADAR_MONITORS.get(task_name)
@task_prerun.connect
def on_task_start(sender=None, **kwargs):
url = get_monitor_url(sender.name)
if url:
try:
requests.get(f"{url}/start", timeout=5)
except requests.RequestException:
pass # Don't fail task if monitoring ping fails
@task_success.connect
def on_task_success(sender=None, **kwargs):
url = get_monitor_url(sender.name)
if url:
try:
requests.get(url, timeout=5)
except requests.RequestException:
pass
@task_failure.connect
def on_task_failure(sender=None, **kwargs):
url = get_monitor_url(sender.name)
if url:
try:
requests.get(f"{url}/fail", timeout=5)
except requests.RequestException:
passNow any task registered in CRONRADAR_MONITORS is automatically tracked without modifying the task code.
Using the CronRadar Python SDK
For a cleaner integration, use the CronRadar SDK:
pip install cronradar# celery.py
from cronradar.celery import CronRadarMiddleware
app = Celery('tasks')
# Register the middleware
app.steps['worker'].add(CronRadarMiddleware)
# Configure monitors
app.conf.cronradar_monitors = {
'tasks.daily_backup': 'abc123',
'tasks.sync_inventory': 'def456',
}The SDK handles retries, timeouts, and edge cases automatically.
Docker and Kubernetes health checks
Container orchestrators need to know when workers are unhealthy. The standard celery inspect ping command doesn't work well—it blocks during long-running tasks.
Docker health check
# Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD celery -A your_project inspect ping -d celery@$HOSTNAME || exit 1Kubernetes with liveness probes
For Kubernetes, implement a file-based health check using Celery signals:
# health.py
from celery.signals import worker_ready, heartbeat_sent
from pathlib import Path
import time
HEARTBEAT_FILE = Path('/tmp/celery_heartbeat')
@worker_ready.connect
def on_worker_ready(**kwargs):
HEARTBEAT_FILE.touch()
@heartbeat_sent.connect
def on_heartbeat(**kwargs):
HEARTBEAT_FILE.touch()
def is_worker_healthy(max_age_seconds=60):
"""Check if heartbeat file was updated recently."""
if not HEARTBEAT_FILE.exists():
return False
age = time.time() - HEARTBEAT_FILE.stat().st_mtime
return age < max_age_seconds# kubernetes deployment
livenessProbe:
exec:
command:
- python
- -c
- "from health import is_worker_healthy; exit(0 if is_worker_healthy() else 1)"
initialDelaySeconds: 30
periodSeconds: 30Production configuration checklist
Before deploying, ensure your Celery configuration includes these monitoring essentials:
# celery.py
app.conf.update(
# === Event Configuration ===
worker_send_task_events=True,
task_send_sent_event=True,
task_track_started=True,
# === Reliability ===
task_acks_late=True,
task_reject_on_worker_lost=True,
# === Timeouts (always set these!) ===
task_time_limit=3600, # Hard limit: 1 hour
task_soft_time_limit=3300, # Soft limit: 55 minutes
# === Result Backend ===
result_extended=True, # Store task args in result
task_store_errors_even_if_ignored=True,
# === Worker Stability ===
worker_max_tasks_per_child=1000, # Prevent memory leaks
worker_prefetch_multiplier=4, # Balance throughput/latency
)The complete monitoring stack
For production Celery, deploy all four monitoring layers:
| Layer | Tool | Purpose |
|---|---|---|
| Real-time visibility | Flower | Debugging, task inspection |
| Metrics & alerting | Prometheus + Grafana | Historical trends, threshold alerts |
| Error tracking | Sentry | Stack traces, root cause analysis |
| Schedule monitoring | CronRadar | Detect when tasks don't run |
What to do next
Start with Flower to understand what's happening in your Celery deployment. Add Prometheus for alerting on failures and queue backlogs. Integrate Sentry to capture error details.
Then ask yourself: would I know if my hourly backup task stopped running? If your Beat scheduler died right now, how long until you'd notice?
If you don't have a good answer, set up dead man's switch monitoring for your critical scheduled tasks. CronRadar's free tier monitors up to 20 jobs—enough to cover most applications' essential background tasks.
Your Celery workers are doing important work. Make sure you know when they stop.