Kubernetes CronJob Monitoring: From Setup to Alerting
A CronJob fails silently for 24 days. No alerts fire. No one notices. When the Kubernetes SIG-Network bot finally got attention in 2020, engineers discovered it had hit the infamous "100 missed schedules" threshold and permanently stopped running. The CronJob wasn't broken—it had simply given up, and Kubernetes doesn't page anyone when that happens.
This scenario plays out across organizations daily. Unlike a crashing deployment that triggers immediate alerts, a CronJob that never runs produces no signal. There's simply an absence of activity. If your nightly database backup silently stops, you won't know until you need that backup.
This guide covers everything you need to monitor Kubernetes CronJobs effectively: native kubectl commands for debugging, Prometheus alerting rules that actually work, external monitoring for catching silent failures, and a troubleshooting flowchart for when things go wrong.
How Kubernetes CronJobs Actually Work
Understanding CronJob architecture helps explain why they fail in unexpected ways.
Kubernetes CronJobs use a three-tier object hierarchy: CronJob → Job → Pod → Container. The CronJob Controller, running inside kube-controller-manager, checks every 10 seconds whether any CronJob needs to create a new Job. When the schedule matches, it spawns a Job object, which creates Pods to execute your workload.
The flow works like this: the CronJob creates a Job on schedule, the Job immediately creates a Pod, and the Pod runs your container. The CronJob controller checks every 10 seconds for schedules that need to fire. Job completion is tracked via status fields, and the Pod's exit code determines success or failure.
Each execution creates ephemeral Pods—unlike traditional cron on a Linux server, there's no persistent state between runs. This architecture creates failure modes that traditional monitoring can't detect.
The CronJob Spec Fields That Matter
Before diving into monitoring, you need to understand which configuration fields affect reliability:
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup
spec:
schedule: "0 2 * * *" # Standard cron syntax
timeZone: "America/New_York" # IANA timezone (stable since v1.27)
concurrencyPolicy: Forbid # Allow | Forbid | Replace
startingDeadlineSeconds: 200 # Grace period for late starts
successfulJobsHistoryLimit: 3 # Keep last 3 successful jobs
failedJobsHistoryLimit: 1 # Keep last 1 failed job
jobTemplate:
spec:
backoffLimit: 3 # Retry failed pods up to 3 times
activeDeadlineSeconds: 1800 # Kill job after 30 minutes
template:
spec:
containers:
- name: backup
image: backup-tool:v2.1
command: ["/scripts/backup.sh"]
restartPolicy: OnFailureconcurrencyPolicy controls what happens when a new job is due but the previous one is still running:
Allow: Run jobs concurrently (dangerous for most use cases)Forbid: Skip the new job (safe, but you might miss runs)Replace: Kill the running job and start a new one
startingDeadlineSeconds is the most misunderstood field. It defines how late a job can start. If the controller misses the scheduled time (due to downtime, resource pressure, or bugs), it will still create the job as long as it's within this deadline.
Here's the critical part: if more than 100 scheduled runs are missed within the startingDeadlineSeconds window, the CronJob permanently stops scheduling. This is the bug that caused the 24-day outage. The CronJob logs an error and gives up forever:
Cannot determine if job needs to be started. Too many missed start time (> 100)timeZone (stable since Kubernetes 1.27) lets you specify when jobs run in human terms. Without it, schedules use the kube-controller-manager's timezone—usually UTC in managed clusters.
Why CronJobs Fail Silently
CronJobs fail in ways that don't trigger traditional alerts:
| Failure Mode | What Happens | Traditional Monitoring Detects It? |
|---|---|---|
| Job never scheduled | Controller misses window, no Job created | ❌ No |
| 100 missed schedules | CronJob permanently stops | ❌ No |
| Pod exits 0 but work incomplete | Job marked "successful" | ❌ No |
| Resource quota exhausted | Job created, Pod never scheduled | ⚠️ Maybe |
| Wrong timezone | Job runs at unexpected time | ❌ No |
| Controller restart race | Duplicate jobs with Forbid policy |
❌ No |
The common thread: the absence of activity produces no signal. Your Prometheus metrics show nothing wrong because nothing is happening.
Native Kubernetes Monitoring with kubectl
Start with kubectl for debugging. These commands should be muscle memory:
# List all CronJobs with last schedule time
kubectl get cronjobs -o wide
# Detailed view including events
kubectl describe cronjob database-backup
# Jobs created by a specific CronJob
kubectl get jobs -l cron-job-name=database-backup
# Pods for a specific Job
kubectl get pods --selector=job-name=database-backup-28547893
# Logs from the most recent job
kubectl logs job/database-backup-28547893
# Logs from a crashed container (previous attempt)
kubectl logs <pod-name> --previous
# Watch events in real-time
kubectl events --for cronjob/database-backup --watch
# Manual trigger for testing
kubectl create job manual-backup-test --from=cronjob/database-backupFor a quick health check across all CronJobs:
kubectl get cronjobs -o custom-columns=\
NAME:.metadata.name,\
SCHEDULE:.spec.schedule,\
SUSPENDED:.spec.suspend,\
LAST:.status.lastScheduleTime,\
ACTIVE:.status.activeKey Events to Watch For
| Event | Meaning | Action Required |
|---|---|---|
SuccessfulCreate |
Job created on schedule | None (good) |
SawCompletedJob |
Controller noticed completion | None (good) |
FailedCreate |
Couldn't create Job | Check RBAC, quotas |
MissSchedule |
Missed scheduled time | Check controller health |
TooManyMissedTimes |
100+ misses, CronJob stopped | Delete and recreate |
JobAlreadyActive |
Skipped due to Forbid policy | May need longer deadline |
The critical limitation: Kubernetes events expire after 1 hour by default. If your CronJob fails at 2 AM and you check at 9 AM, the evidence is gone.
Prometheus and kube-state-metrics Setup
For persistent monitoring, deploy kube-state-metrics and configure Prometheus alerts.
Essential CronJob Metrics
kube-state-metrics exposes these CronJob-specific metrics:
| Metric | Type | What It Tells You |
|---|---|---|
kube_cronjob_status_last_schedule_time |
Gauge | When the CronJob last ran |
kube_cronjob_next_schedule_time |
Gauge | When it should run next |
kube_cronjob_status_active |
Gauge | Currently running jobs |
kube_cronjob_spec_suspend |
Gauge | Whether it's suspended |
kube_job_status_failed |
Gauge | Job failure count |
kube_job_status_succeeded |
Gauge | Job success count |
kube_job_status_start_time |
Gauge | When job started |
kube_job_status_completion_time |
Gauge | When job finished |
Alerting Rules That Actually Work
Alert on failed jobs:
groups:
- name: cronjob-alerts
rules:
- alert: KubeJobFailed
expr: |
kube_job_status_failed{condition="true"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job_name }} failed in {{ $labels.namespace }}"
runbook: "Check logs: kubectl logs job/{{ $labels.job_name }} -n {{ $labels.namespace }}"Alert on missed schedules (the silent killer):
- alert: KubeCronJobMissedSchedule
expr: |
(
time() - kube_cronjob_status_last_schedule_time
>
1.5 * (kube_cronjob_next_schedule_time - kube_cronjob_status_last_schedule_time)
)
and kube_cronjob_spec_suspend == 0
for: 10m
labels:
severity: warning
annotations:
summary: "CronJob {{ $labels.cronjob }} missed its schedule"
description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} has not run when expected"Alert on jobs running too long:
- alert: KubeJobRunningTooLong
expr: |
(time() - kube_job_status_start_time) > 3600
and kube_job_status_active > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Job {{ $labels.job_name }} running over 1 hour"Alert on stuck concurrent jobs:
- alert: KubeCronJobTooManyActive
expr: kube_cronjob_status_active > 1
for: 10m
labels:
severity: warning
annotations:
summary: "CronJob {{ $labels.cronjob }} has {{ $value }} active jobs"Avoiding Alert Fatigue
A common problem: alerts fire on old failed jobs that are retained in history. Use this recording rule to track only the most recent job per CronJob:
- record: cronjob:last_job_start_time:max
expr: |
max by (cronjob, namespace) (
kube_job_status_start_time
* on(job_name, namespace) group_right()
kube_job_owner{owner_kind="CronJob"}
)Then alert only on the most recent job's status rather than all historical jobs.
External Monitoring: The Dead Man's Switch
Here's the uncomfortable truth about internal monitoring: if the thing that's supposed to alert you is broken, you won't get alerted.
Prometheus can't tell you that:
- The entire cluster is down
- The controller-manager crashed
- A CronJob was never scheduled in the first place
- The monitoring system itself has issues
This is where external monitoring comes in. The pattern is called a dead man's switch: your job pings an external service upon completion. If the ping doesn't arrive, something went wrong—regardless of what that something is.
Implementing Ping-Based Monitoring
Add a curl command to your CronJob that signals success or failure:
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: backup-tool:v2.1
command:
- /bin/sh
- -c
- |
# Signal job started
curl -fsS https://cronradar.io/ping/abc123/start
# Run the actual backup
if /scripts/backup.sh; then
# Signal success
curl -fsS https://cronradar.io/ping/abc123
else
# Signal failure with exit code
curl -fsS https://cronradar.io/ping/abc123/fail
exit 1
fi
restartPolicy: OnFailureThe external service expects a ping within a configured grace period. No ping means something broke—and it doesn't matter whether the issue was in your code, the scheduler, the cluster, or anywhere else.
What External Monitoring Catches That Internal Can't
| Failure Scenario | Prometheus | External Ping |
|---|---|---|
| Job fails with error | ✅ Yes | ✅ Yes |
| Job hangs indefinitely | ⚠️ Needs timeout config | ✅ Grace period expires |
| Job never scheduled | ❌ No metrics exist | ✅ No ping arrives |
| Controller crashes | ❌ No | ✅ No ping arrives |
| Cluster network partition | ❌ No | ✅ No ping arrives |
| Monitoring stack down | ❌ No | ✅ Independent system |
The recommendation: use both. Prometheus for metrics, dashboards, and detailed debugging. External monitoring as the safety net that catches what internal monitoring misses.
Troubleshooting Decision Tree
When a CronJob isn't working, follow this diagnostic path:
CronJob Not Working?
│
├─► Is the CronJob suspended?
│ └─► kubectl get cronjob <name> -o jsonpath='{.spec.suspend}'
│ └─► If true: kubectl patch cronjob <name> -p '{"spec":{"suspend":false}}'
│
├─► Is a Job being created?
│ └─► kubectl get jobs -l cron-job-name=<name>
│ │
│ ├─► No Jobs found:
│ │ ├─► Check schedule syntax at crontab.guru
│ │ ├─► Verify startingDeadlineSeconds > 10
│ │ ├─► Look for "TooManyMissedTimes" in events
│ │ └─► Check controller-manager logs
│ │
│ └─► Jobs exist but failing:
│ └─► Continue to Pod diagnostics
│
├─► Is a Pod being created?
│ └─► kubectl get pods --selector=job-name=<job-name>
│ │
│ ├─► No Pods:
│ │ ├─► kubectl describe resourcequota
│ │ ├─► Check LimitRange constraints
│ │ └─► Verify serviceAccount permissions
│ │
│ └─► Pods exist but failing:
│ └─► Continue to container diagnostics
│
└─► What's the Pod status?
│
├─► Pending:
│ ├─► Check node resources (kubectl describe node)
│ ├─► Verify tolerations match taints
│ └─► Check nodeSelector/affinity rules
│
├─► ImagePullBackOff:
│ ├─► Verify image name and tag
│ └─► Check imagePullSecrets
│
├─► CrashLoopBackOff:
│ └─► kubectl logs <pod> --previous
│
└─► OOMKilled (exit 137):
└─► Increase memory limitsCommon Issues and Fixes
"Too many missed start times"
The CronJob hit 100 missed schedules and stopped forever. Only fix: delete and recreate it.
kubectl get cronjob <name> -o yaml > cronjob-backup.yaml
kubectl delete cronjob <name>
kubectl apply -f cronjob-backup.yamlPrevent this by setting startingDeadlineSeconds to limit the lookback window.
Jobs running at the wrong time
Prior to Kubernetes 1.27, CronJobs used the controller's timezone (usually UTC). Add the timeZone field:
spec:
timeZone: "America/New_York"
schedule: "0 9 * * *" # 9 AM New York timeDuplicate jobs with Forbid policy
In rare cases, controller restarts can cause duplicates even with concurrencyPolicy: Forbid. If this is critical (like payment processing), implement idempotency in your application logic.
Jobs stuck in "Active" state
The Pod might be hanging. Set activeDeadlineSeconds on the Job spec to enforce a timeout:
jobTemplate:
spec:
activeDeadlineSeconds: 1800 # Kill after 30 minutesProduction Checklist
Before deploying a CronJob to production, verify each item:
Schedule Configuration
- [ ] Schedule syntax validated at crontab.guru
- [ ]
timeZoneexplicitly set (don't rely on defaults) - [ ] Schedule accounts for job duration (won't overlap unexpectedly)
Failure Handling
- [ ]
startingDeadlineSecondsset (prevents 100 missed schedules bug) - [ ]
concurrencyPolicyappropriate for workload - [ ]
backoffLimitset on Job spec - [ ]
activeDeadlineSecondsset to prevent hung jobs
Resource Management
- [ ] Container resource requests and limits defined
- [ ] Namespace has sufficient quota
- [ ]
ttlSecondsAfterFinishedset to clean up completed Jobs
Monitoring
- [ ] Prometheus alerts configured for failures
- [ ] External ping-based monitoring for silent failures
- [ ] Log aggregation captures job output
- [ ] Runbooks exist for common failure scenarios
History and Debugging
- [ ]
successfulJobsHistoryLimitallows sufficient debugging (3-7) - [ ]
failedJobsHistoryLimitretains recent failures (1-3) - [ ] Events are being shipped to long-term storage
Complete Production Example
Here's a production-ready CronJob with monitoring integrated:
apiVersion: batch/v1
kind: CronJob
metadata:
name: production-db-backup
namespace: production
labels:
app: database-backup
team: platform
spec:
schedule: "0 2 * * *"
timeZone: "UTC"
concurrencyPolicy: Forbid
startingDeadlineSeconds: 3600
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
metadata:
labels:
app: database-backup
spec:
backoffLimit: 2
activeDeadlineSeconds: 1800
ttlSecondsAfterFinished: 86400
template:
spec:
serviceAccountName: backup-service-account
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: backup
image: company/backup-tool:v2.1.0
imagePullPolicy: IfNotPresent
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
env:
- name: BACKUP_BUCKET
value: "s3://company-backups/database"
- name: CRONRADAR_PING_URL
valueFrom:
secretKeyRef:
name: monitoring-secrets
key: cronradar-ping-url
command:
- /bin/sh
- -c
- |
set -e
# Signal start
curl -fsS "${CRONRADAR_PING_URL}/start" || true
# Run backup
if /scripts/backup.sh; then
echo "Backup completed successfully"
curl -fsS "${CRONRADAR_PING_URL}"
else
echo "Backup failed"
curl -fsS "${CRONRADAR_PING_URL}/fail"
exit 1
fi
restartPolicy: OnFailureKey Takeaways
Kubernetes CronJob monitoring requires a layered approach:
- Native kubectl commands for real-time debugging and manual intervention
- Prometheus with kube-state-metrics for dashboards, trends, and alerting on failures
- External ping-based monitoring for catching silent failures that produce no internal signal
The most dangerous CronJob failures are the ones that don't happen—jobs that never run don't generate metrics, logs, or events. External monitoring acts as a dead man's switch: if the ping doesn't arrive, something went wrong, and you'll know about it regardless of what broke.
Configure startingDeadlineSeconds to prevent the 100 missed schedules bug. Set activeDeadlineSeconds to prevent hung jobs. Use timeZone to avoid schedule confusion. And always have monitoring that's independent of your cluster—because when everything fails, that's exactly when you need to know.
Want to monitor your Kubernetes CronJobs without maintaining Prometheus infrastructure? CronRadar provides dead man's switch monitoring with a single curl command. Get instant alerts when jobs fail or never run. Start monitoring free →