Monitoring Sidekiq Jobs in Ruby on Rails Applications
Sidekiq powers background job processing for thousands of Rails applications, handling everything from email delivery to payment processing. But without proper monitoring, failed jobs can silently pile up while your team remains unaware—sometimes for hours.
Consider this real-world incident: A document processing company discovered that podcast import jobs were starving all other jobs in their default queue. The issue went undetected for seven hours before an internal user noticed the backlog. By then, thousands of jobs had failed, and customers were affected.
This guide covers everything you need to implement production-ready Sidekiq monitoring: the metrics that matter, common failure patterns, native monitoring options, and how to set up external alerting that catches problems before your users do.
Why Sidekiq monitoring matters
Sidekiq's Web UI shows you what's happening right now, but it won't tell you when problems start. The fundamental challenge is the visibility gap between "jobs are processing" and "jobs are processing correctly and on time."
Without monitoring, you'll discover issues through:
- Customer complaints about missing emails or delayed reports
- Database bloat from unprocessed cleanup jobs
- Revenue loss from failed payment processing
- Manual dashboard checks (that nobody remembers to do)
Proper monitoring shifts discovery from reactive to proactive. Instead of learning about a queue backlog from an angry customer, you get a Slack alert the moment latency exceeds your threshold.
Key metrics to track
Effective Sidekiq monitoring requires visibility into three categories: queue health, job performance, and infrastructure stability.
Queue health metrics
| Metric | What It Measures | How to Access | Alert Threshold |
|---|---|---|---|
| Queue latency | Seconds since oldest job was enqueued | Sidekiq::Queue.new("default").latency |
>30s warning, >60s critical |
| Queue size | Number of jobs waiting | Sidekiq::Queue.new("default").size |
>100 jobs (varies by app) |
| Scheduled set size | Jobs scheduled for future execution | Sidekiq::ScheduledSet.new.size |
Baseline + 50% |
Queue latency is more meaningful than queue size. A queue with 1,000 fast jobs might have lower latency than a queue with 10 slow jobs. Latency tells you how long jobs actually wait before processing begins.
# Check queue health across all queues
Sidekiq::Queue.all.each do |queue|
puts "#{queue.name}: #{queue.size} jobs, #{queue.latency.round(2)}s latency"
endJob performance metrics
| Metric | What It Measures | Alert Threshold |
|---|---|---|
| Failure rate | Failed jobs / total jobs | >1% (investigate), >5% (critical) |
| Retry queue size | Jobs awaiting retry | Steadily growing = persistent issue |
| Dead job count | Jobs that exhausted all retries | Any increase warrants investigation |
| Processing throughput | Jobs processed per second | Below baseline = capacity issue |
The retry queue deserves special attention. Sidekiq retries failed jobs with exponential backoff over approximately 21 days (25 attempts). A growing retry queue often indicates a systemic problem—a down API, a database issue, or a bug affecting a class of jobs.
# Monitor retry and dead job counts
retry_count = Sidekiq::RetrySet.new.size
dead_count = Sidekiq::DeadSet.new.size
if dead_count > 0
puts "⚠️ #{dead_count} jobs in dead queue - investigate immediately"
endInfrastructure metrics
| Metric | What It Measures | Alert Threshold |
|---|---|---|
| Busy workers | Active threads processing jobs | All busy + growing queue = scale up |
| Memory per process | Sidekiq process RAM usage | Continuous growth = memory leak |
| Redis memory | Redis used_memory | >70% of maxmemory |
| Redis latency | Connection response time | >5ms concerning |
Ruby processes don't release memory back to the OS, so Sidekiq memory tends to grow over time. This is normal, but unbounded growth indicates a problem—often caused by loading large datasets into memory or ActiveRecord query caching.
# Get worker and process stats
stats = Sidekiq::Stats.new
processes = Sidekiq::ProcessSet.new
puts "Processed: #{stats.processed}"
puts "Failed: #{stats.failed}"
puts "Busy workers: #{processes.sum { |p| p['busy'] }}"
puts "Total workers: #{processes.sum { |p| p['concurrency'] }}"Common Sidekiq problems and how monitoring catches them
Understanding failure modes helps you configure alerts that catch real problems without creating noise.
Problem 1: Jobs stuck processing
Symptoms: Queue grows continuously, workers show as busy, latency increases
Root cause: Usually network timeouts—a job waiting on an unresponsive external API blocks the worker thread indefinitely.
What monitoring catches: Queue latency exceeding threshold while worker utilization remains high
Prevention:
class ExternalApiJob
include Sidekiq::Job
sidekiq_options retry: 3
def perform(url)
# Always set timeouts on external calls
response = HTTP.timeout(connect: 5, read: 10).get(url)
process_response(response)
end
endProblem 2: Silent job failures
Symptoms: Jobs disappear without errors, work never completes, no retries triggered
Root cause: Custom middleware or rescue blocks catching exceptions without re-raising them
What monitoring catches: Heartbeat monitoring detects when expected job completions don't occur
# BAD: Swallows exceptions silently
def perform(user_id)
process_user(user_id)
rescue StandardError => e
Rails.logger.error(e.message)
# Job appears successful but work didn't complete
end
# GOOD: Log and re-raise to trigger retry
def perform(user_id)
process_user(user_id)
rescue StandardError => e
Rails.logger.error(e.message)
raise # Re-raise to trigger Sidekiq retry
endProblem 3: Memory bloat
Symptoms: Process memory grows continuously, eventually OOM-killed
Root causes:
- Loading entire tables into memory (
User.all.eachinstead ofUser.find_each) - ActiveRecord query cache accumulation
- Memory fragmentation (especially with glibc malloc)
What monitoring catches: Memory metrics exceeding baseline, process restarts
Prevention:
class LargeDatasetJob
include Sidekiq::Job
def perform
# BAD: Loads all records into memory
# User.all.each { |user| process(user) }
# GOOD: Processes in batches of 1000
User.find_each(batch_size: 1000) do |user|
process(user)
end
end
endFor memory fragmentation, set MALLOC_ARENA_MAX=2 in your environment to reduce glibc's memory allocation overhead.
Problem 4: Queue starvation
Symptoms: High-priority jobs wait while long-running jobs consume all workers
Root cause: Long-running jobs in shared queues block other job types
What monitoring catches: Latency spikes on specific queues, throughput drops
Prevention: Use dedicated queues with weight-based processing:
# config/sidekiq.yml
:queues:
- [critical, 10]
- [default, 5]
- [bulk, 1]
# Route long-running jobs to dedicated queue
class PodcastImportJob
include Sidekiq::Job
sidekiq_options queue: :bulk
endProblem 5: Jobs lost during deploys
Symptoms: Jobs in progress when deploy happens never complete
Root cause: In Sidekiq OSS, jobs being processed when the worker shuts down are lost. The worker pops the job from Redis before processing—if it crashes mid-job, that job is gone.
What monitoring catches: Job completion rates drop during deploy windows, heartbeats miss expected check-ins
Prevention for Heroku (10-second shutdown window):
# config/sidekiq.yml
:timeout: 8 # Give jobs 8 seconds to complete before SIGKILLFor critical jobs, consider Sidekiq Pro's reliable fetch feature, which uses RPOPLPUSH to maintain job visibility until completion is confirmed.
Problem 6: Transaction race conditions
Symptoms: ActiveRecord::RecordNotFound errors on recently created records
Root cause: Job executes before the database transaction that created the record commits
# BAD: Job may run before transaction commits
User.transaction do
user = User.create!(params)
WelcomeEmailJob.perform_async(user.id) # May fail with RecordNotFound
end
# GOOD: Use after_commit callback
class User < ApplicationRecord
after_commit :send_welcome_email, on: :create
private
def send_welcome_email
WelcomeEmailJob.perform_async(id)
end
end
# ALSO GOOD: Enable transactional push (Sidekiq 7.1+)
# config/initializers/sidekiq.rb
Sidekiq.transactional_push!Native Sidekiq monitoring options
Sidekiq Web UI (Free)
The Web UI provides a real-time dashboard showing:
- Processed and failed job counts
- Queue sizes and latency
- Busy workers and their current jobs
- Scheduled, retry, and dead job queues
Setup:
# config/routes.rb
require 'sidekiq/web'
# With Devise authentication
authenticate :user, ->(user) { user.admin? } do
mount Sidekiq::Web => '/sidekiq'
end
# Or with HTTP Basic Auth
Sidekiq::Web.use Rack::Auth::Basic do |username, password|
ActiveSupport::SecurityUtils.secure_compare(username, ENV['SIDEKIQ_USER']) &
ActiveSupport::SecurityUtils.secure_compare(password, ENV['SIDEKIQ_PASSWORD'])
end
mount Sidekiq::Web => '/sidekiq'Limitations:
- Point-in-time view only—no historical data or trends
- No alerting capabilities—requires manual checking
- Won't tell you when problems started or how they evolved
Sidekiq API
For programmatic access, Sidekiq provides a comprehensive stats API:
require 'sidekiq/api'
stats = Sidekiq::Stats.new
stats.processed # Total jobs processed
stats.failed # Total jobs failed
stats.enqueued # Jobs currently enqueued
stats.scheduled_size # Jobs scheduled for future
stats.retry_size # Jobs in retry queue
stats.dead_size # Jobs in dead queue
stats.processes_size # Number of running processes
# Historical data (last 5 days)
history = Sidekiq::Stats::History.new(5)
history.processed # Hash of date => count
history.failed # Hash of date => countSidekiq Pro and Enterprise
Sidekiq Pro ($99/month) adds:
- Reliable fetch (jobs not lost on crash)
- Batch jobs with completion callbacks
- DogStatsD metrics export
Sidekiq Enterprise (from $229/month) adds:
- Historical metrics retention
- Periodic jobs (built-in cron)
- Multi-process management with auto-restart
- Rate limiting with visibility
Setting up external monitoring
Native monitoring tells you what's happening now. External monitoring alerts you when things go wrong and provides historical context for debugging.
Approach 1: Heartbeat monitoring
Heartbeat (or "dead man's switch") monitoring works by expecting regular check-ins. If a check-in doesn't arrive on schedule, you get alerted.
This approach is ideal for:
- Scheduled jobs that should run at specific intervals
- Critical jobs that must complete successfully
- Jobs where silence (no errors) could indicate a problem
Basic implementation:
class DailyReportJob
include Sidekiq::Job
def perform
generate_report
# Ping monitoring service on success
uri = URI.parse(ENV['CRONRADAR_PING_URL'])
Net::HTTP.get_response(uri)
rescue StandardError => e
# Ping failure endpoint
uri = URI.parse("#{ENV['CRONRADAR_PING_URL']}/fail")
Net::HTTP.get_response(uri)
raise
end
endWith timeout tracking:
class CriticalSyncJob
include Sidekiq::Job
EXPECTED_DURATION = 300 # 5 minutes
def perform
start_time = Time.current
# Notify job started
ping_monitor("/start")
perform_sync
duration = Time.current - start_time
if duration > EXPECTED_DURATION
ping_monitor("/fail?message=exceeded_duration")
else
ping_monitor
end
rescue StandardError => e
ping_monitor("/fail?message=#{CGI.escape(e.message)}")
raise
end
private
def ping_monitor(path = "")
uri = URI.parse("#{ENV['CRONRADAR_PING_URL']}#{path}")
Net::HTTP.get_response(uri)
rescue => e
Rails.logger.warn "Monitor ping failed: #{e.message}"
end
endApproach 2: Custom middleware
Middleware provides monitoring for all jobs without modifying individual job classes:
# config/initializers/sidekiq.rb
class MonitoringMiddleware
include Sidekiq::ServerMiddleware
def call(job_instance, job_payload, queue)
started_at = Time.current
success = false
error_message = nil
begin
yield
success = true
rescue StandardError => e
error_message = e.message
raise
ensure
record_metrics(
job_class: job_payload['class'],
queue: queue,
jid: job_payload['jid'],
duration: Time.current - started_at,
enqueued_at: job_payload['enqueued_at'],
success: success,
error: error_message
)
end
end
private
def record_metrics(job_class:, queue:, jid:, duration:, enqueued_at:, success:, error:)
latency = enqueued_at ? Time.current.to_f - enqueued_at : nil
# Send to your monitoring service
Rails.logger.info({
event: 'sidekiq.job.completed',
job_class: job_class,
queue: queue,
jid: jid,
duration_ms: (duration * 1000).round,
latency_ms: latency ? (latency * 1000).round : nil,
success: success,
error: error
}.to_json)
end
end
Sidekiq.configure_server do |config|
config.server_middleware do |chain|
chain.add MonitoringMiddleware
end
endApproach 3: Death handlers
Get alerted when jobs exhaust all retries and move to the dead queue:
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
config.death_handlers << ->(job, exception) do
job_class = job['class']
job_id = job['jid']
error = exception.message
# Send alert
SlackNotifier.alert(
channel: '#sidekiq-alerts',
text: "🔴 Job permanently failed: #{job_class}\nJID: #{job_id}\nError: #{error}"
)
# Ping monitoring service
uri = URI.parse("#{ENV['CRONRADAR_PING_URL']}/#{job_class.underscore}/fail")
Net::HTTP.get_response(uri)
rescue => e
Rails.logger.error "Death handler failed: #{e.message}"
end
endApproach 4: Health check endpoint
Expose Sidekiq health for load balancers and external monitoring:
# config/routes.rb
get '/health/sidekiq', to: 'health#sidekiq'
# app/controllers/health_controller.rb
class HealthController < ApplicationController
skip_before_action :authenticate_user!
def sidekiq
checks = {
queues: check_queues,
redis: check_redis,
processes: check_processes
}
healthy = checks.values.all? { |c| c[:status] == 'ok' }
render json: {
status: healthy ? 'ok' : 'degraded',
checks: checks,
timestamp: Time.current.iso8601
}, status: healthy ? 200 : 503
end
private
def check_queues
critical_queue = Sidekiq::Queue.new('critical')
default_queue = Sidekiq::Queue.new('default')
latency_ok = critical_queue.latency < 30 && default_queue.latency < 60
{
status: latency_ok ? 'ok' : 'degraded',
critical_latency: critical_queue.latency.round(2),
default_latency: default_queue.latency.round(2)
}
end
def check_redis
latency = Sidekiq.redis { |conn| conn.ping }
{ status: 'ok', latency: latency }
rescue Redis::BaseError => e
{ status: 'error', message: e.message }
end
def check_processes
processes = Sidekiq::ProcessSet.new
{
status: processes.size > 0 ? 'ok' : 'error',
count: processes.size,
busy: processes.sum { |p| p['busy'] }
}
end
endApproach 5: Prometheus + Grafana
For self-hosted metrics and dashboards, use yabeda-sidekiq:
# Gemfile
gem 'yabeda-sidekiq'
gem 'yabeda-prometheus'
# config/initializers/yabeda.rb
Yabeda.configure do
# Custom metrics if needed
gauge :custom_queue_size do
description "Custom queue size metric"
tags [:queue]
end
end
# config/initializers/sidekiq.rb
Sidekiq.configure_server do |_config|
Yabeda::Prometheus::Exporter.start_metrics_server!
endPre-built Grafana dashboard available: ID 11667
Best practices by application scale
Small applications (< 10,000 jobs/day)
Recommended stack:
- Sidekiq Web UI for visibility
- Error tracking (Sentry, Honeybadger)
- Heartbeat monitoring for critical scheduled jobs
Alert thresholds:
- Queue latency > 30 seconds
- Any job in dead queue
- Daily job count below baseline
Implementation:
# Monitor critical scheduled jobs with heartbeat pings
class NightlyBackupJob
include Sidekiq::Job
def perform
perform_backup
Net::HTTP.get(URI(ENV['BACKUP_JOB_PING_URL']))
end
endMedium applications (10,000 - 500,000 jobs/day)
Recommended stack:
- APM tool (AppSignal, Scout, New Relic)
- Heartbeat monitoring for all scheduled jobs
- Custom health check endpoint
- Slack/PagerDuty integration for alerts
Alert thresholds:
- Queue latency > 60 seconds (warning), > 120 seconds (critical)
- Failure rate > 1%
- Memory growth > 50% of baseline
- Retry queue > 500 jobs
High-volume applications (> 500,000 jobs/day)
Recommended stack:
- Prometheus + Grafana for metrics
- Sidekiq Enterprise for historical data
- Dedicated Redis instance with monitoring
- Autoscaling based on queue metrics
Alert thresholds:
| Metric | Warning | Critical |
|---|---|---|
| Queue latency | > 60s | > 300s |
| Failure rate | > 0.5% | > 2% |
| Memory usage | > 80% | > 95% |
| Redis memory | > 70% | > 85% |
| Retry queue | > 1,000 | > 5,000 |
Monitoring checklist
Before going to production, verify you have:
Queue visibility:
- [ ] Queue latency monitoring with alerts
- [ ] Queue size baseline established
- [ ] Retry queue growth alerts
- [ ] Dead queue alerts (any job death = investigate)
Job performance:
- [ ] Failure rate tracking
- [ ] Critical job completion monitoring (heartbeats)
- [ ] Job duration baselines for anomaly detection
Infrastructure:
- [ ] Redis memory monitoring
- [ ] Worker memory monitoring
- [ ] Process count verification
- [ ] Health check endpoint for load balancers
Alerting:
- [ ] Alerts route to appropriate channels (Slack, PagerDuty)
- [ ] Escalation path defined for critical alerts
- [ ] Runbook for common alert types
Summary
Sidekiq's reliability depends on visibility. The Web UI shows current state, but production applications need proactive monitoring that catches problems before they impact users.
Start with the basics: monitor queue latency and dead jobs, add heartbeat monitoring for critical scheduled jobs, and set up alerts that reach your team. As your application grows, layer in APM tools and custom metrics.
The goal isn't perfect monitoring—it's catching problems in minutes instead of hours. With the right setup, you'll know about queue backlogs, stuck workers, and failed jobs before anyone else does.