Monitoring Sidekiq Jobs in Ruby on Rails Applications

Sidekiq powers background job processing for thousands of Rails applications, handling everything from email delivery to payment processing. But without proper monitoring, failed jobs can silently pile up while your team remains unaware—sometimes for hours.

Consider this real-world incident: A document processing company discovered that podcast import jobs were starving all other jobs in their default queue. The issue went undetected for seven hours before an internal user noticed the backlog. By then, thousands of jobs had failed, and customers were affected.

This guide covers everything you need to implement production-ready Sidekiq monitoring: the metrics that matter, common failure patterns, native monitoring options, and how to set up external alerting that catches problems before your users do.

Why Sidekiq monitoring matters

Sidekiq's Web UI shows you what's happening right now, but it won't tell you when problems start. The fundamental challenge is the visibility gap between "jobs are processing" and "jobs are processing correctly and on time."

Without monitoring, you'll discover issues through:

Customer complaints about missing emails or delayed reports
Database bloat from unprocessed cleanup jobs
Revenue loss from failed payment processing
Manual dashboard checks (that nobody remembers to do)

Proper monitoring shifts discovery from reactive to proactive. Instead of learning about a queue backlog from an angry customer, you get a Slack alert the moment latency exceeds your threshold.

Key metrics to track

Effective Sidekiq monitoring requires visibility into three categories: queue health, job performance, and infrastructure stability.

Queue health metrics

Metric	What It Measures	How to Access	Alert Threshold
Queue latency	Seconds since oldest job was enqueued	`Sidekiq::Queue.new("default").latency`	>30s warning, >60s critical
Queue size	Number of jobs waiting	`Sidekiq::Queue.new("default").size`	>100 jobs (varies by app)
Scheduled set size	Jobs scheduled for future execution	`Sidekiq::ScheduledSet.new.size`	Baseline + 50%

Queue latency is more meaningful than queue size. A queue with 1,000 fast jobs might have lower latency than a queue with 10 slow jobs. Latency tells you how long jobs actually wait before processing begins.

# Check queue health across all queues
Sidekiq::Queue.all.each do |queue|
  puts "#{queue.name}: #{queue.size} jobs, #{queue.latency.round(2)}s latency"
end

Job performance metrics

Metric	What It Measures	Alert Threshold
Failure rate	Failed jobs / total jobs	>1% (investigate), >5% (critical)
Retry queue size	Jobs awaiting retry	Steadily growing = persistent issue
Dead job count	Jobs that exhausted all retries	Any increase warrants investigation
Processing throughput	Jobs processed per second	Below baseline = capacity issue

The retry queue deserves special attention. Sidekiq retries failed jobs with exponential backoff over approximately 21 days (25 attempts). A growing retry queue often indicates a systemic problem—a down API, a database issue, or a bug affecting a class of jobs.

# Monitor retry and dead job counts
retry_count = Sidekiq::RetrySet.new.size
dead_count = Sidekiq::DeadSet.new.size

if dead_count > 0
  puts "⚠️ #{dead_count} jobs in dead queue - investigate immediately"
end

Infrastructure metrics

Metric	What It Measures	Alert Threshold
Busy workers	Active threads processing jobs	All busy + growing queue = scale up
Memory per process	Sidekiq process RAM usage	Continuous growth = memory leak
Redis memory	Redis used_memory	>70% of maxmemory
Redis latency	Connection response time	>5ms concerning

Ruby processes don't release memory back to the OS, so Sidekiq memory tends to grow over time. This is normal, but unbounded growth indicates a problem—often caused by loading large datasets into memory or ActiveRecord query caching.

# Get worker and process stats
stats = Sidekiq::Stats.new
processes = Sidekiq::ProcessSet.new

puts "Processed: #{stats.processed}"
puts "Failed: #{stats.failed}"
puts "Busy workers: #{processes.sum { |p| p['busy'] }}"
puts "Total workers: #{processes.sum { |p| p['concurrency'] }}"

Common Sidekiq problems and how monitoring catches them

Understanding failure modes helps you configure alerts that catch real problems without creating noise.

Problem 1: Jobs stuck processing

Symptoms: Queue grows continuously, workers show as busy, latency increases

Root cause: Usually network timeouts—a job waiting on an unresponsive external API blocks the worker thread indefinitely.

What monitoring catches: Queue latency exceeding threshold while worker utilization remains high

Prevention:

class ExternalApiJob
  include Sidekiq::Job
  sidekiq_options retry: 3

  def perform(url)
    # Always set timeouts on external calls
    response = HTTP.timeout(connect: 5, read: 10).get(url)
    process_response(response)
  end
end

Problem 2: Silent job failures

Symptoms: Jobs disappear without errors, work never completes, no retries triggered

Root cause: Custom middleware or rescue blocks catching exceptions without re-raising them

What monitoring catches: Heartbeat monitoring detects when expected job completions don't occur

# BAD: Swallows exceptions silently
def perform(user_id)
  process_user(user_id)
rescue StandardError => e
  Rails.logger.error(e.message)
  # Job appears successful but work didn't complete
end

# GOOD: Log and re-raise to trigger retry
def perform(user_id)
  process_user(user_id)
rescue StandardError => e
  Rails.logger.error(e.message)
  raise # Re-raise to trigger Sidekiq retry
end

Problem 3: Memory bloat

Symptoms: Process memory grows continuously, eventually OOM-killed

Root causes:

Loading entire tables into memory (User.all.each instead of User.find_each)
ActiveRecord query cache accumulation
Memory fragmentation (especially with glibc malloc)

What monitoring catches: Memory metrics exceeding baseline, process restarts

Prevention:

class LargeDatasetJob
  include Sidekiq::Job

  def perform
    # BAD: Loads all records into memory
    # User.all.each { |user| process(user) }

    # GOOD: Processes in batches of 1000
    User.find_each(batch_size: 1000) do |user|
      process(user)
    end
  end
end

For memory fragmentation, set MALLOC_ARENA_MAX=2 in your environment to reduce glibc's memory allocation overhead.

Problem 4: Queue starvation

Symptoms: High-priority jobs wait while long-running jobs consume all workers

Root cause: Long-running jobs in shared queues block other job types

What monitoring catches: Latency spikes on specific queues, throughput drops

Prevention: Use dedicated queues with weight-based processing:

# config/sidekiq.yml
:queues:
  - [critical, 10]
  - [default, 5]
  - [bulk, 1]

# Route long-running jobs to dedicated queue
class PodcastImportJob
  include Sidekiq::Job
  sidekiq_options queue: :bulk
end

Problem 5: Jobs lost during deploys

Symptoms: Jobs in progress when deploy happens never complete

Root cause: In Sidekiq OSS, jobs being processed when the worker shuts down are lost. The worker pops the job from Redis before processing—if it crashes mid-job, that job is gone.

What monitoring catches: Job completion rates drop during deploy windows, heartbeats miss expected check-ins

Prevention for Heroku (10-second shutdown window):

# config/sidekiq.yml
:timeout: 8  # Give jobs 8 seconds to complete before SIGKILL

For critical jobs, consider Sidekiq Pro's reliable fetch feature, which uses RPOPLPUSH to maintain job visibility until completion is confirmed.

Problem 6: Transaction race conditions

Symptoms: ActiveRecord::RecordNotFound errors on recently created records

Root cause: Job executes before the database transaction that created the record commits

# BAD: Job may run before transaction commits
User.transaction do
  user = User.create!(params)
  WelcomeEmailJob.perform_async(user.id)  # May fail with RecordNotFound
end

# GOOD: Use after_commit callback
class User < ApplicationRecord
  after_commit :send_welcome_email, on: :create

  private

  def send_welcome_email
    WelcomeEmailJob.perform_async(id)
  end
end

# ALSO GOOD: Enable transactional push (Sidekiq 7.1+)
# config/initializers/sidekiq.rb
Sidekiq.transactional_push!

Native Sidekiq monitoring options

Sidekiq Web UI (Free)

The Web UI provides a real-time dashboard showing:

Processed and failed job counts
Queue sizes and latency
Busy workers and their current jobs
Scheduled, retry, and dead job queues

Setup:

# config/routes.rb
require 'sidekiq/web'

# With Devise authentication
authenticate :user, ->(user) { user.admin? } do
  mount Sidekiq::Web => '/sidekiq'
end

# Or with HTTP Basic Auth
Sidekiq::Web.use Rack::Auth::Basic do |username, password|
  ActiveSupport::SecurityUtils.secure_compare(username, ENV['SIDEKIQ_USER']) &
    ActiveSupport::SecurityUtils.secure_compare(password, ENV['SIDEKIQ_PASSWORD'])
end
mount Sidekiq::Web => '/sidekiq'

Limitations:

Point-in-time view only—no historical data or trends
No alerting capabilities—requires manual checking
Won't tell you when problems started or how they evolved

Sidekiq API

For programmatic access, Sidekiq provides a comprehensive stats API:

require 'sidekiq/api'

stats = Sidekiq::Stats.new
stats.processed          # Total jobs processed
stats.failed             # Total jobs failed
stats.enqueued           # Jobs currently enqueued
stats.scheduled_size     # Jobs scheduled for future
stats.retry_size         # Jobs in retry queue
stats.dead_size          # Jobs in dead queue
stats.processes_size     # Number of running processes

# Historical data (last 5 days)
history = Sidekiq::Stats::History.new(5)
history.processed        # Hash of date => count
history.failed           # Hash of date => count

Sidekiq Pro and Enterprise

Sidekiq Pro ($99/month) adds:

Reliable fetch (jobs not lost on crash)
Batch jobs with completion callbacks
DogStatsD metrics export

Sidekiq Enterprise (from $229/month) adds:

Historical metrics retention
Periodic jobs (built-in cron)
Multi-process management with auto-restart
Rate limiting with visibility

Setting up external monitoring

Native monitoring tells you what's happening now. External monitoring alerts you when things go wrong and provides historical context for debugging.

Approach 1: Heartbeat monitoring

Heartbeat (or "dead man's switch") monitoring works by expecting regular check-ins. If a check-in doesn't arrive on schedule, you get alerted.

This approach is ideal for:

Scheduled jobs that should run at specific intervals
Critical jobs that must complete successfully
Jobs where silence (no errors) could indicate a problem

Basic implementation:

class DailyReportJob
  include Sidekiq::Job

  def perform
    generate_report
    
    # Ping monitoring service on success
    uri = URI.parse(ENV['CRONRADAR_PING_URL'])
    Net::HTTP.get_response(uri)
  rescue StandardError => e
    # Ping failure endpoint
    uri = URI.parse("#{ENV['CRONRADAR_PING_URL']}/fail")
    Net::HTTP.get_response(uri)
    raise
  end
end

With timeout tracking:

class CriticalSyncJob
  include Sidekiq::Job
  EXPECTED_DURATION = 300 # 5 minutes

  def perform
    start_time = Time.current
    
    # Notify job started
    ping_monitor("/start")
    
    perform_sync
    
    duration = Time.current - start_time
    if duration > EXPECTED_DURATION
      ping_monitor("/fail?message=exceeded_duration")
    else
      ping_monitor
    end
  rescue StandardError => e
    ping_monitor("/fail?message=#{CGI.escape(e.message)}")
    raise
  end

  private

  def ping_monitor(path = "")
    uri = URI.parse("#{ENV['CRONRADAR_PING_URL']}#{path}")
    Net::HTTP.get_response(uri)
  rescue => e
    Rails.logger.warn "Monitor ping failed: #{e.message}"
  end
end

Approach 2: Custom middleware

Middleware provides monitoring for all jobs without modifying individual job classes:

# config/initializers/sidekiq.rb
class MonitoringMiddleware
  include Sidekiq::ServerMiddleware

  def call(job_instance, job_payload, queue)
    started_at = Time.current
    success = false
    error_message = nil

    begin
      yield
      success = true
    rescue StandardError => e
      error_message = e.message
      raise
    ensure
      record_metrics(
        job_class: job_payload['class'],
        queue: queue,
        jid: job_payload['jid'],
        duration: Time.current - started_at,
        enqueued_at: job_payload['enqueued_at'],
        success: success,
        error: error_message
      )
    end
  end

  private

  def record_metrics(job_class:, queue:, jid:, duration:, enqueued_at:, success:, error:)
    latency = enqueued_at ? Time.current.to_f - enqueued_at : nil

    # Send to your monitoring service
    Rails.logger.info({
      event: 'sidekiq.job.completed',
      job_class: job_class,
      queue: queue,
      jid: jid,
      duration_ms: (duration * 1000).round,
      latency_ms: latency ? (latency * 1000).round : nil,
      success: success,
      error: error
    }.to_json)
  end
end

Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    chain.add MonitoringMiddleware
  end
end

Approach 3: Death handlers

Get alerted when jobs exhaust all retries and move to the dead queue:

# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.death_handlers << ->(job, exception) do
    job_class = job['class']
    job_id = job['jid']
    error = exception.message

    # Send alert
    SlackNotifier.alert(
      channel: '#sidekiq-alerts',
      text: "🔴 Job permanently failed: #{job_class}\nJID: #{job_id}\nError: #{error}"
    )

    # Ping monitoring service
    uri = URI.parse("#{ENV['CRONRADAR_PING_URL']}/#{job_class.underscore}/fail")
    Net::HTTP.get_response(uri)
  rescue => e
    Rails.logger.error "Death handler failed: #{e.message}"
  end
end

Approach 4: Health check endpoint

Expose Sidekiq health for load balancers and external monitoring:

# config/routes.rb
get '/health/sidekiq', to: 'health#sidekiq'

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  skip_before_action :authenticate_user!

  def sidekiq
    checks = {
      queues: check_queues,
      redis: check_redis,
      processes: check_processes
    }

    healthy = checks.values.all? { |c| c[:status] == 'ok' }

    render json: {
      status: healthy ? 'ok' : 'degraded',
      checks: checks,
      timestamp: Time.current.iso8601
    }, status: healthy ? 200 : 503
  end

  private

  def check_queues
    critical_queue = Sidekiq::Queue.new('critical')
    default_queue = Sidekiq::Queue.new('default')

    latency_ok = critical_queue.latency < 30 && default_queue.latency < 60

    {
      status: latency_ok ? 'ok' : 'degraded',
      critical_latency: critical_queue.latency.round(2),
      default_latency: default_queue.latency.round(2)
    }
  end

  def check_redis
    latency = Sidekiq.redis { |conn| conn.ping }
    { status: 'ok', latency: latency }
  rescue Redis::BaseError => e
    { status: 'error', message: e.message }
  end

  def check_processes
    processes = Sidekiq::ProcessSet.new
    {
      status: processes.size > 0 ? 'ok' : 'error',
      count: processes.size,
      busy: processes.sum { |p| p['busy'] }
    }
  end
end

Approach 5: Prometheus + Grafana

For self-hosted metrics and dashboards, use yabeda-sidekiq:

# Gemfile
gem 'yabeda-sidekiq'
gem 'yabeda-prometheus'

# config/initializers/yabeda.rb
Yabeda.configure do
  # Custom metrics if needed
  gauge :custom_queue_size do
    description "Custom queue size metric"
    tags [:queue]
  end
end

# config/initializers/sidekiq.rb
Sidekiq.configure_server do |_config|
  Yabeda::Prometheus::Exporter.start_metrics_server!
end

Pre-built Grafana dashboard available: ID 11667

Best practices by application scale

Small applications (< 10,000 jobs/day)

Recommended stack:

Sidekiq Web UI for visibility
Error tracking (Sentry, Honeybadger)
Heartbeat monitoring for critical scheduled jobs

Alert thresholds:

Queue latency > 30 seconds
Any job in dead queue
Daily job count below baseline

Implementation:

# Monitor critical scheduled jobs with heartbeat pings
class NightlyBackupJob
  include Sidekiq::Job

  def perform
    perform_backup
    Net::HTTP.get(URI(ENV['BACKUP_JOB_PING_URL']))
  end
end

Medium applications (10,000 - 500,000 jobs/day)

Recommended stack:

APM tool (AppSignal, Scout, New Relic)
Heartbeat monitoring for all scheduled jobs
Custom health check endpoint
Slack/PagerDuty integration for alerts

Alert thresholds:

Queue latency > 60 seconds (warning), > 120 seconds (critical)
Failure rate > 1%
Memory growth > 50% of baseline
Retry queue > 500 jobs

High-volume applications (> 500,000 jobs/day)

Recommended stack:

Prometheus + Grafana for metrics
Sidekiq Enterprise for historical data
Dedicated Redis instance with monitoring
Autoscaling based on queue metrics

Alert thresholds:

Metric	Warning	Critical
Queue latency	> 60s	> 300s
Failure rate	> 0.5%	> 2%
Memory usage	> 80%	> 95%
Redis memory	> 70%	> 85%
Retry queue	> 1,000	> 5,000

Monitoring checklist

Before going to production, verify you have:

Queue visibility:

[ ] Queue latency monitoring with alerts
[ ] Queue size baseline established
[ ] Retry queue growth alerts
[ ] Dead queue alerts (any job death = investigate)

Job performance:

[ ] Failure rate tracking
[ ] Critical job completion monitoring (heartbeats)
[ ] Job duration baselines for anomaly detection

Infrastructure:

[ ] Redis memory monitoring
[ ] Worker memory monitoring
[ ] Process count verification
[ ] Health check endpoint for load balancers

Alerting:

[ ] Alerts route to appropriate channels (Slack, PagerDuty)
[ ] Escalation path defined for critical alerts
[ ] Runbook for common alert types

Summary

Sidekiq's reliability depends on visibility. The Web UI shows current state, but production applications need proactive monitoring that catches problems before they impact users.

Start with the basics: monitor queue latency and dead jobs, add heartbeat monitoring for critical scheduled jobs, and set up alerts that reach your team. As your application grows, layer in APM tools and custom metrics.

The goal isn't perfect monitoring—it's catching problems in minutes instead of hours. With the right setup, you'll know about queue backlogs, stuck workers, and failed jobs before anyone else does.

Why Sidekiq monitoring matters

Key metrics to track

Queue health metrics

Job performance metrics

Infrastructure metrics

Common Sidekiq problems and how monitoring catches them

Problem 1: Jobs stuck processing

Problem 2: Silent job failures

Problem 3: Memory bloat

Problem 4: Queue starvation

Problem 5: Jobs lost during deploys

Problem 6: Transaction race conditions

Native Sidekiq monitoring options

Sidekiq Web UI (Free)

Sidekiq API

Sidekiq Pro and Enterprise

Setting up external monitoring

Approach 1: Heartbeat monitoring

Approach 2: Custom middleware

Approach 3: Death handlers

Approach 4: Health check endpoint

Approach 5: Prometheus + Grafana

Best practices by application scale

Small applications (< 10,000 jobs/day)

Medium applications (10,000 - 500,000 jobs/day)

High-volume applications (> 500,000 jobs/day)

Monitoring checklist

Summary

Start monitoring your cron jobs