Class: Wurk::Swarm
- Inherits:
-
Object
- Object
- Wurk::Swarm
- Includes:
- Component
- Defined in:
- lib/wurk/swarm.rb,
lib/wurk/swarm/child_boot.rb
Overview
Parent supervisor. Forks N children per the worker topology, monitors PIDs, relays signals, respawns crashed children, handles rolling restart on SIGUSR1, recycles RSS-bloated children.
Boot ordering (must be exact — see docs/idea/03-process-model.md):
1. Host app boots fully; eager loads done.
2. Railtie `after_initialize` fires.
3. `boot` closes parent-side connections (Redis, ActiveRecord).
4. `boot` forks N children.
5. Each child reconnects DB + opens a fresh Redis pool, then
installs its own signal handlers and starts the Launcher.
6. Parent calls `supervise` to enter the wait/relay loop.
Signals (see docs/idea/04-signals.md):
TERM/INT → `shutdown` (graceful drain)
TSTP → relay TSTP (pause fetch)
CONT → relay CONT (resume fetch)
USR1 → `rolling_restart` (zero-downtime cycle)
Defined Under Namespace
Classes: ChildBoot
Constant Summary collapse
- SUPERVISE_TICK =
0.2- RESPAWN_BACKOFF =
1.0- HEARTBEAT_WAIT =
30- MEMORY_CHECK_INTERVAL =
10- DEFAULT_SHUTDOWN_TIMEOUT =
25
Instance Attribute Summary collapse
-
#children ⇒ Object
readonly
Returns the value of attribute children.
-
#config ⇒ Object
included
from Component
readonly
Returns the value of attribute config.
-
#topology ⇒ Object
readonly
Returns the value of attribute topology.
Instance Method Summary collapse
-
#boot(install_signals: true) ⇒ Object
install_signals:is false in tests so the integration suite can driveshutdown/rolling_restartdirectly without poisoning the test process's signal handlers. - #default_tag(dir = Dir.pwd) ⇒ Object included from Component
-
#fire_event(event, oneshot: true, reverse: false, reraise: false) ⇒ Object
included
from Component
Invokes lifecycle hooks for
event. - #handle_exception(ex, ctx = {}) ⇒ Object included from Component
- #hostname ⇒ Object included from Component
- #identity ⇒ Object included from Component
-
#initialize(topology:, config: Wurk.configuration, memory_limit: config.memory_limit_kb, shutdown_timeout: DEFAULT_SHUTDOWN_TIMEOUT) ⇒ Swarm
constructor
A new instance of Swarm.
-
#leader? ⇒ Boolean
included
from Component
True iff this process currently holds the cluster
dear-leaderlock. -
#logger ⇒ Object
included
from Component
--- delegated to config -------------------------------------------.
- #mono_ms ⇒ Object included from Component
- #process_nonce ⇒ Object included from Component
-
#real_ms ⇒ Object
included
from Component
--- clocks ---------------------------------------------------------.
- #redis ⇒ Object included from Component
-
#rolling_restart ⇒ Object
SIGUSR1.
-
#safe_thread(name, priority: nil, &block) ⇒ Object
included
from Component
Spawns a named thread that runs
blockunderwatchdog(name). - #shutdown(timeout: @shutdown_timeout) ⇒ Object
- #supervise ⇒ Object
-
#tid ⇒ Object
included
from Component
--- identity -------------------------------------------------------.
-
#watchdog(last_words) ⇒ Object
included
from Component
Wraps a block at a thread boundary: any unhandled exception is reported via handle_exception (so it lands in error_handlers / the log) and then re-raised.
Constructor Details
#initialize(topology:, config: Wurk.configuration, memory_limit: config.memory_limit_kb, shutdown_timeout: DEFAULT_SHUTDOWN_TIMEOUT) ⇒ Swarm
Returns a new instance of Swarm.
39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/wurk/swarm.rb', line 39 def initialize(topology:, config: Wurk.configuration, memory_limit: config.memory_limit_kb, shutdown_timeout: DEFAULT_SHUTDOWN_TIMEOUT) @topology = topology @config = config @memory_limit = memory_limit @shutdown_timeout = shutdown_timeout @children = {} @assignments = [] @stopping = false @last_memory_check = 0 @signal_queue = ::Thread::Queue.new end |
Instance Attribute Details
#children ⇒ Object (readonly)
Returns the value of attribute children.
37 38 39 |
# File 'lib/wurk/swarm.rb', line 37 def children @children end |
#config ⇒ Object (readonly) Originally defined in module Component
Returns the value of attribute config.
#topology ⇒ Object (readonly)
Returns the value of attribute topology.
37 38 39 |
# File 'lib/wurk/swarm.rb', line 37 def topology @topology end |
Instance Method Details
#boot(install_signals: true) ⇒ Object
install_signals: is false in tests so the integration suite can
drive shutdown / rolling_restart directly without poisoning the
test process's signal handlers.
55 56 57 58 59 60 61 62 63 64 |
# File 'lib/wurk/swarm.rb', line 55 def boot(install_signals: true) raise 'Wurk::Swarm already booted' unless @assignments.empty? raise ArgumentError, 'Topology has no slots' if @topology.empty? @assignments = @topology.assignments.freeze close_parent_sockets fork_children install_signal_handlers if install_signals @children.keys end |
#default_tag(dir = Dir.pwd) ⇒ Object Originally defined in module Component
#fire_event(event, oneshot: true, reverse: false, reraise: false) ⇒ Object Originally defined in module Component
Invokes lifecycle hooks for event. Hooks run in registration order
(or LIFO when reverse: true, used for teardown). A raise in one hook
is reported via handle_exception and does NOT stop the next hook unless
reraise: true (used in tests / fail-fast boot). oneshot: true
clears the bucket after dispatch so the event can't fire twice.
#handle_exception(ex, ctx = {}) ⇒ Object Originally defined in module Component
#hostname ⇒ Object Originally defined in module Component
#identity ⇒ Object Originally defined in module Component
#leader? ⇒ Boolean Originally defined in module Component
True iff this process currently holds the cluster dear-leader lock.
Per spec, the check is performed at call time (Wurk does not cache);
callers must not poll faster than the 60s follower cadence. Returns
false unconditionally when WURK_LEADER=false (or SIDEKIQ_LEADER=false)
is set on the process (opt-out hot-standby). Any Redis error is swallowed →
false, so a transient partition can't propagate as an exception into user
code.
Spec: docs/target/sidekiq-ent.md §6.1.
#logger ⇒ Object Originally defined in module Component
--- delegated to config -------------------------------------------
#mono_ms ⇒ Object Originally defined in module Component
#process_nonce ⇒ Object Originally defined in module Component
#real_ms ⇒ Object Originally defined in module Component
--- clocks ---------------------------------------------------------
#redis ⇒ Object Originally defined in module Component
#rolling_restart ⇒ Object
SIGUSR1. For each existing child, fork a replacement, wait for its first heartbeat, then TERM + drain the old one. Long-running jobs in the old slot get the full shutdown_timeout while the replacement is already serving new work.
86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/wurk/swarm.rb', line 86 def rolling_restart @children.dup.each do |old_pid, | replacement = fork_child([:slot], [:index]) @children[replacement] = unless wait_for_heartbeat(replacement) logger.warn do "swarm: replacement #{replacement} heartbeat not seen within #{HEARTBEAT_WAIT}s; proceeding anyway" end end safe_kill(old_pid, 'TERM') wait_pid(old_pid, @shutdown_timeout) @children.delete(old_pid) end end |
#safe_thread(name, priority: nil, &block) ⇒ Object Originally defined in module Component
Spawns a named thread that runs block under watchdog(name). The
parent must retain the returned Thread; otherwise GC may not, but
report_on_exception is disabled so we don't double-log on death.
#shutdown(timeout: @shutdown_timeout) ⇒ Object
75 76 77 78 79 80 |
# File 'lib/wurk/swarm.rb', line 75 def shutdown(timeout: @shutdown_timeout) @stopping = true relay_signal('TERM') wait_for_children(timeout) hard_kill_stragglers end |
#supervise ⇒ Object
66 67 68 69 70 71 72 73 |
# File 'lib/wurk/swarm.rb', line 66 def supervise until done? drain_signals reap_one_child check_memory_pressure sleep SUPERVISE_TICK end end |
#tid ⇒ Object Originally defined in module Component
--- identity -------------------------------------------------------
#watchdog(last_words) ⇒ Object Originally defined in module Component
Wraps a block at a thread boundary: any unhandled exception is reported
via handle_exception (so it lands in error_handlers / the log) and then
re-raised. last_words is the component label included in the context.