Design Decision: Node starting & stopping¶
Background / Context¶
The potential use of a crash shell is relevant to high availability capabilities of nodes.
1. Use crash shell¶
- Already built into the node.
- Potentially add custom commands.
- Won’t reliably work if the node is in an unstable state
- Not practical for running hundreds of nodes as our customers already trying to do.
- Doesn’t mesh with the user access controls of the organisation.
- Doesn’t interface to the existing monitoring and control systems i.e. Nagios, Geneos ITRS, Docker Swarm, etc.
2. Delegate to external tools¶
- Doesn’t require change from our customers
- Will work even if node is completely stuck
- Allows scripted node restart schedules
- Doesn’t raise questions about access control lists and audit
- More uncertainty about what customers do.
- Might be more requirements on us to interact nicely with lots of different products.
- Might mean we get blamed for faults in other people’s control software.
- Doesn’t coordinate with the node for graceful shutdown.
- Doesn’t address any crypto features that target protecting the AMQP headers.
Recommendation and justification¶
Proceed with Option 2: Delegate to external tools