mirror of
https://github.com/itme-brain/nixos.git
synced 2026-05-08 14:50:12 -04:00
2 KiB
2 KiB
Rigby Recovery
This directory contains disaster-recovery automation for rigby, the Ubuntu
AI rig at 192.168.0.23.
Scope
This automation manages the host state after a manual Ubuntu install.
It is intended to restore the working state we validated for:
- AMD ROCm
7.2.1 amdgpu-dkmsamdgpu.cwsr_enable=0- pinned ComfyUI checkout
uv-managed Python3.13venv- ROCm PyTorch
- ComfyUI service layout
- output sharing over Samba
- required groups and permissions
Manual Prerequisites
These are intentionally documented, not automated:
- Install Ubuntu
24.04.4 - Update BIOS to the known-good version for the board
- Verify BIOS settings:
Above 4G Decoding = EnabledSVM = Enabled- UEFI boot
- sane PCIe slot configuration
- Ensure host SSH is reachable as
bryan - Ensure passwordless sudo works for
bryan - Ensure the initial DHCP lease is known so recovery can begin
Recovery Flow
- Install Ubuntu manually.
- Clone this repository onto the operator machine.
- From the repo root, run
just rigby-check HOST=<rigby-ip>. - Run
just rigby-recover HOST=<rigby-ip>. - Reboot
rigby. - Validate:
rocminforocm-smi- ComfyUI startup
Notes
- The AMD repo and package installs are automated here, but BIOS and physical host setup remain manual.
- ComfyUI itself is deployed as an application under
/home/comfy/ComfyUI. - The
comfyui.serviceunit is installed but left disabled so the service is started on demand. - Models, LoRAs, VAEs, outputs, and other AI assets are not restored by this
automation.
rigbyis the source of truth for that data, so disaster recovery for models requires a separate backup strategy. - The
justentrypoints acceptHOST=<ip>so recovery does not depend on a fixed DHCP lease. - Recovery installs the configured SSH key for
bryan. - Static IP configuration is applied at the end of the playbook via netplan. The SSH session used for recovery may be interrupted once the new address is applied, and subsequent access should use the final static IP.