|
2 | 2 |
|
3 | 3 | See the role README.md |
4 | 4 |
|
5 | | -# Results/progress |
| 5 | +# CI workflow |
6 | 6 |
|
7 | | -Without any metadata: |
| 7 | +The compute node rebuild is tested in CI after the tests for rebuilding the |
| 8 | +login and control nodes. The process follows |
8 | 9 |
|
9 | | - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
10 | | - ● ansible-init.service |
11 | | - Loaded: loaded (/etc/systemd/system/ansible-init.service; enabled; preset: disabled) |
12 | | - Active: activating (start) since Fri 2024-12-13 20:41:16 UTC; 1min 45s ago |
13 | | - Main PID: 16089 (ansible-init) |
14 | | - Tasks: 8 (limit: 10912) |
15 | | - Memory: 99.5M |
16 | | - CPU: 11.687s |
17 | | - CGroup: /system.slice/ansible-init.service |
18 | | - ├─16089 /usr/lib/ansible-init/bin/python /usr/bin/ansible-init |
19 | | - ├─16273 /usr/lib/ansible-init/bin/python3.9 /usr/lib/ansible-init/bin/ansible-playbook --connection local --inventory 127.0.0.1, /etc/ansible-init/playbooks/1-compute-init.yml |
20 | | - ├─16350 /usr/lib/ansible-init/bin/python3.9 /usr/lib/ansible-init/bin/ansible-playbook --connection local --inventory 127.0.0.1, /etc/ansible-init/playbooks/1-compute-init.yml |
21 | | - ├─16361 /bin/sh -c "/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1734122485.9542894-16350-45936546411977/AnsiballZ_mount.py && sleep 0" |
22 | | - ├─16362 /usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1734122485.9542894-16350-45936546411977/AnsiballZ_mount.py |
23 | | - ├─16363 /usr/bin/mount /mnt/cluster |
24 | | - └─16364 /sbin/mount.nfs 192.168.10.12:/exports/cluster /mnt/cluster -o ro,sync |
| 10 | +1. Compute nodes are reimaged: |
25 | 11 |
|
26 | | - Dec 13 20:41:24 rl9-compute-0.rl9.invalid ansible-init[16273]: ok: [127.0.0.1] |
27 | | - Dec 13 20:41:24 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [Report skipping initialization if not compute node] ********************** |
28 | | - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: skipping: [127.0.0.1] |
29 | | - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [meta] ******************************************************************** |
30 | | - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: skipping: [127.0.0.1] |
31 | | - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [Ensure the mount directory exists] *************************************** |
32 | | - Dec 13 20:41:25 rl9-compute-0.rl9.invalid python3[16346]: ansible-file Invoked with path=/mnt/cluster state=directory owner=root group=root mode=u=rwX,go= recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access> |
33 | | - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: changed: [127.0.0.1] |
34 | | - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [Mount /mnt/cluster] ****************************************************** |
35 | | - Dec 13 20:41:26 rl9-compute-0.rl9.invalid python3[16362]: ansible-mount Invoked with path=/mnt/cluster src=192.168.10.12:/exports/cluster fstype=nfs opts=ro,sync state=mounted boot=True dump=0 passno=0 backup=False fstab=None |
36 | | - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
| 12 | + ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml |
37 | 13 |
|
38 | | -Added metadata via horizon: |
| 14 | +2. Ansible-init runs against newly reimaged compute nodes |
39 | 15 |
|
40 | | - compute_groups ["compute"] |
| 16 | +3. Run sinfo and check nodes have expected slurm state |
41 | 17 |
|
42 | | - |
43 | | -OK: |
44 | | - |
45 | | - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
46 | | - ● ansible-init.service |
47 | | - Loaded: loaded (/etc/systemd/system/ansible-init.service; enabled; preset: disabled) |
48 | | - Active: active (exited) since Fri 2024-12-13 20:43:31 UTC; 33s ago |
49 | | - Process: 16089 ExecStart=/usr/bin/ansible-init (code=exited, status=0/SUCCESS) |
50 | | - Main PID: 16089 (code=exited, status=0/SUCCESS) |
51 | | - CPU: 13.003s |
52 | | - |
53 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: ok: [127.0.0.1] => { |
54 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: "msg": "Skipping compute initialization as cannot mount exports/cluster share" |
55 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: } |
56 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [meta] ******************************************************************** |
57 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: PLAY RECAP ********************************************************************* |
58 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: 127.0.0.1 : ok=4 changed=1 unreachable=0 failed=0 skipped=1 rescued=0 ignored=1 |
59 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16089]: [INFO] executing remote playbooks for stage - post |
60 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16089]: [INFO] writing sentinel file /var/lib/ansible-init.done |
61 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16089]: [INFO] ansible-init completed successfully |
62 | | - Dec 13 20:43:31 rl9-compute-0.rl9.invalid systemd[1]: Finished ansible-init.service. |
63 | | - |
64 | | -Now run site.yml, then restart ansible-init again: |
65 | | - |
66 | | - |
67 | | - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
68 | | - ● ansible-init.service |
69 | | - Loaded: loaded (/etc/systemd/system/ansible-init.service; enabled; preset: disabled) |
70 | | - Active: active (exited) since Fri 2024-12-13 20:50:10 UTC; 11s ago |
71 | | - Process: 18921 ExecStart=/usr/bin/ansible-init (code=exited, status=0/SUCCESS) |
72 | | - Main PID: 18921 (code=exited, status=0/SUCCESS) |
73 | | - CPU: 8.240s |
74 | | - |
75 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: TASK [Report skipping initialization if cannot mount nfs] ********************** |
76 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: skipping: [127.0.0.1] |
77 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: TASK [meta] ******************************************************************** |
78 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: skipping: [127.0.0.1] |
79 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: PLAY RECAP ********************************************************************* |
80 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: 127.0.0.1 : ok=3 changed=1 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0 |
81 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[18921]: [INFO] executing remote playbooks for stage - post |
82 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[18921]: [INFO] writing sentinel file /var/lib/ansible-init.done |
83 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[18921]: [INFO] ansible-init completed successfully |
84 | | - Dec 13 20:50:10 rl9-compute-0.rl9.invalid systemd[1]: Finished ansible-init.service. |
85 | | - [root@rl9-compute-0 rocky]# ls /mnt/cluster/host |
86 | | - hosts hostvars/ |
87 | | - [root@rl9-compute-0 rocky]# ls /mnt/cluster/hostvars/rl9-compute- |
88 | | - rl9-compute-0/ rl9-compute-1/ |
89 | | - [root@rl9-compute-0 rocky]# ls /mnt/cluster/hostvars/rl9-compute- |
90 | | - rl9-compute-0/ rl9-compute-1/ |
91 | | - [root@rl9-compute-0 rocky]# ls /mnt/cluster/hostvars/rl9-compute-0/ |
92 | | - hostvars.yml |
93 | | - |
94 | | -This commit - shows that hostvars have loaded: |
95 | | - |
96 | | - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
97 | | - ● ansible-init.service |
98 | | - Loaded: loaded (/etc/systemd/system/ansible-init.service; enabled; preset: disabled) |
99 | | - Active: active (exited) since Fri 2024-12-13 21:06:20 UTC; 5s ago |
100 | | - Process: 27585 ExecStart=/usr/bin/ansible-init (code=exited, status=0/SUCCESS) |
101 | | - Main PID: 27585 (code=exited, status=0/SUCCESS) |
102 | | - CPU: 8.161s |
103 | | - |
104 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: TASK [Demonstrate hostvars have loaded] **************************************** |
105 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: ok: [127.0.0.1] => { |
106 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: "prometheus_version": "2.27.0" |
107 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: } |
108 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: PLAY RECAP ********************************************************************* |
109 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: 127.0.0.1 : ok=5 changed=0 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0 |
110 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27585]: [INFO] executing remote playbooks for stage - post |
111 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27585]: [INFO] writing sentinel file /var/lib/ansible-init.done |
112 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27585]: [INFO] ansible-init completed successfully |
113 | | - Dec 13 21:06:20 rl9-compute-0.rl9.invalid systemd[1]: Finished ansible-init.service. |
| 18 | + ansible-playbook -v ansible/ci/check_slurm.yml |
0 commit comments