11# compute-init
22
3- TODO: describe current status.
3+ The following roles are currently functional:
4+ - resolv_conf
5+ - etc_hosts
6+ - stackhpc.openhpc
47
58# Development
69
710To develop/debug this without actually having to build an image:
811
912
10131 . Deploy a cluster using tofu and ansible/site.yml as normal. This will
11- additionally configure the control node to export compute hosts over NFS.
14+ additionally configure the control node to export compute hostvars over NFS.
1215 Check the cluster is up.
1316
14172 . Reimage the compute nodes:
@@ -22,6 +25,10 @@ To develop/debug this without actually having to build an image:
2225
2326 ansible-playbook ansible/fatimage.yml --tags compute_init
2427
28+ NB: This will also re-export the compute hostvars, as the nodes are not
29+ in the builder group, which conveniently means any changes made to that
30+ play also get picked up.
31+
25325 . Fake a reimage of compute to run ansible-init and the compute-init playbook:
2633
2734 On compute node where metadata was added:
@@ -31,8 +38,9 @@ To develop/debug this without actually having to build an image:
3138
3239 Use ` systemctl status ansible-init ` to view stdout/stderr from Ansible.
3340
34- Steps 4/5 can be repeated with changes to the compute script. If desirable
35- reimage the compute node(s) first as in step 3.
41+ Steps 4/5 can be repeated with changes to the compute script. If required,
42+ reimage the compute node(s) first as in step 2 and/or add additional metadata
43+ as in step 3.
3644
3745# Results/progress
3846
@@ -144,3 +152,40 @@ This commit - shows that hostvars have loaded:
144152 Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27585]: [INFO] ansible-init completed successfully
145153 Dec 13 21:06:20 rl9-compute-0.rl9.invalid systemd[1]: Finished ansible-init.service.
146154
155+ # Design notes
156+
157+ - In general, we don't want to rely on NFS export. So should e.g. copy files
158+ from this mount ASAP in the compute-init script. TODO:
159+ - There are a few possible approaches:
160+
161+ 1 . Control node copies files resulting from role into cluster exports,
162+ compute-init copies to local disk. Only works if files are not host-specific
163+ Examples: etc_hosts, eessi config?
164+
165+ 2 . Re-implement the role. Works if the role vars are not too complicated,
166+ (else they all need to be duplicated in compute-init). Could also only
167+ support certain subsets of role functionality or variables
168+ Examples: resolv_conf, stackhpc.openhpc
169+
170+
171+ # Problems with templated hostvars
172+
173+ Here are all the ones which actually rely on hostvars from other nodes,
174+ which therefore aren't available:
175+
176+ ```
177+ [root@rl9-compute-0 rocky]# grep hostvars /mnt/cluster/hostvars/rl9-compute-0/hostvars.yml
178+ "grafana_address": "{{ hostvars[groups['grafana'].0].api_address }}",
179+ "grafana_api_address": "{{ hostvars[groups['grafana'].0].internal_address }}",
180+ "mysql_host": "{{ hostvars[groups['mysql'] | first].api_address }}",
181+ "nfs_server_default": "{{ hostvars[groups['control'] | first ].internal_address }}",
182+ "openhpc_slurm_control_host": "{{ hostvars[groups['control'].0].api_address }}",
183+ "openondemand_address": "{{ hostvars[groups['openondemand'].0].api_address if groups['openondemand'] | count > 0 else '' }}",
184+ "openondemand_node_proxy_directives": "{{ _opeonondemand_unset_auth if (openondemand_auth == 'basic_pam' and 'openondemand_host_regex' and groups['grafana'] | length > 0 and hostvars[ groups['grafana'] | first]._grafana_auth_is_anonymous) else '' }}",
185+ "openondemand_servername": "{{ hostvars[ groups['openondemand'] | first].ansible_host }}",
186+ "prometheus_address": "{{ hostvars[groups['prometheus'].0].api_address }}",
187+ "{{ hostvars[groups['freeipa_server'].0].ansible_host }}"
188+ ```
189+
190+ More generally, there is nothing to stop any group var depending on a
191+ "{{ hostvars[ ] }}" interpolation ...
0 commit comments