So, I have this host group, which consist of 35k VMs.
Aaaand I need to run a playbook over it.
If it does matter - playbook is just a call for community role for installing node_exporter.
But I'm having hard times trying to run it wildly.
I know that running it as is on a such huge host group will definitely cause OOM, so I made a bunch of tries to make it both reliable (make sure it'll finish and not be killed) and fast (but in reality it doesn't)
So I'm doing this :
- Using
strategy: free
- Using
serial: 350
- Collecting only facts I need:
gather_facts: true
gather_subset:
- "default_ipv4"
- "system"
- "service_mgr"
- "pkg_mgr"
- "os_family"
- "selinux"
- "user"
- "mounts"
- "!all"
- "!min"
- Using
-f 350
when calling playbook to make it running playbook over 350 machines simultaneously. - Using persistent setting to make it hold ssh connection
use_persistent_connections = True
[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=1800s -o PreferredAuthentications=publickey -o ForwardAgent=yes
[connection]
ansible_pipelining = True
[persistent_connection]
connect_timeout = 1800
And, well....it doesn't work. Like the biggest problem I see - it's not really spawning 350 forks at a time for doing that. All I see is ~3-5 processes running something on a remote host (max I've seen was maybe 20 of them?) so it's painfully slow. Running it on 350 hosts takes ~1.5h, which is insane, as calling this playbook/role on 30 machines takes around 3-4 minutes to complete.
Plus, it's OOMing anyway at some point.
I'm running it on 32 cores/ 64 Gb RAM VM dedicated for running this one playbook only at the moment and it's OOMing anyway, that's insane.
From my understanding serial
setting should preventing that, as it would free up some memory after every batch. But it's not it seems. It just constantly grows.
Now I'm running it using bash script which builds batches of machines and then I'm calling playbook with -l "machine1:machine2:.....:machine350"
but that completely wrong.
So my questions here are - why am I not able to run the role/playbook on the host group at once, why it's so slow, why it's OOMing and how to prevent that.
TIA for all the help!