Centralized Management

Overview

In a distributed OCS involving multiple hosts, it is advantageous to have a way to start and stop Agents without ssh-ing to the various host systems.

The HostManager Agent, in combination with the ocsbow CLI script or the HostManager panel in ocs-web, provide this functionality. When fully configured, the system provides the following functionality:

  • Any OCS Agent in the system can be started and stopped from a single client. This includes support for bringing down all Agents, across the system, without having to connect to multiple hosts.

  • The OCS Agents running on a particular system will start up automatically when the system is booted (even if the Agents are not contained in Docker containers).

  • The basic health of Agents, across the system, can be monitored and individual agents restarted using HostManager panels in ocs-web.

Warning

The HostManager system, once in place, should be the only means by which those managed Agents are started or stopped. For Agents running on the native OS, HostManager will run them as child processes so that it can monitor their states more easily. For Agents running in docker containers, HostManager takes charge of the implicated containers and there will be conflicts if users also try to use docker compose to restart containers.

The main components of this system are:

  • HostManager Agent – an instance of this Agent must be set up for each host or docker-ish host in the site_config.yaml file.

  • systemd scripts – there should be one systemd script (and launcher script) set up for each HostManager Agent instance; the command-line tool ocs-install-systemd helps with this.

  • ocsbow – the command-line client for communicating with HostManager Agents.

Configuration of HostManager Agents

The HostManager Agents will normally run on the bare systems, rather than in Docker containers. This is because they need to start and stop other processes and start Docker containers on the system.

To enable full centralized control of your system, there must be an instance of HostManager Agent set up for each host in the Site Config file (SCF). Some hosts in the SCF describe agents running in Docker containers, and normally these are grouped to correspond to a single docker-compose.yaml file. Each such host needs a HostManager set up, though the HostManager runs on the native system and not in a docker container.

Config for native system hosts

Considering the Example Config from OCS Site Config File, the SCF there has 3 hosts defined: host-1, host-1-docker, and host-2. We must add a HostManager block to the 'agent-instances' list in each case. For example, the host-1 block would become:

host-1: {

  # Directory for logs.
  'log-dir': '/simonsobs/log/ocs/',

  # List of additional paths to Agent plugin modules.
  'agent-paths': [
    '/simonsobs/ocs/agents/',
  ],

  # Description of host-1's Agents.
  # We have two readout devices; they are both Lakeshore 240. But they can
  # be distinguished, on startup, by a device serial number.
  # We also have a HostManager.

  'agent-instances': [
    {'agent-class': 'Lakeshore240Agent',
     'instance-id': 'thermo1',
     'arguments': [['--serial-number', 'LSA11AA'],
                   ['--mode', 'idle']]},
    {'agent-class': 'Lakeshore240Agent',
     'instance-id': 'thermo2',
     'arguments': [['--serial-number', 'LSA22BB'],
                   ['--mode', 'acq']]},
    {'agent-class': 'HostManager',
     'instance-id': 'hm-host-1'},
    },
  ]
}

To test the configuration, you can try to launch the HostManager. In a fully configured system, this will be done through systemd. But for initial setup you can use the ocs-local-support program.

Note

When you launch HostManager, it will try to start new processes for each of its managed Agents! So you should shut down any running instances, and be in a state where it’s acceptable to start up new instances.

To launch the HostManager agent for the system you’re logged into, run:

$ ocs-local-support start agent --foreground

You can Ctrl-C out of this to kill the agent. (If you accidentally run this without the --foreground, you can try using ocs-local-support stop agent to stop it.)

To start using ocsbow to communicate with this HostManager, see Communicating with HostManager Agents. To set the HostManager up in systemd (useful especially to have the HostManager and managed agents start up when the system boots), see systemd Control of HostManagers.

Config for docker pseudo-hosts

Considering the Example Config from OCS Site Config File, the host host-1-docker describes agents that are launched in containers using docker compose. For HostManager to best manage these agents, a HostManager should be described in this same host config block. The HostManager won’t run in a docker container – it will run on the host system. In this case the HostManager should have a --docker-compose argument that specifies the docker-compose.yaml file (or multiple, comma-separated, files) containing services to manage.

In addition to adding HostManager, each other agent instance in the config must include the setting 'manage': 'docker'.

So the host-1-docker block in the site config file would become:

host-1-docker: {

  # Description of host-1's Agents running with Docker containers.
  # We have one readout device; a Lakeshore 372.

  'agent-instances': [
    {'agent-class': 'Lakeshore372Agent',
     'instance-id': 'LSARR00',
     'manage': 'docker',
     'arguments': [['--serial-number', 'LSARR00'],
                   ['--ip-address', '10.10.10.55']]},
    {'agent-class': 'HostManager',
     'instance-id': 'hm-host-1-docker',
     'arguments': [['--initial-state', 'up'],
                   ['--docker-compose', '/home/ocs/site-config/host-1-docker/docker-compose.yaml']]},
  ]
}

To launch this agent, for testing, you can run:

$ ocs-local-support start agent --site-host=host-1-docker --foreground

(The --site-host argument helps ocs-local-support to find the HostManager config in the host-1-docker block of site config, instead of the host-1 block.)

Note

The HostManager process must be running as a user with sufficient privileges to run docker and docker compose. Usually that means that the user must be root, or must be in the “docker” user group. The recommendation is that you add the OCS user to the docker group (see docker-linux-postinstall).

In order for HostManager to recognize that services defined in your docker-compose.yaml correspond to certain agent instance_id values, make sure the services are called ocs-[instance_id]. (The choice of ocs- prefix is configurable with a command-line argument to HostManager, and can be set to the empty string if you want). In ocsbow and ocs-web, agents running in docker containers will show up with a [d] appended to their usual agent_class name.

If HostManager finds services in the docker-compose.yaml that don’t seem to correspond to agent instances in site config, it will still permit them to be “managed” (brought up and down). The agent_class, in ocsbow or ocs-web, will show up as simply “[docker]”.

Advanced host config

The manage setting in the instance description can be used to fine-tune the treatment of each Agent instance by HostManager. For example, to exclude an instance from HostManager tracking and control, specify 'manage': 'ignore'. It is also possible to specify that certain instances should not be started automatically (for example "host/down" or "docker/down"). For information on the available settings for “manage”, see the description in ocs.site_config.InstanceConfig.from_dict().

It is possible to mix host- and docker-based agents in a single host config block, and control them all with a single HostManager instance. Just make sure your docker-based agents are marked with 'manage': 'docker' in site config, and have service name ocs-[instance-id] as usual. Usually, docker-based agents have some command line parameter overrides set in docker-compose.yaml (or in the site config block), because the crossbar address is different or weird from inside the container. If the hostname, in the docker container, is not the same as on the host system then specify the native host hostname with the --site-host parameter. In the usual example, an Agent instance in a container would see system hostname host-1-docker, and you’d want to pass --site-host=host-1 so that it finds its config in the host-1 part of the site config file.

Communicating with HostManager Agents

This section describes using the ocsbow command line tool to communicate with all the HostManager agents in an OCS setup. A complementary approach is to use ocs-web; see Using ocs-web with HostManager.

ocsbow is a special client program that knows how to parse the SCF and figure out what HostManager are running on the system. This allows it to query each one (using standard OCS techniques) and present the status of all the managed agents.

Like any other OCS client program, ocsbow needs to be able to find the site config file. (If you have just made changes to the SCF to add HostManager agents, make sure the system you’re running this client on also has access to that updated SCF.)

Inspecting status

The basic status display is shown if you run ocsbow. In the example above, the output will look something like this:

$ ocsbow
ocs status
----------

The site config file is :
  /home/ocs/site-config/default.yaml

The crossbar base url is :
  http://my-crossbar-server:8001/call

---------------------------------------------------------------------------
Host: host-1

  [instance-id]                  [agent-class]           [state]   [target]
  hm-host-1                      HostManager                  up        n/a
  thermo1                        Lakeshore240Agent            up         up
  thermo2                        Lakeshore240Agent            up         up

---------------------------------------------------------------------------
Host: host-1-docker

  [instance-id]                  [agent-class]           [state]   [target]
  LSARR00                        Lakeshore372Agent[d]         up         up

---------------------------------------------------------------------------
Host: host-2

  [instance-id]                  [agent-class]           [state]   [target]
  thermo3                        Lakeshore240Agent            up         up
  aggregator                     AggregatorAgent              up         up

The output is interpreted as follows. After an initial statement of what site config file is being used, and the crossbar access address, a block is presented for each host in the SCF. Within each host block, each agent instance-id is listed, along with its agent-class and values for “state” and “target”.

The agent in host-1-docker has the annotation [d] beside its class name, indicating this is an agent managed through a docker container. (The docker service name, in this example, would be ocs-LSARR00.)

If an Agent has been configured with 'manage': 'ignore', it will be marked with suffix [unman] and will have question marks in the state and target fields, e.g.:

[instance-id]        [agent-class]                     [state]   [target]
registry             RegistryAgent[unman]                    ?          ?

If the SCF seen by ocsbow and the information in HostManager are not in agreement, then the agent-class will include two values, connected with a slash. For example, if the local SCF expects the instance to be managed through docker, but the HostManager reports it running on the host, then the line might look like this:

[instance-id]        [agent-class]                             [state]   [target]
LSARR00              Lakeshore372Agent[d]/Lakeshore372Agent         up         up

A managed docker container that has not been associated with a specific instance will show up with agent-class “?/[docker]” and an instance-id corresponding to the service name. For example:

[instance-id]        [agent-class]                     [state]   [target]
influxdb             ?/[docker]                             up         up

state and target

The state column shows whether the Agent is currently running (up) or not (down). This column may also show the value unstable, which indicates that an Agent keeps restarting (this usually indicates a code, configuration, or hardware error that is causing the agent to crash shortly after start-up). The value may also be ?, indicating that the agent is marked to be run through Docker, but no corresponding docker service has been identified.

For the non-HostManager agents, the target column shows the state that HostManager will try to achieve for that Agent. So if target=up then the HostManager will start the Agent, and keep restarting the Agent if it crashes or otherwise terminates. If target=down then the HostManager will stop the Agent and not restart it. (Note that in the case of Agents in docker containers, the HostManager will use docker and docker compose to monitor the state of containers, and request start or stop in order to match the target state.)

Each HostManager can be commanded to change the target state of Agents it controls; see Start/Stop Agents.

For the HostManager lines, the target will always be [n/a] and the state will either be up, down, or sleeping. When the HostManager appears to be functioning normally, the state will be up. If the HostManager appears to not be running at all, the state will be down. If the HostManager is running but the “manage” Process is not running for some reason, the state will be sleeping.

Start/Stop Agents

To start an Agent, through its HostManager, run ocsbow up, specifying the agent-id. For example:

$ ocsbow up thermo1

The correct HostManager will be contacted and target=up will be set for that Agent instance. Similarly:

$ ocsbow down thermo1

will set target=down for the thermo1 instance.

Start/Stop Batches of Agents

You can pass multiple instance-id targets in a single line, even if they are managed by different HostManagers. For example:

$ ocsbow down thermo1 thermo3

If you pass the instance-id of a HostManager, then the target state will be applied to all its managed agents. So in our example:

$ ocsbow down hm-host-1

is equivalent to:

$ ocsbow down thermo1 thermo2

You can target all the managed agents in a system using the -a (--all) switch:

$ ocsbow down -a    # Bring down all the agents!
$ ocsbow up -a      # Bring up all the agents!

Note that none of these commands will cause the HostManager agents to stop. Restarting HostManagers must be done through another means (the systemd controls, or ocs-local-support).

systemd Control of HostManagers

systemd is widely used on Linux systems to manage services and daemons (and lots of other stuff). The OCS program ocs-install-systemd may be used to help register each HostManager Agent as a systemd service. The systemctl program (part of systemd) can then be used to start and stop the Agent, or to configure it to start automatically on system boot.

Note

Before bothering with systemd, you must already have ocs installed on the host in question, with the site config specified for this host and a HostManager instance properly configured to control agents on the system.

Configuring the systemd service

The service configuration consists of two files, which are described in more detail a little later:

  • The .service file

  • The launcher script

To generate those files, run:

$ hostname
ocs-host5
$ cd $OCS_CONFIG_DIR
$ ocs-install-systemd --service-dir=.
Writing /home/ocs/ocs-site-configs/my-ocs/launcher-hostmanager-ocs-host5.sh ...
Writing ./ocs-hostmanager.service ...

After generating the .service file, copy it to the systemd folder:

$ sudo cp ocs-hostmanager.service /etc/systemd/system/

At this point you should be able to check the “status” of the service:

$ sudo systemctl status ocs-hostmanager.service

It probably won’t say very much. If you’ve updated the service file recently (i.e. reinstalled it, with or without changes), it might recommend that you run systemctl daemon-reload; you should probably do so.

At this point you might want to jump to Controlling the systemd service. Some additional details about the service file and launcher script are provided here.

One Host, Many Managers

If you need to run two or more HostManagers on one system, you probably also want to have multiple services set up. (This might be the case if you’re using multiple docker-compose.yaml, or if you have both docker and native system agents running.).

Use arguments --site-host and --service-host to identify which HostManager you mean, and give the services different names:

$ ocs-install-systemd --service-dir=. --service-host=host-1 --site-host=host-1
Writing /home/ocs/ocs-site-configs/my-ocs/launcher-hm-host-1.sh ...
Writing ./ocs-hostmanager-host-1.service ...

$ ocs-install-systemd --service-dir=. --service-host=host-1-docker --site-host=host-1-docker
Writing /home/ocs/ocs-site-configs/my-ocs/launcher-hm-host-1-docker.sh ...
Writing ./ocs-hostmanager-host-1-docker.service ...

The --site-host argument helps the code find the instance_id of the HostManager in the SCF, and to name the launcher script. The --service-host argument is used simply to give the .service file a different filename.

The .service file

The .service file is a service configuration file for systemd, and there are lots of things that could be set up in there. The file created by ocs-install-systemd is minimal, but sufficient. It should look something like this:

[Unit]
Description=OCS HostManager for server5

[Service]
ExecStart=/home/ocs/git/ocs-site-configs/my-lab/launcher-hm-server5.sh
User=ocs
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

This can be edited further before (or after) it is installed. You can control the hostname (server5 here) and system user (ocs here) that get dropped into the template with the --service-host and --service-user arguments to ocs-install-systemd… or just edit them by hand.

If you want to keep copies of the service file in version control, be aware that it might make sense to call the installed service file ocs-hostmanager.service, on each system, but you will need different filenames (probably ocs-hostmanager-<hostname>.service) in your site config dir.

The launcher script

The launcher script is a bash script that runs HostManager. It is called by systemd when starting the service. Any environment variables or additional command line arguments that need to be set for the HostManager instance can be set in this script. The script should normally be kept with other OCS configuration files, such as the SCF.

The launcher script is probably not needed, because a lot of additional configuration (such as environment variables) can be put into a .service file. But in the interest of familiarity, the default behavior provides users with the launcher script.

Controlling the systemd service

The usual systemctl commands (start, stop, restart, enable, disable) are used to control the service.

Starting and stopping the service:

Use the usual systemctl commands to start …:

$ sudo systemctl start ocs-hostmanager.service

… or to stop the service:

$ sudo systemctl stop ocs-hostmanager.service

Checking status

The status of the service (including whether it is running, whether it is enabled, and a few lines from the logs) can be obtained from the “status” command to systemctl:

$ sudo systemctl status ocs-hostmanager.service

Controlling startup on boot

The systemd terminology for “will be launched when system boots” is “enabled”. To enable launch-on-boot:

$ sudo systemctl enable ocs-hostmanager.service

To disable launch-on-boot:

$ sudo systemctl disable ocs-hostmanager.service

Using ocs-web with HostManager

The ocs-web system includes a Panel for HostManager agents. Here’s a screenshot of what that looks like:

../_images/ocs_web_hostmanager.png

In its current form, the control panel is associated with a single HostManager, and there is no way to broadcast target state requests to multiple targets.