HostManager Agent

The HostManager Agent helps to manage the many Agent instances that need to run on a single host machine, providing a way to start and stop Agents without connecting to the host system.

For a complete discussion of this Agent and how to best use it, see Centralized Management.

usage: agent.py [-h] [--initial-state {up,down}]
                [--docker-compose DOCKER_COMPOSE]
                [--docker-service-prefix DOCKER_SERVICE_PREFIX]
                [--docker-compose-bin DOCKER_COMPOSE_BIN] [--quiet]

Agent Options

--initial-state

Possible choices: up, down

Force a single target state for all agents, on start-up.

--docker-compose

Comma-separated list of docker-compose files to parse and manage.

--docker-service-prefix

Prefix, to be used in combination with instance-id, for recognizing docker services that correspond to entries in site config.

Default: 'ocs-'

--docker-compose-bin

Path to docker-compose binary. This will be interpreted as a path relative to current working directory. If not specified, will try to use which docker-compose.

--quiet

Suppress output to stdout/stderr.

Default: False

Configuration File Examples

Note that the HostManager Agent usually runs on the native system, and not in a Docker container. (If you did set up a HostManager in a Docker container, it would only be able to start/stop agents within that container.)

OCS Site Config

Here’s an example configuration block:

{'agent-class': 'HostManager',
 'instance-id': 'hm-mydaqhost1'}

By convention, the HostManager responsible for host <hostname> should be given instance-id hm-<hostname>.

Description

Please see Centralized Management.

Agent API

class ocs.agents.host_manager.agent.HostManager(agent, docker_composes=[], docker_service_prefix='ocs-')[source]

This Agent is used to start and stop OCS-relevant services on a particular host. If the HostManager is launched automatically when a system boots, it can then be used to start up the rest of OCS on that host (either automatically or on request).

manager(requests=[], reload_config=True)[source]

Process - The “manager” Process maintains a list of child Agents for which it is responsible. In response to requests from a client, the Process will launch or terminate child Agents.

Parameters:
  • requests (list) – List of agent instance target state requests; e.g. [(‘instance1’, ‘down’)]. See description in update() Task.

  • reload_config (bool) – When starting up, discard any cached database of tracked agents and rescan the Site Config File. This is mostly for debugging.

Notes

If an Agent process exits unexpectedly, it will be relaunched within a few seconds.

When this Process is started (or restarted), the list of tracked agents and their status is completely reset, and the Site Config File is read in.

Once this process is running, the target states for managed Agents can be manipulated through the update() task.

Note that when a stop is requested on this Process, all managed Agents will be moved to the “down” state and an attempt will be made to terminate them before the Process exits.

The session.data is a dict, with entries: - ‘child_states’ - ‘config_parse_status’ - indicates how recently the various

input files havebeen parsed.

  • ‘orphans’ - lists any orphaned (in the sense of docker compose) containers.

  • ‘new_tags’ - dict mapping instance_id to new docker image tag, if that tag is not known to docker system. Only populated for tracked instances where the tag is not known.

The ‘child_states’ entry is a list of managed Agent status; for example:

[
  {'next_action': 'up',
   'target_state': 'up',
   'stability': 1.0,
   'agent_class': 'Lakeshore372Agent',
   'instance_id': 'thermo1'},
  {'next_action': 'down',
   'target_state': 'down',
   'stability': 1.0,
   'agent_class': 'ACUAgent',
   'instance_id': 'acu-1'},
  {'next_action': 'up',
   'target_state': 'up',
   'stability': 1.0,
   'agent_class': 'FakeDataAgent[d]',
   'instance_id': 'faker6'},
  ]

If you are looking for the “current state”, it’s called “next_action” here.

The agent_class may include a suffix [d] or [d?], indicating that the agent is configured to run within a docker container. (The question mark indicates that the HostManager cannot actually identify the docker-compose service associated with the agent description in the SCF.)

The ‘config_parse_status’ is a dict where the key is a docker compose filename, or “[SCF]” for the site config file, and the value is a tuple (success, timestamp, message).

The ‘orphans’ entry is as a dict mapping docker container ID to some information about the container. E.g.:

{
  "30027f37e0ef4b...": {
    "compose_file": "/home/ocs/config/docker-compose.yml",
    "service": "ocs-faker3",
    "container_id": "30027f37e0ef4b...",
    "running": true,
    "exit_code": 0,
    "container_found": true,
    "running_image": "sha256:7eaa6d6f6..."
  }
}

The ‘new_tags’ entry looks like this:

{
  "faker1": "simonsobs/ocs:v0.11.3",
  "faker2": "simonsobs/ocs:v0.11.3"
}
update(requests=[], reload_config=False)[source]

Task - Update the target state for any subset of the managed agent instances. Optionally, trigger a full reload of the Site Config File first.

Parameters:
  • requests (list) – Default is []. Each entry must be a tuple of the form (instance_id, target_state). The instance_id must be a string that matches an item in the current list of tracked agent instances, or be the string ‘all’, which will match all items being tracked. The target_state must be ‘up’ or ‘down’.

  • reload_config (bool) – Default is False. If True, the site config file and docker-compose files are reparsed in order to (re-)populate the database of child Agent instances.

Examples

update(requests=[('thermo1', 'down')])
update(requests=[('all', 'up')])
update(reload_config=True)

Notes

Starting and stopping agent instances is handled by the manager() Process; if that Process is not running then no action is taken by this Task and it will exit with an error.

The entries in the requests list are processed in order. For example, if the requests were [(‘all’, ‘up’), (‘data1’, ‘down’)]. This would result in setting all known children to have target_state “up”, except for “data1” which would be given target state of “down”.

If reload_config is True, the Site Config File will be reloaded (as described in _reload_config()) before any of the requests are processed.

Managed docker-compose.yaml files are reparsed, continously, by the manager process – no specific action is taken with those in this Task. Note that adding/changing the list of docker-compose.yaml files requires restarting the agent.

remove_orphans(stop_time=10.)[source]

Task - Use docker stop and docker rm to remove orphaned containers associated with managed docker compose files.

This does not really do any error checking.

docker_pull()[source]

Task - Use docker compose to pull any (new) images for the managed docker compose files.

die(disown_dockers=False)[source]

Task - trigger a shutdown of the manage process and then stop the reactor, causing the HostManager to exit.

Parameters:

disown_dockers (bool) – If True, then all tracked docker services will be put in “passive tracking” mode, meaning that they will not be stopped and removed during this shutdown process. This can be used to restart HostManager without needing to also restart all (docker-based) agents on the system.

Supporting APIs

HostManager._reload_config(session)[source]

This helper function is called by both the manager Process at startup, and the update Task.

The Site Config File is parsed and used to update the internal database of child instances. Any previously unknown child Agent is added to the internal tracking database, and assigned whatever target state is specified for that instance. Any previously known child Agent instance is not modified.

If any child Agent instances in the internal database appear to have been removed from the SCF, then they are set to have target_state “down” and will be deleted from the database when that state is reached.

class ocs.agents.host_manager.drivers.ManagedInstance(management: str, agent_class: str, instance_id: str, full_name: str, operable: bool = False, retired: bool = False, passive_tracking: bool = False, agent_script: str | None = None, prot: object | None = None, restart_required: bool = False, target_state: str = 'down', next_action: str = 'down', at: float = 0, fail_times: ~typing.List = <factory>)[source]

Bases: object

Tracks the properties of a managed Agent-instance, including how to launch it, the current run state, target state, etc.

management: str

How host is managed; either “host”, “docker”, or “retired”.

agent_class: str

Agent class name (which may include suffix “[d]” or “[d?]” for docker-managed instances; or simply “[docker]” for services that do not seem to be registered in the SCF.

instance_id: str

The agent instance’s instance_id, or else the docker service name associated with entry in the SCF.

full_name: str

instance_id.

Type:

Indentier constructed as agent_class

operable: bool = False

Indicates whether the instance can be manipulated (whether calls to up/down should be expected to work).

retired: bool = False

Indicates if instance is retired and can be removed from tracking.

passive_tracking: bool = False

Indicates if instance should be “passively” managed, e.g. not be enforced other than ephemerally to attempt a start / stop. This is expected to only be used for docker-based instances.

agent_script: str = None

The docker service name, if docker-managed; otherwisre the string __plugin__ to indicate it is host managed.

prot: object = None

The Twisted ProcessProtocol object, if host system managed; or else the DockerContainerHelper if docker-based.

restart_required: bool = False

Indicates a restart is in order, due to change of docker tag or other new software version.

target_state: str = 'down'

The run state HostManager is trying to enforce (up, down, passive).

next_action: str = 'down'

The thing HostManager plans to do next; this will sometimes mirror the current state (up or down) and will sometimes carry a transitional state, such as “wait_start”.

at: float = 0

Unix timestamp, used by transitional states to indicate time at which some subsequent action should be taken.

fail_times: List

List of unix timestamps for recent events where an instance stopped unexpectedly; used to identify “unstable” agents.

ocs.agents.host_manager.drivers.resolve_child_state(minst)[source]
Parameters:

minst (ManagedInstance) – the instance state information. This will be modified in place.

Returns:

  • ‘messages’ (list of str): messages for the session.

  • ’launch’ (bool): whether to launch a new instance.

  • ’terminate’ (bool): whether to terminate the instance.

  • ’sleep’ (float): maximum delay before checking back, or None if this machine doesn’t care.

Return type:

Dict with important actions for caller to take. Content is

ocs.agents.host_manager.drivers.stability_factor(times, window=120)[source]

Given an increasing list of failure times, quantify the stability of the activity.

A single failure, 10 seconds in the past, has a stability factor of 0.5; if there were additional failures before that, the stability factor will be lower.

Returns a culled list of stop times and a stability factor (0 - 1).

class ocs.agents.host_manager.drivers.AgentProcessHelper(instance_id, cmd)[source]

Bases: ProcessProtocol

connectionMade()[source]

Called when a connection is made.

This may be considered the initializer of the protocol, because it is called when the connection is completed. For clients, this is called once the connection to the server has been established; for servers, this is called after an accept() call stops blocking and a socket has been received. If you need to send any greeting or initial message, do it here.

inConnectionLost()[source]

This will be called when stdin is closed.

processExited(status)[source]

This will be called when the subprocess exits.

@type reason: L{twisted.python.failure.Failure}

outReceived(data)[source]

Some data was received from stdout.

errReceived(data)[source]

Some data was received from stderr.

class ocs.agents.host_manager.drivers.DockerContainerHelper(service, docker_bin=None)[source]

Bases: object

Class for managing the docker container associated with some service. Provides some of the same interface as AgentProcessHelper. Pass in a service description dict (such as the ones returned by parse_docker_state).

update(service)[source]

Update self.status based on service info (in format returned by parse_docker_state).

ocs.agents.host_manager.drivers.parse_docker_state(docker_compose_file)[source]

Analyze a docker compose.yaml file to get a list of services. Using docker compose ps and docker inspect, determine whether each service is running or not.

Returns:

A dict where the key is the service name and each value is a

dict with the following entries:

  • ’compose_file’: the path to the docker compose file

  • ’service’: service name

  • ’image_tag’: the tag listed for the image in the compose file (this may differ from the running image).

  • ’image_id’: the docker image ID corresponding to ‘image_tag’; will be “unknown” if, e.g., listed tag is not yet pulled to the running system.

  • ’container_found’: bool, indicates whether a container for this service was found (whether or not it was running).

  • ’container_id’: the docker ID of the container (if found).

  • ’running’: bool, indicating that the found container is in state “Running”.

  • ’running_image’: the ID of the image for the container (if found; e.g. “sha:0f…”).

  • ’exit_code’: int, which is either extracted from the docker inspect output or is set to 127. (This should never be None.)

orphans:

A dict (by container id) of dicts describing running containers that are associated with this compose file but have apparently been removed from the service list. Key is the service name.

Return type:

services