HostManager Agent
The HostManager Agent helps to manage the many Agent instances that need to run on a single host machine, providing a way to start and stop Agents without connecting to the host system.
For a complete discussion of this Agent and how to best use it, see Centralized Management.
usage: agent.py [-h] [--initial-state {up,down}]
[--docker-compose DOCKER_COMPOSE]
[--docker-service-prefix DOCKER_SERVICE_PREFIX]
[--docker-compose-bin DOCKER_COMPOSE_BIN] [--quiet]
Agent Options
- --initial-state
Possible choices: up, down
Force a single target state for all agents, on start-up.
- --docker-compose
Comma-separated list of docker-compose files to parse and manage.
- --docker-service-prefix
Prefix, to be used in combination with instance-id, for recognizing docker services that correspond to entries in site config.
Default:
'ocs-'- --docker-compose-bin
Path to docker-compose binary. This will be interpreted as a path relative to current working directory. If not specified, will try to use which docker-compose.
- --quiet
Suppress output to stdout/stderr.
Default:
False
Configuration File Examples
Note that the HostManager Agent usually runs on the native system, and not in a Docker container. (If you did set up a HostManager in a Docker container, it would only be able to start/stop agents within that container.)
OCS Site Config
Here’s an example configuration block:
{'agent-class': 'HostManager',
'instance-id': 'hm-mydaqhost1'}
By convention, the HostManager responsible for host <hostname>
should be given instance-id hm-<hostname>.
Description
Please see Centralized Management.
Agent API
- class ocs.agents.host_manager.agent.HostManager(agent, docker_composes=[], docker_service_prefix='ocs-')[source]
This Agent is used to start and stop OCS-relevant services on a particular host. If the HostManager is launched automatically when a system boots, it can then be used to start up the rest of OCS on that host (either automatically or on request).
- manager(requests=[], reload_config=True)[source]
Process - The “manager” Process maintains a list of child Agents for which it is responsible. In response to requests from a client, the Process will launch or terminate child Agents.
- Parameters:
requests (list) – List of agent instance target state requests; e.g. [(‘instance1’, ‘down’)]. See description in
update()Task.reload_config (bool) – When starting up, discard any cached database of tracked agents and rescan the Site Config File. This is mostly for debugging.
Notes
If an Agent process exits unexpectedly, it will be relaunched within a few seconds.
When this Process is started (or restarted), the list of tracked agents and their status is completely reset, and the Site Config File is read in.
Once this process is running, the target states for managed Agents can be manipulated through the
update()task.Note that when a stop is requested on this Process, all managed Agents will be moved to the “down” state and an attempt will be made to terminate them before the Process exits.
The session.data is a dict, with entries: - ‘child_states’ - ‘config_parse_status’ - indicates how recently the various
input files havebeen parsed.
‘orphans’ - lists any orphaned (in the sense of docker compose) containers.
‘new_tags’ - dict mapping instance_id to new docker image tag, if that tag is not known to docker system. Only populated for tracked instances where the tag is not known.
The ‘child_states’ entry is a list of managed Agent status; for example:
[ {'next_action': 'up', 'target_state': 'up', 'stability': 1.0, 'agent_class': 'Lakeshore372Agent', 'instance_id': 'thermo1'}, {'next_action': 'down', 'target_state': 'down', 'stability': 1.0, 'agent_class': 'ACUAgent', 'instance_id': 'acu-1'}, {'next_action': 'up', 'target_state': 'up', 'stability': 1.0, 'agent_class': 'FakeDataAgent[d]', 'instance_id': 'faker6'}, ]If you are looking for the “current state”, it’s called “next_action” here.
The agent_class may include a suffix [d] or [d?], indicating that the agent is configured to run within a docker container. (The question mark indicates that the HostManager cannot actually identify the docker-compose service associated with the agent description in the SCF.)
The ‘config_parse_status’ is a dict where the key is a docker compose filename, or “[SCF]” for the site config file, and the value is a tuple (success, timestamp, message).
The ‘orphans’ entry is as a dict mapping docker container ID to some information about the container. E.g.:
{ "30027f37e0ef4b...": { "compose_file": "/home/ocs/config/docker-compose.yml", "service": "ocs-faker3", "container_id": "30027f37e0ef4b...", "running": true, "exit_code": 0, "container_found": true, "running_image": "sha256:7eaa6d6f6..." } }The ‘new_tags’ entry looks like this:
{ "faker1": "simonsobs/ocs:v0.11.3", "faker2": "simonsobs/ocs:v0.11.3" }
- update(requests=[], reload_config=False)[source]
Task - Update the target state for any subset of the managed agent instances. Optionally, trigger a full reload of the Site Config File first.
- Parameters:
requests (list) – Default is []. Each entry must be a tuple of the form
(instance_id, target_state). Theinstance_idmust be a string that matches an item in the current list of tracked agent instances, or be the string ‘all’, which will match all items being tracked. Thetarget_statemust be ‘up’ or ‘down’.reload_config (bool) – Default is False. If True, the site config file and docker-compose files are reparsed in order to (re-)populate the database of child Agent instances.
Examples
update(requests=[('thermo1', 'down')]) update(requests=[('all', 'up')]) update(reload_config=True)Notes
Starting and stopping agent instances is handled by the
manager()Process; if that Process is not running then no action is taken by this Task and it will exit with an error.The entries in the
requestslist are processed in order. For example, if the requests were [(‘all’, ‘up’), (‘data1’, ‘down’)]. This would result in setting all known children to have target_state “up”, except for “data1” which would be given target state of “down”.If
reload_configis True, the Site Config File will be reloaded (as described in_reload_config()) before any of the requests are processed.Managed docker-compose.yaml files are reparsed, continously, by the manager process – no specific action is taken with those in this Task. Note that adding/changing the list of docker-compose.yaml files requires restarting the agent.
- remove_orphans(stop_time=10.)[source]
Task - Use docker stop and docker rm to remove orphaned containers associated with managed docker compose files.
This does not really do any error checking.
- docker_pull()[source]
Task - Use docker compose to pull any (new) images for the managed docker compose files.
- die(disown_dockers=False)[source]
Task - trigger a shutdown of the manage process and then stop the reactor, causing the HostManager to exit.
- Parameters:
disown_dockers (bool) – If True, then all tracked docker services will be put in “passive tracking” mode, meaning that they will not be stopped and removed during this shutdown process. This can be used to restart HostManager without needing to also restart all (docker-based) agents on the system.
Supporting APIs
- HostManager._reload_config(session)[source]
This helper function is called by both the
managerProcess at startup, and theupdateTask.The Site Config File is parsed and used to update the internal database of child instances. Any previously unknown child Agent is added to the internal tracking database, and assigned whatever target state is specified for that instance. Any previously known child Agent instance is not modified.
If any child Agent instances in the internal database appear to have been removed from the SCF, then they are set to have target_state “down” and will be deleted from the database when that state is reached.
- class ocs.agents.host_manager.drivers.ManagedInstance(management: str, agent_class: str, instance_id: str, full_name: str, operable: bool = False, retired: bool = False, passive_tracking: bool = False, agent_script: str | None = None, prot: object | None = None, restart_required: bool = False, target_state: str = 'down', next_action: str = 'down', at: float = 0, fail_times: ~typing.List = <factory>)[source]
Bases:
objectTracks the properties of a managed Agent-instance, including how to launch it, the current run state, target state, etc.
- management: str
How host is managed; either “host”, “docker”, or “retired”.
- agent_class: str
Agent class name (which may include suffix “[d]” or “[d?]” for docker-managed instances; or simply “[docker]” for services that do not seem to be registered in the SCF.
- instance_id: str
The agent instance’s instance_id, or else the docker service name associated with entry in the SCF.
- full_name: str
instance_id.
- Type:
Indentier constructed as agent_class
- operable: bool = False
Indicates whether the instance can be manipulated (whether calls to up/down should be expected to work).
- retired: bool = False
Indicates if instance is retired and can be removed from tracking.
- passive_tracking: bool = False
Indicates if instance should be “passively” managed, e.g. not be enforced other than ephemerally to attempt a start / stop. This is expected to only be used for docker-based instances.
- agent_script: str = None
The docker service name, if docker-managed; otherwisre the string
__plugin__to indicate it is host managed.
- prot: object = None
The Twisted ProcessProtocol object, if host system managed; or else the DockerContainerHelper if docker-based.
- restart_required: bool = False
Indicates a restart is in order, due to change of docker tag or other new software version.
- target_state: str = 'down'
The run state HostManager is trying to enforce (up, down, passive).
- next_action: str = 'down'
The thing HostManager plans to do next; this will sometimes mirror the current state (up or down) and will sometimes carry a transitional state, such as “wait_start”.
- at: float = 0
Unix timestamp, used by transitional states to indicate time at which some subsequent action should be taken.
- fail_times: List
List of unix timestamps for recent events where an instance stopped unexpectedly; used to identify “unstable” agents.
- ocs.agents.host_manager.drivers.resolve_child_state(minst)[source]
- Parameters:
minst (ManagedInstance) – the instance state information. This will be modified in place.
- Returns:
‘messages’ (list of str): messages for the session.
’launch’ (bool): whether to launch a new instance.
’terminate’ (bool): whether to terminate the instance.
’sleep’ (float): maximum delay before checking back, or None if this machine doesn’t care.
- Return type:
Dict with important actions for caller to take. Content is
- ocs.agents.host_manager.drivers.stability_factor(times, window=120)[source]
Given an increasing list of failure times, quantify the stability of the activity.
A single failure, 10 seconds in the past, has a stability factor of 0.5; if there were additional failures before that, the stability factor will be lower.
Returns a culled list of stop times and a stability factor (0 - 1).
- class ocs.agents.host_manager.drivers.AgentProcessHelper(instance_id, cmd)[source]
Bases:
ProcessProtocol- connectionMade()[source]
Called when a connection is made.
This may be considered the initializer of the protocol, because it is called when the connection is completed. For clients, this is called once the connection to the server has been established; for servers, this is called after an accept() call stops blocking and a socket has been received. If you need to send any greeting or initial message, do it here.
- class ocs.agents.host_manager.drivers.DockerContainerHelper(service, docker_bin=None)[source]
Bases:
objectClass for managing the docker container associated with some service. Provides some of the same interface as AgentProcessHelper. Pass in a service description dict (such as the ones returned by parse_docker_state).
- ocs.agents.host_manager.drivers.parse_docker_state(docker_compose_file)[source]
Analyze a docker compose.yaml file to get a list of services. Using docker compose ps and docker inspect, determine whether each service is running or not.
- Returns:
- A dict where the key is the service name and each value is a
dict with the following entries:
’compose_file’: the path to the docker compose file
’service’: service name
’image_tag’: the tag listed for the image in the compose file (this may differ from the running image).
’image_id’: the docker image ID corresponding to ‘image_tag’; will be “unknown” if, e.g., listed tag is not yet pulled to the running system.
’container_found’: bool, indicates whether a container for this service was found (whether or not it was running).
’container_id’: the docker ID of the container (if found).
’running’: bool, indicating that the found container is in state “Running”.
’running_image’: the ID of the image for the container (if found; e.g. “sha:0f…”).
’exit_code’: int, which is either extracted from the docker inspect output or is set to 127. (This should never be None.)
- orphans:
A dict (by container id) of dicts describing running containers that are associated with this compose file but have apparently been removed from the service list. Key is the service name.
- Return type:
services