The SKA Telescope Manager (TM) is the core package of the SKA Telescope: it is aimed at scheduling observations, controlling their execution, monitoring the telescope health status, diagnosing and fixing its faults and so on. To do that, TM directly interfaces with the Local Monitoring and Control systems (LMCs) of the various SKA Elements (e.g. Dishes, Low-Frequency Aperture Array, etc.), exchanging commands and data with each of them. TM in turn needs to be monitored and controlled, in order to ensure its continuous and proper operation – and therefore that of the whole SKA Telescope –. It appears indeed that, while the unavailability of one or more instances of any other SKA element should result only in a degraded operation for the whole telescope, a problem in TM could cause a complete stop of any operation. In addition to this higher responsibility, a local monitoring and control system for TM has to collect and display logging data directly to operators, perform lifecycle management of TM applications and directly deal – when possible – with management of TM faults (which also includes a direct handling of TM status and performance data). In this paper, the peculiarities presented by the TM monitoring and control and the consequences they have on the design of a related LMC system are addressed and discussed.
In the overall SKA architecture, each of the two telescopes (SKA MID and SKA LOW) is composed by several Elements covering all required functionalities: DISH and MFAA (Mid Frequency Aperture Array, for SKA MID) and LFAA (Low Frequency Aperture Array, for SKA LOW) are the front-end Elements for direct radiation detection, while elements such as CSP (Central Signal Processor), SDP (Science Data Processor), SAT (Synchronization And Timing), INFRA (Infrastructure) and SaDT (Signal and Data Transport) are devoted to all other operational and support functionalities. The global orchestration of this huge system is performed by a central element called Telescope Manager (TM).
SKA Elements (level 2) consist of multiple sub-elements (level 3), which in turn can be decomposed into applications (level 4), components (level 5) and so on, down to the line replaceable units (LRUs). Each SKA Element, in particular, is provided with a Local Monitoring and Control (LMC) system. TM interfaces with the each of the SKA Element LMCs to exchange commands and responses, gather monitoring data, events and alarms, and provide capabilities for diagnostics and upgrades. On the other hand, TM – like any other SKA Element – is provided with its own Local Monitoring and Control (TM.LMC), whose purpose is to support the operations of TM. This includes dedicated software that configures, monitors and controls the performance of the TM as well as ensures that diagnostics can be performed. LMC will also perform lifecycle control for the TM applications. In addition to these functions, LMC plays a special role in the overall “SKA LMC scenario”. It must indeed essentially guarantee that, even in case of severe failures of the TM sub-elements (e.g. nodes failure), TM shall be accessible by external operators (engineers) for off-line monitoring data exploration, diagnostics and troubleshooting, in order to be restored as soon as possible. Therefore, TM.LMC must have the following responsibilities:
- Eliminate common failure modes with TM: if TM fails, the TM.LMC is still able to resolve and/or report the problem,
- Monitoring and control of
- TM process applications and process initialization,
- Execution and stopping order,
- Monitoring TM processes (process status) and hosts (Heartbeat),
- Report TM element state of the overall TM element or components of it,
- Configure TM element and its components and support re-configuring the runtime TM element.
Figure 1 shows the product breakdown structure of TM that is composed by LMC and other three sub-elements:
- TELMGT, that is the responsible of the hardware telescope management;
- OBSMGT, that is the responsible of the observation management;
- LINFRA, that is responsible of the local infrastructure.
In general, a generic Element LMC is meant to provide a single point of control to TM because it enables TM to view the Element as a single entity and provide a standard interface to communicate with it. Applying the same view to TM.LMC, however, shows that TM needs a single point of control of itself that is very peculiar. In the following sections, the functional aspects and technical design of TM.LMC are analyzed and described.