Incidents
Incidents are used to be aware of problems in the infrastructure: for instance, Step can periodically run plans which test the behavior of a system. When these plans fail to execute correctly, or (user-defined) assertions concerning the performance do not hold, this constitutes an incident.
Step allows incidents to be automatically created and managed according to user-defined alerting rules. These rules are evaluated in response to various events (most notably after every plan execution). Thus, a failed execution could lead to the opening of an incident, while a subsequent successful one would automatically close the incident.
Incident Lifecycle
An incident as such has only two states:
- When a problem appears (according to the user-defined criteria), an incident is created / opened. It remains open as long as the issue persists.
- When the problem disappears, the respective incident is closed.
In Step, incidents are usually tied to the execution status (passed or failed) of individual plans or schedules: When an execution or schedule fails, a corresponding incident is opened; it is closed when the execution succeeds again. While an incident is open, the system can optionally record reoccurrences of the failures. Note that this recording does not change the status of the incident, it merely adds additional informational records to the incident.
Incidents list view
This view lists all existing incidents created in Step; the default filters only display the Opened Incidents:
- The incidents menu entry
- The title, clicking on it will display the incident details view
- The root cause, clicking on it will redirect to the root cause (for instance the corresponding execution)
- The analyze action, which will open the analytics dashboards prefiltered for further analysis of this incident
Incident details view
This view displays the details for one incident. As for the incidents list view, you have direct access to the root cause and analyse action for the displayed incident. You can also view all bindings (data) recorded for the events related to the incident: