Alerting rules
Starting with version 24, Step includes a rule-based mechanism that allows users to define flexible reactions to various events.
In a nutshell, rules allow you to define when to react to specific events (conditions), and how (actions).
The following illustration presents the process in a simplified, informal form:
Events
Events are automatically generated by the Step environment and fed into the alerting rules evaluation component. Events contain further details about what happened, in the form of Bindings. These bindings can be used during rule evaluation, e.g. to express conditions based on the value of individual bindings. Furthermore, the bindings are also available when actions are executed; for instance, a mail or webhook sent as a result of a rule can define a template of its content with placeholders that will be appropriately substituted with the concrete binding values.
Events class hierarchy and bindings
Events are structured in a class hierarchy, where individual subclasses add more specialized bindings depending on their context.
Only the leaf classes are actually instantiated (emitted) by the system, but they can conceptually be treated as any of their respective superclasses, and contain all their superclasses’ bindings.
Please note that for simplicity, only three types of bindings are supported: strings, lists of strings, and maps with string keys and string values.
Event class | Binding (Type) | Description |
---|---|---|
AlertingEvent | Base class of all alerting events | |
eventClass (String) | Concrete class of the event | |
eventClasses (List) | All classes of the event (including superclasses) | |
controllerUrl (String) | URL of the Step controller | |
projectId (String) | ID of the Step project in which the event occurred | |
projectName (String) | Name of the Step project in which the event occurred | |
eventSummary (String) | Short human-readable summary/title of the event | |
ExecutionEvent | (extends AlertingEvent) | Base class of execution events |
executionId (String) | ID of the execution | |
planId (String) | ID of the plan that was executed | |
executionDescription (String) | Description of the execution, e.g. plan name | |
executionUrl (String) | URL of the execution | |
executionUserName (String) | Step user who performed the execution | |
executionParameters (Map) | Parameters of the execution | |
AbstractExecutionEndedEvent | (extends ExecutionEvent) | Base class of execution ended events |
executionStatus (String) | Execution status (e.g. PASSED, TECHNICAL_ERROR ) | |
errorSummary (String) | Human-readable error summary (if applicable) | |
errorCodes (List) | Error codes (if applicable) | |
ExecutionEndedEvent | (extends AbstractExecutionEvent) | Regular execution ended (e.g. plan was executed) Triggered after the execution of a plan. |
ScheduledExecutionEndedEvent | (extends AbstractExecutionEvent) | Scheduled execution ended Triggered after each execution of a schedule. |
scheduleId (String) | ID of the schedule that triggered the execution | |
scheduleName (String) | Name of the schedule that triggered the execution | |
scheduleStatus (String) | Status of the schedule (after this execution) | |
scheduleSucceeded (String) | Boolean indicating whether schedule is considered successful (true/false) | |
assertionPlanExecutionStatus (String) | Status of the assertion plan execution (if applicable) | |
assertionPlanExecutionUrl (String) | URL of the assertion plan execution (if applicable) | |
assertionPlanErrorSummary (String) | Error summary of the assertion plan execution (if applicable) | |
assertionPlanErrorCodes (List) | Error codes of the assertion plan execution (if applicable) | |
IncidentEvent | (extends AlertingEvent) | Base class of incident events |
incidentId (String) | ID of the incident | |
incidentUrl (String) | URL of the incident | |
incidentStatus (String) | Status of the incident (OPEN/CLOSED) | |
incidentTitle (String) | Incident title | |
incidentCauseEventClass (String) | Concrete class of the event that caused the incident event | |
incidentCauseEventClasses (List) | All classes of the event that caused the incident event | |
See further notes in description below | ||
IncidentOpenedEvent | (extends IncidentEvent) | An incident was opened Triggered after the opening of an incident. |
IncidentClosedEvent | (extends IncidentEvent) | An incident was closed Triggered after the closing of an incident. |
IncidentRecordedEvent | (extends IncidentEvent) | An already-open incident reoccurred Triggered after re-observing a related event. |
Notes
- IncidentEvent instances will also copy most of the bindings of their causing event. For instance, if an incident was opened in response to a ScheduledExecutionEndedEvent, it will also contain the scheduleName etc. bindings.
- IncidentRecordedEvents are only informational and do not change the status of incidents. They are emitted when an incident would have been opened, but such an incident is already open. In this case, the occurrence is added as an informational entry to the existing incident.
- After a schedule is executed, the system emits two events: one ExecutionEndedEvent related to the actual plan that was executed, and one ScheduledExecutionEndedEvent related to the schedule itself. These events do not necessarily have the same result status, because the schedule may be subject to the evaluation of an Assertion Plan which determines its status, independently of the status of the underlying plan.
- The bindings can be evaluated in conditions, and actions. See the section on Binding Evaluation for more details.
Rules
As mentioned above, Step generates events automatically as they occur in the system. All events are fed into the alerting rules subsystem, but actions are only taken for events which specifically match the defined rules.
Here is an example of a project configured with two rules – one for automatically managing (i.e. opening/closing) incidents based on events emitted from scheduled executions, and one for sending notifications if incidents are opened.
The definition of the first rule is as shown below – as can be seen, it will react to ScheduledExecutionEndedEvents, and perform the action “Open/close incident automatically” as appropriate in response to these events.
Once the first rule is processed, it may in turn itself generate events related to incidents. These events are captured by the second rule, which reacts to IncidentEvents (except if the event is an IncidentRecordedEvent, see the description of conditions below), and actually sends a notification by mail.
Conditions
Every rule must be associated to at least an event class for which it will be evaluated. This can be any class in the hierarchy, but it is recommended to be as specific as possibly – in other words, use the most fitting subclass suitable for the task.
For example, a rule where the event class is set to the top-level AlertingEvent class will be evaluated for every single event that occurs. If the class is set to IncidentEvent, the rule will only be evaluated for incident-related events (remember that subclasses will also be matched); finally, if it is set to IncidentOpenedEvent, the rule will only be triggered on events where incidents were opened.
In addition to this required condition which matches the event class, an arbitrary number of further conditions may be specified.
For the time being, only conditions performing checks on binding values are supported, but more condition types may be added in the future if the need arises.
Binding conditions
In order to restrict a rule to matching only specific events (apart from the broad filtering by event class), the content of the bindings present in the event can be evaluated. To give a few examples:
- If one only wishes to react to (regular) executions that were not successful, and ran in the “PROD” environment (which was specified using an execution parameter
env
), the conditions would (logically) be:eventClass == 'ExecutionEndedEvent'
,executionStatus != 'PASSED'
, andexecutionParameters[env] == 'PROD'
. - To further restrict rules to only match incident events which occur when an incident was opened or closed (but not when an open incident recorded a re-occurrence), a condition on the eventClass binding can be employed, specifying that this binding must not match IncidentRecordedEvent. This is what was done in the second rule shown above. Note that this is functionally equivalent to creating two separate rules – one for IncidentOpenedEvent, one for IncidentClosedEvent, with the same action.
Predicates
All binding conditions require a binding key (i.e., which binding should be considered), and a predicate (i.e., what should be checked). Before going into more detail, here is an example of a slightly complex condition:
As you can see, multiple predicates are available (along with their negated variants):
- equals : performs a simple (String) equality check of the given binding value against a specified value.
- matches regex: checks if the given binding value matches the specified regular expression. Please make sure to provide a syntactically valid regular expression, as validation is only performed at rule evaluation time.
- exists: verifies that the binding exists at all.
For simple String bindings, it should be intuitively clear how this behaves. See the section on Binding evaluation below for more information about the syntax and behavior in the case of List or Map bindings.
Actions
Once a rule has been evaluated and all its conditions were found to apply to the respective event, the final stage in the rule processing is the execution of the defined actions. A rule can contain an arbitrary number of actions, all of which will be performed in the specified order.
Initially, two kinds of actions are supported, however other actions are expected to be available in the future.
Open/close incident
This action will automatically manage incidents, opening or closing them as needed, based on the outcome of the incoming event.
While it is technically possible to associate this action to any incoming event, it can only properly derive the required information from
“execution ended” events (AbstractExecutionEndedEvent
or its subclasses in the event hierarchy), and therefore will not have an effect
when applied to other events.
Compound Key
This action has an optional parameter named Compound Key. It is recommended to leave this empty in normal circumstances, as it influences the way that incidents are “grouped” by the system, and the default implementation should be suitable for most use cases. The default behavior is as follows:
- “Regular” execution ended events will use the
planId
binding as the key, thus grouping incidents by plan name. - Scheduled execution ended events will use the
scheduleId
binding.
In some cases, you may want to deviate from this default grouping. For example, if you have multiple environments (TEST and PROD),
which can be identified via the executionParameters
key env
, the default grouping would create an incident whenever a plan fails
– regardless of the environment. Consider the case where a single plan consistently fails in one environment,
but consistently succeeds in another. In this case, incidents would constantly be opened and closed.
One solution would be to add a condition on the environment to the rules, so only specific environments are even considered for
auto-managing incidents. Another option, which this parameter allows, is to specify a compound key for the incident grouping,
which for this example would be planId, executionParameters[env]
(see below for the syntax).
This has the effect of using both bindings together as a compound key for identifying incidents, and will effectively manage incidents
separately by plan and environment.
Send Notification
This action allows to send a notification using one of the supported Notification mechanisms (E-Mail or Webhook) by selecting the corresponding Notification Preset.
Note that depending on the definition of the selected Notification Preset (which may or may not include data definitions for all necessary fields), you may be allowed to, or even have to, provide the data to use for individual fields (e.g. Mail recipients), so that the content of the notification to send is fully defined at this stage.
Gateway Notification (legacy)
This action allows to send a notification via a gateway defined using the (deprecated)Notifications mechanism.
You will first need to define a suitable gateway (either Mail or Custom Webhook) in the system settings. For mail gateways (only), the definition of this action requires the list of mail recipients to be specified. Also note that legacy Step Webhook gateways are not supported here.
Binding evaluation
As mentioned, all events contain one or more bindings providing more detailed information about the event. These bindings can be used to define more specific conditions for rules, and they can be used in rule actions.
In rule conditions
In rule conditions where bindings are referenced, it is generally as straightforward as directly using the binding name. However, there are different types of bindings, and depending on the type of binding and the requirements that the condition should express, more complex specifications are possible.
String bindings
String bindings are the simplest case, and the only supported syntax is the binding name, verbatim (example: projectName
).
Map bindings
For map bindings (e.g., the executionParameters
binding), one usually wishes to inspect a particular value in the map, in which case the respective map key should be appended in brackets (with no further formatting or escaping). This results in a familiar syntax, such as executionParameters[env]
or executionParameters[cluster]
.
A more unusual requirement would be to check all entries (more specifically, their values) in the map, in which case the explicit syntax for such a wildcard check would be executionParameters[*]
. In this case, the condition will be considered successful if any of the values satisfies it.
List bindings
List bindings are conceptually similar to map bindings in that they can contain multiple values; the difference is that individual items cannot be identified by a key, but instead by their index (position) in the list.
Thus, similar syntax as for map bindings is supported, such as assertionPlanErrorCodes[*]
or assertionPlanErrorCodes[0]
.
In rule actions
For the compound key definition of incident actions, the same syntax as for the conditions applies.
However, for notifications, where the data (e.g. mail content, or webhook payload) is generally user-defined, a simple and familiar syntax for string interpolation (also widely employed by other integration solutions) is used:
- The string
${someBinding}
will be replaced by the (serialized) content of the binding namedsomeBinding
. - For map values,
${mapBinding[someKey]}
will be replaced by the value of the keysomeKey
in bindingmapBinding
.
This syntax should be suitable for most output targeted for humans, like mail notifications. For integration with webhooks, it may be more suitable to directly use a machine-readable format like JSON. For this purpose, the following replacements are performed in addition:
- The string
%{someBinding}
will be replaced by the JSON representation of (serialized) content of the binding namedsomeBinding
. - For map values,
%{mapBinding[someKey]}
will be replaced by the JSON representation of the value of the keysomeKey
in bindingmapBinding
.
Finally, the two special values ${bindings}
and %{bindings}
will produce a (more or less) human-readable, and a JSON-formatted,
machine-readable, representation of all bindings present.
For example, using a mail gateway with the template Hello, here are all the bindings: %{bindings}
and sending a notification in reaction to an incident event gives the following result (only formatted for readability):
Hello, here are all the bindings: {
"eventClass": "IncidentClosedEvent",
"eventClasses": [
"IncidentClosedEvent",
"IncidentEvent",
"AlertingEvent"
],
"controllerUrl": "http://localhost:4201",
"incidentTitle": "Assertion for schedule 'Sleep every minute' failed",
"incidentId": "655dd559d60ac6051820d11b",
"incidentUrl": "http://localhost:4201/#/root/incidents/655dd559d60ac6051820d11b?tenant=Common",
"incidentStatus": "CLOSED",
"incidentCauseEventClasses": [
"ScheduledExecutionEndedEvent",
"AbstractExecutionEndedEvent",
"ExecutionEvent",
"AlertingEvent"
],
"incidentCauseEventClass": "ScheduledExecutionEndedEvent",
"projectId": "654b0e3329deb95b5c828e3c",
"projectName": "Common",
"executionId": "655dd594d60ac6051820d122",
"executionDescription": "Sleep",
"executionUrl": "http://localhost:4201/#/root/executions/655dd594d60ac6051820d122?tenant=Common",
"executionUserName": "admin",
"executionParameters": {
"env": "TEST"
},
"planId": "654b16d2beafb638286e54e7",
"executionStatus": "PASSED",
"errorSummary": "",
"errorCodes": [],
"assertionPlanExecutionStatus": "PASSED",
"assertionPlanErrorSummary": "",
"assertionPlanErrorCodes": [],
"assertionPlanExecutionUrl": "http://localhost:4201/#/root/executions/655dd594d60ac6051820d19e?tenant=Common",
"scheduleId": "654e4a880a142b3f2af5e052",
"scheduleName": "Sleep every minute",
"scheduleStatus": "PASSED",
"eventSummary": "Incident closed: Assertion for schedule 'Sleep every minute' failed",
"scheduleSucceeded": "true"
}