How to monitor services availability and performance
This tutorial will show you how step can be used to monitor services availability and metrics using plan executions in combination with the notification package, the scheduler and the monitoring view as a dashboard.
Pre-requisites
Before going through this tutorial, make sure you understand what step keywords and plans are.
You could also refer to the example usage of test sets and test cases for a better understanding on how monitoring plans are defined.
Services availability and performance monitoring
Keywords definition
Let’s assume you are operating 3 different Windows services and you want to monitor them by checking their status (running or dead) and measure the average response time needed to execute the check.
A simple keyword will be used to check these services and return as an output the service’s health status (running / stopped). The keyword has 2 inputs :
- executablePath : in this tutorial we are using Powershell
- serviceDisplayName : the service name as displayed in the Windows Service Control Manager
You can find a sample of the keyword code to be used (Windows OS) in the step sample project hosted on Github.
To register the sample keyword(s) (called WindowsServiceStatusKeyword) into step, follow below instructions :
- checkout the step samples project using your favorite SCM tool
- execute a Maven build using the “package” goal on the project demo-system-monitoring to produce the keywords jar
- register the WindowsServiceStatusKeyword keyword into step as described here
Define your monitoring plans
Setup your keywords and TestCases
For clarity purposes, the keyword calls have been labeled with the service name as below :
- Check_DHCP_Client_Service_Health
- Check_DNS_Client_Service_Health
- Check_Print_Spooler_Service_Health
Let’s define a simple plan of type TestSet containing 3 TestCase controls to execute the services health check keywords :
A good practice is to wrap each of your checks in a TestCase control. This will enable an execution per test case split view and give you a better control on what is executed in the monitoring checks test set.
In addition, all your test cases will be executed in parallel if they are defined under a TestSet control.
Add an assertion on keyword output
In order to perform a check on the service status (is it running or not ?) within our test plan, let’s add an Assert control under each of our keywords :
Let's execute the plan and click on the **Check Sprint Pooler test case** to display its content :
As per above screenshot, we can see that the Check Print Spooler keyword node status is red and marked as FAILED because the service is stopped !
These functional checks can now be productively used in order to monitor our services.
Schedule your plan
Now that we have some functional checks, let’s schedule them to run periodically using the scheduler.
From an execution of your plan, click the “Schedule” button on the top right panel :
You can now define the period you want your monitoring plan to be executed. In this example, we are using the "Every 5 minutes" preset (you can use the [Java CRON](https://docs.oracle.com/cd/E12058_01/doc/doc.1014/e12030/cron_expressions.htm) expression of your choice) :
Click the **"OK"** button : you are redirected to the *step* [scheduler](https://step.dev/knowledgebase/3.19/userdocs/executions/#scheduling) tab from where you can see and edit all the scheduling entries you created :
Add notifications to your plan
Notification are made available as an Enterprise Plugin.
In order to receive an alert about the status of your services, step can trigger automatic E-Mail notification based on the overall status of the executed plan.
We recommend following this page to configure your email gateway(s) prior to setup notifications on your plans.
Once done, open an execution of your monitoring test plan and click the “Add notifications” link from the bottom right panel :
Choose on what kind of event a notification should be triggered (execution ended or failed), fill you notification gateway and the notification recipient :
We choose to receive a notification when the monitoring test fails. Let’s stop the Print Spooler service on our monitored instance and wait for the next execution.
After 5 minutes, the following email has been received in our mailbox :
Consult your monitoring dashboard
Monitoring dashboards are made available as step’s Enterprise Versions.
Let’s have a look at the monitoring dashboard : you can access it by clicking on the “Monitoring” tab from the top menu :
As you can see, the latest execution of our monitoring plan ended as FAILED as the Print Spooler service is not running.
In order to illustrate the “Last status change” column behavior, let’s fix the Print Spooler service and wait for the next plan execution.
Here the monitoring view 5 minutes after having applied the fix to the service :
We can see that the “Last status change” column value has been updated accordingly to the last plan execution overall status !
Long term trends / history
To display the performance metrics over time, open any execution of your monitoring test plan, switch to the "Performance" tab then click on "Interactive analytics" :
Now that we have been redirected to RTM, we can remove the existing filter base on the execution id : its purpose is to filter the measurements of the selected execution. In our case we are interested in all the measurements of a specific plan over time. Therefore we remove the existing filter base on the execution id.
Click on the associated right red cross to remove it :
Let's now filter our result to display only the **Check_Print_Spooler_Service_Health** keyword's response time : add a simple "Text filter" based on the keyword name :
See below the graph containing the average response time for the selected keyword :
To retrieve the executions data of all our keywords , we can use a regular expression filter still based on the keyword **name** as below (in our example, all the keyword name starts with "Check") :
You can now see the graph containing the average response time of each services over time :