/ Management  

Unlock hidden insights; Build a stunning Grafana dashboard to monitor OPA like a bro (part 4)

Hi there “Process Automation” fans,

Welcome to a new installment of “Process Automation” tips.

We continue our journey on Grafana, but what have we seen so far?

  • Post 1: Was all about installing Grafana, Prometheus, the JSON exporter, and gluing it all together into metrics values.
  • Post 2: Was all about creating a first Grafana dashboard from the metrics, and showing it in a ‘Web content’ panel for a homepage layout in our beloved OPA platform.
  • Post 3: Was all about creating a custom exporter in Python to query our Postgres DB for interesting numbers.
  • Post 4: Our final post (this one) will show you the alerting mechanism in Prometheus when &^%$#&@^%$ hits the fan!

A question upfront: Why alerting in Prometheus and not in Grafana? Because with Prometheus we’re at the heart of our metrics (incl. the scraping times with no delays!)…Grafana is just sauce to make it look nice! Prometheus uses rule-based alerting decoupled from any UI which makes it also version controllable for the YAML-config files!


Let’s get right into it…

Our alerting (in this post) will send out an e-mail, so the first thing we need in place is an SMTP mail server! We use the simplest, and elegant mail server available “smtp4dev”. You can find an example in action here.

You can also alert to other “receivers” like MS-Teams, Slack, Discord, etc. (here’s an overview). Not for this post, but for you to further explore!

Next to this mail server (which I run locally on my Windows laptop), I also make sure my OPA VM is fully operational. This includes Grafana, Prometheus, the JSON exporter, and our custom DB exporter (written in Python).

Once all is running, we need to first install the Prometheus Alertmanager with these commands:

1
2
3
4
5
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
sudo mkdir /opt/prometheus/alertmanager
tar -xzf alertmanager-0.28.1.linux-amd64.tar.gz
sudo mv alertmanager-0.28.1.linux-amd64/* /opt/prometheus/alertmanager
sudo vi /opt/prometheus/alertmanager/opa_config.yml

This is the starting input for the config file opa_config.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
global:
resolve_timeout: 1m

route:
group_by: ['opa_health']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'smtp4dev'

receivers:
- name: 'smtp4dev'
email_configs:
- to: opa@alerting.com
from: no_reply@alerting.com
smarthost: 192.168.56.1:25
require_tls: false

Notes:

  • Our “smtp4dev” mail-server accepts everything (all domains) for the from/to fields!
  • I need to overrule the require_tls with a false value to send a mail later down the drain in this post…

Now you can start “Alertmanager” (incl. the config file) with sudo /opt/prometheus/alertmanager/alertmanager --config.file=/opt/prometheus/alertmanager/opa_config.yml and watch the log in the console! Open the UI from your browser at: http://192.168.56.107:9093

grafana_001

Our next step is to create rules for the manager (all nice loosely coupled!): sudo vi /opt/prometheus/alertmanager/opa_alert_rules.yml

1
2
3
4
5
6
7
8
9
10
11
groups:
- name: opa_health_alerts
rules:
- alert: opa_health_service_down
expr: sum by(check_status) (opa_metics_up) != 7
for: 1m
labels:
severity: critical
annotations:
summary: "OPA Service down"
description: "OPA Service has been down for more than 1 minute!"

Our JSON exporter exposes 7 UP services (for now #SUPPORT as it normally exposes more services on the OPA health end-point); when one is DOWN we’ll get a notification.

Next step is to add additional alerting configuration to the Prometheus config file (reading the rules from above and calling the alert system): sudo vi /opt/prometheus/opa_config.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
global:
scrape_interval: 15s
external_labels:
monitor: 'mon_opa'

scrape_configs:
- job_name: 'custom_postgres_app_metrics'
static_configs:
- targets: ['localhost:7878']
- job_name: 'job_prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'job_opa'
scrape_interval: 5s
metrics_path: '/probe'
static_configs:
- targets: ['http://192.168.56.107:8080/home/opa_tips/app/mp/health/ready']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.56.107:7979

rule_files:
- /opt/prometheus/alertmanager/opa_alert_rules.yml

alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093'

Stop <Ctrl>+<c> and restart Prometheus: sudo /opt/prometheus/prometheus --config.file=/opt/prometheus/opa_config.yml

In the Prometheus UI, you will see something like this:

grafana_002

It’s time to shut down one of the service containers in OPA (in /system)! After a while you will see a movement going on:

grafana_003

And after 1 minute (as configured!) it moves again:

grafana_004

Now also the Alertmanager is involved with a new “Fired” alert:

grafana_005

Fascinating to follow the flow! 👍

What about that mail? It’s nicely delivered:

grafana_006

A final question: What happens when we start the service container again? Well the alert in Prometheus gets back to in ‘inactive’ state…This completes the alert cycle! 👌

This concept works the same for our custom Python exporter querying the OPA database! That’s the power of metrics data and a loosely coupled implementation.

The MVP is running for consumption…I leave it up to you for further fine-tuning as I already see a glitch in the mail and the alert on two alerts for UP and DOWN!?…Not for now; I leave it with you!

These are some helpful resources from my side (with a little ChatGPT input on the config YML files). You can also follow those resources, but this blog puts it more in context of our beloved OPA platform.


So, how hard can it be!? What I say…If I can do it, you can do it too! I give it a “DONE” with a final quick configurable alerting mechanism for our Prometheus metrics. The past 4 posts gave an interesting insight on Grafana, and its relevant services. It’s an interesting technique to play with and I will implement it at one of my customers if they allow me to do so. I’m brought to the other side of the line and belong now also to the group of enthusiastic people using these tools…I now also better understand the WHY!? Have a great weekend; We see each-other next week in another great topic for “Process Automation Tips”!

Don’t forget to subscribe to get updates on the activities happening on this site. Have you noticed the quiz where you find out if you are also “The Process Automation guy”?