Setting up an Apache Kafka cluster with Monitoring and Rolling Update features using Ansible, Prometheus, and Grafana
I’ve been working on a project where we use Apache Kafka as our Message Broker. A place for all Micro-services to publish their events as a message on its topics and the other Micro-services subscribe to those topics to receive those messages. At first, when we didn’t have many Micro-services and hadn’t defined many events, we were OK having only one single Apache Kafka broker. As the number of Micro-Services and events grew, we felt the urge to use a more reliable and well monitored Apache Kafka cluster. We required a solution to:
- Setup and recover quickly, In case of a broker failure
- Update our cluster without any downtime
- Monitor the cluster statistics
Setting up the Cluster
To set up an Apache Kafka cluster, each broker must be assigned a unique broker id. The brokers also need to connect to the same Zookeeper cluster which means that we need to have a reliable Zookeeper cluster. I have created an Ansible Role chubock.kafka
to set up an Apache Kafka cluster in Debian platforms. you can install it with: ansible-galaxy install chubock.kafka
.
To set up an Apache Zookeeper cluster we need at least 3 nodes. Its strongly recommended having an odd number of nodes because of the voting procedure of selecting a leader. For more information about the Apache Zookeeper leader election, you can visit https://data-flair.training/blogs/zookeeper-leader-election. There is an Ansible Role chubock.zookeeper
that you can use to set up your Apache Zookeeper cluster in Debian platforms. First, you need to install it with: ansible-galaxy install chubock.zookeeper
. Now that you’ve installed the roles, you can set up your Apache Zookeeper and Apache Kafka clusters with an Ansible Playbook like this:
---
- hosts: zookeeper
become: true
roles:
- { role: chubock.zookeeper, java_home: '/opt/java/default' }
serial: 1
- hosts: kafka
become: true
roles:
- { role: chubock.kafka , zookeeper.hosts: "{{groups['zookeeper']}}", stable_rolling_update: false }
serial: 1
There are lots of variables that you can use to configure the clusters. For more information about the roles visit ansible-role-zookeeper and ansible-role-kafka.
Cluster Zero Downtime Update
To update the Apache Kafka cluster with no downtime, we needed to update and restart each broker one at a time, waiting for the cluster to have no under replicated partitions and then move to the next broker.
Kafka exposes a lot of metrics with JMX and one of them is kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
which shows the number of partitions that are under replicated. We used Jolokia To get the value of this metric after each broker restarts. Jolokia is a JMX-HTTP bridge which has a JVM Agent you can use when launching Java application like Apache Kafka. To use it with Kafka, you should download Jolokia JVM agent and define it as a java agent in KAFKA_OPTS
environment variable when starting your broker: KAFKA_OPTS=-javaagent:{{ jolokia_path }}/jolokia-jvm-1.6.2-agent.jar
Starting your broker with Jolokia agent enables you to get the metric via URL: http://localhost:8778/jolokia/read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value
which results in a response as follows:
{
“request”: {
“mbean”: “kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager”,
“attribute”: “Value”,
“type”: “read”
},
“value”: 0,
“timestamp”: 1609865259,
“status”: 200
}
We poll the mentioned URL a couple of times after each broker restarts, wait for the value field of the response to become 0. Then we can restart the next broker to update it. If you are using chubock.kafka
role, setting role variable stable_rolling_update
value to true will do the job and makes the play stop until the cluster has no under replicated partitions.
Monitoring the cluster with Prometheus
Prometheus is a monitoring system and a time-series database. There are many exporters and integrations available for it to retrieve metrics from different sources and make them available over an HTTP URL, which Prometheus can poll and fetch every couple of seconds. If you haven’t installed Prometheus, there is an Ansible Role you can usechubock.prometheus
in Debian platforms. First, install it with: ansible-galaxy install chubock.prometheus
. Then you can install Prometheus on your host with the following Ansible Playbook:
---
- hosts: {{ hosts to install }}
become: yes
roles:
- chubock.prometheus
To make Kafka metrics available for Prometheus to poll, you should download it’s Kafka JMX Exporter Agent and add it to the KAFKA_OPTS
when launching the broker. For example:
KAFKA_OPTS=-javaagent:{{ prometheus_path }}/jmx_prometheus_javaagent-0.13.0.jar=18080:{{ prometheus_path }}/kafka-2_0_0.yml -javaagent:{{ jolokia_path }}/jolokia-jvm-1.6.2-agent.jar
After restarting the broker, if you send an HTTP GET Request to http://localhost:18080/metrics
you will get those metrics in a proper format convenient to Prometheus. Now we should define a job in the Prometheus configuration file prometheus.yml
for the Kafka Cluster and provide the metrics extractors addresses:
- job_name: 'kafka'
static_configs:
- targets:
- kfk1.local:18080
- kfk2.local:18080
- kfk3.local:18080
Of course, you should change the hosts addressed accordingly. If you are using the chubock.prometheus
Ansible Role you can define the Prometheus job in your Ansible Playbook as well:
- hosts: {{ hosts to install }}
become: yes
roles:
- chubock.prometheus
vars:
jobs:
kafka:
targets:
- kfk1.local:9114
- kfk2.local:9114
- kfk3.local:9114
scrape_interval: 30s
Using chubock.kafka
Ansible Role, you can set the variable prometheus_exporter
value to true in your Ansible Playbook to installs the Prometheus exporter agent for the brokers.
Visualizing data by Grafana
Grafana is a multi-platform open-source analytics and interactive visualization web application. We use Grafana to visualize Prometheus Data. First, we need to add a Data Source for our Prometheus in Configuration/Data Sources section. Then we can create dashboards with charts and graphs representing our data. One of the best things about Grafana is that there are a lot of different Dashboards for different Sources like Apache Kafka developed by the community. We started with one of the available dashboards and then added some more charts and graphs to improve it. Now, the dashboard is available at https://grafana.com/grafana/dashboards/13684.
The only thing left to do is to import the dashboard in Grafana and setting the Prometheus data source you added.
That was all
I hope this story has been helpful. any comment and contribution to the Ansible Roles mentioned in the story are greatly appreciated.