Setting up an Apache Kafka cluster with Monitoring and Rolling Update features using Ansible, Prometheus, and Grafana

5 min readJan 20, 2021

I’ve been working on a project where we use Apache Kafka as our Message Broker. A place for all Micro-services to publish their events as a message on its topics and the other Micro-services subscribe to those topics to receive those messages. At first, when we didn’t have many Micro-services and hadn’t defined many events, we were OK having only one single Apache Kafka broker. As the number of Micro-Services and events grew, we felt the urge to use a more reliable and well monitored Apache Kafka cluster. We required a solution to:

Setup and recover quickly, In case of a broker failure
Update our cluster without any downtime
Monitor the cluster statistics

Setting up the Cluster

To set up an Apache Kafka cluster, each broker must be assigned a unique broker id. The brokers also need to connect to the same Zookeeper cluster which means that we need to have a reliable Zookeeper cluster. I have created an Ansible Role chubock.kafka to set up an Apache Kafka cluster in Debian platforms. you can install it with: ansible-galaxy install chubock.kafka.

To set up an Apache Zookeeper cluster we need at least 3 nodes. Its strongly recommended having an odd number of nodes because of the voting procedure of selecting a leader. For more information about the Apache Zookeeper leader election, you can visit https://data-flair.training/blogs/zookeeper-leader-election. There is an Ansible Role chubock.zookeeper that you can use to set up your Apache Zookeeper cluster in Debian platforms. First, you need to install it with: ansible-galaxy install chubock.zookeeper. Now that you’ve installed the roles, you can set up your Apache Zookeeper and Apache Kafka clusters with an Ansible Playbook like this:

---
- hosts: zookeeper
  become: true
  roles:
  - { role: chubock.zookeeper, java_home: '/opt/java/default' }
  serial: 1

- hosts: kafka
  become: true
  roles:
  - { role: chubock.kafka , zookeeper.hosts: "{{groups['zookeeper']}}", stable_rolling_update: false }
  serial: 1

There are lots of variables that you can use to configure the clusters. For more information about the roles visit ansible-role-zookeeper and ansible-role-kafka.

Cluster Zero Downtime Update

To update the Apache Kafka cluster with no downtime, we needed to update and restart each broker one at a time, waiting for the cluster to have no under replicated partitions and then move to the next broker.

Kafka exposes a lot of metrics with JMX and one of them is kafka.server:type=ReplicaManager,name=UnderReplicatedPartitionswhich shows the number of partitions that are under replicated. We used Jolokia To get the value of this metric after each broker restarts. Jolokia is a JMX-HTTP bridge which has a JVM Agent you can use when launching Java application like Apache Kafka. To use it with Kafka, you should download Jolokia JVM agent and define it as a java agent in KAFKA_OPTS environment variable when starting your broker: KAFKA_OPTS=-javaagent:{{ jolokia_path }}/jolokia-jvm-1.6.2-agent.jar

Starting your broker with Jolokia agent enables you to get the metric via URL: http://localhost:8778/jolokia/read/kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager/Value which results in a response as follows:

{
 “request”: {
     “mbean”: “kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager”,
     “attribute”: “Value”,
     “type”: “read”
 },
 “value”: 0,
 “timestamp”: 1609865259,
 “status”: 200
}

We poll the mentioned URL a couple of times after each broker restarts, wait for the value field of the response to become 0. Then we can restart the next broker to update it. If you are using chubock.kafka role, setting role variable stable_rolling_update value to true will do the job and makes the play stop until the cluster has no under replicated partitions.

Monitoring the cluster with Prometheus

Prometheus is a monitoring system and a time-series database. There are many exporters and integrations available for it to retrieve metrics from different sources and make them available over an HTTP URL, which Prometheus can poll and fetch every couple of seconds. If you haven’t installed Prometheus, there is an Ansible Role you can usechubock.prometheus in Debian platforms. First, install it with: ansible-galaxy install chubock.prometheus. Then you can install Prometheus on your host with the following Ansible Playbook:

---
- hosts: {{ hosts to install }}
  become: yes
  roles:
    - chubock.prometheus

To make Kafka metrics available for Prometheus to poll, you should download it’s Kafka JMX Exporter Agent and add it to the KAFKA_OPTS when launching the broker. For example:

KAFKA_OPTS=-javaagent:{{ prometheus_path }}/jmx_prometheus_javaagent-0.13.0.jar=18080:{{ prometheus_path }}/kafka-2_0_0.yml -javaagent:{{ jolokia_path }}/jolokia-jvm-1.6.2-agent.jar

After restarting the broker, if you send an HTTP GET Request to http://localhost:18080/metrics you will get those metrics in a proper format convenient to Prometheus. Now we should define a job in the Prometheus configuration file prometheus.yml for the Kafka Cluster and provide the metrics extractors addresses:

- job_name: 'kafka'
  static_configs:
  - targets:
    - kfk1.local:18080
    - kfk2.local:18080
    - kfk3.local:18080

Of course, you should change the hosts addressed accordingly. If you are using the chubock.prometheus Ansible Role you can define the Prometheus job in your Ansible Playbook as well:

- hosts: {{ hosts to install }}
  become: yes
  roles:
    - chubock.prometheus
  vars:
    jobs:
      kafka:
        targets:
          - kfk1.local:9114
          - kfk2.local:9114
          - kfk3.local:9114
        scrape_interval: 30s

Using chubock.kafka Ansible Role, you can set the variable prometheus_exporter value to true in your Ansible Playbook to installs the Prometheus exporter agent for the brokers.

Visualizing data by Grafana

Grafana is a multi-platform open-source analytics and interactive visualization web application. We use Grafana to visualize Prometheus Data. First, we need to add a Data Source for our Prometheus in Configuration/Data Sources section. Then we can create dashboards with charts and graphs representing our data. One of the best things about Grafana is that there are a lot of different Dashboards for different Sources like Apache Kafka developed by the community. We started with one of the available dashboards and then added some more charts and graphs to improve it. Now, the dashboard is available at https://grafana.com/grafana/dashboards/13684.

Grafana dashboard for Apache Kafka Cluster

The only thing left to do is to import the dashboard in Grafana and setting the Prometheus data source you added.

That was all

I hope this story has been helpful. any comment and contribution to the Ansible Roles mentioned in the story are greatly appreciated.