Kafka

Kafka is primarily used to enable real-time data streaming and processing between systems. It ensures reliable, scalable, and fault-tolerant communication in data-driven environments, making it ideal for building event-driven architectures.

data-driven environment

A is a setting where decisions and operations are guided by data and analytics. It involves collecting, centralizing, and analyzing data to gain insights and drive evidence-based decisions. Examples include:

  • Healthcare: Predicting disease outbreaks using patient data.
  • Finance: Detecting fraud through transaction analysis.
  • Retail: Optimizing inventory based on customer trends.
  • Aviation: Monitoring aircraft telemetry for real-time safety and predictive maintenance.

Functionalities of Kafka

  1. Publish-Subscribe Messaging:

    • Kafka allows applications to produce messages (events) to a topic.
    • Consumers can subscribe to these topics and receive the events in real-time.
  2. Scalability:

    • Kafka is horizontally scalable. Topics are divided into partitions, which can be distributed across multiple brokers (nodes) in the Kafka cluster.
    • This ensures high throughput by parallelizing reads and writes.
  3. Data Durability:

    • Kafka stores messages on disk, allowing consumers to retrieve them later.
    • Messages are retained for a configurable period, even after consumption, ensuring reliability.
  4. Fault Tolerance:

    • Kafka replicates partitions across multiple brokers for redundancy.
    • If a broker fails, the replicas take over seamlessly.
  5. High Throughput:

    • Designed to handle millions of events per second, Kafka ensures minimal performance impact even in high-traffic systems.
  6. Event Storage:

    • Kafka acts as a durable log for event storage, allowing downstream systems to replay or process historical events.
  7. Real-Time Processing:

    • Kafka supports stream processing using tools like Kafka Streams or external frameworks such as Apache Flink or Apache Spark.
  8. Decoupling Systems:

    • Kafka acts as a mediator, enabling independent evolution of producers and consumers.
    • Producers and consumers can operate asynchronously, decoupling their lifecycles.
  9. Integration:

    • Kafka connects heterogeneous systems by acting as a central hub for data exchange.
    • With Kafka Connect, it integrates with databases, storage systems, and other services for seamless data ingestion and export.
  10. Replayability:

    • Consumers can read data from a specific point (offset) in the stream, making Kafka suitable for debugging, audits, and reprocessing.

Use Cases

  1. Log Aggregation:

    • Collects logs from different services and systems into a central platform for processing.
  2. Real-Time Analytics:

    • Enables event-driven analytics for dashboards, fraud detection, and recommendation systems.
  3. Stream Processing:

    • Transforms or aggregates event data for complex processing pipelines.
  4. Event Sourcing:

    • Captures the state changes in applications as events for replaying and recovering state.
  5. Data Integration:

    • Acts as a bridge for real-time data movement between databases, cloud services, and applications.
  6. IoT and Monitoring:

    • Processes streams from IoT devices or monitoring systems in real-time.

Kafka Components

  1. Producer: Sends messages to Kafka topics.
  2. Broker: Servers that store and serve messages.
  3. Topic: A category where messages are published.
  4. Partition: A subset of a topic for parallelism.
  5. Consumer: Reads messages from topics.
  6. ZooKeeper (Deprecated in newer versions): Manages cluster metadata and leader election (replaced by Kafka’s own metadata quorum in newer versions).
graph LR A[Kafka Architecture] --> B[Core Components] A --> C[Data Flow] A --> D[Key Features] B --> B1[Producer] B1 --> B1a[Publishes messages to topics] B1 --> B1b[Defines message keys for partitioning] B --> B2[Broker] B2 --> B2a[Stores and manages messages] B2 --> B2b[Handles partition replication] B2 --> B2c[Distributes load across brokers in a cluster] B --> B3[Topic] B3 --> B3a[Logical category for messages] B3 --> B3b[Divided into partitions for scalability] B --> B4[Partition] B4 --> B4a[Subset of a topic] B4 --> B4b[Ensures message order within a partition] B --> B5[Consumer] B5 --> B5a[Subscribes to topics and reads messages] B5 --> B5b[Tracks offset for reprocessing or continuous reading] B --> B6[ZooKeeper/Metadata Quorum] B6 --> B6a[Manages metadata and leader election] B6 --> B6b[Ensures cluster coordination] C --> C1[Producers send messages to topics] C1 --> C1a[Messages are distributed across partitions] C --> C2[Brokers store messages persistently] C2 --> C2a[Messages are retained based on time or size limits] C --> C3[Consumers fetch messages] C3 --> C3a[Can read in real-time or replay old messages] D --> D1[High Throughput] D1 --> D1a[Handles millions of messages per second] D --> D2[Fault Tolerance] D2 --> D2a[Partition replication across brokers] D --> D3[Scalability] D3 --> D3a[Add more brokers and partitions] D --> D4[Replayability] D4 --> D4a[Consumers can re-read messages from any offset] D --> D5[Decoupled Systems] D5 --> D5a[Producers and consumers are independent] D --> D6[Integration with Stream Processing] D6 --> D6a[Kafka Streams, Flink, or Spark for real-time analytics]
  1. Core Components:

    • Producers: Applications that publish events/messages to Kafka topics.
    • Brokers: Kafka servers that store and serve messages to consumers.
    • Topics: Logical channels where messages are categorized.
    • Partitions: Subsets of topics for parallel processing and scalability.
    • Consumers: Applications that read messages from Kafka topics.
    • ZooKeeper/Metadata Quorum: Manages metadata and cluster coordination (newer versions replace ZooKeeper with Kafka's native quorum).
  2. Data Flow:

    • Producers send messages to topics, and these messages are distributed across partitions.
    • Brokers store the messages persistently, retaining them for a specified duration.
    • Consumers fetch these messages, either in real-time or by replaying historical data.
  3. Key Features:

    • High throughput, fault tolerance, and scalability.
    • Ability to replay messages and decouple producers and consumers.
    • Seamless integration with stream processing tools for analytics and transformation.
graph LR A[Kafka Use Case: Real-Time Data Streaming Platform] %% Producers A --> B[Producers] B --> B1[Web Apps] B --> B2[IoT Devices] B --> B3[Logs & Monitoring] B --> B4[Database Change Events] %% Kafka Core A --> C[Kafka Core] C --> C1[Topics] C1 --> C2[Partitions] C --> C3[Brokers] C --> C4[Message Retention] C --> C5[Replication] %% Consumers A --> D[Consumers] D --> D1[Analytics & Dashboards] D --> D2[Stream Processing] D --> D3[Machine Learning Pipelines] D --> D4[Database Updates] D --> D5[Data Warehouses] %% Use Case Workflow subgraph Workflow E1[Producers publish events] E2[Kafka stores in partitions] E3[Consumers fetch messages] E4[Consumers process/store data] end C --> Workflow %% Monitoring and Integration A --> F[Monitoring & Integration] F --> F1[Prometheus & Grafana] F --> F2[Kafka Connect] F --> F3[ElasticSearch] %% Features subgraph Features G1[High Throughput] G2[Low Latency] G3[Scalability] G4[Fault Tolerance] G5[Message Replay] end A --> Features
  1. Producers:

    • Web Applications: Publish user interactions, transactions, or events to Kafka topics.
    • IoT Devices: Send sensor data for real-time processing.
    • Logs and Monitoring Tools: Centralize logs from various systems.
    • Database Change Events (CDC): Capture changes in relational databases (e.g., MySQL, PostgreSQL) to propagate updates.
  2. Kafka Core:

    • Topics act as logical channels for communication.
    • Partitions enable parallelism and scalability.
    • Brokers store and serve messages.
    • Replication ensures data durability and fault tolerance.
  3. Consumers:

    • Real-time dashboards for analytics.
    • Stream processing for data aggregation, filtering, or transformation.
    • Machine learning pipelines for real-time model training or inference.
    • Updates to relational databases.
    • Storing data in data warehouses for business intelligence.
  4. Workflow:

    • Events flow from producers to topics.
    • Kafka retains messages in partitions for a configurable period.
    • Consumers retrieve messages for processing or storage.
  5. Monitoring and Integration:

    • Use tools like Prometheus and Grafana for monitoring Kafka.
    • Integrate with external systems using Kafka Connect (e.g., ElasticSearch, Hadoop).
  6. Features:

    • Kafka’s high throughput, low latency, scalability, and fault tolerance make it ideal for real-time data streaming use cases.

use case in the aviation industry, such as real-time flight monitoring and data streaming

graph LR A[Kafka in Aviation Use Case: Real-Time Flight Monitoring] %% Producers A --> B[Producers] B --> B1[Aircraft Sensors] B --> B2[Air Traffic Control Systems] B --> B3[Airline Operations Centers] B --> B4[Weather Data Providers] %% Kafka Core A --> C[Kafka Core] C --> C1[Topics] C1 --> C1a[AircraftTelemetry] C1 --> C1b[FlightSchedules] C1 --> C1c[WeatherUpdates] C --> C2[Partitions] C --> C3[Brokers] C --> C4[Message Retention] C --> C5[Replication for Fault Tolerance] %% Consumers A --> D[Consumers] D --> D1[Airline Dashboards] D --> D2[Real-Time Alerts to Pilots and Ground Staff] D --> D3[Maintenance and Diagnostics Systems] D --> D4[Passenger Notification Systems] D --> D5[Predictive Analytics for Flight Efficiency] %% Use Case Workflow subgraph Workflow E1[Step 1: Sensors publish telemetry data to Kafka] E2[Step 2: Weather providers update weather information] E3[Step 3: Air traffic systems share flight schedules] E4[Step 4: Kafka stores and distributes data to consumers] E5[Step 5: Consumers process and display information] end C --> Workflow %% Features subgraph Features F1[Low Latency] F2[High Throughput] F3[Data Replay for Auditing] F4[Scalability for Multiple Flights] F5[Fault Tolerance] end A --> Features
  1. Producers:

    • Aircraft Sensors: Send real-time telemetry data, including speed, altitude, and system health.
    • Air Traffic Control Systems: Provide updates on flight schedules, airspace usage, and routing.
    • Airline Operations Centers: Share crew assignments, gate information, and ground operations data.
    • Weather Data Providers: Deliver live weather updates, turbulence warnings, and wind patterns.
  2. Kafka Core:

    • Topics like AircraftTelemetry, FlightSchedules, and WeatherUpdates organize data streams.
    • Partitions divide topics for scalability (e.g., telemetry by aircraft ID).
    • Brokers store messages, enabling real-time streaming and fault-tolerant data storage.
  3. Consumers:

    • Airline Dashboards: Display real-time flight statuses and operational insights.
    • Real-Time Alerts: Notify pilots or ground staff of critical updates (e.g., turbulence or maintenance needs).
    • Maintenance Systems: Analyze telemetry for predictive maintenance and diagnostics.
    • Passenger Notification Systems: Inform passengers of delays or gate changes.
    • Predictive Analytics: Use data to optimize flight paths, fuel efficiency, and scheduling.
  4. Workflow:

    • Data from producers (sensors, systems) is streamed to Kafka topics.
    • Consumers fetch this data for processing, visualization, and decision-making.
    • Kafka enables seamless communication between aviation systems.
  5. Features:

    • Kafka ensures low latency for critical updates, fault tolerance for reliability, and scalability to handle data from thousands of flights.

Project Setup

graph LR A[Spring Boot Application: Real-Time Aviation Monitoring] %% Producers TD A --> B[Telemetry Producer] B --> B1[REST API Endpoint :/api/telemetry/send:] B --> B2[Kafka Template] B2 --> B3[Kafka Topic: AircraftTelemetry] %% Kafka Core TD A --> C[Kafka Core] C --> C1[Topics] C1 --> C2[AircraftTelemetry] C --> C3[Brokers] C --> C4[Partitions] C --> C5[Replication for Fault Tolerance] %% Consumers TD A --> D[Telemetry Consumer] D --> D1[Kafka Listener :AircraftTelemetry] D1 --> D2[Process Telemetry Data] D2 --> D3[Log Telemetry to Console] D2 --> D4[Save Data to Database ] %% External Systems TD B1 --> E[Client Application] D --> F[Real-Time Dashboard] D --> G[Alerting System] D --> H[Data Analytics Pipeline] %% Enhancements TD subgraph Enhancements H1[Monitoring with Prometheus] H2[Grafana for Visualizing Metrics] H3[Stream Processing with Kafka Streams] H4[Database Integration for Historical Data] end A --> Enhancements

1. Spring Boot Application (A)

Central application for handling real-time aviation telemetry data.

2. Producers (B)

  • Telemetry Producer: Sends telemetry data to Kafka.
  • REST API Endpoint: Receives telemetry data via HTTP.
  • Kafka Template: Sends data to Kafka.
  • Kafka Topic: AircraftTelemetry: The channel where data is published.

3. Kafka Core (C)

  • Topics: Organizes data streams.
  • AircraftTelemetry: Specific topic for aircraft data.
  • Brokers: Kafka servers managing data.
  • Partitions: Allows parallel processing.
  • Replication: Ensures fault tolerance.

4. Consumers (D)

  • Telemetry Consumer: Subscribes to the Kafka topic to process data.
  • Kafka Listener: Listens for incoming data.
  • Process Telemetry Data: Processes the received data.
  • Log to Console: Logs data for monitoring.
  • Save to Database: Stores data for future use.

5. External Systems

  • Client Application: Sends data via the REST API.
  • Real-Time Dashboard: Displays data in real-time.
  • Alerting System: Notifies about anomalies.
  • Data Analytics Pipeline: Analyzes telemetry data.

6. Enhancements

  • Prometheus/Grafana: Monitoring and visualization.
  • Kafka Streams: Stream processing.
  • Database Integration: Stores historical data.

Create a Spring Boot Project

dependencies:
  • Spring Web
  • Spring Kafka
  • Spring Boot Actuator (for monitoring)
  • Spring DevTools (optional for development)
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-kafka</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
</dependency>

application.properties

spring.kafka.bootstrap-servers=localhost:9092
spring.kafka.consumer.group-id=aviation-group
spring.kafka.consumer.auto-offset-reset=earliest
spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.value-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.consumer.key-deserializer=org.apache.kafka.common.serialization.StringDeserializer
spring.kafka.consumer.value-deserializer=org.apache.kafka.common.serialization.StringDeserializer
  1. spring.kafka.bootstrap-servers=localhost:9092

    • This specifies the Kafka broker (or cluster of brokers) the application connects to.
    • In this case, it’s a single Kafka broker running on localhost at port 9092.
  2. spring.kafka.consumer.group-id=aviation-group

    • Kafka consumers are organized into groups. Each consumer in a group is assigned a subset of partitions to read messages.
    • Here, the consumer group is identified as aviation-group.
  3. spring.kafka.consumer.auto-offset-reset=earliest

    • Determines what the consumer should do if there is no initial offset or if the current offset is invalid (e.g., messages are deleted).
    • earliest: Reads messages starting from the beginning of the topic's log.
  4. spring.kafka.producer.key-serializer and spring.kafka.producer.value-serializer

    • Producers need to serialize the keys and values of messages into a format Kafka can store and transmit.
    • These properties specify the serializers for keys and values:
      • org.apache.kafka.common.serialization.StringSerializer is used to convert data to String format.
  5. spring.kafka.consumer.key-deserializer and spring.kafka.consumer.value-deserializer

    • Consumers need to deserialize the keys and values of messages received from Kafka.
    • These properties specify the deserializers for keys and values:
      • org.apache.kafka.common.serialization.StringDeserializer is used to convert data from Kafka's format back to String.

Step 1: Define Models

Create a model for telemetry data from aircraft sensors:

package com.example.aviation.model;

public class AircraftTelemetry {
    private String aircraftId;
    private double latitude;
    private double longitude;
    private double altitude;
    private double speed;

    // Getters and Setters
    public String getAircraftId() { return aircraftId; }
    public void setAircraftId(String aircraftId) { this.aircraftId = aircraftId; }
    public double getLatitude() { return latitude; }
    public void setLatitude(double latitude) { this.latitude = latitude; }
    public double getLongitude() { return longitude; }
    public void setLongitude(double longitude) { this.longitude = longitude; }
    public double getAltitude() { return altitude; }
    public void setAltitude(double altitude) { this.altitude = altitude; }
    public double getSpeed() { return speed; }
    public void setSpeed(double speed) { this.speed = speed; }
}

Step 2: Producer Implementation

Create a Kafka producer to send telemetry data.

package com.example.aviation.producer;

import com.example.aviation.model.AircraftTelemetry;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.stereotype.Service;

@Service
public class TelemetryProducer {

    private static final String TOPIC = "AircraftTelemetry";

    @Autowired
    private KafkaTemplate kafkaTemplate;

    public void sendTelemetry(AircraftTelemetry telemetry) {
        kafkaTemplate.send(TOPIC, telemetry.getAircraftId(), telemetry);
    }
}

Step 3: Consumer Implementation

Create a Kafka consumer to process telemetry data.

package com.example.aviation.consumer;

import com.example.aviation.model.AircraftTelemetry;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Service;

@Service
public class TelemetryConsumer {

    @KafkaListener(topics = "AircraftTelemetry", groupId = "aviation-group")
    public void consumeTelemetry(AircraftTelemetry telemetry) {
        System.out.println("Received telemetry: " + telemetry);
        // Add logic to process or save telemetry data
    }
}

Step 4: REST API for Sending Data

Expose an endpoint to allow external systems to send telemetry data.

package com.example.aviation.controller;

import com.example.aviation.model.AircraftTelemetry;
import com.example.aviation.producer.TelemetryProducer;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/api/telemetry")
public class TelemetryController {

    @Autowired
    private TelemetryProducer producer;

    @PostMapping("/send")
    public String sendTelemetry(@RequestBody AircraftTelemetry telemetry) {
        producer.sendTelemetry(telemetry);
        return "Telemetry sent successfully!";
    }
}

Kafka vs RabbitMQ

graph LR A[Kafka vs RabbitMQ] --> B[Use Case] A --> C[Architecture] A --> D[Scalability and Performance] A --> E[Messaging Patterns] A --> F[Durability and Reliability] A --> G[Tooling and Ecosystem] A --> H[Protocols] B --> B1[Kafka: High-throughput, real-time streaming] B --> B2[RabbitMQ: Task queues and low-latency messaging] C --> C1[Kafka: Log-based storage] C --> C2[RabbitMQ: Queue-based storage] C --> C3[Kafka: Consumer-managed offsets] C --> C4[RabbitMQ: Explicit acknowledgment] D --> D1[Kafka: Extremely high throughput] D --> D2[RabbitMQ: Lower throughput, low latency] D --> D3[Kafka: Horizontal scalability with partitions] D --> D4[RabbitMQ: Scalable with queues] E --> E1[Kafka: Built-in pub-sub model] E --> E2[RabbitMQ: Advanced routing with exchanges] E --> E3[Kafka: Limited point-to-point capabilities] E --> E4[RabbitMQ: Designed for point-to-point messaging] F --> F1[Kafka: Retains messages for reprocessing] F --> F2[RabbitMQ: Messages typically deleted after consumption] F --> F3[Kafka: Fault-tolerant with replication] F --> F4[RabbitMQ: Reliable with HA queues] G --> G1[Kafka: More complex setup, high learning curve] G --> G2[RabbitMQ: Easier configuration and usage] G --> G3[Kafka: Ecosystem with Kafka Streams and Connect] G --> G4[RabbitMQ: AMQP protocol support and plugins] H --> H1[Kafka: Uses its own binary protocol] H --> H2[RabbitMQ: Supports AMQP, MQTT, STOMP, HTTP]