Skip to content

strawberry-code/kafka-word-splitter

Repository files navigation

Kafka Word Splitter

Build Status Java Kafka License Gradle Security

A production-ready Apache Kafka application demonstrating real-time file processing in a distributed environment. The application monitors directories for new files, splits content into words, and distributes them across Kafka topics based on word length.

Table of Contents

Features

  • Real-time File Monitoring: Watches directories for new files using Java NIO WatchService
  • Distributed Processing: Leverages Apache Kafka for scalable message distribution
  • Dynamic Topic Routing: Routes words to topics based on length (3-10 characters)
  • Graceful Shutdown: Production-ready shutdown hooks for clean resource cleanup
  • Production Hardened: Security patches, comprehensive error handling, and resource management
  • CI/CD Ready: Automated testing, security scanning, and code quality checks
  • Container Runtime Support: Works with both Podman (recommended) and Docker with automatic detection
  • Infrastructure Automation: Automated scripts for startup, shutdown, status checking, and topic creation

Architecture

System Overview

┌──────────────┐
│  File System │
│  (Watch Dir) │
└──────┬───────┘
       │
       v
┌──────────────┐      ┌─────────────┐      ┌──────────────┐
│   Producer   │─────>│    Kafka    │─────>│   Consumer   │
│ (File Reader)│      │   Brokers   │      │ (File Writer)│
└──────────────┘      │   Topics    │      └──────────────┘
                      │   3,4,5..10 │
                      └─────────────┘

Components

  1. Producer Application

    • Monitors directory for new text files
    • Reads and splits file content into words
    • Sends each word to a Kafka topic based on word length
    • Package: org.example.ProducerApp
  2. Consumer Application

    • Subscribes to specific Kafka topics by word length
    • Receives words from Kafka
    • Writes words to output files
    • Package: org.example.ConsumerApp
  3. Kafka Infrastructure

    • Message broker for distributed communication
    • Topics: 3, 4, 5, 6, 7, 8, 9, 10 (word lengths)
    • Deployed via Docker Compose

For detailed architecture documentation, see ARCHITECTURE_REPORT.md.

Prerequisites

Before running this application, ensure you have:

  • Java Development Kit (JDK) 17 or higher

  • Podman and Podman Compose (recommended) OR Docker and Docker Compose

    • Required for Kafka infrastructure
    • The application works with both container runtimes (auto-detected)

    Podman (Recommended):

    • macOS: brew install podman then podman machine init && podman machine start
    • Linux: sudo apt-get install podman (Ubuntu/Debian)
    • Verify: podman --version and podman compose version
    • Download: Podman Installation Guide

    Docker (Alternative):

    • Verify: docker --version and docker compose version
    • Download: Docker Desktop
  • Git (for cloning the repository)

    • Verify: git --version

Quick Start

Get up and running in 5 minutes:

# 1. Clone the repository
git clone https://github.com/strawberry-code/kafka-word-splitter.git
cd kafka-word-splitter

# 2. Start Kafka infrastructure (auto-detects Podman or Docker)
./start-kafka.sh

# 3. Build the project
./gradlew clean build

# 4. Create Kafka topics (required topics 3-10)
./scripts/create-topics.sh

# 5. Run the consumer (in one terminal)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar org.example.ConsumerApp 3 output-3.txt

# 6. Run the producer (in another terminal)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar org.example.ProducerApp /path/to/watch

# 7. Add a text file to the watch directory and see the magic!
echo "hello world from kafka" > /path/to/watch/test.txt

Note: This project uses java -cp (classpath) instead of java -jar to run applications. This allows running both ProducerApp and ConsumerApp from the same JAR file by specifying the main class explicitly on the command line.

For a more detailed guide, see docs/getting-started/QUICK_START.md.

Building the Project

Standard Build

# Clean and build the project
./gradlew clean build

# Build fat JAR with all dependencies
./gradlew shadowJar

# Output: build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar

Development Build

# Fast build without tests
./gradlew build -x test

# Continuous build (watches for changes)
./gradlew build --continuous

For comprehensive build documentation, see docs/getting-started/BUILD.md.

Running the Application

Start Kafka Infrastructure

# Start Kafka and Zookeeper (auto-detects Podman or Docker)
./start-kafka.sh

# Check infrastructure status
./scripts/kafka-status.sh

# Create required topics (topics 3-10)
./scripts/create-topics.sh

Run Producer

The producer monitors a directory and sends words to Kafka:

# Using the fat JAR
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ProducerApp /path/to/watch/directory

# Using Gradle
./gradlew run --args="/path/to/watch/directory"

Run Consumer

The consumer reads from a specific topic and writes to a file:

# Subscribe to topic "3" (3-letter words)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 3 output-3.txt

# Run multiple consumers for different word lengths
for i in {3..10}; do
  java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
    org.example.ConsumerApp "$i" "output-$i.txt" &
done

Stop Kafka Infrastructure

# Gracefully stop Kafka and Zookeeper
./scripts/stop-kafka.sh

Graceful Shutdown

Both applications support graceful shutdown:

# Press Ctrl+C or send SIGTERM
kill -TERM <pid>

The applications will:

  • Stop accepting new work
  • Complete in-flight processing
  • Flush and close Kafka connections
  • Clean up all resources

See docs/architecture/SHUTDOWN.md for detailed shutdown behavior.

Testing the Application

This section provides a complete end-to-end testing workflow to verify the application is working correctly after setup.

Prerequisites for Testing

Before testing, ensure you have:

  • Built the application (./gradlew clean build)
  • Kafka infrastructure running (./start-kafka.sh)
  • Required topics created (./scripts/create-topics.sh)
  • Watch directory created manually (producer will not start without it)
  • Output directory created if using subdirectories (e.g., output/file.txt)

Complete End-to-End Example

1. Prepare Test Directories

The producer requires the watch directory to exist before starting:

# Create watch directory (REQUIRED - producer validates this exists)
mkdir -p /tmp/kafka-watch

# Optional: Create output directory if using subdirectories
# (Not needed if output files are in current directory)
mkdir -p /tmp/kafka-output

2. Start Infrastructure and Build

# Ensure Kafka is running
./start-kafka.sh

# Verify infrastructure status
./scripts/kafka-status.sh

# Build the application (if not already built)
./gradlew clean build

# Create required topics (if not already created)
./scripts/create-topics.sh

3. Start Consumer(s) - Terminal 1

Start one or more consumers to receive words of specific lengths:

# Example 1: Single consumer for 5-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 5 output-5.txt

# Example 2: Multiple consumers for different word lengths
# Terminal 1: 3-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 3 output-3.txt

# Terminal 2: 4-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 4 output-4.txt

# Terminal 3: 5-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 5 output-5.txt

You should see log output indicating the consumer is running:

INFO  Consumer started - Topic: 5, Output: output-5.txt
INFO  Subscribed to topic: 5

4. Start Producer - Terminal 2

In a new terminal, start the producer:

java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ProducerApp /tmp/kafka-watch

You should see:

INFO  Producer started - Watch directory: /tmp/kafka-watch
INFO  Watching directory: /tmp/kafka-watch

5. Add Test Files

Now add a test file to the watch directory. The file will be automatically processed and deleted.

Simple test with echo:

echo "hello world from kafka testing application" > /tmp/kafka-watch/test.txt

Expected word routing:

  • "hello" (5 chars) → topic "5" → output-5.txt
  • "world" (5 chars) → topic "5" → output-5.txt
  • "from" (4 chars) → topic "4" → output-4.txt
  • "kafka" (5 chars) → topic "5" → output-5.txt
  • "testing" (7 chars) → topic "7" → output-7.txt
  • "application" (11 chars) → FILTERED OUT (exceeds 10 character limit)

Note: Only words with 3-10 characters are processed. Words outside this range are ignored.

6. Verify Results

Check the output files to verify words were processed:

# View content of output files
cat output-5.txt
# Expected output:
# hello
# world
# kafka

cat output-4.txt
# Expected output:
# from

# Check word counts
wc -l output-*.txt

Producer logs should show:

INFO  File detected: test.txt
INFO  Processing file: /tmp/kafka-watch/test.txt
INFO  Sent word 'hello' to topic '5'
INFO  Sent word 'world' to topic '5'
INFO  Sent word 'from' to topic '4'
INFO  Sent word 'kafka' to topic '5'
INFO  Sent word 'testing' to topic '7'
INFO  File processing completed: test.txt

Consumer logs should show:

INFO  Received word: hello
INFO  Received word: world
INFO  Received word: kafka

Quick Testing Tips

Using README.md as Test File

You can use existing files to test, but remember they will be deleted after processing:

# IMPORTANT: Make a copy! The original will be deleted.
cp README.md /tmp/kafka-watch/README-copy.md

This will process all words from the README and distribute them to topics 3-10.

Simple One-Liner Tests

# Test specific word lengths
echo "the cat sat" > /tmp/kafka-watch/test-3.txt          # All 3-letter words
echo "test" > /tmp/kafka-watch/test-4.txt                  # Single 4-letter word
echo "hello world" > /tmp/kafka-watch/test-5.txt           # 5-letter words

# Test word filtering (words too long or too short)
echo "a in application infrastructure" > /tmp/kafka-watch/test-filter.txt
# Result: "a" (1 char) and "in" (2 chars) are filtered out
#         "application" (11 chars) and "infrastructure" (14 chars) are filtered out

Testing Multiple Topics Simultaneously

Run consumers for all supported word lengths in separate terminals or background processes:

# Start all consumers (run each in separate terminal or use & for background)
for i in {3..10}; do
  java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
    org.example.ConsumerApp "$i" "output-$i.txt" &
done

# Add comprehensive test file
cat > /tmp/kafka-watch/comprehensive.txt << 'EOF'
The quick brown fox jumps over lazy dog
Testing distributed systems with Apache Kafka
Processing words character lengths three through ten
EOF

# Wait a moment, then check all output files
sleep 3
for i in {3..10}; do
  echo "=== Topic $i (${i}-letter words) ==="
  cat "output-$i.txt" 2>/dev/null || echo "No words"
done

Important Behaviors

File Deletion Warning

CRITICAL: Files are automatically deleted after successful processing.

When the producer processes a file:

  1. Reads the entire file content
  2. Splits content into words by whitespace
  3. Sends valid words (3-10 characters) to Kafka
  4. DELETES the original file upon successful completion

To preserve files:

  • Always copy files to the watch directory (don't move originals)
  • Use cp instead of mv: cp myfile.txt /tmp/kafka-watch/

Supported File Types

Any UTF-8 text file is supported:

  • No file extension restrictions (.txt, .md, .log, .csv, anything works)
  • Files must be readable as UTF-8 text
  • Binary files will cause errors

Examples:

echo "hello" > /tmp/kafka-watch/test.txt      # .txt works
echo "hello" > /tmp/kafka-watch/test.md       # .md works
echo "hello" > /tmp/kafka-watch/test.log      # .log works
echo "hello" > /tmp/kafka-watch/test          # no extension works

Word Processing Rules

Word Length Filter: 3-10 characters only

# Words that WILL be processed:
echo "cat dog bird" > /tmp/kafka-watch/valid.txt
# "cat" (3), "dog" (3), "bird" (4) - all valid

# Words that will be FILTERED OUT:
echo "a in application" > /tmp/kafka-watch/filtered.txt
# "a" (1) - too short
# "in" (2) - too short
# "application" (11) - too long

Word Splitting: By whitespace

  • Words are split using regex: \s+ (any whitespace)
  • Includes spaces, tabs, newlines
  • Empty strings are filtered out

Topic Routing: By word length

  • "cat" (3 chars) → topic "3"
  • "word" (4 chars) → topic "4"
  • "hello" (5 chars) → topic "5"
  • ... and so on up to topic "10"

Automatic File Creation

Output files are created automatically:

  • You don't need to create output files beforehand
  • Consumers create files on first write
  • If using subdirectories (e.g., output/file.txt), the directory must exist
  • Files are appended to (not overwritten) if they already exist

Verification Steps

Check Producer Status

# Producer should show these logs when running:
INFO  Producer started - Watch directory: /tmp/kafka-watch
INFO  Watching directory: /tmp/kafka-watch

# When a file is added:
INFO  File detected: <filename>
INFO  Processing file: /tmp/kafka-watch/<filename>
INFO  Sent word '<word>' to topic '<length>'

Check Consumer Status

# Consumer should show these logs when running:
INFO  Consumer started - Topic: <topic>, Output: <output-file>
INFO  Subscribed to topic: <topic>

# When receiving words:
INFO  Received word: <word>

Verify Output Files

# Check if output files were created
ls -lh output-*.txt

# View contents
cat output-5.txt

# Count words received
wc -l output-*.txt

# Check all output files
for f in output-*.txt; do
  echo "=== $f ==="
  cat "$f"
  echo
done

Verify Kafka Topics

# Check if topics exist
./scripts/create-topics.sh  # Will show existing topics

# Or use Kafka console consumer to verify messages
podman exec -it kafka-word-splitter-kafka \
  kafka-console-consumer.sh \
    --bootstrap-server localhost:9092 \
    --topic 5 \
    --from-beginning \
    --max-messages 10

Troubleshooting

Producer Won't Start

Error: "Watch directory does not exist"

# Solution: Create the directory before starting producer
mkdir -p /tmp/kafka-watch

Error: "Connection refused" or "Cannot connect to Kafka"

# Solution: Ensure Kafka is running
./scripts/kafka-status.sh

# If not running, start it:
./start-kafka.sh

Consumer Won't Start

Error: "Parent directory does not exist"

# Solution: If output file uses subdirectories, create them
mkdir -p output

# Then restart consumer with:
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 5 output/file.txt

Error: "Topic does not exist"

# Solution: Create required topics
./scripts/create-topics.sh

No Words in Output Files

Problem: Output file is empty or not created

Check:

  1. Word length: Only 3-10 character words are processed

    # Test with known valid words
    echo "hello world kafka" > /tmp/kafka-watch/test.txt
  2. File was detected: Check producer logs for "File detected" message

  3. File format: Ensure file is text (UTF-8)

    # Check file encoding
    file /tmp/kafka-watch/test.txt
  4. Consumer is subscribed to correct topic:

    • If your file contains 5-letter words, ensure consumer is listening to topic "5"
  5. Topic exists: Verify topic was created

    ./scripts/create-topics.sh

File Not Being Processed

Problem: File added to watch directory but nothing happens

Check:

  1. Producer is running: Look for "Watching directory" log message
  2. Directory is correct: Verify watch directory path matches
  3. File permissions: Ensure file is readable
    chmod 644 /tmp/kafka-watch/test.txt

Example Testing Session

Here's a complete testing session from start to finish:

# Terminal 1: Start Kafka
./start-kafka.sh
./scripts/create-topics.sh

# Terminal 2: Build and start consumer for 5-letter words
./gradlew clean build
mkdir -p /tmp/kafka-watch
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 5 output-5.txt

# Terminal 3: Start producer
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ProducerApp /tmp/kafka-watch

# Terminal 4: Add test file
echo "hello world kafka" > /tmp/kafka-watch/test.txt

# Wait 2-3 seconds, then check results
cat output-5.txt
# Expected output:
# hello
# world
# kafka

# Verify file was deleted
ls /tmp/kafka-watch/
# Output: (empty - file was deleted after processing)

Success! You've verified the complete end-to-end workflow.

For more advanced testing scenarios and performance testing, see docs/getting-started/QUICK_START.md.

Kafka Infrastructure Management

This project uses Podman (or Docker) to run Kafka and Zookeeper locally. The infrastructure scripts automatically detect which container runtime you have installed.

Available Scripts

  • ./start-kafka.sh - Start Kafka infrastructure with automatic runtime detection
  • ./scripts/stop-kafka.sh - Gracefully stop Kafka infrastructure
  • ./scripts/kafka-status.sh - Check infrastructure health and connectivity
  • ./scripts/create-topics.sh - Create required Kafka topics (3-10)
  • ./scripts/validate-podman.sh - Validate Podman migration and installation

Container Runtime Detection

The scripts automatically detect and use available runtimes in this priority order:

  1. Podman Compose (podman compose) - recommended
  2. Podman Compose standalone (podman-compose)
  3. Docker Compose v2 (docker compose)
  4. Docker Compose v1 (docker-compose)

No configuration needed - just install your preferred runtime.

Infrastructure Operations

# Start Kafka (first time or after restart)
./start-kafka.sh

# Check status anytime
./scripts/kafka-status.sh

# Create topics before running application
./scripts/create-topics.sh

# View logs
podman logs kafka-word-splitter-kafka          # or use 'docker' instead
podman logs kafka-word-splitter-zookeeper

# Stop when done
./scripts/stop-kafka.sh

Troubleshooting

# Check if services are running
podman ps                                       # or 'docker ps'

# Restart infrastructure
./scripts/stop-kafka.sh && ./start-kafka.sh

# Validate installation (Podman users)
./scripts/validate-podman.sh

For detailed migration notes and platform-specific guidance, see MIGRATION-NOTES.md.

Documentation

Getting Started

Architecture & Design

Security

Operations

Development

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/strawberry-code/kafka-word-splitter.git
cd kafka-word-splitter

# Build the project
./gradlew clean build

# Run tests
./gradlew test

# Run code quality checks
./gradlew checkstyleMain spotbugsMain

# Run security scan
./gradlew dependencyCheckAnalyze

Code Quality Tools

The project uses several quality tools:

  • Checkstyle: Code style validation
  • SpotBugs: Static bug detection
  • JaCoCo: Code coverage reporting
  • OWASP Dependency Check: Security vulnerability scanning

Run all quality checks:

./gradlew check

Local CI Simulation

Simulate the CI pipeline locally:

# Run the CI build script
./scripts/ci-build.sh

# Run security checks
./scripts/security-check.sh

# Run quality checks
./scripts/quality-check.sh

Security

Phase 1 Security Improvements

All critical and high severity CVEs have been patched:

  • CVE-2024-31141 (Kafka) - CRITICAL - Privilege escalation
  • CVE-2023-6378 (Logback) - HIGH - Deserialization DoS
  • CVE-2021-42550 (Logback) - MEDIUM - JNDI injection

Current Security Status

  • Zero critical or high severity vulnerabilities
  • Automated security scanning in CI/CD pipeline
  • Regular dependency updates via dependency-check plugin

See SECURITY.md for detailed security information.

Reporting Vulnerabilities

Please report security vulnerabilities via:

Contributing

We welcome contributions! Please read our Contributing Guide for:

  • Development setup
  • Code style guidelines
  • Testing requirements
  • Pull request process
  • Code review guidelines

Quick Contribution Workflow

# 1. Fork the repository
# 2. Create a feature branch
git checkout -b feature/my-feature

# 3. Make your changes
# 4. Run quality checks
./gradlew check

# 5. Commit your changes
git commit -m "Add my feature"

# 6. Push and create a pull request
git push origin feature/my-feature

Technology Stack

  • Language: Java 17
  • Build Tool: Gradle 8.10.2
  • Message Broker: Apache Kafka 3.8.0
  • Logging: SLF4J 2.0.9 + Logback 1.4.14
  • Testing: JUnit 5.10.0
  • Container Runtime: Podman (recommended) or Docker
  • Container Orchestration: Compose v3.8 (Podman Compose or Docker Compose)
  • CI/CD: GitHub Actions

Project Status

  • Phase 1: COMPLETED - Critical stabilization, security patches, CI/CD
  • Current Version: 1.0-SNAPSHOT
  • Production Ready: Yes (with production Kafka configuration)

Roadmap

Completed

  • Build system stabilization
  • Security vulnerability patches
  • Code quality improvements
  • Graceful shutdown implementation
  • CI/CD pipeline
  • Comprehensive documentation

Planned

  • Unit and integration test suite
  • Metrics and monitoring integration
  • Configuration externalization
  • Performance optimization
  • Kubernetes deployment manifests

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Need help? Here's how to get it:

See SUPPORT.md for more information.

Acknowledgments

  • Apache Kafka community for excellent documentation
  • Gradle team for the build system
  • All contributors to this project

Author

  • Coordinator: strawberry-code

Built with Apache Kafka | Powered by Java 17 | Automated with GitHub Actions

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •