Kafka Word Splitter

A production-ready Apache Kafka application demonstrating real-time file processing in a distributed environment. The application monitors directories for new files, splits content into words, and distributes them across Kafka topics based on word length.

Features

Real-time File Monitoring: Watches directories for new files using Java NIO WatchService
Distributed Processing: Leverages Apache Kafka for scalable message distribution
Dynamic Topic Routing: Routes words to topics based on length (3-10 characters)
Graceful Shutdown: Production-ready shutdown hooks for clean resource cleanup
Production Hardened: Security patches, comprehensive error handling, and resource management
CI/CD Ready: Automated testing, security scanning, and code quality checks
Container Runtime Support: Works with both Podman (recommended) and Docker with automatic detection
Infrastructure Automation: Automated scripts for startup, shutdown, status checking, and topic creation

Architecture

System Overview

┌──────────────┐
│  File System │
│  (Watch Dir) │
└──────┬───────┘
       │
       v
┌──────────────┐      ┌─────────────┐      ┌──────────────┐
│   Producer   │─────>│    Kafka    │─────>│   Consumer   │
│ (File Reader)│      │   Brokers   │      │ (File Writer)│
└──────────────┘      │   Topics    │      └──────────────┘
                      │   3,4,5..10 │
                      └─────────────┘

Components

Producer Application
- Monitors directory for new text files
- Reads and splits file content into words
- Sends each word to a Kafka topic based on word length
- Package: org.example.ProducerApp
Consumer Application
- Subscribes to specific Kafka topics by word length
- Receives words from Kafka
- Writes words to output files
- Package: org.example.ConsumerApp
Kafka Infrastructure
- Message broker for distributed communication
- Topics: 3, 4, 5, 6, 7, 8, 9, 10 (word lengths)
- Deployed via Docker Compose

For detailed architecture documentation, see ARCHITECTURE_REPORT.md.

Prerequisites

Before running this application, ensure you have:

Java Development Kit (JDK) 17 or higher
- Verify: java -version
- Download: Adoptium or Oracle
Podman and Podman Compose (recommended) OR Docker and Docker Compose
- Required for Kafka infrastructure
- The application works with both container runtimes (auto-detected)
Podman (Recommended):
- macOS: brew install podman then podman machine init && podman machine start
- Linux: sudo apt-get install podman (Ubuntu/Debian)
- Verify: podman --version and podman compose version
- Download: Podman Installation Guide
Docker (Alternative):
- Verify: docker --version and docker compose version
- Download: Docker Desktop
Git (for cloning the repository)
- Verify: git --version

Quick Start

Get up and running in 5 minutes:

# 1. Clone the repository
git clone https://github.com/strawberry-code/kafka-word-splitter.git
cd kafka-word-splitter

# 2. Start Kafka infrastructure (auto-detects Podman or Docker)
./start-kafka.sh

# 3. Build the project
./gradlew clean build

# 4. Create Kafka topics (required topics 3-10)
./scripts/create-topics.sh

# 5. Run the consumer (in one terminal)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar org.example.ConsumerApp 3 output-3.txt

# 6. Run the producer (in another terminal)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar org.example.ProducerApp /path/to/watch

# 7. Add a text file to the watch directory and see the magic!
echo "hello world from kafka" > /path/to/watch/test.txt

Note: This project uses java -cp (classpath) instead of java -jar to run applications. This allows running both ProducerApp and ConsumerApp from the same JAR file by specifying the main class explicitly on the command line.

For a more detailed guide, see docs/getting-started/QUICK_START.md.

Building the Project

Standard Build

# Clean and build the project
./gradlew clean build

# Build fat JAR with all dependencies
./gradlew shadowJar

# Output: build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar

Development Build

# Fast build without tests
./gradlew build -x test

# Continuous build (watches for changes)
./gradlew build --continuous

For comprehensive build documentation, see docs/getting-started/BUILD.md.

Running the Application

Start Kafka Infrastructure

# Start Kafka and Zookeeper (auto-detects Podman or Docker)
./start-kafka.sh

# Check infrastructure status
./scripts/kafka-status.sh

# Create required topics (topics 3-10)
./scripts/create-topics.sh

Run Producer

The producer monitors a directory and sends words to Kafka:

# Using the fat JAR
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ProducerApp /path/to/watch/directory

# Using Gradle
./gradlew run --args="/path/to/watch/directory"

Run Consumer

The consumer reads from a specific topic and writes to a file:

# Subscribe to topic "3" (3-letter words)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 3 output-3.txt

# Run multiple consumers for different word lengths
for i in {3..10}; do
  java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
    org.example.ConsumerApp "$i" "output-$i.txt" &
done

Stop Kafka Infrastructure

# Gracefully stop Kafka and Zookeeper
./scripts/stop-kafka.sh

Graceful Shutdown

Both applications support graceful shutdown:

# Press Ctrl+C or send SIGTERM
kill -TERM <pid>

The applications will:

Stop accepting new work
Complete in-flight processing
Flush and close Kafka connections
Clean up all resources

See docs/architecture/SHUTDOWN.md for detailed shutdown behavior.

Testing the Application

This section provides a complete end-to-end testing workflow to verify the application is working correctly after setup.

Prerequisites for Testing

Before testing, ensure you have:

Built the application (./gradlew clean build)
Kafka infrastructure running (./start-kafka.sh)
Required topics created (./scripts/create-topics.sh)
Watch directory created manually (producer will not start without it)
Output directory created if using subdirectories (e.g., output/file.txt)

Complete End-to-End Example

1. Prepare Test Directories

The producer requires the watch directory to exist before starting:

# Create watch directory (REQUIRED - producer validates this exists)
mkdir -p /tmp/kafka-watch

# Optional: Create output directory if using subdirectories
# (Not needed if output files are in current directory)
mkdir -p /tmp/kafka-output

2. Start Infrastructure and Build

# Ensure Kafka is running
./start-kafka.sh

# Verify infrastructure status
./scripts/kafka-status.sh

# Build the application (if not already built)
./gradlew clean build

# Create required topics (if not already created)
./scripts/create-topics.sh

3. Start Consumer(s) - Terminal 1

Start one or more consumers to receive words of specific lengths:

# Example 1: Single consumer for 5-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 5 output-5.txt

# Example 2: Multiple consumers for different word lengths
# Terminal 1: 3-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 3 output-3.txt

# Terminal 2: 4-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 4 output-4.txt

# Terminal 3: 5-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 5 output-5.txt

You should see log output indicating the consumer is running:

INFO  Consumer started - Topic: 5, Output: output-5.txt
INFO  Subscribed to topic: 5

4. Start Producer - Terminal 2

In a new terminal, start the producer:

java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ProducerApp /tmp/kafka-watch

You should see:

INFO  Producer started - Watch directory: /tmp/kafka-watch
INFO  Watching directory: /tmp/kafka-watch

5. Add Test Files

Now add a test file to the watch directory. The file will be automatically processed and deleted.

Simple test with echo:

echo "hello world from kafka testing application" > /tmp/kafka-watch/test.txt

Expected word routing:

"hello" (5 chars) → topic "5" → output-5.txt
"world" (5 chars) → topic "5" → output-5.txt
"from" (4 chars) → topic "4" → output-4.txt
"kafka" (5 chars) → topic "5" → output-5.txt
"testing" (7 chars) → topic "7" → output-7.txt
"application" (11 chars) → FILTERED OUT (exceeds 10 character limit)

Note: Only words with 3-10 characters are processed. Words outside this range are ignored.

6. Verify Results

Check the output files to verify words were processed:

# View content of output files
cat output-5.txt
# Expected output:
# hello
# world
# kafka

cat output-4.txt
# Expected output:
# from

# Check word counts
wc -l output-*.txt

Producer logs should show:

INFO  File detected: test.txt
INFO  Processing file: /tmp/kafka-watch/test.txt
INFO  Sent word 'hello' to topic '5'
INFO  Sent word 'world' to topic '5'
INFO  Sent word 'from' to topic '4'
INFO  Sent word 'kafka' to topic '5'
INFO  Sent word 'testing' to topic '7'
INFO  File processing completed: test.txt

Consumer logs should show:

INFO  Received word: hello
INFO  Received word: world
INFO  Received word: kafka

Quick Testing Tips

Using README.md as Test File

You can use existing files to test, but remember they will be deleted after processing:

# IMPORTANT: Make a copy! The original will be deleted.
cp README.md /tmp/kafka-watch/README-copy.md

This will process all words from the README and distribute them to topics 3-10.

Simple One-Liner Tests

# Test specific word lengths
echo "the cat sat" > /tmp/kafka-watch/test-3.txt          # All 3-letter words
echo "test" > /tmp/kafka-watch/test-4.txt                  # Single 4-letter word
echo "hello world" > /tmp/kafka-watch/test-5.txt           # 5-letter words

# Test word filtering (words too long or too short)
echo "a in application infrastructure" > /tmp/kafka-watch/test-filter.txt
# Result: "a" (1 char) and "in" (2 chars) are filtered out
#         "application" (11 chars) and "infrastructure" (14 chars) are filtered out

Testing Multiple Topics Simultaneously

Run consumers for all supported word lengths in separate terminals or background processes:

# Start all consumers (run each in separate terminal or use & for background)
for i in {3..10}; do
  java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
    org.example.ConsumerApp "$i" "output-$i.txt" &
done

# Add comprehensive test file
cat > /tmp/kafka-watch/comprehensive.txt << 'EOF'
The quick brown fox jumps over lazy dog
Testing distributed systems with Apache Kafka
Processing words character lengths three through ten
EOF

# Wait a moment, then check all output files
sleep 3
for i in {3..10}; do
  echo "=== Topic $i (${i}-letter words) ==="
  cat "output-$i.txt" 2>/dev/null || echo "No words"
done

Important Behaviors

File Deletion Warning

CRITICAL: Files are automatically deleted after successful processing.

When the producer processes a file:

Reads the entire file content
Splits content into words by whitespace
Sends valid words (3-10 characters) to Kafka
DELETES the original file upon successful completion

To preserve files:

Always copy files to the watch directory (don't move originals)
Use cp instead of mv: cp myfile.txt /tmp/kafka-watch/

Supported File Types

Any UTF-8 text file is supported:

No file extension restrictions (.txt, .md, .log, .csv, anything works)
Files must be readable as UTF-8 text
Binary files will cause errors

Examples:

echo "hello" > /tmp/kafka-watch/test.txt      # .txt works
echo "hello" > /tmp/kafka-watch/test.md       # .md works
echo "hello" > /tmp/kafka-watch/test.log      # .log works
echo "hello" > /tmp/kafka-watch/test          # no extension works

Word Processing Rules

Word Length Filter: 3-10 characters only

# Words that WILL be processed:
echo "cat dog bird" > /tmp/kafka-watch/valid.txt
# "cat" (3), "dog" (3), "bird" (4) - all valid

# Words that will be FILTERED OUT:
echo "a in application" > /tmp/kafka-watch/filtered.txt
# "a" (1) - too short
# "in" (2) - too short
# "application" (11) - too long

Word Splitting: By whitespace

Words are split using regex: \s+ (any whitespace)
Includes spaces, tabs, newlines
Empty strings are filtered out

Topic Routing: By word length

"cat" (3 chars) → topic "3"
"word" (4 chars) → topic "4"
"hello" (5 chars) → topic "5"
... and so on up to topic "10"

Automatic File Creation

Output files are created automatically:

You don't need to create output files beforehand
Consumers create files on first write
If using subdirectories (e.g., output/file.txt), the directory must exist
Files are appended to (not overwritten) if they already exist

Verification Steps

Check Producer Status

# Producer should show these logs when running:
INFO  Producer started - Watch directory: /tmp/kafka-watch
INFO  Watching directory: /tmp/kafka-watch

# When a file is added:
INFO  File detected: <filename>
INFO  Processing file: /tmp/kafka-watch/<filename>
INFO  Sent word '<word>' to topic '<length>'

Check Consumer Status

# Consumer should show these logs when running:
INFO  Consumer started - Topic: <topic>, Output: <output-file>
INFO  Subscribed to topic: <topic>

# When receiving words:
INFO  Received word: <word>

Verify Output Files

# Check if output files were created
ls -lh output-*.txt

# View contents
cat output-5.txt

# Count words received
wc -l output-*.txt

# Check all output files
for f in output-*.txt; do
  echo "=== $f ==="
  cat "$f"
  echo
done

Verify Kafka Topics

# Check if topics exist
./scripts/create-topics.sh  # Will show existing topics

# Or use Kafka console consumer to verify messages
podman exec -it kafka-word-splitter-kafka \
  kafka-console-consumer.sh \
    --bootstrap-server localhost:9092 \
    --topic 5 \
    --from-beginning \
    --max-messages 10

Troubleshooting

Producer Won't Start

Error: "Watch directory does not exist"

# Solution: Create the directory before starting producer
mkdir -p /tmp/kafka-watch

Error: "Connection refused" or "Cannot connect to Kafka"

# Solution: Ensure Kafka is running
./scripts/kafka-status.sh

# If not running, start it:
./start-kafka.sh

Consumer Won't Start

Error: "Parent directory does not exist"

# Solution: If output file uses subdirectories, create them
mkdir -p output

# Then restart consumer with:
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 5 output/file.txt

Error: "Topic does not exist"

# Solution: Create required topics
./scripts/create-topics.sh

No Words in Output Files

Problem: Output file is empty or not created

Check:

Word length: Only 3-10 character words are processed

# Test with known valid words
echo "hello world kafka" > /tmp/kafka-watch/test.txt

File was detected: Check producer logs for "File detected" message

File format: Ensure file is text (UTF-8)

# Check file encoding
file /tmp/kafka-watch/test.txt

Consumer is subscribed to correct topic:
- If your file contains 5-letter words, ensure consumer is listening to topic "5"
Topic exists: Verify topic was created
```
./scripts/create-topics.sh
```

File Not Being Processed

Problem: File added to watch directory but nothing happens

Check:

Producer is running: Look for "Watching directory" log message
Directory is correct: Verify watch directory path matches
File permissions: Ensure file is readable
```
chmod 644 /tmp/kafka-watch/test.txt
```

Example Testing Session

Here's a complete testing session from start to finish:

# Terminal 1: Start Kafka
./start-kafka.sh
./scripts/create-topics.sh

# Terminal 2: Build and start consumer for 5-letter words
./gradlew clean build
mkdir -p /tmp/kafka-watch
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ConsumerApp 5 output-5.txt

# Terminal 3: Start producer
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
  org.example.ProducerApp /tmp/kafka-watch

# Terminal 4: Add test file
echo "hello world kafka" > /tmp/kafka-watch/test.txt

# Wait 2-3 seconds, then check results
cat output-5.txt
# Expected output:
# hello
# world
# kafka

# Verify file was deleted
ls /tmp/kafka-watch/
# Output: (empty - file was deleted after processing)

Success! You've verified the complete end-to-end workflow.

For more advanced testing scenarios and performance testing, see docs/getting-started/QUICK_START.md.

Kafka Infrastructure Management

This project uses Podman (or Docker) to run Kafka and Zookeeper locally. The infrastructure scripts automatically detect which container runtime you have installed.

Available Scripts

./start-kafka.sh - Start Kafka infrastructure with automatic runtime detection
./scripts/stop-kafka.sh - Gracefully stop Kafka infrastructure
./scripts/kafka-status.sh - Check infrastructure health and connectivity
./scripts/create-topics.sh - Create required Kafka topics (3-10)
./scripts/validate-podman.sh - Validate Podman migration and installation

Container Runtime Detection

The scripts automatically detect and use available runtimes in this priority order:

Podman Compose (podman compose) - recommended
Podman Compose standalone (podman-compose)
Docker Compose v2 (docker compose)
Docker Compose v1 (docker-compose)

No configuration needed - just install your preferred runtime.

Infrastructure Operations

# Start Kafka (first time or after restart)
./start-kafka.sh

# Check status anytime
./scripts/kafka-status.sh

# Create topics before running application
./scripts/create-topics.sh

# View logs
podman logs kafka-word-splitter-kafka          # or use 'docker' instead
podman logs kafka-word-splitter-zookeeper

# Stop when done
./scripts/stop-kafka.sh

Troubleshooting

# Check if services are running
podman ps                                       # or 'docker ps'

# Restart infrastructure
./scripts/stop-kafka.sh && ./start-kafka.sh

# Validate installation (Podman users)
./scripts/validate-podman.sh

For detailed migration notes and platform-specific guidance, see MIGRATION-NOTES.md.

Documentation

Getting Started

Quick Start Guide - 5-minute setup
Build Instructions - Comprehensive build guide

Architecture & Design

Architecture Report - System design and patterns
Project Structure - File organization
Shutdown Procedures - Graceful shutdown mechanisms

Security

Security Report - CVE patches and analysis
Dependency Update Policy - Keeping dependencies secure

Operations

CI/CD Documentation - Pipeline guide
DevOps Report - Infrastructure setup
Branch Protection - Git workflow
Migration Notes - Docker to Podman migration

Development

Contributing Guide - How to contribute
Support - Getting help
Changelog - Version history

Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/strawberry-code/kafka-word-splitter.git
cd kafka-word-splitter

# Build the project
./gradlew clean build

# Run tests
./gradlew test

# Run code quality checks
./gradlew checkstyleMain spotbugsMain

# Run security scan
./gradlew dependencyCheckAnalyze

Code Quality Tools

The project uses several quality tools:

Checkstyle: Code style validation
SpotBugs: Static bug detection
JaCoCo: Code coverage reporting
OWASP Dependency Check: Security vulnerability scanning

Run all quality checks:

./gradlew check

Local CI Simulation

Simulate the CI pipeline locally:

# Run the CI build script
./scripts/ci-build.sh

# Run security checks
./scripts/security-check.sh

# Run quality checks
./scripts/quality-check.sh

Security

Phase 1 Security Improvements

All critical and high severity CVEs have been patched:

CVE-2024-31141 (Kafka) - CRITICAL - Privilege escalation
CVE-2023-6378 (Logback) - HIGH - Deserialization DoS
CVE-2021-42550 (Logback) - MEDIUM - JNDI injection

Current Security Status

Zero critical or high severity vulnerabilities
Automated security scanning in CI/CD pipeline
Regular dependency updates via dependency-check plugin

See SECURITY.md for detailed security information.

Reporting Vulnerabilities

Please report security vulnerabilities via:

GitHub Security Advisories
Email: security@example.com (configure this)

Contributing

We welcome contributions! Please read our Contributing Guide for:

Development setup
Code style guidelines
Testing requirements
Pull request process
Code review guidelines

Quick Contribution Workflow

# 1. Fork the repository
# 2. Create a feature branch
git checkout -b feature/my-feature

# 3. Make your changes
# 4. Run quality checks
./gradlew check

# 5. Commit your changes
git commit -m "Add my feature"

# 6. Push and create a pull request
git push origin feature/my-feature

Technology Stack

Language: Java 17
Build Tool: Gradle 8.10.2
Message Broker: Apache Kafka 3.8.0
Logging: SLF4J 2.0.9 + Logback 1.4.14
Testing: JUnit 5.10.0
Container Runtime: Podman (recommended) or Docker
Container Orchestration: Compose v3.8 (Podman Compose or Docker Compose)
CI/CD: GitHub Actions

Project Status

Phase 1: COMPLETED - Critical stabilization, security patches, CI/CD
Current Version: 1.0-SNAPSHOT
Production Ready: Yes (with production Kafka configuration)

Roadmap

Completed

Build system stabilization
Security vulnerability patches
Code quality improvements
Graceful shutdown implementation
CI/CD pipeline
Comprehensive documentation

Planned

Unit and integration test suite
Metrics and monitoring integration
Configuration externalization
Performance optimization
Kubernetes deployment manifests

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Need help? Here's how to get it:

Documentation: Check the docs directory
Issues: GitHub Issues
Discussions: GitHub Discussions

See SUPPORT.md for more information.

Acknowledgments

Apache Kafka community for excellent documentation
Gradle team for the build system
All contributors to this project

Author

Coordinator: strawberry-code

Built with Apache Kafka | Powered by Java 17 | Automated with GitHub Actions

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
.gradle		.gradle
.idea		.idea
build		build
config/checkstyle		config/checkstyle
docs		docs
fixture		fixture
gradle/wrapper		gradle/wrapper
scripts		scripts
src/main		src/main
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
DOCUMENTATION_INDEX.md		DOCUMENTATION_INDEX.md
LICENSE		LICENSE
MIGRATION-NOTES.md		MIGRATION-NOTES.md
README.md		README.md
build.gradle.kts		build.gradle.kts
compose.yml		compose.yml
dependency-check-suppressions.xml		dependency-check-suppressions.xml
docker-compose.yml		docker-compose.yml
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts
start-kafka.sh		start-kafka.sh

License

strawberry-code/kafka-word-splitter

Folders and files

Latest commit

History

Repository files navigation

Kafka Word Splitter

Table of Contents

Features

Architecture

System Overview

Components

Prerequisites

Quick Start

Building the Project

Standard Build

Development Build

Running the Application

Start Kafka Infrastructure

Run Producer

Run Consumer

Stop Kafka Infrastructure

Graceful Shutdown

Testing the Application

Prerequisites for Testing

Complete End-to-End Example

1. Prepare Test Directories

2. Start Infrastructure and Build

3. Start Consumer(s) - Terminal 1

4. Start Producer - Terminal 2

5. Add Test Files

6. Verify Results

Quick Testing Tips

Using README.md as Test File

Simple One-Liner Tests

Testing Multiple Topics Simultaneously

Important Behaviors

File Deletion Warning

Supported File Types

Word Processing Rules

Automatic File Creation

Verification Steps

Check Producer Status

Check Consumer Status

Verify Output Files

Verify Kafka Topics

Troubleshooting

Producer Won't Start

Consumer Won't Start

No Words in Output Files

File Not Being Processed

Example Testing Session

Kafka Infrastructure Management

Available Scripts

Container Runtime Detection

Infrastructure Operations

Troubleshooting

Documentation

Getting Started

Architecture & Design

Security

Operations

Development

Development

Setting Up Development Environment

Code Quality Tools

Local CI Simulation

Security

Phase 1 Security Improvements

Current Security Status

Reporting Vulnerabilities

Contributing

Quick Contribution Workflow

Technology Stack

Project Status

Roadmap

Completed

Planned

License

Support

Packages