A production-ready Apache Kafka application demonstrating real-time file processing in a distributed environment. The application monitors directories for new files, splits content into words, and distributes them across Kafka topics based on word length.
- Features
- Architecture
- Prerequisites
- Quick Start
- Building the Project
- Running the Application
- Testing the Application
- Kafka Infrastructure Management
- Documentation
- Development
- Security
- Contributing
- License
- Support
- Real-time File Monitoring: Watches directories for new files using Java NIO WatchService
- Distributed Processing: Leverages Apache Kafka for scalable message distribution
- Dynamic Topic Routing: Routes words to topics based on length (3-10 characters)
- Graceful Shutdown: Production-ready shutdown hooks for clean resource cleanup
- Production Hardened: Security patches, comprehensive error handling, and resource management
- CI/CD Ready: Automated testing, security scanning, and code quality checks
- Container Runtime Support: Works with both Podman (recommended) and Docker with automatic detection
- Infrastructure Automation: Automated scripts for startup, shutdown, status checking, and topic creation
┌──────────────┐
│ File System │
│ (Watch Dir) │
└──────┬───────┘
│
v
┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ Producer │─────>│ Kafka │─────>│ Consumer │
│ (File Reader)│ │ Brokers │ │ (File Writer)│
└──────────────┘ │ Topics │ └──────────────┘
│ 3,4,5..10 │
└─────────────┘
-
Producer Application
- Monitors directory for new text files
- Reads and splits file content into words
- Sends each word to a Kafka topic based on word length
- Package:
org.example.ProducerApp
-
Consumer Application
- Subscribes to specific Kafka topics by word length
- Receives words from Kafka
- Writes words to output files
- Package:
org.example.ConsumerApp
-
Kafka Infrastructure
- Message broker for distributed communication
- Topics: 3, 4, 5, 6, 7, 8, 9, 10 (word lengths)
- Deployed via Docker Compose
For detailed architecture documentation, see ARCHITECTURE_REPORT.md.
Before running this application, ensure you have:
-
Java Development Kit (JDK) 17 or higher
-
Podman and Podman Compose (recommended) OR Docker and Docker Compose
- Required for Kafka infrastructure
- The application works with both container runtimes (auto-detected)
Podman (Recommended):
- macOS:
brew install podmanthenpodman machine init && podman machine start - Linux:
sudo apt-get install podman(Ubuntu/Debian) - Verify:
podman --versionandpodman compose version - Download: Podman Installation Guide
Docker (Alternative):
- Verify:
docker --versionanddocker compose version - Download: Docker Desktop
-
Git (for cloning the repository)
- Verify:
git --version
- Verify:
Get up and running in 5 minutes:
# 1. Clone the repository
git clone https://github.com/strawberry-code/kafka-word-splitter.git
cd kafka-word-splitter
# 2. Start Kafka infrastructure (auto-detects Podman or Docker)
./start-kafka.sh
# 3. Build the project
./gradlew clean build
# 4. Create Kafka topics (required topics 3-10)
./scripts/create-topics.sh
# 5. Run the consumer (in one terminal)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar org.example.ConsumerApp 3 output-3.txt
# 6. Run the producer (in another terminal)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar org.example.ProducerApp /path/to/watch
# 7. Add a text file to the watch directory and see the magic!
echo "hello world from kafka" > /path/to/watch/test.txtNote: This project uses java -cp (classpath) instead of java -jar to run applications. This allows running both ProducerApp and ConsumerApp from the same JAR file by specifying the main class explicitly on the command line.
For a more detailed guide, see docs/getting-started/QUICK_START.md.
# Clean and build the project
./gradlew clean build
# Build fat JAR with all dependencies
./gradlew shadowJar
# Output: build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar# Fast build without tests
./gradlew build -x test
# Continuous build (watches for changes)
./gradlew build --continuousFor comprehensive build documentation, see docs/getting-started/BUILD.md.
# Start Kafka and Zookeeper (auto-detects Podman or Docker)
./start-kafka.sh
# Check infrastructure status
./scripts/kafka-status.sh
# Create required topics (topics 3-10)
./scripts/create-topics.shThe producer monitors a directory and sends words to Kafka:
# Using the fat JAR
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ProducerApp /path/to/watch/directory
# Using Gradle
./gradlew run --args="/path/to/watch/directory"The consumer reads from a specific topic and writes to a file:
# Subscribe to topic "3" (3-letter words)
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp 3 output-3.txt
# Run multiple consumers for different word lengths
for i in {3..10}; do
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp "$i" "output-$i.txt" &
done# Gracefully stop Kafka and Zookeeper
./scripts/stop-kafka.shBoth applications support graceful shutdown:
# Press Ctrl+C or send SIGTERM
kill -TERM <pid>The applications will:
- Stop accepting new work
- Complete in-flight processing
- Flush and close Kafka connections
- Clean up all resources
See docs/architecture/SHUTDOWN.md for detailed shutdown behavior.
This section provides a complete end-to-end testing workflow to verify the application is working correctly after setup.
Before testing, ensure you have:
- Built the application (
./gradlew clean build) - Kafka infrastructure running (
./start-kafka.sh) - Required topics created (
./scripts/create-topics.sh) - Watch directory created manually (producer will not start without it)
- Output directory created if using subdirectories (e.g.,
output/file.txt)
The producer requires the watch directory to exist before starting:
# Create watch directory (REQUIRED - producer validates this exists)
mkdir -p /tmp/kafka-watch
# Optional: Create output directory if using subdirectories
# (Not needed if output files are in current directory)
mkdir -p /tmp/kafka-output# Ensure Kafka is running
./start-kafka.sh
# Verify infrastructure status
./scripts/kafka-status.sh
# Build the application (if not already built)
./gradlew clean build
# Create required topics (if not already created)
./scripts/create-topics.shStart one or more consumers to receive words of specific lengths:
# Example 1: Single consumer for 5-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp 5 output-5.txt
# Example 2: Multiple consumers for different word lengths
# Terminal 1: 3-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp 3 output-3.txt
# Terminal 2: 4-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp 4 output-4.txt
# Terminal 3: 5-letter words
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp 5 output-5.txtYou should see log output indicating the consumer is running:
INFO Consumer started - Topic: 5, Output: output-5.txt
INFO Subscribed to topic: 5
In a new terminal, start the producer:
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ProducerApp /tmp/kafka-watchYou should see:
INFO Producer started - Watch directory: /tmp/kafka-watch
INFO Watching directory: /tmp/kafka-watch
Now add a test file to the watch directory. The file will be automatically processed and deleted.
Simple test with echo:
echo "hello world from kafka testing application" > /tmp/kafka-watch/test.txtExpected word routing:
- "hello" (5 chars) → topic "5" →
output-5.txt - "world" (5 chars) → topic "5" →
output-5.txt - "from" (4 chars) → topic "4" →
output-4.txt - "kafka" (5 chars) → topic "5" →
output-5.txt - "testing" (7 chars) → topic "7" →
output-7.txt - "application" (11 chars) → FILTERED OUT (exceeds 10 character limit)
Note: Only words with 3-10 characters are processed. Words outside this range are ignored.
Check the output files to verify words were processed:
# View content of output files
cat output-5.txt
# Expected output:
# hello
# world
# kafka
cat output-4.txt
# Expected output:
# from
# Check word counts
wc -l output-*.txtProducer logs should show:
INFO File detected: test.txt
INFO Processing file: /tmp/kafka-watch/test.txt
INFO Sent word 'hello' to topic '5'
INFO Sent word 'world' to topic '5'
INFO Sent word 'from' to topic '4'
INFO Sent word 'kafka' to topic '5'
INFO Sent word 'testing' to topic '7'
INFO File processing completed: test.txt
Consumer logs should show:
INFO Received word: hello
INFO Received word: world
INFO Received word: kafka
You can use existing files to test, but remember they will be deleted after processing:
# IMPORTANT: Make a copy! The original will be deleted.
cp README.md /tmp/kafka-watch/README-copy.mdThis will process all words from the README and distribute them to topics 3-10.
# Test specific word lengths
echo "the cat sat" > /tmp/kafka-watch/test-3.txt # All 3-letter words
echo "test" > /tmp/kafka-watch/test-4.txt # Single 4-letter word
echo "hello world" > /tmp/kafka-watch/test-5.txt # 5-letter words
# Test word filtering (words too long or too short)
echo "a in application infrastructure" > /tmp/kafka-watch/test-filter.txt
# Result: "a" (1 char) and "in" (2 chars) are filtered out
# "application" (11 chars) and "infrastructure" (14 chars) are filtered outRun consumers for all supported word lengths in separate terminals or background processes:
# Start all consumers (run each in separate terminal or use & for background)
for i in {3..10}; do
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp "$i" "output-$i.txt" &
done
# Add comprehensive test file
cat > /tmp/kafka-watch/comprehensive.txt << 'EOF'
The quick brown fox jumps over lazy dog
Testing distributed systems with Apache Kafka
Processing words character lengths three through ten
EOF
# Wait a moment, then check all output files
sleep 3
for i in {3..10}; do
echo "=== Topic $i (${i}-letter words) ==="
cat "output-$i.txt" 2>/dev/null || echo "No words"
doneCRITICAL: Files are automatically deleted after successful processing.
When the producer processes a file:
- Reads the entire file content
- Splits content into words by whitespace
- Sends valid words (3-10 characters) to Kafka
- DELETES the original file upon successful completion
To preserve files:
- Always copy files to the watch directory (don't move originals)
- Use
cpinstead ofmv:cp myfile.txt /tmp/kafka-watch/
Any UTF-8 text file is supported:
- No file extension restrictions (.txt, .md, .log, .csv, anything works)
- Files must be readable as UTF-8 text
- Binary files will cause errors
Examples:
echo "hello" > /tmp/kafka-watch/test.txt # .txt works
echo "hello" > /tmp/kafka-watch/test.md # .md works
echo "hello" > /tmp/kafka-watch/test.log # .log works
echo "hello" > /tmp/kafka-watch/test # no extension worksWord Length Filter: 3-10 characters only
# Words that WILL be processed:
echo "cat dog bird" > /tmp/kafka-watch/valid.txt
# "cat" (3), "dog" (3), "bird" (4) - all valid
# Words that will be FILTERED OUT:
echo "a in application" > /tmp/kafka-watch/filtered.txt
# "a" (1) - too short
# "in" (2) - too short
# "application" (11) - too longWord Splitting: By whitespace
- Words are split using regex:
\s+(any whitespace) - Includes spaces, tabs, newlines
- Empty strings are filtered out
Topic Routing: By word length
- "cat" (3 chars) → topic "3"
- "word" (4 chars) → topic "4"
- "hello" (5 chars) → topic "5"
- ... and so on up to topic "10"
Output files are created automatically:
- You don't need to create output files beforehand
- Consumers create files on first write
- If using subdirectories (e.g.,
output/file.txt), the directory must exist - Files are appended to (not overwritten) if they already exist
# Producer should show these logs when running:
INFO Producer started - Watch directory: /tmp/kafka-watch
INFO Watching directory: /tmp/kafka-watch
# When a file is added:
INFO File detected: <filename>
INFO Processing file: /tmp/kafka-watch/<filename>
INFO Sent word '<word>' to topic '<length>'# Consumer should show these logs when running:
INFO Consumer started - Topic: <topic>, Output: <output-file>
INFO Subscribed to topic: <topic>
# When receiving words:
INFO Received word: <word># Check if output files were created
ls -lh output-*.txt
# View contents
cat output-5.txt
# Count words received
wc -l output-*.txt
# Check all output files
for f in output-*.txt; do
echo "=== $f ==="
cat "$f"
echo
done# Check if topics exist
./scripts/create-topics.sh # Will show existing topics
# Or use Kafka console consumer to verify messages
podman exec -it kafka-word-splitter-kafka \
kafka-console-consumer.sh \
--bootstrap-server localhost:9092 \
--topic 5 \
--from-beginning \
--max-messages 10Error: "Watch directory does not exist"
# Solution: Create the directory before starting producer
mkdir -p /tmp/kafka-watchError: "Connection refused" or "Cannot connect to Kafka"
# Solution: Ensure Kafka is running
./scripts/kafka-status.sh
# If not running, start it:
./start-kafka.shError: "Parent directory does not exist"
# Solution: If output file uses subdirectories, create them
mkdir -p output
# Then restart consumer with:
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp 5 output/file.txtError: "Topic does not exist"
# Solution: Create required topics
./scripts/create-topics.shProblem: Output file is empty or not created
Check:
-
Word length: Only 3-10 character words are processed
# Test with known valid words echo "hello world kafka" > /tmp/kafka-watch/test.txt
-
File was detected: Check producer logs for "File detected" message
-
File format: Ensure file is text (UTF-8)
# Check file encoding file /tmp/kafka-watch/test.txt -
Consumer is subscribed to correct topic:
- If your file contains 5-letter words, ensure consumer is listening to topic "5"
-
Topic exists: Verify topic was created
./scripts/create-topics.sh
Problem: File added to watch directory but nothing happens
Check:
- Producer is running: Look for "Watching directory" log message
- Directory is correct: Verify watch directory path matches
- File permissions: Ensure file is readable
chmod 644 /tmp/kafka-watch/test.txt
Here's a complete testing session from start to finish:
# Terminal 1: Start Kafka
./start-kafka.sh
./scripts/create-topics.sh
# Terminal 2: Build and start consumer for 5-letter words
./gradlew clean build
mkdir -p /tmp/kafka-watch
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ConsumerApp 5 output-5.txt
# Terminal 3: Start producer
java -cp build/libs/kafka-word-splitter-1.0-SNAPSHOT-all.jar \
org.example.ProducerApp /tmp/kafka-watch
# Terminal 4: Add test file
echo "hello world kafka" > /tmp/kafka-watch/test.txt
# Wait 2-3 seconds, then check results
cat output-5.txt
# Expected output:
# hello
# world
# kafka
# Verify file was deleted
ls /tmp/kafka-watch/
# Output: (empty - file was deleted after processing)Success! You've verified the complete end-to-end workflow.
For more advanced testing scenarios and performance testing, see docs/getting-started/QUICK_START.md.
This project uses Podman (or Docker) to run Kafka and Zookeeper locally. The infrastructure scripts automatically detect which container runtime you have installed.
- ./start-kafka.sh - Start Kafka infrastructure with automatic runtime detection
- ./scripts/stop-kafka.sh - Gracefully stop Kafka infrastructure
- ./scripts/kafka-status.sh - Check infrastructure health and connectivity
- ./scripts/create-topics.sh - Create required Kafka topics (3-10)
- ./scripts/validate-podman.sh - Validate Podman migration and installation
The scripts automatically detect and use available runtimes in this priority order:
- Podman Compose (podman compose) - recommended
- Podman Compose standalone (podman-compose)
- Docker Compose v2 (docker compose)
- Docker Compose v1 (docker-compose)
No configuration needed - just install your preferred runtime.
# Start Kafka (first time or after restart)
./start-kafka.sh
# Check status anytime
./scripts/kafka-status.sh
# Create topics before running application
./scripts/create-topics.sh
# View logs
podman logs kafka-word-splitter-kafka # or use 'docker' instead
podman logs kafka-word-splitter-zookeeper
# Stop when done
./scripts/stop-kafka.sh# Check if services are running
podman ps # or 'docker ps'
# Restart infrastructure
./scripts/stop-kafka.sh && ./start-kafka.sh
# Validate installation (Podman users)
./scripts/validate-podman.shFor detailed migration notes and platform-specific guidance, see MIGRATION-NOTES.md.
- Quick Start Guide - 5-minute setup
- Build Instructions - Comprehensive build guide
- Architecture Report - System design and patterns
- Project Structure - File organization
- Shutdown Procedures - Graceful shutdown mechanisms
- Security Report - CVE patches and analysis
- Dependency Update Policy - Keeping dependencies secure
- CI/CD Documentation - Pipeline guide
- DevOps Report - Infrastructure setup
- Branch Protection - Git workflow
- Migration Notes - Docker to Podman migration
- Contributing Guide - How to contribute
- Support - Getting help
- Changelog - Version history
# Clone the repository
git clone https://github.com/strawberry-code/kafka-word-splitter.git
cd kafka-word-splitter
# Build the project
./gradlew clean build
# Run tests
./gradlew test
# Run code quality checks
./gradlew checkstyleMain spotbugsMain
# Run security scan
./gradlew dependencyCheckAnalyzeThe project uses several quality tools:
- Checkstyle: Code style validation
- SpotBugs: Static bug detection
- JaCoCo: Code coverage reporting
- OWASP Dependency Check: Security vulnerability scanning
Run all quality checks:
./gradlew checkSimulate the CI pipeline locally:
# Run the CI build script
./scripts/ci-build.sh
# Run security checks
./scripts/security-check.sh
# Run quality checks
./scripts/quality-check.shAll critical and high severity CVEs have been patched:
- CVE-2024-31141 (Kafka) - CRITICAL - Privilege escalation
- CVE-2023-6378 (Logback) - HIGH - Deserialization DoS
- CVE-2021-42550 (Logback) - MEDIUM - JNDI injection
- Zero critical or high severity vulnerabilities
- Automated security scanning in CI/CD pipeline
- Regular dependency updates via dependency-check plugin
See SECURITY.md for detailed security information.
Please report security vulnerabilities via:
- GitHub Security Advisories
- Email: security@example.com (configure this)
We welcome contributions! Please read our Contributing Guide for:
- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
- Code review guidelines
# 1. Fork the repository
# 2. Create a feature branch
git checkout -b feature/my-feature
# 3. Make your changes
# 4. Run quality checks
./gradlew check
# 5. Commit your changes
git commit -m "Add my feature"
# 6. Push and create a pull request
git push origin feature/my-feature- Language: Java 17
- Build Tool: Gradle 8.10.2
- Message Broker: Apache Kafka 3.8.0
- Logging: SLF4J 2.0.9 + Logback 1.4.14
- Testing: JUnit 5.10.0
- Container Runtime: Podman (recommended) or Docker
- Container Orchestration: Compose v3.8 (Podman Compose or Docker Compose)
- CI/CD: GitHub Actions
- Phase 1: COMPLETED - Critical stabilization, security patches, CI/CD
- Current Version: 1.0-SNAPSHOT
- Production Ready: Yes (with production Kafka configuration)
- Build system stabilization
- Security vulnerability patches
- Code quality improvements
- Graceful shutdown implementation
- CI/CD pipeline
- Comprehensive documentation
- Unit and integration test suite
- Metrics and monitoring integration
- Configuration externalization
- Performance optimization
- Kubernetes deployment manifests
This project is licensed under the MIT License - see the LICENSE file for details.
Need help? Here's how to get it:
- Documentation: Check the docs directory
- Issues: GitHub Issues
- Discussions: GitHub Discussions
See SUPPORT.md for more information.
- Apache Kafka community for excellent documentation
- Gradle team for the build system
- All contributors to this project
- Coordinator: strawberry-code
Built with Apache Kafka | Powered by Java 17 | Automated with GitHub Actions