You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: metadata-ingestion/docs/sources/kafka-connect/README.md
+165-4Lines changed: 165 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,38 @@ This plugin extracts the following:
6
6
- For Source connectors - Data Jobs to represent lineage information between source dataset to Kafka topic per `{connector_name}:{source_dataset}` combination
7
7
- For Sink connectors - Data Jobs to represent lineage information between Kafka topic to destination dataset per `{connector_name}:{topic}` combination
8
8
9
+
### Requirements
10
+
11
+
**Java Runtime Dependency:**
12
+
13
+
This source requires Java to be installed and available on the system for transform pipeline support (RegexRouter, etc.). The Java runtime is accessed via JPype to enable Java regex pattern matching that's compatible with Kafka Connect transforms.
-**Docker deployments**: Ensure your DataHub ingestion Docker image includes a Java runtime. The official DataHub images include Java by default.
17
+
-**Impact**: Without Java, transform pipeline features will be disabled and lineage accuracy may be reduced for connectors using transforms
18
+
19
+
**Note for Docker users**: If you're building custom Docker images for DataHub ingestion, ensure a Java Runtime Environment (JRE) is included in your image to support full transform pipeline functionality.
20
+
21
+
### Environment Support
22
+
23
+
DataHub's Kafka Connect source supports both **self-hosted** and **Confluent Cloud** environments with automatic detection and environment-specific topic retrieval strategies:
24
+
25
+
#### Self-hosted Kafka Connect
26
+
27
+
-**Topic Discovery**: Uses runtime `/connectors/{name}/topics` API endpoint
28
+
-**Accuracy**: Returns actual topics that connectors are currently reading from/writing to
29
+
-**Benefits**: Most accurate topic information as it reflects actual runtime state
30
+
-**Requirements**: Standard Kafka Connect REST API access
31
+
32
+
#### Confluent Cloud
33
+
34
+
-**Topic Discovery**: Uses comprehensive Kafka REST API v3 for optimal transform pipeline support with config-based fallback
35
+
-**Method**: Gets all topics from Kafka cluster via REST API, applies reverse transform pipeline for accurate mappings
36
+
-**Transform Support**: Full support for complex transform pipelines via reverse pipeline strategy using actual cluster topics
37
+
-**Fallback**: Falls back to config-based derivation if Kafka API is unavailable
38
+
39
+
**Environment Detection**: Automatically detects environment based on `connect_uri` patterns containing `confluent.cloud`.
40
+
9
41
### Concept Mapping
10
42
11
43
This ingestion source maps the following Source System Concepts to DataHub Concepts:
@@ -16,9 +48,138 @@ This ingestion source maps the following Source System Concepts to DataHub Conce
DataHub supports different connector types with varying levels of lineage extraction capabilities depending on the environment (self-hosted vs Confluent Cloud):
54
+
55
+
### Source Connectors
56
+
57
+
| Connector Type | Self-hosted Support | Confluent Cloud Support | Topic Discovery Method | Lineage Extraction |
|**BigQuery Sink**<br/>`com.wepay.kafka.connect.bigquery.BigQuerySinkConnector`| ✅ Full | ✅ Full | Runtime API / Config-based | Topic → Table mapping |
79
+
|**S3 Sink**<br/>`io.confluent.connect.s3.S3SinkConnector`| ✅ Full | ✅ Full | Runtime API / Config-based | Topic → S3 object mapping |
80
+
|**Snowflake Sink**<br/>`com.snowflake.kafka.connector.SnowflakeSinkConnector`| ✅ Full | ✅ Full | Runtime API / Config-based | Topic → Table mapping |
81
+
|**Cloud PostgreSQL Sink**<br/>`PostgresSink`| ✅ Full | ✅ Full | Runtime API / Config-based | Topic → Table mapping |
82
+
|**Cloud MySQL Sink**<br/>`MySqlSink`| ✅ Full | ✅ Full | Runtime API / Config-based | Topic → Table mapping |
83
+
|**Cloud Snowflake Sink**<br/>`SnowflakeSink`| ✅ Full | ✅ Full | Runtime API / Config-based | Topic → Table mapping |
84
+
85
+
**Legend:**
86
+
87
+
- ✅ **Full**: Complete lineage extraction with accurate topic discovery
88
+
- ✅ **Partial**: Lineage extraction supported but topic discovery may be limited (config-based only)
89
+
- 🔧 **Config Required**: Requires `generic_connectors` configuration for lineage mapping
90
+
91
+
### Supported Transforms
92
+
93
+
DataHub uses an **advanced transform pipeline strategy** that automatically handles complex transform chains by applying the complete pipeline to all topics and checking if results exist. This provides robust support for any combination of transforms.
0 commit comments