|
| 1 | +# Kafka Streaming Source |
| 2 | + |
| 3 | + |
| 4 | +Description |
| 5 | +----------- |
| 6 | +Kafka streaming source. Emits a record with the schema specified by the user. If no schema |
| 7 | +is specified, it will emit a record with two fields: 'key' (nullable string) and 'message' |
| 8 | +(bytes). This plugin uses kafka 0.10.2 java apis. |
| 9 | + |
| 10 | + |
| 11 | +Use Case |
| 12 | +-------- |
| 13 | +This source is used whenever you want to read from Kafka. For example, you may want to read messages |
| 14 | +from Kafka and write them to a Table. |
| 15 | + |
| 16 | + |
| 17 | +Properties |
| 18 | +---------- |
| 19 | +**referenceName:** This will be used to uniquely identify this source for lineage, annotating metadata, etc. |
| 20 | + |
| 21 | +**brokers:** List of Kafka brokers specified in host1:port1,host2:port2 form. (Macro-enabled) |
| 22 | + |
| 23 | +**topic:** The Kafka topic to read from. (Macro-enabled) |
| 24 | + |
| 25 | +**partitions:** List of topic partitions to read from. If not specified, all partitions will be read. (Macro-enabled) |
| 26 | + |
| 27 | +**defaultInitialOffset:** The default initial offset for all topic partitions. |
| 28 | +An offset of -2 means the smallest offset. An offset of -1 means the latest offset. Defaults to -1. |
| 29 | +Offsets are inclusive. If an offset of 5 is used, the message at offset 5 will be read. |
| 30 | +If you wish to set different initial offsets for different partitions, use the initialPartitionOffsets property. (Macro-enabled) |
| 31 | + |
| 32 | +**initialPartitionOffsets:** The initial offset for each topic partition. If this is not specified, |
| 33 | +all partitions will use the same initial offset, which is determined by the defaultInitialOffset property. |
| 34 | +Any partitions specified in the partitions property, but not in this property will use the defaultInitialOffset. |
| 35 | +An offset of -2 means the smallest offset. An offset of -1 means the latest offset. |
| 36 | +Offsets are inclusive. If an offset of 5 is used, the message at offset 5 will be read. (Macro-enabled) |
| 37 | + |
| 38 | +**schema:** Output schema of the source. If you would like the output records to contain a field with the |
| 39 | +Kafka message key, the schema must include a field of type bytes or nullable bytes, and you must set the |
| 40 | +keyField property to that field's name. Similarly, if you would like the output records to contain a field with |
| 41 | +the timestamp of when the record was read, the schema must include a field of type long or nullable long, and you |
| 42 | +must set the timeField property to that field's name. Any field that is not the timeField or keyField will be used |
| 43 | +in conjuction with the format to parse Kafka message payloads. |
| 44 | + |
| 45 | +**format:** Optional format of the Kafka event message. Any format supported by CDAP is supported. |
| 46 | +For example, a value of 'csv' will attempt to parse Kafka payloads as comma-separated values. |
| 47 | +If no format is given, Kafka message payloads will be treated as bytes. |
| 48 | + |
| 49 | +**timeField:** Optional name of the field containing the read time of the batch. |
| 50 | +If this is not set, no time field will be added to output records. |
| 51 | +If set, this field must be present in the schema property and must be a long. |
| 52 | + |
| 53 | +**keyField:** Optional name of the field containing the message key. |
| 54 | +If this is not set, no key field will be added to output records. |
| 55 | +If set, this field must be present in the schema property and must be bytes. |
| 56 | + |
| 57 | +**partitionField:** Optional name of the field containing the partition the message was read from. |
| 58 | +If this is not set, no partition field will be added to output records. |
| 59 | +If set, this field must be present in the schema property and must be an int. |
| 60 | + |
| 61 | +**offsetField:** Optional name of the field containing the partition offset the message was read from. |
| 62 | +If this is not set, no offset field will be added to output records. |
| 63 | +If set, this field must be present in the schema property and must be a long. |
| 64 | + |
| 65 | +**maxRatePerPartition:** Maximum number of records to read per second per partition. Defaults to 1000. |
| 66 | + |
| 67 | +**principal** The kerberos principal used for the source when kerberos security is enabled for kafka. |
| 68 | + |
| 69 | +**keytabLocation** The keytab location for the kerberos principal when kerberos security is enabled for kafka. |
| 70 | + |
| 71 | +Example |
| 72 | +------- |
| 73 | +This example reads from the 'purchases' topic of a Kafka instance running |
| 74 | +on brokers host1.example.com:9092 and host2.example.com:9092. The source will add |
| 75 | +a time field named 'readTime' that contains a timestamp corresponding to the micro |
| 76 | +batch when the record was read. It will also contain a field named 'key' which will have |
| 77 | +the message key in it. It parses the Kafka messages using the 'csv' format |
| 78 | +with 'user', 'item', 'count', and 'price' as the message schema. |
| 79 | + |
| 80 | + { |
| 81 | + "name": "Kafka", |
| 82 | + "type": "streamingsource", |
| 83 | + "properties": { |
| 84 | + "topics": "purchases", |
| 85 | + "brokers": "host1.example.com:9092,host2.example.com:9092", |
| 86 | + "format": "csv", |
| 87 | + "timeField": "readTime", |
| 88 | + "keyField": "key", |
| 89 | + "schema": "{ |
| 90 | + \"type\":\"record\", |
| 91 | + \"name\":\"purchase\", |
| 92 | + \"fields\":[ |
| 93 | + {\"name\":\"readTime\",\"type\":\"long\"}, |
| 94 | + {\"name\":\"key\",\"type\":\"bytes\"}, |
| 95 | + {\"name\":\"user\",\"type\":\"string\"}, |
| 96 | + {\"name\":\"item\",\"type\":\"string\"}, |
| 97 | + {\"name\":\"count\",\"type\":\"int\"}, |
| 98 | + {\"name\":\"price\",\"type\":\"double\"} |
| 99 | + ] |
| 100 | + }" |
| 101 | + } |
| 102 | + } |
| 103 | + |
| 104 | +For each Kafka message read, it will output a record with the schema: |
| 105 | + |
| 106 | + +================================+ |
| 107 | + | field name | type | |
| 108 | + +================================+ |
| 109 | + | readTime | long | |
| 110 | + | key | bytes | |
| 111 | + | user | string | |
| 112 | + | item | string | |
| 113 | + | count | int | |
| 114 | + | price | double | |
| 115 | + +================================+ |
| 116 | + |
| 117 | +Note that the readTime field is not derived from the Kafka message, but from the time that the |
| 118 | +message was read. |
0 commit comments