Skip to content

Conversation

@btkcodedev
Copy link

Closes #14733

Summary

Adds an HBase metadata ingestion plugin to DataHub to extract namespaces, tables, column families and schema metadata via the HBase Thrift API. Includes backend source implementation, plugin registration, UI config stubs, docs and minimal wiring to surface the connector in the ingestion UI.

Motivation

Support for HBase metadata ingestion so users can index HBase namespaces/tables into DataHub and make them discoverable, searchable and lineage-able.

What changed

Added backend source implementation
Registered plugin entrypoint

The connector is a StatefulIngestionSourceBase and supports:
Containers: HBase namespaces
Schema metadata: row key + column families / qualifiers
Deletion detection via stateful ingestion
Platform instance and env config
Thrift-based connection (Thrift / happybase recommended)

Compatibility / Risk

No breaking changes.
Requires Thrift-related dependencies and (optionally) UI assets.
If Thrift libs missing, connector logs a clear error and fails to connect.

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX community-contribution PR or Issue raised by member(s) of DataHub Community labels Nov 10, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 10, 2025
@codecov
Copy link

codecov bot commented Nov 10, 2025

Codecov Report

❌ Patch coverage is 7.69231% with 24 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...eb-react/src/app/ingest/source/conf/hbase/hbase.ts 7.69% 12 Missing ⚠️
...-react/src/app/ingestV2/source/conf/hbase/hbase.ts 7.69% 12 Missing ⚠️

📢 Thoughts on this report? Let us know!

@codecov
Copy link

codecov bot commented Nov 10, 2025

Bundle Report

Changes will increase total bundle size by 8.36kB (0.03%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.64MB 8.36kB (0.03%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 3.87kB 19.01MB 0.02%
assets/hbaselogo-*.png (New) 4.49kB 4.49kB 100.0% 🚀

Files in assets/index-*.js:

  • ./src/app/ingestV2/source/builder/constants.ts → Total Size: 5.9kB

  • ./src/app/ingest/source/builder/constants.ts → Total Size: 6.87kB

  • ./src/app/ingest/source/builder/sources.json → Total Size: 35.11kB

  • ./src/app/ingestV2/source/builder/RecipeForm/constants.ts → Total Size: 10.03kB

  • ./src/app/ingestV2/source/builder/sources.json → Total Size: 34.27kB

  • ./src/images/hbaselogo.png → Total Size: 45 bytes

  • ./src/app/ingestV2/source/builder/RecipeForm/hbase.ts → Total Size: 2.67kB

"glue = datahub.ingestion.source.aws.glue:GlueSource",
"sagemaker = datahub.ingestion.source.aws.sagemaker:SagemakerSource",
"hana = datahub.ingestion.source.sql.hana:HanaSource",
"hbase = datahub.ingestion.source.sql.hbase:HBaseSource",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ✅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this rectangle shape may not fit well in the UI, have you checked?
a logo that fits better in a square/circle shape would look better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resized to square

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 12, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good

I miss:

@sgomezvillamor
Copy link
Contributor

Please, double check you followed steps defined here https://docs.datahub.com/docs/metadata-ingestion/adding-source Eg I miss updates in constant.ts

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Nov 13, 2025
@btkcodedev
Copy link
Author

Changes Made:

Backend:

  • Added dependencies (happybase, thrift) to setup.py
  • Fixed linting issues in hbase.py

Frontend:

  • Added HBase to the sources registry datahub-web-react/src/app/ingest/source/builder/sources.json with the default recipe
  • Created UI form fields datahub-web-react/src/app/ingestV2/source/builder/RecipeForm/hbase.ts
  • Updated constants for logo and URN mapping

Documentation:

  • Resized logo to square
  • Added setup guide and sample recipe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hbase is not an sql database
so sources should not be in metadata-ingestion/src/datahub/ingestion/source/sql/

may you place them in metadata-ingestion/src/datahub/ingestion/source/hbase/?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created dedicated src/datahub/ingestion/source/hbase/ directory

self.connection = Hbase.Client(protocol)

# Open connection
transport.open()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be closed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revamped connection lifecycle and now it's properly managed

)
auth_mechanism: Optional[str] = Field(
default=None,
description="Authentication mechanism (None, KERBEROS, or custom)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we support kerberos in the current implementation?
what does custom mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed auth_mechanism field

- Table properties and configuration
HBase is a distributed, scalable, big data store built on top of Hadoop.
This connector uses the HBase Thrift API to extract metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code uses Thrift1 API but doesn't validate version. HBase supports both Thrift1 and Thrift2 with different APIs:

  • Thrift1: Older, limited namespace support
  • Thrift2: Newer, full namespace support
    Current implementation works around Thrift1 limitations by parsing table names, but this is fragile.

Wouldn't it be better to go with Thrift2 directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, have you considered happybase? it seems a very popular python lib for consuming hbase and it may save you from setting up thirft

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced raw Thrift implementation with happybase library

default=True,
description="Include column families as schema metadata",
)
max_column_qualifiers: int = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this being used?
I haven't found code for sampling in this source

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed max_column_qualifiers field from configuration

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 14, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

Still main blocker is testing:

  • unit tests for configuration validation
  • unit tests for namespace/table discovery and schema field generation
  • integration tests (with docker?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1 frontend was deprecated, so you can skip updating anything in datahub-web-react/src/app/ingest/source/

just keep udpates in datahub-web-react/src/app/ingestV2/

@btkcodedev
Copy link
Author

  • Moved from sql/ to src/datahub/ingestion/source/hbase/
  • Switched to happybase library (replaced raw Thrift)
  • Removed unused max_column_qualifiers config field
  • Removed unsupported auth_mechanism config field
  • Added proper connection cleanup in close() method
  • Added 65+ test cases (20+ config, 35+ unit, 10+ integration)
  • Updated setup.py dependencies: happybase>=1.2.0

All tests are passing
image

@btkcodedev
Copy link
Author

*Integration tests are in development - (facing some issues with docker)

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Nov 17, 2025
@btkcodedev
Copy link
Author

Integration tests are passing with mock

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer. product PR or Issue related to the DataHub UI/UX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

How can I ingest the metadata of HBase?

3 participants