Skip to content

Conversation

@jjoyce0510
Copy link
Collaborator

Introducing Documents in DataHub (Context)

This PR introduces a new Document entity to DataHub, enabling users to create, manage, and organize first-party knowledge base content directly within the platform. Documents can be hierarchically organized, linked to data assets, and managed through a complete lifecycle including draft/publish workflows.

Core Data Models

Introduces comprehensive metadata models for the Document entity in DataHub:

Entity Definition

  • New document entity with key aspect documentKey and search capabilities
  • Full support for standard DataHub aspects: ownership, domains, tags, glossary terms, structured properties, institutional memory

Core Aspects (PDL Models)

  • DocumentKey - Unique identifier for documents
  • DocumentInfo - Primary aspect containing:
    • Title and text contents
    • Document status (PUBLISHED/UNPUBLISHED)
    • Source information (distinguishes first-party vs third-party ingested documents)
    • Audit stamps (created/lastModified with actor and timestamp)
    • Hierarchical parent-child relationships
    • Related assets (datasets, dashboards, etc.) and related documents
    • Draft workflow support via draftOf field
  • DocumentContents - Text content storage
  • DocumentStatus & DocumentState - Publication state management
  • DocumentSource - Tracking external sources for third-party integrations
  • ParentDocument, RelatedAsset, RelatedDocument - Relationship models
  • DraftOf - Draft-to-published document linking

GraphQL APIs

Comprehensive GraphQL API surface in knowledge.graphql:

Mutations

  1. createDocument - Create new documents with content, relationships, and hierarchy

    • Supports custom IDs or auto-generated UUIDs
    • Can create as draft or published
    • Automatic ownership assignment to creator
  2. updateDocumentContents - Update document text and title

  3. updateDocumentRelatedEntities - Manage relationships to assets and other documents

  4. moveDocument - Relocate documents within the hierarchy

  5. deleteDocument - Remove documents and their references

  6. updateDocumentStatus - Toggle between PUBLISHED/UNPUBLISHED states

  7. mergeDraft - Merge draft content into published document with optional draft deletion

Queries

  1. document(urn) - Fetch document by URN with full metadata
  2. searchDocuments - Hybrid semantic search with rich filtering:
    • Semantic query support
    • Filter by parent document (hierarchical browsing)
    • Filter by types, domains, states
    • Option to include/exclude drafts
    • Faceted search support

Special Features

  • drafts field - Lists all draft versions of a published document
  • changeHistory field - Chronological audit log of document modifications with support for: Content changes, Parent changes (moves), Relationship changes, State changes, etc.

Authorization & Privileges

New Platform Privilege

  • MANAGE_DOCUMENTS - Platform-level privilege for managing all documents

Entity-Level Privileges

Documents support standard DataHub entity privileges:

  • VIEW_ENTITY_PAGE / GET_ENTITY - View document
  • EDIT_ENTITY_DOCS / EDIT_ENTITY - Edit document content
  • CREATE_ENTITY - Create documents
  • EDIT_ENTITY_OWNERS - Manage ownership
  • EDIT_ENTITY_DOMAINS - Assign domains
  • SHARE_ENTITY - Share documents
  • EDIT_ENTITY_PROPERTIES - Edit structured properties

Authorization Logic

  • canCreateDocument() - Requires CREATE_ENTITY for documents or MANAGE_DOCUMENTS
  • canEditDocument() - Requires EDIT_ENTITY_DOCS, EDIT_ENTITY, or MANAGE_DOCUMENTS
  • canGetDocument() - Requires VIEW_ENTITY_PAGE or MANAGE_DOCUMENTS
  • canDeleteDocument() - Requires delete authorization or MANAGE_DOCUMENTS

Backend Services

DocumentService

Complete service layer implementation in metadata-service/services:

  • CRUD operations with validation
  • Draft workflow management (create, merge, track)
  • Hierarchical structure management (move operations)
  • Relationship management (assets and documents)
  • Ownership management
  • State transition handling
  • Full audit trail via lastModified timestamps

Timeline Support

  • DocumentInfoChangeEventGenerator - Generates change events for audit history
  • Tracks all modifications to document aspects
  • Integrates with DataHub's timeline service

Factory Beans

  • DocumentServiceFactory - Spring factory for service instantiation
  • Integration with GraphQL engine

Test Coverage

Smoke Tests

  • document_test.py (410 lines) - End-to-end document lifecycle tests
  • document_draft_test.py (326 lines) - Draft creation, merging, and workflows
  • document_change_history_test.py (281 lines) - Timeline and change tracking

Unit Tests

  • DocumentServiceTest.java (486 lines) - Service layer business logic
  • GraphQL resolver tests for all mutations and queries
  • DocumentMapperTest.java - Type mapping validation
  • DocumentInfoChangeEventGeneratorTest.java - Timeline event generation

Key Features & Use Cases

  1. Knowledge Base Management - Create and organize internal documentation, FAQs, tutorials, and runbooks
  2. Asset Documentation - Link documents to data assets for enriched context
  3. Draft Workflows - Work on document updates without publishing immediately
  4. Hierarchical Organization - Structure documents in parent-child relationships
  5. Semantic Search - Find relevant documents through hybrid search
  6. Change Tracking - Full audit history of all document modifications
  7. Third-Party Integration Ready - Source field supports ingesting external docs (Confluence, Notion, etc.)

This PR lays the foundation for DataHub to become a central knowledge hub, combining first-party documentation with data asset management in a unified platform.

Coming in a followup PR:

  • Add a browse paths for docs, enabling us to replicate hierarchical structure from other places.
  • Add the "container" story for docs. One option is to define a parent container type as a Dataset entity (e.g. Dataset = Collection of Documents) which is then itself within a container.
  • Models for document-level lineage, and UI support for creating document level lineage links.
  • Support Document Tags, Glossary Terms, and inclusion in Data Products

Status

Ready for review.

@github-actions github-actions bot added docs Issues and Improvements to docs product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Nov 12, 2025
}
}
} catch (Exception e) {
e.printStackTrace();
Copy link

@aikido-pr-checks aikido-pr-checks bot Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stacktrace might be exposed to end user - medium severity
Handling exceptions only with a printStackTrace() might result in stacktraces and variables being exposed in log files. Moreover, the developers who might need these stacktraces to detect the problems might never find them in those logs.

Remediation: Log the errors to a special-purpose error tracking system, such as Sentry.
View details in Aikido Security

@jjoyce0510
Copy link
Collaborator Author

Currently addressing Abe's comments, working on test coverage to 70%, and finalizing the API smoke tests.

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 12, 2025
@codecov
Copy link

codecov bot commented Nov 12, 2025

Bundle Report

Changes will increase total bundle size by 185 bytes (0.0%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.64MB 185 bytes (0.0%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 185 bytes 19.01MB 0.0%

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Nov 13, 2025
@jjoyce0510 jjoyce0510 merged commit 711ac49 into master Nov 19, 2025
77 checks passed
@jjoyce0510 jjoyce0510 deleted the jj--oss-context-base-v1 branch November 19, 2025 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment docs Issues and Improvements to docs pending-submitter-merge product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants