Skip to content

Conversation

@mbani01
Copy link
Contributor

@mbani01 mbani01 commented Oct 31, 2025

This pull request introduces enhancements to the Tinybird client library, adds new utility scripts, and updates dependencies to support Snowflake integration. The most significant improvements are the addition of new methods for interacting with Tinybird datasources and the refactoring of header management for API requests.

Tinybird client enhancements:

  • Added a private getHeaders method to centralize and simplify HTTP header creation for Tinybird API requests in tinybirdClient.ts.
  • Refactored existing API calls to use the new getHeaders method for both GET and POST requests, improving maintainability. [1] [2]
  • Introduced a new rawSql method to execute raw SQL queries directly against Tinybird, supporting operations outside of named pipes.
  • Added a deleteDatasource method to allow selective deletion of data from a Tinybird datasource using the official delete API.

Script and dependency updates:

  • Added a new script cleanup-fork-activities to the script_executor_worker package for maintenance tasks.
  • Updated dependencies in both pnpm-lock.yaml and package.json to include @crowd/snowflake, preparing the project for Snowflake integration. [1] [2]

@mbani01 mbani01 self-assigned this Oct 31, 2025
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventional Commits FTW!

@mbani01 mbani01 changed the title feat: Cleanup old inaccurate activities/maintainers for previously processed (forked) repos[CM-749] feat: cleanup old inaccurate activities & maintainers for previously processed fork repos [CM-749] Oct 31, 2025
@mbani01 mbani01 requested review from epipav and ulemons October 31, 2025 16:42
@mbani01 mbani01 marked this pull request as ready for review October 31, 2025 16:43
})

constructor() {
constructor(token?: string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we taking the token as param ? If we have some use case in which we do not have this env variable probably we must also send the host ? @mbani01

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the param to be able to pass another token with different permissions than the one used in env var

Comment on lines +482 to +499
const maintainersDeletedPostgres = await deleteMaintainersFromPostgres(
clients.postgres,
repo.id,
repo.url,
dryRun,
)
const maintainersDeletedTinybird = await deleteMaintainersFromTinybird(
tinybird,
repo.id,
repo.url,
dryRun,
)
const maintainersDeletedSnowflake = await deleteMaintainersFromSnowflake(
clients.snowflake,
repo.id,
repo.url,
dryRun,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these deletions are not in transaction, what if some deletion goes well and some other not ? (e.g. postgress deletion successful, tinybird deletion failed ) can we simply rerun the script ?

* Execute raw SQL query on Tinybird
* Useful for queries that don't go through named pipes (e.g., ALTER TABLE, direct SELECT)
*/
async rawSql<T = unknown>(query: string): Promise<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could thing to reuse this rawSql function in the pipeSql

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! done ✅

Copy link
Collaborator

@epipav epipav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good overall - before merging, let's

  • double check that delete jobs have the proper delete_condition set with this call
  • Reduce the batch size a bit more, maybe to 200

Copy link
Contributor

@joanagmaia joanagmaia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Mouad, I saw that the table names were not correct for Snowflake. Also, in Snowflake we will need to delete from both Kafka tables and Fivetran tables.

Can you try to access https://app.snowflake.com/jnmhvwd/xpb85243/#/homepage with your contractor email? You should have access. You will also need to update the columns you are reading for these tables according to the data schema.

In terms of credentials, we will use COMMUNITY_MANAGEMENT_USER , we are just waiting on delete permissions to all of these tables. Thread -> https://linuxfoundation.slack.com/archives/C06GFSRQQER/p1762177471448389

https://linuxfoundation.slack.com/archives/C02RMNFE9FF/p1762537289862279

if (dryRun) {
log.info(`[DRY RUN] Querying maintainers from Snowflake for repository: ${repoUrl}`)
try {
const query = `SELECT COUNT(*) as count FROM "maintainersInternal" WHERE "repoId" = '${repoId}'`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For maintainersInternal:
Table names should be:

  • FIVETRAN_INGEST.CROWD_PROD_PUBLIC.MAINTAINERSINTERNAL: Schema
  • KAFKA_INGEST.COMMUNITY_MANAGEMENT.MAINTAINERS_INTERNAL: Schema

For Kafka I can help you get the columns.

try {
// Delete from activityRelations first (foreign key dependency)
log.info('Deleting from activityRelations table...')
const activityRelationsQuery = `DELETE FROM activityRelations WHERE activityId IN (${idsString})`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For activityRelations in Snowflake:
Table names should be:

  • FIVETRAN_INGEST.CROWD_PROD_PUBLIC.ACTIVITYRELATIONS: Schema
  • KAFKA_INGEST.COMMUNITY_MANAGEMENT.ACTIVITY_RELATIONS: Schema

For Kafka I can help you get the columns.


// Delete from activities table
log.info('Deleting from activities table...')
const activitiesQuery = `DELETE FROM activities WHERE id IN (${idsString})`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For activities in Snowflake:
Table names should be:

  • FIVETRAN_INGEST.CROWD_PROD_PUBLIC.ACTIVITIES: Schema
  • KAFKA_INGEST.COMMUNITY_MANAGEMENT.ACTIVITIES: Schema

For Kafka I can help you get the columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants