Skip to content

Commit 6ae100e

Browse files
authored
feat: tinybird pipes and datasources for leaderboards (IN-813) (#3575)
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
1 parent 8be9871 commit 6ae100e

18 files changed

+683
-20
lines changed

services/libs/tinybird/datasources/activities.datasource

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ DESCRIPTION >
1616
- `url` is the direct link to the activity on the source platform (empty string if not available).
1717
- Sentiment analysis fields (`sentimentLabel`, `sentimentScore*`) provide sentiment metrics (-1 default for no analysis).
1818
- Git-specific fields (`git*`) track code changes, branch info, and merge status for code-related activities (0 default for non-git).
19+
1920
TAGS "Activity preprocessing pipeline"
2021

2122
SCHEMA >
@@ -46,8 +47,8 @@ SCHEMA >
4647
`createdById` UUID `json:$.createdById` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
4748
`updatedById` UUID `json:$.updatedById` DEFAULT toUUID('00000000-0000-0000-0000-000000000000'),
4849
`segmentId` LowCardinality(String) `json:$.segmentId` DEFAULT ''
49-
50-
ENGINE "ReplacingMergeTree"
51-
ENGINE_PARTITION_KEY "toYear(createdAt)"
52-
ENGINE_SORTING_KEY "id"
53-
ENGINE_VER "updatedAt"
50+
51+
ENGINE ReplacingMergeTree
52+
ENGINE_PARTITION_KEY toYear(createdAt)
53+
ENGINE_SORTING_KEY id
54+
ENGINE_VER updatedAt

services/libs/tinybird/datasources/activities_deduplicated_ds.datasource

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
DESCRIPTION >
2-
- `activities_deduplicated_ds` contains deduplicated raw activity events without relationship data.
2+
- `activities_deduplicated_ds` contains deduplicated raw activity events without relationship data.
33
- Created via copy pipe from `activities` datasource with deduplication and field selection for performance.
44
- Since aggregations are mainly done on relationships, `activityRelations_deduplicated_cleaned_ds` should be used for reporting purposes instead.
55
- Optimized subset of activity fields focused on core analytics needs.
@@ -9,11 +9,12 @@ DESCRIPTION >
99
- `type` specifies the activity type (issues-opened, pull-request-opened, etc.) using LowCardinality.
1010
- `channel` contains the repository, channel, or forum where activity occurred.
1111
- `sourceId` is the unique identifier from the source platform.
12-
- `score` is the computed importance/impact score for the activity.
12+
- `score` is the computed importance/impact score for the activity.
1313
- `attributes` contains additional JSON metadata specific to the activity type.
1414
- `body` contains the activity’s textual body/content when applicable; empty string if not applicable.
1515
- `title` contains the activity’s title/subject when applicable; empty string if not applicable.
1616
- `url` is the direct link to the activity on the source platform; empty string if not available.
17+
1718
TAGS "Activity preprocessing pipeline"
1819

1920
SCHEMA >
@@ -31,11 +32,10 @@ SCHEMA >
3132
`url` String DEFAULT '',
3233
`updatedAt` DateTime64(3)
3334

34-
ENGINE "MergeTree"
35-
ENGINE_PARTITION_KEY "toYear(timestamp)"
36-
ENGINE_SORTING_KEY "id, platform, channel"
37-
38-
3935
INDEXES >
40-
idx_body_ngram3 body TYPE ngrambf_v1(3, 2048, 6, 0) GRANULARITY 64,
41-
idx_title_ngram3 title TYPE ngrambf_v1(3, 512, 6, 0) GRANULARITY 64
36+
idx_body_ngram3 body TYPE ngrambf_v1(3, 2048, 6, 0) GRANULARITY 64
37+
idx_title_ngram3 title TYPE ngrambf_v1(3, 512, 6, 0) GRANULARITY 64
38+
39+
ENGINE MergeTree
40+
ENGINE_PARTITION_KEY toYear(timestamp)
41+
ENGINE_SORTING_KEY id, platform, channel

services/libs/tinybird/datasources/issues_analyzed.datasource

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,17 @@ DESCRIPTION >
88
- `closedAt` is the timestamp when the issue was closed (nullable for open issues).
99
- `closedInSeconds` is the computed duration from open to close in seconds (nullable for open issues).
1010

11-
TAGS "Issue analytics", "Performance metrics"
11+
TAGS "Performance metrics", "Issue analytics"
1212

1313
SCHEMA >
1414
`id` String,
1515
`sourceId` String,
16+
`segmentId` String,
1617
`openedAt` DateTime64(3),
18+
`commentedAt` Nullable(DateTime64(3)),
1719
`closedAt` Nullable(DateTime64(3)),
18-
`closedInSeconds` Nullable(Int64)
20+
`closedInSeconds` Nullable(Int64),
21+
`respondedInSeconds` Nullable(Int64)
1922

2023
ENGINE MergeTree
2124
ENGINE_PARTITION_KEY toYear(openedAt)
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
DESCRIPTION >
2+
- `leaderboards_copy_ds` contains pre-computed leaderboard rankings for open source projects across multiple categories.
3+
- Created via copy pipe that runs daily at 1 AM, combining data from multiple leaderboard metric pipes.
4+
- Supports six leaderboard types: longest-running, active-contributors, active-organizations, commit-activity, focused-teams, and small-teams-massive-output.
5+
- Designed for high-performance read access to power project ranking and comparison features.
6+
- `rank` is the position of the project within its leaderboard type (1 = top ranked).
7+
- `id` is the unique project identifier.
8+
- `segmentId` is the internal segment identifier for the project.
9+
- `name` is the human-readable project name.
10+
- `slug` is the URL-friendly project identifier used for filtering and routing.
11+
- `logoUrl` is the URL to the project's logo image.
12+
- `leaderboardType` identifies the ranking category (e.g., 'commit-activity', 'active-contributors').
13+
- `value` is the primary metric value used for ranking (interpretation depends on leaderboardType).
14+
- `previousPeriodValue` is the comparison metric from the previous period (0 for non-temporal leaderboards).
15+
16+
TAGS "Analytics, Leaderboards, Project rankings"
17+
18+
SCHEMA >
19+
`rank` UInt64,
20+
`id` String,
21+
`segmentId` String,
22+
`name` String,
23+
`slug` String,
24+
`logoUrl` String,
25+
`leaderboardType` String,
26+
`value` Float64,
27+
`previousPeriodValue` Float64
28+
29+
ENGINE MergeTree
30+
ENGINE_SORTING_KEY leaderboardType, rank, slug, id

services/libs/tinybird/pipes/activities_backup_cleaned_mv.pipe

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ DESCRIPTION >
55

66
NODE activities_backup_mv_v1
77
SQL >
8-
SELECT
8+
SELECT
99
id,
1010
type,
1111
timestamp,

services/libs/tinybird/pipes/issue_analysis_copy_pipe.pipe

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,29 +3,44 @@ DESCRIPTION >
33

44
NODE issues_opened
55
SQL >
6-
SELECT activityId as id, sourceId, timestamp AS openedAt
6+
SELECT activityId as id, sourceId, timestamp AS openedAt, segmentId
77
FROM activityRelations_deduplicated_cleaned_ds
88
WHERE type = 'issues-opened'
99

1010
NODE issues_closed
1111
SQL >
1212
SELECT sourceParentId, MIN(timestamp) AS closedAt
1313
FROM activityRelations_deduplicated_cleaned_ds
14-
WHERE type = 'issues-closed'
14+
WHERE type = 'issues-closed' AND sourceParentId != ''
15+
GROUP BY sourceParentId
16+
17+
NODE issues_comment
18+
SQL >
19+
SELECT sourceParentId, MIN(timestamp) AS commentedAt
20+
FROM activityRelations_deduplicated_cleaned_ds
21+
WHERE type = 'issue-comment' AND sourceParentId != '' AND toYear(timestamp) >= 1971
1522
GROUP BY sourceParentId
1623

1724
NODE issue_analysis_results_merged
1825
SQL >
1926
SELECT
2027
opened.id,
2128
opened.sourceId,
29+
opened.segmentId as segmentId,
2230
opened.openedAt,
31+
comment.commentedAt,
2332
IF(closed.closedAt = toDateTime(0), NULL, closed.closedAt) AS closedAt,
2433
IF(
2534
closedAt IS NULL, NULL, toUnixTimestamp(closed.closedAt) - toUnixTimestamp(opened.openedAt)
26-
) AS closedInSeconds
35+
) AS closedInSeconds,
36+
IF(
37+
commentedAt IS NULL,
38+
NULL,
39+
toUnixTimestamp(comment.commentedAt) - toUnixTimestamp(opened.openedAt)
40+
) AS respondedInSeconds
2741
FROM issues_opened opened
2842
LEFT JOIN issues_closed AS closed ON opened.sourceId = closed.sourceParentId
43+
LEFT JOIN issues_comment AS comment ON opened.sourceId = comment.sourceParentId
2944

3045
TYPE COPY
3146
TARGET_DATASOURCE issues_analyzed
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
DESCRIPTION >
2+
Main leaderboard query endpoint that retrieves leaderboard data from the consolidated datasource.
3+
Supports filtering by project slug, leaderboard type, and search term. Includes pagination support.
4+
5+
NODE leaderboards_list
6+
DESCRIPTION >
7+
Queries the unified leaderboards datasource with optional filtering and pagination
8+
9+
SQL >
10+
%
11+
SELECT *
12+
FROM leaderboards_copy_ds
13+
WHERE
14+
1 = 1
15+
{% if defined(slug) %}
16+
AND slug = {{ String(slug, description="Project slug", required=False) }}
17+
{% end %}
18+
{% if defined(leaderboardType) %}
19+
AND leaderboardType
20+
= {{ String(leaderboardType, description="Leaderboard type", required=False) }}
21+
{% end %}
22+
{% if defined(search) %}
23+
AND name
24+
ilike '%'
25+
|| {{
26+
String(
27+
search,
28+
description="Search project leaderboard open ended wildcard using slug",
29+
required=False,
30+
)
31+
}}
32+
|| '%'
33+
{% end %}
34+
{% if defined(maxRank) %}
35+
AND rank
36+
<= {{
37+
Int32(
38+
maxRank,
39+
description="Define max rank to send",
40+
required=False,
41+
)
42+
}}
43+
{% end %}
44+
ORDER BY rank ASC
45+
{% if not defined(maxRank) %}
46+
LIMIT {{ Int32(pageSize, 20) }} OFFSET {{ Int32(page, 0) * Int32(pageSize, 20) }}
47+
{% end %}
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
DESCRIPTION >
2+
Leaderboard ranking projects by average commits per author (focused teams). Calculates the ratio
3+
of total commits to unique contributors, identifying projects with highly productive individual
4+
contributors. Only includes projects with git platform integrations.
5+
6+
NODE leaderboards_avg_commits_per_author_projects
7+
DESCRIPTION >
8+
Retrieves all projects that have git integration enabled
9+
10+
SQL >
11+
SELECT id, name, slug, segmentId, logoUrl
12+
FROM insights_projects_populated_ds
13+
WHERE has(connectedPlatforms, 'git')
14+
GROUP BY id, name, slug, segmentId, logoUrl
15+
16+
NODE leaderboards_avg_commits_per_author_activities
17+
DESCRIPTION >
18+
Counts total commits and unique authors per project for calculating average commits per author
19+
20+
SQL >
21+
SELECT segmentId, count() as commits, uniq(memberId) as unique_authors
22+
FROM activityRelations_deduplicated_cleaned_ds
23+
WHERE
24+
activityId != ''
25+
AND (type = 'authored-commit' OR type = 'committed-commit')
26+
AND platform = 'git'
27+
GROUP BY segmentId
28+
HAVING unique_authors > 0
29+
30+
NODE leaderboards_avg_commits_per_author_results
31+
DESCRIPTION >
32+
Calculates average commits per author ratio and ranks projects by highest productivity per contributor
33+
34+
SQL >
35+
SELECT
36+
row_number() OVER (
37+
ORDER BY
38+
CAST(coalesce(c.commits, 0) AS Float64)
39+
/ CAST(coalesce(c.unique_authors, 1) AS Float64) DESC
40+
) as rank,
41+
p.id,
42+
p.segmentId,
43+
p.name,
44+
p.slug,
45+
p.logoUrl,
46+
(coalesce(c.commits, 0) / coalesce(c.unique_authors, 1)) as value
47+
FROM leaderboards_avg_commits_per_author_projects p
48+
INNER JOIN leaderboards_avg_commits_per_author_activities c ON p.segmentId = c.segmentId
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
DESCRIPTION >
2+
Leaderboard ranking projects by codebase size, measured as the net lines of code changed
3+
(total insertions minus deletions) across all git commits. Only includes projects with
4+
git platform integrations.
5+
6+
NODE leaderboards_codebase_size_projects
7+
DESCRIPTION >
8+
Retrieves all projects that have git integration enabled
9+
10+
SQL >
11+
SELECT id, name, slug, segmentId, logoUrl
12+
FROM insights_projects_populated_ds
13+
WHERE has(connectedPlatforms, 'git')
14+
GROUP BY id, name, slug, segmentId, logoUrl
15+
16+
NODE leaderboards_codebase_size_activities
17+
DESCRIPTION >
18+
Calculates net lines changed per project by summing git insertions and subtracting deletions
19+
20+
SQL >
21+
SELECT segmentId, SUM(gitInsertions) - SUM(gitDeletions) AS lineDifference
22+
FROM activityRelations_deduplicated_cleaned_ds
23+
WHERE platform = 'git' AND (gitInsertions > 0 OR gitDeletions > 0)
24+
GROUP BY segmentId
25+
26+
NODE leaderboards_codebase_size_result
27+
DESCRIPTION >
28+
Joins project metadata with codebase size metrics and assigns rankings based on total lines changed
29+
30+
SQL >
31+
SELECT
32+
row_number() OVER (ORDER BY coalesce(o.linesChanged, 0) DESC) as rank,
33+
p.id,
34+
p.segmentId,
35+
p.name,
36+
p.slug,
37+
p.logoUrl,
38+
CAST(coalesce(o.linesChanged, 0) as Float64) as value
39+
FROM leaderboards_codebase_size_projects p
40+
INNER JOIN leaderboards_codebase_size_activities o ON p.segmentId = o.segmentId
41+
WHERE o.linesChanged > 0
42+
ORDER BY value DESC
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
DESCRIPTION >
2+
Leaderboard ranking projects by total commit activity. Compares the current period (last 12 months)
3+
with the previous period (12-24 months ago). Higher commit counts rank higher, showing the most
4+
active development projects.
5+
6+
NODE leaderboards_commits_projects
7+
DESCRIPTION >
8+
Retrieves all projects from the populated datasource
9+
10+
SQL >
11+
SELECT id, name, slug, segmentId, logoUrl
12+
FROM insights_projects_populated_ds
13+
GROUP BY id, name, slug, segmentId, logoUrl
14+
15+
NODE leaderboards_commits_current_period
16+
DESCRIPTION >
17+
Counts total git commits for each project in the last 12 months
18+
19+
SQL >
20+
SELECT segmentId, count(activityId) as commits
21+
FROM activityRelations_deduplicated_cleaned_ds
22+
WHERE
23+
timestamp >= now() - INTERVAL 12 MONTH
24+
AND timestamp < now()
25+
AND activityId != ''
26+
AND (type = 'authored-commit' OR type = 'committed-commit')
27+
AND platform = 'git'
28+
GROUP BY segmentId
29+
30+
NODE leaderboards_commits_previous_period
31+
DESCRIPTION >
32+
Counts total git commits for each project in the previous 12 months (12-24 months ago)
33+
34+
SQL >
35+
SELECT segmentId, count(activityId) as commits
36+
FROM activityRelations_deduplicated_cleaned_ds
37+
WHERE
38+
timestamp >= now() - INTERVAL 24 MONTH
39+
AND timestamp < now() - INTERVAL 12 MONTH
40+
AND activityId != ''
41+
AND (type = 'authored-commit' OR type = 'committed-commit')
42+
AND platform = 'git'
43+
GROUP BY segmentId
44+
45+
NODE leaderboards_copy_commits
46+
DESCRIPTION >
47+
Joins project metadata with current and previous period commit counts, ranks by most active
48+
49+
SQL >
50+
SELECT
51+
row_number() OVER (ORDER BY coalesce(o.commits, 0) DESC) as rank,
52+
p.id,
53+
p.segmentId,
54+
p.name,
55+
p.slug,
56+
p.logoUrl,
57+
cast(coalesce(o.commits, 0) as Float64) as value,
58+
cast(coalesce(pp.commits, 0) as Float64) as previousPeriodValue
59+
FROM leaderboards_commits_projects p
60+
INNER JOIN leaderboards_commits_current_period o ON p.segmentId = o.segmentId
61+
LEFT JOIN leaderboards_commits_previous_period pp ON p.segmentId = pp.segmentId
62+
WHERE o.commits > 0
63+
ORDER BY value DESC

0 commit comments

Comments
 (0)