Skip to content

Conversation

@schmikei
Copy link
Contributor

This PR updates the discourse-mixin to use commonlib/g.libsonnet to generate its dashboards. It also employs the signals API to help manage the different prometheus targets which can be extended into the future.

@schmikei schmikei marked this pull request as ready for review November 10, 2025 20:05
@schmikei schmikei requested a review from a team as a code owner November 10, 2025 20:05
@schmikei
Copy link
Contributor Author

Currently working on runbook for this but think the Mixin should be ready for some review

Copy link
Member

@Dasomeone Dasomeone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments, thanks Keith!

local signals = this.signals,

// Overview dashboard panels
trafficByResponseCode:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this panel to have a right-aligned tabel legend please, it very quickly gets large with the cardinality we have in the legend of {{ api }} - {{ verb }} - {{ status }}

Comment on lines +27 to +73
latestMedianRequestTime: {
name: 'Latest median request time',
type: 'gauge',
unit: 's',
description: 'The median amount of time for "latest" page requests.',
sources: {
prometheus: {
expr: 'discourse_http_duration_seconds{%(queriesSelector)s,quantile="0.5",action="latest"}',
},
},
},

topicMedianRequestTime: {
name: 'Topic median request time',
type: 'gauge',
unit: 's',
description: 'The median amount of time for "topics show" requests.',
sources: {
prometheus: {
expr: 'discourse_http_duration_seconds{%(queriesSelector)s,quantile="0.5",controller="topics"}',
},
},
},

latest99thPercentileRequestTime: {
name: 'Latest 99th percentile request time',
type: 'gauge',
unit: 's',
description: 'The 99th percentile amount of time for "latest" page requests.',
sources: {
prometheus: {
expr: 'discourse_http_duration_seconds{%(queriesSelector)s,quantile="0.99",action="latest"}',
},
},
},

topic99thPercentileRequestTime: {
name: 'Topic 99th percentile request time',
type: 'gauge',
unit: 's',
description: 'The 99th percentile amount of time for "topics show" requests.',
sources: {
prometheus: {
expr: 'discourse_http_duration_seconds{%(queriesSelector)s,quantile="0.99",controller="topics"}',
},
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have proper histogram data from the looks of it, what do you think of replacing these with histograms? Or perhaps in addition to the 99th quantile

Copy link
Contributor Author

@schmikei schmikei Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea 👍 will look into it 👀

Comment on lines 62 to 70
sidekiqJobDuration:
commonlib.panels.generic.stat.base.new('Sidekiq job duration', targets=[signals.jobs.sidekiqJobDuration.asTarget()])
+ g.panel.stat.panelOptions.withDescription('Time spent in Sidekiq jobs broken out by job name.')
+ g.panel.stat.standardOptions.withUnit('s'),

scheduledJobDuration:
commonlib.panels.generic.stat.base.new('Scheduled job duration', targets=[signals.jobs.scheduledJobDuration.asTarget()])
+ g.panel.stat.panelOptions.withDescription('Time spent in scheduled jobs broken out by job name.')
+ g.panel.stat.standardOptions.withUnit('s'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see these two panels also have right-aligned table legends, with an additional column for max/mode/current duration

Comment on lines 8 to 23
alert: 'DiscourseHigh5xxErrors',
expr: |||
100 * rate(discourse_http_requests{status="500"}[5m]) / on() group_left() (sum(rate(discourse_http_requests[5m])) by (instance)) > %(alertsCritical5xxResponses)s
||| % this.config,
'for': '5m',
labels: {
severity: 'critical',
},
annotations: {
summary: 'More than %(alertsCritical5xxResponses)s%% of all requests result in a 5XX.' % this.config,
description:
('{{ printf "%%.2f" $value }}%% of all requests are resulting in 500 status codes, ' +
'which is above the threshold %(alertsCritical5xxResponses)s%%, ' +
'indicating a potentially larger issue for {{$labels.instance}}') % this.config,
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reference 5xx in the summary and alert name, but only 500 in the query expression and description.

Should update this to be a proper regex the same as 4.. below

@schmikei schmikei force-pushed the chore/discourse-modernization branch from 468dbe1 to 8778033 Compare November 11, 2025 21:18
@schmikei schmikei force-pushed the chore/discourse-modernization branch from 8778033 to 52beb8e Compare November 11, 2025 21:27
Copy link

@aalhour aalhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments and requests for changes.

+ g.panel.timeSeries.options.legend.withAsTable(true)
+ g.panel.timeSeries.options.legend.withPlacement('right')
+ g.panel.timeSeries.panelOptions.withDescription('Time spent in Sidekiq jobs broken out by job name.')
+ g.panel.timeSeries.standardOptions.withUnit('s'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this panel would look better with stacking enabled? Since we want to show cumulative job duration and breakdown by job.

+ g.panel.timeSeries.options.legend.withAsTable(true)
+ g.panel.timeSeries.options.legend.withPlacement('right')
+ g.panel.timeSeries.panelOptions.withDescription('Time spent in scheduled jobs broken out by job name.')
+ g.panel.timeSeries.standardOptions.withUnit('s'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question to the sidekiqJobDuration.

}
},
"version": "master"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a dependency to grafonnet-v11.4.0?

targets=[signals.overview.pageViews.asTarget()]
)
+ g.panel.timeSeries.panelOptions.withDescription('Rate of pageviews for the entire application. Grouped by type and service.')
+ g.panel.timeSeries.standardOptions.withUnit('none'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to change the unit to something more semantic like views/sec?

g.panel.histogram.new('Latest request time')
+ g.panel.histogram.queryOptions.withTargets([signals.overview.latest99thPercentileRequestTime.asTarget()])
+ g.panel.histogram.panelOptions.withDescription('The 99th percentile amount of time for "latest" page requests for the selected site.')
+ g.panel.histogram.standardOptions.withUnit('s'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether these should be timeseries or histograms. Since the datasource is not a histogram but a precalculated quantile value this might be broken. I am in favor of unifying both. What do you think?

g.panel.histogram.new('Topic show request time')
+ g.panel.histogram.queryOptions.withTargets([signals.overview.topic99thPercentileRequestTime.asTarget()])
+ g.panel.histogram.panelOptions.withDescription('The amount of time for "topics show" requests for the selected site.')
+ g.panel.histogram.standardOptions.withUnit('s'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to the histogram above it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants