-
Notifications
You must be signed in to change notification settings - Fork 177
chore: Discourse Mixin Modernization #1539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Currently working on runbook for this but think the Mixin should be ready for some review |
Dasomeone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments, thanks Keith!
| local signals = this.signals, | ||
|
|
||
| // Overview dashboard panels | ||
| trafficByResponseCode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this panel to have a right-aligned tabel legend please, it very quickly gets large with the cardinality we have in the legend of {{ api }} - {{ verb }} - {{ status }}
| latestMedianRequestTime: { | ||
| name: 'Latest median request time', | ||
| type: 'gauge', | ||
| unit: 's', | ||
| description: 'The median amount of time for "latest" page requests.', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'discourse_http_duration_seconds{%(queriesSelector)s,quantile="0.5",action="latest"}', | ||
| }, | ||
| }, | ||
| }, | ||
|
|
||
| topicMedianRequestTime: { | ||
| name: 'Topic median request time', | ||
| type: 'gauge', | ||
| unit: 's', | ||
| description: 'The median amount of time for "topics show" requests.', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'discourse_http_duration_seconds{%(queriesSelector)s,quantile="0.5",controller="topics"}', | ||
| }, | ||
| }, | ||
| }, | ||
|
|
||
| latest99thPercentileRequestTime: { | ||
| name: 'Latest 99th percentile request time', | ||
| type: 'gauge', | ||
| unit: 's', | ||
| description: 'The 99th percentile amount of time for "latest" page requests.', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'discourse_http_duration_seconds{%(queriesSelector)s,quantile="0.99",action="latest"}', | ||
| }, | ||
| }, | ||
| }, | ||
|
|
||
| topic99thPercentileRequestTime: { | ||
| name: 'Topic 99th percentile request time', | ||
| type: 'gauge', | ||
| unit: 's', | ||
| description: 'The 99th percentile amount of time for "topics show" requests.', | ||
| sources: { | ||
| prometheus: { | ||
| expr: 'discourse_http_duration_seconds{%(queriesSelector)s,quantile="0.99",controller="topics"}', | ||
| }, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we have proper histogram data from the looks of it, what do you think of replacing these with histograms? Or perhaps in addition to the 99th quantile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea 👍 will look into it 👀
discourse-mixin/panels.libsonnet
Outdated
| sidekiqJobDuration: | ||
| commonlib.panels.generic.stat.base.new('Sidekiq job duration', targets=[signals.jobs.sidekiqJobDuration.asTarget()]) | ||
| + g.panel.stat.panelOptions.withDescription('Time spent in Sidekiq jobs broken out by job name.') | ||
| + g.panel.stat.standardOptions.withUnit('s'), | ||
|
|
||
| scheduledJobDuration: | ||
| commonlib.panels.generic.stat.base.new('Scheduled job duration', targets=[signals.jobs.scheduledJobDuration.asTarget()]) | ||
| + g.panel.stat.panelOptions.withDescription('Time spent in scheduled jobs broken out by job name.') | ||
| + g.panel.stat.standardOptions.withUnit('s'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see these two panels also have right-aligned table legends, with an additional column for max/mode/current duration
| alert: 'DiscourseHigh5xxErrors', | ||
| expr: ||| | ||
| 100 * rate(discourse_http_requests{status="500"}[5m]) / on() group_left() (sum(rate(discourse_http_requests[5m])) by (instance)) > %(alertsCritical5xxResponses)s | ||
| ||| % this.config, | ||
| 'for': '5m', | ||
| labels: { | ||
| severity: 'critical', | ||
| }, | ||
| annotations: { | ||
| summary: 'More than %(alertsCritical5xxResponses)s%% of all requests result in a 5XX.' % this.config, | ||
| description: | ||
| ('{{ printf "%%.2f" $value }}%% of all requests are resulting in 500 status codes, ' + | ||
| 'which is above the threshold %(alertsCritical5xxResponses)s%%, ' + | ||
| 'indicating a potentially larger issue for {{$labels.instance}}') % this.config, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reference 5xx in the summary and alert name, but only 500 in the query expression and description.
Should update this to be a proper regex the same as 4.. below
468dbe1 to
8778033
Compare
8778033 to
52beb8e
Compare
aalhour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments and requests for changes.
| + g.panel.timeSeries.options.legend.withAsTable(true) | ||
| + g.panel.timeSeries.options.legend.withPlacement('right') | ||
| + g.panel.timeSeries.panelOptions.withDescription('Time spent in Sidekiq jobs broken out by job name.') | ||
| + g.panel.timeSeries.standardOptions.withUnit('s'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this panel would look better with stacking enabled? Since we want to show cumulative job duration and breakdown by job.
| + g.panel.timeSeries.options.legend.withAsTable(true) | ||
| + g.panel.timeSeries.options.legend.withPlacement('right') | ||
| + g.panel.timeSeries.panelOptions.withDescription('Time spent in scheduled jobs broken out by job name.') | ||
| + g.panel.timeSeries.standardOptions.withUnit('s'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question to the sidekiqJobDuration.
| } | ||
| }, | ||
| "version": "master" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a dependency to grafonnet-v11.4.0?
| targets=[signals.overview.pageViews.asTarget()] | ||
| ) | ||
| + g.panel.timeSeries.panelOptions.withDescription('Rate of pageviews for the entire application. Grouped by type and service.') | ||
| + g.panel.timeSeries.standardOptions.withUnit('none'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to change the unit to something more semantic like views/sec?
| g.panel.histogram.new('Latest request time') | ||
| + g.panel.histogram.queryOptions.withTargets([signals.overview.latest99thPercentileRequestTime.asTarget()]) | ||
| + g.panel.histogram.panelOptions.withDescription('The 99th percentile amount of time for "latest" page requests for the selected site.') | ||
| + g.panel.histogram.standardOptions.withUnit('s'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether these should be timeseries or histograms. Since the datasource is not a histogram but a precalculated quantile value this might be broken. I am in favor of unifying both. What do you think?
| g.panel.histogram.new('Topic show request time') | ||
| + g.panel.histogram.queryOptions.withTargets([signals.overview.topic99thPercentileRequestTime.asTarget()]) | ||
| + g.panel.histogram.panelOptions.withDescription('The amount of time for "topics show" requests for the selected site.') | ||
| + g.panel.histogram.standardOptions.withUnit('s'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to the histogram above it.
This PR updates the discourse-mixin to use commonlib/g.libsonnet to generate its dashboards. It also employs the signals API to help manage the different prometheus targets which can be extended into the future.