Skip to content

Commit 03e8349

Browse files
committed
doc(Metrics&Alert): 添加监控/告警处理模块README
1 parent 8fd98a7 commit 03e8349

File tree

1 file changed

+176
-8
lines changed

1 file changed

+176
-8
lines changed

internal/alerting/README.md

Lines changed: 176 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,176 @@
1-
监控告警模块下也可拆分为多个文件夹:
2-
alerting/
3-
├── receiver/ # 告警接收与处理
4-
├── rules/ # 告警规则调整
5-
├── metadata/ # 监控与告警元数据
6-
├── healthcheck/ # 周期体检任务
7-
├── severity/ # 告警等级计算
8-
└── remediation/ # 自动化治愈行为
1+
监控/告警处理模块(Alerting)
2+
3+
目标
4+
- 统一接收、聚合与去重来自 Prometheus/ES/第三方的告警事件
5+
- 将事件归并为告警问题(Issue),支持生命周期/状态机管理
6+
- 结合服务元数据计算影响面,产出最终告警等级(P0/P1/P2/Warning)
7+
- 支持自动/半自动治愈与回填评论,形成可追溯处置记录
8+
- 提供查询、检索与统计 API,服务控制台与自动化流程
9+
10+
模块边界
11+
- 输入:告警事件流(Webhook/轮询)、指标查询、服务元数据
12+
- 输出:告警问题(Issue)数据、评论记录、通知、治愈执行
13+
- 不做:指标采集、底层存储运维(交由公共组件)
14+
15+
目录结构(Hexagonal/Clean)
16+
internal/alerting/
17+
- domain/ 领域模型与端口(接口)
18+
- types.go Level/Issue/Comment/Event 等
19+
- ports.go IssueRepository/RuleCalculator/Notifier/Healer 等
20+
- usecase/ 应用服务(聚合、状态机、等级计算、治愈编排)
21+
- service.go New(repo, rules, notifiers, healers)
22+
- adapter/ 适配器:传输/存储/规则/通知/治愈/接入
23+
- httpapi/ HTTP 路由与 DTO
24+
- repository/memory/ 内存仓储(示例)
25+
- rules/default/ 默认等级计算器
26+
- notifier/feishu/ 飞书通知(示例)
27+
- ingest/prometheus/ Prometheus 事件接入(示例)
28+
- api/ 便捷装配(示例项目直接调用)
29+
- scheduler/ (可选)体检/巡检定时任务
30+
- README.md 本文档
31+
32+
数据模型(MySQL)
33+
1) alert_issues(告警问题表)
34+
- 主键:id(字符串或自增,推荐字符串以便跨源唯一)
35+
- 字段:
36+
- state:问题状态(Open/Closed)
37+
- level:告警等级(P0/P1/P2/Warning)
38+
- alert_state:处理状态(InProcessing/AutoRestored/Restored)
39+
- title:标题
40+
- labels:JSON([{key,value}...]
41+
- alert_since:DATETIME(首次告警时间)
42+
- resolved_at:DATETIME(恢复时间,可为空)
43+
- source:来源(prometheus/es/...)
44+
- fingerprint:去重指纹(同一问题归并)
45+
- extra:JSON 扩展(原始维度/链接等)
46+
47+
2) alert_issue_comments(告警问题评论表)
48+
- 主键:id(自增)
49+
- issue_id:外键关联 alert_issues.id
50+
- created_at:DATETIME
51+
- author:字符串(如 system/ai/user@name)
52+
- content:TEXT(Markdown,记录AI/系统/人工动作)
53+
54+
建表示例
55+
```sql
56+
CREATE TABLE alert_issues (
57+
id VARCHAR(64) PRIMARY KEY,
58+
state VARCHAR(16) NOT NULL,
59+
level VARCHAR(16) NOT NULL,
60+
alert_state VARCHAR(32) NOT NULL,
61+
title VARCHAR(255) NOT NULL,
62+
labels JSON NULL,
63+
alert_since DATETIME(3) NOT NULL,
64+
resolved_at DATETIME(3) NULL,
65+
source VARCHAR(32) NOT NULL,
66+
fingerprint VARCHAR(128) NOT NULL,
67+
extra JSON NULL,
68+
KEY idx_state_level (state, level),
69+
KEY idx_fingerprint (fingerprint),
70+
KEY idx_alert_since (alert_since)
71+
);
72+
73+
CREATE TABLE alert_issue_comments (
74+
id BIGINT PRIMARY KEY AUTO_INCREMENT,
75+
issue_id VARCHAR(64) NOT NULL,
76+
created_at DATETIME(3) NOT NULL,
77+
author VARCHAR(64) NOT NULL,
78+
content MEDIUMTEXT NOT NULL,
79+
KEY idx_issue (issue_id),
80+
CONSTRAINT fk_issue FOREIGN KEY (issue_id) REFERENCES alert_issues(id)
81+
);
82+
```
83+
84+
状态机
85+
- Issue.state:Open → Closed(单向闭环)
86+
- Issue.alert_state:
87+
- InProcessing(触发后处理中)
88+
- AutoRestored(系统自愈恢复)
89+
- Restored(人工或外部系统恢复)
90+
91+
告警等级计算
92+
- 输入:原始告警等级(来自源头)+ 服务影响面(流量、租户数、区域、核心度)
93+
- 输出:最终 level(P0/P1/P2/Warning)
94+
- 计算器放置于 `rules/`,通过接口可热插拔与单元测试
95+
96+
聚合与去重
97+
- 指纹 fingerprint = hash(source, rule_id, resource, dimensions...)
98+
- 指纹一致且时间窗口内归为同一 Issue,更新 `alert_since`/计数/最后出现时间
99+
100+
API 接口
101+
1) 列表
102+
GET /v1/issues?start=xxx&limit=10&state=Closed|Open
103+
响应:
104+
{
105+
"items": [
106+
{
107+
"id": "xxx",
108+
"state": "Closed",
109+
"level": "P0",
110+
"alertState": "Restored",
111+
"title": "yzh S3APIV2s3apiv2.putobject 0_64K上传响应时间95值:50012ms > 450ms",
112+
"labels": [{"key":"api","value":"s3apiv2.putobject"},{"key":"idc","value":"yzh"}],
113+
"alertSince": "2025-05-05T11:00:00Z",
114+
"resolved_at": "2025-05-05T12:00:00Z"
115+
}
116+
],
117+
"next": "cursor-token"
118+
}
119+
120+
2) 详情
121+
GET /v1/issues/:issueID
122+
响应:
123+
{
124+
"id": "xxx",
125+
"state": "Closed",
126+
"level": "P0",
127+
"alertState": "Restored",
128+
"title": "yzh S3APIV2s3apiv2.putobject 0_64K上传响应时间95值:50012ms > 450ms",
129+
"labels": [{"key":"api","value":"s3apiv2.putobject"},{"key":"idc","value":"yzh"}],
130+
"alertSince": "2025-05-05T11:00:00Z",
131+
"resolved_at": "2025-05-05T12:00:00Z",
132+
"comments": [
133+
{"createdAt": "2024-01-03T03:00:00Z", "author": "ai", "content": "markdown content"}
134+
]
135+
}
136+
137+
3) 新增评论
138+
POST /v1/issues/:issueID/comments
139+
请求:
140+
{ "author": "user@name", "content": "markdown" }
141+
响应:204
142+
143+
4) 手动关闭/恢复标记
144+
POST /v1/issues/:issueID/close
145+
POST /v1/issues/:issueID/reopen
146+
响应:200
147+
148+
摄入(Ingress)
149+
- Prometheus Webhook:/v1/ingest/prometheus
150+
- Elastic/Logs:定制 handler 于 `ingest/`
151+
- 每个接入负责标准化为内部 Event,交由 service 层聚合
152+
153+
治愈(Healing)
154+
- `healing/` 定义动作(如重启、扩容、清缓存),由编排器串联
155+
- 执行结果写入 `alert_issue_comments`,并可更新 `alert_state`
156+
157+
通知(Notifier)
158+
- 在 state 变化或等级升级时触发
159+
- 通过 `notifier/` 适配钉钉/飞书/邮件,支持静默窗口与去重
160+
161+
定时体检(Scheduler)
162+
- 周期巡检 SLO/关键链路,将异常转化为 Issue 流入统一闭环
163+
164+
安全与审计
165+
- API 走网关鉴权;重要动作(关闭/忽略/治愈)记录评论与审计日志
166+
167+
代码组织建议(Go)
168+
- domain:领域模型与端口接口;无外部依赖
169+
- usecase:只依赖 domain;通过端口调用适配器
170+
- adapter:实现端口;彼此解耦,可替换
171+
- api:http 仅做 DTO/编排,依赖 usecase
172+
173+
测试建议
174+
- rules:基于样例数据的表驱动测试
175+
- service:状态机与聚合流程的单元测试
176+
- api:handler 层的端到端(通过 fake service)

0 commit comments

Comments
 (0)