|
1 | | -监控告警模块下也可拆分为多个文件夹: |
2 | | -alerting/ |
3 | | -├── receiver/ # 告警接收与处理 |
4 | | -├── rules/ # 告警规则调整 |
5 | | -├── metadata/ # 监控与告警元数据 |
6 | | -├── healthcheck/ # 周期体检任务 |
7 | | -├── severity/ # 告警等级计算 |
8 | | -└── remediation/ # 自动化治愈行为 |
| 1 | +监控/告警处理模块(Alerting) |
| 2 | + |
| 3 | +目标 |
| 4 | +- 统一接收、聚合与去重来自 Prometheus/ES/第三方的告警事件 |
| 5 | +- 将事件归并为告警问题(Issue),支持生命周期/状态机管理 |
| 6 | +- 结合服务元数据计算影响面,产出最终告警等级(P0/P1/P2/Warning) |
| 7 | +- 支持自动/半自动治愈与回填评论,形成可追溯处置记录 |
| 8 | +- 提供查询、检索与统计 API,服务控制台与自动化流程 |
| 9 | + |
| 10 | +模块边界 |
| 11 | +- 输入:告警事件流(Webhook/轮询)、指标查询、服务元数据 |
| 12 | +- 输出:告警问题(Issue)数据、评论记录、通知、治愈执行 |
| 13 | +- 不做:指标采集、底层存储运维(交由公共组件) |
| 14 | + |
| 15 | +目录结构(Hexagonal/Clean) |
| 16 | +internal/alerting/ |
| 17 | +- domain/ 领域模型与端口(接口) |
| 18 | + - types.go Level/Issue/Comment/Event 等 |
| 19 | + - ports.go IssueRepository/RuleCalculator/Notifier/Healer 等 |
| 20 | +- usecase/ 应用服务(聚合、状态机、等级计算、治愈编排) |
| 21 | + - service.go New(repo, rules, notifiers, healers) |
| 22 | +- adapter/ 适配器:传输/存储/规则/通知/治愈/接入 |
| 23 | + - httpapi/ HTTP 路由与 DTO |
| 24 | + - repository/memory/ 内存仓储(示例) |
| 25 | + - rules/default/ 默认等级计算器 |
| 26 | + - notifier/feishu/ 飞书通知(示例) |
| 27 | + - ingest/prometheus/ Prometheus 事件接入(示例) |
| 28 | +- api/ 便捷装配(示例项目直接调用) |
| 29 | +- scheduler/ (可选)体检/巡检定时任务 |
| 30 | +- README.md 本文档 |
| 31 | + |
| 32 | +数据模型(MySQL) |
| 33 | +1) alert_issues(告警问题表) |
| 34 | +- 主键:id(字符串或自增,推荐字符串以便跨源唯一) |
| 35 | +- 字段: |
| 36 | +- state:问题状态(Open/Closed) |
| 37 | +- level:告警等级(P0/P1/P2/Warning) |
| 38 | +- alert_state:处理状态(InProcessing/AutoRestored/Restored) |
| 39 | +- title:标题 |
| 40 | +- labels:JSON([{key,value}...]) |
| 41 | +- alert_since:DATETIME(首次告警时间) |
| 42 | +- resolved_at:DATETIME(恢复时间,可为空) |
| 43 | +- source:来源(prometheus/es/...) |
| 44 | +- fingerprint:去重指纹(同一问题归并) |
| 45 | +- extra:JSON 扩展(原始维度/链接等) |
| 46 | + |
| 47 | +2) alert_issue_comments(告警问题评论表) |
| 48 | +- 主键:id(自增) |
| 49 | +- issue_id:外键关联 alert_issues.id |
| 50 | +- created_at:DATETIME |
| 51 | +- author:字符串(如 system/ai/user@name) |
| 52 | +- content:TEXT(Markdown,记录AI/系统/人工动作) |
| 53 | + |
| 54 | +建表示例 |
| 55 | +```sql |
| 56 | +CREATE TABLE alert_issues ( |
| 57 | + id VARCHAR(64) PRIMARY KEY, |
| 58 | + state VARCHAR(16) NOT NULL, |
| 59 | + level VARCHAR(16) NOT NULL, |
| 60 | + alert_state VARCHAR(32) NOT NULL, |
| 61 | + title VARCHAR(255) NOT NULL, |
| 62 | + labels JSON NULL, |
| 63 | + alert_since DATETIME(3) NOT NULL, |
| 64 | + resolved_at DATETIME(3) NULL, |
| 65 | + source VARCHAR(32) NOT NULL, |
| 66 | + fingerprint VARCHAR(128) NOT NULL, |
| 67 | + extra JSON NULL, |
| 68 | + KEY idx_state_level (state, level), |
| 69 | + KEY idx_fingerprint (fingerprint), |
| 70 | + KEY idx_alert_since (alert_since) |
| 71 | +); |
| 72 | + |
| 73 | +CREATE TABLE alert_issue_comments ( |
| 74 | + id BIGINT PRIMARY KEY AUTO_INCREMENT, |
| 75 | + issue_id VARCHAR(64) NOT NULL, |
| 76 | + created_at DATETIME(3) NOT NULL, |
| 77 | + author VARCHAR(64) NOT NULL, |
| 78 | + content MEDIUMTEXT NOT NULL, |
| 79 | + KEY idx_issue (issue_id), |
| 80 | + CONSTRAINT fk_issue FOREIGN KEY (issue_id) REFERENCES alert_issues(id) |
| 81 | +); |
| 82 | +``` |
| 83 | + |
| 84 | +状态机 |
| 85 | +- Issue.state:Open → Closed(单向闭环) |
| 86 | +- Issue.alert_state: |
| 87 | +- InProcessing(触发后处理中) |
| 88 | +- AutoRestored(系统自愈恢复) |
| 89 | +- Restored(人工或外部系统恢复) |
| 90 | + |
| 91 | +告警等级计算 |
| 92 | +- 输入:原始告警等级(来自源头)+ 服务影响面(流量、租户数、区域、核心度) |
| 93 | +- 输出:最终 level(P0/P1/P2/Warning) |
| 94 | +- 计算器放置于 `rules/`,通过接口可热插拔与单元测试 |
| 95 | + |
| 96 | +聚合与去重 |
| 97 | +- 指纹 fingerprint = hash(source, rule_id, resource, dimensions...) |
| 98 | +- 指纹一致且时间窗口内归为同一 Issue,更新 `alert_since`/计数/最后出现时间 |
| 99 | + |
| 100 | +API 接口 |
| 101 | +1) 列表 |
| 102 | +GET /v1/issues?start=xxx&limit=10&state=Closed|Open |
| 103 | +响应: |
| 104 | +{ |
| 105 | + "items": [ |
| 106 | + { |
| 107 | + "id": "xxx", |
| 108 | + "state": "Closed", |
| 109 | + "level": "P0", |
| 110 | + "alertState": "Restored", |
| 111 | + "title": "yzh S3APIV2s3apiv2.putobject 0_64K上传响应时间95值:50012ms > 450ms", |
| 112 | + "labels": [{"key":"api","value":"s3apiv2.putobject"},{"key":"idc","value":"yzh"}], |
| 113 | + "alertSince": "2025-05-05T11:00:00Z", |
| 114 | + "resolved_at": "2025-05-05T12:00:00Z" |
| 115 | + } |
| 116 | + ], |
| 117 | + "next": "cursor-token" |
| 118 | +} |
| 119 | + |
| 120 | +2) 详情 |
| 121 | +GET /v1/issues/:issueID |
| 122 | +响应: |
| 123 | +{ |
| 124 | + "id": "xxx", |
| 125 | + "state": "Closed", |
| 126 | + "level": "P0", |
| 127 | + "alertState": "Restored", |
| 128 | + "title": "yzh S3APIV2s3apiv2.putobject 0_64K上传响应时间95值:50012ms > 450ms", |
| 129 | + "labels": [{"key":"api","value":"s3apiv2.putobject"},{"key":"idc","value":"yzh"}], |
| 130 | + "alertSince": "2025-05-05T11:00:00Z", |
| 131 | + "resolved_at": "2025-05-05T12:00:00Z", |
| 132 | + "comments": [ |
| 133 | + {"createdAt": "2024-01-03T03:00:00Z", "author": "ai", "content": "markdown content"} |
| 134 | + ] |
| 135 | +} |
| 136 | + |
| 137 | +3) 新增评论 |
| 138 | +POST /v1/issues/:issueID/comments |
| 139 | +请求: |
| 140 | +{ "author": "user@name", "content": "markdown" } |
| 141 | +响应:204 |
| 142 | + |
| 143 | +4) 手动关闭/恢复标记 |
| 144 | +POST /v1/issues/:issueID/close |
| 145 | +POST /v1/issues/:issueID/reopen |
| 146 | +响应:200 |
| 147 | + |
| 148 | +摄入(Ingress) |
| 149 | +- Prometheus Webhook:/v1/ingest/prometheus |
| 150 | +- Elastic/Logs:定制 handler 于 `ingest/` |
| 151 | +- 每个接入负责标准化为内部 Event,交由 service 层聚合 |
| 152 | + |
| 153 | +治愈(Healing) |
| 154 | +- `healing/` 定义动作(如重启、扩容、清缓存),由编排器串联 |
| 155 | +- 执行结果写入 `alert_issue_comments`,并可更新 `alert_state` |
| 156 | + |
| 157 | +通知(Notifier) |
| 158 | +- 在 state 变化或等级升级时触发 |
| 159 | +- 通过 `notifier/` 适配钉钉/飞书/邮件,支持静默窗口与去重 |
| 160 | + |
| 161 | +定时体检(Scheduler) |
| 162 | +- 周期巡检 SLO/关键链路,将异常转化为 Issue 流入统一闭环 |
| 163 | + |
| 164 | +安全与审计 |
| 165 | +- API 走网关鉴权;重要动作(关闭/忽略/治愈)记录评论与审计日志 |
| 166 | + |
| 167 | +代码组织建议(Go) |
| 168 | +- domain:领域模型与端口接口;无外部依赖 |
| 169 | +- usecase:只依赖 domain;通过端口调用适配器 |
| 170 | +- adapter:实现端口;彼此解耦,可替换 |
| 171 | +- api:http 仅做 DTO/编排,依赖 usecase |
| 172 | + |
| 173 | +测试建议 |
| 174 | +- rules:基于样例数据的表驱动测试 |
| 175 | +- service:状态机与聚合流程的单元测试 |
| 176 | +- api:handler 层的端到端(通过 fake service) |
0 commit comments