A comprehensive ML-powered forecasting and anomaly detection system for Kubernetes infrastructure, providing real-time predictions for CPU, Memory, Disk, I/O, and Network metrics using ensemble models (Prophet, ARIMA, LSTM).
- Dual-Layer Forecasting: Separate models for Host (full node) and Pod (Kubernetes workloads) layers
- Kubernetes Cluster Awareness: Automatic cluster identification and per-cluster model training
- Standalone Node Support: Separate handling for nodes without Kubernetes workloads
- Ensemble Models: Combines Prophet, ARIMA, and LSTM for robust predictions
- Multiple Metric Types: CPU, Memory, Disk Usage, Disk I/O, Network bandwidth
- Anomaly Detection: Isolation Forest-based classification with temporal awareness (seasonality-aware: hour, day-of-week patterns)
- Temporal-Aware Forecasting: All models account for daily/weekly patterns (batch jobs, weekend backups, etc.)
- Minimal Updates: Efficient forecast mode with incremental model updates
- Selective Retraining: Retrain specific nodes/mounts/signals without full retraining
- Comprehensive Plotting: Automatic generation of forecast and backtest plots
- Integrated Alerting: Webhook payloads plus Prometheus Pushgateway metrics for SRE workflows
- Parallel Processing: Automatic CPU detection with 80% resource utilization rule, respects Kubernetes/Docker container limits
flowchart LR
subgraph Metrics["Metrics Plane"]
P["Prometheus/VictoriaMetrics"]
end
subgraph Forecast["Forecast Engine"]
F["metrics.py"]
C[("Model Artifacts")]
M[("Manifests & Plots")]
end
subgraph Alerting["Alerting Plane"]
W["Webhook Endpoints"]
G["Prometheus Pushgateway"]
end
P -->|"Query API"| F
F -->|"Minimal Updates"| C
F -->|"Forecast + Crisis Data"| M
F -->|"JSON Alerts"| W
F -->|"metrics_ai_* gauges"| G
subgraph Ops["Ops Consumers"]
S["SRE Dashboards/Alertmanager"]
O["Ops SME/Technical Architects"]
end
G --> S
W --> O
- Metrics Plane – Prometheus or VictoriaMetrics hosts raw time-series data queried during each run.
- Forecast Engine –
metrics.pyperforms ensemble forecasts, anomaly detection, and manifests/plot management using cached artifacts. - Alerting Plane – Webhook payloads provide rich JSON for chat/incident tools, while Pushgateway metrics feed Prometheus/Alertmanager.
- Ops Consumers – SREs, architects, and SMEs consume plots, manifests, and alerts to drive operational decisions.
- Python 3.8+
- Access to Prometheus/VictoriaMetrics endpoint
- Required Python packages (see
requirements.txt)
# Clone the repository
git clone <repository-url>
cd metrics-ai
# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Optional: Install TensorFlow for LSTM support
pip install tensorflow-cpuThe system is highly configurable via environment variables:
VM_URL: Prometheus/VictoriaMetrics query endpoint (default:http://vm.london.local/api/v1/query_range)STEP: Query step size (default:60s)START_HOURS_AGO: Historical data range in hours (default:360)
HORIZON_MIN: Forecast horizon in minutes (default:15)LOOKBACK_HRS: Lookback window for anomaly detection (default:24)CONTAMINATION: Anomaly detection contamination rate (default:0.12)TRAIN_FRACTION: Train/test split ratio (default:0.8)PLOT_HISTORY_HOURS: Hours of historical data to include in plots (default:168→ 7 days)PLOT_FORECAST_HOURS: Hours of forecast horizon to draw in plots (default:168→ 7 days)
LSTM_SEQ_LEN: LSTM sequence length (default:60)LSTM_EPOCHS: LSTM training epochs (default:10)
MODEL_FILES_DIR: Directory for model files (default:./model_files)FORECAST_PLOTS_DIR: Directory for forecast plots (default:./forecast_plots)
DNS_DOMAIN_SUFFIXES: Comma-separated DNS suffixes for hostname resolution (default:.london.local,.local)INSTANCE_ALIAS_MAP: JSON map for instance aliases (default:{})AUTO_ALIAS_ENABLED: Enable automatic alias detection (default:1)VERBOSE_LEVEL: Logging verbosity (default:0)MAX_WORKER_THREADS: Maximum number of parallel workers (default: auto-detected as 80% of available CPUs)- Automatically detects physical CPU count via
os.cpu_count() - Respects Kubernetes/Docker container CPU limits via cgroups
- Applies 80% utilization rule (leaves 20% headroom)
- Can be overridden via environment variable (e.g.,
export MAX_WORKER_THREADS=4) or CLI flag--parallel N
- Automatically detects physical CPU count via
export VM_URL="http://prometheus.example.com/api/v1/query_range"
export MODEL_FILES_DIR="/var/lib/metrics-ai/models"
export FORECAST_PLOTS_DIR="/var/lib/metrics-ai/plots"
export HORIZON_MIN=30
export START_HOURS_AGO=720Train all models from scratch:
python3 metrics.py --trainingOutput:
- Trains Host, Pod, Disk, I/O, and Network models
- Generates backtest plots and metrics
- Saves all models to
MODEL_FILES_DIR - Creates forecast plots in
FORECAST_PLOTS_DIR
Generate forecasts using cached models with minimal updates:
# Single run (no loop)
python3 metrics.py --forecast --interval 0
# Continuous monitoring (runs every 15 seconds)
python3 metrics.py --forecast
# Custom interval (runs every 30 seconds)
python3 metrics.py --forecast --interval 30
# With custom plot windows
python3 metrics.py --forecast --plot-history-hours 48 --plot-forecast-hours 12
# With alerting
python3 metrics.py --forecast \
--alert-webhook https://hooks.slack.com/services/... \
--pushgateway http://pushgateway.monitoring:9091
# Continuous monitoring with alerts (recommended for production)
python3 metrics.py --forecast \
--interval 15 \
--alert-webhook https://hooks.slack.com/services/... \
--pushgateway http://pushgateway.monitoring:9091
# Skip plot generation for faster execution
python3 metrics.py --forecast --interval 15
# Generate forecast plots only
python3 metrics.py --forecast --interval 15 --plot forecast
# Generate backtest plots only
python3 metrics.py --forecast --interval 15 --plot backtest
# Generate both forecast and backtest plots
python3 metrics.py --forecast --interval 15 --plot bothCommand-Line Options:
--forecast: Enable forecast mode (uses cached models, minimal updates)--interval <seconds>: Run continuously with specified interval (default: 15). Set to0for single run.--alert-webhook <URL>: HTTP webhook URL for alert delivery (Slack, Teams, PagerDuty, etc.)--pushgateway <URL>: Prometheus Pushgateway URL for metrics export--plot <forecast|backtest|both>: Generate and save plot files (PNG images). Options:forecast: Generate only forecast plots (future predictions)backtest: Generate only backtest plots (model performance evaluation)both: Generate both forecast and backtest plots- If not specified, plots are skipped to save time
--plot-history-hours <hours>: Override plot history window (default: 168 = 7 days)--plot-forecast-hours <hours>: Override plot forecast window (default: 168 = 7 days)--parallel <N>: Override automatic CPU detection and use N parallel workers (overrides 80% rule,MAX_WORKER_THREADSenv var, and bypasses >10 items threshold). Example:--parallel 4--forecast-horizon <realtime|neartime|future>: OverrideHORIZON_MINfor forecast length:realtime: 15 minutes (short-term decisions)neartime: 3 hours (near-term planning)future: 7 days (long-term capacity planning)- Default: Uses
HORIZON_MINenv var (15 minutes)
-v, --verbose: Increase verbosity (repeatable:-vv,-vvv)-q, --quiet: Suppress verbose output--dump-csv <dir>: Dump training datasets for each model into the specified directory (created if missing)
Output:
- Updates models with latest data (minimal updates)
- Generates forecast plots for all metrics (only when
--plot forecastor--plot bothis provided) - Generates backtest plots for all models (only when
--plot backtestor--plot bothis provided) - Displays predictions and anomalies in tabular format
- Saves updated models to disk
- Dispatches alerts via webhook/Pushgateway when actionable issues detected
- Runs continuously every
--intervalseconds when--interval > 0
Note: Plot generation can be time-consuming. Use --plot only when you need visualizations. For production monitoring focused on alerts, omit --plot for faster execution. In training mode (--training), plots are generated by default (both forecast and backtest).
Use Case:
- Single run: Use
--interval 0for one-time forecasts or testing - Continuous monitoring: Use
--interval 15(or higher) for production monitoring with real-time alerts - Kubernetes Deployment: Deploy as a
Deployment(not CronJob) with--interval 15for continuous monitoring
View backtest performance of cached models:
python3 metrics.py --show-backtestOutput:
- Displays backtest metrics (MAE, train/test split) for all models
- Generates backtest plots
- Does not retrain models (uses cached)
Use pre-trained models without updates:
python3 metrics.pyOutput:
- Uses cached models as-is
- No plots generated
- No backtest metrics shown
- Fast execution
Retrain specific disk models:
# Retrain all disk models
python3 metrics.py --disk-retrain all
# Retrain specific node
python3 metrics.py --disk-retrain host02
# Retrain specific node:mountpoint combination
python3 metrics.py --disk-retrain host02:/,worker01:/home
# Retrain multiple targets
python3 metrics.py --disk-retrain host02,worker01:/home,worker03:/Output:
- Retrains only specified disk models
- Generates backtest plots and metrics for retrained models
- Updates manifest with new predictions
Retrain specific I/O and Network models:
# Retrain all I/O and Network models
python3 metrics.py --io-net-retrain all
# Retrain all signals for a specific node
python3 metrics.py --io-net-retrain host02
# Retrain specific signal for a node
python3 metrics.py --io-net-retrain host02:DISK_IO_WAIT
# Retrain multiple targets
python3 metrics.py --io-net-retrain host02:DISK_IO_WAIT,worker01:NET_TX_BWOutput:
- Retrains only specified I/O and Network models
- Generates backtest plots and metrics for retrained models
- Updates manifest with new models
# Training with backtest metrics
python3 metrics.py --training --show-backtest
# Forecast mode with verbose output
python3 metrics.py --forecast -v
# Forecast with custom plot windows
python3 metrics.py --forecast --plot-history-hours 72 --plot-forecast-hours 6
# Retrain specific models and show backtest
python3 metrics.py --disk-retrain host02 --io-net-retrain worker01 --show-backtestThe system supports two alert delivery mechanisms for real-time notification of detected issues:
POSTs a JSON payload to HTTP webhooks (Slack, Teams, PagerDuty, custom endpoints) whenever actionable alerts are detected.
Usage:
python3 metrics.py --forecast --alert-webhook https://hooks.slack.com/services/YOUR/WEBHOOK/URLWebhook Payload Structure:
{
"timestamp": "2025-11-21T11:32:56.529548",
"summary_text": "Disk → 0 critical, 0 warning, 1 soon | I/O+Network Crisis → 0 | I/O+Network Anomaly → 0 | Golden Signals → 0 | Classification Anomalies → 1 | Host Pressure → 1",
"disk": {
"critical": 0,
"warning": 0,
"soon": 1,
"total": 1,
"samples": [
{
"instance": "host02 (192.168.10.82)",
"mountpoint": "/",
"current_%": 79.67,
"days_to_90pct": 31.8,
"ensemble_eta": 31.8,
"linear_eta": 9999.0,
"prophet_eta": 31.8,
"alert": "SOON"
}
]
},
"io_network_crisis": {
"total": 0,
"samples": []
},
"io_network_anomaly": {
"total": 0,
"samples": []
},
"golden_anomaly": {
"total": 0,
"samples": []
},
"classification_anomaly": {
"total": 1,
"samples": [
{
"instance": "host02 (192.168.10.82)",
"host_cpu": 0.313898,
"host_mem": 0.454784,
"pod_cpu": 0.843041,
"pod_mem": 0.974526,
"severity": "WARNING",
"signal": "anomalous_node",
"detected_at": "2025-11-21 11:32:56"
}
]
},
"host_pressure": {
"total": 1,
"samples": [
{
"instance": "pi (192.168.10.200)",
"host_cpu": 0.174918,
"host_mem": 0.840407,
"severity": "WARNING",
"signal": "host_pressure",
"detected_at": "2025-11-21 11:32:56"
}
]
}
}Alert Categories:
- Disk Alerts: CRITICAL (<3 days to 90%), WARNING (3-7 days), SOON (7-30 days). Only non-OK alerts are included.
- I/O+Network Crisis: Predicted I/O or network saturation within 30 days
- I/O+Network Anomaly: Anomalous I/O or network patterns detected
- Golden Anomaly: Root-cause signals (iowait, inodes, network drops, OOM kills, etc.)
- Classification Anomaly: Nodes with anomalous host/pod usage patterns (high host usage but low pod usage, or vice versa)
- Host Pressure: Nodes with high host CPU/memory usage but minimal Kubernetes workload (suggests OS-level processes)
Note: Webhooks are only sent when there are actionable alerts (non-OK status). If all systems are healthy, no webhook is dispatched.
Publishes metrics to Prometheus Pushgateway for integration with Alertmanager and Grafana dashboards.
Usage:
python3 metrics.py --forecast --pushgateway http://pushgateway.monitoring:9091Published Metrics:
metrics_ai_disk_alerts_critical- Count of critical disk alertsmetrics_ai_disk_alerts_warning- Count of warning disk alertsmetrics_ai_disk_alerts_soon- Count of soon disk alertsmetrics_ai_disk_alerts_total- Total non-OK disk alertsmetrics_ai_io_network_crisis_total- I/O+Network crisis countmetrics_ai_io_network_anomaly_total- I/O+Network anomaly countmetrics_ai_golden_anomaly_total- Golden anomaly signal countmetrics_ai_classification_anomaly_total- Classification anomaly countmetrics_ai_host_pressure_total- Host pressure alert count
Continuous Monitoring:
Both alert mechanisms honor the --interval flag (default 15 seconds). When running with --interval > 0, the system runs continuously and dispatches alerts on each cycle when actionable issues are detected. Set --interval 0 for a single run.
Example - Continuous Monitoring with Alerts:
python3 metrics.py --forecast \
--interval 15 \
--alert-webhook https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
--pushgateway http://pushgateway.monitoring:9091- Single-shot validation:
python3 metrics.py --forecast --interval 0(CI/CD smoke test). - Continuous monitoring: Kubernetes
Deploymentwith--forecast --interval 15plus alert sinks. - Scheduled retraining: CronJob or CI pipeline invoking
--training --show-backtest.
- Verify manifests and plots are refreshed (
MODEL_FILES_DIR,FORECAST_PLOTS_DIR). - Confirm Pushgateway metrics (
metrics_ai_*) are scraped (Prometheusupandmetrics_ai_*queries). - Webhook payloads should include
summary_textand samples when alerts trigger.
- Review console tables (disk forecast, classification anomalies, host pressure).
- Inspect plots in
FORECAST_PLOTS_DIRfor impacted nodes. - Check manifests to confirm minimal updates saved (
disk_full_models.pkl,io_net_models.pkl). - Create/acknowledge incident using webhook payload context.
- Disk:
--disk-retrain host02,/home - I/O+Network:
--io-net-retrain host02:DISK_IO_WAIT - Combine with
--show-backtestto verify MAE improvements. - Use
--dump-csv ./exportsto capture the exact training datasets for audits or offline experimentation.
- Missing alerts: run with
-vto inspect “DEBUG: disk_alerts…” output; confirm thresholds still relevant. - Data gaps: verify Prometheus query responses (HTTP 200, non-empty JSON).
- Model drift: trigger full training (
--training) and compare MAE/RMSE. - Webhook/Pushgateway errors: check HTTP status codes logged after “Alert webhook sent” / “Metrics pushed”.
- Document CLI flags and env overrides in change tickets.
- Attach latest plots/manifests when promoting new models.
- Ensure rollback plan (previous manifests/models) is stored with release artifacts.
k8s_cluster_{cluster_id}_forecast.pkl- Per-cluster ensemble model (host + pod combined)k8s_cluster_{cluster_id}_arima.pkl- Per-cluster ARIMA model parametersk8s_cluster_{cluster_id}_prophet_params.pkl- Per-cluster Prophet hyperparametersk8s_cluster_{cluster_id}_forecast.pkl.meta.json- Cluster model metadata
standalone_forecast.pkl- Standalone nodes ensemble modelstandalone_arima.pkl- Standalone ARIMA model parametersstandalone_prophet_params.pkl- Standalone Prophet hyperparameters
host_forecast.pkl- Legacy host layer ensemble modelpod_forecast.pkl- Legacy pod layer ensemble modelhost_arima.pkl- Legacy host ARIMA model parametershost_prophet_params.pkl- Legacy host Prophet hyperparameterspod_arima.pkl- Legacy pod ARIMA model parameterspod_prophet_params.pkl- Legacy pod Prophet hyperparameters
lstm_model.pkl- LSTM model (if TensorFlow available, shared across clusters)disk_full_models.pkl- Disk models manifestio_net_models.pkl- I/O and Network models manifestisolation_forest_anomaly.pkl- Anomaly detection model (per-cluster)*.meta.json- Model metadata files
k8s_layer_forecast.png- Kubernetes cluster forecast (aggregated across all clusters, default 7d historical + 7d forecast, configurable)k8s_layer_backtest.png- Kubernetes cluster backtest (generated during training/--show-backtest)standalone_layer_forecast.png- Standalone nodes forecast (default 7d historical + 7d forecast, configurable)standalone_layer_backtest.png- Standalone nodes backtest (generated during training/--show-backtest)- Legacy:
host_layer_forecast.png,pod_layer_forecast.png,host_layer_backtest.png,pod_layer_backtest.png(for backward compatibility)
disk_{node}_{mountpoint}_forecast.png- Individual disk forecast plots
{node}_{signal}_crisis_forecast.png- Crisis prediction forecast plots (Prophet-based threshold monitoring){node}_{signal}_crisis_backtest.png- Crisis prediction backtest plots (model performance evaluation){signal}_{node}__{ip}__ensemble_layer_forecast.png- Ensemble forecast plots (Prophet + ARIMA + LSTM, future predictions){signal}_{node}__{ip}__ensemble_layer_backtest.png- Ensemble backtest plots (Prophet + ARIMA + LSTM, model performance)
classification_host_vs_pod.png- Anomaly classification scatter plot
disk_full_prediction.csv- Disk full prediction report
# Set configuration
export VM_URL="http://prometheus.example.com/api/v1/query_range"
export MODEL_FILES_DIR="/opt/metrics-ai/models"
export FORECAST_PLOTS_DIR="/opt/metrics-ai/plots"
# Initial training
python3 metrics.py --training
# Expected output:
# - All models trained
# - Backtest metrics displayed
# - All plots generated
# - Models saved to /opt/metrics-ai/models# Add to crontab for every 15 seconds (using external scheduler)
# Or use systemd timer, Kubernetes CronJob, etc.
# Forecast mode (lightweight, fast)
python3 metrics.py --forecast
# Expected output:
# - Models updated with latest data
# - Forecast plots generated
# - Predictions displayed
# - Models saved with latest timestamp# Retrain and monitor specific disk
python3 metrics.py --disk-retrain host02:/ --show-backtest
# Expected output:
# - Only host02:/ disk model retrained
# - Backtest metrics for that disk
# - Updated forecast plot# Retrain I/O models for specific node
python3 metrics.py --io-net-retrain host02:DISK_IO_WAIT --show-backtest
# Expected output:
# - Only DISK_IO_WAIT model for host02 retrained
# - Backtest metrics displayed
# - Updated forecast plot# View all backtest metrics without retraining
python3 metrics.py --show-backtest
# Expected output:
# - All backtest metrics for all models
# - All backtest plots generated
# - No model retrainingEach forecasting model uses a 3-model ensemble:
- Prophet: Handles seasonality and trends with temporal awareness (daily/weekly patterns)
- ARIMA: Captures autoregressive patterns and short-term dependencies
- LSTM: Deep learning for complex patterns (optional, requires TensorFlow)
Final prediction: Simple average of all three models (equal weighting)
All ensemble models generate comprehensive backtest metrics:
- MAE (Mean Absolute Error): Average absolute prediction error
- MAPE (Mean Absolute Percentage Error): Relative error as a percentage
- Expected Error Rate (%): Same as MAPE, provides intuitive accuracy measure
- Confidence Level (1-10): Model reliability score where 1 is highest confidence
- 1-2: Excellent
- 3-4: Good
- 5-6: Moderate
- 7-8: Low
- 9-10: Very Low
Note: For metrics with very small values (e.g., DISK_IO_WAIT ratios), MAPE can be misleadingly high. The system displays a note when MAPE > 50% and MAE < 0.01, indicating the high MAPE is due to small actual values rather than poor model performance.
In --forecast mode, models receive minimal updates (required, not optional):
- Prophet: Loads saved hyperparameters, fits on last 7 days of data with optimized settings (
n_changepoints=10for 30-40% faster fitting), saves updated hyperparameters - ARIMA: Uses saved model order, fits on latest data, saves updated model
- LSTM: Loads pre-trained model, fine-tunes for 2 epochs on last 2 days, saves updated model
This preserves learned patterns while incorporating recent trends. All model files (including host_arima.pkl, pod_arima.pkl, host_prophet_params.pkl, pod_prophet_params.pkl) are updated with the latest timestamp after minimal updates.
Performance: Minimal updates are optimized for speed, especially beneficial when processing 100+ nodes. I/O and Network models include additional optimizations: pre-computed retrain matching, progress reporting, and conditional plot generation.
- Defaults: 7 days of history + 7 days of forecast on every chart (host/pod, disk, I/O, network).
- Override globally with env vars (
PLOT_HISTORY_HOURS,PLOT_FORECAST_HOURS). - Override per run via CLI flags (
--plot-history-hours,--plot-forecast-hours). - All plot titles/legends automatically reflect the configured durations (e.g., “48 hours Historical + 12 hours Forecast”).
If you see warnings about missing models:
# Train models first
python3 metrics.py --trainingLSTM is optional. Install TensorFlow:
pip install tensorflow-cpuThe system will work without LSTM, using only Prophet and ARIMA.
Check your VM_URL configuration:
# Test connection
curl "$VM_URL?query=up&start=$(date -d '1 hour ago' +%s)&end=$(date +%s)&step=60s"Ensure write permissions for FORECAST_PLOTS_DIR:
mkdir -p forecast_plots
chmod 755 forecast_plots- Training Mode: ~5-15 minutes (depends on data volume)
- Forecast Mode: ~10-30 seconds for typical deployments (optimized for frequent runs)
- I/O and Network: ~200-400 seconds for 100 nodes (optimized with faster Prophet settings, progress reporting, and conditional plot generation)
- Normal Mode: ~5-10 seconds (uses cached models)
The system automatically parallelizes model training and forecasting across multiple CPU cores:
- Automatic CPU Detection: Detects available CPU cores and applies 80% utilization rule
- Container-Aware: Respects Kubernetes/Docker CPU limits via cgroups detection
- Override Options:
- Environment Variable:
export MAX_WORKER_THREADS=4(applies 80% rule to this value, respects thresholds) - CLI Flag:
--parallel 4(direct override, highest priority, bypasses 80% rule AND thresholds)
- Environment Variable:
- Parallelization Thresholds (only apply when using automatic detection, not with
--parallel):- Disk Models: Parallelizes when processing >10 disks
- I/O Network Crisis: Parallelizes when processing >10 nodes
- I/O Network Ensemble: Parallelizes when processing >10 nodes
- Note: When
--parallelis set, these thresholds are bypassed and parallel processing is used regardless of item count
- Worker Count: Uses
min(total_items, MAX_WORKER_THREADS)to avoid over-subscription - Performance Gains:
- Sequential: 1 core utilization
- Parallel: Up to 80% of available cores (e.g., 8 cores on 10-core system = 8 workers)
- Expected Speedup: 3-6x for large deployments (100+ nodes/disks) depending on CPU count
Example: On a 10-core system with 100 disks:
- Sequential: ~400 seconds (1 core)
- Parallel (8 workers): ~50-80 seconds (8 cores, accounting for overhead)
Override Examples:
# Use 4 workers regardless of CPU count (bypasses 80% rule AND thresholds)
# Will parallelize even with <10 items
python3 metrics.py --forecast --parallel 4
# Use 16 workers (useful for high-core systems where you want full utilization)
python3 metrics.py --training --parallel 16
# Force parallel processing on small deployments (6 nodes, 9 disks)
python3 metrics.py --parallel 2 # Uses 2 workers even with <10 items
# Override forecast horizon
python3 metrics.py --forecast --forecast-horizon neartime # 3-hour forecasts
python3 metrics.py --forecast --forecast-horizon future # 7-day forecasts
python3 metrics.py --forecast --forecast-horizon realtime # 15-minute forecasts (default)- README.md - This file: Quick start and overview
- Docs/SYSTEM_DOCUMENTATION.md - Comprehensive system documentation
- Docs/CONFIGURATION_VARIABLES.md - Detailed explanation of all configuration variables
- Docs/MODEL_TRAINING_AND_PREDICTION_GUIDE.md - Step-by-step guide for all models
- Docs/VISUAL_ARCHITECTURE_GUIDE.md - Architecture diagrams, flowcharts, and comparison tables
- Docs/ANOMALY_DETECTION.md - Anomaly detection details
- Docs/SLI_SLO_GUIDE.md - SLI/SLO configuration guide
- Docs/ERROR_BUDGET_EXPLAINED.md - Error budget calculations
- Docs/ERROR_BUDGET_CALCULATION_EXAMPLE.md - Error budget calculation examples
See LICENSE for details.
Contributions are welcome! Please open an issue or submit a pull request.
For questions or issues, please open an issue in this repository.