Skip to content

Commit 6e065d1

Browse files
committed
feat: add backing store for disk buffering of events
1 parent 40e7a0c commit 6e065d1

File tree

5 files changed

+850
-3
lines changed

5 files changed

+850
-3
lines changed

audittools/README.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
// SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
# audittools
5+
6+
`audittools` provides a standard interface for generating and sending CADF (Cloud Auditing Data Federation) audit events to a RabbitMQ message broker.
7+
8+
## Certification Requirements (PCI DSS, SOC 2, and more)
9+
10+
As a cloud provider subject to strict audits (including PCI DSS and more), we must ensure the **completeness** and **integrity** of audit logs while maintaining service **availability**.
11+
12+
### Standard Production Configuration
13+
14+
**You MUST configure a Backing Store with Persistent Storage (PVC).**
15+
16+
* **Configuration**: Set `BackingStorePath` to a mount point backed by a PVC.
17+
* **Requirement**: This ensures that audit events are preserved even in double-failure scenarios (RabbitMQ outage + Pod crash/reschedule).
18+
* **Compliance**: Satisfies requirements for guaranteed event delivery and audit trail completeness.
19+
20+
### Non-Compliant Configurations
21+
22+
The following configurations are available for development or specific edge cases but are **NOT** recommended for production services subject to audit:
23+
24+
1. **Ephemeral Storage (emptyDir)**:
25+
* *Risk*: Data loss if the Pod is rescheduled during a RabbitMQ outage.
26+
* *Status*: **Development / Testing Only**.
27+
28+
2. **No Backing Store**:
29+
* *Behavior*: The service will **block** (stop processing requests) if the RabbitMQ broker is down and the memory buffer fills up.
30+
* *Risk*: Service downtime (Violation of Availability targets).
31+
* *Status*: **Not Recommended**. Only acceptable if service downtime is preferred over *any* storage complexity.
32+
33+
## Usage
34+
35+
### Basic Setup
36+
37+
To use `audittools`, you typically initialize an `Auditor` with your RabbitMQ connection details.
38+
39+
```go
40+
import "github.com/sapcc/go-bits/audittools"
41+
42+
func main() {
43+
// ...
44+
auditor, err := audittools.NewAuditor(audittools.AuditorOpts{
45+
EnvPrefix: "MYSERVICE_AUDIT", // Configures env vars like MYSERVICE_AUDIT_RABBITMQ_URL
46+
})
47+
if err != nil {
48+
log.Fatal(err)
49+
}
50+
// ...
51+
}
52+
```
53+
54+
### Sending Events
55+
56+
```go
57+
event := cadf.Event{
58+
// ... fill in event details ...
59+
}
60+
auditor.Record(event)
61+
```
62+
63+
## Disk-Based Buffering
64+
65+
`audittools` includes buffering to ensure audit events are not lost if the RabbitMQ broker becomes unavailable. Events are temporarily written to disk and replayed once the connection is restored.
66+
67+
### Configuration
68+
69+
Buffering is enabled by providing a `BackingStorePath`.
70+
71+
```go
72+
auditor, err := audittools.NewAuditor(audittools.AuditorOpts{
73+
EnvPrefix: "MYSERVICE_AUDIT",
74+
BackingStorePath: "/var/lib/myservice/audit-buffer",
75+
})
76+
```
77+
78+
Or via environment variables:
79+
80+
* `MYSERVICE_AUDIT_BACKING_STORE_PATH`: Directory to store buffered events.
81+
* `MYSERVICE_AUDIT_BACKING_STORE_MAX_TOTAL_SIZE`: (Optional) Max total size of the buffer in bytes.
82+
83+
### Kubernetes Deployment
84+
85+
If running in Kubernetes, you have two main options for the backing store:
86+
87+
1. **Persistent Storage (PVC) - Recommended for Audit Compliance**:
88+
* Mount a Persistent Volume Claim (PVC) at the `BackingStorePath`.
89+
* **Pros**: Data survives Pod deletion, rescheduling, and rolling updates.
90+
* **Cons**: Adds complexity (volume management, access modes).
91+
* **Use Case**: **Required** for strict audit compliance to ensure no data is lost even if the service instance fails during a broker outage.
92+
93+
2. **Ephemeral Storage (emptyDir)**:
94+
* Mount an `emptyDir` volume at the `BackingStorePath`.
95+
* **Pros**: Simple, fast, no persistent volume management.
96+
* **Cons**: Data is lost if the *Pod* is deleted or rescheduled. However, it survives container restarts within the same Pod.
97+
* **Use Case**: Suitable for non-critical environments or where occasional data loss during complex failure scenarios (simultaneous broker outage + pod rescheduling) is acceptable.
98+
99+
### Behavior
100+
101+
The system transitions through the following states to ensure zero data loss:
102+
103+
1. **Normal Operation**: Events are sent directly to RabbitMQ.
104+
2. **RabbitMQ Outage**: Events are written to the disk backing store. The application continues without blocking.
105+
3. **Disk Full**: If the backing store reaches `BackingStoreMaxTotalSize`, writes fail. Events are buffered in memory (up to 20).
106+
4. **Fail-Closed**: If the memory buffer fills up, `auditor.Record()` **blocks**. This pauses the application to prevent data loss.
107+
5. **Recovery**: A background routine continuously drains the backing store to RabbitMQ once it becomes available. New events are persisted to disk during draining to prevent blocking.
108+
109+
**Additional Details**:
110+
111+
* **Security**: The directory is created with `0700` permissions, and files with `0600`, ensuring only the service user can access the sensitive audit data.
112+
* **Capacity**: If `BackingStoreMaxTotalSize` is configured, the store will reject new writes when full, preserving existing data.
113+
114+
### Metrics
115+
116+
The backing store exports the following Prometheus metrics:
117+
118+
* `audittools_backing_store_writes_total`: Total number of audit events written to disk.
119+
* `audittools_backing_store_reads_total`: Total number of audit events read from disk.
120+
* `audittools_backing_store_errors_total`: Total number of errors, labeled by operation:
121+
* `write_stat`: Failed to stat file during rotation check
122+
* `write_full`: Backing store is full (exceeds `MaxTotalSize`)
123+
* `write_open`: Failed to open backing store file for writing
124+
* `write_marshal`: Failed to marshal event to JSON
125+
* `write_io`: Failed to write event to disk
126+
* `read_open`: Failed to open backing store file for reading
127+
* `read_scan`: Failed to scan backing store file
128+
* `corrupted_event`: Encountered corrupted event during read (skipped)
129+
* `commit_remove`: Failed to remove file after successful processing
130+
* `audittools_backing_store_size_bytes`: Current total size of the backing store in bytes.
131+
* `audittools_backing_store_files_count`: Current number of files in the backing store.

audittools/auditor.go

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,12 @@ import (
1717
"fmt"
1818
"net"
1919
"net/url"
20+
"os"
2021
"strconv"
2122
"testing"
2223

2324
"github.com/prometheus/client_golang/prometheus"
25+
2426
"github.com/sapcc/go-api-declarations/cadf"
2527

2628
"github.com/sapcc/go-bits/assert"
@@ -63,6 +65,14 @@ type AuditorOpts struct {
6365
// - "audittools_successful_submissions" (counter, no labels)
6466
// - "audittools_failed_submissions" (counter, no labels)
6567
Registry prometheus.Registerer
68+
69+
// Optional. If given, the Auditor will buffer events in this directory when the RabbitMQ server is unreachable.
70+
// If EnvPrefix is given, this can also be set via the environment variable "${PREFIX}_BACKING_STORE_PATH".
71+
BackingStorePath string
72+
73+
// Optional. If given, limits the total size of the backing store in bytes.
74+
// If EnvPrefix is given, this can also be set via the environment variable "${PREFIX}_BACKING_STORE_MAX_TOTAL_SIZE".
75+
BackingStoreMaxTotalSize int64
6676
}
6777

6878
func (opts AuditorOpts) getConnectionOptions() (rabbitURL url.URL, queueName string, err error) {
@@ -99,6 +109,7 @@ func (opts AuditorOpts) getConnectionOptions() (rabbitURL url.URL, queueName str
99109
User: url.UserPassword(username, pass),
100110
Path: "/",
101111
}
112+
102113
return rabbitURL, queueName, nil
103114
}
104115

@@ -139,16 +150,49 @@ func NewAuditor(ctx context.Context, opts AuditorOpts) (Auditor, error) {
139150
opts.Registry.MustRegister(failureCounter)
140151
}
141152

153+
// read backing store options from environment if EnvPrefix is set
154+
if opts.EnvPrefix != "" {
155+
if opts.BackingStorePath == "" {
156+
opts.BackingStorePath = os.Getenv(opts.EnvPrefix + "_BACKING_STORE_PATH")
157+
}
158+
if opts.BackingStoreMaxTotalSize == 0 {
159+
if val := os.Getenv(opts.EnvPrefix + "_BACKING_STORE_MAX_TOTAL_SIZE"); val != "" {
160+
size, err := strconv.ParseInt(val, 10, 64)
161+
if err == nil {
162+
opts.BackingStoreMaxTotalSize = size
163+
}
164+
}
165+
}
166+
}
167+
142168
// spawn event delivery goroutine
143169
rabbitURL, queueName, err := opts.getConnectionOptions()
144170
if err != nil {
145171
return nil, err
146172
}
173+
174+
var backingStore BackingStore
175+
if opts.BackingStorePath != "" {
176+
bsRegistry := opts.Registry
177+
if bsRegistry == nil {
178+
bsRegistry = prometheus.DefaultRegisterer
179+
}
180+
backingStore, err = NewFileBackingStore(FileBackingStoreOpts{
181+
Directory: opts.BackingStorePath,
182+
MaxTotalSize: opts.BackingStoreMaxTotalSize,
183+
Registry: bsRegistry,
184+
})
185+
if err != nil {
186+
return nil, fmt.Errorf("failed to initialize backing store: %w", err)
187+
}
188+
}
189+
147190
eventChan := make(chan cadf.Event, 20)
148191
go auditTrail{
149192
EventSink: eventChan,
150193
OnSuccessfulPublish: func() { successCounter.Inc() },
151194
OnFailedPublish: func() { failureCounter.Inc() },
195+
BackingStore: backingStore,
152196
}.Commit(ctx, rabbitURL, queueName)
153197

154198
return &standardAuditor{

0 commit comments

Comments
 (0)