Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit cf9774d

Browse files
committed
Add playbook entry for CortexGossipMembersMismatch.
1 parent 0b75ccd commit cf9774d

File tree

1 file changed

+25
-1
lines changed

1 file changed

+25
-1
lines changed

cortex-mixin/docs/playbooks.md

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -572,7 +572,31 @@ How to **fix**:
572572
573573
### CortexGossipMembersMismatch
574574
575-
_TODO: this playbook has not been written yet._
575+
This alert fires when any instance does not register all other instances as members of the memberlist cluster.
576+
577+
How it **works**:
578+
- This alert applies when memberlist is used for the ring backing store.
579+
- All Cortex instances, regardless of type, join the a single memberlist cluster.
580+
- Each instance (=memberlist cluster member) should be able to see all others.
581+
- Therefore the following should be equal for every instance:
582+
- The reported number of cluster members (`memberlist_client_cluster_members_count`)
583+
- The total number of currently responsive instances.
584+
585+
How to **investigate**:
586+
- The instance which has the incomplete view of the cluster (too few members) is specified in the alert.
587+
- If the count is zero:
588+
- It is possible that the joining the cluster has yet to succeed.
589+
- The following log message indicates that the _initial_ initial join did not succeed: `failed to join memberlist cluster`
590+
- The following log messages indicate that subsequent re-join attempts are failing: `re-joining memberlist cluster failed`
591+
- If it is the case that the initial join failed, take action according to the reason given.
592+
- Verify communication with other members by checking memberlist traffic is being sent and received by the instance using the following metrics:
593+
- `memberlist_tcp_transport_packets_received_total`
594+
- `memberlist_tcp_transport_packets_sent_total`
595+
- If traffic is present, then verify there are no errors sending or receiving packets using the following metrics:
596+
- `memberlist_tcp_transport_packets_sent_errors_total`
597+
- `memberlist_tcp_transport_packets_received_errors_total`
598+
- These errors (and others) can be found by searching for messages prefixed with `TCPTransport:`.
599+
- Logs coming directly from memberlist are also logged by Cortex; they may indicate where to investigate further. These can be identified as such due to being tagged with `caller=memberlist_logger.go:xyz`.
576600
577601
### EtcdAllocatingTooMuchMemory
578602

0 commit comments

Comments
 (0)