Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions common-content/en/module/distributed-tracing/introduction/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
+++
title = "Distributed Tracing"
time = 120
objectives = [
"Contrast distributed tracing and metrics.",
"Explain how distributed tracing helps understand a request flow through several systems.",
]
[build]
render = "never"
list = "local"
publishResources = false
+++

In the past we have added metrics to our programs and collected and aggregated those metrics using the Prometheus monitoring tool. Metrics are a widely-used methodology for understanding the behaviour of our systems at a statistical level: what percentage of requests are being completed successfully, what is the 90th percentile latency, what is our current cache hit rate or queue length. These kinds of queries are very useful for telling us whether our systems seem to be healthy overall or not, and, in some cases, may provide useful insights into problems or inefficiencies.

However, one thing that metrics are not normally very good for is understanding how user experience for a system may vary between different types of requests, why particular requests are outliers in terms of latency, and how a single user request flows through backend services - many complex web services may involve dozens of backend services or datastores. It may be possible to answer some of these questions using logs analysis. However, there is a better solution, designed just for this problem: distributed tracing.

Distributed tracing has two key concepts: traces and spans. A trace represents a whole request or transaction. Traces are uniquely identified by trace IDs. Traces are made up of a set of spans, each tagged with the trace ID of the trace it belongs to. Each span is a unit of work: a remote procedure call or web request to a specific service, a method execution, or perhaps the time that a message spends in a queue. Spans can have child spans. There are specific tools that are designed to collect and store distributed traces, and to perform useful queries against them.

One of the key aspects of distributed tracing is that when services call other services the trace ID is propagated to those calls (in HTTP-based systems this is done using a special HTTP [traceparent header](https://uptrace.dev/opentelemetry/opentelemetry-traceparent.html)) so that the overall trace may be assembled. This is necessary because each service in a complex chain of calls independently posts its spans to the distributed trace collector. The collector uses the trace ID to assemble the spans together, like a jigsaw puzzle, so that we can see a holistic view of an
entire operation.

[OpenTelemetry](https://opentelemetry.io/) (also known as OTel) is the main industry standard for distributed tracing. It governs the format of traces and spans, and how traces and spans are collected. It is worth spending some time exploring the [OTel documentation](https://opentelemetry.io/docs/), particularly the Concepts section. The [awesome-opentelemetry repo](https://github.com/magsther/awesome-opentelemetry) is another very comprehensive set of
resources.

Distributed tracing is a useful technique for understanding how complex systems are operating.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
+++
title = "Using Honeycomb"
time = 180
objectives = [
"Publish trace spans to Honeycomb.",
"View an assembled trace in Honeycomb.",
"Identify outliers in Honeycomb.",
]
[build]
render = "never"
list = "local"
publishResources = false
+++

[Honeycomb](https://www.honeycomb.io/) is a {{<tooltip title="SaaS">}}Software as a Service is software that someone else runs and we can rely on. {{</tooltip>}} distributed tracing provider.

Honeycomb provide API endpoints where we can upload trace spans. Honeycomb assembles spans which belong to the same traces. We can then view, query, and inspect those entire traces, seeing how our request flowed through a system.

We will experiment with Honeycomb locally with a single program running on one computer, to practice uploading and interpreting spans.

Sign up to Honeycomb for free.

{{<note type="Exercise">}}
Write a small standalone command line application which:
1. Picks a random number of iterations between 2 and 10 (we'll call it `n`).
2. `n` times, creates a span, sleeps for a random amount of time between 10ms and 5s, then uploads the span.
3. Between each span, sleeps for a random amount of time between 100ms and 5s.

Each time you run your program, it should use a unique trace ID, but within on program execution, all spans should have the same trace ID.

There are standard libraries for creating and sending OTel spans, such as [in Go](https://docs.honeycomb.io/send-data/go/opentelemetry-sdk/) and [in Java](https://docs.honeycomb.io/send-data/java/opentelemetry-agent/).
{{</note>}}

{{<note type="Exercise">}}
Run your program 10 times, making sure it uploads its spans to Honeycomb with your API key.

Explore the Honeycomb UI. Try to work out:
1. What was the biggest `n` generated by one of your program runs?
2. Which was the fastest run? What was `n` for that run?
3. What was the longest individual sleep performed in your program during a span?
4. What was the longest individual sleep _between_ spans in your program?
{{</note>}}
27 changes: 27 additions & 0 deletions common-content/en/module/event-driven/kafka-in-a-nutshell/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
+++
title = "Kafka in a Nutshell"
time = 120
objectives = [
"List the components of the Kafka architecture.",
"Explain the purpose of a producer, consumer, and broker.",
"Defined a record.",
"Define a topic.",
"Define a partition.",
"Explain the relationship (and differences) between topics and partitions.",
"Explain how Kafka knows when a consumer has successfully handled a record.",
"Contrast at-most-once and at-least-once delivery.",
"Explain why exactly-once delivery is very hard to achieve.",
]
[build]
render = "never"
list = "local"
publishResources = false
+++

Kafka is a commonly-used open-source distributed queue.

{{<note type="Reading">}}
Read [Apache Kafka in a Nutshell](https://medium.com/swlh/apache-kafka-in-a-nutshell-5782b01d9ffb).

Make sure you have achieved all of the learning objectives for this prep.
{{</note>}}
23 changes: 23 additions & 0 deletions common-content/en/module/event-driven/kafka-paper/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
+++
title = "Kafka Paper"
time = 120
objectives = [
"Describe how Kakfa stores data internally.",
"Calculate how many partitions are needed to serve a given number of consumers on a topic.",
"Contrast push-based and pull-based queueing systems.",
"Describe what delivery ordering constraints are and aren't guaranteed by Kafka.",
"Explain limitations of Kafka compared to systems with acknowledgements or two-phase commits.",
]
[build]
render = "never"
list = "local"
publishResources = false
+++

Kafka is a commonly-used open-source distributed queue.

{{<note type="Reading">}}
Read about the core Kafka concepts in the [Kafka: a Distributed Messaging System for Log Processing paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf).

Make sure you have achieved all of the learning objectives for this prep.
{{</note>}}
51 changes: 51 additions & 0 deletions common-content/en/module/event-driven/project/alerting/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
+++
title = "Alerting"
time = 180
objectives = [
"Identify and graph metrics to indicate specific problems.",
"Create an alert triggered by a specific problem.",
]
[build]
render = "never"
list = "local"
publishResources = false
+++

Write an [AlertManager configuration](https://prometheus.io/docs/alerting/latest/alertmanager/) and set up at least one alert.

For instance:

- We could alert on the age of jobs being unqueued - if this rises too high (more than a few seconds) then users' jobs aren't being executed in a timely fashion. We should use a percentile for this calculation.
- Note that this may need us to add extra code/data to our system, to be able to produce these metrics.
- We could also alert on failure to queue jobs, and failure to read from the queue.
- We expect to see fetch requests against all of our topics. If we don't, it may mean that our consumers are not running, or are otherwise broken. We could set up alerts on the `kafka_server_BrokerTopicMetrics_Count{name="TotalFetchRequestsPerSec"}` metric to check this.

For critical alerts in a production environment we would usually use PagerDuty or a similar tool, but for our purposes the easiest way to configure an alert is to use email.
This article describes how to send [Alertmanager email using GMail](https://www.robustperception.io/sending-email-with-the-alertmanager-via-gmail/) as an email server.

> [!WARNING]
>
> If you do this, be careful not to check your `GMAIL_AUTH_TOKEN` into GitHub - we should never check ANY token into source control. Instead, we can check in a template file and use a tool such as [heredoc](https://tldp.org/LDP/abs/html/here-docs.html) to substitute the value of an environment variable (our token) into the final generated Alertmanager configuration (and include this step in a build script/Makefile).
>
> It is also advisable a throwaway GMail account for this purpose, for additional security - just in case.

We can also build a Grafana dashboard to display our Prometheus metrics. The [Grafana Fundamentals](https://grafana.com/tutorials/grafana-fundamentals/) tutorial will walk you through how to do this (although we will need to use our own application and not their sample application).

{{<note type="Exercise">}}
Simulate several potential problems that could happen when running this system. Make sure you can identify the problems on your Grafana dashboard.

Examples of problems to simulate and identify:
* Kafka crashes and doesn't start again
* One kind of job always fails
* There are too many jobs and we can't get to them all in a timely manner.
* One kind of job fails whenever it runs in a particular topic (but succeeds in the other topics)
* One kind of job takes a really long time to run and means other jobs don't start in a timely manner.
* No consumers are pulling jobs out of a particular topic.
* A producer is producing jobs which the consumers can't understand (e.g. they have missing JSON fields).

Prepare a demo where you can show how your Grafana dashboard can help you diagnose and debug these problems.
{{</note>}}

{{<note type="Exercise">}}
Create at least one alert to notify you by email of a problem with your cron system.
{{</note>}}
43 changes: 43 additions & 0 deletions common-content/en/module/event-driven/project/cron/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
+++
title = "Cron"
time = 300
objectives = [
"Describe the purpose of cron.",
"Write a crontab to run a job every minute, or at fixed times.",
"Write a program to parse files containing crontabs and schedule jobs.",
]
[build]
render = "never"
list = "local"
publishResources = false
+++

We are going to implement a distributed version of the `cron` job scheduler (read about [cron](https://en.wikipedia.org/wiki/Cron) if you are not familiar with it). Cron jobs are defined by two attributes: the command to be executed, and either the schedule that the job should run on or a definition of the times that the job should execute. The schedule is defined according
to the `crontab` format.

Most languages have parsers of the crontab format - you do not need to write one yourself, (though it can be an interesting challenge!). Some examples:
* For Go, the most widely used is [robfig/cron](https://github.com/robfig/cron).
* For Java, [Quartz](https://www.quartz-scheduler.org/documentation/quartz-2.4.x/) has a Cron parser/scheduler, see [this quick start guide](https://betterstack.com/community/questions/how-to-run-cron-jobs-in-java/) for how to use it.\
Note that Quartz parsing requires a leading seconds specifier, which is non-standard. You can convert a regular cron expression to a Quartz-compatible one by adding the prefix `"0 "`.

The `cron` tool common to Unix operating systems runs jobs on a schedule. Cron only works on a single hosts. We want to create a version of cron that can schedule jobs across multiple workers, running on different hosts.

### Writing a cron scheduler without Kafka

The first step won't involve Kafka at all, or running custom jobs. These will come later.

{{<note type="Exercise">}}
Write code which will parse a file which contains a list of crontabs, one per line, and print "Running job [line number]" for each line on the schedule.

e.g. if passed the file:
```
* * * * *
15 * * * *
```

Your program should print "Running job 0" every minute, and "Running job 1" once an hour at quarter past the hour.
{{</note>}}

{{<note type="Exercise">}}
Create a Docker image for your cron scheduler program. Make sure you can run the image.
{{</note>}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
+++
title = "Distributed Tracing in Kafka"
time = 300
objectives = [
"Instrument a producer and consumer with OTel.",
"Interpret a trace in Honeycomb across producer and consumer.",
"Identify outliers in Honeycomb.",
]
[build]
render = "never"
list = "local"
publishResources = false
+++

We know that metrics can help give us aggregate information about all actions, and distributed tracing can help us better understand the flow of particular requests through systems.

A single cron job invocation is a like a user request. It gets originated in one system (the producer), then flows through Kafka, and may run on one consumer (if it succeeds the first time), or more than one consumer (if it fails and needs to be retried).

We can use distributed tracing to trace individual cron job invocations.

{{<note type="Exercise">}}
Add span publishing to your producer and consumers.

To end up assembled in the same trace, all of the services will need to know to use the same trace ID. You may need to modify your job data format to enable this.

Run your system, publishing to Honeycomb, and inspect the traces in Honeycomb. Identify:
1. How long jobs spend waiting in Kafka between the producer and consumer. What was the longest time a job waited there? What was the shortest time?
2. What was the largest number of retries any job took?
3. How many jobs always failed all retries?
4. Which jobs fail the most or the least?
{{</note>}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
+++
title = "Distributing with Kafka"
time = 90
objectives = [
"Run a Kafka queue using `docker-compose`.",
"Produce messages into a Kafka queue from a custom written producer.",
"Consume messages from a Kafka queue in a custom written consumer.",
"Consume different messages from a Kafka queue from multiple consumers on the same topic.",
"Run a Kafka pipeline with producers and consumers in `docker-compose`.",
]
[build]
render = "never"
list = "local"
publishResources = false
+++

Having built a local cron scheduler, we can now expand this to create a functional distributed cron system. We will build two separate programs:

- A Kafka producer that reads configuration files for jobs and queues tasks for execution
- A Kafka consumer that dequeues jobs from a queue and runs them

In _this_ step, we will just make dummy producers and consumers that send messages on the correct scheduler and log that they were received. In the next step we will make them actually run command lines.

Kafka itself is a queue that lets you communicate single messages in a structured and asynchronous way between producers and consumers. Therefore, all the scheduling logic for managing recurring jobs must be part of your producer (although it is recommended to reuse a suitable library to assist with parsing crontabs and scheduling). Every time a job is due to be run, your producer creates a new message and writes it to Kafka, for a consumer to dequeue and run.

We'll need to be able to run Kafka. The easiest way is to use `docker-compose`. The [conduktor/kafka-stack-docker-compose](https://github.com/conduktor/kafka-stack-docker-compose) project provides several starter configurations for running Kafka. The config for `zk-single-kafka-single.yml` will work for development purposes.

There are existing Kafka clients for many languages, such as:
* A [Golang Kafka client](https://docs.confluent.io/kafka-clients/go/current/overview.html#go-example-code).
* A [Java Kafka client](https://docs.confluent.io/kafka-clients/java/current/overview.html).

We may want to run other Docker containers later, so we may want to make our own copy of that docker compose configuration that we can add to.

Our producer program needs to be able to do the following:

- Read and parse a file with cron job definitions (we'll set up our own for this project, don't reuse the system cron config file because we will want to modify the format later) - you should already have written this code.
- Write a message to Kafka specifying the command to run, the intended start time of the job, and any other information that we think is necessary. It probably makes sense to encode this information as JSON.
- We will also need to [create a Kafka topic](https://kafka.apache.org/documentation/#quickstart_createtopic). In a production environment we would probably use separate tooling to manage topics (perhaps Terraform), but for this project, we can create our Kafka topic using code like these examples in [Go](https://github.com/confluentinc/examples/blob/7.3.0-post/clients/cloud/go/producer.go#L39) or [Java](https://github.com/confluentinc/examples/blob/6d4c49b20662cb4c8b4a668622cb2e9442b59a20/clients/cloud/java/src/main/java/io/confluent/examples/clients/cloud/ProducerExample.java#L39).

Our consumer program needs to be able to do the following things:

- Read job information from a Kafka queue (decoding JSON)
- Execute the commands to run the jobs (assume this is a simple one-line command that you can `exec` for now) - for now we will just log the job number (like we were doing in our local version), but in a future step, we will make it run commands.
- Because the producer is writing jobs to the queue when they are ready to run, your consumer does not need to do any scheduling or to parse crontab format

We want to run two consumers - therefore, when we create our topic, we should create two partitions of the topic. We will also need to specify a key for each Kafka message that we produce - Kafka assigns messages to partitions based on a hash of the message ID. We can generate UUIDs to use as keys.

We can build Docker containers for our producer and consumer and add these to our docker-compose configuration. We should create a Makefile or script to make this repeatable.

Test our implementation and observe both of our consumers running jobs scheduled by your producer. What happens if we only create one partition in our topic? What happens if we create three?
Loading
Loading