Skip to content

Conversation

@Drowze
Copy link
Contributor

@Drowze Drowze commented Sep 1, 2025

What does this PR do?

On the Karafka integration, when distributed_tracing is on, the original trace was lost after iterating through the messages. This PR aims to fix that and also add span links linking the message traces with the parent consumer trace (so it's easier to find each other using the Datadog APM UI).

Fixes #4873

Motivation:
Fix and improve the Datadog Karafka support.

Change log entry

Additional Notes:

How to test the change?
See #4873 for a full testing snippet.

@Drowze Drowze requested review from a team as code owners September 1, 2025 13:33
@github-actions github-actions bot added integrations Involves tracing integrations tracing labels Sep 1, 2025
@Drowze Drowze changed the title Fix missing original karafka trace after iterating Fix losing original karafka trace after iterating through the messages with distributed_tracing on Sep 1, 2025
@Drowze
Copy link
Contributor Author

Drowze commented Sep 15, 2025

Hey @marcotc @ivoanjo can you have a quick look here please? 🙇

Would like to get this merged to not conflict with other Karafka/Waterdrop pull requests (e.g.: #4874 - and also I want to open a new PR soon about turning off distributed tracing on a per-topic basis, see here)

@ivoanjo
Copy link
Member

ivoanjo commented Sep 16, 2025

Err thanks for the patience, I've asked the team to take a look at this asap :)

@marcotc
Copy link
Member

marcotc commented Sep 17, 2025

@Drowze I took a first pass and I need to allocate proper time to understand how we best want to present this kind of workflow (span links, directly linking, something else).

I have scheduled time to take a good look at this next week.

@p-datadog
Copy link
Member

The repair to #4876 LGTM. Span links do not appear to be used in dd-trace-rb except in one place where an open telemetry trace is converted (mapped?) to a datadog trace. @Drowze how do you feel about moving the span link code into a separate PR and just leaving the fix in this PR?

@p-datadog
Copy link
Member

Regarding span links: I read https://docs.datadoghq.com/tracing/trace_collection/span_links/ to try to understand their usage. The documentation page says the span links are for relationships other than parent/child. In this PR they are used for parent/child relationship? What is the actual gain/benefit?

@p-datadog
Copy link
Member

Looking at this PR again I don't understand the fix. The problem is the parent trace gets reset somehow, therefore I would expect the fix to repair the reset. How is adding links fixing the reset?

@Drowze Drowze force-pushed the fix-lost-karafka-consumer-traces branch from 54c704d to 4f1847d Compare November 28, 2025 15:04
@Drowze
Copy link
Contributor Author

Drowze commented Dec 1, 2025

Coming back to this now - apologies for the delay 🙇

do you feel about moving the span link code into a separate PR and just leaving the fix in this PR?

Sure - sounds good to me. I understand that span links are not widely used at the moment and using them here just made the PR a bit confusing to review, so I'll move that part into a separate pull request - and we can continue discussing span links there.

@p-datadog I just rebased the PR with latest master and removed the code referring to span links - can you have another look please? 🙇

Copy link
Member

@p-datadog p-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Drowze for splitting the PR.

I think this PR look good but I am not an expert in the tracing side, I'll try to find another reviewer.


# assert that the message span is not continuation of the producer span
expect(span.parent_id).to eq(consumer_span.id)
expect(span.trace_id).to eq(consumer_trace.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other test, is consumer (the middle span, if I am understanding the logic correctly) expected to be a child of producer? Is that asserted somewhere?

Copy link
Contributor Author

@Drowze Drowze Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these tests, the consumer span is never expected to be a child of any span. Referring to each test individually:

  1. when the message has tracing headers, when distributed tracing is enabled, it continues the span that produced the message

    • consumer is a root span and it doesn't have any child spans
    • producer is a root span and it has one child span from iterating the messages(*)
  2. when the message has tracing headers, when distributed tracing is not enabled, it does not continue the span that produced the message

    • consumer is a root span and it has one child span from iterating the messages(*)
    • producer is a root span and it doesn't have any child spans

Hope that makes sense! 🙇


(*) we create spans from iterating the messages as the karafka integration patches ::Karafka::Messages::Messages#each

Copy link
Contributor Author

@Drowze Drowze Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(*) we create spans from iterating the messages as the karafka integration patches ::Karafka::Messages::Messages#each

I think that this might not be clear enough from our tests... The messages are being iterated here:

# this creates a new span
expect(messages).to all(be_a(::Karafka::Messages::Message))

# ...

# in these expectations, `span` refers to the span created from iterating the messages
expect(span.parent_id).to eq(consumer_span.id)
expect(span.trace_id).to eq(consumer_trace.id)

# because of these `let`s (from the beginning of the file)
let(:span) do
  spans.find { |s| s.name == span_name }
end
let(:span_name) { Datadog::Tracing::Contrib::Karafka::Ext::SPAN_MESSAGE_CONSUME }

I'll update the test now to use an explicit each and add a comment about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated the test with comments better expliciting the test intentions - please let me know if it's more clear now 🙇


note: I didn't add an explicit each because this same pattern of relying on expect(...).to all(...) to create a span is used across pretty much all the tests in that file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think I understand what you are saying, but it's strange to me that the iterating span changed from being under the producer to being under the consumer but there is no relationship between the producer and the consumer. Might be a question outside the scope of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating the message does not really change any spans - it's continuing a trace that changes the parent/child relationship of spans.
In other words:

  • when distributed_tracing=true, the message span is a continuation of the span that created it (if any) (i.e.: the message span turns into a child of the span who produced the message)
  • when distributed_tracing=false, the message span is just a regular span (which may have a parent - which in the case of theses tests is the consumer span who's iterating the messages)

Distributed tracing gets confusing quickly 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fyi: I found this issue describing a similar use-case, so I've included our own use-case there as well (along with an example & screenshots)

@p-datadog p-datadog requested a review from ericfirth December 2, 2025 18:26
@Drowze Drowze force-pushed the fix-lost-karafka-consumer-traces branch from f8a9e45 to c32d6a6 Compare December 2, 2025 21:38
@Drowze Drowze requested a review from p-datadog December 2, 2025 21:38
@Drowze
Copy link
Contributor Author

Drowze commented Dec 9, 2025

Hey Datadog team 👋 @vpellan @p-datadog @marcotc @ivoanjo
Is there anything I can do to help moving this PR forward?

As of now we're having to use our own dd-trace-rb fork due to some issues/missing features around the Karafka/WaterDrop integration (our fork is the latest dd-trace-rb release (v2.22.0), plus this PR, #5120 and #5107).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrations Involves tracing integrations tracing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Karafka's "worker.process" trace is lost after iterating the messages (when distributed tracing is enabled)

5 participants