Improve SIP packet detection using heuristic parsing #2024

sorooshm78 · 2025-11-17T16:11:39Z

This PR improves SIP packet detection in PcapPlusPlus by introducing heuristic parsing based on Wireshark’s SIP dissector. SIP messages are now detected from the UDP payload itself, not only when using port 5060.

Add static bool dissectSipHeuristic(const uint8_t* data, size_t dataLen) to detect SIP requests/responses from payload content (Wireshark-style logic)
Use the new heuristic in UdpLayer so SIP packets on non-standard ports are correctly classified
Preserve existing behavior for non-SIP payloads

Related issue: #2022

Note: I’m not yet fully familiar with PcapPlusPlus’ internal structure, so if there are better places, names, or patterns for this logic, I’m happy to adjust the PR based on your feedback

Packet++/header/SipLayer.h

seladb · 2025-11-23T03:41:42Z

Packet++/header/SipLayer.h

+		/// @param[in] data Pointer to the raw data buffer
+		/// @param[in] dataLen Length of the data buffer in bytes
+		/// @return True if the first line matches SIP request/response syntax, false otherwise
+		static bool dissectSipHeuristic(const uint8_t* data, size_t dataLen)


We already have SipRequestFirstLine and SipResponseFirstLine that parse the first line, maybe we could use this instead of adding more logic to parse the first line?

I agree that we already have SipRequestFirstLine and SipResponseFirstLine for parsing the first line, but I think keeping the heuristic logic separate is still necessary because the goals are different.

The Sip*FirstLine classes assume we already decided that the payload is SIP, operate on a Sip*Layer instance, update internal state (m_IsComplete, offsets, logging, etc.), and are meant for full parsing.

In contrast, dissectSipHeuristic() is a stateless, side-effect-free check that runs directly on raw data to answer a simpler question: “does this buffer look like a SIP message at all?”. This also matches Wireshark’s design, where heuristic detection is separate from the actual SIP dissector that parses the first line and fields.

This separation is particularly important for TCP: when we inspect data per segment, the first line may be incomplete. In that case the heuristic must be able to say “need more data / undecided” without constructing SIP layers or marking anything as invalid, which is a different lifecycle than the existing first-line parsers.

In this pull request I’m not yet handling TCP segmentation or IP fragmentation — the heuristic currently assumes it sees at least one complete first line. I plan to address proper TCP stream reassembly / IP fragmentation handling in a separate follow-up PR.

I agree that we already have SipRequestFirstLine and SipResponseFirstLine for parsing the first line, but I think keeping the heuristic logic separate is still necessary because the goals are different.

The Sip*FirstLine classes assume we already decided that the payload is SIP, operate on a Sip*Layer instance, update internal state (m_IsComplete, offsets, logging, etc.), and are meant for full parsing.

In contrast, dissectSipHeuristic() is a stateless, side-effect-free check that runs directly on raw data to answer a simpler question: “does this buffer look like a SIP message at all?”. This also matches Wireshark’s design, where heuristic detection is separate from the actual SIP dissector that parses the first line and fields.

I just noticed Sip*FirstLine classes do accept a request/response pointer in their constructor, so they can't be used directly. However, they do contain static methods such as parseStatusCode(), parseVersion(), parseMethod() that can definitely be used. If we see we still have a lot of common code between these classes and the parsing logic you need we can think what's the best way to refactor them so they can be used in both scenarios.

In this pull request I’m not yet handling TCP segmentation or IP fragmentation — the heuristic currently assumes it sees at least one complete first line. I plan to address proper TCP stream reassembly / IP fragmentation handling in a separate follow-up PR.

Handling TCP segmentation or IP fragmentation is more tricky - PcapPlusPlus parses packets one by one, there is currently no built-in way to use TcpReassembly or IPReassembly and use the outcome to parse the message again as a packet

Thanks for the suggestion!

I refactored SipLayer::dissectSipHeuristic() to use the static parsing helpers from SipResponseFirstLine and SipRequestFirstLine instead of manually tokenizing the first line.

For responses I'm now using parseVersion() and parseStatusCode(), and for requests I'm using parseMethod(), parseVersion() and parseUri(). This removes the duplicated parsing logic and keeps the heuristic in sync with the actual SIP first-line parsers.

I didn't use the Sip*FirstLine constructors themselves, as they still require a request/response pointer as you mentioned.

You’re absolutely right that handling TCP segmentation and IP fragmentation is more complex. Since PcapPlusPlus currently processes packets one by one, in this PR I focused only on heuristic detection on the first line of a single packet. My plan is to add built-in IP fragmentation support to PcapPlusPlus itself in a separate PR, and I’m really excited to work on that.

codecov · 2025-11-23T03:53:35Z

Codecov Report

❌ Patch coverage is 83.65385% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.88%. Comparing base (24cc309) to head (6c6acda).
⚠️ Report is 2 commits behind head on dev.

Files with missing lines	Patch %	Lines
Packet++/header/SipLayer.h	83.16%	16 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##              dev    #2024   +/-   ##
=======================================
  Coverage   83.87%   83.88%           
=======================================
  Files         307      307           
  Lines       53952    54046   +94     
  Branches    11352    11335   -17     
=======================================
+ Hits        45254    45334   +80     
- Misses       7483     7494   +11     
- Partials     1215     1218    +3

Flag	Coverage Δ
alpine320	`75.91% <87.03%> (+0.01%)`	⬆️
fedora42	`75.47% <86.04%> (+0.01%)`	⬆️
macos-14	`81.56% <74.41%> (-0.02%)`	⬇️
macos-15	`81.56% <74.41%> (-0.02%)`	⬇️
mingw32	`69.97% <64.86%> (-0.03%)`	⬇️
mingw64	`69.99% <72.09%> (+0.12%)`	⬆️
npcap	`?`
rhel94	`75.49% <85.45%> (+0.01%)`	⬆️
ubuntu2004	`59.50% <75.47%> (+0.02%)`	⬆️
ubuntu2004-zstd	`59.61% <75.47%> (+0.04%)`	⬆️
ubuntu2204	`75.40% <85.45%> (+0.01%)`	⬆️
ubuntu2204-icpx	`57.91% <72.09%> (+0.06%)`	⬆️
ubuntu2404	`75.53% <87.03%> (+0.04%)`	⬆️
ubuntu2404-arm64	`75.56% <87.03%> (+0.02%)`	⬆️
unittest	`83.88% <83.65%> (+<0.01%)`	⬆️
windows-2022	`85.40% <81.11%> (+0.15%)`	⬆️
windows-2025	`85.42% <81.11%> (+0.09%)`	⬆️
winpcap	`85.42% <81.11%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…sers in heuristic SIP detection

seladb · 2025-12-03T07:16:57Z

@sorooshm78 I just saw we already have most of the logic that you implemented: https://github.com/seladb/PcapPlusPlus/blob/master/Packet%2B%2B/src/UdpLayer.cpp#L113-L120

The 2 things you implemented on top of what we have is parsing the request version and the URI, but are they actually required? 🤔

sorooshm78 · 2025-12-03T15:33:54Z

@seladb
I had a PCAP file where SIP packets with source and destination ports other than 5060 were not being detected, and the parsing process was not being performed. The reason for this was due to the condition in the code: SipLayer::isSipPort(portDst) || SipLayer::isSipPort(portSrc).

Upon reviewing the code in Wireshark, I concluded that further investigation is needed. The logic in Wireshark is essentially designed to search the first line and split it into three tokens. Since the format for request and response is defined in the RFC, this determination was made based on that. Specifically, the RFC defines SIP requests and responses in the following way:

SIP Request Line: Method SP Request-URI SP SIP-Version CRLF
SIP Response Line: SIP-Version SP Status-Code SP Reason-Phrase CRLF

However, based on your comment, I noticed that part of the logic in your parsing already includes this check, so I decided to build on that.

For SIP responses, everything was straightforward and clear because of the fixed length of the version and the three-digit status code, which made detection easy. I could use the SipResponseFirstLine::parseStatusCode and SipResponseFirstLine::parseVersion methods because of their static nature.

However, for SipRequestFirstLine, it was different. In this class, only the SipRequestFirstLine::parseMethod method is static, while methods like SipRequestFirstLine::parseVersion() and SipRequestFirstLine::getUri() are not static. To use these methods, an instance of the SipRequestFirstLine class would need to be created, which, as per the logic, isn't correct because we are still in the phase of identifying the SIP protocol.

Therefore, I added two static methods to the SipRequestFirstLine class:

SipRequestFirstLine::parseUri(const char* data, size_t dataLen)
SipRequestFirstLine::parseVersion(const char* data, size_t dataLen)

This was necessary to ensure that the parsing of SIP requests was handled correctly without violating the existing design logic.

Please let me know if you think there's a better way to approach this or if there are any concerns with my changes.

seladb · 2025-12-04T07:10:29Z

@sorooshm78 I think I understand it better now: you'd like to be able to parse SIP messages that are not on port 5060.

Since the parsing logic is part of the fast path and can potentially run on every captured packet, it should be as efficient as possible, which is why we need to be careful when parsing strings. That's the reason we check the port first and then try to parse the first line.

To be honest, I don't see a good way to run this parsing logic on every packet in an efficient manner that would not hurt the performance. Do you know if there are standard SIP ports other than 5060? If yes, maybe we can add them to the list and be able to parse a larger variaty of SIP packets... 🤔

Dimi1010 · 2025-12-04T13:19:31Z

Since the parsing logic is part of the fast path and can potentially run on every captured packet, it should be as efficient as possible, which is why we need to be careful when parsing strings. That's the reason we check the port first and then try to parse the first line.

To be honest, I don't see a good way to run this parsing logic on every packet in an efficient manner that would not hurt the performance. Do you know if there are standard SIP ports other than 5060? If yes, maybe we can add them to the list and be able to parse a larger variety of SIP packets... 🤔

@seladb A solution I have considered plausible is a 2 pass system. Basically the ports work only as a hint to which protocol type to try to parse as first. This would allow the library to match packets from "known" protocol ports fast, while also allowing parsing capability of packets coming from "unknown" ports. Below is a flow chart on how it can work.

flowchart TD
  start[Inbound Packet]-->portCheck{Check Ports}
  portCheck -->|Port Matches| discectHintCheck{Run dissector for hinted port}
  discectHintCheck --> |No match| runAllDissectors
  portCheck -->|No match| runAllDissectors{Run all dissectors}
  discectHintCheck -->|Data matches| packetParsed[Parsed Packet]
  runAllDissectors -->|Data matches| packetParsed
  runAllDissectors--> |No match| unknownPacket[Unknown Packet]

Basically if the inbound packet is HTTP and comes on port 80, it will be fast tracked to the HTTP parser by the port 80 -> HTTP hint. If it came on port 41518, it would fail all port hints and go to the second stage where all the possible protocols are tested for signature match, eventually getting to the HTTP parser and matching.

Such architecture should keep performance on the fast path, while allowing protocols on unknown ports to be parsed. The only trade-off is that completely unparsable packets need to go through more checks until they are discarded.

sorooshm78 · 2025-12-04T22:14:28Z

@seladb
The SIP port can be anything, and its default is 5060. I understand your performance concerns, and you’re right that we shouldn’t call the heuristic function for every packet. However, relying only on the port to determine the packet type is not sufficient. Because of that, two approaches come to my mind:

We can run the heuristic function only after all port-based detections have been checked. If we can’t identify the packet based on its port, then we call the heuristic functions. This way we still get fast detection when the port is enough, and if we can’t determine the packet type from the port, instead of marking it as “unknown” we run the heuristic functions to identify it. This is exactly what Wireshark does: it first tries fast port-based classification, and if that fails, it calls each protocol’s heuristic dissector to detect the packet more accurately.
The second approach that comes to mind, which can be used alongside port-based detection for fast classification, is this: since the SIP request and response formats are well defined, we can read the first 3 characters of the payload. If it’s a response, those 3 characters will always be SIP. If it’s a request, the first token is the SIP method, and the set of SIP methods is known. So we can check just the first 3 characters of that token. For example, if the request is INVITE sip:uri SIP/2.0, by reading the first 3 characters INV we can verify that they match the beginning of a valid SIP method.

SIP Request Line: Method SP Request-URI SP SIP-Version CRLF
SIP Response Line: SIP-Version SP Status-Code SP Reason-Phrase CRLF

seladb · 2025-12-05T08:06:15Z

We can run the heuristic function only after all port-based detections have been checked. If we can’t identify the packet based on its port, then we call the heuristic functions. This way we still get fast detection when the port is enough, and if we can’t determine the packet type from the port, instead of marking it as “unknown” we run the heuristic functions to identify it. This is exactly what Wireshark does: it first tries fast port-based classification, and if that fails, it calls each protocol’s heuristic dissector to detect the packet more accurately.

@sorooshm78 I think this aligns with @Dimi1010 's suggestion above ☝️ The main issue I see with this approach is that all of the non-classified protocols (meaning protocols PcapPlusPlus doesn't yet parse) will fall into this bucket and will be checked for SIP matching (and in the future, maybe more protocols).

Maybe we can combine this with option (2) - we can have add 2 static SipLayer::parse() methods with a signature like this:

// Parse with checking the port
static SipLayer* parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet, uint16_t srcPort, uint16_t dstPort);

// Parse without checking the port
static SipLayer* parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet);

The first method will do roughly what the current parsing logic does: check the ports first, then check if it's a request or a response. It will be called instead of the current SIP parsing.

The second method will do what option (2) suggests: check the first 3 characters: if it looks like a SIP response - continue checking the version, status code, etc. If it looks like a request - continue checking the SIP method. This method will be called last - after all the port checks.

Please let me know what you think.

sorooshm78 · 2025-12-05T16:48:46Z

The approach you suggested works very well for SIP, and I agree it’s a good fit there. However, what I’m proposing (and I believe this is also what @Dimi1010 intends) is a more general change to how PcapPlusPlus detects protocols.

Right now detection relies heavily on ports, but changing default ports is not rare at all – it’s actually quite common in real deployments. This is not specific to SIP; any supported protocol might run on a non-default port. In such cases, PcapPlusPlus will simply fail to identify the packet if we only look at ports.

I understand that PcapPlusPlus doesn’t support every protocol, but even for the protocols it does support, port-only detection means we don’t get full coverage. As soon as someone changes the default port, those packets are treated as “unknown” by the library. So the trade-off is:

very fast detection but low accuracy (fail whenever the port is non-default), vs
slightly slower detection but the ability to correctly recognize all protocols that PcapPlusPlus supports, even on non-default ports.

What I’m suggesting is:

First try to detect the protocol based on the port (fast path).
If that fails, fall back to calling the heuristic functions of the protocols.

Any new protocol added in the future can follow the same pattern. Instead of removing heuristic functions, we can focus on keeping them as lightweight and fast as possible.

For example, in my initial SIP heuristic I was taking the first line, splitting it into three tokens, and then parsing each one, which is more expensive. A faster approach is to just look at the first 3 characters of the payload:

For a response, the first 3 characters will be SIP.
For a request, the first token is the SIP method, and we can check only the first 3 characters (e.g. INV, REG, BYE, etc.) against the known SIP methods.

This keeps the heuristic very cheap, but still doesn’t rely on the port.

I’d really like to hear what @Dimi1010 and @seladb

Dimi1010 · 2025-12-05T17:42:34Z

I was indeed talking in the general case. Not just for SIP, but as a general solution to the current limitation of port based detection, which is brittle.

Yes, it will mean that the packets that don't fit the fast-path "known ports" heuristic will take slower to parse as they have to run through all dissectors, but I deem that acceptable for the improved functionality. Otherwise, they won't be parsed at all.

As for the heuristic dissector functions. I think it's fine if they have some more expensive checks as long as they are later in the order. IMO, the first checks in a detector should be cheap ones that can reject as many true negatives as possible.
In a truly random stream of packets, a negative is statistically more likely for a given protocol heuristic function, so keeping the rejection time low is just as important as the pass case.

I'm not super familiar with SIP, but the proposed solution seems fine? If the first characters are not "SIP," that can reject the majority of the true negatives without further inspection?

sorooshm78 · 2025-12-06T12:02:54Z

Another idea that came to my mind is to make the use of heuristics configurable. That way, users who care more about processing speed can disable them and keep the current behavior, while users who prefer higher accuracy and fully parsed results can enable them.

seladb · 2025-12-07T07:04:56Z

@sorooshm78 @Dimi1010 my suggestion is very similar to what both of you suggest:

// Parse with checking the port
static SipLayer* parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet, uint16_t srcPort, uint16_t dstPort);

// Parse without checking the port
static SipLayer* parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet);

The first overload is the "parse SIP by port" which will be called in the same place it's called today.

The second overload is the one who does "heuristic parsing" and will be called at the end, after the port search is exhausted.

I agree we can think of a more generic approach which will have heuristics to parse more protocols, but for now maybe we can start with SIP only and see how it goes.

For example, in my initial SIP heuristic I was taking the first line, splitting it into three tokens, and then parsing each one, which is more expensive. A faster approach is to just look at the first 3 characters of the payload:
* For a response, the first 3 characters will be `SIP`.

* For a request, the first token is the SIP method, and we can check only the first 3 characters (e.g. `INV`, `REG`, `BYE`, etc.) against the known SIP methods.
This keeps the heuristic very cheap, but still doesn’t rely on the port.

@sorooshm78 I agree that could be a good heuristic that can minimize the performance impact.

sorooshm78 · 2025-12-07T10:23:49Z

@seladb
Sounds good, I’ll update the PR with these changes soon.

Dimi1010 · 2025-12-07T17:47:19Z

The first overload is the "parse SIP by port" which will be called in the same place it's called today.

The second overload is the one who does "heuristic parsing" and will be called at the end, after the port search is exhausted.

@seladb Just to clarify, do you mean that the first one does only the port check or port check + heuristic for parse test?

Because I think only the heuristic parse check overload is needed (without the port params). The dispatch by port hint should be handled inside the UdpLayer::parseNextLayer, as that isn't really a conclusive heuristic to determine SIP Packet.

Note: IMO, since the SipLayer constructors accept pure data buffers during parse, the actual heuristic function should be implemented as a standalone function, instead of inside parseSipLayer. This way the actual heuristic checks are decoupled from the layer construction and can allow some more flexibility like potentially utilizing Layer::constructNextLayer<T> inside UdpLayer::parseNextLayer, even if we still keep the parse function.

This is a quick draft on how I can imagine the 2 stage parser to work.

enum class SipPacketDissectResult
{
  Rejected = 0,
  Request = 1,
  Response = 2,
};

SipPacketDissectResult dissectSipHeuristic(const uint8_t* data, size_t dataLen)
{
  // Do heuristic checks here.
  // Return SipPacketDissectResult::Rejected on reject.
  // Return SipPacketDissectResult::Request or SipPacketDissectResult::Response if it matches a type.
}

UdpLayer::parseNextLayer()
{
  // Guards against insufficient data length here.

  uint16_t portSrc, portDst;
  // Unpack ports
  bool skipPortCheck = false;
  for (int i = 0; i < 2; i++)
  {
    /* Other if (skipPortCheck || TLayer::isTPort(portSrc) || TLayer::isTPort(portDst)) */

    // SIP check
    if (skipPortCheck || SipLayer::isSipPort(portSrc) || SipLayer::isSipPort(portDst))
    {
      auto dissectResult = SipLayer::dissectSipHeuristic(udpData, udpDataLen));
      // Possibly add initial "if(dissectResult == SipPacketDissectResult::Rejected) {} else"
      // to short-circuit the other ifs but probably won't be necessary. Benchmark first.
      if (dissectResult == SipPacketDissectResult::Request)
      {
        constructNextLayer<SipRequestLayer>(udpData, udpDataLen, m_Packet);
        break; // We constructed the next layer. Exit the for-loop.
      }
      else if (dissectResult == SipPacketDissectResult::Response)
      {
        constructNextLayer<SipResponseLayer>(udpData, udpDataLen, m_Packet);
        break; // We constructed the next layer. Exit the for-loop.
      }
    }

    /* Other if (skipPortCheck || TLayer::isTPort(portSrc) || TLayer::isTPort(portDst)) */
    
    // ----- After all if parse checks ------
    // After the first iteration is done. Set skipPortCheck to true to ignore the ports on the second run.
    skipPortCheck = true;
  }

  // Final check if no layer could be identified.
  if(!hasNextLayer())
  {
    // Construct a generic payload layer.
    constructNextLayer<PayloadLayer>(udpData, udpDataLen, m_Packet);
  }
}

bool SipLayer::isSipPort(uint16_t port) { /*... */}

// Maybe not needed?
static SipLayer* SipLayer::parseSipLayer(uint8_t* data, size_t dataLen, Layer* prevLayer, Packet* packet)
{
  // Does heuristic checks here.
  // Return `nullptr` on reject.
  // Return SipLayer object on success.
}

PS: This architecture would also allow multiple protocol types to share a "port hint" and still work, although the more overlap there is the less benefits there are to actually hinting.

seladb · 2025-12-08T04:11:43Z

@Dimi1010 I think that parsing a SIP layer if a port is known is quite different vs. any packet: if you know it's a SIP port you can expect a request or response and parse them accordingly. For example: if the dst port is 5060 it can't be a SIP request and vise versa. However, for any packet, you should first decide if it's a request or a response (probably by examining the first 3 characters), and then continue parsing.

That's why I suggested 2 parsing methods vs. just one.

Note: IMO, since the SipLayer constructors accept pure data buffers during parse, the actual heuristic function should be implemented as a standalone function, instead of inside parseSipLayer. This way the actual heuristic checks are decoupled from the layer construction and can allow some more flexibility like potentially utilizing Layer::constructNextLayer<T> inside UdpLayer::parseNextLayer, even if we still keep the parse function.

Instead of UdpLayer or TcpLayer looking at the response and creating a layer, my suggestion was to encapsulate this logic inside the parse() method, which can decide which message it is and create the layer. We use this pattern in other places in the codebase as well so it's not a new behavior.

This is a quick draft on how I can imagine the 2 stage parser to work.

I think we should find a better approach than the 2 iteration for-loop, because packets that don't match any layer will be parsed twice as slow, as they will throw the whole check again...

Dimi1010 · 2025-12-08T06:35:58Z

I think that parsing a SIP layer if a port is known is quite different vs. any packet: if you know it's a SIP port you can expect a request or response and parse them accordingly. For example: if the dst port is 5060 it can't be a SIP request and vise versa. However, for any packet, you should first decide if it's a request or a response (probably by examining the first 3 characters), and then continue parsing.

@seladb Yes, but you don't know its a SIP packet when you are comparing the ports. If one of the ports is 5060 it is highly likely it might be SIP, but not guaranteed. You still have to examine the packet, possibly by the first 3 chars. I could very well pass an HTTP packet on port 5060 otherwise. I could also very well do a unholy abomination that runs the SIP request / response layer in reverse. This is why I said that the main dissector should be port agnostic and always run and ports are just a hint on which type of dissector to run first. At the end of the day conceptually you only get one incoming stream of "any packets".

Instead of UdpLayer or TcpLayer looking at the response and creating a layer, my suggestion was to encapsulate this logic inside the parse() method, which can decide which message it is and create the layer. We use this pattern in other places in the codebase as well so it's not a new behavior.

What I am worried about is that that solution essentially locks the layers to always be constructed on heap, which somewhat interferes with my experiments for the arena based allocation packet I have mentioned before, as the heap allocations are the main bottleneck I discovered in the parser library. I am fine with the parse method existing, but I would prefer for the actual dissector logic to be a separate function that can be utilized as standalone too, essentially leading to dissector standalone method + parse factory method that runs the dissector and optionally creates the layer.

I think we should find a better approach than the 2 iteration for-loop, because packets that don't match any layer will be parsed twice as slow, as they will throw the whole check again...

In the first pass, the majority of the handlers are skipped by the short circuit by the ports, which is the same as we have currently. In the second pass they do run everything. Unfortunately this is unavoidable, as you need to run those because you are out of hints when the ports don't work and you have no guarantee that the packets are truly unknown yet. This is why I said that the dissectors should reject non-matching packets as fast as possible. Otherwise you have faster pipeline for unknown packets, but you also throw packets that could have been known into it.

seladb · 2025-12-08T08:28:51Z

@seladb Yes, but you don't know its a SIP packet when you are comparing the ports. If one of the ports is 5060 it is highly likely it might be SIP, but not guaranteed. You still have to examine the packet, possibly by the first 3 chars. I could very well pass an HTTP packet on port 5060 otherwise. I could also very well do a unholy abomination that runs the SIP request / response layer in reverse. This is why I said that the main dissector should be port agnostic and always run and ports are just a hint on which type of dissector to run first. At the end of the day conceptually you only get one incoming stream of "any packets".

You're right, that's why we check the port + the first line. However, if the dstport is 5060 and it's a SIP packet - it has to be a SIP reuqest, and same if the srcport is 5060 - it cannot be a SIP request.

What I am worried about is that that solution essentially locks the layers to always be constructed on heap, which somewhat interferes with my experiments for the arena based allocation packet I have mentioned before, as the heap allocations are the main bottleneck I discovered in the parser library. I am fine with the parse method existing, but I would prefer for the actual dissector logic to be a separate function that can be utilized as standalone too, essentially leading to dissector standalone method + parse factory method that runs the dissector and optionally creates the layer.

Why would this block the option of arena/pool based allocation? It doesn't matter who creates the layer instance, once we have a different way to allocate packets - we can change the code to use it 🤔

In the first pass, the majority of the handlers are skipped by the short circuit by the ports, which is the same as we have currently. In the second pass they do run everything. Unfortunately this is unavoidable, as you need to run those because you are out of hints when the ports don't work and you have no guarantee that the packets are truly unknown yet. This is why I said that the dissectors should reject non-matching packets as fast as possible. Otherwise you have faster pipeline for unknown packets, but you also throw packets that could have been known into it.

I'd like the second pass to be shorter and only include protocols that have a heuristic-based dissector. The assumption is that most protocol don't have such a dissector

Dimi1010 · 2025-12-08T10:23:15Z

However, if the dstport is 5060 and it's a SIP packet - it has to be a SIP request, and same if the srcport is 5060 - it cannot be a SIP request.

Why is that assumed? As far as I know, the SIP protocol is port agnostic. There is nothing in the spec that requires that port 5060 sends only responses and not requests. The server and client are not bound to specific ports. The "well known" port is, at the end of the day, a suggestion. Hypothetically you could have a connection where the server runs on port 6000 and port 5060 is used by the client to send requests and it would be a valid connection, even if unusual.

Why would this block the option of arena/pool based allocation? It doesn't matter who creates the layer instance, once we have a different way to allocate packets - we can change the code to use it 🤔

Sure, we can change that when we get to it. I just wanted to give an example on why I think it would be beneficial to have the pure dissector checks in a separate function that is not the factory function.

I'd like the second pass to be shorter and only include protocols that have a heuristic-based dissector. The assumption is that most protocol don't have such a dissector

Sure, we can do that. The for-loop was drafted only for protocols that support dissectors to be inside it. The old style protocols that don't have dissectors can go before the for loop, since we don't need a second pass on them. Personally I think we should make it a milestone to have a dissector for every protocol to allow port agnostic parsing.

Essentially we would end up with something like this:

{
  // Init and port unpack.

  /* Old style protocols w/o dissector support (port only heuristic) go here. */
  
  // Return if old style protocol was matched.
  if(hasNextLayer()) { return; }

  bool skipPortCheck = false;
  for (int i = 0; i < 2; i++)
  {
    /* Protocol w/ dissector support checks go here */
    /* if (skipPortCheck || TLayer::isTPort(portSrc) || TLayer::isTPort(portDst)) { ... } */    
    
    // ----- After all if parse checks ------
    // After the first iteration is done. Set skipPortCheck to true to ignore the ports on the second run.
    skipPortCheck = true;
  }

  // Final check if no layer could be identified.
  if(!hasNextLayer())
  {
    // Construct a generic payload layer.
    constructNextLayer<PayloadLayer>(udpData, udpDataLen, m_Packet);
  }
}

sorooshm78 added 5 commits November 17, 2025 15:51

add initial heuristic detection for SIP packets

f8b2d4d

add comment

07ca6b7

refactor move helper methods to private, keep public API minimal

f4fba28

use function dissectSipHeuristic

d18abd0

Remove unused #include <iostream>

6c6acda

sorooshm78 requested a review from seladb as a code owner November 17, 2025 16:11

seladb reviewed Nov 23, 2025

View reviewed changes

sorooshm78 added 2 commits November 29, 2025 07:55

move the implementation to SipLayer.cpp

2f2cc66

Merge branch 'dev' into add-sip-heuristic

09cb3f9

seladb linked an issue Nov 30, 2025 that may be closed by this pull request

SIP detection in PcapPlusPlus relies solely on port 5060 #2022

Open

sorooshm78 added 3 commits December 2, 2025 17:10

Merge branch 'dev' into add-sip-heuristic

8362e31

refactor: use SipRequestFirstLine and SipResponseFirstLine static par…

dfc161c

…sers in heuristic SIP detection

Remove unused helper functions

9f5c52d

sorooshm78 changed the title ~~Improve SIP Packet Detection Using Heuristic Parsing (Fixes #2022)~~ Improve SIP packet detection using heuristic parsing Dec 4, 2025

Merge branch 'dev' into add-sip-heuristic

29063ca

Improve SIP packet detection using heuristic parsing #2024

Are you sure you want to change the base?

Improve SIP packet detection using heuristic parsing #2024

Conversation

sorooshm78 commented Nov 17, 2025

Uh oh!

Uh oh!

seladb Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

sorooshm78 Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

seladb Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

sorooshm78 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

sorooshm78 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

seladb commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sorooshm78 commented Dec 3, 2025

Uh oh!

seladb commented Dec 4, 2025

Uh oh!

Dimi1010 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sorooshm78 commented Dec 4, 2025

Uh oh!

seladb commented Dec 5, 2025

Uh oh!

sorooshm78 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dimi1010 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sorooshm78 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seladb commented Dec 7, 2025

Uh oh!

sorooshm78 commented Dec 7, 2025

Uh oh!

Dimi1010 commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seladb commented Dec 8, 2025

Uh oh!

Dimi1010 commented Dec 8, 2025

Uh oh!

seladb commented Dec 8, 2025

Uh oh!

Dimi1010 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 23, 2025 •

edited

Loading

seladb commented Dec 3, 2025 •

edited

Loading

Dimi1010 commented Dec 4, 2025 •

edited

Loading

sorooshm78 commented Dec 5, 2025 •

edited

Loading

Dimi1010 commented Dec 5, 2025 •

edited

Loading

sorooshm78 commented Dec 6, 2025 •

edited

Loading

Dimi1010 commented Dec 7, 2025 •

edited

Loading

Dimi1010 commented Dec 8, 2025 •

edited

Loading