Skip to content

Conversation

@bemoody
Copy link

@bemoody bemoody commented Nov 8, 2023

The find_iter and captures_iter functions iterate over the distinct, non-overlapping matches within the subject string.

Normally, this means the iterator will start searching at the end position of the previous match. However, if the previous match was zero-width, we want to advance by an additional character so we don't return the same match again.

In byte mode, this just means incrementing the position by 1. In "UTF" mode, however, we need to advance to the next UTF-8 character boundary.

Note that to determine whether the regex is in UTF mode, we can't simply check the utf flag of its config, because UTF mode can also be turned on by including (*UTF) in the pattern itself. So we need to invoke pcre2_pattern_info_8 to check which mode is actually used.

(I also considered that, instead of adding a utf field to Regex, the Regex::build function could modify the saved config. It's possible that might work but it seems fragile.)

Benjamin Moody added 2 commits November 7, 2023 22:48
Following an empty match, the iterator (Matches or CaptureMatches)
advances the last_end position so as not to return the same match
twice.

However, if the regex uses UTF mode, the position passed to
find_at_with_match_data or captures_read_at is required to be a UTF-8
character boundary, so last_end must be advanced by a whole UTF-8
character, not just one byte.

Determining whether or not the regex is using UTF mode requires
PCRE2_INFO_ALLOPTIONS (checking the config is not enough.)
These tests will fail if the iterator does not correctly advance the
end position (by either a byte or a whole character, as appropriate)
following an empty match.
@BurntSushi
Copy link
Owner

This PR unfortunately doesn't seem quite correct. In particular, it doesn't take PCRE2_MATCH_INVALID_UTF into account. I'd expect some tests for inputs containing invalid UTF-8.

@bemoody
Copy link
Author

bemoody commented Sep 21, 2025

Thanks for looking at this! It's been quite a while since I thought about this, so I may be missing some of the details.

I'm not sure what you mean about PCRE2_MATCH_INVALID_UTF. But, from a quick skim of the documentation, I think that:

  • The existing behavior of pcre2 is unsound (when JIT is used together with UTF mode and not UCP mode.)
  • This pull request would incidentally fix undefined behavior in some circumstances, but by no means all.

From what I can recall, I wasn't trying to fix undefined behavior with this pull request. Nor was I trying to fix anything related to invalid UTF-8 (rather, I was trying to fix the handling of valid UTF-8 with valid UTF-8 patterns.)

So I'd be happy to try to address the larger issue but I'd need to first fully understand what the larger issue is.

Also, is there something that you think this code should be doing to take PCRE2_MATCH_INVALID_UTF into account? I'm not sure what the desired semantics are if you try to use UTF matching with something that is truly invalid UTF-8.

@BurntSushi
Copy link
Owner

I don't think UB is related here. There shouldn't be any UB before or after this PR.

When PCRE2_MATCH_INVALID_UTF is set, PCRE2 can and should be able to search invalid UTF-8. But the way you are advancing input here means it could completely skip over huge portions of the haystack.

@BurntSushi
Copy link
Owner

I am not sure what the right thing is here. I haven't investigated too deeply.

@bemoody
Copy link
Author

bemoody commented Sep 21, 2025

I don't think UB is related here. There shouldn't be any UB before or after this PR.

Yes, I think I was reading the PCRE2 documentation incorrectly. (It says that using pcre2_jit_match without PCRE2_MATCH_INVALID_UTF, if the string is non-UTF-8, gives undefined behavior. It doesn't say the same about pcre2_match.)

When PCRE2_MATCH_INVALID_UTF is set, PCRE2 can and should be able to search invalid UTF-8.

Hmm. It's not completely obvious what "searching invalid UTF-8" means, or, again, what the intended semantics are. https://www.pcre.org/current/doc/html/pcre2unicode.html says:

In this mode, an invalid code unit sequence in the subject never matches any pattern item. It does not match dot, it does not match \p{Any}, it does not even match negative items such as [^X]. A lookbehind assertion fails if it encounters an invalid sequence while moving the current point backwards. In other words, an invalid UTF code unit sequence acts as a barrier which no match can cross.

Which doesn't answer the question of whether there can be a zero-length match between two invalid UTF-8 bytes. But if that were allowed, that would also imply matching in the middle of a valid UTF-8 character, which feels wrong to me.

In fact, the current behavior of pcre2 seems to be inconsistent between JIT and non-JIT; for example:

use pcre2::bytes::RegexBuilder;
fn main() {
    let re1 = RegexBuilder::new().ucp(true).build(r".*").unwrap();
    let re2 = RegexBuilder::new().jit(true).ucp(true).build(r".*").unwrap();
    println!("{:?}", re1.find_at(b"a\x80", 1));
    println!("{:?}", re2.find_at(b"a\x80", 1));
}

prints:

Ok(Some(Match { subject: [97, 128], start: 2, end: 2 }))
Ok(Some(Match { subject: [97, 128], start: 1, end: 1 }))

@BurntSushi
Copy link
Owner

Searching invalid UTF-8 is critical for perf and is used by ripgrep. (ripgrep is why I made this crate.)

The regex crate also supports it: https://docs.rs/regex/latest/regex/bytes/index.html#syntax

This might also be useful to read for the general principle: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8

In terms of what PCRE2 should or does do, that's a harder question. Its docs can be difficult to read at points.

Sorry I don't have more to offer right now. If I get a chance soon, I can try to elaborate and dig into this more. I can't really be an oracle here. I would need to dig into the PCRE2 docs and behavior to figure out the right answer here.

@bemoody
Copy link
Author

bemoody commented Sep 22, 2025

When I try using regex::bytes::Regex, in "unicode" mode, with a pattern that matches an empty string, its find_iter does indeed yield a match at every byte position, including in the middle of a UTF-8 character.

PCRE2 (with PCRE2_MATCH_INVALID_UTF) doesn't do that if JIT is disabled. It only yields a match if the following byte is not in the range 0x80 to 0xbf.

PCRE2 does appear to do the same as regex::bytes when JIT is enabled (on x86-64, at least). This inconsistency is probably a bug in PCRE2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants