Fix handling of empty matches in iterators in UTF mode #36

bemoody · 2023-11-08T03:55:58Z

The find_iter and captures_iter functions iterate over the distinct, non-overlapping matches within the subject string.

Normally, this means the iterator will start searching at the end position of the previous match. However, if the previous match was zero-width, we want to advance by an additional character so we don't return the same match again.

In byte mode, this just means incrementing the position by 1. In "UTF" mode, however, we need to advance to the next UTF-8 character boundary.

Note that to determine whether the regex is in UTF mode, we can't simply check the utf flag of its config, because UTF mode can also be turned on by including (*UTF) in the pattern itself. So we need to invoke pcre2_pattern_info_8 to check which mode is actually used.

(I also considered that, instead of adding a utf field to Regex, the Regex::build function could modify the saved config. It's possible that might work but it seems fragile.)

Following an empty match, the iterator (Matches or CaptureMatches) advances the last_end position so as not to return the same match twice. However, if the regex uses UTF mode, the position passed to find_at_with_match_data or captures_read_at is required to be a UTF-8 character boundary, so last_end must be advanced by a whole UTF-8 character, not just one byte. Determining whether or not the regex is using UTF mode requires PCRE2_INFO_ALLOPTIONS (checking the config is not enough.)

These tests will fail if the iterator does not correctly advance the end position (by either a byte or a whole character, as appropriate) following an empty match.

BurntSushi · 2025-09-21T13:42:14Z

This PR unfortunately doesn't seem quite correct. In particular, it doesn't take PCRE2_MATCH_INVALID_UTF into account. I'd expect some tests for inputs containing invalid UTF-8.

bemoody · 2025-09-21T19:27:37Z

Thanks for looking at this! It's been quite a while since I thought about this, so I may be missing some of the details.

I'm not sure what you mean about PCRE2_MATCH_INVALID_UTF. But, from a quick skim of the documentation, I think that:

The existing behavior of pcre2 is unsound (when JIT is used together with UTF mode and not UCP mode.)
This pull request would incidentally fix undefined behavior in some circumstances, but by no means all.

From what I can recall, I wasn't trying to fix undefined behavior with this pull request. Nor was I trying to fix anything related to invalid UTF-8 (rather, I was trying to fix the handling of valid UTF-8 with valid UTF-8 patterns.)

So I'd be happy to try to address the larger issue but I'd need to first fully understand what the larger issue is.

Also, is there something that you think this code should be doing to take PCRE2_MATCH_INVALID_UTF into account? I'm not sure what the desired semantics are if you try to use UTF matching with something that is truly invalid UTF-8.

BurntSushi · 2025-09-21T20:53:00Z

I don't think UB is related here. There shouldn't be any UB before or after this PR.

When PCRE2_MATCH_INVALID_UTF is set, PCRE2 can and should be able to search invalid UTF-8. But the way you are advancing input here means it could completely skip over huge portions of the haystack.

BurntSushi · 2025-09-21T20:54:13Z

I am not sure what the right thing is here. I haven't investigated too deeply.

bemoody · 2025-09-21T21:41:16Z

I don't think UB is related here. There shouldn't be any UB before or after this PR.

Yes, I think I was reading the PCRE2 documentation incorrectly. (It says that using pcre2_jit_match without PCRE2_MATCH_INVALID_UTF, if the string is non-UTF-8, gives undefined behavior. It doesn't say the same about pcre2_match.)

When PCRE2_MATCH_INVALID_UTF is set, PCRE2 can and should be able to search invalid UTF-8.

Hmm. It's not completely obvious what "searching invalid UTF-8" means, or, again, what the intended semantics are. https://www.pcre.org/current/doc/html/pcre2unicode.html says:

In this mode, an invalid code unit sequence in the subject never matches any pattern item. It does not match dot, it does not match \p{Any}, it does not even match negative items such as [^X]. A lookbehind assertion fails if it encounters an invalid sequence while moving the current point backwards. In other words, an invalid UTF code unit sequence acts as a barrier which no match can cross.

Which doesn't answer the question of whether there can be a zero-length match between two invalid UTF-8 bytes. But if that were allowed, that would also imply matching in the middle of a valid UTF-8 character, which feels wrong to me.

In fact, the current behavior of pcre2 seems to be inconsistent between JIT and non-JIT; for example:

use pcre2::bytes::RegexBuilder;
fn main() {
    let re1 = RegexBuilder::new().ucp(true).build(r".*").unwrap();
    let re2 = RegexBuilder::new().jit(true).ucp(true).build(r".*").unwrap();
    println!("{:?}", re1.find_at(b"a\x80", 1));
    println!("{:?}", re2.find_at(b"a\x80", 1));
}

prints:

Ok(Some(Match { subject: [97, 128], start: 2, end: 2 }))
Ok(Some(Match { subject: [97, 128], start: 1, end: 1 }))

BurntSushi · 2025-09-21T21:59:55Z

Searching invalid UTF-8 is critical for perf and is used by ripgrep. (ripgrep is why I made this crate.)

The regex crate also supports it: https://docs.rs/regex/latest/regex/bytes/index.html#syntax

This might also be useful to read for the general principle: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8

In terms of what PCRE2 should or does do, that's a harder question. Its docs can be difficult to read at points.

Sorry I don't have more to offer right now. If I get a chance soon, I can try to elaborate and dig into this more. I can't really be an oracle here. I would need to dig into the PCRE2 docs and behavior to figure out the right answer here.

bemoody · 2025-09-22T02:11:27Z

When I try using regex::bytes::Regex, in "unicode" mode, with a pattern that matches an empty string, its find_iter does indeed yield a match at every byte position, including in the middle of a UTF-8 character.

PCRE2 (with PCRE2_MATCH_INVALID_UTF) doesn't do that if JIT is disabled. It only yields a match if the following byte is not in the range 0x80 to 0xbf.

PCRE2 does appear to do the same as regex::bytes when JIT is enabled (on x86-64, at least). This inconsistency is probably a bug in PCRE2.

Benjamin Moody added 2 commits November 7, 2023 22:48

tests: check iterating over empty matches in UTF-8 text

fb07086

These tests will fail if the iterator does not correctly advance the end position (by either a byte or a whole character, as appropriate) following an empty match.

IsaacOscar mentioned this pull request Aug 5, 2025

Iterating over byte regexes with unicode support enabled does not work properly when matches are empty rust-lang/regex#1275

Closed

bemoody mentioned this pull request Sep 28, 2025

Fix handling of empty matches in iterators, using PCRE2_NOTEMPTY_ATSTART #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix handling of empty matches in iterators in UTF mode #36

Fix handling of empty matches in iterators in UTF mode #36

Uh oh!

bemoody commented Nov 8, 2023

Uh oh!

BurntSushi commented Sep 21, 2025

Uh oh!

bemoody commented Sep 21, 2025

Uh oh!

BurntSushi commented Sep 21, 2025

Uh oh!

BurntSushi commented Sep 21, 2025

Uh oh!

bemoody commented Sep 21, 2025

Uh oh!

BurntSushi commented Sep 21, 2025

Uh oh!

bemoody commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix handling of empty matches in iterators in UTF mode #36

Are you sure you want to change the base?

Fix handling of empty matches in iterators in UTF mode #36

Uh oh!

Conversation

bemoody commented Nov 8, 2023

Uh oh!

BurntSushi commented Sep 21, 2025

Uh oh!

bemoody commented Sep 21, 2025

Uh oh!

BurntSushi commented Sep 21, 2025

Uh oh!

BurntSushi commented Sep 21, 2025

Uh oh!

bemoody commented Sep 21, 2025

Uh oh!

BurntSushi commented Sep 21, 2025

Uh oh!

bemoody commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants