-
-
Notifications
You must be signed in to change notification settings - Fork 37
Fix handling of empty matches in iterators in UTF mode #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Following an empty match, the iterator (Matches or CaptureMatches) advances the last_end position so as not to return the same match twice. However, if the regex uses UTF mode, the position passed to find_at_with_match_data or captures_read_at is required to be a UTF-8 character boundary, so last_end must be advanced by a whole UTF-8 character, not just one byte. Determining whether or not the regex is using UTF mode requires PCRE2_INFO_ALLOPTIONS (checking the config is not enough.)
These tests will fail if the iterator does not correctly advance the end position (by either a byte or a whole character, as appropriate) following an empty match.
|
This PR unfortunately doesn't seem quite correct. In particular, it doesn't take |
|
Thanks for looking at this! It's been quite a while since I thought about this, so I may be missing some of the details. I'm not sure what you mean about PCRE2_MATCH_INVALID_UTF. But, from a quick skim of the documentation, I think that:
From what I can recall, I wasn't trying to fix undefined behavior with this pull request. Nor was I trying to fix anything related to invalid UTF-8 (rather, I was trying to fix the handling of valid UTF-8 with valid UTF-8 patterns.) So I'd be happy to try to address the larger issue but I'd need to first fully understand what the larger issue is. Also, is there something that you think this code should be doing to take PCRE2_MATCH_INVALID_UTF into account? I'm not sure what the desired semantics are if you try to use UTF matching with something that is truly invalid UTF-8. |
|
I don't think UB is related here. There shouldn't be any UB before or after this PR. When PCRE2_MATCH_INVALID_UTF is set, PCRE2 can and should be able to search invalid UTF-8. But the way you are advancing input here means it could completely skip over huge portions of the haystack. |
|
I am not sure what the right thing is here. I haven't investigated too deeply. |
Yes, I think I was reading the PCRE2 documentation incorrectly. (It says that using pcre2_jit_match without PCRE2_MATCH_INVALID_UTF, if the string is non-UTF-8, gives undefined behavior. It doesn't say the same about pcre2_match.)
Hmm. It's not completely obvious what "searching invalid UTF-8" means, or, again, what the intended semantics are. https://www.pcre.org/current/doc/html/pcre2unicode.html says:
Which doesn't answer the question of whether there can be a zero-length match between two invalid UTF-8 bytes. But if that were allowed, that would also imply matching in the middle of a valid UTF-8 character, which feels wrong to me. In fact, the current behavior of pcre2 seems to be inconsistent between JIT and non-JIT; for example: prints: |
|
Searching invalid UTF-8 is critical for perf and is used by ripgrep. (ripgrep is why I made this crate.) The regex crate also supports it: https://docs.rs/regex/latest/regex/bytes/index.html#syntax This might also be useful to read for the general principle: https://docs.rs/bstr/latest/bstr/#handling-of-invalid-utf-8 In terms of what PCRE2 should or does do, that's a harder question. Its docs can be difficult to read at points. Sorry I don't have more to offer right now. If I get a chance soon, I can try to elaborate and dig into this more. I can't really be an oracle here. I would need to dig into the PCRE2 docs and behavior to figure out the right answer here. |
|
When I try using PCRE2 (with PCRE2_MATCH_INVALID_UTF) doesn't do that if JIT is disabled. It only yields a match if the following byte is not in the range 0x80 to 0xbf. PCRE2 does appear to do the same as |
The
find_iterandcaptures_iterfunctions iterate over the distinct, non-overlapping matches within the subject string.Normally, this means the iterator will start searching at the end position of the previous match. However, if the previous match was zero-width, we want to advance by an additional character so we don't return the same match again.
In byte mode, this just means incrementing the position by 1. In "UTF" mode, however, we need to advance to the next UTF-8 character boundary.
Note that to determine whether the regex is in UTF mode, we can't simply check the
utfflag of itsconfig, because UTF mode can also be turned on by including(*UTF)in the pattern itself. So we need to invokepcre2_pattern_info_8to check which mode is actually used.(I also considered that, instead of adding a
utffield toRegex, theRegex::buildfunction could modify the savedconfig. It's possible that might work but it seems fragile.)