Commit d046e4f
committed
syntax: reject '(?-u)\W' when UTF-8 mode is enabled
When Unicode mode is disabled (i.e., (?-u)), the Perl character classes
(\w, \d and \s) revert to their ASCII definitions. The negated forms
of these classes are also derived from their ASCII definitions, and this
means that they may actually match bytes outside of ASCII and thus
possibly invalid UTF-8. For this reason, when the translator is
configured to only produce HIR that matches valid UTF-8, '(?-u)\W'
should be rejected.
Previously, it was not being rejected, which could actually lead to
matches that produced offsets that split codepoints, and thus lead to
panics when match offsets are used to slice a string. For example, this
code
fn main() {
let re = regex::Regex::new(r"(?-u)\W").unwrap();
let haystack = "☃";
if let Some(m) = re.find(haystack) {
println!("{:?}", &haystack[m.range()]);
}
}
panics with
byte index 1 is not a char boundary; it is inside '☃' (bytes 0..3) of `☃`
That is, it reports a match at 0..1, which is technically correct, but
the regex itself should have been rejected in the first place since the
top-level Regex API always has UTF-8 mode enabled.
Also, many of the replacement tests were using '(?-u)\W' (or similar)
for some reason. I'm not sure why, so I just removed the '(?-u)' to make
those tests pass. Whether Unicode is enabled or not doesn't seem to be
an interesting detail for those tests. (All haystacks and replacements
appear to be ASCII.)
Fixes #895, Partially addresses #7381 parent 2f405f3 commit d046e4f
2 files changed
+92
-40
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
305 | 305 | | |
306 | 306 | | |
307 | 307 | | |
308 | | - | |
| 308 | + | |
309 | 309 | | |
310 | 310 | | |
311 | 311 | | |
| |||
445 | 445 | | |
446 | 446 | | |
447 | 447 | | |
448 | | - | |
| 448 | + | |
449 | 449 | | |
450 | 450 | | |
451 | 451 | | |
| |||
879 | 879 | | |
880 | 880 | | |
881 | 881 | | |
882 | | - | |
| 882 | + | |
883 | 883 | | |
884 | 884 | | |
885 | 885 | | |
| |||
893 | 893 | | |
894 | 894 | | |
895 | 895 | | |
896 | | - | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
897 | 903 | | |
898 | 904 | | |
899 | 905 | | |
| |||
1971 | 1977 | | |
1972 | 1978 | | |
1973 | 1979 | | |
1974 | | - | |
| 1980 | + | |
1975 | 1981 | | |
1976 | 1982 | | |
1977 | 1983 | | |
| |||
2011 | 2017 | | |
2012 | 2018 | | |
2013 | 2019 | | |
| 2020 | + | |
2014 | 2021 | | |
| 2022 | + | |
| 2023 | + | |
2015 | 2024 | | |
2016 | 2025 | | |
2017 | 2026 | | |
| |||
2040 | 2049 | | |
2041 | 2050 | | |
2042 | 2051 | | |
2043 | | - | |
| 2052 | + | |
2044 | 2053 | | |
2045 | 2054 | | |
2046 | 2055 | | |
2047 | | - | |
| 2056 | + | |
2048 | 2057 | | |
2049 | 2058 | | |
2050 | 2059 | | |
2051 | | - | |
| 2060 | + | |
2052 | 2061 | | |
2053 | 2062 | | |
2054 | 2063 | | |
2055 | | - | |
| 2064 | + | |
2056 | 2065 | | |
2057 | 2066 | | |
2058 | 2067 | | |
2059 | | - | |
| 2068 | + | |
2060 | 2069 | | |
2061 | 2070 | | |
2062 | 2071 | | |
2063 | | - | |
| 2072 | + | |
2064 | 2073 | | |
2065 | 2074 | | |
| 2075 | + | |
| 2076 | + | |
| 2077 | + | |
| 2078 | + | |
| 2079 | + | |
| 2080 | + | |
| 2081 | + | |
| 2082 | + | |
| 2083 | + | |
| 2084 | + | |
| 2085 | + | |
| 2086 | + | |
| 2087 | + | |
| 2088 | + | |
| 2089 | + | |
| 2090 | + | |
| 2091 | + | |
| 2092 | + | |
| 2093 | + | |
| 2094 | + | |
| 2095 | + | |
| 2096 | + | |
| 2097 | + | |
| 2098 | + | |
| 2099 | + | |
| 2100 | + | |
| 2101 | + | |
| 2102 | + | |
| 2103 | + | |
| 2104 | + | |
| 2105 | + | |
| 2106 | + | |
| 2107 | + | |
| 2108 | + | |
| 2109 | + | |
| 2110 | + | |
| 2111 | + | |
| 2112 | + | |
| 2113 | + | |
| 2114 | + | |
| 2115 | + | |
| 2116 | + | |
| 2117 | + | |
| 2118 | + | |
| 2119 | + | |
| 2120 | + | |
| 2121 | + | |
| 2122 | + | |
| 2123 | + | |
| 2124 | + | |
| 2125 | + | |
| 2126 | + | |
| 2127 | + | |
| 2128 | + | |
| 2129 | + | |
| 2130 | + | |
| 2131 | + | |
| 2132 | + | |
| 2133 | + | |
| 2134 | + | |
| 2135 | + | |
| 2136 | + | |
| 2137 | + | |
| 2138 | + | |
2066 | 2139 | | |
2067 | 2140 | | |
2068 | 2141 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
| 15 | + | |
23 | 16 | | |
24 | 17 | | |
25 | 18 | | |
26 | | - | |
| 19 | + | |
27 | 20 | | |
28 | 21 | | |
29 | 22 | | |
| |||
33 | 26 | | |
34 | 27 | | |
35 | 28 | | |
36 | | - | |
| 29 | + | |
37 | 30 | | |
38 | 31 | | |
39 | 32 | | |
| |||
48 | 41 | | |
49 | 42 | | |
50 | 43 | | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
| 44 | + | |
| 45 | + | |
67 | 46 | | |
68 | 47 | | |
69 | 48 | | |
70 | | - | |
| 49 | + | |
71 | 50 | | |
72 | 51 | | |
73 | 52 | | |
74 | 53 | | |
75 | 54 | | |
76 | 55 | | |
77 | 56 | | |
78 | | - | |
| 57 | + | |
79 | 58 | | |
80 | 59 | | |
81 | 60 | | |
82 | 61 | | |
83 | 62 | | |
84 | 63 | | |
85 | 64 | | |
86 | | - | |
| 65 | + | |
87 | 66 | | |
88 | 67 | | |
89 | 68 | | |
| |||
0 commit comments