Skip to content

Commit dd75533

Browse files
authored
Update to Unicode 17 (#830)
1 parent 3b91977 commit dd75533

29 files changed

+5760
-4635
lines changed

.gitattributes

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,7 @@
11
testdata/* -text
22
maint/manifest-* -text
3+
maint/ucptestdata -text
4+
*.sh text eol=lf
5+
pcre2-config.in text eol=lf
6+
RunTest text eol=lf
7+
RunGrepTest text eol=lf

maint/FetchUcd.sh

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
#! /bin/sh
2+
3+
# Small helper script to fetch the Unicode Character Database files
4+
5+
VER=17.0.0
6+
7+
cd "$(dirname "$0")"
8+
pwd
9+
10+
rm -rf Unicode.tables/
11+
mkdir Unicode.tables
12+
13+
fetch_file()
14+
{
15+
url="$1"
16+
i="$2"
17+
18+
echo "=== Downloading $i ==="
19+
# Download each file with curl and place into the Unicode.tables folder
20+
# Reject the download if there is an HTTP error
21+
if ! curl --fail -o Unicode.tables/$i -L "$url"; then
22+
echo "Error downloading $i"
23+
rm -f Unicode.tables/$i
24+
fi
25+
}
26+
27+
for i in BidiMirroring.txt \
28+
CaseFolding.txt \
29+
DerivedCoreProperties.txt \
30+
PropertyAliases.txt \
31+
PropertyValueAliases.txt \
32+
PropList.txt \
33+
ScriptExtensions.txt \
34+
Scripts.txt \
35+
UnicodeData.txt \
36+
; do
37+
fetch_file "https://www.unicode.org/Public/$VER/ucd/$i" "$i"
38+
done
39+
40+
for i in DerivedBidiClass.txt \
41+
DerivedGeneralCategory.txt \
42+
; do
43+
fetch_file "https://www.unicode.org/Public/$VER/ucd/extracted/$i" "$i"
44+
done
45+
46+
for i in GraphemeBreakProperty.txt \
47+
; do
48+
fetch_file "https://www.unicode.org/Public/$VER/ucd/auxiliary/$i" "$i"
49+
done
50+
51+
for i in emoji-data.txt \
52+
; do
53+
fetch_file "https://www.unicode.org/Public/$VER/ucd/emoji/$i" "$i"
54+
done

maint/GenerateCommon.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -348,7 +348,7 @@ def open_output(default):
348348
POSSIBILITY OF SUCH DAMAGE.
349349
-----------------------------------------------------------------------------
350350
*/
351-
\n""")
351+
\n\n""")
352352
return file
353353

354354
# End of UcpCommon.py

maint/GenerateUcd.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -788,10 +788,13 @@ def write_bitsets(list, item_size):
788788
just one of these tables is actually needed. When compiling the library, some
789789
headers are needed. */
790790
791+
791792
#ifndef PCRE2_PCRE2TEST
792793
#include "pcre2_internal.h"
793794
#endif /* PCRE2_PCRE2TEST */
794795
796+
797+
795798
/* The tables herein are needed only when UCP support is built, and in PCRE2
796799
that happens automatically with UTF support. This module should not be
797800
referenced otherwise, so it should not matter whether it is compiled or not.

maint/README

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,10 @@ GenerateUcpTables.py
6060
GenerateCommon.py and Unicode data files. The generated file contains tables
6161
for looking up Unicode property names.
6262

63+
FetchUcd.sh
64+
A shell script to download the UCD data from the Unicode website into
65+
the Unicode.tables directory.
66+
6367
FilterCoverage.py
6468
A small helper used by the RunCoverage script.
6569

@@ -141,10 +145,11 @@ Updating to a new Unicode release
141145
=================================
142146

143147
When there is a new release of Unicode, the files in Unicode.tables must be
144-
refreshed from the Unicode web site. Once that is done, the four Python scripts
145-
that generate files from the Unicode data can be run from within the "maint"
146-
directory. Note that the format used for those files is not stable, and
147-
therefore changes to the scripts might be needed to support new versions.
148+
refreshed from the Unicode web site, which can be done with the script
149+
FetchUcd.sh. Once that is done, the four Python scripts that generate files from
150+
the Unicode data can be run from within the "maint" directory. Note that the
151+
format used for those files is not stable, and therefore changes to the scripts
152+
might be needed to support new versions.
148153

149154
Note: Previously, it was necessary to update lists of scripts and their
150155
abbreviations by hand before running the Python scripts. This is no longer

maint/Unicode.tables/BidiMirroring.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# BidiMirroring-16.0.0.txt
2-
# Date: 2024-01-30
3-
# © 2024 Unicode®, Inc.
1+
# BidiMirroring-17.0.0.txt
2+
# Date: 2025-08-01
3+
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
66
#
@@ -16,7 +16,7 @@
1616
# value, for which there is another Unicode character that typically has a glyph
1717
# that is the mirror image of the original character's glyph.
1818
#
19-
# The repertoire covered by the file is Unicode 16.0.0.
19+
# The repertoire covered by the file is Unicode 17.0.0.
2020
#
2121
# The file contains a list of lines with mappings from one code point
2222
# to another one for character-based mirroring.

maint/Unicode.tables/CaseFolding.txt

Lines changed: 34 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# CaseFolding-16.0.0.txt
2-
# Date: 2024-04-30, 21:48:11 GMT
3-
# © 2024 Unicode®, Inc.
1+
# CaseFolding-17.0.0.txt
2+
# Date: 2025-07-30, 23:54:36 GMT
3+
# © 2025 Unicode®, Inc.
44
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
55
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
66
#
@@ -18,15 +18,15 @@
1818
# The data supports both implementations that require simple case foldings
1919
# (where string lengths don't change), and implementations that allow full case folding
2020
# (where string lengths may grow). Note that where they can be supported, the
21-
# full case foldings are superior: for example, they allow "MASSE" and "Maße" to match.
21+
# full case foldings are superior: for example, they allow "FUSS" and "Fuß" to match.
2222
#
2323
# All code points not listed in this file map to themselves.
2424
#
2525
# NOTE: case folding does not preserve normalization formats!
2626
#
2727
# For information on case folding, including how to have case folding
28-
# preserve normalization formats, see Section 3.13 Default Case Algorithms in
29-
# The Unicode Standard.
28+
# preserve normalization formats, see the
29+
# "Conformance" / "Default Case Algorithms" section of the core specification.
3030
#
3131
# ================================================================================
3232
# Format
@@ -1243,7 +1243,10 @@ A7C7; C; A7C8; # LATIN CAPITAL LETTER D WITH SHORT STROKE OVERLAY
12431243
A7C9; C; A7CA; # LATIN CAPITAL LETTER S WITH SHORT STROKE OVERLAY
12441244
A7CB; C; 0264; # LATIN CAPITAL LETTER RAMS HORN
12451245
A7CC; C; A7CD; # LATIN CAPITAL LETTER S WITH DIAGONAL STROKE
1246+
A7CE; C; A7CF; # LATIN CAPITAL LETTER PHARYNGEAL VOICED FRICATIVE
12461247
A7D0; C; A7D1; # LATIN CAPITAL LETTER CLOSED INSULAR G
1248+
A7D2; C; A7D3; # LATIN CAPITAL LETTER DOUBLE THORN
1249+
A7D4; C; A7D5; # LATIN CAPITAL LETTER DOUBLE WYNN
12471250
A7D6; C; A7D7; # LATIN CAPITAL LETTER MIDDLE SCOTS S
12481251
A7D8; C; A7D9; # LATIN CAPITAL LETTER SIGMOID S
12491252
A7DA; C; A7DB; # LATIN CAPITAL LETTER LAMBDA
@@ -1616,6 +1619,31 @@ FF3A; C; FF5A; # FULLWIDTH LATIN CAPITAL LETTER Z
16161619
16E5D; C; 16E7D; # MEDEFAIDRIN CAPITAL LETTER O
16171620
16E5E; C; 16E7E; # MEDEFAIDRIN CAPITAL LETTER AI
16181621
16E5F; C; 16E7F; # MEDEFAIDRIN CAPITAL LETTER Y
1622+
16EA0; C; 16EBB; # BERIA ERFE CAPITAL LETTER ARKAB
1623+
16EA1; C; 16EBC; # BERIA ERFE CAPITAL LETTER BASIGNA
1624+
16EA2; C; 16EBD; # BERIA ERFE CAPITAL LETTER DARBAI
1625+
16EA3; C; 16EBE; # BERIA ERFE CAPITAL LETTER EH
1626+
16EA4; C; 16EBF; # BERIA ERFE CAPITAL LETTER FITKO
1627+
16EA5; C; 16EC0; # BERIA ERFE CAPITAL LETTER GOWAY
1628+
16EA6; C; 16EC1; # BERIA ERFE CAPITAL LETTER HIRDEABO
1629+
16EA7; C; 16EC2; # BERIA ERFE CAPITAL LETTER I
1630+
16EA8; C; 16EC3; # BERIA ERFE CAPITAL LETTER DJAI
1631+
16EA9; C; 16EC4; # BERIA ERFE CAPITAL LETTER KOBO
1632+
16EAA; C; 16EC5; # BERIA ERFE CAPITAL LETTER LAKKO
1633+
16EAB; C; 16EC6; # BERIA ERFE CAPITAL LETTER MERI
1634+
16EAC; C; 16EC7; # BERIA ERFE CAPITAL LETTER NINI
1635+
16EAD; C; 16EC8; # BERIA ERFE CAPITAL LETTER GNA
1636+
16EAE; C; 16EC9; # BERIA ERFE CAPITAL LETTER NGAY
1637+
16EAF; C; 16ECA; # BERIA ERFE CAPITAL LETTER OI
1638+
16EB0; C; 16ECB; # BERIA ERFE CAPITAL LETTER PI
1639+
16EB1; C; 16ECC; # BERIA ERFE CAPITAL LETTER ERIGO
1640+
16EB2; C; 16ECD; # BERIA ERFE CAPITAL LETTER ERIGO TAMURA
1641+
16EB3; C; 16ECE; # BERIA ERFE CAPITAL LETTER SERI
1642+
16EB4; C; 16ECF; # BERIA ERFE CAPITAL LETTER SHEP
1643+
16EB5; C; 16ED0; # BERIA ERFE CAPITAL LETTER TATASOUE
1644+
16EB6; C; 16ED1; # BERIA ERFE CAPITAL LETTER UI
1645+
16EB7; C; 16ED2; # BERIA ERFE CAPITAL LETTER WASSE
1646+
16EB8; C; 16ED3; # BERIA ERFE CAPITAL LETTER AY
16191647
1E900; C; 1E922; # ADLAM CAPITAL LETTER ALIF
16201648
1E901; C; 1E923; # ADLAM CAPITAL LETTER DAALI
16211649
1E902; C; 1E924; # ADLAM CAPITAL LETTER LAAM

0 commit comments

Comments
 (0)