ENH: FontDescriptor: Initiate from embedded font resource #3551

PJBrs · 2025-12-06T08:38:37Z

This PR enables initiating a font descriptor based on a font resource for an embedded font. To that end, it takes the code for collecting character widths from the Font class in pypdf/_text_extraction/_layout_mode/_font.py and adds some lines to collect the
other font metrics. This would be necessary for forms that do not use the 14 Adobe standard fonts.

To reduce code duplication, I replaced the existing width_map in the Font class with the FontDescriptor class, which keeps all functionality intact, because it uses the same code for collecting font widths.

I did notice that character widths are in fact not collected for type1 and truetype fonts that specify their encoding as a string.

codecov · 2025-12-06T08:48:14Z

Codecov Report

❌ Patch coverage is 98.30508% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 97.15%. Comparing base (2cd7409) to head (def511b).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
pypdf/_font.py	98.14%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3551      +/-   ##
==========================================
- Coverage   97.15%   97.15%   -0.01%     
==========================================
  Files          56       56              
  Lines        9783     9802      +19     
  Branches     1784     1789       +5     
==========================================
+ Hits         9505     9523      +18     
  Misses        167      167              
- Partials      111      112       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pypdf/_font.py

stefan6419846 · 2025-12-07T14:01:52Z

pypdf/_font.py

+            }
+
+        # CID fonts have a /W array mapping character codes to widths stashed in /DescendantFonts
+        if "/DescendantFonts" in pdf_font_dict:


Are we able to move this whole conditional block into a dedicated method to keep methods short and readable where possible?

Not quite, at least, not very elegantly. If I did, then the code for finding the widths should also return the font descriptor dictionary, which would be a rather ugly solution.

Conceptually, it would be a lot different if the FontDescriptor here would also be a subclass of a Font class. More on that below.

Hmm... I can make this work, of course, will try later.

pypdf/_font.py

tests/test_text_extraction.py

PJBrs · 2025-12-08T17:43:01Z

@stefan6419846 I've done some more thinking about this, also in light of trying to fix creating appearance streams for arabic text.

I think we took a bit of a wrong turn when we added the character widths in the CORE_FONT_METRICS dict to the FontDescriptor class. In a PDF document, widths are defined immediately in the font dictionary, at the same level of depths as the font descriptor. So, in keeping with the logic of a pdf font resource, a Font class should consist of:

A character map
An encoding
Character widths
A Font Descriptor.

Now, conceptually, our font metrics consist of character widths and a font descriptor, but our FontDescriptor class includes the character widths as well.

Furthermore, in order to solve the arabic text case, we would at least need to update the document's font resource, for which it would be really nice to have an associated Font class.

I'd like to:

Remove character widths from FontDescriptor (and make the according changes in CORE_FONT_METRICS and the script that produces it)
Add a Font class to pypdf/_font.py, consisting of character map, encoding, character widths, and font descriptor. This is almost the same as the existing class in pypdf/_text_extraction/_layout_mode/_font.py.
Initialise Font from font dictionary
Add code to initialise font from font file (not this PR)
Add code to either update or add a font resource, based on a file. (not this PR).

This patch copies over the logic for acquiring character widths from a pdf font dictionary fron the Font class in pypdf/_text_extraction/_layout_mode/_font.py. Later on, this makes it possible to initiate a FontDescriptor from an embedded font.

Replace the width_map in the Font class with the FontDescriptor class.

stefan6419846 · 2025-12-09T16:19:04Z

I have to admit that I can not completely follow you completely here due to my limited knowledge of how all the font aspects work and relate. If you plan another round of refactoring, please keep in mind to keep it understandable, maintainable and reviewable in a straightforward matter.

stefan6419846 reviewed Dec 7, 2025

View reviewed changes

pypdf/_font.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Dec 7, 2025

View reviewed changes

pypdf/_font.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Dec 7, 2025

View reviewed changes

pypdf/_font.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Dec 7, 2025

View reviewed changes

tests/test_text_extraction.py Outdated Show resolved Hide resolved

PJBrs marked this pull request as draft December 8, 2025 17:30

PJBrs added 5 commits December 8, 2025 19:55

ENH: FontDescriptor: Be more explicit about typing

085e326

ENH: Collect all font descriptor metrics

a67c05c

MAINT: Refactor Font class

7cbee81

Replace the width_map in the Font class with the FontDescriptor class.

MAINT: Remove while IndirectObject loop

def511b

PJBrs force-pushed the fontwork branch from 9ac5d71 to def511b Compare December 8, 2025 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: FontDescriptor: Initiate from embedded font resource #3551

ENH: FontDescriptor: Initiate from embedded font resource #3551

PJBrs commented Dec 6, 2025

Uh oh!

codecov bot commented Dec 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

stefan6419846 Dec 7, 2025

Uh oh!

PJBrs Dec 8, 2025

Uh oh!

PJBrs Dec 9, 2025

Uh oh!

Uh oh!

Uh oh!

PJBrs commented Dec 8, 2025

Uh oh!

stefan6419846 commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ENH: FontDescriptor: Initiate from embedded font resource #3551

Are you sure you want to change the base?

ENH: FontDescriptor: Initiate from embedded font resource #3551

Conversation

PJBrs commented Dec 6, 2025

Uh oh!

codecov bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

stefan6419846 Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

PJBrs Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

PJBrs Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PJBrs commented Dec 8, 2025

Uh oh!

stefan6419846 commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 6, 2025 •

edited

Loading