Skip to content

Conversation

@PJBrs
Copy link
Contributor

@PJBrs PJBrs commented Dec 6, 2025

This PR enables initiating a font descriptor based on a font resource for an embedded font. To that end, it takes the code for collecting character widths from the Font class in pypdf/_text_extraction/_layout_mode/_font.py and adds some lines to collect the
other font metrics. This would be necessary for forms that do not use the 14 Adobe standard fonts.

To reduce code duplication, I replaced the existing width_map in the Font class with the FontDescriptor class, which keeps all functionality intact, because it uses the same code for collecting font widths.

I did notice that character widths are in fact not collected for type1 and truetype fonts that specify their encoding as a string.

@codecov
Copy link

codecov bot commented Dec 6, 2025

Codecov Report

❌ Patch coverage is 98.30508% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 97.15%. Comparing base (2cd7409) to head (def511b).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pypdf/_font.py 98.14% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3551      +/-   ##
==========================================
- Coverage   97.15%   97.15%   -0.01%     
==========================================
  Files          56       56              
  Lines        9783     9802      +19     
  Branches     1784     1789       +5     
==========================================
+ Hits         9505     9523      +18     
  Misses        167      167              
- Partials      111      112       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

}

# CID fonts have a /W array mapping character codes to widths stashed in /DescendantFonts
if "/DescendantFonts" in pdf_font_dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to move this whole conditional block into a dedicated method to keep methods short and readable where possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, at least, not very elegantly. If I did, then the code for finding the widths should also return the font descriptor dictionary, which would be a rather ugly solution.

Conceptually, it would be a lot different if the FontDescriptor here would also be a subclass of a Font class. More on that below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I can make this work, of course, will try later.

@PJBrs PJBrs marked this pull request as draft December 8, 2025 17:30
@PJBrs
Copy link
Contributor Author

PJBrs commented Dec 8, 2025

@stefan6419846 I've done some more thinking about this, also in light of trying to fix creating appearance streams for arabic text.

I think we took a bit of a wrong turn when we added the character widths in the CORE_FONT_METRICS dict to the FontDescriptor class. In a PDF document, widths are defined immediately in the font dictionary, at the same level of depths as the font descriptor. So, in keeping with the logic of a pdf font resource, a Font class should consist of:

  • A character map
  • An encoding
  • Character widths
  • A Font Descriptor.

Now, conceptually, our font metrics consist of character widths and a font descriptor, but our FontDescriptor class includes the character widths as well.

Furthermore, in order to solve the arabic text case, we would at least need to update the document's font resource, for which it would be really nice to have an associated Font class.

I'd like to:

  • Remove character widths from FontDescriptor (and make the according changes in CORE_FONT_METRICS and the script that produces it)
  • Add a Font class to pypdf/_font.py, consisting of character map, encoding, character widths, and font descriptor. This is almost the same as the existing class in pypdf/_text_extraction/_layout_mode/_font.py.
  • Initialise Font from font dictionary
  • Add code to initialise font from font file (not this PR)
  • Add code to either update or add a font resource, based on a file. (not this PR).

PJBrs added 5 commits December 8, 2025 19:55
This patch copies over the logic for acquiring character widths
from a pdf font dictionary fron the Font class in
pypdf/_text_extraction/_layout_mode/_font.py. Later on, this makes
it possible to initiate a FontDescriptor from an embedded font.
Replace the width_map in the Font class with the FontDescriptor
class.
@stefan6419846
Copy link
Collaborator

I have to admit that I can not completely follow you completely here due to my limited knowledge of how all the font aspects work and relate. If you plan another round of refactoring, please keep in mind to keep it understandable, maintainable and reviewable in a straightforward matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants