Skip to content

Conversation

@steven10a
Copy link
Collaborator

@steven10a steven10a commented Nov 10, 2025

Remove dependency on presidio-anonymizer to prevent package clash

  • presidio-anonymizer is being used to mask detected entities. This PR implements that functionality ourselves without depending on an external library
  • Additionally, updates BIC_SWIFT detection which was resulting in too many false positives
  • Updated and added relevant tests

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes the dependency on presidio-anonymizer and implements custom text masking functionality to prevent package conflicts. The changes include a new lightweight anonymizer module, improved BIC/SWIFT detection patterns to reduce false positives, and comprehensive test coverage for both the anonymizer and BIC detection.

Key changes:

  • Implemented custom anonymizer in src/guardrails/utils/anonymizer.py to replace presidio-anonymizer
  • Enhanced BIC/SWIFT detection with context-aware patterns and bank code whitelisting to reduce false positives
  • Added baseline tests to ensure the custom anonymizer matches expected behavior

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/guardrails/utils/anonymizer.py New module implementing custom PII masking functionality with overlap resolution
src/guardrails/checks/text/pii.py Updated to use custom anonymizer and improved BIC/SWIFT detection patterns
src/guardrails/_base_client.py Migrated from presidio-anonymizer to custom anonymizer for structured content masking
tests/unit/checks/test_pii.py Added tests for BIC/SWIFT detection including false positive prevention
tests/unit/checks/test_anonymizer_baseline.py New baseline tests ensuring custom anonymizer produces expected results
pyproject.toml Removed presidio-anonymizer dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@steven10a steven10a requested a review from Copilot November 10, 2025 20:12
Copilot finished reviewing on behalf of steven10a November 10, 2025 20:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@steven10a steven10a requested a review from Copilot November 10, 2025 20:22
Copilot finished reviewing on behalf of steven10a November 10, 2025 20:26
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"ANZB|NATA|WPAC|CTBA|" # Australia
"BKCH|MHCB|BOTK|" # Japan
"ICBK|ABOC|PCBC|" # China
"HSBC|SCBL|" # Hong Kong
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"HSBC" is duplicated in the known_bank_codes string. It already appears in line 176 as part of the "Major international" banks. Consider removing the duplicate "HSBC" from the Hong Kong section to avoid redundancy.

Suggested change
"HSBC|SCBL|" # Hong Kong
"SCBL|" # Hong Kong

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

Comment on lines +162 to +165
# Check that at least the first format is detected
assert "<PHONE_NUMBER>" in result1.info["checked_text"] # noqa: S101


Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The texts_and_results list is created and appended to but never used. Consider removing this unused variable or adding assertions that use it.

Suggested change
# Check that at least the first format is detected
assert "<PHONE_NUMBER>" in result1.info["checked_text"] # noqa: S101
# Check that all formats are detected
for original, masked in texts_and_results:
assert "<PHONE_NUMBER>" in masked # noqa: S101

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fine unit test for now

"ICBK|ABOC|PCBC|" # China
"HSBC|SCBL|" # Hong Kong
"DBSS|OCBC|UOVB|" # Singapore
"CZNB|SHBK|KOEX|HVBK|NACF|IBKO|KODB|HNBN|CITI" # South Korea
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"CITI" is duplicated in the known_bank_codes string. It already appears in line 176 as part of the "Major international" banks. Consider removing the duplicate "CITI" from the South Korea section to avoid redundancy.

Suggested change
"CZNB|SHBK|KOEX|HVBK|NACF|IBKO|KODB|HNBN|CITI" # South Korea
"CZNB|SHBK|KOEX|HVBK|NACF|IBKO|KODB|HNBN" # South Korea

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

@gabor-openai gabor-openai merged commit bf65130 into main Nov 10, 2025
9 checks passed
@gabor-openai gabor-openai deleted the dev/steven/remove_anonymizer branch November 10, 2025 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants