Add FuzzySharp-based text analysis plugin for domain-specific typo detection and entity extraction #1208

ywang1110 · 2025-11-06T20:34:54Z

User description

This part cannot be directly merged because it represents a POC version — it demonstrates the complete logic flow and can be tested via Postman, but it does not yet fit into the current BotSharp architecture and the vocabulary data source relies on local CSV files instead of database reads.

How to Test Locally

1. Copy Data Source

Copy the folder /BotSharp/src/Plugins/BotSharp.Plugin.FuzzySharp/data/fuzzySharp under /BotSharp/src/WebStarter/bin/Debug/net8.0/data/plugins

2. Start the Application

Run the WebStarter project.

3. Run Tests via Postman

PR Type

Enhancement

Description

Add FuzzySharp plugin for domain-specific text analysis
- Detects typos, exact matches, and domain term mappings
- Supports n-gram processing with fuzzy matching
Implement text tokenization and vocabulary loading
- Loads domain vocabularies from CSV files with caching
- Supports domain term mapping for business abbreviations
Create REST API endpoint for text analysis
- Returns flagged items with match types and confidence scores
- Includes optional token output and processing time metrics
Integrate plugin into BotSharp architecture
- Registers services via dependency injection
- Added to solution and WebStarter configuration

Diagram Walkthrough

flowchart LR
  A["Text Input"] --> B["TextTokenizer"]
  B --> C["VocabularyService"]
  C --> D["NgramProcessor"]
  D --> E["Token Matchers"]
  E --> F["ResultProcessor"]
  F --> G["TextAnalysisResponse"]
  H["CSV Files"] -.-> C
  I["Domain Term Mapping"] -.-> E

File Walkthrough

Relevant files

Enhancement

21 files

TextAnalysisRequest.cs `Text analysis request model with parameters`	+52/-0
INgramProcessor.cs `N-gram processing interface definition`	+27/-0
IResultProcessor.cs `Result deduplication and sorting interface`	+18/-0
ITextAnalysisService.cs `Main text analysis service interface`	+13/-0
ITokenMatcher.cs `Token matching interface and context models`	+40/-0
IVocabularyService.cs `Vocabulary and domain term loading interface`	+9/-0
FlaggedItem.cs `Flagged item model with match metadata`	+50/-0
TextAnalysisResponse.cs `Text analysis response model structure`	+31/-0
MatchReason.cs `Match type constants for results`	+21/-0
TextConstants.cs `Text tokenization separator characters`	+30/-0
FuzzySharpController.cs `REST API endpoint for text analysis`	+61/-0
FuzzySharpPlugin.cs `Plugin registration and dependency injection`	+29/-0
DomainTermMatcher.cs `Domain term mapping matcher implementation`	+24/-0
ExactMatcher.cs `Exact vocabulary match implementation`	+24/-0
FuzzyMatcher.cs `Fuzzy matching for typo correction`	+82/-0
NgramProcessor.cs `N-gram processing with priority-based matching`	+134/-0
ResultProcessor.cs `Result deduplication and sorting logic`	+103/-0
TextAnalysisService.cs `Main text analysis service implementation`	+141/-0
VocabularyService.cs `CSV vocabulary and domain term loading`	+186/-0
Using.cs `Global using statements for plugin`	+10/-0
TextTokenizer.cs `Text preprocessing and tokenization utilities`	+64/-0

Configuration changes

4 files

BotSharp.sln `Add FuzzySharp plugin to solution`	+11/-0
BotSharp.Plugin.FuzzySharp.csproj `FuzzySharp plugin project file`	+21/-0
WebStarter.csproj `Add FuzzySharp plugin reference`	+1/-0
appsettings.json `Register FuzzySharp in plugin list`	+2/-1

Dependencies

1 files

Directory.Packages.props `Add CsvHelper and FuzzySharp dependencies`	+3/-0

qodo-merge-pro · 2025-11-06T20:35:34Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Path traversal Description: The service loads all .csv files from a folder name supplied via request-controlled input and constructs file paths under AppContext.BaseDirectory without explicit sanitization, creating a possible path traversal or arbitrary file read risk if an attacker can influence `vocabulary_folder_name`. VocabularyService.cs [147-166] Referred Code private async Task<Dictionary<string, string>> LoadCsvFilesFromFolderAsync(string folderName) { var csvFileDict = new Dictionary<string, string>(); var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp", folderName); if (!Directory.Exists(searchFolder)) { _logger.LogWarning($"Folder does not exist: {searchFolder}"); return csvFileDict; } var csvFiles = Directory.GetFiles(searchFolder, "*.csv"); foreach (var file in csvFiles) { var fileName = Path.GetFileNameWithoutExtension(file); csvFileDict[fileName] = file; } _logger.LogInformation($"Loaded {csvFileDict.Count} CSV files from {searchFolder}"); return await Task.FromResult(csvFileDict); }
	Arbitrary file read Description: The domain term mapping file path is built from a user-controlled filename appended to a base folder; without strict allowlisting or sanitization, this may allow reading unintended files via crafted `domain_term_mapping_file`. VocabularyService.cs [56-115] Referred Code public async Task<Dictionary<string, (string DbPath, string CanonicalForm)>> LoadDomainTermMappingAsync(string? filename) { var result = new Dictionary<string, (string DbPath, string CanonicalForm)>(); if (string.IsNullOrWhiteSpace(filename)) { return result; } var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp"); var filePath = Path.Combine(searchFolder, filename); if (string.IsNullOrEmpty(filePath) \|\| !File.Exists(filePath)) { return result; } try { using var reader = new StreamReader(filePath); using var csv = new CsvReader(reader, CreateCsvConfig()); ... (clipped 39 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed
🔴	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Leaks exception info: The API returns internal exception messages to the client in the 500 response body (`Error` `analyzing text: {ex.Message}`), potentially exposing internal details. Referred Code catch (Exception ex) { _logger.LogError(ex, "Error analyzing text"); return StatusCode(500, new { error = $"Error analyzing text: {ex.Message}" }); }
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing audit logs: The new POST endpoint performs text analysis without recording structured audit logs for the request, user identity, parameters used, or outcome beyond a generic info log, which may be required for auditing critical actions. Referred Code [ProducesResponseType(typeof(TextAnalysisResponse), StatusCodes.Status200OK)] [ProducesResponseType(StatusCodes.Status400BadRequest)] [ProducesResponseType(StatusCodes.Status500InternalServerError)] public async Task<IActionResult> AnalyzeText([FromBody] TextAnalysisRequest request) { try { if (string.IsNullOrWhiteSpace(request.Text)) { return BadRequest(new { error = "Text is required" }); } var result = await _textAnalysisService.AnalyzeTextAsync(request); return Ok(result); }
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Generic 500 message: The controller returns a 500 with exception message text, and service methods rely on broad try/catch with rethrow, lacking explicit validation of external CSV inputs and boundary cases across the pipeline. Referred Code catch (Exception ex) { _logger.LogError(ex, "Error analyzing text"); return StatusCode(500, new { error = $"Error analyzing text: {ex.Message}" }); }
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Potential PII logging: Informational logs include text length and flagged item counts, and error logs may capture exception context tied to user-provided text, which could risk sensitive data exposure depending on inputs. Referred Code _logger.LogInformation( $"Text analysis completed in {response.ProcessingTimeMs}ms \| " + $"Text length: {request.Text.Length} chars \| " + $"Flagged items: {flagged.Count}"); return response; } catch (Exception ex) { stopwatch.Stop(); _logger.LogError(ex, $"Error analyzing text after {stopwatch.Elapsed.TotalMilliseconds}ms"); throw; }
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Path handling risk: CSV file and folder inputs are combined with base directories without explicit normalization or whitelisting, which may allow unintended file access if user-controlled values are passed. Referred Code private async Task<Dictionary<string, string>> LoadCsvFilesFromFolderAsync(string folderName) { var csvFileDict = new Dictionary<string, string>(); var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp", folderName); if (!Directory.Exists(searchFolder)) { _logger.LogWarning($"Folder does not exist: {searchFolder}"); return csvFileDict; } var csvFiles = Directory.GetFiles(searchFolder, "*.csv"); foreach (var file in csvFiles) { var fileName = Path.GetFileNameWithoutExtension(file); csvFileDict[fileName] = file; } _logger.LogInformation($"Loaded {csvFileDict.Count} CSV files from {searchFolder}"); return await Task.FromResult(csvFileDict); }

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-merge-pro · 2025-11-06T20:37:06Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Integrate data sources with the architecture The current implementation uses local CSV files for data, which is unsuitable for production. Refactor the data loading to integrate with the platform's architecture by fetching data from a database or a dedicated service. Examples: src/Plugins/BotSharp.Plugin.FuzzySharp/Services/VocabularyService.cs [21-115] public async Task<Dictionary<string, HashSet<string>>> LoadVocabularyAsync(string? foldername) { var vocabulary = new Dictionary<string, HashSet<string>>(); if (string.IsNullOrEmpty(foldername)) { return vocabulary; } // Load CSV files from the folder ... (clipped 85 lines) Solution Walkthrough: Before: // In VocabularyService.cs public class VocabularyService : IVocabularyService { public async Task<Dictionary<string, HashSet<string>>> LoadVocabularyAsync(string? foldername) { // ... var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp", foldername); var csvFiles = Directory.GetFiles(searchFolder, "*.csv"); // ... load from files } public async Task<Dictionary<string, (string, string)>> LoadDomainTermMappingAsync(string? filename) { // ... var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp"); var filePath = Path.Combine(searchFolder, filename); // ... load from file } } After: // In VocabularyService.cs public class VocabularyService : IVocabularyService { private readonly IMyDatabaseService _dbService; // or a DbContext public VocabularyService(IMyDatabaseService dbService) { _dbService = dbService; } public async Task<Dictionary<string, HashSet<string>>> LoadVocabularyAsync(string? vocabularySourceId) { // Fetch vocabulary from a database or central service var vocabularyData = await _dbService.GetVocabularyByIdAsync(vocabularySourceId); // ... process data into the required dictionary format return processedVocabulary; } public async Task<Dictionary<string, (string, string)>> LoadDomainTermMappingAsync(string? mappingSourceId) { // Fetch domain terms from a database or central service var mappingData = await _dbService.GetDomainTermMappingAsync(mappingSourceId); // ... process data into the required dictionary format return processedMapping; } } Suggestion importance[1-10]: 9 __ Why: This suggestion correctly identifies the most critical architectural issue in the PR—the reliance on local file-based data sources—which is explicitly mentioned by the author as a reason this is a proof-of-concept.	High
General	Optimize regex normalization for performance Optimize the `Normalize` method for performance by pre-compiling the regular expression and reordering the string manipulation operations. src/Plugins/BotSharp.Plugin.FuzzySharp/Services/Matching/FuzzyMatcher.cs [74-80] +private static readonly Regex NormalizationRegex = new Regex(@"[^\w']+", RegexOptions.Compiled); + private string Normalize(string text) { - // Replace non-word characters (except apostrophes) with spaces - var normalized = Regex.Replace(text, @"[^\w']+", " ", RegexOptions.IgnoreCase); - // Convert to lowercase, collapse multiple spaces, and trim - return Regex.Replace(normalized.ToLowerInvariant(), @"\s+", " ").Trim(); + // Convert to lowercase first + var lowerText = text.ToLowerInvariant(); + // Replace non-word characters (except apostrophes) with a single space + var normalized = NormalizationRegex.Replace(lowerText, " "); + // Trim and collapse multiple spaces that might be introduced + return Regex.Replace(normalized, @"\s+", " ").Trim(); } Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies a performance bottleneck and proposes a valid optimization by pre-compiling a regex and reordering operations, which is beneficial as this method is in a hot path.	Low
General	Improve performance of n-gram extraction Improve the performance of `ExtractContentSpan` by replacing LINQ's `Skip` and `Take` methods with the more efficient `List.GetRange`. src/Plugins/BotSharp.Plugin.FuzzySharp/Services/Processors/NgramProcessor.cs [124-132] private (string ContentSpan, List<string> Tokens, List<int> ContentIndices) ExtractContentSpan( List<string> tokens, int startIdx, int n) { - var span = tokens.Skip(startIdx).Take(n).ToList(); + var span = tokens.GetRange(startIdx, n); var indices = Enumerable.Range(startIdx, n).ToList(); return (string.Join(" ", span), span, indices); } Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly points out that using `GetRange` is more performant than `Skip` and `Take` for `List<T>`, which is a valid optimization for code inside nested loops.	Low
Learned best practice	Guard nulls and hide exception details Add a null check for the request object and avoid returning raw exception messages; instead return a generic error. This prevents null-reference issues and leaking internal details. src/Plugins/BotSharp.Plugin.FuzzySharp/Controllers/FuzzySharpController.cs [42-59] -public async Task<IActionResult> AnalyzeText([FromBody] TextAnalysisRequest request) +public async Task<IActionResult> AnalyzeText([FromBody] TextAnalysisRequest? request) { try { - if (string.IsNullOrWhiteSpace(request.Text)) + if (request == null \|\| string.IsNullOrWhiteSpace(request.Text)) { return BadRequest(new { error = "Text is required" }); } var result = await _textAnalysisService.AnalyzeTextAsync(request); return Ok(result); } catch (Exception ex) { _logger.LogError(ex, "Error analyzing text"); - return StatusCode(500, new { error = $"Error analyzing text: {ex.Message}" }); + return StatusCode(500, new { error = "Error analyzing text" }); } } Apply / Chat Suggestion importance[1-10]: 6 __ Why: Relevant best practice - Ensure nullability guards and safe fallbacks before property access and when returning error details to avoid null reference issues or leaking sensitive exception messages.	Low
More

iceljc · 2025-11-06T22:05:01Z

Please remove any business related documents.

Yanan Wang added 4 commits November 2, 2025 22:36

wip

904b9ad

Add FuzzySharp for NER

616480a

resolve conflict when merge from development

739441d

Add cache for load data

f3a0101

qodo-merge-pro bot added the Review effort 3/5 label Nov 6, 2025

iceljc marked this pull request as draft November 6, 2025 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add FuzzySharp-based text analysis plugin for domain-specific typo detection and entity extraction #1208

Add FuzzySharp-based text analysis plugin for domain-specific typo detection and entity extraction #1208

Uh oh!

ywang1110 commented Nov 6, 2025 •

edited

Loading

Uh oh!

qodo-merge-pro bot commented Nov 6, 2025

Uh oh!

qodo-merge-pro bot commented Nov 6, 2025

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

iceljc commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add FuzzySharp-based text analysis plugin for domain-specific typo detection and entity extraction #1208

Are you sure you want to change the base?

Add FuzzySharp-based text analysis plugin for domain-specific typo detection and entity extraction #1208

Uh oh!

Conversation

ywang1110 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

How to Test Locally

1. Copy Data Source

2. Start the Application

3. Run Tests via Postman

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Nov 6, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-merge-pro bot commented Nov 6, 2025

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

iceljc commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ywang1110 commented Nov 6, 2025 •

edited

Loading