Skip to content

Conversation

@ywang1110
Copy link
Contributor

@ywang1110 ywang1110 commented Nov 6, 2025

User description

This part cannot be directly merged because it represents a POC version — it demonstrates the complete logic flow and can be tested via Postman, but it does not yet fit into the current BotSharp architecture and the vocabulary data source relies on local CSV files instead of database reads.

How to Test Locally

1. Copy Data Source

Copy the folder /BotSharp/src/Plugins/BotSharp.Plugin.FuzzySharp/data/fuzzySharp under /BotSharp/src/WebStarter/bin/Debug/net8.0/data/plugins
image

2. Start the Application

Run the WebStarter project.

3. Run Tests via Postman


PR Type

Enhancement


Description

  • Add FuzzySharp plugin for domain-specific text analysis

    • Detects typos, exact matches, and domain term mappings
    • Supports n-gram processing with fuzzy matching
  • Implement text tokenization and vocabulary loading

    • Loads domain vocabularies from CSV files with caching
    • Supports domain term mapping for business abbreviations
  • Create REST API endpoint for text analysis

    • Returns flagged items with match types and confidence scores
    • Includes optional token output and processing time metrics
  • Integrate plugin into BotSharp architecture

    • Registers services via dependency injection
    • Added to solution and WebStarter configuration

Diagram Walkthrough

flowchart LR
  A["Text Input"] --> B["TextTokenizer"]
  B --> C["VocabularyService"]
  C --> D["NgramProcessor"]
  D --> E["Token Matchers"]
  E --> F["ResultProcessor"]
  F --> G["TextAnalysisResponse"]
  H["CSV Files"] -.-> C
  I["Domain Term Mapping"] -.-> E
Loading

File Walkthrough

Relevant files
Enhancement
21 files
TextAnalysisRequest.cs
Text analysis request model with parameters                           
+52/-0   
INgramProcessor.cs
N-gram processing interface definition                                     
+27/-0   
IResultProcessor.cs
Result deduplication and sorting interface                             
+18/-0   
ITextAnalysisService.cs
Main text analysis service interface                                         
+13/-0   
ITokenMatcher.cs
Token matching interface and context models                           
+40/-0   
IVocabularyService.cs
Vocabulary and domain term loading interface                         
+9/-0     
FlaggedItem.cs
Flagged item model with match metadata                                     
+50/-0   
TextAnalysisResponse.cs
Text analysis response model structure                                     
+31/-0   
MatchReason.cs
Match type constants for results                                                 
+21/-0   
TextConstants.cs
Text tokenization separator characters                                     
+30/-0   
FuzzySharpController.cs
REST API endpoint for text analysis                                           
+61/-0   
FuzzySharpPlugin.cs
Plugin registration and dependency injection                         
+29/-0   
DomainTermMatcher.cs
Domain term mapping matcher implementation                             
+24/-0   
ExactMatcher.cs
Exact vocabulary match implementation                                       
+24/-0   
FuzzyMatcher.cs
Fuzzy matching for typo correction                                             
+82/-0   
NgramProcessor.cs
N-gram processing with priority-based matching                     
+134/-0 
ResultProcessor.cs
Result deduplication and sorting logic                                     
+103/-0 
TextAnalysisService.cs
Main text analysis service implementation                               
+141/-0 
VocabularyService.cs
CSV vocabulary and domain term loading                                     
+186/-0 
Using.cs
Global using statements for plugin                                             
+10/-0   
TextTokenizer.cs
Text preprocessing and tokenization utilities                       
+64/-0   
Configuration changes
4 files
BotSharp.sln
Add FuzzySharp plugin to solution                                               
+11/-0   
BotSharp.Plugin.FuzzySharp.csproj
FuzzySharp plugin project file                                                     
+21/-0   
WebStarter.csproj
Add FuzzySharp plugin reference                                                   
+1/-0     
appsettings.json
Register FuzzySharp in plugin list                                             
+2/-1     
Dependencies
1 files
Directory.Packages.props
Add CsvHelper and FuzzySharp dependencies                               
+3/-0     

@qodo-merge-pro
Copy link

qodo-merge-pro bot commented Nov 6, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Path traversal

Description: The service loads all .csv files from a folder name supplied via request-controlled input
and constructs file paths under AppContext.BaseDirectory without explicit sanitization,
creating a possible path traversal or arbitrary file read risk if an attacker can
influence vocabulary_folder_name.
VocabularyService.cs [147-166]

Referred Code
private async Task<Dictionary<string, string>> LoadCsvFilesFromFolderAsync(string folderName)
{
    var csvFileDict = new Dictionary<string, string>();
    var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp", folderName);
    if (!Directory.Exists(searchFolder))
    {
        _logger.LogWarning($"Folder does not exist: {searchFolder}");
        return csvFileDict;
    }

    var csvFiles = Directory.GetFiles(searchFolder, "*.csv");
    foreach (var file in csvFiles)
    {
        var fileName = Path.GetFileNameWithoutExtension(file);
        csvFileDict[fileName] = file;
    }

    _logger.LogInformation($"Loaded {csvFileDict.Count} CSV files from {searchFolder}");
    return await Task.FromResult(csvFileDict);
}
Arbitrary file read

Description: The domain term mapping file path is built from a user-controlled filename appended to a
base folder; without strict allowlisting or sanitization, this may allow reading
unintended files via crafted domain_term_mapping_file.
VocabularyService.cs [56-115]

Referred Code
public async Task<Dictionary<string, (string DbPath, string CanonicalForm)>> LoadDomainTermMappingAsync(string? filename)
{
    var result = new Dictionary<string, (string DbPath, string CanonicalForm)>();
    if (string.IsNullOrWhiteSpace(filename))
    {
        return result;
    }

    var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp");
    var filePath = Path.Combine(searchFolder, filename);

    if (string.IsNullOrEmpty(filePath) || !File.Exists(filePath))
    {
        return result;
    }

    try
    {
        using var reader = new StreamReader(filePath); 
        using var csv = new CsvReader(reader, CreateCsvConfig());



 ... (clipped 39 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

🔴
Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Leaks exception info: The API returns internal exception messages to the client in the 500 response body (Error
analyzing text: {ex.Message}), potentially exposing internal details.

Referred Code
catch (Exception ex)
{
    _logger.LogError(ex, "Error analyzing text");
    return StatusCode(500, new { error = $"Error analyzing text: {ex.Message}" });
}
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit logs: The new POST endpoint performs text analysis without recording structured audit logs for
the request, user identity, parameters used, or outcome beyond a generic info log, which
may be required for auditing critical actions.

Referred Code
[ProducesResponseType(typeof(TextAnalysisResponse), StatusCodes.Status200OK)]
[ProducesResponseType(StatusCodes.Status400BadRequest)]
[ProducesResponseType(StatusCodes.Status500InternalServerError)]
public async Task<IActionResult> AnalyzeText([FromBody] TextAnalysisRequest request)
{
    try
    {
        if (string.IsNullOrWhiteSpace(request.Text))
        {
            return BadRequest(new { error = "Text is required" });
        }

        var result = await _textAnalysisService.AnalyzeTextAsync(request);
        return Ok(result);
    }
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Generic 500 message: The controller returns a 500 with exception message text, and service methods rely on
broad try/catch with rethrow, lacking explicit validation of external CSV inputs and
boundary cases across the pipeline.

Referred Code
catch (Exception ex)
{
    _logger.LogError(ex, "Error analyzing text");
    return StatusCode(500, new { error = $"Error analyzing text: {ex.Message}" });
}
Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Potential PII logging: Informational logs include text length and flagged item counts, and error logs may capture
exception context tied to user-provided text, which could risk sensitive data exposure
depending on inputs.

Referred Code
    _logger.LogInformation(
        $"Text analysis completed in {response.ProcessingTimeMs}ms | " +
        $"Text length: {request.Text.Length} chars | " +
        $"Flagged items: {flagged.Count}");

    return response;
}
catch (Exception ex)
{
    stopwatch.Stop();
    _logger.LogError(ex, $"Error analyzing text after {stopwatch.Elapsed.TotalMilliseconds}ms");
    throw;
}
Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Path handling risk: CSV file and folder inputs are combined with base directories without explicit
normalization or whitelisting, which may allow unintended file access if user-controlled
values are passed.

Referred Code
private async Task<Dictionary<string, string>> LoadCsvFilesFromFolderAsync(string folderName)
{
    var csvFileDict = new Dictionary<string, string>();
    var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp", folderName);
    if (!Directory.Exists(searchFolder))
    {
        _logger.LogWarning($"Folder does not exist: {searchFolder}");
        return csvFileDict;
    }

    var csvFiles = Directory.GetFiles(searchFolder, "*.csv");
    foreach (var file in csvFiles)
    {
        var fileName = Path.GetFileNameWithoutExtension(file);
        csvFileDict[fileName] = file;
    }

    _logger.LogInformation($"Loaded {csvFileDict.Count} CSV files from {searchFolder}");
    return await Task.FromResult(csvFileDict);
}
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-merge-pro
Copy link

qodo-merge-pro bot commented Nov 6, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Integrate data sources with the architecture

The current implementation uses local CSV files for data, which is unsuitable
for production. Refactor the data loading to integrate with the platform's
architecture by fetching data from a database or a dedicated service.

Examples:

src/Plugins/BotSharp.Plugin.FuzzySharp/Services/VocabularyService.cs [21-115]
        public async Task<Dictionary<string, HashSet<string>>> LoadVocabularyAsync(string? foldername)
        {
            var vocabulary = new Dictionary<string, HashSet<string>>();

            if (string.IsNullOrEmpty(foldername))
            {
                return vocabulary;
            }

            // Load CSV files from the folder

 ... (clipped 85 lines)

Solution Walkthrough:

Before:

// In VocabularyService.cs
public class VocabularyService : IVocabularyService
{
    public async Task<Dictionary<string, HashSet<string>>> LoadVocabularyAsync(string? foldername)
    {
        // ...
        var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp", foldername);
        var csvFiles = Directory.GetFiles(searchFolder, "*.csv");
        // ... load from files
    }

    public async Task<Dictionary<string, (string, string)>> LoadDomainTermMappingAsync(string? filename)
    {
        // ...
        var searchFolder = Path.Combine(AppContext.BaseDirectory, "data", "plugins", "fuzzySharp");
        var filePath = Path.Combine(searchFolder, filename);
        // ... load from file
    }
}

After:

// In VocabularyService.cs
public class VocabularyService : IVocabularyService
{
    private readonly IMyDatabaseService _dbService; // or a DbContext

    public VocabularyService(IMyDatabaseService dbService)
    {
        _dbService = dbService;
    }

    public async Task<Dictionary<string, HashSet<string>>> LoadVocabularyAsync(string? vocabularySourceId)
    {
        // Fetch vocabulary from a database or central service
        var vocabularyData = await _dbService.GetVocabularyByIdAsync(vocabularySourceId);
        // ... process data into the required dictionary format
        return processedVocabulary;
    }

    public async Task<Dictionary<string, (string, string)>> LoadDomainTermMappingAsync(string? mappingSourceId)
    {
        // Fetch domain terms from a database or central service
        var mappingData = await _dbService.GetDomainTermMappingAsync(mappingSourceId);
        // ... process data into the required dictionary format
        return processedMapping;
    }
}
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies the most critical architectural issue in the PR—the reliance on local file-based data sources—which is explicitly mentioned by the author as a reason this is a proof-of-concept.

High
General
Optimize regex normalization for performance

Optimize the Normalize method for performance by pre-compiling the regular
expression and reordering the string manipulation operations.

src/Plugins/BotSharp.Plugin.FuzzySharp/Services/Matching/FuzzyMatcher.cs [74-80]

+private static readonly Regex NormalizationRegex = new Regex(@"[^\w']+", RegexOptions.Compiled);
+
 private string Normalize(string text)
 {
-    // Replace non-word characters (except apostrophes) with spaces
-    var normalized = Regex.Replace(text, @"[^\w']+", " ", RegexOptions.IgnoreCase);
-    // Convert to lowercase, collapse multiple spaces, and trim
-    return Regex.Replace(normalized.ToLowerInvariant(), @"\s+", " ").Trim();
+    // Convert to lowercase first
+    var lowerText = text.ToLowerInvariant();
+    // Replace non-word characters (except apostrophes) with a single space
+    var normalized = NormalizationRegex.Replace(lowerText, " ");
+    // Trim and collapse multiple spaces that might be introduced
+    return Regex.Replace(normalized, @"\s+", " ").Trim();
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a performance bottleneck and proposes a valid optimization by pre-compiling a regex and reordering operations, which is beneficial as this method is in a hot path.

Low
Improve performance of n-gram extraction

Improve the performance of ExtractContentSpan by replacing LINQ's Skip and Take
methods with the more efficient List.GetRange.

src/Plugins/BotSharp.Plugin.FuzzySharp/Services/Processors/NgramProcessor.cs [124-132]

 private (string ContentSpan, List<string> Tokens, List<int> ContentIndices) ExtractContentSpan(
     List<string> tokens, 
     int startIdx, 
     int n)
 {
-    var span = tokens.Skip(startIdx).Take(n).ToList();
+    var span = tokens.GetRange(startIdx, n);
     var indices = Enumerable.Range(startIdx, n).ToList();
     return (string.Join(" ", span), span, indices);
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that using GetRange is more performant than Skip and Take for List<T>, which is a valid optimization for code inside nested loops.

Low
Learned
best practice
Guard nulls and hide exception details

Add a null check for the request object and avoid returning raw exception
messages; instead return a generic error. This prevents null-reference issues
and leaking internal details.

src/Plugins/BotSharp.Plugin.FuzzySharp/Controllers/FuzzySharpController.cs [42-59]

-public async Task<IActionResult> AnalyzeText([FromBody] TextAnalysisRequest request)
+public async Task<IActionResult> AnalyzeText([FromBody] TextAnalysisRequest? request)
 {
     try
     {
-        if (string.IsNullOrWhiteSpace(request.Text))
+        if (request == null || string.IsNullOrWhiteSpace(request.Text))
         {
             return BadRequest(new { error = "Text is required" });
         }
 
         var result = await _textAnalysisService.AnalyzeTextAsync(request);
         return Ok(result);
     }
     catch (Exception ex)
     {
         _logger.LogError(ex, "Error analyzing text");
-        return StatusCode(500, new { error = $"Error analyzing text: {ex.Message}" });
+        return StatusCode(500, new { error = "Error analyzing text" });
     }
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Ensure nullability guards and safe fallbacks before property access and when returning error details to avoid null reference issues or leaking sensitive exception messages.

Low
  • More

@iceljc iceljc marked this pull request as draft November 6, 2025 22:04
@iceljc
Copy link
Collaborator

iceljc commented Nov 6, 2025

Please remove any business related documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants