Hello Adam Mucha,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you're facing an issue with digit recognition on your invoices, where digits separated by long distances are being read as separate numbers (e.g., 1 and 032 instead of 1032). Unfortunately, Azure does not have a built-in function to adjust the spacing threshold for recognizing digits as part of the same number.
However, here are some steps you can take to mitigate this issue:
- If not using Azure Form Recognizer, consider it as it is optimized for extracting structured data like tables and numbers. Use the prebuilt or custom model options, depending on your invoice format.
- Develop logic to merge split digits by identifying patterns (e.g., spacing or context like thousands separators). For an example:
def merge_digits(text): return text.replace(' ', '') # Basic example to remove spaces in numbers
- If using Form Recognizer’s custom model, include training data with spaced numbers to improve recognition accuracy. Use labeled data with expected outputs to guide the model on how to interpret such cases.
- Increase scanning resolution to 300 DPI or higher. You can use preprocessing techniques like binarization to improve OCR results.
- Explore OCR API Parameters to fine-tune recognition (if applicable). For instance, some OCR tools allow tweaking settings for character separation.
- If none of the above resolves the issue, raise a support request to Azure for possible feature enhancements or technical guidance.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.