why is the 1040 prebuilt only reading a couple of lines from my example?

Question

So when I tried the prebuilt 1040, it could only read at most 4 lines. I even used the add on to see if that improves but nothing. How is anyone using this where the results are so bad? Since is a prebuilt model there is nothing I can do. It would be stupid to use a custom model and train a document that the Microsoft already trained.

Answer

@Pauline Nguyen

Welcome to the Microsoft Q&A

I understand your frustration with the 1040 prebuilt model not performing as expected. It sounds like it's not capturing the full content you need, which can be quite limiting.

Here are a few things you can try to improve the results:

Document Formatting: Ensure that the document is properly formatted. Sometimes, simple changes like adjusting the line spacing, font size, or layout can make a difference in how the model reads the text.
Document Quality: Check the quality of the document. If it's a scanned document, make sure the scan is clear and free of any distortions or blemishes.
Use Different Add-ons: Explore other add-ons or tools that might enhance the reading capabilities of the model.
Custom Model: While it might seem redundant to train a custom model on a document that Microsoft has already trained, it could still provide better results tailored to your specific needs. You might be able to refine the model to better understand your particular documents.
Feedback to Microsoft: Provide feedback to Microsoft about your experience with the prebuilt model. This can help them improve their models and address any issues. I hope these helps. Let me know if you have any further questions or need additional assistance. Also if these answers your query, do click the "Upvote" and click "Accept the answer" of which might be beneficial to other community members reading this thread.

Answer

Hi Pauline Nguyen,

Welcome to Microsoft Q&A forum. Thanks for posting your query.

You're absolutely right that it shouldn’t be necessary to build a custom model when Microsoft already provides a prebuilt one for 1040 forms. However, if the prebuilt model is not extracting all the necessary information, here are some potential reasons and solutions:

Possible Causes of the Issue:

Page Segmentation & Processing Limitations:

The model may process only part of the document at a time. As seen with barcodes, breaking the document into sections improved extraction.

Some fields might be prioritized for extraction, leaving others ignored.

Document Parsing Method in Azure:

Azure’s prebuilt models often use region-based extraction rather than reading all text like OCR. If the layout of your form differs slightly from what the model was trained on, it may skip sections that don’t match expected positions.

Processing Order & Internal Heuristics

The prebuilt-1040 model might prioritize certain sections when analyzing the full document.

Potential Solutions & Workarounds:

Test with Document Splitting

Instead of processing the entire form, try splitting the PDF into smaller sections:

Top half vs. Bottom half

Single-page at a time

Line-by-line text blocks

Use Prebuilt Layout as a Backup for Missing Text

If the prebuilt-1040 model still misses data, run the prebuilt-layout model separately to extract raw text.

Then manually combine both outputs for full extraction.

Final Recommendation:

One of the most effective strategies for improving text extraction accuracy in Azure Document Intelligence (prebuilt-1040 model) is to split the document into smaller sections before processing. This approach has already proven successful in the prebuilt-layout model for barcode extraction, and we can apply the same technique for text extraction in the 1040 tax form.

Hope this helps.

Thank You.

Share via

why is the 1040 prebuilt only reading a couple of lines from my example?

2 answers

Your answer