How to improve line breaks in Document Intelligence's OCR output?

Question

How to improve line breaks in Document Intelligence's OCR output?

Yathharthha Kaushal 0

Hello everyone,

I'm currently working with an image of code that I need to convert into text using OCR. The issue is that the code uses line breaks in a way that, if modified, could lead to compile/runtime errors.

Here's an example of the OCR output I am getting:

Paragraph:
public static int foo(int bar) {
==========
Paragraph:
bar++; if (bar < 10) bar = foo(bar);
==========
Paragraph:
int i = 0; int j = 0; while (i > foo(j - bar)) { j++; bar += j;
==========
Paragraph:
3
==========
Paragraph:
return bar;
==========
Paragraph:
}
==========

However, the code should look like this:

public static int foo(int bar) {
    bar++;
    if (bar < 10)
        bar = foo(bar);

    int i = 0;
    int j = 0;
    while (i > foo(j - bar)) {
        j++;
        bar += j;
    }

    return bar;
}

Here's the actual image used for the ocr:

Actual image for the ocr

Is there any way to make the Document Intelligence OCR output line breaks better, ensuring the code is correctly formatted?

Best regards,
YK

Pavankumar Purilla 5,070 Reputation points Microsoft External Staff

2025-02-10T21:34:09.3633333+00:00

Hi Yathharthha Kaushal,
Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!

To improving line breaks in OCR output there are a few strategies you can try:

First, try using the prebuilt-layout model instead of the default prebuilt-read model. This model is better at preserving text structure, making it more suitable for code extraction.

If formatting issues still occur, a post-processing script can help fix misplaced line breaks and indentation.

Another approach is to train a custom OCR model that recognizes code formatting more accurately.

I hope this information helps. Thank you.
Pavankumar Purilla 5,070 Reputation points Microsoft External Staff

2025-02-11T17:47:02.2966667+00:00

Hi Yathharthha Kaushal,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Pavankumar Purilla 5,070 Reputation points Microsoft External Staff

2025-02-12T18:07:42.3766667+00:00

Hi Yathharthha Kaushal,
Just checking back to see if you have a resolution yet.

Your answer

Pavankumar Purilla 5,070 Reputation points Microsoft External Staff

2025-02-10T21:34:09.3633333+00:00

Hi Yathharthha Kaushal,
Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!

To improving line breaks in OCR output there are a few strategies you can try:

First, try using the prebuilt-layout model instead of the default prebuilt-read model. This model is better at preserving text structure, making it more suitable for code extraction.

If formatting issues still occur, a post-processing script can help fix misplaced line breaks and indentation.

Another approach is to train a custom OCR model that recognizes code formatting more accurately.

I hope this information helps. Thank you.
Pavankumar Purilla 5,070 Reputation points Microsoft External Staff

2025-02-11T17:47:02.2966667+00:00

Hi Yathharthha Kaushal,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Pavankumar Purilla 5,070 Reputation points Microsoft External Staff

2025-02-12T18:07:42.3766667+00:00

Hi Yathharthha Kaushal,
Just checking back to see if you have a resolution yet.

Share via

How to improve line breaks in Document Intelligence's OCR output?

Your answer