图像处理，以提高四方光学字符识别的准确性

小开

最佳答案

固定 DPI (如果需要)300 DPI 是最低的
修正文字大小(例如12pt 应该可以)
尝试修复文本行(排版和扭曲文本)
尝试固定图像的亮度(例如图像中没有暗部)
二值化和去噪图像

没有适合所有情况的通用命令行(有时需要模糊和锐化图像)。但是你可以试试来自弗雷德的图像魔术脚本的文本清理工。

如果您不喜欢命令行，也许您可以尝试使用开源 Scantailor.sourceforge.net或商业书店老板。

小开

我绝不是光学识别专家。但是这周我需要把文本从 jpg 中转换出来。

我开始与彩色，RGB 445x747像素 jpg。我立刻尝试了四次方程式，但程式几乎没有转换任何东西。然后我进入 GIMP 并执行以下操作。

图像 > 模式 > 灰度
图像 > 比例尺图像 > 1191x2000像素
滤镜 > 增强 > 不锐利的遮罩
半径 = 6.8，数量 = 2.69，阈值 = 0

然后我以100% 的质量保存为一个新的 jpg。

然后 Tesseract 能够将所有文本提取到一个. txt 文件中

瘸子是你的朋友。

小开

虽然这是很久以前的事了，但它仍然可能有用。

我的经验表明，在将图像传递给四方体之前在内存中调整图像的大小有时会有所帮助。

尝试不同的插值模式。后 https://stackoverflow.com/a/4756906/146003帮了我很多。

小开

这种方式对我来说非常有帮助的是 Capture2Text 项目的源代码。 Http://sourceforge.net/projects/capture2text/files/capture2text/.

顺便说一句: 值得赞扬的是，它的作者分享了这样一个艰苦的算法。

特别注意 Capture2Text 源代码 eptonica _ util eptonica _ util.c 文件——这是该实用程序的图像预处理的本质。

如果要运行二进制文件，可以在 Capture2Text Output 文件夹中的进程之前/之后检查图像转换。

所述解决方案使用 Tesseract 进行 OCR 和 Leptonica 进行预处理。

小开

提高图像可读性的三个要点:

使用可变高度和宽度调整图像的大小(用图像高度和宽度乘以0.5、1和2)。
将图像转换为灰度格式(黑色和白色)。
删除噪音像素，使更清晰(过滤图像)。

请参阅以下代码:

调整大小

public Bitmap Resize(Bitmap bmp, int newWidth, int newHeight)
{
         

Bitmap temp = (Bitmap)bmp;
            

Bitmap bmap = new Bitmap(newWidth, newHeight, temp.PixelFormat);
             

double nWidthFactor = (double)temp.Width / (double)newWidth;
double nHeightFactor = (double)temp.Height / (double)newHeight;


double fx, fy, nx, ny;
int cx, cy, fr_x, fr_y;
Color color1 = new Color();
Color color2 = new Color();
Color color3 = new Color();
Color color4 = new Color();
byte nRed, nGreen, nBlue;


byte bp1, bp2;


for (int x = 0; x < bmap.Width; ++x)
{
for (int y = 0; y < bmap.Height; ++y)
{


fr_x = (int)Math.Floor(x * nWidthFactor);
fr_y = (int)Math.Floor(y * nHeightFactor);
cx = fr_x + 1;
if (cx >= temp.Width) cx = fr_x;
cy = fr_y + 1;
if (cy >= temp.Height) cy = fr_y;
fx = x * nWidthFactor - fr_x;
fy = y * nHeightFactor - fr_y;
nx = 1.0 - fx;
ny = 1.0 - fy;


color1 = temp.GetPixel(fr_x, fr_y);
color2 = temp.GetPixel(cx, fr_y);
color3 = temp.GetPixel(fr_x, cy);
color4 = temp.GetPixel(cx, cy);


// Blue
bp1 = (byte)(nx * color1.B + fx * color2.B);


bp2 = (byte)(nx * color3.B + fx * color4.B);


nBlue = (byte)(ny * (double)(bp1) + fy * (double)(bp2));


// Green
bp1 = (byte)(nx * color1.G + fx * color2.G);


bp2 = (byte)(nx * color3.G + fx * color4.G);


nGreen = (byte)(ny * (double)(bp1) + fy * (double)(bp2));


// Red
bp1 = (byte)(nx * color1.R + fx * color2.R);


bp2 = (byte)(nx * color3.R + fx * color4.R);


nRed = (byte)(ny * (double)(bp1) + fy * (double)(bp2));


bmap.SetPixel(x, y, System.Drawing.Color.FromArgb
(255, nRed, nGreen, nBlue));
}
}


       



bmap = SetGrayscale(bmap);
bmap = RemoveNoise(bmap);


return bmap;
            

}

SetGrayscale

public Bitmap SetGrayscale(Bitmap img)
{
    

Bitmap temp = (Bitmap)img;
Bitmap bmap = (Bitmap)temp.Clone();
Color c;
for (int i = 0; i < bmap.Width; i++)
{
for (int j = 0; j < bmap.Height; j++)
{
c = bmap.GetPixel(i, j);
byte gray = (byte)(.299 * c.R + .587 * c.G + .114 * c.B);
    

bmap.SetPixel(i, j, Color.FromArgb(gray, gray, gray));
}
}
return (Bitmap)bmap.Clone();
    

}

除去噪音

public Bitmap RemoveNoise(Bitmap bmap)
{
    

for (var x = 0; x < bmap.Width; x++)
{
for (var y = 0; y < bmap.Height; y++)
{
var pixel = bmap.GetPixel(x, y);
if (pixel.R < 162 && pixel.G < 162 && pixel.B < 162)
bmap.SetPixel(x, y, Color.Black);
else if (pixel.R > 162 && pixel.G > 162 && pixel.B > 162)
bmap.SetPixel(x, y, Color.White);
}
}
    

return bmap;
}

输入图像
INPUT IMAGE

输出图像 OUTPUT IMAGE

小开

自适应阈值是重要的，如果照明是不均匀的整个图像。我使用 GraphicsMagic 进行的预处理在这篇文章中有提到: Https://groups.google.com/forum/#!topic/tesseract-ocr/jongschlrv4

GraphhicsMagic 还有线性时间自适应阈值的-lat 特性，我很快就会尝试。

另一种使用 OpenCV 的阈值化方法在这里描述: Https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html

小开

上面 Sathyaraj 代码的 Java 版本:

// Resize
public Bitmap resize(Bitmap img, int newWidth, int newHeight) {
Bitmap bmap = img.copy(img.getConfig(), true);


double nWidthFactor = (double) img.getWidth() / (double) newWidth;
double nHeightFactor = (double) img.getHeight() / (double) newHeight;


double fx, fy, nx, ny;
int cx, cy, fr_x, fr_y;
int color1;
int color2;
int color3;
int color4;
byte nRed, nGreen, nBlue;


byte bp1, bp2;


for (int x = 0; x < bmap.getWidth(); ++x) {
for (int y = 0; y < bmap.getHeight(); ++y) {


fr_x = (int) Math.floor(x * nWidthFactor);
fr_y = (int) Math.floor(y * nHeightFactor);
cx = fr_x + 1;
if (cx >= img.getWidth())
cx = fr_x;
cy = fr_y + 1;
if (cy >= img.getHeight())
cy = fr_y;
fx = x * nWidthFactor - fr_x;
fy = y * nHeightFactor - fr_y;
nx = 1.0 - fx;
ny = 1.0 - fy;


color1 = img.getPixel(fr_x, fr_y);
color2 = img.getPixel(cx, fr_y);
color3 = img.getPixel(fr_x, cy);
color4 = img.getPixel(cx, cy);


// Blue
bp1 = (byte) (nx * Color.blue(color1) + fx * Color.blue(color2));
bp2 = (byte) (nx * Color.blue(color3) + fx * Color.blue(color4));
nBlue = (byte) (ny * (double) (bp1) + fy * (double) (bp2));


// Green
bp1 = (byte) (nx * Color.green(color1) + fx * Color.green(color2));
bp2 = (byte) (nx * Color.green(color3) + fx * Color.green(color4));
nGreen = (byte) (ny * (double) (bp1) + fy * (double) (bp2));


// Red
bp1 = (byte) (nx * Color.red(color1) + fx * Color.red(color2));
bp2 = (byte) (nx * Color.red(color3) + fx * Color.red(color4));
nRed = (byte) (ny * (double) (bp1) + fy * (double) (bp2));


bmap.setPixel(x, y, Color.argb(255, nRed, nGreen, nBlue));
}
}


bmap = setGrayscale(bmap);
bmap = removeNoise(bmap);


return bmap;
}


// SetGrayscale
private Bitmap setGrayscale(Bitmap img) {
Bitmap bmap = img.copy(img.getConfig(), true);
int c;
for (int i = 0; i < bmap.getWidth(); i++) {
for (int j = 0; j < bmap.getHeight(); j++) {
c = bmap.getPixel(i, j);
byte gray = (byte) (.299 * Color.red(c) + .587 * Color.green(c)
+ .114 * Color.blue(c));


bmap.setPixel(i, j, Color.argb(255, gray, gray, gray));
}
}
return bmap;
}


// RemoveNoise
private Bitmap removeNoise(Bitmap bmap) {
for (int x = 0; x < bmap.getWidth(); x++) {
for (int y = 0; y < bmap.getHeight(); y++) {
int pixel = bmap.getPixel(x, y);
if (Color.red(pixel) < 162 && Color.green(pixel) < 162 && Color.blue(pixel) < 162) {
bmap.setPixel(x, y, Color.BLACK);
}
}
}
for (int x = 0; x < bmap.getWidth(); x++) {
for (int y = 0; y < bmap.getHeight(); y++) {
int pixel = bmap.getPixel(x, y);
if (Color.red(pixel) > 162 && Color.green(pixel) > 162 && Color.blue(pixel) > 162) {
bmap.setPixel(x, y, Color.WHITE);
}
}
}
return bmap;
}

小开

我这样做，以获得良好的结果出一个图像，没有很小的文字。

对原始图像应用模糊处理。
应用自适应阈值。
使用锐化效果。

如果仍然没有得到好的结果，缩放图像到150% 或200% 。

小开

Tesseract 文档通过图像处理步骤包含了关于如何提高光学字符识别的质量的一些很好的细节。

在某种程度上，宇宙魔方会自动应用它们。还可以告诉 Tesseract 编写一个中间图像以供检查，即检查内部图像处理的工作情况(在上面的参考文献中搜索 tessedit_write_images)。

更重要的是，在 Tesseract 4的新型神经网络系统产生更好的 OCR 结果-一般来说，特别是对于有一些噪声的图像。它是用 --oem 1启用的，例如:

$ tesseract --oem 1 -l deu page.png result pdf

(这个例子选择了德语)

因此，在应用一些自定义预处理图像处理步骤之前，有必要首先测试新的 Tesseract LSTM 模式的进展情况。

小开

使用任何 OCR 引擎从图像文档中读取文本都存在许多问题，以便获得较好的准确性。没有固定的解决方案，所有的情况下，但这里有一些事情，应该考虑提高 OCR 的结果。

1)由于图像质量差/背景区域有不需要的元素/斑点而产生噪音。这需要一些预处理操作，如噪声去除，可以很容易地做到使用高斯滤波器或正常的中值滤波方法。这些也可以在 OpenCV 中找到。

2)图像定位错误: 由于定位错误，OCR 引擎无法正确分割图像中的线条和文字，精度最差。

3)线条的存在: 在进行文字或线条分割的 OCR 引擎中，有时也会尝试将文字和线条合并在一起，从而处理错误的内容，从而产生错误的结果。还有其他问题，但这些都是基本问题。

本文以 OCR 申请为例，对 OCR 结果进行一些图像预处理和后处理，以获得更好的 OCR 精度。

小开

根据经验，我通常使用 OpenCV 库应用以下图像预处理技术:

重新缩放图像(如果你正在处理 DPI 小于300dpi 的图像，建议这样做) :
```
img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
```

Converting image to grayscale:

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

Applying dilation and erosion to remove the noise (you may play with the kernel size depending on your data set):

kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

Applying blur, which can be done by using one of the following lines (each of which has its pros and cons, however, median blur and bilateral filter usually perform better than gaussian blur.):

cv2.threshold(cv2.GaussianBlur(img, (5, 5), 0), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]


cv2.threshold(cv2.bilateralFilter(img, 5, 75, 75), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]


cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]


cv2.adaptiveThreshold(cv2.GaussianBlur(img, (5, 5), 0), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)


cv2.adaptiveThreshold(cv2.bilateralFilter(img, 9, 75, 75), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)


cv2.adaptiveThreshold(cv2.medianBlur(img, 3), 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

I've recently written a pretty simple guide to Tesseract but it should enable you to write your first OCR script and clear up some hurdles that I experienced when things were less clear than I would have liked in the documentation.

In case you'd like to check them out, here I'm sharing the links with you:

小开

文本识别取决于多种因素，以产生良好的质量输出。OCR 输出在很大程度上取决于输入图像的质量。这就是为什么每个 OCR 引擎提供关于输入图像的质量及其大小的指南。这些指南有助于 OCR 引擎产生准确的结果。

我已经写了一篇关于使用 python 进行图像处理的详细文章。请点击下面的链接以获得更多解释。还添加了 Python 源代码来实现这些过程。

如果你对这个话题有更好的建议或想法，请写一个评论来改进它。

Https://medium.com/cashify-engineering/improve-accuracy-of-ocr-using-image-preprocessing-8df29ec3a033

小开

您可以进行降噪，然后应用阈值，但是您可以通过更改—— psm 和—— oem 值来调整 OCR 的配置

尝试: —— psm 5 —— oem 2

你亦可浏览以下连结了解详情给你

小开

到目前为止，我已经使用了3. x、4. x 和5.0.0。宇宙魔方4.x 和5.x 的精度完全一样。

有时，使用遗留引擎(使用 --oem 0)可以获得更好的结果，有时使用 LTSM 引擎 --oem 1可以获得更好的结果。一般来说，我得到了最好的结果与 LTSM 引擎的高倍图像。后者与我早期的引擎(用于 Linux 的 ABBYYCLIOCR11)相当。

当然，受过训练的数据需要从 github 下载，因为大多数 linux 发行版只提供快速版本。可以在 https://github.com/tesseract-ocr/tessdata下载适用于遗留引擎和 LTSM 引擎的经过训练的数据，只需使用以下命令即可。别忘了下载 OSD 训练过的数据。

curl -L https://github.com/tesseract-ocr/tessdata/blob/main/eng.traineddata?raw=true -o /usr/share/tesseract/tessdata/eng.traineddata
curl -L https://github.com/tesseract-ocr/tessdata/blob/main/eng.traineddata?raw=true -o /usr/share/tesseract/tessdata/osd.traineddata

我最终使用 ImageMagick 作为我的图像预处理器，因为它很方便，而且可以很容易地运行脚本。您可以安装它与 yum install ImageMagick或 apt install imagemagick取决于您的发行版风味。

这就是我的单行线预处理器，它可以处理我输入到 OCR 中的大部分东西:

convert my_document.jpg -units PixelsPerInch -respect-parenthesis \( -compress LZW -resample 300 -bordercolor black -border 1 -trim +repage -fill white -draw "color 0,0 floodfill" -alpha off -shave 1x1 \) \( -bordercolor black -border 2 -fill white -draw "color 0,0 floodfill" -alpha off -shave 0x1 -deskew 40 +repage \) -antialias -sharpen 0x3 preprocessed_my_document.tiff

基本上我们:

使用 TIFF 格式，因为 tesseract 喜欢它多于 JPG (解压缩相关，谁知道)
使用无损 LZW TIFF 压缩
将图像重新采样到300dpi
使用一些黑魔法去除不需要的颜色
如果可以检测到旋转，请尝试旋转页面
反别名的形象
锐化文字

后一幅图像可以通过以下方式输入到宇宙立方中:

tesseract -l eng preprocessed_my_document.tiff - --oem 1 -psm 1

顺便说一句，几年前我编写了“穷人的 OCR 服务器”，它检查给定目录中已更改的文件，并对所有尚未使用 OCR 的文件启动 OCR 操作。Pmocr 与 tesseract 3.x-5.x 和 abbyyocr11兼容。看看在 github 上的 pmor 项目。