Tesseract running error

小开

您可以从 C 代码调用 tesseract API 函数:

#include <tesseract/baseapi.h>
#include <tesseract/ocrclass.h>; // ETEXT_DESC


using namespace tesseract;


class TessAPI : public TessBaseAPI {
public:
void PrintRects(int len);
};


...
TessAPI *api = new TessAPI();
int res = api->Init(NULL, "rus");
api->SetAccuracyVSpeed(AVS_MOST_ACCURATE);
api->SetImage(data, w0, h0, bpp, stride);
api->SetRectangle(x0,y0,w0,h0);


char *text;
ETEXT_DESC monitor;
api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
printf("m.count: %s\n", monitor.count);
printf("m.progress: %s\n", monitor.progress);


api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
...
api->End();

然后编写代码:

g++ -g -I. -I/usr/local/include -o _test test.cpp -ltesseract_api -lfreeimageplus

(我需要免费的图片载入)

小开

You can grab eng.traineddata Github:

wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata

检查 https://github.com/tesseract-ocr/tessdata以获得训练有素的语言数据的完整列表。

When you grab the file(s), move them to the /usr/local/share/tessdata folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata instead.

# If you got the data from Google, unzip it first!
gunzip eng.traineddata.gz
# Move the data
sudo mv -v eng.traineddata /usr/local/share/tessdata/

小开

最简单的方法是安装所需的软件包:

sudo apt-get install tesseract-ocr-eng  #for english
sudo apt-get install tesseract-ocr-tam  #for tamil
sudo apt-get install tesseract-ocr-deu  #for deutsch (German)

正如你所注意到的，它为其他语言(即 tesseract-ocr-fra)开辟了道路。

小开

之前的解决方案对我都不管用。

我安装了 apt-get和手动下载 tessdata，在 /usr周围移动等等，没有人工作，即使我导出变量千倍。

Finally, on a last try before start to cry i've tried to pass the path directly to the instance of Tesseract().

说明一下，我使用的是 tesserwrap模块。

小开

I had this error too on the Windows machine.

我的解决办法。

1)下载你的语言文件 Https://github.com/tesseract-ocr/tessdata/tree/3.04.00

例如，对于 eng，我下载了所有带 eng 前缀的文件。

2)将它们放入某个文件夹内的 Tessdata目录中，并将该文件夹作为 TESSDATA _ PREFIX添加到系统路径变量中。

结果就是 System env var: TESSDATA _ PREFIX = D:/Java/OCR 和 OCR 文件夹有 tessdata与语言文件。

这是目录的截图:

小开

I'm using Visual Studio 2017 Community Edition.
我通过在项目的 Debug 目录中创建一个名为 tessdata的目录来解决这个问题。然后我将 工程师，受过训练的数据文件放入所述目录。

小开

tesseract  --tessdata-dir <tessdata-folder> <image-path> stdout --oem 2 -l <lng>

就我而言，我犯过的错误或者尝试过的失败。

我克隆了 github 回购文件，然后把文件复制到
- /usr/local/share/tessdata/
- /usr/share/tesseract-ocr/tessdata/
- /usr/share/tessdata/
对上述路径使用 TESSDATA_PREFIX
Sudo apt-get install tesseract-ocr-eng

前两次尝试没有工作，因为，从 git clone的文件没有工作的原因，我不知道。我不知道为什么第三种尝试对我有效。

终于,

我使用 wget下载了 eng.train 数据文件
复制到某个目录
使用具有目录名的 --tessdata-dir

对我来说就是 好好学习和使用这个工具，而不是依赖于软件包管理器的安装和目录

小开

我使用的是 Windows 操作系统，我尝试了以上所有的解决方案，但没有一个奏效。

最后，我将 Tesseract-OCR 安装在 D 驱动器(我运行 Python 脚本的地方)而不是 C 驱动器上，它就可以工作了。

因此，如果您正在使用 Windows，请在与您的 Tesseract-OCR 相同的驱动器中运行 Python 脚本。

小开

Windows 用户:

In Environment Variables, add a new variable in system variable with name "TESSDATA_PREFIX" and value is "C:\Program Files (x86)\Tesseract-OCR\tessdata"

小开

C # 开发人员在 Windows 上工作。我只需要从下面的 URL 下载文件 工程师，受过训练的数据:

Https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

并将其复制到我的控制台应用项目中的以下目录:

[项目目录]装入调试测试数据

我确实手动创建了上面的 Tessdata文件夹。

小开

tessdata_dir_config = r'--tessdata-dir "/usr/local/Cellar/tesseract/4.1.1/share/tessdata"'
pytesseract.image_to_string(imgCrop,lang='eng',config=tessdata_dir_config)

小开

将以下代码添加到代码中:

instance.setDatapath("C:\\somepath\\tessdata");


instance.setLanguage("eng");

小开

我是如何在 Manjaro Xfce 中解决这个问题的:

信息“ tesseractError: (1，‘错误打开数据文件/home/julio/nap/tesseract/common/eng.traeddata 请确保 TESSdATA _ PREFIX 环境变量设置为您的“ TESSDATA”目录。加载语言‘ eng’Tesseract 失败，无法加载任何语言！无法初始化四方体。”

然后，在我的 Manjaro 中，我输入: sudo pacman-S 立方然后系统安装了“宇宙魔方”和一个名为“瘦子”的软件包

完成这一步之后，我认为一切正常，并尝试运行我的简单脚本。但是，错误消息变成了这样(它将前面的“/home”位置更改为其他类似于“/usr”的位置) : ”请确保将 TESSdATA _ PREFIX 环境变量设置为“ TESSDATA”目录。加载语言‘ eng’Tesseract 失败，无法加载任何语言！无法初始化四方体。”

然后我意识到在我用 pacman 安装“ tesseract”时出现了这样的信息: “您必须安装 tesseract-data-* 包或整个 tesseract-data 组中的一个”

因此，我尝试了命令: “ sudo pacman-S tesseract-data”，系统向我提供了许多语言选项。所以我选择了一些语言，安装如下，模块开始像魔咒一样工作:

Sudo pacman-S 四方数据工程

Sudo pacman-S tesseract-data-po

Sudo pacman-S Tesseract-data-fra

sudo pacman -S tesseract-data-spa

我尝试了一些葡萄牙特殊字符(比如“ ão”) ，它们只有在我使用 pytesseract.image _ to _ string (img，lang = ‘ po’)中的参数“ lang = ‘ po’”时才有效

小开

对于 Ubuntu，只需运行下面的命令，环境变量错误就会消失。

命令:

export TESSDATA_PREFIX=Path_of_your_tessdata_folder

命令示例:

export TESSDATA_PREFIX=/home/amar/Desktop/OCR/tesseract-4.1.1/tessdata

This command will set the tessdata folder's path to the environment variable with name TESSDATA_PREFIX and the above error will be resolved.

小开

As of 2021, My solution for Ubuntu is to download the zip files from https://github.com/tesseract-ocr/tessdata_best/releases/tag/4.1.0, extract and copy the neccessary .traineddata files into /usr/local/share/tessdata. This is the default folder for tesseract 4.1.1 to search for trained data.

小开

对我来说，问题在于我是如何下载火车数据文件的。

最初我用的是:

wget https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata

当我改成:

wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata

成功了

小开

我在 macOS 上使用 DEU 语言时遇到了同样的问题。我可以通过安装其他语言来解决这个问题，比如:

brew install tesseract-lang

就像 https://formulae.brew.sh/formula/tesseract上建议的那样

小开

如果你有 Windows 操作系统，那么请把你的 TesseractOCR 添加到系统变量中。呃..。

找到 Tesseract 安装在 c 驱动器中的路径(在我的例子中是 r“ C: Program Files Tesseract-OCR Tesseract.exe”) * * 2)确保你有所需的文件即 tessdata，tessdata，如果没有，然后下载它从 https://github.com/tesseract-ocr/tessdata https://github.com/tesseract-ocr/langdata(至少那些语言，你想要转换)
past it into the main directory in my case C:\Program Files\Tesseract-OCR 4)将目录的路径添加到系统环境变量为了那个
search environment variable in start bar 去环境变量点击系统环境变量中的路径(不在用户环境变量中) 超过了宇宙魔方的轨道

仅此而已。

小开

在 Google Colab 中，我用这种方式解决了这个问题:

!sudo apt-get install tesseract-ocr-*

Because if you use this command !sudo apt install tesseract-ocr then it imports 2 languages but when you intend to work on non-English languages then the former command works. 然后，使用这个命令 !pip install pytesseract 您还可以用这种方式检查语言 !tesseract --list-langs