Issue
I am extracting non-English text from image(bill) by using Tesseract
But I met an Error when I executed function doOCR(BufferedImage var1) (No error if set English language) :
contains_unichar_id(unichar_id):Error:Assert failed:in file c:\projects\github\tesseract-ocr\src\ccutil\unicharset.h, line 511
Exception in thread "main" java.lang.Error: Invalid memory access
at com.sun.jna.Native.invokePointer(Native Method)
at com.sun.jna.Function.invokePointer(Function.java:470)
at com.sun.jna.Function.invoke(Function.java:404)
at com.sun.jna.Function.invoke(Function.java:315)
at com.sun.jna.Library$Handler.invoke(Library.java:212)
at com.sun.proxy.$Proxy0.TessBaseAPIGetUTF8Text(Unknown Source)
at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:433)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:288)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:260)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:241)
My code:
ITesseract iT = new Tesseract();
iT.setLanguage(LANGUAGE);
iT.setDatapath(System.getenv("TESSDATA_PREFIX"));
try {
return iT.doOCR(bufferedImage);
} catch (Exception e) {
e.getMessage();
return "Error while reading image";
}
Some bills can extract successfully. But with some special cases, I faced that error.
Solution
I have solved my issue. In file pom.xml, I changed from:
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.0.0</version>
</dependency>
to
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>5.3.0</version>
</dependency>
Answered By - Jay
Answer Checked By - Willingham (JavaFixing Volunteer)