Update to v5 (#830)

master
Balearica 1 year ago committed by GitHub
parent ccf7414bc2
commit 6ebe92fb5b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 4
      .gitignore
  2. 111
      README.md
  3. 11
      benchmarks/browser/auto-rotate-benchmark.html
  4. 24
      benchmarks/browser/speed-benchmark.html
  5. 2
      benchmarks/node/speed-benchmark.js
  6. 270
      docs/api.md
  7. 44
      docs/examples.md
  8. 2
      docs/faq.md
  9. 68
      docs/intro.md
  10. 29
      docs/local-installation.md
  11. 29
      docs/performance.md
  12. 51
      docs/workers_vs_schedulers.md
  13. 10
      examples/browser/basic-efficient.html
  14. 8
      examples/browser/basic-scheduler.html
  15. 26
      examples/browser/basic.html
  16. 160
      examples/browser/demo.html
  17. 8
      examples/browser/download-pdf.html
  18. 10
      examples/browser/image-processing.html
  19. 13
      examples/node/detect.js
  20. 2
      examples/node/download-pdf.js
  21. 2
      examples/node/image-processing.js
  22. 4
      examples/node/recognize.js
  23. 28
      examples/node/scheduler.js
  24. 14
      package-lock.json
  25. 4
      package.json
  26. 2
      scripts/server.js
  27. 49
      scripts/webpack.config.dev.js
  28. 8
      src/Tesseract.js
  29. 5
      src/constants/config.js
  30. 4
      src/constants/defaultOptions.js
  31. 82
      src/createWorker.js
  32. 7
      src/index.d.ts
  33. 14
      src/worker-script/browser/getCore.js
  34. 117
      src/worker-script/index.js
  35. 15
      src/worker-script/node/getCore.js
  36. 10
      src/worker/browser/defaultOptions.js
  37. 2
      tests/FS.test.html
  38. 2
      tests/FS.test.js
  39. 4
      tests/constants.js
  40. 2
      tests/detect.test.html
  41. 4
      tests/detect.test.js
  42. 2
      tests/recognize.test.html
  43. 31
      tests/recognize.test.js
  44. 2
      tests/scheduler.test.html
  45. 5
      tests/scheduler.test.js

4
.gitignore vendored

@ -1,8 +1,8 @@
.DS_Store .DS_Store
node_modules/* node_modules/*
yarn.lock yarn.lock
tesseract.dev.js tesseract.min.js
worker.dev.js worker.min.js
*.traineddata *.traineddata
*.traineddata.gz *.traineddata.gz
.nyc_output .nyc_output

@ -31,82 +31,32 @@ Video Real-time Recognition
Tesseract.js wraps a [webassembly port](https://github.com/naptha/tesseract.js-core) of the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR Engine. Tesseract.js wraps a [webassembly port](https://github.com/naptha/tesseract.js-core) of the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR Engine.
It works in the browser using [webpack](https://webpack.js.org/) or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/). It works in the browser using [webpack](https://webpack.js.org/), esm, or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/).
After you [install it](#installation), using it is as simple as: After you [install it](#installation), using it is as simple as:
```javascript
import Tesseract from 'tesseract.js';
Tesseract.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
console.log(text);
})
```
Or using workers (recommended for production use):
```javascript ```javascript
import { createWorker } from 'tesseract.js'; import { createWorker } from 'tesseract.js';
const worker = await createWorker({
logger: m => console.log(m)
});
(async () => { (async () => {
await worker.loadLanguage('eng'); const worker = await createWorker('eng');
await worker.initialize('eng'); const data = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); console.log(data.text);
console.log(text);
await worker.terminate(); await worker.terminate();
})(); })();
``` ```
When recognizing multiple images, users should create a worker once, run `worker.recognize` for each image, and then run `worker.terminate()` once at the end (rather than running the above snippet for every image).
For a basic overview of the functions, including the pros/cons of different approaches, see the [intro](./docs/intro.md). [Check out the docs](#documentation) for a full explanation of the API.
## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.
- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
- `createWorker` is now async
- `getPDF` function replaced by `pdf` recognize option
## Major changes in v3
- Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data)
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
- Added SIMD-enabled build for supported devices
- Added support:
- Node.js version 18
- Removed support:
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
- Node.js versions 10 and 12
## Major changes in v2
- Upgrade to tesseract v4.1.1 (using emscripten 1.39.10 upstream)
- Support multiple languages at the same time, eg: eng+chi\_tra for English and Traditional Chinese
- Supported image formats: png, jpg, bmp, pbm
- Support WebAssembly (fallback to ASM.js when browser doesn't support)
- Support Typescript
Read a story about v2: <a href="https://jeromewu.github.io/why-i-refactor-tesseract.js-v2/">Why I refactor tesseract.js v2?</a><br>
Check the <a href="https://github.com/naptha/tesseract.js/tree/support/1.x">support/1.x</a> branch for version 1
## Installation ## Installation
Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via `npm` and on Node.js with `npm/yarn`. Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via `npm` and on Node.js with `npm/yarn`.
### CDN ### CDN
```html ```html
<!-- v4 --> <!-- v5 -->
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@4/dist/tesseract.min.js'></script> <script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>
``` ```
After including the script the `Tesseract` variable will be globally available. After including the script the `Tesseract` variable will be globally available and a worker can be created using `Tesseract.createWorker`.
Alternatively, an ESM build (used with `import` syntax) can be found at `https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.esm.min.js`.
### Node.js ### Node.js
@ -122,16 +72,51 @@ npm install tesseract.js@3.0.3
yarn add tesseract.js@3.0.3 yarn add tesseract.js@3.0.3
``` ```
## Documentation ## Documentation
* [Intro](./docs/intro.md) * [Workers vs. Schedulers](./docs/workers_vs_schedulers.md)
* [Examples](./docs/examples.md) * [Examples](./docs/examples.md)
* [Image Format](./docs/image-format.md) * [Supported Image Formats](./docs/image-format.md)
* [API](./docs/api.md) * [API](./docs/api.md)
* [Local Installation](./docs/local-installation.md) * [Local Installation](./docs/local-installation.md)
* [FAQ](./docs/faq.md) * [FAQ](./docs/faq.md)
## Major changes in v5
Version 5 changes are documented in [this issue](https://github.com/naptha/tesseract.js/issues/820). Highlights are below.
- Significantly smaller files by default (54% smaller for English, 73% smaller for Chinese)
- This results in a ~50% reduction in runtime for first-time users (who do not have the files cached yet)
- Significantly lower memory usage
- Compatible with iOS 17 (using default settings)
- Breaking changes:
- `createWorker` arguments changed
- Setting non-default language and OEM now happens in `createWorker`
- E.g. `createWorker("chi_sim", 1)`
- `worker.initialize` and `worker.loadLanguage` functions now do nothing and can be deleted from code
- See [this issue](https://github.com/naptha/tesseract.js/issues/820) for full list
## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.
- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
- `createWorker` is now async
- `getPDF` function replaced by `pdf` recognize option
## Major changes in v3
- Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data)
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
- Added SIMD-enabled build for supported devices
- Added support:
- Node.js version 18
- Removed support:
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
- Node.js versions 10 and 12
## Use tesseract.js the way you like! ## Use tesseract.js the way you like!
- Electron Version: https://github.com/Balearica/tesseract.js-electron - Electron Version: https://github.com/Balearica/tesseract.js-electron
@ -167,7 +152,7 @@ npm start
``` ```
The development server will be available at http://localhost:3000/examples/browser/demo.html in your favorite browser. The development server will be available at http://localhost:3000/examples/browser/demo.html in your favorite browser.
It will automatically rebuild `tesseract.dev.js` and `worker.dev.js` when you change files in the **src** folder. It will automatically rebuild `tesseract.min.js` and `worker.min.js` when you change files in the **src** folder.
### Online Setup with a single Click ### Online Setup with a single Click

@ -1,7 +1,7 @@
<html> <html>
<head> <head>
<script src="/dist/tesseract.dev.js"></script> <script src="/dist/tesseract.min.js"></script>
<style> <style>
.column { .column {
float: left; float: left;
@ -37,15 +37,10 @@
const element = document.getElementById("imgRow"); const element = document.getElementById("imgRow");
const worker = await Tesseract.createWorker({ const worker = await Tesseract.createWorker('eng', 0, {
// corePath: '/tesseract-core-simd.wasm.js', // corePath: '/tesseract-core-simd.wasm.js',
workerPath: "/dist/worker.dev.js" workerPath: "/dist/worker.min.js"
}); });
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.initialize();
const fileArr = ["../data/meditations.jpg", "../data/tyger.jpg", "../data/testocr.png"]; const fileArr = ["../data/meditations.jpg", "../data/tyger.jpg", "../data/testocr.png"];
let timeTotal = 0; let timeTotal = 0;

@ -1,6 +1,6 @@
<html> <html>
<head> <head>
<script src="/dist/tesseract.dev.js"></script> <script src="/dist/tesseract.min.js"></script>
</head> </head>
<body> <body>
<textarea id="message">Working...</textarea> <textarea id="message">Working...</textarea>
@ -13,20 +13,21 @@
const { createWorker } = Tesseract; const { createWorker } = Tesseract;
(async () => { (async () => {
const worker = await createWorker({ const worker = await createWorker("eng", 1, {
// corePath: '/tesseract-core-simd.wasm.js', corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js" workerPath: "/dist/worker.min.js",
}); });
await worker.loadLanguage('eng');
await worker.initialize('eng');
// The performance.measureUserAgentSpecificMemory function only runs under specific circumstances for security reasons. // The performance.measureUserAgentSpecificMemory function only runs under specific circumstances for security reasons.
// See: https://developer.mozilla.org/en-US/docs/Web/API/Performance/measureUserAgentSpecificMemory#security_requirements // See: https://developer.mozilla.org/en-US/docs/Web/API/Performance/measureUserAgentSpecificMemory#security_requirements
// Launching a server using `npm start` and accessing via localhost on the same system should meet these conditions. // Launching a server using `npm start` and accessing via localhost on the same system should meet these conditions.
const debugMemory = true; const debugMemory = true;
if (debugMemory && crossOriginIsolated) { if (debugMemory && crossOriginIsolated) {
console.log("Memory utilization after initialization:"); const memObj = await performance.measureUserAgentSpecificMemory();
console.log(await performance.measureUserAgentSpecificMemory()); const memMb = memObj.breakdown.map((x) => {if(x.attribution?.[0]?.scope == "DedicatedWorkerGlobalScope") return x.bytes}).reduce((a, b) => (a || 0) + (b || 0), 0) / 1e6;
console.log(`Worker memory utilization after initialization: ${memMb} MB`);
} else { } else {
console.log("Unable to run `performance.measureUserAgentSpecificMemory`: not crossOriginIsolated.") console.log("Unable to run `performance.measureUserAgentSpecificMemory`: not crossOriginIsolated.")
} }
@ -45,8 +46,11 @@
} }
if (debugMemory && crossOriginIsolated) { if (debugMemory && crossOriginIsolated) {
console.log("Memory utilization after recognition:"); const memObj = await performance.measureUserAgentSpecificMemory();
console.log(await performance.measureUserAgentSpecificMemory()); const memMb = memObj.breakdown.map((x) => {if(x.attribution?.[0]?.scope == "DedicatedWorkerGlobalScope") return x.bytes}).reduce((a, b) => (a || 0) + (b || 0), 0) / 1e6;
console.log(`Worker memory utilization after recognition: ${memMb} MB`);
} }
document.getElementById('message').innerHTML += "\nTotal runtime: " + timeTotal + "s"; document.getElementById('message').innerHTML += "\nTotal runtime: " + timeTotal + "s";

@ -4,8 +4,6 @@ const { createWorker } = require('../../');
(async () => { (async () => {
const worker = await createWorker(); const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const fileArr = ["../data/meditations.jpg", "../data/tyger.jpg", "../data/testocr.png"]; const fileArr = ["../data/meditations.jpg", "../data/tyger.jpg", "../data/testocr.png"];
let timeTotal = 0; let timeTotal = 0;
for (let file of fileArr) { for (let file of fileArr) {

@ -1,16 +1,15 @@
# API # API
- [createWorker()](#create-worker) - [createWorker()](#create-worker)
- [Worker.recognize](#worker-recognize)
- [Worker.setParameters](#worker-set-parameters)
- [Worker.reinitialize](#worker-reinitialize)
- [Worker.detect](#worker-detect)
- [Worker.terminate](#worker-terminate)
- [Worker.writeText](#worker-writeText) - [Worker.writeText](#worker-writeText)
- [Worker.readText](#worker-readText) - [Worker.readText](#worker-readText)
- [Worker.removeFile](#worker-removeFile) - [Worker.removeFile](#worker-removeFile)
- [Worker.FS](#worker-FS) - [Worker.FS](#worker-FS)
- [Worker.loadLanguage](#worker-load-language)
- [Worker.initialize](#worker-initialize)
- [Worker.setParameters](#worker-set-parameters)
- [Worker.recognize](#worker-recognize)
- [Worker.detect](#worker-detect)
- [Worker.terminate](#worker-terminate)
- [createScheduler()](#create-scheduler) - [createScheduler()](#create-scheduler)
- [Scheduler.addWorker](#scheduler-add-worker) - [Scheduler.addWorker](#scheduler-add-worker)
- [Scheduler.addJob](#scheduler-add-job) - [Scheduler.addJob](#scheduler-add-job)
@ -27,10 +26,13 @@
<a name="create-worker"></a> <a name="create-worker"></a>
## createWorker(options): Worker ## createWorker(options): Worker
createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node. `createWorker` is a function that creates a Tesseract.js worker. A Tesseract.js worker is an object that creates and manages an instance of Tesseract running in a web worker (browser) or worker thread (Node.js). Once created, OCR jobs are sent through the worker.
**Arguments:** **Arguments:**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `oem` a enum to indicate the OCR Engine Mode you use
- `options` an object of customized options - `options` an object of customized options
- `corePath` path to a directory containing **both** `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js` from [Tesseract.js-core](https://www.npmjs.com/package/tesseract.js-core) package - `corePath` path to a directory containing **both** `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js` from [Tesseract.js-core](https://www.npmjs.com/package/tesseract.js-core) package
- Setting `corePath` to a specific `.js` file is **strongly discouraged.** To provide the best performance on all devices, Tesseract.js needs to be able to pick between `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js`. See [this issue](https://github.com/naptha/tesseract.js/issues/735) for more detail. - Setting `corePath` to a specific `.js` file is **strongly discouraged.** To provide the best performance on all devices, Tesseract.js needs to be able to pick between `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js`. See [this issue](https://github.com/naptha/tesseract.js/issues/735) for more detail.
@ -43,6 +45,8 @@ createWorker is a factory function that creates a tesseract worker, a worker is
- readOnly: read cache and not to write back - readOnly: read cache and not to write back
- refresh: not to read cache and write back - refresh: not to read cache and write back
- none: not to read cache and not to write back - none: not to read cache and not to write back
- `legacyCore` set to `true` to ensure any code downloaded supports the Legacy model (in addition to LSTM model)
- `legacyLang` set to `true` to ensure any language data downloaded supports the Legacy model (in addition to LSTM model)
- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true - `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
- `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true - `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
- `logger` a function to log the progress, a quick example is `m => console.log(m)` - `logger` a function to log the progress, a quick example is `m => console.log(m)`
@ -59,255 +63,211 @@ const worker = await createWorker({
}); });
``` ```
## Worker <a name="worker-recognize"></a>
### Worker.recognize(image, options, jobId): Promise
A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:
- FS functions // optional
- loadLanguage
- initialize
- setParameters // optional
- recognize or detect
- terminate
Each function is async, so using async/await or Promise is required. When it is resolved, you get an object:
```json
{
"jobId": "Job-1-123",
"data": { ... }
}
```
jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.
<a name="worker-writeText"></a> Worker.recognize() provides core function of Tesseract.js as it executes OCR
### Worker.writeText(path, text, jobId): Promise
Worker.writeText() writes a text file to the path specified in MEMFS, it is useful when you want to use some features that requires tesseract.js Figures out what words are in `image`, where the words are in `image`, etc.
to read file from file system. > Note: `image` should be sufficiently high resolution.
> Often, the same image will get much better results if you upscale it before calling `recognize`.
**Arguments:** **Arguments:**
- `path` text file path - `image` see [Image Format](./image-format.md) for more details.
- `text` content of the text file - `options` an object of customized options
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned)
- `jobId` Please see details above - `jobId` Please see details above
**Output:**
**Examples:** **Examples:**
```javascript ```javascript
const { createWorker } = Tesseract;
(async () => { (async () => {
await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n'); const worker = await createWorker('eng');
const { data: { text } } = await worker.recognize(image);
console.log(text);
})(); })();
``` ```
<a name="worker-readText"></a> With rectangle
### Worker.readText(path, jobId): Promise
Worker.readText() reads a text file to the path specified in MEMFS, it is useful when you want to check the content.
**Arguments:**
- `path` text file path
- `jobId` Please see details above
**Examples:**
```javascript ```javascript
const { createWorker } = Tesseract;
(async () => { (async () => {
const { data } = await worker.readText('tmp.txt'); const worker = await createWorker('eng');
console.log(data); const { data: { text } } = await worker.recognize(image, {
rectangle: { top: 0, left: 0, width: 100, height: 100 },
});
console.log(text);
})(); })();
``` ```
<a name="worker-removeFile"></a> <a name="worker-set-parameters"></a>
### Worker.removeFile(path, jobId): Promise ### worker.setParameters(params, jobId): Promise
Worker.removeFile() remove a file in MEMFS, it is useful when you want to free the memory. `worker.setParameters()` set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
**Arguments:** **Arguments:**
- `path` file path - `params` an object with key and value of the parameters
- `jobId` Please see details above - `jobId` Please see details above
**Examples:** Note: `worker.setParameters` cannot be used to change the `oem`, as this value is set at initialization. `oem` is initially set using an argument of `createWorker`. After a worker already exists, changing `oem` requires running `worker.reinitialize`.
```javascript **Useful Parameters:**
(async () => {
await worker.removeFile('tmp.txt');
})();
```
<a name="worker-FS"></a>
### Worker.FS(method, args, jobId): Promise
Worker.FS() is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions.
**Arguments:** | name | type | default value | description |
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
- `method` method name This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
- `args` array of arguments to pass
- `jobId` Please see details above
**Examples:** **Examples:**
```javascript ```javascript
(async () => { (async () => {
await worker.FS('writeFile', ['tmp.txt', 'Hi\nTesseract.js\n']); await worker.setParameters({
// equal to: tessedit_char_whitelist: '0123456789',
// await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n'); });
})(); })
``` ```
<a name="worker-load-language"></a> <a name="worker-reinitialize"></a>
### Worker.loadLanguage(langs, jobId): Promise ### worker.reinitialize(langs, oem, jobId): Promise
Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system. `worker.reinitialize()` re-initializes an existing Tesseract.js worker with different `langs` and `oem` arguments.
**Arguments:** **Arguments:**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra** - `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `oem` a enum to indicate the OCR Engine Mode you use
- `jobId` Please see details above - `jobId` Please see details above
Note: to switch from Tesseract LSTM (`oem` value `1`) to Tesseract Legacy (`oem` value `0`) using `worker.reinitialize()`, the worker must already contain the code required to run the Tesseract Legacy model. Setting `legacyCore: true` and `legacyLang: true` in `createWorker` options ensures this is the case.
**Examples:** **Examples:**
```javascript ```javascript
(async () => { await worker.reinitialize('eng', 1);
await worker.loadLanguage('eng+chi_tra');
})();
``` ```
<a name="worker-initialize"></a> <a name="worker-detect"></a>
### Worker.initialize(langs, oem, jobId): Promise ### Worker.detect(image, jobId): Promise
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.
Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks. Note: Running `worker.detect` requires a worker with code and language data that supports Tesseract Legacy (this is not enabled by default). If you want to run `worker.detect`, set `legacyCore` and `legacyLang` to `true` in the `createWorker` options.
**Arguments:** **Arguments:**
- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage. - `image` see [Image Format](./image-format.md) for more details.
- `oem` a enum to indicate the OCR Engine Mode you use
- `jobId` Please see details above - `jobId` Please see details above
**Examples:** **Examples:**
```javascript ```javascript
const { createWorker } = Tesseract;
(async () => { (async () => {
/** You can load more languages in advance, but use only part of them in Worker.initialize() */ const worker = await createWorker('eng', 1, {legacyCore: true, legacyLang: true});
await worker.loadLanguage('eng+chi_tra'); const { data } = await worker.detect(image);
await worker.initialize('eng'); console.log(data);
})(); })();
``` ```
<a name="worker-set-parameters"></a>
### Worker.setParameters(params, jobId): Promise
Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
**Arguments:**
- `params` an object with key and value of the parameters
- `jobId` Please see details above
<a name="worker-terminate"></a>
### Worker.terminate(jobId): Promise
**Useful Parameters:** Worker.terminate() terminates the worker and cleans up
| name | type | default value | description |
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| tessedit\_ocr\_engine\_mode | enum | OEM.DEFAULT | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
**Examples:**
```javascript ```javascript
(async () => { (async () => {
await worker.setParameters({ await worker.terminate();
tessedit_char_whitelist: '0123456789', })();
});
})
``` ```
<a name="worker-recognize"></a>
### Worker.recognize(image, options, jobId): Promise
Worker.recognize() provides core function of Tesseract.js as it executes OCR <a name="worker-writeText"></a>
### Worker.writeText(path, text, jobId): Promise
Figures out what words are in `image`, where the words are in `image`, etc. Worker.writeText() writes a text file to the path specified in MEMFS, it is useful when you want to use some features that requires tesseract.js
> Note: `image` should be sufficiently high resolution. to read file from file system.
> Often, the same image will get much better results if you upscale it before calling `recognize`.
**Arguments:** **Arguments:**
- `image` see [Image Format](./image-format.md) for more details. - `path` text file path
- `options` an object of customized options - `text` content of the text file
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned)
- `jobId` Please see details above - `jobId` Please see details above
**Output:**
**Examples:** **Examples:**
```javascript ```javascript
const { createWorker } = Tesseract;
(async () => { (async () => {
const worker = await createWorker(); await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image);
console.log(text);
})(); })();
``` ```
With rectangle <a name="worker-readText"></a>
### Worker.readText(path, jobId): Promise
Worker.readText() reads a text file to the path specified in MEMFS, it is useful when you want to check the content.
**Arguments:**
- `path` text file path
- `jobId` Please see details above
**Examples:**
```javascript ```javascript
const { createWorker } = Tesseract;
(async () => { (async () => {
const worker = await createWorker(); const { data } = await worker.readText('tmp.txt');
await worker.loadLanguage('eng'); console.log(data);
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image, {
rectangle: { top: 0, left: 0, width: 100, height: 100 },
});
console.log(text);
})(); })();
``` ```
<a name="worker-detect"></a> <a name="worker-removeFile"></a>
### Worker.detect(image, jobId): Promise ### Worker.removeFile(path, jobId): Promise
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR. Worker.removeFile() remove a file in MEMFS, it is useful when you want to free the memory.
**Arguments:** **Arguments:**
- `image` see [Image Format](./image-format.md) for more details. - `path` file path
- `jobId` Please see details above - `jobId` Please see details above
**Examples:** **Examples:**
```javascript ```javascript
const { createWorker } = Tesseract;
(async () => { (async () => {
const worker = await createWorker(); await worker.removeFile('tmp.txt');
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data } = await worker.detect(image);
console.log(data);
})(); })();
``` ```
<a name="worker-terminate"></a> <a name="worker-FS"></a>
### Worker.terminate(jobId): Promise ### Worker.FS(method, args, jobId): Promise
Worker.terminate() terminates the worker and cleans up Worker.FS() is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions.
**Arguments:**
- `method` method name
- `args` array of arguments to pass
- `jobId` Please see details above
**Examples:**
```javascript ```javascript
(async () => { (async () => {
await worker.terminate(); await worker.FS('writeFile', ['tmp.txt', 'Hi\nTesseract.js\n']);
// equal to:
// await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
})(); })();
``` ```
@ -404,13 +364,17 @@ setLogging(true);
<a name="recognize"></a> <a name="recognize"></a>
## recognize(image, langs, options): Promise ## recognize(image, langs, options): Promise
recognize() is a function to quickly do recognize() task, it is not recommended to use in real application, but useful when you want to save some time. This function is depreciated and should be replaced with `worker.recognize` (see above).
`recognize` works the same as `worker.recognize`, except that a new worker is created, loaded, and destroyed every time the function is called.
See [Tesseract.js](../src/Tesseract.js) See [Tesseract.js](../src/Tesseract.js)
<a name="detect"></a> <a name="detect"></a>
## detect(image, options): Promise ## detect(image, options): Promise
This function is depreciated and should be replaced with `worker.detect` (see above).
Same background as recognize(), but it does detect instead. Same background as recognize(), but it does detect instead.
See [Tesseract.js](../src/Tesseract.js) See [Tesseract.js](../src/Tesseract.js)

@ -7,11 +7,9 @@ You can also check [examples](../examples) folder.
```javascript ```javascript
const { createWorker } = require('tesseract.js'); const { createWorker } = require('tesseract.js');
const worker = await createWorker(); const worker = await createWorker('eng');
(async () => { (async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text); console.log(text);
await worker.terminate(); await worker.terminate();
@ -23,13 +21,11 @@ const worker = await createWorker();
```javascript ```javascript
const { createWorker } = require('tesseract.js'); const { createWorker } = require('tesseract.js');
const worker = await createWorker({ const worker = await createWorker('eng', 1, {
logger: m => console.log(m), // Add logger here logger: m => console.log(m), // Add logger here
}); });
(async () => { (async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text); console.log(text);
await worker.terminate(); await worker.terminate();
@ -41,11 +37,9 @@ const worker = await createWorker({
```javascript ```javascript
const { createWorker } = require('tesseract.js'); const { createWorker } = require('tesseract.js');
const worker = await createWorker(); const worker = await createWorker('eng+chi_tra');
(async () => { (async () => {
await worker.loadLanguage('eng+chi_tra');
await worker.initialize('eng+chi_tra');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text); console.log(text);
await worker.terminate(); await worker.terminate();
@ -56,11 +50,9 @@ const worker = await createWorker();
```javascript ```javascript
const { createWorker } = require('tesseract.js'); const { createWorker } = require('tesseract.js');
const worker = await createWorker(); const worker = await createWorker('eng');
(async () => { (async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({ await worker.setParameters({
tessedit_char_whitelist: '0123456789', tessedit_char_whitelist: '0123456789',
}); });
@ -77,11 +69,9 @@ Check here for more details of pageseg mode: https://github.com/tesseract-ocr/te
```javascript ```javascript
const { createWorker, PSM } = require('tesseract.js'); const { createWorker, PSM } = require('tesseract.js');
const worker = await createWorker(); const worker = await createWorker('eng');
(async () => { (async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({ await worker.setParameters({
tessedit_pageseg_mode: PSM.SINGLE_BLOCK, tessedit_pageseg_mode: PSM.SINGLE_BLOCK,
}); });
@ -105,12 +95,10 @@ Node: [download-pdf.js](../examples/node/download-pdf.js)
```javascript ```javascript
const { createWorker } = require('tesseract.js'); const { createWorker } = require('tesseract.js');
const worker = await createWorker(); const worker = await createWorker('eng');
const rectangle = { left: 0, top: 0, width: 500, height: 250 }; const rectangle = { left: 0, top: 0, width: 500, height: 250 };
(async () => { (async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle }); const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle });
console.log(text); console.log(text);
await worker.terminate(); await worker.terminate();
@ -122,7 +110,7 @@ const rectangle = { left: 0, top: 0, width: 500, height: 250 };
```javascript ```javascript
const { createWorker } = require('tesseract.js'); const { createWorker } = require('tesseract.js');
const worker = await createWorker(); const worker = await createWorker('eng');
const rectangles = [ const rectangles = [
{ {
left: 0, left: 0,
@ -139,8 +127,6 @@ const rectangles = [
]; ];
(async () => { (async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const values = []; const values = [];
for (let i = 0; i < rectangles.length; i++) { for (let i = 0; i < rectangles.length; i++) {
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle: rectangles[i] }); const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle: rectangles[i] });
@ -157,8 +143,8 @@ const rectangles = [
const { createWorker, createScheduler } = require('tesseract.js'); const { createWorker, createScheduler } = require('tesseract.js');
const scheduler = createScheduler(); const scheduler = createScheduler();
const worker1 = await createWorker(); const worker1 = await createWorker('eng');
const worker2 = await createWorker(); const worker2 = await createWorker('eng');
const rectangles = [ const rectangles = [
{ {
left: 0, left: 0,
@ -175,10 +161,6 @@ const rectangles = [
]; ];
(async () => { (async () => {
await worker1.loadLanguage('eng');
await worker2.loadLanguage('eng');
await worker1.initialize('eng');
await worker2.initialize('eng');
scheduler.addWorker(worker1); scheduler.addWorker(worker1);
scheduler.addWorker(worker2); scheduler.addWorker(worker2);
const results = await Promise.all(rectangles.map((rectangle) => ( const results = await Promise.all(rectangles.map((rectangle) => (
@ -195,14 +177,10 @@ const rectangles = [
const { createWorker, createScheduler } = require('tesseract.js'); const { createWorker, createScheduler } = require('tesseract.js');
const scheduler = createScheduler(); const scheduler = createScheduler();
const worker1 = await createWorker(); const worker1 = await createWorker('eng');
const worker2 = await createWorker(); const worker2 = await createWorker('eng');
(async () => { (async () => {
await worker1.loadLanguage('eng');
await worker2.loadLanguage('eng');
await worker1.initialize('eng');
await worker2.initialize('eng');
scheduler.addWorker(worker1); scheduler.addWorker(worker1);
scheduler.addWorker(worker2); scheduler.addWorker(worker2);
/** Add 10 recognition jobs */ /** Add 10 recognition jobs */

@ -19,8 +19,6 @@ Default settings should provide optimal results for most users. If you do want
# Trained Data # Trained Data
## How does tesseract.js download and keep \*.traineddata? ## How does tesseract.js download and keep \*.traineddata?
The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you. During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.
## How can I train my own \*.traineddata? ## How can I train my own \*.traineddata?

@ -1,68 +0,0 @@
# Overview
Tesseract.js offers 3 different ways to recognize text, which vary in complexity. This allows Tesseract.js to provide ease of use to new users experimenting with Tesseract.js, while offering control and performance to more experienced users. Each option is described in brief below, in order of complexity. For more detailed documentation on each function, see the [api page](./api.md).
# Option 1: Single Function
By using `Tesseract.recognize`, you can recognize text with just 1 function and 2 arguments (image and language). This makes it easy for new users to experiment with Tesseract.js.
```
Tesseract.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng'
).then(({ data: { text } }) => {
console.log(text);
})
```
This option should generally be avoided in production code. Using `Tesseract.recognize` results in a new worker being created and loaded with language data whenever `Tesseract.recognize` is run. This is inefficient for reasons explained below.
# Option 2: Using Workers
Tesseract.js also supports creating and managing workers (the objects that execute recognition) manually.
```
(async () => {
const worker = await Tesseract.createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
This code block is no more efficient than the `Tesseract.recognize` example as written (in both cases a worker is created and destroyed for recognizing a single image). However, within the context of an actual application, separating (1) creating a worker and loading data and (2) running recognition jobs provides developers the control necessary to write more efficient code:
1. Workers can be prepared ahead of time
- E.g. a worker can be created and loaded with language data when the page is first loaded, rather than waiting for a user to upload an image to recognize
1. Workers can be reused for multiple recognition jobs, rather than creating a new worker and loading language data for every image recognized (as `Tesseract.recognize` does)
# Option 3: Using Schedulers + Workers
Finally, Tesseract.js supports schedulers. A scheduler is an object that contains multiple workers, which it uses to execute jobs in parallel.
```
const scheduler = Tesseract.createScheduler();
// Creates worker and adds to scheduler
const workerGen = async () => {
const worker = await Tesseract.createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
scheduler.addWorker(worker);
}
const workerN = 4;
(async () => {
const resArr = Array(workerN);
for (let i=0; i<workerN; i++) {
resArr[i] = workerGen();
}
await Promise.all(resArr);
/** Add 4 recognition jobs */
const results = await Promise.all(Array(10).fill(0).map(() => (
scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
)))
await scheduler.terminate(); // It also terminates all workers.
})();
```
While using schedulers is no more efficient for a single job, they allow for quickly executing large numbers of jobs in parallel.
When working with schedulers, note that workers added to the same scheduler should all be homogenous—they should have the same language be configured with the same parameters. Schedulers assign jobs to workers in a non-deterministic manner, so if the workers are not identical then recognition results will depend on which worker the job is assigned to.

@ -8,21 +8,11 @@ Because of this we recommend loading `tesseract.js` from a CDN. But if you reall
In Node.js environment, the only path you may want to customize is languages/langPath. In Node.js environment, the only path you may want to customize is languages/langPath.
```javascript
Tesseract.recognize(image, langs, {
workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v4.0.3/dist/worker.min.js',
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3',
})
```
Or
```javascript ```javascript
const worker = await createWorker({ const worker = await createWorker({
workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v4.0.3/dist/worker.min.js', workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v5.0.0/dist/worker.min.js',
langPath: 'https://tessdata.projectnaptha.com/4.0.0', langPath: 'https://tessdata.projectnaptha.com/4.0.0',
corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3', corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0',
}); });
``` ```
@ -30,11 +20,18 @@ const worker = await createWorker({
A string specifying the location of the `worker.js` file. A string specifying the location of the `worker.js` file.
### langPath ### langPath
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. A string specifying the location of the tesseract language files. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. If `langPath` is not specified by the user, then the correct language data will be automatically downloaded from the jsDelivr CDN.
### corePath ### corePath
A string specifying the location of the [tesseract.js-core](https://github.com/naptha/tesseract.js-core) files, with default value 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3'. A string specifying the location of the [tesseract.js-core](https://github.com/naptha/tesseract.js-core) files, with default value 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0'.
If you set the `corePath` argument, be sure to set it to a directory that contains **all 4** of these files:
1. `tesseract-core.wasm.js`
2. `tesseract-core-simd.wasm.js`
3. `tesseract-core-lstm.wasm.js`
4. `tesseract-core-simd-lstm.wasm.js`
`corePath` should be set to a directory containing both `tesseract-core-simd.wasm.js` and `tesseract-core.wasm.js`. Tesseract.js will load either `tesseract-core-simd.wasm.js` or `tesseract-core.wasm.js` from the directory depending on whether the users' device supports SIMD (see [https://webassembly.org/roadmap/](https://webassembly.org/roadmap/)). Tesseract.js will pick the correct file based on your users' device and the `createWorker` options.
To avoid breaking old code, when `corePath` is set to a specific `.js` file (e.g. `https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3/tesseract-core.wasm.js`), it will load that file regardless of whether the users' device supports SIMD or not. This behavior only exists to preserve backwards compatibility—setting `corePath` to a specific `.js` file is strongly discouraged. Doing so will either result in much slower performance (if `tesseract-core.wasm.js` is specified) or failure to run on certain devices (if `tesseract-core-simd.wasm.js` is specified). To avoid breaking old code, when `corePath` is set to a specific `.js` file (e.g. `https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0/tesseract-core.wasm.js`), it will load that file regardless of whether the users' device supports SIMD or not. This behavior only exists to preserve backwards compatibility—setting `corePath` to a specific `.js` file is strongly discouraged. Doing so will either result in much slower performance (if `tesseract-core.wasm.js` is specified) or failure to run on certain devices (if `tesseract-core-simd.wasm.js` is specified).

@ -2,38 +2,37 @@
This guide contains tips and strategies for getting the fastest performance from Tesseract.js. While some of the tips below involve avoiding pitfalls and should be universally implemented, other strategies (changing the language data or recognition model) may harm recognition quality. Therefore, whether these strategies are appropriate depends on the application, and users should always benchmark performance and quality before changing important settings from their defaults. This guide contains tips and strategies for getting the fastest performance from Tesseract.js. While some of the tips below involve avoiding pitfalls and should be universally implemented, other strategies (changing the language data or recognition model) may harm recognition quality. Therefore, whether these strategies are appropriate depends on the application, and users should always benchmark performance and quality before changing important settings from their defaults.
# Reducing Setup Time # Reducing Setup Time
Within certain applications, the majority of runtime may be attributable to setup steps (`createWorker`, `worker.initialize`, and `worker.loadLanguage`) rather than recognition (`worker.recognize`). Implementing the strategies in this section should reduce the time spent on these steps. Within certain applications, the majority of runtime may be attributable to setup steps (`createWorker`) rather than recognition (`worker.recognize`). Implementing the strategies in this section should reduce the time spent on these steps.
Notably, the time spent on setup for first-time users may not be apparent to developers, as Tesseract.js caches language data after it is downloaded for the first time. To experience Tesseract.js as a first-time user, set `cacheMethod: 'none'` in the [createWorker options](./api.md#createworkeroptions-worker). Be sure to remove this setting before publishing your app. Notably, the time spent on setup for first-time users may not be apparent to developers, as Tesseract.js caches language data after it is downloaded for the first time. To experience Tesseract.js as a first-time user, set `cacheMethod: 'none'` in the [createWorker options](./api.md#createworkeroptions-worker). Be sure to remove this setting before publishing your app.
### Reuse Workers ### Reuse Workers
When recognizing multiple images, some users will create/load/destroy a new worker for each image. This is never the correct option. If the images are being recognized one after the other, all of the extra `createWorker`/`worker.initialize`/`worker.loadLanguage` steps are wasted runtime, as `worker.recognize` could be run with the same `worker`. Workers do not break after one use. When recognizing multiple images, some users will create/load/destroy a new worker for each image. This is never the correct option. If the images are being recognized one after the other, all of the extra steps required to create/load/destroy a new worker are wasted runtime, as `worker.recognize` could be run with the same `worker`. Workers do not break after one use.
Alternatively, if images are being recognized in parallel, then creating a new worker for each recognition job is likely to cause crashes due to resource limitations. As each Tesseract.js worker uses a high amount of memory, code should never be able to create an arbitrary number of `workers`. Instead, schedulers should be used to create a specific pool for workers (say, 4 workers), and then jobs assigned through the scheduler. Alternatively, if images are being recognized in parallel, then creating a new worker for each recognition job is likely to cause crashes due to resource limitations. As each Tesseract.js worker uses a high amount of memory, code should never be able to create an arbitrary number of `workers`. Instead, schedulers should be used to create a specific pool for workers (say, 4 workers), and then jobs assigned through the scheduler.
### Set Up Workers Ahead of Time ### Set Up Workers Ahead of Time
Rather than waiting until the last minute to load code and data, you can set up a worker ahead of time. Doing so greatly reduces runtime the first time a user run recognition. This requires managing workers rather than using `Tesseract.recognize`, which is explained [here](./intro.md). An example where a worker is prepared ahead of time can be found [here](../examples/browser/basic-efficient.html). Rather than waiting until the last minute to load code and data, you can set up a worker ahead of time. Doing so greatly reduces runtime the first time a user run recognition. An example where a worker is prepared ahead of time can be found [here](../examples/browser/basic-efficient.html).
The appropriate time to load Tesseract.js workers and data is application-specific. For example, if you have an web app where only 5% of users need OCR, it likely does not make sense to download ~15MB in code and data upon a page load. You could consider instead loading Tesseract.js when a user indicates they want to perform OCR, but before they select a specific image. The appropriate time to load Tesseract.js workers and data is application-specific. For example, if you have an web app where only 5% of users need OCR, it likely does not make sense to download ~15MB in code and data upon a page load. You could consider instead loading Tesseract.js when a user indicates they want to perform OCR, but before they select a specific image.
### Do Not Disable Language Data Caching ### Do Not Disable Language Data Caching
Language data is, by far, the largest download required to run Tesseract.js. The default `eng.traineddata` file is 10.4MB compressed. The default `chi_sim.traineddata` file is 19.2MB compressed. Language data is one of the largest downloads required to run Tesseract.js. While most language data files (including the default English file) are ~2MB, in a worst-case scenario they can be much larger. For example, setting the recognition model (`oem`) to Tesseract Legacy and language to Chinese (simplified) results in a ~20MB file being downloaded.
To avoid downloading language data multiple times, Tesseract.js caches `.traineddata` files. In past versions of Tesseract.js, this caching behavior contained bugs, so some users disabled it (setting `cacheMethod: 'none'` or `cacheMethod: 'refresh'`). As these bugs were fixed in [v4.0.6](https://github.com/naptha/tesseract.js/releases/tag/v4.0.6), it is now recommended that users use the default `cacheMethod` value (i.e. just ignore the `cacheMethod` argument). To avoid downloading language data multiple times, Tesseract.js caches `.traineddata` files. In past versions of Tesseract.js, this caching behavior contained bugs, so some users disabled it (setting `cacheMethod: 'none'` or `cacheMethod: 'refresh'`). As these bugs were fixed in [v4.0.6](https://github.com/naptha/tesseract.js/releases/tag/v4.0.6), it is now recommended that users use the default `cacheMethod` value (i.e. just ignore the `cacheMethod` argument).
### Consider Using Smaller Language Data
The default language data used by Tesseract.js includes data for both Tesseract engines (LSTM [the default model] and Legacy), and is optimized for quality rather than speed. Both the inclusion of multiple models and the focus on quality increase the size of the language data. Setting a non-default `langData` path may result in significantly smaller files being downloaded.
For example, by changing `langPath` from the default (`https://tessdata.projectnaptha.com/4.0.0`) to `https://tessdata.projectnaptha.com/4.0.0_fast` the size of the compressed English language data is reduced from 10.9MB to 2.0MB. Note that this language data (1) only supports the default LSTM model and (2) is optimized for size/speed rather than quality, so users should switch only after testing whether this data works for their application.
# Reducing Recognition Runtime # Reducing Recognition Runtime
### Use the Latest Version of Tesseract.js ### Use the Latest Version of Tesseract.js
Old versions of Tesseract.js are significantly slower. Notably, v2 (now depreciated) takes 10x longer to recognize certain images compared to the latest version. Old versions of Tesseract.js are significantly slower. Notably, v2 (now depreciated) takes 10x longer to recognize certain images compared to the latest version.
### Consider Using the Legacy Model ### Do Not Set `corePath` to a Single `.js` file
In general, the LSTM (default) recognition model provides the best quality. However, the Legacy model generally runs faster, and depending on your application, may provide sufficient recognition quality. If runtime is a significant concern, consider experimenting with the Legacy model (by setting `oem` to `”0”` within `worker.initialize`). As a rule of thumb, the Legacy model is usually viable when the input data is high-quality (high-definition screenshots, document scans, etc.). If you set the `corePath` argument, be sure to set it to a directory that contains **all 4** of these files:
### Consider Using "Fast" Language Data 1. `tesseract-core.wasm.js`
By default, Tesseract.js uses language data that is optimized for quality rather than speed. You can also experiment with using language data that is optimized for speed by setting `langPath` to `https://tessdata.projectnaptha.com/4.0.0_fast`. 2. `tesseract-core-simd.wasm.js`
3. `tesseract-core-lstm.wasm.js`
4. `tesseract-core-simd-lstm.wasm.js`
### Do Not Set `corePath` to a Single `.js` file Tesseract.js needs to be able to pick between these files—setting `corePath` to a specific `.js` file will significantly degrade performance or compatibility.
If you set the `corePath` argument, be sure to set it to a directory that contains both `tesseract-core.wasm.js` or `tesseract-core-simd.wasm.js`. Tesseract.js needs to be able to pick between both files—setting `corePath` to a specific `.js` file will significantly degrade performance or compatibility. See [this comment](https://github.com/naptha/tesseract.js/issues/735#issuecomment-1519157646) for explanation.
### Consider Using "Fast" Language Data
By default, Tesseract.js uses language data that is optimized for quality rather than speed. You can also experiment with using language data that is optimized for speed by setting `langPath` to `https://tessdata.projectnaptha.com/4.0.0_fast`. We have not benchmarked the impact this has on performance or accuracy, so feel free to open a Git Issue if you do so and wish to share results.

@ -0,0 +1,51 @@
# Overview
Tesseract.js offers 2 ways to run recognition jobs: (1) using a worker directly, or (2) using a scheduler to run jobs on multiple workers in parallel. The syntax for the latter is more complicated, but using parallel processing via schedulers provides significantly better performance for large jobs. For more detailed documentation on each function, see the [api page](./api.md).
# Option 1: Using Workers Directly
Tesseract.js also supports creating and managing workers (the objects that execute recognition) manually.
```
(async () => {
const worker = await Tesseract.createWorker('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
In actual use, the `createWorker` step should be separated from the `worker.recognize` step. Doing so enables the following benefits:
1. Workers can be prepared ahead of time
- E.g. a worker can be created when the page is first loaded, rather than waiting for a user to upload an image to recognize
1. Workers can be reused for multiple recognition jobs, rather than creating a new worker and loading language data for every image recognized
- Remember to call `worker.terminate()` after all recognition is complete to free memory
# Option 2: Using Schedulers + Workers
Tesseract.js also supports executing jobs using schedulers. A scheduler is an object that contains multiple workers, which it uses to execute jobs in parallel. For example, the following code executes 10 jobs in parallel using 4 workers.
```
const scheduler = Tesseract.createScheduler();
// Creates worker and adds to scheduler
const workerGen = async () => {
const worker = await Tesseract.createWorker('eng');
scheduler.addWorker(worker);
}
const workerN = 4;
(async () => {
const resArr = Array(workerN);
for (let i=0; i<workerN; i++) {
resArr[i] = workerGen();
}
await Promise.all(resArr);
/** Add 10 recognition jobs */
const results = await Promise.all(Array(10).fill(0).map(() => (
scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
)))
await scheduler.terminate(); // It also terminates all workers.
})();
```
While using schedulers is no more efficient for a single job, they allow for quickly executing large numbers of jobs in parallel.
When working with schedulers, note that workers added to the same scheduler should all be homogenous—they should have the same language be configured with the same parameters. Schedulers assign jobs to workers in a non-deterministic manner, so if the workers are not identical then recognition results will depend on which worker the job is assigned to.

@ -1,7 +1,7 @@
<!DOCTYPE HTML> <!DOCTYPE HTML>
<html> <html>
<head> <head>
<script src="/dist/tesseract.dev.js"></script> <script src="/dist/tesseract.min.js"></script>
</head> </head>
<body> <body>
<input type="file" id="uploader" multiple> <input type="file" id="uploader" multiple>
@ -10,16 +10,12 @@
// This is a basic example more efficient than "basic.html". // This is a basic example more efficient than "basic.html".
// In this example we create a worker once, and this worker is re-used // In this example we create a worker once, and this worker is re-used
// every time the user uploads a new file. // every time the user uploads a new file.
const worker = await Tesseract.createWorker("eng", 1, {
const worker = await Tesseract.createWorker({
corePath: '../../node_modules/tesseract.js-core', corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js", workerPath: "/dist/worker.min.js",
logger: function(m){console.log(m);} logger: function(m){console.log(m);}
}); });
await worker.loadLanguage('eng');
await worker.initialize('eng');
const recognize = async function(evt){ const recognize = async function(evt){
const files = evt.target.files; const files = evt.target.files;

@ -1,7 +1,7 @@
<!DOCTYPE HTML> <!DOCTYPE HTML>
<html> <html>
<head> <head>
<script src="/dist/tesseract.dev.js"></script> <script src="/dist/tesseract.min.js"></script>
</head> </head>
<body> <body>
<input type="file" id="uploader" multiple> <input type="file" id="uploader" multiple>
@ -16,13 +16,11 @@
// Creates worker and adds to scheduler // Creates worker and adds to scheduler
const workerGen = async () => { const workerGen = async () => {
const worker = await Tesseract.createWorker({ const worker = await Tesseract.createWorker("eng", 1, {
corePath: '../../node_modules/tesseract.js-core', corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js", workerPath: "/dist/worker.min.js",
logger: function(m){console.log(m);} logger: function(m){console.log(m);}
}); });
await worker.loadLanguage('eng');
await worker.initialize('eng');
scheduler.addWorker(worker); scheduler.addWorker(worker);
} }

@ -1,26 +0,0 @@
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
</head>
<body>
<input type="file" id="uploader">
<script>
// This is the most basic example (contains a single function call).
// However, in cases when multiple recognition jobs are run,
// calling Tesseract.recognize() each time is inefficient.
// See "basic-efficient.html" for a more efficient example.
const recognize = async ({ target: { files } }) => {
const { data: { text } } = await Tesseract.recognize(files[0], 'eng', {
corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js",
logger: m => console.log(m),
});
console.log(text);
}
const elm = document.getElementById('uploader');
elm.addEventListener('change', recognize);
</script>
</body>
</html>

@ -1,160 +0,0 @@
<script src="/dist/tesseract.dev.js"></script>
<script>
function progressUpdate(packet){
var log = document.getElementById('log');
if(log.firstChild && log.firstChild.status === packet.status){
if('progress' in packet){
var progress = log.firstChild.querySelector('progress')
progress.value = packet.progress
}
}else{
var line = document.createElement('div');
line.status = packet.status;
var status = document.createElement('div')
status.className = 'status'
status.appendChild(document.createTextNode(packet.status))
line.appendChild(status)
if('progress' in packet){
var progress = document.createElement('progress')
progress.value = packet.progress
progress.max = 1
line.appendChild(progress)
}
if(packet.status == 'done'){
var pre = document.createElement('pre')
pre.appendChild(document.createTextNode(packet.data.data.text))
line.innerHTML = ''
line.appendChild(pre)
}
log.insertBefore(line, log.firstChild)
}
}
async function recognizeFile(file) {
document.querySelector("#log").innerHTML = ''
const corePath = '../../node_modules/tesseract.js-core';
const lang = document.querySelector('#langsel').value
const data = await Tesseract.recognize(file, lang, {
corePath,
logger: progressUpdate,
});
progressUpdate({ status: 'done', data });
}
</script>
<select id="langsel" onchange="window.lastFile && recognizeFile(window.lastFile)">
<option value='afr' > Afrikaans </option>
<option value='ara' > Arabic </option>
<option value='aze' > Azerbaijani </option>
<option value='bel' > Belarusian </option>
<option value='ben' > Bengali </option>
<option value='bul' > Bulgarian </option>
<option value='cat' > Catalan </option>
<option value='ces' > Czech </option>
<option value='chi_sim' > Chinese </option>
<option value='chi_tra' > Traditional Chinese </option>
<option value='chr' > Cherokee </option>
<option value='dan' > Danish </option>
<option value='deu' > German </option>
<option value='ell' > Greek </option>
<option value='eng' selected> English </option>
<option value='enm' > English (Old) </option>
<option value='meme' > Internet Meme </option>
<option value='epo' > Esperanto </option>
<option value='epo_alt' > Esperanto alternative </option>
<option value='est' > Estonian </option>
<option value='eus' > Basque </option>
<option value='fin' > Finnish </option>
<option value='fra' > French </option>
<option value='frk' > Frankish </option>
<option value='frm' > French (Old) </option>
<option value='glg' > Galician </option>
<option value='grc' > Ancient Greek </option>
<option value='heb' > Hebrew </option>
<option value='hin' > Hindi </option>
<option value='hrv' > Croatian </option>
<option value='hun' > Hungarian </option>
<option value='ind' > Indonesian </option>
<option value='isl' > Icelandic </option>
<option value='ita' > Italian </option>
<option value='ita_old' > Italian (Old) </option>
<option value='jpn' > Japanese </option>
<option value='kan' > Kannada </option>
<option value='kor' > Korean </option>
<option value='lav' > Latvian </option>
<option value='lit' > Lithuanian </option>
<option value='mal' > Malayalam </option>
<option value='mkd' > Macedonian </option>
<option value='mlt' > Maltese </option>
<option value='msa' > Malay </option>
<option value='nld' > Dutch </option>
<option value='nor' > Norwegian </option>
<option value='pol' > Polish </option>
<option value='por' > Portuguese </option>
<option value='ron' > Romanian </option>
<option value='rus' > Russian </option>
<option value='slk' > Slovakian </option>
<option value='slv' > Slovenian </option>
<option value='spa' > Spanish </option>
<option value='spa_old' > Old Spanish </option>
<option value='sqi' > Albanian </option>
<option value='srp' > Serbian (Latin) </option>
<option value='swa' > Swahili </option>
<option value='swe' > Swedish </option>
<option value='tam' > Tamil </option>
<option value='tel' > Telugu </option>
<option value='tgl' > Tagalog </option>
<option value='tha' > Thai </option>
<option value='tur' > Turkish </option>
<option value='ukr' > Ukrainian </option>
<option value='vie' > Vietnamese </option>
</select>
<button onclick="recognizeFile('../../tests/assets/images/simple.png')">Sample Image</button>
<input type="file" onchange="recognizeFile(window.lastFile=this.files[0])">
<div id="log"></div>
<style>
#log > div {
color: #313131;
border-top: 1px solid #dadada;
padding: 9px;
display: flex;
}
#log > div:first-child {
border: 0;
}
.status {
min-width: 250px;
}
#log {
border: 1px solid #dadada;
padding: 10px;
margin-top: 20px;
min-height: 100px;
}
body {
font-family: sans-serif;
margin: 30px;
}
progress {
display: block;
width: 100%;
transition: opacity 0.5s linear;
}
progress[value="1"] {
opacity: 0.5;
}
</style>

@ -1,6 +1,6 @@
<html> <html>
<head> <head>
<script src="/dist/tesseract.dev.js"></script> <script src="/dist/tesseract.min.js"></script>
</head> </head>
<body> <body>
<div> <div>
@ -10,17 +10,15 @@
<textarea id="board" readonly rows="8" cols="80">Upload an image file</textarea> <textarea id="board" readonly rows="8" cols="80">Upload an image file</textarea>
<script type="module"> <script type="module">
const { createWorker } = Tesseract; const { createWorker } = Tesseract;
const worker = await createWorker({ const worker = await createWorker("eng", 1, {
corePath: '/node_modules/tesseract.js-core', corePath: '/node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js", workerPath: "/dist/worker.min.js",
logger: m => console.log(m), logger: m => console.log(m),
}); });
const uploader = document.getElementById('uploader'); const uploader = document.getElementById('uploader');
const dlBtn = document.getElementById('download-pdf'); const dlBtn = document.getElementById('download-pdf');
let pdf; let pdf;
const recognize = async ({ target: { files } }) => { const recognize = async ({ target: { files } }) => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const res = await worker.recognize(files[0],{pdfTitle: "Example PDF"},{pdf: true}); const res = await worker.recognize(files[0],{pdfTitle: "Example PDF"},{pdf: true});
pdf = res.data.pdf; pdf = res.data.pdf;
const text = res.data.text; const text = res.data.text;

@ -1,7 +1,7 @@
<html> <html>
<head> <head>
<script src="/dist/tesseract.dev.js"></script> <script src="/dist/tesseract.min.js"></script>
<style> <style>
.column { .column {
float: left; float: left;
@ -37,14 +37,10 @@
<script> <script>
const recognize = async ({ target: { files } }) => { const recognize = async ({ target: { files } }) => {
document.getElementById("imgInput").src = URL.createObjectURL(files[0]); document.getElementById("imgInput").src = URL.createObjectURL(files[0]);
const worker = await Tesseract.createWorker({ const worker = await Tesseract.createWorker("eng", 1, {
// corePath: '/tesseract-core-simd.wasm.js', // corePath: '/tesseract-core-simd.wasm.js',
workerPath: "/dist/worker.dev.js" workerPath: "/dist/worker.min.js"
}); });
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.initialize();
const ret = await worker.recognize(files[0], {rotateAuto: true}, {imageColor: true, imageGrey: true, imageBinary: true}); const ret = await worker.recognize(files[0], {rotateAuto: true}, {imageColor: true, imageGrey: true, imageBinary: true});
document.getElementById("imgOriginal").src = ret.data.imageColor; document.getElementById("imgOriginal").src = ret.data.imageColor;
document.getElementById("imgGrey").src = ret.data.imageGrey; document.getElementById("imgGrey").src = ret.data.imageGrey;

@ -1,13 +0,0 @@
#!/usr/bin/env node
const path = require('node:path');
const Tesseract = require('../../');
const [,, imagePath] = process.argv;
const image = path.resolve(__dirname, (imagePath || '../../tests/assets/images/cosmic.png'));
console.log(`Recognizing ${image}`);
Tesseract.detect(image, { logger: m => console.log(m) })
.then(({ data }) => {
console.log(data);
});

@ -10,8 +10,6 @@ console.log(`Recognizing ${image}`);
(async () => { (async () => {
const worker = await createWorker(); const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text, pdf } } = await worker.recognize(image, {pdfTitle: "Example PDF"}, {pdf: true}); const { data: { text, pdf } } = await worker.recognize(image, {pdfTitle: "Example PDF"}, {pdf: true});
console.log(text); console.log(text);
fs.writeFileSync('tesseract-ocr-result.pdf', Buffer.from(pdf)); fs.writeFileSync('tesseract-ocr-result.pdf', Buffer.from(pdf));

@ -21,8 +21,6 @@ const convertImage = (imageSrc) => {
(async () => { (async () => {
const worker = await createWorker(); const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { imageColor, imageGrey, imageBinary } } = await worker.recognize(image, {rotateAuto: true}, {imageColor: true, imageGrey: true, imageBinary: true}); const { data: { imageColor, imageGrey, imageBinary } } = await worker.recognize(image, {rotateAuto: true}, {imageColor: true, imageGrey: true, imageBinary: true});
console.log('Saving intermediate images: imageColor.png, imageGrey.png, imageBinary.png'); console.log('Saving intermediate images: imageColor.png, imageGrey.png, imageBinary.png');

@ -8,11 +8,9 @@ const image = path.resolve(__dirname, (imagePath || '../../tests/assets/images/c
console.log(`Recognizing ${image}`); console.log(`Recognizing ${image}`);
(async () => { (async () => {
const worker = await createWorker({ const worker = await createWorker("eng", 1, {
logger: m => console.log(m), logger: m => console.log(m),
}); });
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image); const { data: { text } } = await worker.recognize(image);
console.log(text); console.log(text);
await worker.terminate(); await worker.terminate();

@ -1,12 +1,19 @@
const { createWorker, createScheduler } = require('../../'); const { createWorker, createScheduler } = require('../../');
const path = require('path');
const [,, imagePath] = process.argv;
// Note: This example recognizes the same image 4 times in parallel
// to show how schedulers can be used to speed up bulk jobs.
// In actual use you would (obviously) not want to run multiple identical jobs.
const image = path.resolve(__dirname, (imagePath || '../../tests/assets/images/cosmic.png'));
const imageArr = [image, image, image, image];
const scheduler = createScheduler(); const scheduler = createScheduler();
// Creates worker and adds to scheduler // Creates worker and adds to scheduler
const workerGen = async () => { const workerGen = async () => {
const worker = await createWorker({cachePath: "."}); const worker = await createWorker("eng", 1, {cachePath: "."});
await worker.loadLanguage('eng');
await worker.initialize('eng');
scheduler.addWorker(worker); scheduler.addWorker(worker);
} }
@ -14,12 +21,17 @@ const workerN = 4;
(async () => { (async () => {
const resArr = Array(workerN); const resArr = Array(workerN);
for (let i=0; i<workerN; i++) { for (let i=0; i<workerN; i++) {
resArr[i] = await workerGen(); resArr[i] = workerGen();
} }
await Promise.all(resArr); await Promise.all(resArr);
/** Add 4 recognition jobs */
const results = await Promise.all(Array(10).fill(0).map(() => ( const resArr2 = Array(imageArr.length);
scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
))) for (let i = 0; i < imageArr.length; i++) {
resArr2[i] = scheduler.addJob('recognize', image).then((x) => console.log(x.data.text));
}
await Promise.all(resArr2);
await scheduler.terminate(); // It also terminates all workers. await scheduler.terminate(); // It also terminates all workers.
})(); })();

14
package-lock.json generated

@ -17,7 +17,7 @@
"node-fetch": "^2.6.9", "node-fetch": "^2.6.9",
"opencollective-postinstall": "^2.0.3", "opencollective-postinstall": "^2.0.3",
"regenerator-runtime": "^0.13.3", "regenerator-runtime": "^0.13.3",
"tesseract.js-core": "^4.0.4", "tesseract.js-core": "^5.0.0-beta.1",
"wasm-feature-detect": "^1.2.11", "wasm-feature-detect": "^1.2.11",
"zlibjs": "^0.3.1" "zlibjs": "^0.3.1"
}, },
@ -8663,9 +8663,9 @@
} }
}, },
"node_modules/tesseract.js-core": { "node_modules/tesseract.js-core": {
"version": "4.0.4", "version": "5.0.0-beta.1",
"resolved": "https://registry.npmjs.org/tesseract.js-core/-/tesseract.js-core-4.0.4.tgz", "resolved": "https://registry.npmjs.org/tesseract.js-core/-/tesseract.js-core-5.0.0-beta.1.tgz",
"integrity": "sha512-MJ+vtktjAaT0681uPl6TDUPhbRbpD/S9emko5rtorgHRZpQo7R3BG7h+3pVHgn1KjfNf1bvnx4B7KxEK8YKqpg==" "integrity": "sha512-lzRLGeNWVwGLi96unpzmYqXshdGWF/IR8LY5Ds+em6twjYQVSQlvpSgJ+2Y5vfxOzbtiFif0gtSZYBqzH4u03w=="
}, },
"node_modules/test-exclude": { "node_modules/test-exclude": {
"version": "6.0.0", "version": "6.0.0",
@ -16060,9 +16060,9 @@
} }
}, },
"tesseract.js-core": { "tesseract.js-core": {
"version": "4.0.4", "version": "5.0.0-beta.1",
"resolved": "https://registry.npmjs.org/tesseract.js-core/-/tesseract.js-core-4.0.4.tgz", "resolved": "https://registry.npmjs.org/tesseract.js-core/-/tesseract.js-core-5.0.0-beta.1.tgz",
"integrity": "sha512-MJ+vtktjAaT0681uPl6TDUPhbRbpD/S9emko5rtorgHRZpQo7R3BG7h+3pVHgn1KjfNf1bvnx4B7KxEK8YKqpg==" "integrity": "sha512-lzRLGeNWVwGLi96unpzmYqXshdGWF/IR8LY5Ds+em6twjYQVSQlvpSgJ+2Y5vfxOzbtiFif0gtSZYBqzH4u03w=="
}, },
"test-exclude": { "test-exclude": {
"version": "6.0.0", "version": "6.0.0",

@ -12,7 +12,7 @@
"profile:tesseract": "webpack-bundle-analyzer dist/tesseract-stats.json", "profile:tesseract": "webpack-bundle-analyzer dist/tesseract-stats.json",
"profile:worker": "webpack-bundle-analyzer dist/worker-stats.json", "profile:worker": "webpack-bundle-analyzer dist/worker-stats.json",
"prepublishOnly": "npm run build", "prepublishOnly": "npm run build",
"wait": "rimraf dist && wait-on http://localhost:3000/dist/tesseract.dev.js", "wait": "rimraf dist && wait-on http://localhost:3000/dist/tesseract.min.js",
"test": "npm-run-all -p -r start test:all", "test": "npm-run-all -p -r start test:all",
"test:all": "npm-run-all wait test:browser:* test:node:all", "test:all": "npm-run-all wait test:browser:* test:node:all",
"test:node": "nyc mocha --exit --bail --require ./scripts/test-helper.js", "test:node": "nyc mocha --exit --bail --require ./scripts/test-helper.js",
@ -69,7 +69,7 @@
"node-fetch": "^2.6.9", "node-fetch": "^2.6.9",
"opencollective-postinstall": "^2.0.3", "opencollective-postinstall": "^2.0.3",
"regenerator-runtime": "^0.13.3", "regenerator-runtime": "^0.13.3",
"tesseract.js-core": "^4.0.4", "tesseract.js-core": "^5.0.0",
"wasm-feature-detect": "^1.2.11", "wasm-feature-detect": "^1.2.11",
"zlibjs": "^0.3.1" "zlibjs": "^0.3.1"
}, },

@ -3,7 +3,7 @@ const middleware = require('webpack-dev-middleware');
const express = require('express'); const express = require('express');
const path = require('node:path'); const path = require('node:path');
const cors = require('cors'); const cors = require('cors');
const webpackConfig = require('./webpack.config.dev'); const webpackConfig = require('./webpack.config.prod');
const compiler = webpack(webpackConfig); const compiler = webpack(webpackConfig);
const app = express(); const app = express();

@ -1,49 +0,0 @@
const path = require('node:path');
const webpack = require('webpack');
const { BundleAnalyzerPlugin } = require('webpack-bundle-analyzer');
const common = require('./webpack.config.common');
const genConfig = ({
entry, filename, library, libraryTarget,
}) => ({
...common,
mode: 'development',
devtool: 'source-map',
entry,
output: {
filename,
library,
libraryTarget,
},
plugins: [
new webpack.ProvidePlugin({
Buffer: ['buffer', 'Buffer'],
}),
new webpack.DefinePlugin({
'process.env': {
TESS_ENV: JSON.stringify('development'),
},
}),
new BundleAnalyzerPlugin({
analyzerMode: 'disable',
statsFilename: `${filename.split('.')[0]}-stats.json`,
generateStatsFile: true
}),
],
devServer: {
allowedHosts: ['localhost', '.gitpod.io'],
},
});
module.exports = [
genConfig({
entry: path.resolve(__dirname, '..', 'src', 'index.js'),
filename: 'tesseract.dev.js',
library: 'Tesseract',
libraryTarget: 'umd',
}),
genConfig({
entry: path.resolve(__dirname, '..', 'src', 'worker-script', 'browser', 'index.js'),
filename: 'worker.dev.js',
}),
];

@ -1,9 +1,7 @@
const createWorker = require('./createWorker'); const createWorker = require('./createWorker');
const recognize = async (image, langs, options) => { const recognize = async (image, langs, options) => {
const worker = await createWorker(options); const worker = await createWorker(langs, 1, options);
await worker.loadLanguage(langs);
await worker.initialize(langs);
return worker.recognize(image) return worker.recognize(image)
.finally(async () => { .finally(async () => {
await worker.terminate(); await worker.terminate();
@ -11,9 +9,7 @@ const recognize = async (image, langs, options) => {
}; };
const detect = async (image, options) => { const detect = async (image, options) => {
const worker = await createWorker(options); const worker = await createWorker('osd', 0, options);
await worker.loadLanguage('osd');
await worker.initialize('osd');
return worker.detect(image) return worker.detect(image)
.finally(async () => { .finally(async () => {
await worker.terminate(); await worker.terminate();

@ -1,5 +0,0 @@
const OEM = require('./OEM');
module.exports = {
defaultOEM: OEM.DEFAULT,
};

@ -1,8 +1,4 @@
module.exports = { module.exports = {
/*
* default path for downloading *.traineddata
*/
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
/* /*
* Use BlobURL for worker script by default * Use BlobURL for worker script by default
* TODO: remove this option * TODO: remove this option

@ -3,7 +3,7 @@ const circularize = require('./utils/circularize');
const createJob = require('./createJob'); const createJob = require('./createJob');
const { log } = require('./utils/log'); const { log } = require('./utils/log');
const getId = require('./utils/getId'); const getId = require('./utils/getId');
const { defaultOEM } = require('./constants/config'); const OEM = require('./constants/OEM');
const { const {
defaultOptions, defaultOptions,
spawnWorker, spawnWorker,
@ -15,7 +15,7 @@ const {
let workerCounter = 0; let workerCounter = 0;
module.exports = async (_options = {}) => { module.exports = async (langs = 'eng', oem = OEM.LSTM_ONLY, _options = {}, config = {}) => {
const id = getId('Worker', workerCounter); const id = getId('Worker', workerCounter);
const { const {
logger, logger,
@ -28,6 +28,13 @@ module.exports = async (_options = {}) => {
const resolves = {}; const resolves = {};
const rejects = {}; const rejects = {};
// Current langs, oem, and config file.
// Used if the user ever re-initializes the worker using `worker.reinitialize`.
const currentLangs = typeof langs === 'string' ? langs.split('+') : langs;
let currentOem = oem;
let currentConfig = config;
const lstmOnlyCore = [OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem) && !options.legacyCore;
let workerResReject; let workerResReject;
let workerResResolve; let workerResResolve;
const workerRes = new Promise((resolve, reject) => { const workerRes = new Promise((resolve, reject) => {
@ -69,7 +76,7 @@ module.exports = async (_options = {}) => {
const loadInternal = (jobId) => ( const loadInternal = (jobId) => (
startJob(createJob({ startJob(createJob({
id: jobId, action: 'load', payload: { options }, id: jobId, action: 'load', payload: { options: { lstmOnly: lstmOnlyCore, corePath: options.corePath, logging: options.logging } },
})) }))
); );
@ -105,22 +112,62 @@ module.exports = async (_options = {}) => {
})) }))
); );
const loadLanguage = (langs = 'eng', jobId) => ( const loadLanguage = () => (
startJob(createJob({ console.warn('`loadLanguage` is depreciated and should be removed from code (workers now come with language pre-loaded)')
);
const loadLanguageInternal = (_langs, jobId) => startJob(createJob({
id: jobId, id: jobId,
action: 'loadLanguage', action: 'loadLanguage',
payload: { langs, options }, payload: {
})) langs: _langs,
options: {
langPath: options.langPath,
dataPath: options.dataPath,
cachePath: options.cachePath,
cacheMethod: options.cacheMethod,
gzip: options.gzip,
lstmOnly: [OEM.TESSERACT_ONLY, OEM.TESSERACT_LSTM_COMBINED].includes(currentOem)
&& !options.legacyLang,
},
},
}));
const initialize = () => (
console.warn('`initialize` is depreciated and should be removed from code (workers now come pre-initialized)')
); );
const initialize = (langs = 'eng', oem = defaultOEM, config, jobId) => ( const initializeInternal = (_langs, _oem, _config, jobId) => (
startJob(createJob({ startJob(createJob({
id: jobId, id: jobId,
action: 'initialize', action: 'initialize',
payload: { langs, oem, config }, payload: { langs: _langs, oem: _oem, config: _config },
})) }))
); );
const reinitialize = (langs = 'eng', oem, config, jobId) => { // eslint-disable-line
if (lstmOnlyCore && [OEM.TESSERACT_ONLY, OEM.TESSERACT_LSTM_COMBINED].includes(oem)) throw Error('Legacy model requested but code missing.');
const _oem = oem || currentOem;
currentOem = _oem;
const _config = config || currentConfig;
currentConfig = _config;
// Only load langs that are not already loaded.
// This logic fails if the user downloaded the LSTM-only English data for a language
// and then uses `worker.reinitialize` to switch to the Legacy engine.
// However, the correct data will still be downloaded after initialization fails
// and this can be avoided entirely
const langsArr = typeof langs === 'string' ? langs.split('+') : langs;
const _langs = langsArr.filter((x) => currentLangs.includes(x));
currentLangs.push(_langs);
return loadLanguageInternal(_langs, jobId)
.then(() => initializeInternal(_langs, _oem, _config, jobId));
};
const setParameters = (params = {}, jobId) => ( const setParameters = (params = {}, jobId) => (
startJob(createJob({ startJob(createJob({
id: jobId, id: jobId,
@ -148,13 +195,15 @@ module.exports = async (_options = {}) => {
})); }));
}; };
const detect = async (image, jobId) => ( const detect = async (image, jobId) => {
startJob(createJob({ if (lstmOnlyCore) throw Error('`worker.detect` requires Legacy model, which was not loaded.');
return startJob(createJob({
id: jobId, id: jobId,
action: 'detect', action: 'detect',
payload: { image: await loadImage(image) }, payload: { image: await loadImage(image) },
})) }));
); };
const terminate = async () => { const terminate = async () => {
if (worker !== null) { if (worker !== null) {
@ -207,6 +256,7 @@ module.exports = async (_options = {}) => {
FS, FS,
loadLanguage, loadLanguage,
initialize, initialize,
reinitialize,
setParameters, setParameters,
recognize, recognize,
getPDF, getPDF,
@ -214,7 +264,11 @@ module.exports = async (_options = {}) => {
terminate, terminate,
}; };
loadInternal().then(() => workerResResolve(resolveObj)).catch(() => {}); loadInternal()
.then(() => loadLanguageInternal(langs))
.then(() => initializeInternal(langs, oem, config))
.then(() => workerResResolve(resolveObj))
.catch(() => {});
return workerRes; return workerRes;
}; };

7
src/index.d.ts vendored

@ -1,6 +1,6 @@
declare namespace Tesseract { declare namespace Tesseract {
function createScheduler(): Scheduler function createScheduler(): Scheduler
function createWorker(options?: Partial<WorkerOptions>): Promise<Worker> function createWorker(langs?: string | Lang[], oem?: OEM, options?: Partial<WorkerOptions>, config?: string | Partial<InitOptions>): Promise<Worker>
function setLogging(logging: boolean): void function setLogging(logging: boolean): void
function recognize(image: ImageLike, langs?: string, options?: Partial<WorkerOptions>): Promise<RecognizeResult> function recognize(image: ImageLike, langs?: string, options?: Partial<WorkerOptions>): Promise<RecognizeResult>
function detect(image: ImageLike, options?: Partial<WorkerOptions>): any function detect(image: ImageLike, options?: Partial<WorkerOptions>): any
@ -20,8 +20,7 @@ declare namespace Tesseract {
readText(path: string, jobId?: string): Promise<ConfigResult> readText(path: string, jobId?: string): Promise<ConfigResult>
removeText(path: string, jobId?: string): Promise<ConfigResult> removeText(path: string, jobId?: string): Promise<ConfigResult>
FS(method: string, args: any[], jobId?: string): Promise<ConfigResult> FS(method: string, args: any[], jobId?: string): Promise<ConfigResult>
loadLanguage(langs?: string | Lang[], jobId?: string): Promise<ConfigResult> reinitialize(langs?: string | Lang[], oem?: OEM, config?: string | Partial<InitOptions>, jobId?: string): Promise<ConfigResult>
initialize(langs?: string | Lang[], oem?: OEM, config?: string | Partial<InitOptions>, jobId?: string): Promise<ConfigResult>
setParameters(params: Partial<WorkerParams>, jobId?: string): Promise<ConfigResult> setParameters(params: Partial<WorkerParams>, jobId?: string): Promise<ConfigResult>
getImage(type: imageType): string getImage(type: imageType): string
recognize(image: ImageLike, options?: Partial<RecognizeOptions>, output?: Partial<OutputFormats>, jobId?: string): Promise<RecognizeResult> recognize(image: ImageLike, options?: Partial<RecognizeOptions>, output?: Partial<OutputFormats>, jobId?: string): Promise<RecognizeResult>
@ -61,6 +60,8 @@ declare namespace Tesseract {
cacheMethod: string cacheMethod: string
workerBlobURL: boolean workerBlobURL: boolean
gzip: boolean gzip: boolean
legacyLang: boolean
legacyCore: boolean
logger: (arg: LoggerMessage) => void, logger: (arg: LoggerMessage) => void,
errorHandler: (arg: any) => void errorHandler: (arg: any) => void
} }

@ -1,9 +1,11 @@
const { simd } = require('wasm-feature-detect'); const { simd } = require('wasm-feature-detect');
const { dependencies } = require('../../../package.json'); const { dependencies } = require('../../../package.json');
module.exports = async (corePath, res) => { module.exports = async (lstmOnly, corePath, res) => {
if (typeof global.TesseractCore === 'undefined') { if (typeof global.TesseractCore === 'undefined') {
res.progress({ status: 'loading tesseract core', progress: 0 }); const statusText = 'loading tesseract core';
res.progress({ status: statusText, progress: 0 });
// If the user specifies a core path, we use that // If the user specifies a core path, we use that
// Otherwise, default to CDN // Otherwise, default to CDN
@ -19,7 +21,13 @@ module.exports = async (corePath, res) => {
} else { } else {
const simdSupport = await simd(); const simdSupport = await simd();
if (simdSupport) { if (simdSupport) {
if (lstmOnly) {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd-lstm.wasm.js`;
} else {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd.wasm.js`; corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd.wasm.js`;
}
} else if (lstmOnly) {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-lstm.wasm.js`;
} else { } else {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core.wasm.js`; corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core.wasm.js`;
} }
@ -36,7 +44,7 @@ module.exports = async (corePath, res) => {
} else if (typeof global.TesseractCore === 'undefined') { } else if (typeof global.TesseractCore === 'undefined') {
throw Error('Failed to load TesseractCore'); throw Error('Failed to load TesseractCore');
} }
res.progress({ status: 'loading tesseract core', progress: 1 }); res.progress({ status: statusText, progress: 1 });
} }
return global.TesseractCore; return global.TesseractCore;
}; };

@ -28,15 +28,19 @@ let api = null;
let latestJob; let latestJob;
let adapter = {}; let adapter = {};
let params = defaultParams; let params = defaultParams;
let cachePathWorker; let loadLanguageLangsWorker;
let cacheMethodWorker; let loadLanguageOptionsWorker;
let dataFromCache = false;
const load = async ({ workerId, jobId, payload: { options: { corePath, logging } } }, res) => { const load = async ({ workerId, jobId, payload: { options: { lstmOnly, corePath, logging } } }, res) => { // eslint-disable-line max-len
setLogging(logging); setLogging(logging);
const statusText = 'initializing tesseract';
if (!TessModule) { if (!TessModule) {
const Core = await adapter.getCore(corePath, res); const Core = await adapter.getCore(lstmOnly, corePath, res);
res.progress({ workerId, status: 'initializing tesseract', progress: 0 }); res.progress({ workerId, status: statusText, progress: 0 });
Core({ Core({
TesseractProgress(percent) { TesseractProgress(percent) {
@ -49,7 +53,7 @@ const load = async ({ workerId, jobId, payload: { options: { corePath, logging }
}, },
}).then((tessModule) => { }).then((tessModule) => {
TessModule = tessModule; TessModule = tessModule;
res.progress({ workerId, status: 'initialized tesseract', progress: 1 }); res.progress({ workerId, status: statusText, progress: 1 });
res.resolve({ loaded: true }); res.resolve({ loaded: true });
}); });
} else { } else {
@ -72,13 +76,26 @@ const loadLanguage = async ({
cachePath, cachePath,
cacheMethod, cacheMethod,
gzip = true, gzip = true,
lstmOnly,
}, },
}, },
}, },
res) => { res) => {
// Remember cache options for later, as cache may be deleted if `initialize` fails // Remember options for later, as cache may be deleted if `initialize` fails
cachePathWorker = cachePath; loadLanguageLangsWorker = langs;
cacheMethodWorker = cacheMethod; loadLanguageOptionsWorker = {
langPath,
dataPath,
cachePath,
cacheMethod,
gzip,
lstmOnly,
};
const statusText = 'loading language traineddata';
const langsArr = typeof langs === 'string' ? langs.split('+') : langs;
let progress = 0;
const loadAndGunzipFile = async (_lang) => { const loadAndGunzipFile = async (_lang) => {
const lang = typeof _lang === 'string' ? _lang : _lang.code; const lang = typeof _lang === 'string' ? _lang : _lang.code;
@ -94,8 +111,8 @@ res) => {
const _data = await readCache(`${cachePath || '.'}/${lang}.traineddata`); const _data = await readCache(`${cachePath || '.'}/${lang}.traineddata`);
if (typeof _data !== 'undefined') { if (typeof _data !== 'undefined') {
log(`[${workerId}]: Load ${lang}.traineddata from cache`); log(`[${workerId}]: Load ${lang}.traineddata from cache`);
res.progress({ workerId, status: 'loading language traineddata (from cache)', progress: 0.5 });
data = _data; data = _data;
dataFromCache = true;
} else { } else {
throw Error('Not found in cache'); throw Error('Not found in cache');
} }
@ -106,14 +123,19 @@ res) => {
if (typeof _lang === 'string') { if (typeof _lang === 'string') {
let path = null; let path = null;
// If `langPath` if not explicitly set by the user, the jsdelivr CDN is used.
// Data supporting the Legacy model is only included if `lstmOnly` is not true.
// This saves a significant amount of data for the majority of users that use LSTM only.
const langPathDownload = langPath || (lstmOnly ? `https://cdn.jsdelivr.net/npm/@tesseract.js-data/${lang}/4.0.0_best_int` : `https://cdn.jsdelivr.net/npm/@tesseract.js-data/${lang}/4.0.0`);
// For Node.js, langPath may be a URL or local file path // For Node.js, langPath may be a URL or local file path
// The is-url package is used to tell the difference // The is-url package is used to tell the difference
// For the browser version, langPath is assumed to be a URL // For the browser version, langPath is assumed to be a URL
if (env !== 'node' || isURL(langPath) || langPath.startsWith('moz-extension://') || langPath.startsWith('chrome-extension://') || langPath.startsWith('file://')) { /** When langPath is an URL */ if (env !== 'node' || isURL(langPathDownload) || langPathDownload.startsWith('moz-extension://') || langPathDownload.startsWith('chrome-extension://') || langPathDownload.startsWith('file://')) { /** When langPathDownload is an URL */
path = langPath.replace(/\/$/, ''); path = langPathDownload.replace(/\/$/, '');
} }
// langPath is a URL, fetch from server // langPathDownload is a URL, fetch from server
if (path !== null) { if (path !== null) {
const fetchUrl = `${path}/${lang}.traineddata${gzip ? '.gz' : ''}`; const fetchUrl = `${path}/${lang}.traineddata${gzip ? '.gz' : ''}`;
const resp = await (env === 'webworker' ? fetch : adapter.fetch)(fetchUrl); const resp = await (env === 'webworker' ? fetch : adapter.fetch)(fetchUrl);
@ -122,16 +144,19 @@ res) => {
} }
data = new Uint8Array(await resp.arrayBuffer()); data = new Uint8Array(await resp.arrayBuffer());
// langPath is a local file, read .traineddata from local filesystem // langPathDownload is a local file, read .traineddata from local filesystem
// (adapter.readCache is a generic file read function in Node.js version) // (adapter.readCache is a generic file read function in Node.js version)
} else { } else {
data = await adapter.readCache(`${langPath}/${lang}.traineddata${gzip ? '.gz' : ''}`); data = await adapter.readCache(`${langPathDownload}/${lang}.traineddata${gzip ? '.gz' : ''}`);
} }
} else { } else {
data = _lang.data; // eslint-disable-line data = _lang.data; // eslint-disable-line
} }
} }
progress += 0.5 / langsArr.length;
if (res) res.progress({ workerId, status: statusText, progress });
// Check for gzip magic numbers (1F and 8B in hex) // Check for gzip magic numbers (1F and 8B in hex)
const isGzip = (data[0] === 31 && data[1] === 139) || (data[1] === 31 && data[0] === 139); const isGzip = (data[0] === 31 && data[1] === 139) || (data[1] === 31 && data[0] === 139);
@ -144,7 +169,7 @@ res) => {
try { try {
TessModule.FS.mkdir(dataPath); TessModule.FS.mkdir(dataPath);
} catch (err) { } catch (err) {
res.reject(err.toString()); if (res) res.reject(err.toString());
} }
} }
TessModule.FS.writeFile(`${dataPath || '.'}/${lang}.traineddata`, data); TessModule.FS.writeFile(`${dataPath || '.'}/${lang}.traineddata`, data);
@ -158,16 +183,19 @@ res) => {
log(err.toString()); log(err.toString());
} }
} }
return Promise.resolve();
progress += 0.5 / langsArr.length;
// Make sure last progress message is 1 (not 0.9999)
if (Math.round(progress * 100) === 100) progress = 1;
if (res) res.progress({ workerId, status: statusText, progress });
}; };
res.progress({ workerId, status: 'loading language traineddata', progress: 0 }); if (res) res.progress({ workerId, status: statusText, progress: 0 });
try { try {
await Promise.all((typeof langs === 'string' ? langs.split('+') : langs).map(loadAndGunzipFile)); await Promise.all(langsArr.map(loadAndGunzipFile));
res.progress({ workerId, status: 'loaded language traineddata', progress: 1 }); if (res) res.resolve(langs);
res.resolve(langs);
} catch (err) { } catch (err) {
res.reject(err.toString()); if (res) res.reject(err.toString());
} }
}; };
@ -208,9 +236,11 @@ const initialize = async ({
? _langs ? _langs
: _langs.map((l) => ((typeof l === 'string') ? l : l.data)).join('+'); : _langs.map((l) => ((typeof l === 'string') ? l : l.data)).join('+');
const statusText = 'initializing api';
try { try {
res.progress({ res.progress({
workerId, status: 'initializing api', progress: 0, workerId, status: statusText, progress: 0,
}); });
if (api !== null) { if (api !== null) {
api.End(); api.End();
@ -230,22 +260,55 @@ const initialize = async ({
} }
api = new TessModule.TessBaseAPI(); api = new TessModule.TessBaseAPI();
const status = api.Init(null, langs, oem); let status = api.Init(null, langs, oem);
if (status === -1) { if (status === -1) {
// Cache is deleted if initialization fails to avoid keeping bad data in cache // Cache is deleted if initialization fails to avoid keeping bad data in cache
// This assumes that initialization failing only occurs due to bad .traineddata, // This assumes that initialization failing only occurs due to bad .traineddata,
// this should be refined if other reasons for init failing are encountered. // this should be refined if other reasons for init failing are encountered.
if (['write', 'refresh', undefined].includes(cacheMethodWorker)) { // The "if" condition skips this section if either (1) cache is disabled [so the issue
// is definitely unrelated to cached data] or (2) cache is set to read-only
// [so we do not have permission to make any changes].
if (['write', 'refresh', undefined].includes(loadLanguageOptionsWorker.cacheMethod)) {
const langsArr = langs.split('+'); const langsArr = langs.split('+');
const delCachePromise = langsArr.map((lang) => adapter.deleteCache(`${cachePathWorker || '.'}/${lang}.traineddata`)); const delCachePromise = langsArr.map((lang) => adapter.deleteCache(`${loadLanguageOptionsWorker.cachePath || '.'}/${lang}.traineddata`));
await Promise.all(delCachePromise); await Promise.all(delCachePromise);
// Check for the case when (1) data was loaded from the cache and
// (2) the data does not support the requested OEM.
// In this case, loadLanguage is re-run and initialization is attempted a second time.
// This is because `loadLanguage` has no mechanism for checking whether the cached data
// supports the requested model, so this only becomes apparent when initialization fails.
// Check for this error message:
// eslint-disable-next-line
// "Tesseract (legacy) engine requested, but components are not present in ./eng.traineddata!!""
// The .wasm build of Tesseract saves this message in a separate file
// (in addition to the normal debug file location).
const debugStr = TessModule.FS.readFile('/debugDev.txt', { encoding: 'utf8', flags: 'a+' });
if (dataFromCache && /components are not present/.test(debugStr)) {
log('Data from cache missing requested OEM model. Attempting to refresh cache with new language data.');
// In this case, language data is re-loaded
await loadLanguage({ workerId, payload: { langs: loadLanguageLangsWorker, options: loadLanguageOptionsWorker } }); // eslint-disable-line max-len
status = api.Init(null, langs, oem);
if (status === -1) {
log('Language data refresh failed.');
const delCachePromise2 = langsArr.map((lang) => adapter.deleteCache(`${loadLanguageOptionsWorker.cachePath || '.'}/${lang}.traineddata`));
await Promise.all(delCachePromise2);
} else {
log('Language data refresh successful.');
} }
}
}
}
if (status === -1) {
res.reject('initialization failed'); res.reject('initialization failed');
} }
params = defaultParams; params = defaultParams;
await setParameters({ payload: { params } }); await setParameters({ payload: { params } });
res.progress({ res.progress({
workerId, status: 'initialized api', progress: 1, workerId, status: statusText, progress: 1,
}); });
res.resolve(); res.resolve();
} catch (err) { } catch (err) {

@ -1,20 +1,29 @@
const { simd } = require('wasm-feature-detect'); const { simd } = require('wasm-feature-detect');
const OEM = require('../../constants/OEM');
let TesseractCore = null; let TesseractCore = null;
/* /*
* getCore is a sync function to load and return * getCore is a sync function to load and return
* TesseractCore. * TesseractCore.
*/ */
module.exports = async (_, res) => { module.exports = async (oem, _, res) => {
if (TesseractCore === null) { if (TesseractCore === null) {
const statusText = 'loading tesseract core';
const simdSupport = await simd(); const simdSupport = await simd();
res.progress({ status: 'loading tesseract core', progress: 0 }); res.progress({ status: statusText, progress: 0 });
if (simdSupport) { if (simdSupport) {
if ([OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem)) {
TesseractCore = require('tesseract.js-core/tesseract-core-simd-lstm');
} else {
TesseractCore = require('tesseract.js-core/tesseract-core-simd'); TesseractCore = require('tesseract.js-core/tesseract-core-simd');
}
} else if ([OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem)) {
TesseractCore = require('tesseract.js-core/tesseract-core-lstm');
} else { } else {
TesseractCore = require('tesseract.js-core/tesseract-core'); TesseractCore = require('tesseract.js-core/tesseract-core');
} }
res.progress({ status: 'loaded tesseract core', progress: 1 }); res.progress({ status: statusText, progress: 1 });
} }
return TesseractCore; return TesseractCore;
}; };

@ -1,4 +1,3 @@
const resolveURL = (s) => (new URL(s, window.location.href)).href;
const { version } = require('../../../package.json'); const { version } = require('../../../package.json');
const defaultOptions = require('../../constants/defaultOptions'); const defaultOptions = require('../../constants/defaultOptions');
@ -7,12 +6,5 @@ const defaultOptions = require('../../constants/defaultOptions');
*/ */
module.exports = { module.exports = {
...defaultOptions, ...defaultOptions,
workerPath: (typeof process !== 'undefined' && process.env.TESS_ENV === 'development') workerPath: `https://cdn.jsdelivr.net/npm/tesseract.js@v${version}/dist/worker.min.js`,
? resolveURL(`/dist/worker.dev.js?nocache=${Math.random().toString(36).slice(3)}`)
: `https://cdn.jsdelivr.net/npm/tesseract.js@v${version}/dist/worker.min.js`,
/*
* If browser doesn't support WebAssembly,
* load ASM version instead
*/
corePath: null,
}; };

@ -7,7 +7,7 @@
<div id="mocha"></div> <div id="mocha"></div>
<script src="../node_modules/mocha/mocha.js"></script> <script src="../node_modules/mocha/mocha.js"></script>
<script src="../node_modules/expect.js/index.js"></script> <script src="../node_modules/expect.js/index.js"></script>
<script src="../dist/tesseract.dev.js"></script> <script src="../dist/tesseract.min.js"></script>
<script src="./constants.js"></script> <script src="./constants.js"></script>
<script>mocha.setup('bdd');</script> <script>mocha.setup('bdd');</script>
<script src="./FS.test.js"></script> <script src="./FS.test.js"></script>

@ -3,7 +3,7 @@ const FS_WAIT = 500;
let worker; let worker;
before(async function cb() { before(async function cb() {
this.timeout(0); this.timeout(0);
worker = await createWorker(OPTIONS); worker = await createWorker("eng", 1, OPTIONS);
}); });
describe('FS', async () => { describe('FS', async () => {

@ -6,14 +6,14 @@ const OPTIONS = {
langPath: 'http://localhost:3000/tests/assets/traineddata', langPath: 'http://localhost:3000/tests/assets/traineddata',
cachePath: './tests/assets/traineddata', cachePath: './tests/assets/traineddata',
corePath: '../node_modules/tesseract.js-core/tesseract-core.wasm.js', corePath: '../node_modules/tesseract.js-core/tesseract-core.wasm.js',
...(IS_BROWSER ? { workerPath: '../dist/worker.dev.js' } : {}), ...(IS_BROWSER ? { workerPath: '../dist/worker.min.js' } : {}),
}; };
const SIMPLE_TEXT = 'Tesseract.js\n'; const SIMPLE_TEXT = 'Tesseract.js\n';
const SIMPLE_TEXT_HALF = 'Tesse\n'; const SIMPLE_TEXT_HALF = 'Tesse\n';
const COMSIC_TEXT = 'HellO World\nfrom beyond\nthe Cosmic Void\n'; const COMSIC_TEXT = 'HellO World\nfrom beyond\nthe Cosmic Void\n';
const TESTOCR_TEXT = 'This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n'; const TESTOCR_TEXT = 'This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n';
const CHINESE_TEXT = '繁 體 中 文 測 試\n'; const CHINESE_TEXT = '繁 體 中 文 測 試\n';
const BILL_SPACED_TEXT = 'FIRST CHEQUING\n\nLine of Credit 100,000.00 Rate 4.2000\n\nDate Description Number Debits Credits Balance\n31Jul2018 Balance Forward 99,878.08 -\n01Aug2018 Clearing Cheque 4987 36.07 99,914.15 -\n01Aug2018 Clearing Cheque 4986 60.93 99,975.08 -\n01Aug2018 Clearing Cheque 4982 800.04 100,775.12 EX\n01Aug2018 Clearing Cheque 4981 823.34 101,598.46 EX\n01Aug2018 Incoming Interac e-Transfer 1454 101,583.92 EX\n01Aug2018 Incoming Interac e-Transfer 400.00 101,183.92 EX\n01Aug2018 Assisted Deposit 3241450 68,769.42 -\n01Aug2018 Transfer out to loan 7 1,500.00 70,269.42 -\n02Aug2018 Clearing Cheque 4984 48.08 70,317.50 -\n02Aug2018 Clearing Cheque 4985 7051 70,388.01 -\n02Aug2018 Clearing Cheque 4992 500.00 70.888.01 -\n'; const BILL_SPACED_TEXT = 'FIRST CHEQUING\n\nLine of Credit 100,000.00 Rate 4.2000\n\nDate Description Number Debits Credits Balance\n31Jul2018 Balance Forward 99,878.08 -\n01Aug2018 Clearing Cheque 4987 36.07 99,914.15 -\n01Aug2018 Clearing Cheque 4986 60.93 99,975.08 -\n01Aug2018 Clearing Cheque 4982 800.04 100,775.12 EX\n01Aug2018 Clearing Cheque 4981 823.34 101,598.46 EX\n01Aug2018 Incoming Interac e-Transfer 1454 101,583.92 EX\n01Aug2018 Incoming Interac e-Transfer 400.00 101,183.92 EX\n01Aug2018 Assisted Deposit 3241450 68,769.42 -\n01Aug2018 Transfer out to loan 7 1,500.00 70,269.42 -\n02Aug2018 Clearing Cheque 4984 48.08 70,317.50 -\n02Aug2018 Clearing Cheque 4985 7051 70,388.01 -\n02Aug2018 Clearing Cheque 4992 500.00 70,888.01 -\n';
const SIMPLE_WHITELIST_TEXT = 'Tesses\n'; const SIMPLE_WHITELIST_TEXT = 'Tesses\n';
const FORMATS = ['png', 'jpg', 'bmp', 'pbm', 'webp', 'gif']; const FORMATS = ['png', 'jpg', 'bmp', 'pbm', 'webp', 'gif'];
const SIMPLE_PNG_BASE64 = ''; const SIMPLE_PNG_BASE64 = '';

@ -7,7 +7,7 @@
<div id="mocha"></div> <div id="mocha"></div>
<script src="../node_modules/mocha/mocha.js"></script> <script src="../node_modules/mocha/mocha.js"></script>
<script src="../node_modules/expect.js/index.js"></script> <script src="../node_modules/expect.js/index.js"></script>
<script src="../dist/tesseract.dev.js"></script> <script src="../dist/tesseract.min.js"></script>
<script src="./constants.js"></script> <script src="./constants.js"></script>
<script>mocha.setup('bdd');</script> <script>mocha.setup('bdd');</script>
<script src="./detect.test.js"></script> <script src="./detect.test.js"></script>

@ -2,7 +2,7 @@ const { createWorker } = Tesseract;
let worker; let worker;
before(async function cb() { before(async function cb() {
this.timeout(0); this.timeout(0);
worker = await createWorker(OPTIONS); worker = await createWorker("osd", 0, OPTIONS);
}); });
describe('detect()', async () => { describe('detect()', async () => {
@ -10,8 +10,6 @@ describe('detect()', async () => {
[ [
{ name: 'cosmic.png', ans: { script: 'Latin' } }, { name: 'cosmic.png', ans: { script: 'Latin' } },
].forEach(async ({ name, ans: { script } }) => { ].forEach(async ({ name, ans: { script } }) => {
await worker.loadLanguage('osd');
await worker.initialize('osd');
const { data: { script: s } } = await worker.detect(`${IMAGE_PATH}/${name}`); const { data: { script: s } } = await worker.detect(`${IMAGE_PATH}/${name}`);
expect(s).to.be(script); expect(s).to.be(script);
}); });

@ -7,7 +7,7 @@
<div id="mocha"></div> <div id="mocha"></div>
<script src="../node_modules/mocha/mocha.js"></script> <script src="../node_modules/mocha/mocha.js"></script>
<script src="../node_modules/expect.js/index.js"></script> <script src="../node_modules/expect.js/index.js"></script>
<script src="../dist/tesseract.dev.js"></script> <script src="../dist/tesseract.min.js"></script>
<script src="./constants.js"></script> <script src="./constants.js"></script>
<script>mocha.setup('bdd');</script> <script>mocha.setup('bdd');</script>
<script src="./recognize.test.js"></script> <script src="./recognize.test.js"></script>

@ -2,15 +2,14 @@ const { createWorker, PSM } = Tesseract;
let worker; let worker;
before(async function cb() { before(async function cb() {
this.timeout(0); this.timeout(0);
worker = await createWorker(OPTIONS); worker = await createWorker("eng+chi_tra+osd", 1, OPTIONS);
await worker.loadLanguage('eng+chi_tra+osd');
}); });
describe('recognize()', () => { describe('recognize()', () => {
describe('should read bmp, jpg, png and pbm format images', () => { describe('should read bmp, jpg, png and pbm format images', () => {
FORMATS.forEach(format => ( FORMATS.forEach(format => (
it(`support ${format} format`, async () => { it(`support ${format} format`, async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/simple.${format}`); const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/simple.${format}`);
expect(text).to.be(SIMPLE_TEXT); expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -23,7 +22,7 @@ describe('recognize()', () => {
{ format: 'jpg', image: SIMPLE_JPG_BASE64, ans: SIMPLE_TEXT }, { format: 'jpg', image: SIMPLE_JPG_BASE64, ans: SIMPLE_TEXT },
].forEach(({ format, image, ans }) => ( ].forEach(({ format, image, ans }) => (
it(`recongize ${format} in base64`, async () => { it(`recongize ${format} in base64`, async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(image); const { data: { text } } = await worker.recognize(image);
expect(text).to.be(ans); expect(text).to.be(ans);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -37,7 +36,7 @@ describe('recognize()', () => {
{ name: 'simple-270.jpg', desc: 'simple', ans: SIMPLE_TEXT }, { name: 'simple-270.jpg', desc: 'simple', ans: SIMPLE_TEXT },
].forEach(({ name, desc, ans }) => ( ].forEach(({ name, desc, ans }) => (
it(`recongize ${desc} image`, async () => { it(`recongize ${desc} image`, async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`); const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`);
expect(text).to.be(ans); expect(text).to.be(ans);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -62,7 +61,7 @@ describe('recognize()', () => {
{ name: 'chinese.png', lang: 'chi_tra', ans: CHINESE_TEXT }, { name: 'chinese.png', lang: 'chi_tra', ans: CHINESE_TEXT },
].forEach(({ name, lang, ans }) => ( ].forEach(({ name, lang, ans }) => (
it(`recongize ${lang}`, async () => { it(`recongize ${lang}`, async () => {
await worker.initialize(lang); await worker.reinitialize(lang);
const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`); const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`);
expect(text).to.be(ans); expect(text).to.be(ans);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -76,7 +75,7 @@ describe('recognize()', () => {
{ name: 'testocr.png', desc: 'large', ans: TESTOCR_TEXT }, { name: 'testocr.png', desc: 'large', ans: TESTOCR_TEXT },
].forEach(({ name, desc, ans }) => ( ].forEach(({ name, desc, ans }) => (
it(`recongize ${desc} image`, async () => { it(`recongize ${desc} image`, async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`); const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`);
expect(text).to.be(ans); expect(text).to.be(ans);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -92,7 +91,7 @@ describe('recognize()', () => {
name, left, top, width, height, ans, name, left, top, width, height, ans,
}) => ( }) => (
it(`recongize half ${name}`, async () => { it(`recongize half ${name}`, async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize( const { data: { text } } = await worker.recognize(
`${IMAGE_PATH}/${name}`, `${IMAGE_PATH}/${name}`,
{ {
@ -108,7 +107,7 @@ describe('recognize()', () => {
describe('should work with selected parameters', () => { describe('should work with selected parameters', () => {
it('support preserve_interword_spaces', async () => { it('support preserve_interword_spaces', async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
await worker.setParameters({ await worker.setParameters({
preserve_interword_spaces: '1', preserve_interword_spaces: '1',
}); });
@ -117,7 +116,7 @@ describe('recognize()', () => {
}).timeout(TIMEOUT); }).timeout(TIMEOUT);
it('support tessedit_char_whitelist', async () => { it('support tessedit_char_whitelist', async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
await worker.setParameters({ await worker.setParameters({
tessedit_char_whitelist: 'Tess', tessedit_char_whitelist: 'Tess',
}); });
@ -132,7 +131,7 @@ describe('recognize()', () => {
.map(name => ({ name, mode: PSM[name] })) .map(name => ({ name, mode: PSM[name] }))
.forEach(({ name, mode }) => ( .forEach(({ name, mode }) => (
it(`support PSM.${name} mode`, async () => { it(`support PSM.${name} mode`, async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
await worker.setParameters({ await worker.setParameters({
tessedit_pageseg_mode: mode, tessedit_pageseg_mode: mode,
}); });
@ -146,7 +145,7 @@ describe('recognize()', () => {
FORMATS.forEach(format => ( FORMATS.forEach(format => (
it(`support ${format} format`, async () => { it(`support ${format} format`, async () => {
const buf = fs.readFileSync(path.join(__dirname, 'assets', 'images', `simple.${format}`)); const buf = fs.readFileSync(path.join(__dirname, 'assets', 'images', `simple.${format}`));
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(buf); const { data: { text } } = await worker.recognize(buf);
expect(text).to.be(SIMPLE_TEXT); expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -158,7 +157,7 @@ describe('recognize()', () => {
it(`support ${format} format`, async () => { it(`support ${format} format`, async () => {
const imageDOM = document.createElement('img'); const imageDOM = document.createElement('img');
imageDOM.setAttribute('src', `${IMAGE_PATH}/simple.${format}`); imageDOM.setAttribute('src', `${IMAGE_PATH}/simple.${format}`);
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(imageDOM); const { data: { text } } = await worker.recognize(imageDOM);
expect(text).to.be(SIMPLE_TEXT); expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -170,7 +169,7 @@ describe('recognize()', () => {
it(`support ${format} format`, async () => { it(`support ${format} format`, async () => {
const videoDOM = document.createElement('video'); const videoDOM = document.createElement('video');
videoDOM.setAttribute('poster', `${IMAGE_PATH}/simple.${format}`); videoDOM.setAttribute('poster', `${IMAGE_PATH}/simple.${format}`);
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(videoDOM); const { data: { text } } = await worker.recognize(videoDOM);
expect(text).to.be(SIMPLE_TEXT); expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -202,7 +201,7 @@ describe('recognize()', () => {
formats.forEach(format => ( formats.forEach(format => (
it(`support ${format} format`, async () => { it(`support ${format} format`, async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(canvasDOM); const { data: { text } } = await worker.recognize(canvasDOM);
expect(text).to.be(SIMPLE_TEXT); expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)
@ -234,7 +233,7 @@ describe('recognize()', () => {
formats.forEach(format => ( formats.forEach(format => (
it(`support ${format} format`, async () => { it(`support ${format} format`, async () => {
await worker.initialize('eng'); await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(offscreenCanvas); const { data: { text } } = await worker.recognize(offscreenCanvas);
expect(text).to.be(SIMPLE_TEXT); expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT) }).timeout(TIMEOUT)

@ -7,7 +7,7 @@
<div id="mocha"></div> <div id="mocha"></div>
<script src="../node_modules/mocha/mocha.js"></script> <script src="../node_modules/mocha/mocha.js"></script>
<script src="../node_modules/expect.js/index.js"></script> <script src="../node_modules/expect.js/index.js"></script>
<script src="../dist/tesseract.dev.js"></script> <script src="../dist/tesseract.min.js"></script>
<script src="./constants.js"></script> <script src="./constants.js"></script>
<script>mocha.setup('bdd');</script> <script>mocha.setup('bdd');</script>
<script src="./scheduler.test.js"></script> <script src="./scheduler.test.js"></script>

@ -7,10 +7,7 @@ before(async function cb() {
const NUM_WORKERS = 5; const NUM_WORKERS = 5;
console.log(`Initializing ${NUM_WORKERS} workers`); console.log(`Initializing ${NUM_WORKERS} workers`);
workers = await Promise.all(Array(NUM_WORKERS).fill(0).map(async () => { workers = await Promise.all(Array(NUM_WORKERS).fill(0).map(async () => {
const w = await createWorker(OPTIONS); return await createWorker("eng", 1, OPTIONS);
await w.loadLanguage('eng');
await w.initialize('eng');
return w;
})); }));
console.log(`Initialized ${NUM_WORKERS} workers`); console.log(`Initialized ${NUM_WORKERS} workers`);
}); });

Loading…
Cancel
Save