Update to v5 (#830)

master
Balearica 1 year ago committed by GitHub
parent ccf7414bc2
commit 6ebe92fb5b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 4
      .gitignore
  2. 111
      README.md
  3. 11
      benchmarks/browser/auto-rotate-benchmark.html
  4. 24
      benchmarks/browser/speed-benchmark.html
  5. 2
      benchmarks/node/speed-benchmark.js
  6. 270
      docs/api.md
  7. 44
      docs/examples.md
  8. 2
      docs/faq.md
  9. 68
      docs/intro.md
  10. 29
      docs/local-installation.md
  11. 29
      docs/performance.md
  12. 51
      docs/workers_vs_schedulers.md
  13. 10
      examples/browser/basic-efficient.html
  14. 8
      examples/browser/basic-scheduler.html
  15. 26
      examples/browser/basic.html
  16. 160
      examples/browser/demo.html
  17. 8
      examples/browser/download-pdf.html
  18. 10
      examples/browser/image-processing.html
  19. 13
      examples/node/detect.js
  20. 2
      examples/node/download-pdf.js
  21. 2
      examples/node/image-processing.js
  22. 4
      examples/node/recognize.js
  23. 28
      examples/node/scheduler.js
  24. 14
      package-lock.json
  25. 4
      package.json
  26. 2
      scripts/server.js
  27. 49
      scripts/webpack.config.dev.js
  28. 8
      src/Tesseract.js
  29. 5
      src/constants/config.js
  30. 4
      src/constants/defaultOptions.js
  31. 86
      src/createWorker.js
  32. 7
      src/index.d.ts
  33. 16
      src/worker-script/browser/getCore.js
  34. 117
      src/worker-script/index.js
  35. 17
      src/worker-script/node/getCore.js
  36. 10
      src/worker/browser/defaultOptions.js
  37. 2
      tests/FS.test.html
  38. 2
      tests/FS.test.js
  39. 4
      tests/constants.js
  40. 2
      tests/detect.test.html
  41. 4
      tests/detect.test.js
  42. 2
      tests/recognize.test.html
  43. 31
      tests/recognize.test.js
  44. 2
      tests/scheduler.test.html
  45. 5
      tests/scheduler.test.js

4
.gitignore vendored

@ -1,8 +1,8 @@
.DS_Store
node_modules/*
yarn.lock
tesseract.dev.js
worker.dev.js
tesseract.min.js
worker.min.js
*.traineddata
*.traineddata.gz
.nyc_output

@ -31,82 +31,32 @@ Video Real-time Recognition
Tesseract.js wraps a [webassembly port](https://github.com/naptha/tesseract.js-core) of the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR Engine.
It works in the browser using [webpack](https://webpack.js.org/) or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/).
It works in the browser using [webpack](https://webpack.js.org/), esm, or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/).
After you [install it](#installation), using it is as simple as:
```javascript
import Tesseract from 'tesseract.js';
Tesseract.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng',
{ logger: m => console.log(m) }
).then(({ data: { text } }) => {
console.log(text);
})
```
Or using workers (recommended for production use):
```javascript
import { createWorker } from 'tesseract.js';
const worker = await createWorker({
logger: m => console.log(m)
});
(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
const worker = await createWorker('eng');
const data = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(data.text);
await worker.terminate();
})();
```
When recognizing multiple images, users should create a worker once, run `worker.recognize` for each image, and then run `worker.terminate()` once at the end (rather than running the above snippet for every image).
For a basic overview of the functions, including the pros/cons of different approaches, see the [intro](./docs/intro.md). [Check out the docs](#documentation) for a full explanation of the API.
## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.
- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
- `createWorker` is now async
- `getPDF` function replaced by `pdf` recognize option
## Major changes in v3
- Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data)
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
- Added SIMD-enabled build for supported devices
- Added support:
- Node.js version 18
- Removed support:
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
- Node.js versions 10 and 12
## Major changes in v2
- Upgrade to tesseract v4.1.1 (using emscripten 1.39.10 upstream)
- Support multiple languages at the same time, eg: eng+chi\_tra for English and Traditional Chinese
- Supported image formats: png, jpg, bmp, pbm
- Support WebAssembly (fallback to ASM.js when browser doesn't support)
- Support Typescript
Read a story about v2: <a href="https://jeromewu.github.io/why-i-refactor-tesseract.js-v2/">Why I refactor tesseract.js v2?</a><br>
Check the <a href="https://github.com/naptha/tesseract.js/tree/support/1.x">support/1.x</a> branch for version 1
## Installation
Tesseract.js works with a `<script>` tag via local copy or CDN, with webpack via `npm` and on Node.js with `npm/yarn`.
### CDN
```html
<!-- v4 -->
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@4/dist/tesseract.min.js'></script>
<!-- v5 -->
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>
```
After including the script the `Tesseract` variable will be globally available.
After including the script the `Tesseract` variable will be globally available and a worker can be created using `Tesseract.createWorker`.
Alternatively, an ESM build (used with `import` syntax) can be found at `https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.esm.min.js`.
### Node.js
@ -122,16 +72,51 @@ npm install tesseract.js@3.0.3
yarn add tesseract.js@3.0.3
```
## Documentation
* [Intro](./docs/intro.md)
* [Workers vs. Schedulers](./docs/workers_vs_schedulers.md)
* [Examples](./docs/examples.md)
* [Image Format](./docs/image-format.md)
* [Supported Image Formats](./docs/image-format.md)
* [API](./docs/api.md)
* [Local Installation](./docs/local-installation.md)
* [FAQ](./docs/faq.md)
## Major changes in v5
Version 5 changes are documented in [this issue](https://github.com/naptha/tesseract.js/issues/820). Highlights are below.
- Significantly smaller files by default (54% smaller for English, 73% smaller for Chinese)
- This results in a ~50% reduction in runtime for first-time users (who do not have the files cached yet)
- Significantly lower memory usage
- Compatible with iOS 17 (using default settings)
- Breaking changes:
- `createWorker` arguments changed
- Setting non-default language and OEM now happens in `createWorker`
- E.g. `createWorker("chi_sim", 1)`
- `worker.initialize` and `worker.loadLanguage` functions now do nothing and can be deleted from code
- See [this issue](https://github.com/naptha/tesseract.js/issues/820) for full list
## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.
- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
- `createWorker` is now async
- `getPDF` function replaced by `pdf` recognize option
## Major changes in v3
- Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data)
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
- Added SIMD-enabled build for supported devices
- Added support:
- Node.js version 18
- Removed support:
- ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
- Node.js versions 10 and 12
## Use tesseract.js the way you like!
- Electron Version: https://github.com/Balearica/tesseract.js-electron
@ -167,7 +152,7 @@ npm start
```
The development server will be available at http://localhost:3000/examples/browser/demo.html in your favorite browser.
It will automatically rebuild `tesseract.dev.js` and `worker.dev.js` when you change files in the **src** folder.
It will automatically rebuild `tesseract.min.js` and `worker.min.js` when you change files in the **src** folder.
### Online Setup with a single Click

@ -1,7 +1,7 @@
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
<script src="/dist/tesseract.min.js"></script>
<style>
.column {
float: left;
@ -37,15 +37,10 @@
const element = document.getElementById("imgRow");
const worker = await Tesseract.createWorker({
const worker = await Tesseract.createWorker('eng', 0, {
// corePath: '/tesseract-core-simd.wasm.js',
workerPath: "/dist/worker.dev.js"
workerPath: "/dist/worker.min.js"
});
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.initialize();
const fileArr = ["../data/meditations.jpg", "../data/tyger.jpg", "../data/testocr.png"];
let timeTotal = 0;

@ -1,6 +1,6 @@
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
<script src="/dist/tesseract.min.js"></script>
</head>
<body>
<textarea id="message">Working...</textarea>
@ -13,20 +13,21 @@
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker({
// corePath: '/tesseract-core-simd.wasm.js',
workerPath: "/dist/worker.dev.js"
const worker = await createWorker("eng", 1, {
corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.min.js",
});
await worker.loadLanguage('eng');
await worker.initialize('eng');
// The performance.measureUserAgentSpecificMemory function only runs under specific circumstances for security reasons.
// See: https://developer.mozilla.org/en-US/docs/Web/API/Performance/measureUserAgentSpecificMemory#security_requirements
// Launching a server using `npm start` and accessing via localhost on the same system should meet these conditions.
const debugMemory = true;
if (debugMemory && crossOriginIsolated) {
console.log("Memory utilization after initialization:");
console.log(await performance.measureUserAgentSpecificMemory());
const memObj = await performance.measureUserAgentSpecificMemory();
const memMb = memObj.breakdown.map((x) => {if(x.attribution?.[0]?.scope == "DedicatedWorkerGlobalScope") return x.bytes}).reduce((a, b) => (a || 0) + (b || 0), 0) / 1e6;
console.log(`Worker memory utilization after initialization: ${memMb} MB`);
} else {
console.log("Unable to run `performance.measureUserAgentSpecificMemory`: not crossOriginIsolated.")
}
@ -45,8 +46,11 @@
}
if (debugMemory && crossOriginIsolated) {
console.log("Memory utilization after recognition:");
console.log(await performance.measureUserAgentSpecificMemory());
const memObj = await performance.measureUserAgentSpecificMemory();
const memMb = memObj.breakdown.map((x) => {if(x.attribution?.[0]?.scope == "DedicatedWorkerGlobalScope") return x.bytes}).reduce((a, b) => (a || 0) + (b || 0), 0) / 1e6;
console.log(`Worker memory utilization after recognition: ${memMb} MB`);
}
document.getElementById('message').innerHTML += "\nTotal runtime: " + timeTotal + "s";

@ -4,8 +4,6 @@ const { createWorker } = require('../../');
(async () => {
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const fileArr = ["../data/meditations.jpg", "../data/tyger.jpg", "../data/testocr.png"];
let timeTotal = 0;
for (let file of fileArr) {

@ -1,16 +1,15 @@
# API
- [createWorker()](#create-worker)
- [Worker.recognize](#worker-recognize)
- [Worker.setParameters](#worker-set-parameters)
- [Worker.reinitialize](#worker-reinitialize)
- [Worker.detect](#worker-detect)
- [Worker.terminate](#worker-terminate)
- [Worker.writeText](#worker-writeText)
- [Worker.readText](#worker-readText)
- [Worker.removeFile](#worker-removeFile)
- [Worker.FS](#worker-FS)
- [Worker.loadLanguage](#worker-load-language)
- [Worker.initialize](#worker-initialize)
- [Worker.setParameters](#worker-set-parameters)
- [Worker.recognize](#worker-recognize)
- [Worker.detect](#worker-detect)
- [Worker.terminate](#worker-terminate)
- [createScheduler()](#create-scheduler)
- [Scheduler.addWorker](#scheduler-add-worker)
- [Scheduler.addJob](#scheduler-add-job)
@ -27,10 +26,13 @@
<a name="create-worker"></a>
## createWorker(options): Worker
createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node.
`createWorker` is a function that creates a Tesseract.js worker. A Tesseract.js worker is an object that creates and manages an instance of Tesseract running in a web worker (browser) or worker thread (Node.js). Once created, OCR jobs are sent through the worker.
**Arguments:**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `oem` a enum to indicate the OCR Engine Mode you use
- `options` an object of customized options
- `corePath` path to a directory containing **both** `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js` from [Tesseract.js-core](https://www.npmjs.com/package/tesseract.js-core) package
- Setting `corePath` to a specific `.js` file is **strongly discouraged.** To provide the best performance on all devices, Tesseract.js needs to be able to pick between `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js`. See [this issue](https://github.com/naptha/tesseract.js/issues/735) for more detail.
@ -43,6 +45,8 @@ createWorker is a factory function that creates a tesseract worker, a worker is
- readOnly: read cache and not to write back
- refresh: not to read cache and write back
- none: not to read cache and not to write back
- `legacyCore` set to `true` to ensure any code downloaded supports the Legacy model (in addition to LSTM model)
- `legacyLang` set to `true` to ensure any language data downloaded supports the Legacy model (in addition to LSTM model)
- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
- `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
- `logger` a function to log the progress, a quick example is `m => console.log(m)`
@ -59,255 +63,211 @@ const worker = await createWorker({
});
```
## Worker
A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:
- FS functions // optional
- loadLanguage
- initialize
- setParameters // optional
- recognize or detect
- terminate
Each function is async, so using async/await or Promise is required. When it is resolved, you get an object:
```json
{
"jobId": "Job-1-123",
"data": { ... }
}
```
jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.
<a name="worker-recognize"></a>
### Worker.recognize(image, options, jobId): Promise
<a name="worker-writeText"></a>
### Worker.writeText(path, text, jobId): Promise
Worker.recognize() provides core function of Tesseract.js as it executes OCR
Worker.writeText() writes a text file to the path specified in MEMFS, it is useful when you want to use some features that requires tesseract.js
to read file from file system.
Figures out what words are in `image`, where the words are in `image`, etc.
> Note: `image` should be sufficiently high resolution.
> Often, the same image will get much better results if you upscale it before calling `recognize`.
**Arguments:**
- `path` text file path
- `text` content of the text file
- `image` see [Image Format](./image-format.md) for more details.
- `options` an object of customized options
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned)
- `jobId` Please see details above
**Output:**
**Examples:**
```javascript
const { createWorker } = Tesseract;
(async () => {
await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
const worker = await createWorker('eng');
const { data: { text } } = await worker.recognize(image);
console.log(text);
})();
```
<a name="worker-readText"></a>
### Worker.readText(path, jobId): Promise
Worker.readText() reads a text file to the path specified in MEMFS, it is useful when you want to check the content.
**Arguments:**
- `path` text file path
- `jobId` Please see details above
**Examples:**
With rectangle
```javascript
const { createWorker } = Tesseract;
(async () => {
const { data } = await worker.readText('tmp.txt');
console.log(data);
const worker = await createWorker('eng');
const { data: { text } } = await worker.recognize(image, {
rectangle: { top: 0, left: 0, width: 100, height: 100 },
});
console.log(text);
})();
```
<a name="worker-removeFile"></a>
### Worker.removeFile(path, jobId): Promise
<a name="worker-set-parameters"></a>
### worker.setParameters(params, jobId): Promise
Worker.removeFile() remove a file in MEMFS, it is useful when you want to free the memory.
`worker.setParameters()` set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
**Arguments:**
- `path` file path
- `params` an object with key and value of the parameters
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
await worker.removeFile('tmp.txt');
})();
```
Note: `worker.setParameters` cannot be used to change the `oem`, as this value is set at initialization. `oem` is initially set using an argument of `createWorker`. After a worker already exists, changing `oem` requires running `worker.reinitialize`.
<a name="worker-FS"></a>
### Worker.FS(method, args, jobId): Promise
Worker.FS() is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions.
**Useful Parameters:**
**Arguments:**
| name | type | default value | description |
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
- `method` method name
- `args` array of arguments to pass
- `jobId` Please see details above
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
**Examples:**
```javascript
(async () => {
await worker.FS('writeFile', ['tmp.txt', 'Hi\nTesseract.js\n']);
// equal to:
// await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
})();
await worker.setParameters({
tessedit_char_whitelist: '0123456789',
});
})
```
<a name="worker-load-language"></a>
### Worker.loadLanguage(langs, jobId): Promise
<a name="worker-reinitialize"></a>
### worker.reinitialize(langs, oem, jobId): Promise
Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system.
`worker.reinitialize()` re-initializes an existing Tesseract.js worker with different `langs` and `oem` arguments.
**Arguments:**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `oem` a enum to indicate the OCR Engine Mode you use
- `jobId` Please see details above
Note: to switch from Tesseract LSTM (`oem` value `1`) to Tesseract Legacy (`oem` value `0`) using `worker.reinitialize()`, the worker must already contain the code required to run the Tesseract Legacy model. Setting `legacyCore: true` and `legacyLang: true` in `createWorker` options ensures this is the case.
**Examples:**
```javascript
(async () => {
await worker.loadLanguage('eng+chi_tra');
})();
await worker.reinitialize('eng', 1);
```
<a name="worker-initialize"></a>
### Worker.initialize(langs, oem, jobId): Promise
<a name="worker-detect"></a>
### Worker.detect(image, jobId): Promise
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.
Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks.
Note: Running `worker.detect` requires a worker with code and language data that supports Tesseract Legacy (this is not enabled by default). If you want to run `worker.detect`, set `legacyCore` and `legacyLang` to `true` in the `createWorker` options.
**Arguments:**
- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage.
- `oem` a enum to indicate the OCR Engine Mode you use
- `image` see [Image Format](./image-format.md) for more details.
- `jobId` Please see details above
**Examples:**
```javascript
const { createWorker } = Tesseract;
(async () => {
/** You can load more languages in advance, but use only part of them in Worker.initialize() */
await worker.loadLanguage('eng+chi_tra');
await worker.initialize('eng');
const worker = await createWorker('eng', 1, {legacyCore: true, legacyLang: true});
const { data } = await worker.detect(image);
console.log(data);
})();
```
<a name="worker-set-parameters"></a>
### Worker.setParameters(params, jobId): Promise
Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
**Arguments:**
- `params` an object with key and value of the parameters
- `jobId` Please see details above
**Useful Parameters:**
| name | type | default value | description |
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| tessedit\_ocr\_engine\_mode | enum | OEM.DEFAULT | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
<a name="worker-terminate"></a>
### Worker.terminate(jobId): Promise
**Examples:**
Worker.terminate() terminates the worker and cleans up
```javascript
(async () => {
await worker.setParameters({
tessedit_char_whitelist: '0123456789',
});
})
await worker.terminate();
})();
```
<a name="worker-recognize"></a>
### Worker.recognize(image, options, jobId): Promise
Worker.recognize() provides core function of Tesseract.js as it executes OCR
<a name="worker-writeText"></a>
### Worker.writeText(path, text, jobId): Promise
Figures out what words are in `image`, where the words are in `image`, etc.
> Note: `image` should be sufficiently high resolution.
> Often, the same image will get much better results if you upscale it before calling `recognize`.
Worker.writeText() writes a text file to the path specified in MEMFS, it is useful when you want to use some features that requires tesseract.js
to read file from file system.
**Arguments:**
- `image` see [Image Format](./image-format.md) for more details.
- `options` an object of customized options
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned)
- `path` text file path
- `text` content of the text file
- `jobId` Please see details above
**Output:**
**Examples:**
```javascript
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image);
console.log(text);
await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
})();
```
With rectangle
<a name="worker-readText"></a>
### Worker.readText(path, jobId): Promise
Worker.readText() reads a text file to the path specified in MEMFS, it is useful when you want to check the content.
**Arguments:**
- `path` text file path
- `jobId` Please see details above
**Examples:**
```javascript
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image, {
rectangle: { top: 0, left: 0, width: 100, height: 100 },
});
console.log(text);
const { data } = await worker.readText('tmp.txt');
console.log(data);
})();
```
<a name="worker-detect"></a>
### Worker.detect(image, jobId): Promise
<a name="worker-removeFile"></a>
### Worker.removeFile(path, jobId): Promise
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.
Worker.removeFile() remove a file in MEMFS, it is useful when you want to free the memory.
**Arguments:**
- `image` see [Image Format](./image-format.md) for more details.
- `path` file path
- `jobId` Please see details above
**Examples:**
```javascript
const { createWorker } = Tesseract;
(async () => {
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data } = await worker.detect(image);
console.log(data);
await worker.removeFile('tmp.txt');
})();
```
<a name="worker-terminate"></a>
### Worker.terminate(jobId): Promise
<a name="worker-FS"></a>
### Worker.FS(method, args, jobId): Promise
Worker.terminate() terminates the worker and cleans up
Worker.FS() is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions.
**Arguments:**
- `method` method name
- `args` array of arguments to pass
- `jobId` Please see details above
**Examples:**
```javascript
(async () => {
await worker.terminate();
await worker.FS('writeFile', ['tmp.txt', 'Hi\nTesseract.js\n']);
// equal to:
// await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
})();
```
@ -404,13 +364,17 @@ setLogging(true);
<a name="recognize"></a>
## recognize(image, langs, options): Promise
recognize() is a function to quickly do recognize() task, it is not recommended to use in real application, but useful when you want to save some time.
This function is depreciated and should be replaced with `worker.recognize` (see above).
`recognize` works the same as `worker.recognize`, except that a new worker is created, loaded, and destroyed every time the function is called.
See [Tesseract.js](../src/Tesseract.js)
<a name="detect"></a>
## detect(image, options): Promise
This function is depreciated and should be replaced with `worker.detect` (see above).
Same background as recognize(), but it does detect instead.
See [Tesseract.js](../src/Tesseract.js)

@ -7,11 +7,9 @@ You can also check [examples](../examples) folder.
```javascript
const { createWorker } = require('tesseract.js');
const worker = await createWorker();
const worker = await createWorker('eng');
(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
@ -23,13 +21,11 @@ const worker = await createWorker();
```javascript
const { createWorker } = require('tesseract.js');
const worker = await createWorker({
const worker = await createWorker('eng', 1, {
logger: m => console.log(m), // Add logger here
});
(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
@ -41,11 +37,9 @@ const worker = await createWorker({
```javascript
const { createWorker } = require('tesseract.js');
const worker = await createWorker();
const worker = await createWorker('eng+chi_tra');
(async () => {
await worker.loadLanguage('eng+chi_tra');
await worker.initialize('eng+chi_tra');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
@ -56,11 +50,9 @@ const worker = await createWorker();
```javascript
const { createWorker } = require('tesseract.js');
const worker = await createWorker();
const worker = await createWorker('eng');
(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_char_whitelist: '0123456789',
});
@ -77,11 +69,9 @@ Check here for more details of pageseg mode: https://github.com/tesseract-ocr/te
```javascript
const { createWorker, PSM } = require('tesseract.js');
const worker = await createWorker();
const worker = await createWorker('eng');
(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_pageseg_mode: PSM.SINGLE_BLOCK,
});
@ -105,12 +95,10 @@ Node: [download-pdf.js](../examples/node/download-pdf.js)
```javascript
const { createWorker } = require('tesseract.js');
const worker = await createWorker();
const worker = await createWorker('eng');
const rectangle = { left: 0, top: 0, width: 500, height: 250 };
(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle });
console.log(text);
await worker.terminate();
@ -122,7 +110,7 @@ const rectangle = { left: 0, top: 0, width: 500, height: 250 };
```javascript
const { createWorker } = require('tesseract.js');
const worker = await createWorker();
const worker = await createWorker('eng');
const rectangles = [
{
left: 0,
@ -139,8 +127,6 @@ const rectangles = [
];
(async () => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const values = [];
for (let i = 0; i < rectangles.length; i++) {
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle: rectangles[i] });
@ -157,8 +143,8 @@ const rectangles = [
const { createWorker, createScheduler } = require('tesseract.js');
const scheduler = createScheduler();
const worker1 = await createWorker();
const worker2 = await createWorker();
const worker1 = await createWorker('eng');
const worker2 = await createWorker('eng');
const rectangles = [
{
left: 0,
@ -175,10 +161,6 @@ const rectangles = [
];
(async () => {
await worker1.loadLanguage('eng');
await worker2.loadLanguage('eng');
await worker1.initialize('eng');
await worker2.initialize('eng');
scheduler.addWorker(worker1);
scheduler.addWorker(worker2);
const results = await Promise.all(rectangles.map((rectangle) => (
@ -195,14 +177,10 @@ const rectangles = [
const { createWorker, createScheduler } = require('tesseract.js');
const scheduler = createScheduler();
const worker1 = await createWorker();
const worker2 = await createWorker();
const worker1 = await createWorker('eng');
const worker2 = await createWorker('eng');
(async () => {
await worker1.loadLanguage('eng');
await worker2.loadLanguage('eng');
await worker1.initialize('eng');
await worker2.initialize('eng');
scheduler.addWorker(worker1);
scheduler.addWorker(worker2);
/** Add 10 recognition jobs */

@ -19,8 +19,6 @@ Default settings should provide optimal results for most users. If you do want
# Trained Data
## How does tesseract.js download and keep \*.traineddata?
The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.
## How can I train my own \*.traineddata?

@ -1,68 +0,0 @@
# Overview
Tesseract.js offers 3 different ways to recognize text, which vary in complexity. This allows Tesseract.js to provide ease of use to new users experimenting with Tesseract.js, while offering control and performance to more experienced users. Each option is described in brief below, in order of complexity. For more detailed documentation on each function, see the [api page](./api.md).
# Option 1: Single Function
By using `Tesseract.recognize`, you can recognize text with just 1 function and 2 arguments (image and language). This makes it easy for new users to experiment with Tesseract.js.
```
Tesseract.recognize(
'https://tesseract.projectnaptha.com/img/eng_bw.png',
'eng'
).then(({ data: { text } }) => {
console.log(text);
})
```
This option should generally be avoided in production code. Using `Tesseract.recognize` results in a new worker being created and loaded with language data whenever `Tesseract.recognize` is run. This is inefficient for reasons explained below.
# Option 2: Using Workers
Tesseract.js also supports creating and managing workers (the objects that execute recognition) manually.
```
(async () => {
const worker = await Tesseract.createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
This code block is no more efficient than the `Tesseract.recognize` example as written (in both cases a worker is created and destroyed for recognizing a single image). However, within the context of an actual application, separating (1) creating a worker and loading data and (2) running recognition jobs provides developers the control necessary to write more efficient code:
1. Workers can be prepared ahead of time
- E.g. a worker can be created and loaded with language data when the page is first loaded, rather than waiting for a user to upload an image to recognize
1. Workers can be reused for multiple recognition jobs, rather than creating a new worker and loading language data for every image recognized (as `Tesseract.recognize` does)
# Option 3: Using Schedulers + Workers
Finally, Tesseract.js supports schedulers. A scheduler is an object that contains multiple workers, which it uses to execute jobs in parallel.
```
const scheduler = Tesseract.createScheduler();
// Creates worker and adds to scheduler
const workerGen = async () => {
const worker = await Tesseract.createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
scheduler.addWorker(worker);
}
const workerN = 4;
(async () => {
const resArr = Array(workerN);
for (let i=0; i<workerN; i++) {
resArr[i] = workerGen();
}
await Promise.all(resArr);
/** Add 4 recognition jobs */
const results = await Promise.all(Array(10).fill(0).map(() => (
scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
)))
await scheduler.terminate(); // It also terminates all workers.
})();
```
While using schedulers is no more efficient for a single job, they allow for quickly executing large numbers of jobs in parallel.
When working with schedulers, note that workers added to the same scheduler should all be homogenous—they should have the same language be configured with the same parameters. Schedulers assign jobs to workers in a non-deterministic manner, so if the workers are not identical then recognition results will depend on which worker the job is assigned to.

@ -8,21 +8,11 @@ Because of this we recommend loading `tesseract.js` from a CDN. But if you reall
In Node.js environment, the only path you may want to customize is languages/langPath.
```javascript
Tesseract.recognize(image, langs, {
workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v4.0.3/dist/worker.min.js',
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3',
})
```
Or
```javascript
const worker = await createWorker({
workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v4.0.3/dist/worker.min.js',
workerPath: 'https://cdn.jsdelivr.net/npm/tesseract.js@v5.0.0/dist/worker.min.js',
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3',
corePath: 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0',
});
```
@ -30,11 +20,18 @@ const worker = await createWorker({
A string specifying the location of the `worker.js` file.
### langPath
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`.
A string specifying the location of the tesseract language files. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. If `langPath` is not specified by the user, then the correct language data will be automatically downloaded from the jsDelivr CDN.
### corePath
A string specifying the location of the [tesseract.js-core](https://github.com/naptha/tesseract.js-core) files, with default value 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3'.
A string specifying the location of the [tesseract.js-core](https://github.com/naptha/tesseract.js-core) files, with default value 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0'.
If you set the `corePath` argument, be sure to set it to a directory that contains **all 4** of these files:
1. `tesseract-core.wasm.js`
2. `tesseract-core-simd.wasm.js`
3. `tesseract-core-lstm.wasm.js`
4. `tesseract-core-simd-lstm.wasm.js`
`corePath` should be set to a directory containing both `tesseract-core-simd.wasm.js` and `tesseract-core.wasm.js`. Tesseract.js will load either `tesseract-core-simd.wasm.js` or `tesseract-core.wasm.js` from the directory depending on whether the users' device supports SIMD (see [https://webassembly.org/roadmap/](https://webassembly.org/roadmap/)).
Tesseract.js will pick the correct file based on your users' device and the `createWorker` options.
To avoid breaking old code, when `corePath` is set to a specific `.js` file (e.g. `https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3/tesseract-core.wasm.js`), it will load that file regardless of whether the users' device supports SIMD or not. This behavior only exists to preserve backwards compatibility—setting `corePath` to a specific `.js` file is strongly discouraged. Doing so will either result in much slower performance (if `tesseract-core.wasm.js` is specified) or failure to run on certain devices (if `tesseract-core-simd.wasm.js` is specified).
To avoid breaking old code, when `corePath` is set to a specific `.js` file (e.g. `https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0/tesseract-core.wasm.js`), it will load that file regardless of whether the users' device supports SIMD or not. This behavior only exists to preserve backwards compatibility—setting `corePath` to a specific `.js` file is strongly discouraged. Doing so will either result in much slower performance (if `tesseract-core.wasm.js` is specified) or failure to run on certain devices (if `tesseract-core-simd.wasm.js` is specified).

@ -2,38 +2,37 @@
This guide contains tips and strategies for getting the fastest performance from Tesseract.js. While some of the tips below involve avoiding pitfalls and should be universally implemented, other strategies (changing the language data or recognition model) may harm recognition quality. Therefore, whether these strategies are appropriate depends on the application, and users should always benchmark performance and quality before changing important settings from their defaults.
# Reducing Setup Time
Within certain applications, the majority of runtime may be attributable to setup steps (`createWorker`, `worker.initialize`, and `worker.loadLanguage`) rather than recognition (`worker.recognize`). Implementing the strategies in this section should reduce the time spent on these steps.
Within certain applications, the majority of runtime may be attributable to setup steps (`createWorker`) rather than recognition (`worker.recognize`). Implementing the strategies in this section should reduce the time spent on these steps.
Notably, the time spent on setup for first-time users may not be apparent to developers, as Tesseract.js caches language data after it is downloaded for the first time. To experience Tesseract.js as a first-time user, set `cacheMethod: 'none'` in the [createWorker options](./api.md#createworkeroptions-worker). Be sure to remove this setting before publishing your app.
### Reuse Workers
When recognizing multiple images, some users will create/load/destroy a new worker for each image. This is never the correct option. If the images are being recognized one after the other, all of the extra `createWorker`/`worker.initialize`/`worker.loadLanguage` steps are wasted runtime, as `worker.recognize` could be run with the same `worker`. Workers do not break after one use.
When recognizing multiple images, some users will create/load/destroy a new worker for each image. This is never the correct option. If the images are being recognized one after the other, all of the extra steps required to create/load/destroy a new worker are wasted runtime, as `worker.recognize` could be run with the same `worker`. Workers do not break after one use.
Alternatively, if images are being recognized in parallel, then creating a new worker for each recognition job is likely to cause crashes due to resource limitations. As each Tesseract.js worker uses a high amount of memory, code should never be able to create an arbitrary number of `workers`. Instead, schedulers should be used to create a specific pool for workers (say, 4 workers), and then jobs assigned through the scheduler.
### Set Up Workers Ahead of Time
Rather than waiting until the last minute to load code and data, you can set up a worker ahead of time. Doing so greatly reduces runtime the first time a user run recognition. This requires managing workers rather than using `Tesseract.recognize`, which is explained [here](./intro.md). An example where a worker is prepared ahead of time can be found [here](../examples/browser/basic-efficient.html).
Rather than waiting until the last minute to load code and data, you can set up a worker ahead of time. Doing so greatly reduces runtime the first time a user run recognition. An example where a worker is prepared ahead of time can be found [here](../examples/browser/basic-efficient.html).
The appropriate time to load Tesseract.js workers and data is application-specific. For example, if you have an web app where only 5% of users need OCR, it likely does not make sense to download ~15MB in code and data upon a page load. You could consider instead loading Tesseract.js when a user indicates they want to perform OCR, but before they select a specific image.
### Do Not Disable Language Data Caching
Language data is, by far, the largest download required to run Tesseract.js. The default `eng.traineddata` file is 10.4MB compressed. The default `chi_sim.traineddata` file is 19.2MB compressed.
Language data is one of the largest downloads required to run Tesseract.js. While most language data files (including the default English file) are ~2MB, in a worst-case scenario they can be much larger. For example, setting the recognition model (`oem`) to Tesseract Legacy and language to Chinese (simplified) results in a ~20MB file being downloaded.
To avoid downloading language data multiple times, Tesseract.js caches `.traineddata` files. In past versions of Tesseract.js, this caching behavior contained bugs, so some users disabled it (setting `cacheMethod: 'none'` or `cacheMethod: 'refresh'`). As these bugs were fixed in [v4.0.6](https://github.com/naptha/tesseract.js/releases/tag/v4.0.6), it is now recommended that users use the default `cacheMethod` value (i.e. just ignore the `cacheMethod` argument).
### Consider Using Smaller Language Data
The default language data used by Tesseract.js includes data for both Tesseract engines (LSTM [the default model] and Legacy), and is optimized for quality rather than speed. Both the inclusion of multiple models and the focus on quality increase the size of the language data. Setting a non-default `langData` path may result in significantly smaller files being downloaded.
For example, by changing `langPath` from the default (`https://tessdata.projectnaptha.com/4.0.0`) to `https://tessdata.projectnaptha.com/4.0.0_fast` the size of the compressed English language data is reduced from 10.9MB to 2.0MB. Note that this language data (1) only supports the default LSTM model and (2) is optimized for size/speed rather than quality, so users should switch only after testing whether this data works for their application.
# Reducing Recognition Runtime
### Use the Latest Version of Tesseract.js
Old versions of Tesseract.js are significantly slower. Notably, v2 (now depreciated) takes 10x longer to recognize certain images compared to the latest version.
### Consider Using the Legacy Model
In general, the LSTM (default) recognition model provides the best quality. However, the Legacy model generally runs faster, and depending on your application, may provide sufficient recognition quality. If runtime is a significant concern, consider experimenting with the Legacy model (by setting `oem` to `”0”` within `worker.initialize`). As a rule of thumb, the Legacy model is usually viable when the input data is high-quality (high-definition screenshots, document scans, etc.).
### Do Not Set `corePath` to a Single `.js` file
If you set the `corePath` argument, be sure to set it to a directory that contains **all 4** of these files:
### Consider Using "Fast" Language Data
By default, Tesseract.js uses language data that is optimized for quality rather than speed. You can also experiment with using language data that is optimized for speed by setting `langPath` to `https://tessdata.projectnaptha.com/4.0.0_fast`.
1. `tesseract-core.wasm.js`
2. `tesseract-core-simd.wasm.js`
3. `tesseract-core-lstm.wasm.js`
4. `tesseract-core-simd-lstm.wasm.js`
### Do Not Set `corePath` to a Single `.js` file
If you set the `corePath` argument, be sure to set it to a directory that contains both `tesseract-core.wasm.js` or `tesseract-core-simd.wasm.js`. Tesseract.js needs to be able to pick between both files—setting `corePath` to a specific `.js` file will significantly degrade performance or compatibility. See [this comment](https://github.com/naptha/tesseract.js/issues/735#issuecomment-1519157646) for explanation.
Tesseract.js needs to be able to pick between these files—setting `corePath` to a specific `.js` file will significantly degrade performance or compatibility.
### Consider Using "Fast" Language Data
By default, Tesseract.js uses language data that is optimized for quality rather than speed. You can also experiment with using language data that is optimized for speed by setting `langPath` to `https://tessdata.projectnaptha.com/4.0.0_fast`. We have not benchmarked the impact this has on performance or accuracy, so feel free to open a Git Issue if you do so and wish to share results.

@ -0,0 +1,51 @@
# Overview
Tesseract.js offers 2 ways to run recognition jobs: (1) using a worker directly, or (2) using a scheduler to run jobs on multiple workers in parallel. The syntax for the latter is more complicated, but using parallel processing via schedulers provides significantly better performance for large jobs. For more detailed documentation on each function, see the [api page](./api.md).
# Option 1: Using Workers Directly
Tesseract.js also supports creating and managing workers (the objects that execute recognition) manually.
```
(async () => {
const worker = await Tesseract.createWorker('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
In actual use, the `createWorker` step should be separated from the `worker.recognize` step. Doing so enables the following benefits:
1. Workers can be prepared ahead of time
- E.g. a worker can be created when the page is first loaded, rather than waiting for a user to upload an image to recognize
1. Workers can be reused for multiple recognition jobs, rather than creating a new worker and loading language data for every image recognized
- Remember to call `worker.terminate()` after all recognition is complete to free memory
# Option 2: Using Schedulers + Workers
Tesseract.js also supports executing jobs using schedulers. A scheduler is an object that contains multiple workers, which it uses to execute jobs in parallel. For example, the following code executes 10 jobs in parallel using 4 workers.
```
const scheduler = Tesseract.createScheduler();
// Creates worker and adds to scheduler
const workerGen = async () => {
const worker = await Tesseract.createWorker('eng');
scheduler.addWorker(worker);
}
const workerN = 4;
(async () => {
const resArr = Array(workerN);
for (let i=0; i<workerN; i++) {
resArr[i] = workerGen();
}
await Promise.all(resArr);
/** Add 10 recognition jobs */
const results = await Promise.all(Array(10).fill(0).map(() => (
scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
)))
await scheduler.terminate(); // It also terminates all workers.
})();
```
While using schedulers is no more efficient for a single job, they allow for quickly executing large numbers of jobs in parallel.
When working with schedulers, note that workers added to the same scheduler should all be homogenous—they should have the same language be configured with the same parameters. Schedulers assign jobs to workers in a non-deterministic manner, so if the workers are not identical then recognition results will depend on which worker the job is assigned to.

@ -1,7 +1,7 @@
<!DOCTYPE HTML>
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
<script src="/dist/tesseract.min.js"></script>
</head>
<body>
<input type="file" id="uploader" multiple>
@ -10,16 +10,12 @@
// This is a basic example more efficient than "basic.html".
// In this example we create a worker once, and this worker is re-used
// every time the user uploads a new file.
const worker = await Tesseract.createWorker({
const worker = await Tesseract.createWorker("eng", 1, {
corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js",
workerPath: "/dist/worker.min.js",
logger: function(m){console.log(m);}
});
await worker.loadLanguage('eng');
await worker.initialize('eng');
const recognize = async function(evt){
const files = evt.target.files;

@ -1,7 +1,7 @@
<!DOCTYPE HTML>
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
<script src="/dist/tesseract.min.js"></script>
</head>
<body>
<input type="file" id="uploader" multiple>
@ -16,13 +16,11 @@
// Creates worker and adds to scheduler
const workerGen = async () => {
const worker = await Tesseract.createWorker({
const worker = await Tesseract.createWorker("eng", 1, {
corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js",
workerPath: "/dist/worker.min.js",
logger: function(m){console.log(m);}
});
await worker.loadLanguage('eng');
await worker.initialize('eng');
scheduler.addWorker(worker);
}

@ -1,26 +0,0 @@
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
</head>
<body>
<input type="file" id="uploader">
<script>
// This is the most basic example (contains a single function call).
// However, in cases when multiple recognition jobs are run,
// calling Tesseract.recognize() each time is inefficient.
// See "basic-efficient.html" for a more efficient example.
const recognize = async ({ target: { files } }) => {
const { data: { text } } = await Tesseract.recognize(files[0], 'eng', {
corePath: '../../node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js",
logger: m => console.log(m),
});
console.log(text);
}
const elm = document.getElementById('uploader');
elm.addEventListener('change', recognize);
</script>
</body>
</html>

@ -1,160 +0,0 @@
<script src="/dist/tesseract.dev.js"></script>
<script>
function progressUpdate(packet){
var log = document.getElementById('log');
if(log.firstChild && log.firstChild.status === packet.status){
if('progress' in packet){
var progress = log.firstChild.querySelector('progress')
progress.value = packet.progress
}
}else{
var line = document.createElement('div');
line.status = packet.status;
var status = document.createElement('div')
status.className = 'status'
status.appendChild(document.createTextNode(packet.status))
line.appendChild(status)
if('progress' in packet){
var progress = document.createElement('progress')
progress.value = packet.progress
progress.max = 1
line.appendChild(progress)
}
if(packet.status == 'done'){
var pre = document.createElement('pre')
pre.appendChild(document.createTextNode(packet.data.data.text))
line.innerHTML = ''
line.appendChild(pre)
}
log.insertBefore(line, log.firstChild)
}
}
async function recognizeFile(file) {
document.querySelector("#log").innerHTML = ''
const corePath = '../../node_modules/tesseract.js-core';
const lang = document.querySelector('#langsel').value
const data = await Tesseract.recognize(file, lang, {
corePath,
logger: progressUpdate,
});
progressUpdate({ status: 'done', data });
}
</script>
<select id="langsel" onchange="window.lastFile && recognizeFile(window.lastFile)">
<option value='afr' > Afrikaans </option>
<option value='ara' > Arabic </option>
<option value='aze' > Azerbaijani </option>
<option value='bel' > Belarusian </option>
<option value='ben' > Bengali </option>
<option value='bul' > Bulgarian </option>
<option value='cat' > Catalan </option>
<option value='ces' > Czech </option>
<option value='chi_sim' > Chinese </option>
<option value='chi_tra' > Traditional Chinese </option>
<option value='chr' > Cherokee </option>
<option value='dan' > Danish </option>
<option value='deu' > German </option>
<option value='ell' > Greek </option>
<option value='eng' selected> English </option>
<option value='enm' > English (Old) </option>
<option value='meme' > Internet Meme </option>
<option value='epo' > Esperanto </option>
<option value='epo_alt' > Esperanto alternative </option>
<option value='est' > Estonian </option>
<option value='eus' > Basque </option>
<option value='fin' > Finnish </option>
<option value='fra' > French </option>
<option value='frk' > Frankish </option>
<option value='frm' > French (Old) </option>
<option value='glg' > Galician </option>
<option value='grc' > Ancient Greek </option>
<option value='heb' > Hebrew </option>
<option value='hin' > Hindi </option>
<option value='hrv' > Croatian </option>
<option value='hun' > Hungarian </option>
<option value='ind' > Indonesian </option>
<option value='isl' > Icelandic </option>
<option value='ita' > Italian </option>
<option value='ita_old' > Italian (Old) </option>
<option value='jpn' > Japanese </option>
<option value='kan' > Kannada </option>
<option value='kor' > Korean </option>
<option value='lav' > Latvian </option>
<option value='lit' > Lithuanian </option>
<option value='mal' > Malayalam </option>
<option value='mkd' > Macedonian </option>
<option value='mlt' > Maltese </option>
<option value='msa' > Malay </option>
<option value='nld' > Dutch </option>
<option value='nor' > Norwegian </option>
<option value='pol' > Polish </option>
<option value='por' > Portuguese </option>
<option value='ron' > Romanian </option>
<option value='rus' > Russian </option>
<option value='slk' > Slovakian </option>
<option value='slv' > Slovenian </option>
<option value='spa' > Spanish </option>
<option value='spa_old' > Old Spanish </option>
<option value='sqi' > Albanian </option>
<option value='srp' > Serbian (Latin) </option>
<option value='swa' > Swahili </option>
<option value='swe' > Swedish </option>
<option value='tam' > Tamil </option>
<option value='tel' > Telugu </option>
<option value='tgl' > Tagalog </option>
<option value='tha' > Thai </option>
<option value='tur' > Turkish </option>
<option value='ukr' > Ukrainian </option>
<option value='vie' > Vietnamese </option>
</select>
<button onclick="recognizeFile('../../tests/assets/images/simple.png')">Sample Image</button>
<input type="file" onchange="recognizeFile(window.lastFile=this.files[0])">
<div id="log"></div>
<style>
#log > div {
color: #313131;
border-top: 1px solid #dadada;
padding: 9px;
display: flex;
}
#log > div:first-child {
border: 0;
}
.status {
min-width: 250px;
}
#log {
border: 1px solid #dadada;
padding: 10px;
margin-top: 20px;
min-height: 100px;
}
body {
font-family: sans-serif;
margin: 30px;
}
progress {
display: block;
width: 100%;
transition: opacity 0.5s linear;
}
progress[value="1"] {
opacity: 0.5;
}
</style>

@ -1,6 +1,6 @@
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
<script src="/dist/tesseract.min.js"></script>
</head>
<body>
<div>
@ -10,17 +10,15 @@
<textarea id="board" readonly rows="8" cols="80">Upload an image file</textarea>
<script type="module">
const { createWorker } = Tesseract;
const worker = await createWorker({
const worker = await createWorker("eng", 1, {
corePath: '/node_modules/tesseract.js-core',
workerPath: "/dist/worker.dev.js",
workerPath: "/dist/worker.min.js",
logger: m => console.log(m),
});
const uploader = document.getElementById('uploader');
const dlBtn = document.getElementById('download-pdf');
let pdf;
const recognize = async ({ target: { files } }) => {
await worker.loadLanguage('eng');
await worker.initialize('eng');
const res = await worker.recognize(files[0],{pdfTitle: "Example PDF"},{pdf: true});
pdf = res.data.pdf;
const text = res.data.text;

@ -1,7 +1,7 @@
<html>
<head>
<script src="/dist/tesseract.dev.js"></script>
<script src="/dist/tesseract.min.js"></script>
<style>
.column {
float: left;
@ -37,14 +37,10 @@
<script>
const recognize = async ({ target: { files } }) => {
document.getElementById("imgInput").src = URL.createObjectURL(files[0]);
const worker = await Tesseract.createWorker({
const worker = await Tesseract.createWorker("eng", 1, {
// corePath: '/tesseract-core-simd.wasm.js',
workerPath: "/dist/worker.dev.js"
workerPath: "/dist/worker.min.js"
});
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.initialize();
const ret = await worker.recognize(files[0], {rotateAuto: true}, {imageColor: true, imageGrey: true, imageBinary: true});
document.getElementById("imgOriginal").src = ret.data.imageColor;
document.getElementById("imgGrey").src = ret.data.imageGrey;

@ -1,13 +0,0 @@
#!/usr/bin/env node
const path = require('node:path');
const Tesseract = require('../../');
const [,, imagePath] = process.argv;
const image = path.resolve(__dirname, (imagePath || '../../tests/assets/images/cosmic.png'));
console.log(`Recognizing ${image}`);
Tesseract.detect(image, { logger: m => console.log(m) })
.then(({ data }) => {
console.log(data);
});

@ -10,8 +10,6 @@ console.log(`Recognizing ${image}`);
(async () => {
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text, pdf } } = await worker.recognize(image, {pdfTitle: "Example PDF"}, {pdf: true});
console.log(text);
fs.writeFileSync('tesseract-ocr-result.pdf', Buffer.from(pdf));

@ -21,8 +21,6 @@ const convertImage = (imageSrc) => {
(async () => {
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { imageColor, imageGrey, imageBinary } } = await worker.recognize(image, {rotateAuto: true}, {imageColor: true, imageGrey: true, imageBinary: true});
console.log('Saving intermediate images: imageColor.png, imageGrey.png, imageBinary.png');

@ -8,11 +8,9 @@ const image = path.resolve(__dirname, (imagePath || '../../tests/assets/images/c
console.log(`Recognizing ${image}`);
(async () => {
const worker = await createWorker({
const worker = await createWorker("eng", 1, {
logger: m => console.log(m),
});
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image);
console.log(text);
await worker.terminate();

@ -1,12 +1,19 @@
const { createWorker, createScheduler } = require('../../');
const path = require('path');
const [,, imagePath] = process.argv;
// Note: This example recognizes the same image 4 times in parallel
// to show how schedulers can be used to speed up bulk jobs.
// In actual use you would (obviously) not want to run multiple identical jobs.
const image = path.resolve(__dirname, (imagePath || '../../tests/assets/images/cosmic.png'));
const imageArr = [image, image, image, image];
const scheduler = createScheduler();
// Creates worker and adds to scheduler
const workerGen = async () => {
const worker = await createWorker({cachePath: "."});
await worker.loadLanguage('eng');
await worker.initialize('eng');
const worker = await createWorker("eng", 1, {cachePath: "."});
scheduler.addWorker(worker);
}
@ -14,12 +21,17 @@ const workerN = 4;
(async () => {
const resArr = Array(workerN);
for (let i=0; i<workerN; i++) {
resArr[i] = await workerGen();
resArr[i] = workerGen();
}
await Promise.all(resArr);
/** Add 4 recognition jobs */
const results = await Promise.all(Array(10).fill(0).map(() => (
scheduler.addJob('recognize', 'https://tesseract.projectnaptha.com/img/eng_bw.png').then((x) => console.log(x.data.text))
)))
const resArr2 = Array(imageArr.length);
for (let i = 0; i < imageArr.length; i++) {
resArr2[i] = scheduler.addJob('recognize', image).then((x) => console.log(x.data.text));
}
await Promise.all(resArr2);
await scheduler.terminate(); // It also terminates all workers.
})();

14
package-lock.json generated

@ -17,7 +17,7 @@
"node-fetch": "^2.6.9",
"opencollective-postinstall": "^2.0.3",
"regenerator-runtime": "^0.13.3",
"tesseract.js-core": "^4.0.4",
"tesseract.js-core": "^5.0.0-beta.1",
"wasm-feature-detect": "^1.2.11",
"zlibjs": "^0.3.1"
},
@ -8663,9 +8663,9 @@
}
},
"node_modules/tesseract.js-core": {
"version": "4.0.4",
"resolved": "https://registry.npmjs.org/tesseract.js-core/-/tesseract.js-core-4.0.4.tgz",
"integrity": "sha512-MJ+vtktjAaT0681uPl6TDUPhbRbpD/S9emko5rtorgHRZpQo7R3BG7h+3pVHgn1KjfNf1bvnx4B7KxEK8YKqpg=="
"version": "5.0.0-beta.1",
"resolved": "https://registry.npmjs.org/tesseract.js-core/-/tesseract.js-core-5.0.0-beta.1.tgz",
"integrity": "sha512-lzRLGeNWVwGLi96unpzmYqXshdGWF/IR8LY5Ds+em6twjYQVSQlvpSgJ+2Y5vfxOzbtiFif0gtSZYBqzH4u03w=="
},
"node_modules/test-exclude": {
"version": "6.0.0",
@ -16060,9 +16060,9 @@
}
},
"tesseract.js-core": {
"version": "4.0.4",
"resolved": "https://registry.npmjs.org/tesseract.js-core/-/tesseract.js-core-4.0.4.tgz",
"integrity": "sha512-MJ+vtktjAaT0681uPl6TDUPhbRbpD/S9emko5rtorgHRZpQo7R3BG7h+3pVHgn1KjfNf1bvnx4B7KxEK8YKqpg=="
"version": "5.0.0-beta.1",
"resolved": "https://registry.npmjs.org/tesseract.js-core/-/tesseract.js-core-5.0.0-beta.1.tgz",
"integrity": "sha512-lzRLGeNWVwGLi96unpzmYqXshdGWF/IR8LY5Ds+em6twjYQVSQlvpSgJ+2Y5vfxOzbtiFif0gtSZYBqzH4u03w=="
},
"test-exclude": {
"version": "6.0.0",

@ -12,7 +12,7 @@
"profile:tesseract": "webpack-bundle-analyzer dist/tesseract-stats.json",
"profile:worker": "webpack-bundle-analyzer dist/worker-stats.json",
"prepublishOnly": "npm run build",
"wait": "rimraf dist && wait-on http://localhost:3000/dist/tesseract.dev.js",
"wait": "rimraf dist && wait-on http://localhost:3000/dist/tesseract.min.js",
"test": "npm-run-all -p -r start test:all",
"test:all": "npm-run-all wait test:browser:* test:node:all",
"test:node": "nyc mocha --exit --bail --require ./scripts/test-helper.js",
@ -69,7 +69,7 @@
"node-fetch": "^2.6.9",
"opencollective-postinstall": "^2.0.3",
"regenerator-runtime": "^0.13.3",
"tesseract.js-core": "^4.0.4",
"tesseract.js-core": "^5.0.0",
"wasm-feature-detect": "^1.2.11",
"zlibjs": "^0.3.1"
},

@ -3,7 +3,7 @@ const middleware = require('webpack-dev-middleware');
const express = require('express');
const path = require('node:path');
const cors = require('cors');
const webpackConfig = require('./webpack.config.dev');
const webpackConfig = require('./webpack.config.prod');
const compiler = webpack(webpackConfig);
const app = express();

@ -1,49 +0,0 @@
const path = require('node:path');
const webpack = require('webpack');
const { BundleAnalyzerPlugin } = require('webpack-bundle-analyzer');
const common = require('./webpack.config.common');
const genConfig = ({
entry, filename, library, libraryTarget,
}) => ({
...common,
mode: 'development',
devtool: 'source-map',
entry,
output: {
filename,
library,
libraryTarget,
},
plugins: [
new webpack.ProvidePlugin({
Buffer: ['buffer', 'Buffer'],
}),
new webpack.DefinePlugin({
'process.env': {
TESS_ENV: JSON.stringify('development'),
},
}),
new BundleAnalyzerPlugin({
analyzerMode: 'disable',
statsFilename: `${filename.split('.')[0]}-stats.json`,
generateStatsFile: true
}),
],
devServer: {
allowedHosts: ['localhost', '.gitpod.io'],
},
});
module.exports = [
genConfig({
entry: path.resolve(__dirname, '..', 'src', 'index.js'),
filename: 'tesseract.dev.js',
library: 'Tesseract',
libraryTarget: 'umd',
}),
genConfig({
entry: path.resolve(__dirname, '..', 'src', 'worker-script', 'browser', 'index.js'),
filename: 'worker.dev.js',
}),
];

@ -1,9 +1,7 @@
const createWorker = require('./createWorker');
const recognize = async (image, langs, options) => {
const worker = await createWorker(options);
await worker.loadLanguage(langs);
await worker.initialize(langs);
const worker = await createWorker(langs, 1, options);
return worker.recognize(image)
.finally(async () => {
await worker.terminate();
@ -11,9 +9,7 @@ const recognize = async (image, langs, options) => {
};
const detect = async (image, options) => {
const worker = await createWorker(options);
await worker.loadLanguage('osd');
await worker.initialize('osd');
const worker = await createWorker('osd', 0, options);
return worker.detect(image)
.finally(async () => {
await worker.terminate();

@ -1,5 +0,0 @@
const OEM = require('./OEM');
module.exports = {
defaultOEM: OEM.DEFAULT,
};

@ -1,8 +1,4 @@
module.exports = {
/*
* default path for downloading *.traineddata
*/
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
/*
* Use BlobURL for worker script by default
* TODO: remove this option

@ -3,7 +3,7 @@ const circularize = require('./utils/circularize');
const createJob = require('./createJob');
const { log } = require('./utils/log');
const getId = require('./utils/getId');
const { defaultOEM } = require('./constants/config');
const OEM = require('./constants/OEM');
const {
defaultOptions,
spawnWorker,
@ -15,7 +15,7 @@ const {
let workerCounter = 0;
module.exports = async (_options = {}) => {
module.exports = async (langs = 'eng', oem = OEM.LSTM_ONLY, _options = {}, config = {}) => {
const id = getId('Worker', workerCounter);
const {
logger,
@ -28,6 +28,13 @@ module.exports = async (_options = {}) => {
const resolves = {};
const rejects = {};
// Current langs, oem, and config file.
// Used if the user ever re-initializes the worker using `worker.reinitialize`.
const currentLangs = typeof langs === 'string' ? langs.split('+') : langs;
let currentOem = oem;
let currentConfig = config;
const lstmOnlyCore = [OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem) && !options.legacyCore;
let workerResReject;
let workerResResolve;
const workerRes = new Promise((resolve, reject) => {
@ -69,7 +76,7 @@ module.exports = async (_options = {}) => {
const loadInternal = (jobId) => (
startJob(createJob({
id: jobId, action: 'load', payload: { options },
id: jobId, action: 'load', payload: { options: { lstmOnly: lstmOnlyCore, corePath: options.corePath, logging: options.logging } },
}))
);
@ -105,22 +112,62 @@ module.exports = async (_options = {}) => {
}))
);
const loadLanguage = (langs = 'eng', jobId) => (
startJob(createJob({
id: jobId,
action: 'loadLanguage',
payload: { langs, options },
}))
const loadLanguage = () => (
console.warn('`loadLanguage` is depreciated and should be removed from code (workers now come with language pre-loaded)')
);
const initialize = (langs = 'eng', oem = defaultOEM, config, jobId) => (
const loadLanguageInternal = (_langs, jobId) => startJob(createJob({
id: jobId,
action: 'loadLanguage',
payload: {
langs: _langs,
options: {
langPath: options.langPath,
dataPath: options.dataPath,
cachePath: options.cachePath,
cacheMethod: options.cacheMethod,
gzip: options.gzip,
lstmOnly: [OEM.TESSERACT_ONLY, OEM.TESSERACT_LSTM_COMBINED].includes(currentOem)
&& !options.legacyLang,
},
},
}));
const initialize = () => (
console.warn('`initialize` is depreciated and should be removed from code (workers now come pre-initialized)')
);
const initializeInternal = (_langs, _oem, _config, jobId) => (
startJob(createJob({
id: jobId,
action: 'initialize',
payload: { langs, oem, config },
payload: { langs: _langs, oem: _oem, config: _config },
}))
);
const reinitialize = (langs = 'eng', oem, config, jobId) => { // eslint-disable-line
if (lstmOnlyCore && [OEM.TESSERACT_ONLY, OEM.TESSERACT_LSTM_COMBINED].includes(oem)) throw Error('Legacy model requested but code missing.');
const _oem = oem || currentOem;
currentOem = _oem;
const _config = config || currentConfig;
currentConfig = _config;
// Only load langs that are not already loaded.
// This logic fails if the user downloaded the LSTM-only English data for a language
// and then uses `worker.reinitialize` to switch to the Legacy engine.
// However, the correct data will still be downloaded after initialization fails
// and this can be avoided entirely
const langsArr = typeof langs === 'string' ? langs.split('+') : langs;
const _langs = langsArr.filter((x) => currentLangs.includes(x));
currentLangs.push(_langs);
return loadLanguageInternal(_langs, jobId)
.then(() => initializeInternal(_langs, _oem, _config, jobId));
};
const setParameters = (params = {}, jobId) => (
startJob(createJob({
id: jobId,
@ -148,13 +195,15 @@ module.exports = async (_options = {}) => {
}));
};
const detect = async (image, jobId) => (
startJob(createJob({
const detect = async (image, jobId) => {
if (lstmOnlyCore) throw Error('`worker.detect` requires Legacy model, which was not loaded.');
return startJob(createJob({
id: jobId,
action: 'detect',
payload: { image: await loadImage(image) },
}))
);
}));
};
const terminate = async () => {
if (worker !== null) {
@ -207,6 +256,7 @@ module.exports = async (_options = {}) => {
FS,
loadLanguage,
initialize,
reinitialize,
setParameters,
recognize,
getPDF,
@ -214,7 +264,11 @@ module.exports = async (_options = {}) => {
terminate,
};
loadInternal().then(() => workerResResolve(resolveObj)).catch(() => {});
loadInternal()
.then(() => loadLanguageInternal(langs))
.then(() => initializeInternal(langs, oem, config))
.then(() => workerResResolve(resolveObj))
.catch(() => {});
return workerRes;
};

7
src/index.d.ts vendored

@ -1,6 +1,6 @@
declare namespace Tesseract {
function createScheduler(): Scheduler
function createWorker(options?: Partial<WorkerOptions>): Promise<Worker>
function createWorker(langs?: string | Lang[], oem?: OEM, options?: Partial<WorkerOptions>, config?: string | Partial<InitOptions>): Promise<Worker>
function setLogging(logging: boolean): void
function recognize(image: ImageLike, langs?: string, options?: Partial<WorkerOptions>): Promise<RecognizeResult>
function detect(image: ImageLike, options?: Partial<WorkerOptions>): any
@ -20,8 +20,7 @@ declare namespace Tesseract {
readText(path: string, jobId?: string): Promise<ConfigResult>
removeText(path: string, jobId?: string): Promise<ConfigResult>
FS(method: string, args: any[], jobId?: string): Promise<ConfigResult>
loadLanguage(langs?: string | Lang[], jobId?: string): Promise<ConfigResult>
initialize(langs?: string | Lang[], oem?: OEM, config?: string | Partial<InitOptions>, jobId?: string): Promise<ConfigResult>
reinitialize(langs?: string | Lang[], oem?: OEM, config?: string | Partial<InitOptions>, jobId?: string): Promise<ConfigResult>
setParameters(params: Partial<WorkerParams>, jobId?: string): Promise<ConfigResult>
getImage(type: imageType): string
recognize(image: ImageLike, options?: Partial<RecognizeOptions>, output?: Partial<OutputFormats>, jobId?: string): Promise<RecognizeResult>
@ -61,6 +60,8 @@ declare namespace Tesseract {
cacheMethod: string
workerBlobURL: boolean
gzip: boolean
legacyLang: boolean
legacyCore: boolean
logger: (arg: LoggerMessage) => void,
errorHandler: (arg: any) => void
}

@ -1,9 +1,11 @@
const { simd } = require('wasm-feature-detect');
const { dependencies } = require('../../../package.json');
module.exports = async (corePath, res) => {
module.exports = async (lstmOnly, corePath, res) => {
if (typeof global.TesseractCore === 'undefined') {
res.progress({ status: 'loading tesseract core', progress: 0 });
const statusText = 'loading tesseract core';
res.progress({ status: statusText, progress: 0 });
// If the user specifies a core path, we use that
// Otherwise, default to CDN
@ -19,7 +21,13 @@ module.exports = async (corePath, res) => {
} else {
const simdSupport = await simd();
if (simdSupport) {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd.wasm.js`;
if (lstmOnly) {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd-lstm.wasm.js`;
} else {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-simd.wasm.js`;
}
} else if (lstmOnly) {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core-lstm.wasm.js`;
} else {
corePathImportFile = `${corePathImport.replace(/\/$/, '')}/tesseract-core.wasm.js`;
}
@ -36,7 +44,7 @@ module.exports = async (corePath, res) => {
} else if (typeof global.TesseractCore === 'undefined') {
throw Error('Failed to load TesseractCore');
}
res.progress({ status: 'loading tesseract core', progress: 1 });
res.progress({ status: statusText, progress: 1 });
}
return global.TesseractCore;
};

@ -28,15 +28,19 @@ let api = null;
let latestJob;
let adapter = {};
let params = defaultParams;
let cachePathWorker;
let cacheMethodWorker;
let loadLanguageLangsWorker;
let loadLanguageOptionsWorker;
let dataFromCache = false;
const load = async ({ workerId, jobId, payload: { options: { corePath, logging } } }, res) => {
const load = async ({ workerId, jobId, payload: { options: { lstmOnly, corePath, logging } } }, res) => { // eslint-disable-line max-len
setLogging(logging);
const statusText = 'initializing tesseract';
if (!TessModule) {
const Core = await adapter.getCore(corePath, res);
const Core = await adapter.getCore(lstmOnly, corePath, res);
res.progress({ workerId, status: 'initializing tesseract', progress: 0 });
res.progress({ workerId, status: statusText, progress: 0 });
Core({
TesseractProgress(percent) {
@ -49,7 +53,7 @@ const load = async ({ workerId, jobId, payload: { options: { corePath, logging }
},
}).then((tessModule) => {
TessModule = tessModule;
res.progress({ workerId, status: 'initialized tesseract', progress: 1 });
res.progress({ workerId, status: statusText, progress: 1 });
res.resolve({ loaded: true });
});
} else {
@ -72,13 +76,26 @@ const loadLanguage = async ({
cachePath,
cacheMethod,
gzip = true,
lstmOnly,
},
},
},
res) => {
// Remember cache options for later, as cache may be deleted if `initialize` fails
cachePathWorker = cachePath;
cacheMethodWorker = cacheMethod;
// Remember options for later, as cache may be deleted if `initialize` fails
loadLanguageLangsWorker = langs;
loadLanguageOptionsWorker = {
langPath,
dataPath,
cachePath,
cacheMethod,
gzip,
lstmOnly,
};
const statusText = 'loading language traineddata';
const langsArr = typeof langs === 'string' ? langs.split('+') : langs;
let progress = 0;
const loadAndGunzipFile = async (_lang) => {
const lang = typeof _lang === 'string' ? _lang : _lang.code;
@ -94,8 +111,8 @@ res) => {
const _data = await readCache(`${cachePath || '.'}/${lang}.traineddata`);
if (typeof _data !== 'undefined') {
log(`[${workerId}]: Load ${lang}.traineddata from cache`);
res.progress({ workerId, status: 'loading language traineddata (from cache)', progress: 0.5 });
data = _data;
dataFromCache = true;
} else {
throw Error('Not found in cache');
}
@ -106,14 +123,19 @@ res) => {
if (typeof _lang === 'string') {
let path = null;
// If `langPath` if not explicitly set by the user, the jsdelivr CDN is used.
// Data supporting the Legacy model is only included if `lstmOnly` is not true.
// This saves a significant amount of data for the majority of users that use LSTM only.
const langPathDownload = langPath || (lstmOnly ? `https://cdn.jsdelivr.net/npm/@tesseract.js-data/${lang}/4.0.0_best_int` : `https://cdn.jsdelivr.net/npm/@tesseract.js-data/${lang}/4.0.0`);
// For Node.js, langPath may be a URL or local file path
// The is-url package is used to tell the difference
// For the browser version, langPath is assumed to be a URL
if (env !== 'node' || isURL(langPath) || langPath.startsWith('moz-extension://') || langPath.startsWith('chrome-extension://') || langPath.startsWith('file://')) { /** When langPath is an URL */
path = langPath.replace(/\/$/, '');
if (env !== 'node' || isURL(langPathDownload) || langPathDownload.startsWith('moz-extension://') || langPathDownload.startsWith('chrome-extension://') || langPathDownload.startsWith('file://')) { /** When langPathDownload is an URL */
path = langPathDownload.replace(/\/$/, '');
}
// langPath is a URL, fetch from server
// langPathDownload is a URL, fetch from server
if (path !== null) {
const fetchUrl = `${path}/${lang}.traineddata${gzip ? '.gz' : ''}`;
const resp = await (env === 'webworker' ? fetch : adapter.fetch)(fetchUrl);
@ -122,16 +144,19 @@ res) => {
}
data = new Uint8Array(await resp.arrayBuffer());
// langPath is a local file, read .traineddata from local filesystem
// langPathDownload is a local file, read .traineddata from local filesystem
// (adapter.readCache is a generic file read function in Node.js version)
} else {
data = await adapter.readCache(`${langPath}/${lang}.traineddata${gzip ? '.gz' : ''}`);
data = await adapter.readCache(`${langPathDownload}/${lang}.traineddata${gzip ? '.gz' : ''}`);
}
} else {
data = _lang.data; // eslint-disable-line
}
}
progress += 0.5 / langsArr.length;
if (res) res.progress({ workerId, status: statusText, progress });
// Check for gzip magic numbers (1F and 8B in hex)
const isGzip = (data[0] === 31 && data[1] === 139) || (data[1] === 31 && data[0] === 139);
@ -144,7 +169,7 @@ res) => {
try {
TessModule.FS.mkdir(dataPath);
} catch (err) {
res.reject(err.toString());
if (res) res.reject(err.toString());
}
}
TessModule.FS.writeFile(`${dataPath || '.'}/${lang}.traineddata`, data);
@ -158,16 +183,19 @@ res) => {
log(err.toString());
}
}
return Promise.resolve();
progress += 0.5 / langsArr.length;
// Make sure last progress message is 1 (not 0.9999)
if (Math.round(progress * 100) === 100) progress = 1;
if (res) res.progress({ workerId, status: statusText, progress });
};
res.progress({ workerId, status: 'loading language traineddata', progress: 0 });
if (res) res.progress({ workerId, status: statusText, progress: 0 });
try {
await Promise.all((typeof langs === 'string' ? langs.split('+') : langs).map(loadAndGunzipFile));
res.progress({ workerId, status: 'loaded language traineddata', progress: 1 });
res.resolve(langs);
await Promise.all(langsArr.map(loadAndGunzipFile));
if (res) res.resolve(langs);
} catch (err) {
res.reject(err.toString());
if (res) res.reject(err.toString());
}
};
@ -208,9 +236,11 @@ const initialize = async ({
? _langs
: _langs.map((l) => ((typeof l === 'string') ? l : l.data)).join('+');
const statusText = 'initializing api';
try {
res.progress({
workerId, status: 'initializing api', progress: 0,
workerId, status: statusText, progress: 0,
});
if (api !== null) {
api.End();
@ -230,22 +260,55 @@ const initialize = async ({
}
api = new TessModule.TessBaseAPI();
const status = api.Init(null, langs, oem);
let status = api.Init(null, langs, oem);
if (status === -1) {
// Cache is deleted if initialization fails to avoid keeping bad data in cache
// This assumes that initialization failing only occurs due to bad .traineddata,
// this should be refined if other reasons for init failing are encountered.
if (['write', 'refresh', undefined].includes(cacheMethodWorker)) {
// The "if" condition skips this section if either (1) cache is disabled [so the issue
// is definitely unrelated to cached data] or (2) cache is set to read-only
// [so we do not have permission to make any changes].
if (['write', 'refresh', undefined].includes(loadLanguageOptionsWorker.cacheMethod)) {
const langsArr = langs.split('+');
const delCachePromise = langsArr.map((lang) => adapter.deleteCache(`${cachePathWorker || '.'}/${lang}.traineddata`));
const delCachePromise = langsArr.map((lang) => adapter.deleteCache(`${loadLanguageOptionsWorker.cachePath || '.'}/${lang}.traineddata`));
await Promise.all(delCachePromise);
// Check for the case when (1) data was loaded from the cache and
// (2) the data does not support the requested OEM.
// In this case, loadLanguage is re-run and initialization is attempted a second time.
// This is because `loadLanguage` has no mechanism for checking whether the cached data
// supports the requested model, so this only becomes apparent when initialization fails.
// Check for this error message:
// eslint-disable-next-line
// "Tesseract (legacy) engine requested, but components are not present in ./eng.traineddata!!""
// The .wasm build of Tesseract saves this message in a separate file
// (in addition to the normal debug file location).
const debugStr = TessModule.FS.readFile('/debugDev.txt', { encoding: 'utf8', flags: 'a+' });
if (dataFromCache && /components are not present/.test(debugStr)) {
log('Data from cache missing requested OEM model. Attempting to refresh cache with new language data.');
// In this case, language data is re-loaded
await loadLanguage({ workerId, payload: { langs: loadLanguageLangsWorker, options: loadLanguageOptionsWorker } }); // eslint-disable-line max-len
status = api.Init(null, langs, oem);
if (status === -1) {
log('Language data refresh failed.');
const delCachePromise2 = langsArr.map((lang) => adapter.deleteCache(`${loadLanguageOptionsWorker.cachePath || '.'}/${lang}.traineddata`));
await Promise.all(delCachePromise2);
} else {
log('Language data refresh successful.');
}
}
}
}
if (status === -1) {
res.reject('initialization failed');
}
params = defaultParams;
await setParameters({ payload: { params } });
res.progress({
workerId, status: 'initialized api', progress: 1,
workerId, status: statusText, progress: 1,
});
res.resolve();
} catch (err) {

@ -1,20 +1,29 @@
const { simd } = require('wasm-feature-detect');
const OEM = require('../../constants/OEM');
let TesseractCore = null;
/*
* getCore is a sync function to load and return
* TesseractCore.
*/
module.exports = async (_, res) => {
module.exports = async (oem, _, res) => {
if (TesseractCore === null) {
const statusText = 'loading tesseract core';
const simdSupport = await simd();
res.progress({ status: 'loading tesseract core', progress: 0 });
res.progress({ status: statusText, progress: 0 });
if (simdSupport) {
TesseractCore = require('tesseract.js-core/tesseract-core-simd');
if ([OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem)) {
TesseractCore = require('tesseract.js-core/tesseract-core-simd-lstm');
} else {
TesseractCore = require('tesseract.js-core/tesseract-core-simd');
}
} else if ([OEM.DEFAULT, OEM.LSTM_ONLY].includes(oem)) {
TesseractCore = require('tesseract.js-core/tesseract-core-lstm');
} else {
TesseractCore = require('tesseract.js-core/tesseract-core');
}
res.progress({ status: 'loaded tesseract core', progress: 1 });
res.progress({ status: statusText, progress: 1 });
}
return TesseractCore;
};

@ -1,4 +1,3 @@
const resolveURL = (s) => (new URL(s, window.location.href)).href;
const { version } = require('../../../package.json');
const defaultOptions = require('../../constants/defaultOptions');
@ -7,12 +6,5 @@ const defaultOptions = require('../../constants/defaultOptions');
*/
module.exports = {
...defaultOptions,
workerPath: (typeof process !== 'undefined' && process.env.TESS_ENV === 'development')
? resolveURL(`/dist/worker.dev.js?nocache=${Math.random().toString(36).slice(3)}`)
: `https://cdn.jsdelivr.net/npm/tesseract.js@v${version}/dist/worker.min.js`,
/*
* If browser doesn't support WebAssembly,
* load ASM version instead
*/
corePath: null,
workerPath: `https://cdn.jsdelivr.net/npm/tesseract.js@v${version}/dist/worker.min.js`,
};

@ -7,7 +7,7 @@
<div id="mocha"></div>
<script src="../node_modules/mocha/mocha.js"></script>
<script src="../node_modules/expect.js/index.js"></script>
<script src="../dist/tesseract.dev.js"></script>
<script src="../dist/tesseract.min.js"></script>
<script src="./constants.js"></script>
<script>mocha.setup('bdd');</script>
<script src="./FS.test.js"></script>

@ -3,7 +3,7 @@ const FS_WAIT = 500;
let worker;
before(async function cb() {
this.timeout(0);
worker = await createWorker(OPTIONS);
worker = await createWorker("eng", 1, OPTIONS);
});
describe('FS', async () => {

@ -6,14 +6,14 @@ const OPTIONS = {
langPath: 'http://localhost:3000/tests/assets/traineddata',
cachePath: './tests/assets/traineddata',
corePath: '../node_modules/tesseract.js-core/tesseract-core.wasm.js',
...(IS_BROWSER ? { workerPath: '../dist/worker.dev.js' } : {}),
...(IS_BROWSER ? { workerPath: '../dist/worker.min.js' } : {}),
};
const SIMPLE_TEXT = 'Tesseract.js\n';
const SIMPLE_TEXT_HALF = 'Tesse\n';
const COMSIC_TEXT = 'HellO World\nfrom beyond\nthe Cosmic Void\n';
const TESTOCR_TEXT = 'This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n';
const CHINESE_TEXT = '繁 體 中 文 測 試\n';
const BILL_SPACED_TEXT = 'FIRST CHEQUING\n\nLine of Credit 100,000.00 Rate 4.2000\n\nDate Description Number Debits Credits Balance\n31Jul2018 Balance Forward 99,878.08 -\n01Aug2018 Clearing Cheque 4987 36.07 99,914.15 -\n01Aug2018 Clearing Cheque 4986 60.93 99,975.08 -\n01Aug2018 Clearing Cheque 4982 800.04 100,775.12 EX\n01Aug2018 Clearing Cheque 4981 823.34 101,598.46 EX\n01Aug2018 Incoming Interac e-Transfer 1454 101,583.92 EX\n01Aug2018 Incoming Interac e-Transfer 400.00 101,183.92 EX\n01Aug2018 Assisted Deposit 3241450 68,769.42 -\n01Aug2018 Transfer out to loan 7 1,500.00 70,269.42 -\n02Aug2018 Clearing Cheque 4984 48.08 70,317.50 -\n02Aug2018 Clearing Cheque 4985 7051 70,388.01 -\n02Aug2018 Clearing Cheque 4992 500.00 70.888.01 -\n';
const BILL_SPACED_TEXT = 'FIRST CHEQUING\n\nLine of Credit 100,000.00 Rate 4.2000\n\nDate Description Number Debits Credits Balance\n31Jul2018 Balance Forward 99,878.08 -\n01Aug2018 Clearing Cheque 4987 36.07 99,914.15 -\n01Aug2018 Clearing Cheque 4986 60.93 99,975.08 -\n01Aug2018 Clearing Cheque 4982 800.04 100,775.12 EX\n01Aug2018 Clearing Cheque 4981 823.34 101,598.46 EX\n01Aug2018 Incoming Interac e-Transfer 1454 101,583.92 EX\n01Aug2018 Incoming Interac e-Transfer 400.00 101,183.92 EX\n01Aug2018 Assisted Deposit 3241450 68,769.42 -\n01Aug2018 Transfer out to loan 7 1,500.00 70,269.42 -\n02Aug2018 Clearing Cheque 4984 48.08 70,317.50 -\n02Aug2018 Clearing Cheque 4985 7051 70,388.01 -\n02Aug2018 Clearing Cheque 4992 500.00 70,888.01 -\n';
const SIMPLE_WHITELIST_TEXT = 'Tesses\n';
const FORMATS = ['png', 'jpg', 'bmp', 'pbm', 'webp', 'gif'];
const SIMPLE_PNG_BASE64 = '';

@ -7,7 +7,7 @@
<div id="mocha"></div>
<script src="../node_modules/mocha/mocha.js"></script>
<script src="../node_modules/expect.js/index.js"></script>
<script src="../dist/tesseract.dev.js"></script>
<script src="../dist/tesseract.min.js"></script>
<script src="./constants.js"></script>
<script>mocha.setup('bdd');</script>
<script src="./detect.test.js"></script>

@ -2,7 +2,7 @@ const { createWorker } = Tesseract;
let worker;
before(async function cb() {
this.timeout(0);
worker = await createWorker(OPTIONS);
worker = await createWorker("osd", 0, OPTIONS);
});
describe('detect()', async () => {
@ -10,8 +10,6 @@ describe('detect()', async () => {
[
{ name: 'cosmic.png', ans: { script: 'Latin' } },
].forEach(async ({ name, ans: { script } }) => {
await worker.loadLanguage('osd');
await worker.initialize('osd');
const { data: { script: s } } = await worker.detect(`${IMAGE_PATH}/${name}`);
expect(s).to.be(script);
});

@ -7,7 +7,7 @@
<div id="mocha"></div>
<script src="../node_modules/mocha/mocha.js"></script>
<script src="../node_modules/expect.js/index.js"></script>
<script src="../dist/tesseract.dev.js"></script>
<script src="../dist/tesseract.min.js"></script>
<script src="./constants.js"></script>
<script>mocha.setup('bdd');</script>
<script src="./recognize.test.js"></script>

@ -2,15 +2,14 @@ const { createWorker, PSM } = Tesseract;
let worker;
before(async function cb() {
this.timeout(0);
worker = await createWorker(OPTIONS);
await worker.loadLanguage('eng+chi_tra+osd');
worker = await createWorker("eng+chi_tra+osd", 1, OPTIONS);
});
describe('recognize()', () => {
describe('should read bmp, jpg, png and pbm format images', () => {
FORMATS.forEach(format => (
it(`support ${format} format`, async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/simple.${format}`);
expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT)
@ -23,7 +22,7 @@ describe('recognize()', () => {
{ format: 'jpg', image: SIMPLE_JPG_BASE64, ans: SIMPLE_TEXT },
].forEach(({ format, image, ans }) => (
it(`recongize ${format} in base64`, async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(image);
expect(text).to.be(ans);
}).timeout(TIMEOUT)
@ -37,7 +36,7 @@ describe('recognize()', () => {
{ name: 'simple-270.jpg', desc: 'simple', ans: SIMPLE_TEXT },
].forEach(({ name, desc, ans }) => (
it(`recongize ${desc} image`, async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`);
expect(text).to.be(ans);
}).timeout(TIMEOUT)
@ -62,7 +61,7 @@ describe('recognize()', () => {
{ name: 'chinese.png', lang: 'chi_tra', ans: CHINESE_TEXT },
].forEach(({ name, lang, ans }) => (
it(`recongize ${lang}`, async () => {
await worker.initialize(lang);
await worker.reinitialize(lang);
const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`);
expect(text).to.be(ans);
}).timeout(TIMEOUT)
@ -76,7 +75,7 @@ describe('recognize()', () => {
{ name: 'testocr.png', desc: 'large', ans: TESTOCR_TEXT },
].forEach(({ name, desc, ans }) => (
it(`recongize ${desc} image`, async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(`${IMAGE_PATH}/${name}`);
expect(text).to.be(ans);
}).timeout(TIMEOUT)
@ -92,7 +91,7 @@ describe('recognize()', () => {
name, left, top, width, height, ans,
}) => (
it(`recongize half ${name}`, async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(
`${IMAGE_PATH}/${name}`,
{
@ -108,7 +107,7 @@ describe('recognize()', () => {
describe('should work with selected parameters', () => {
it('support preserve_interword_spaces', async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
await worker.setParameters({
preserve_interword_spaces: '1',
});
@ -117,7 +116,7 @@ describe('recognize()', () => {
}).timeout(TIMEOUT);
it('support tessedit_char_whitelist', async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
await worker.setParameters({
tessedit_char_whitelist: 'Tess',
});
@ -132,7 +131,7 @@ describe('recognize()', () => {
.map(name => ({ name, mode: PSM[name] }))
.forEach(({ name, mode }) => (
it(`support PSM.${name} mode`, async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
await worker.setParameters({
tessedit_pageseg_mode: mode,
});
@ -146,7 +145,7 @@ describe('recognize()', () => {
FORMATS.forEach(format => (
it(`support ${format} format`, async () => {
const buf = fs.readFileSync(path.join(__dirname, 'assets', 'images', `simple.${format}`));
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(buf);
expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT)
@ -158,7 +157,7 @@ describe('recognize()', () => {
it(`support ${format} format`, async () => {
const imageDOM = document.createElement('img');
imageDOM.setAttribute('src', `${IMAGE_PATH}/simple.${format}`);
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(imageDOM);
expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT)
@ -170,7 +169,7 @@ describe('recognize()', () => {
it(`support ${format} format`, async () => {
const videoDOM = document.createElement('video');
videoDOM.setAttribute('poster', `${IMAGE_PATH}/simple.${format}`);
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(videoDOM);
expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT)
@ -202,7 +201,7 @@ describe('recognize()', () => {
formats.forEach(format => (
it(`support ${format} format`, async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(canvasDOM);
expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT)
@ -234,7 +233,7 @@ describe('recognize()', () => {
formats.forEach(format => (
it(`support ${format} format`, async () => {
await worker.initialize('eng');
await worker.reinitialize('eng');
const { data: { text } } = await worker.recognize(offscreenCanvas);
expect(text).to.be(SIMPLE_TEXT);
}).timeout(TIMEOUT)

@ -7,7 +7,7 @@
<div id="mocha"></div>
<script src="../node_modules/mocha/mocha.js"></script>
<script src="../node_modules/expect.js/index.js"></script>
<script src="../dist/tesseract.dev.js"></script>
<script src="../dist/tesseract.min.js"></script>
<script src="./constants.js"></script>
<script>mocha.setup('bdd');</script>
<script src="./scheduler.test.js"></script>

@ -7,10 +7,7 @@ before(async function cb() {
const NUM_WORKERS = 5;
console.log(`Initializing ${NUM_WORKERS} workers`);
workers = await Promise.all(Array(NUM_WORKERS).fill(0).map(async () => {
const w = await createWorker(OPTIONS);
await w.loadLanguage('eng');
await w.initialize('eng');
return w;
return await createWorker("eng", 1, OPTIONS);
}));
console.log(`Initialized ${NUM_WORKERS} workers`);
});

Loading…
Cancel
Save