Tesseract.js wraps a [webassembly port](https://github.com/naptha/tesseract.js-core) of the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR Engine.
Tesseract.js wraps a [webassembly port](https://github.com/naptha/tesseract.js-core) of the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR Engine.
It works in the browser using [webpack](https://webpack.js.org/) or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/).
It works in the browser using [webpack](https://webpack.js.org/), esm, or plain script tags with a [CDN](#CDN) and on the server with [Node.js](https://nodejs.org/en/).
After you [install it](#installation), using it is as simple as:
After you [install it](#installation), using it is as simple as:
Or using workers (recommended for production use):
```javascript
```javascript
import { createWorker } from 'tesseract.js';
import { createWorker } from 'tesseract.js';
const worker = await createWorker({
logger: m => console.log(m)
});
(async () => {
(async () => {
await worker.loadLanguage('eng');
const worker = await createWorker('eng');
await worker.initialize('eng');
const data = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(data.text);
console.log(text);
await worker.terminate();
await worker.terminate();
})();
})();
```
```
When recognizing multiple images, users should create a worker once, run `worker.recognize` for each image, and then run `worker.terminate()` once at the end (rather than running the above snippet for every image).
For a basic overview of the functions, including the pros/cons of different approaches, see the [intro](./docs/intro.md). [Check out the docs](#documentation) for a full explanation of the API.
## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.
Version 5 changes are documented in [this issue](https://github.com/naptha/tesseract.js/issues/820). Highlights are below.
- Significantly smaller files by default (54% smaller for English, 73% smaller for Chinese)
- This results in a ~50% reduction in runtime for first-time users (who do not have the files cached yet)
- Significantly lower memory usage
- Compatible with iOS 17 (using default settings)
- Breaking changes:
- `createWorker` arguments changed
- Setting non-default language and OEM now happens in `createWorker`
- E.g. `createWorker("chi_sim", 1)`
- `worker.initialize` and `worker.loadLanguage` functions now do nothing and can be deleted from code
- See [this issue](https://github.com/naptha/tesseract.js/issues/820) for full list
## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.
createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node.
`createWorker` is a function that creates a Tesseract.js worker. A Tesseract.js worker is an object that creates and manages an instance of Tesseract running in a web worker (browser) or worker thread (Node.js). Once created, OCR jobs are sent through the worker.
**Arguments:**
**Arguments:**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `oem` a enum to indicate the OCR Engine Mode you use
- `options` an object of customized options
- `options` an object of customized options
- `corePath` path to a directory containing **both**`tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js` from [Tesseract.js-core](https://www.npmjs.com/package/tesseract.js-core) package
- `corePath` path to a directory containing **both**`tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js` from [Tesseract.js-core](https://www.npmjs.com/package/tesseract.js-core) package
- Setting `corePath` to a specific `.js` file is **strongly discouraged.** To provide the best performance on all devices, Tesseract.js needs to be able to pick between `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js`. See [this issue](https://github.com/naptha/tesseract.js/issues/735) for more detail.
- Setting `corePath` to a specific `.js` file is **strongly discouraged.** To provide the best performance on all devices, Tesseract.js needs to be able to pick between `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js`. See [this issue](https://github.com/naptha/tesseract.js/issues/735) for more detail.
@ -43,6 +45,8 @@ createWorker is a factory function that creates a tesseract worker, a worker is
- readOnly: read cache and not to write back
- readOnly: read cache and not to write back
- refresh: not to read cache and write back
- refresh: not to read cache and write back
- none: not to read cache and not to write back
- none: not to read cache and not to write back
- `legacyCore` set to `true` to ensure any code downloaded supports the Legacy model (in addition to LSTM model)
- `legacyLang` set to `true` to ensure any language data downloaded supports the Legacy model (in addition to LSTM model)
- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
- `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
- `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
- `logger` a function to log the progress, a quick example is `m => console.log(m)`
- `logger` a function to log the progress, a quick example is `m => console.log(m)`
Worker.removeFile() remove a file in MEMFS, it is useful when you want to free the memory.
`worker.setParameters()` set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
**Arguments:**
**Arguments:**
- `path` file path
- `params` an object with key and value of the parameters
- `jobId` Please see details above
- `jobId` Please see details above
**Examples:**
Note: `worker.setParameters` cannot be used to change the `oem`, as this value is set at initialization. `oem` is initially set using an argument of `createWorker`. After a worker already exists, changing `oem` requires running `worker.reinitialize`.
```javascript
**Useful Parameters:**
(async () => {
await worker.removeFile('tmp.txt');
})();
```
<aname="worker-FS"></a>
### Worker.FS(method, args, jobId): Promise
Worker.FS() is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions.
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
- `method` method name
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system.
`worker.reinitialize()` re-initializes an existing Tesseract.js worker with different `langs` and `oem` arguments.
**Arguments:**
**Arguments:**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
- `oem` a enum to indicate the OCR Engine Mode you use
- `jobId` Please see details above
- `jobId` Please see details above
Note: to switch from Tesseract LSTM (`oem` value `1`) to Tesseract Legacy (`oem` value `0`) using `worker.reinitialize()`, the worker must already contain the code required to run the Tesseract Legacy model. Setting `legacyCore: true` and `legacyLang: true` in `createWorker` options ensures this is the case.
**Examples:**
**Examples:**
```javascript
```javascript
(async () => {
await worker.reinitialize('eng', 1);
await worker.loadLanguage('eng+chi_tra');
})();
```
```
<aname="worker-initialize"></a>
<aname="worker-detect"></a>
### Worker.initialize(langs, oem, jobId): Promise
### Worker.detect(image, jobId): Promise
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.
Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks.
Note: Running `worker.detect` requires a worker with code and language data that supports Tesseract Legacy (this is not enabled by default). If you want to run `worker.detect`, set `legacyCore` and `legacyLang` to `true` in the `createWorker` options.
**Arguments:**
**Arguments:**
- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage.
- `image` see [Image Format](./image-format.md) for more details.
- `oem` a enum to indicate the OCR Engine Mode you use
- `jobId` Please see details above
- `jobId` Please see details above
**Examples:**
**Examples:**
```javascript
```javascript
const { createWorker } = Tesseract;
(async () => {
(async () => {
/** You can load more languages in advance, but use only part of them in Worker.initialize() */
Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
**Arguments:**
- `params` an object with key and value of the parameters
- `jobId` Please see details above
<aname="worker-terminate"></a>
### Worker.terminate(jobId): Promise
**Useful Parameters:**
Worker.terminate() terminates the worker and cleans up
| tessedit\_ocr\_engine\_mode | enum | OEM.DEFAULT | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode |
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.
Worker.removeFile() remove a file in MEMFS, it is useful when you want to free the memory.
**Arguments:**
**Arguments:**
- `image` see [Image Format](./image-format.md) for more details.
- `path` file path
- `jobId` Please see details above
- `jobId` Please see details above
**Examples:**
**Examples:**
```javascript
```javascript
const { createWorker } = Tesseract;
(async () => {
(async () => {
const worker = await createWorker();
await worker.removeFile('tmp.txt');
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data } = await worker.detect(image);
console.log(data);
})();
})();
```
```
<aname="worker-terminate"></a>
<aname="worker-FS"></a>
### Worker.terminate(jobId): Promise
### Worker.FS(method, args, jobId): Promise
Worker.terminate() terminates the worker and cleans up
Worker.FS() is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions.
recognize() is a function to quickly do recognize() task, it is not recommended to use in real application, but useful when you want to save some time.
This function is depreciated and should be replaced with `worker.recognize` (see above).
`recognize` works the same as `worker.recognize`, except that a new worker is created, loaded, and destroyed every time the function is called.
See [Tesseract.js](../src/Tesseract.js)
See [Tesseract.js](../src/Tesseract.js)
<aname="detect"></a>
<aname="detect"></a>
## detect(image, options): Promise
## detect(image, options): Promise
This function is depreciated and should be replaced with `worker.detect` (see above).
Same background as recognize(), but it does detect instead.
Same background as recognize(), but it does detect instead.
@ -19,8 +19,6 @@ Default settings should provide optimal results for most users. If you do want
# Trained Data
# Trained Data
## How does tesseract.js download and keep \*.traineddata?
## How does tesseract.js download and keep \*.traineddata?
The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.
During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.
Tesseract.js offers 3 different ways to recognize text, which vary in complexity. This allows Tesseract.js to provide ease of use to new users experimenting with Tesseract.js, while offering control and performance to more experienced users. Each option is described in brief below, in order of complexity. For more detailed documentation on each function, see the [api page](./api.md).
# Option 1: Single Function
By using `Tesseract.recognize`, you can recognize text with just 1 function and 2 arguments (image and language). This makes it easy for new users to experiment with Tesseract.js.
This option should generally be avoided in production code. Using `Tesseract.recognize` results in a new worker being created and loaded with language data whenever `Tesseract.recognize` is run. This is inefficient for reasons explained below.
# Option 2: Using Workers
Tesseract.js also supports creating and managing workers (the objects that execute recognition) manually.
```
(async () => {
const worker = await Tesseract.createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
This code block is no more efficient than the `Tesseract.recognize` example as written (in both cases a worker is created and destroyed for recognizing a single image). However, within the context of an actual application, separating (1) creating a worker and loading data and (2) running recognition jobs provides developers the control necessary to write more efficient code:
1. Workers can be prepared ahead of time
- E.g. a worker can be created and loaded with language data when the page is first loaded, rather than waiting for a user to upload an image to recognize
1. Workers can be reused for multiple recognition jobs, rather than creating a new worker and loading language data for every image recognized (as `Tesseract.recognize` does)
# Option 3: Using Schedulers + Workers
Finally, Tesseract.js supports schedulers. A scheduler is an object that contains multiple workers, which it uses to execute jobs in parallel.
await scheduler.terminate(); // It also terminates all workers.
})();
```
While using schedulers is no more efficient for a single job, they allow for quickly executing large numbers of jobs in parallel.
When working with schedulers, note that workers added to the same scheduler should all be homogenous—they should have the same language be configured with the same parameters. Schedulers assign jobs to workers in a non-deterministic manner, so if the workers are not identical then recognition results will depend on which worker the job is assigned to.
A string specifying the location of the `worker.js` file.
A string specifying the location of the `worker.js` file.
### langPath
### langPath
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`.
A string specifying the location of the tesseract language files. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`. If `langPath` is not specified by the user, then the correct language data will be automatically downloaded from the jsDelivr CDN.
### corePath
### corePath
A string specifying the location of the [tesseract.js-core](https://github.com/naptha/tesseract.js-core) files, with default value 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3'.
A string specifying the location of the [tesseract.js-core](https://github.com/naptha/tesseract.js-core) files, with default value 'https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0'.
If you set the `corePath` argument, be sure to set it to a directory that contains **all 4** of these files:
1. `tesseract-core.wasm.js`
2. `tesseract-core-simd.wasm.js`
3. `tesseract-core-lstm.wasm.js`
4. `tesseract-core-simd-lstm.wasm.js`
`corePath` should be set to a directory containing both `tesseract-core-simd.wasm.js` and `tesseract-core.wasm.js`. Tesseract.js will load either `tesseract-core-simd.wasm.js` or `tesseract-core.wasm.js` from the directory depending on whether the users' device supports SIMD (see [https://webassembly.org/roadmap/](https://webassembly.org/roadmap/)).
Tesseract.js will pick the correct file based on your users' device and the `createWorker` options.
To avoid breaking old code, when `corePath` is set to a specific `.js` file (e.g. `https://cdn.jsdelivr.net/npm/tesseract.js-core@v4.0.3/tesseract-core.wasm.js`), it will load that file regardless of whether the users' device supports SIMD or not. This behavior only exists to preserve backwards compatibility—setting `corePath` to a specific `.js` file is strongly discouraged. Doing so will either result in much slower performance (if `tesseract-core.wasm.js` is specified) or failure to run on certain devices (if `tesseract-core-simd.wasm.js` is specified).
To avoid breaking old code, when `corePath` is set to a specific `.js` file (e.g. `https://cdn.jsdelivr.net/npm/tesseract.js-core@v5.0.0/tesseract-core.wasm.js`), it will load that file regardless of whether the users' device supports SIMD or not. This behavior only exists to preserve backwards compatibility—setting `corePath` to a specific `.js` file is strongly discouraged. Doing so will either result in much slower performance (if `tesseract-core.wasm.js` is specified) or failure to run on certain devices (if `tesseract-core-simd.wasm.js` is specified).
This guide contains tips and strategies for getting the fastest performance from Tesseract.js. While some of the tips below involve avoiding pitfalls and should be universally implemented, other strategies (changing the language data or recognition model) may harm recognition quality. Therefore, whether these strategies are appropriate depends on the application, and users should always benchmark performance and quality before changing important settings from their defaults.
This guide contains tips and strategies for getting the fastest performance from Tesseract.js. While some of the tips below involve avoiding pitfalls and should be universally implemented, other strategies (changing the language data or recognition model) may harm recognition quality. Therefore, whether these strategies are appropriate depends on the application, and users should always benchmark performance and quality before changing important settings from their defaults.
# Reducing Setup Time
# Reducing Setup Time
Within certain applications, the majority of runtime may be attributable to setup steps (`createWorker`, `worker.initialize`, and `worker.loadLanguage`) rather than recognition (`worker.recognize`). Implementing the strategies in this section should reduce the time spent on these steps.
Within certain applications, the majority of runtime may be attributable to setup steps (`createWorker`) rather than recognition (`worker.recognize`). Implementing the strategies in this section should reduce the time spent on these steps.
Notably, the time spent on setup for first-time users may not be apparent to developers, as Tesseract.js caches language data after it is downloaded for the first time. To experience Tesseract.js as a first-time user, set `cacheMethod: 'none'` in the [createWorker options](./api.md#createworkeroptions-worker). Be sure to remove this setting before publishing your app.
Notably, the time spent on setup for first-time users may not be apparent to developers, as Tesseract.js caches language data after it is downloaded for the first time. To experience Tesseract.js as a first-time user, set `cacheMethod: 'none'` in the [createWorker options](./api.md#createworkeroptions-worker). Be sure to remove this setting before publishing your app.
### Reuse Workers
### Reuse Workers
When recognizing multiple images, some users will create/load/destroy a new worker for each image. This is never the correct option. If the images are being recognized one after the other, all of the extra `createWorker`/`worker.initialize`/`worker.loadLanguage` steps are wasted runtime, as `worker.recognize` could be run with the same `worker`. Workers do not break after one use.
When recognizing multiple images, some users will create/load/destroy a new worker for each image. This is never the correct option. If the images are being recognized one after the other, all of the extra steps required to create/load/destroy a new worker are wasted runtime, as `worker.recognize` could be run with the same `worker`. Workers do not break after one use.
Alternatively, if images are being recognized in parallel, then creating a new worker for each recognition job is likely to cause crashes due to resource limitations. As each Tesseract.js worker uses a high amount of memory, code should never be able to create an arbitrary number of `workers`. Instead, schedulers should be used to create a specific pool for workers (say, 4 workers), and then jobs assigned through the scheduler.
Alternatively, if images are being recognized in parallel, then creating a new worker for each recognition job is likely to cause crashes due to resource limitations. As each Tesseract.js worker uses a high amount of memory, code should never be able to create an arbitrary number of `workers`. Instead, schedulers should be used to create a specific pool for workers (say, 4 workers), and then jobs assigned through the scheduler.
### Set Up Workers Ahead of Time
### Set Up Workers Ahead of Time
Rather than waiting until the last minute to load code and data, you can set up a worker ahead of time. Doing so greatly reduces runtime the first time a user run recognition. This requires managing workers rather than using `Tesseract.recognize`, which is explained [here](./intro.md). An example where a worker is prepared ahead of time can be found [here](../examples/browser/basic-efficient.html).
Rather than waiting until the last minute to load code and data, you can set up a worker ahead of time. Doing so greatly reduces runtime the first time a user run recognition. An example where a worker is prepared ahead of time can be found [here](../examples/browser/basic-efficient.html).
The appropriate time to load Tesseract.js workers and data is application-specific. For example, if you have an web app where only 5% of users need OCR, it likely does not make sense to download ~15MB in code and data upon a page load. You could consider instead loading Tesseract.js when a user indicates they want to perform OCR, but before they select a specific image.
The appropriate time to load Tesseract.js workers and data is application-specific. For example, if you have an web app where only 5% of users need OCR, it likely does not make sense to download ~15MB in code and data upon a page load. You could consider instead loading Tesseract.js when a user indicates they want to perform OCR, but before they select a specific image.
### Do Not Disable Language Data Caching
### Do Not Disable Language Data Caching
Language data is, by far, the largest download required to run Tesseract.js. The default `eng.traineddata` file is 10.4MB compressed. The default `chi_sim.traineddata` file is 19.2MB compressed.
Language data is one of the largest downloads required to run Tesseract.js. While most language data files (including the default English file) are ~2MB, in a worst-case scenario they can be much larger. For example, setting the recognition model (`oem`) to Tesseract Legacy and language to Chinese (simplified) results in a ~20MB file being downloaded.
To avoid downloading language data multiple times, Tesseract.js caches `.traineddata` files. In past versions of Tesseract.js, this caching behavior contained bugs, so some users disabled it (setting `cacheMethod: 'none'` or `cacheMethod: 'refresh'`). As these bugs were fixed in [v4.0.6](https://github.com/naptha/tesseract.js/releases/tag/v4.0.6), it is now recommended that users use the default `cacheMethod` value (i.e. just ignore the `cacheMethod` argument).
To avoid downloading language data multiple times, Tesseract.js caches `.traineddata` files. In past versions of Tesseract.js, this caching behavior contained bugs, so some users disabled it (setting `cacheMethod: 'none'` or `cacheMethod: 'refresh'`). As these bugs were fixed in [v4.0.6](https://github.com/naptha/tesseract.js/releases/tag/v4.0.6), it is now recommended that users use the default `cacheMethod` value (i.e. just ignore the `cacheMethod` argument).
### Consider Using Smaller Language Data
The default language data used by Tesseract.js includes data for both Tesseract engines (LSTM [the default model] and Legacy), and is optimized for quality rather than speed. Both the inclusion of multiple models and the focus on quality increase the size of the language data. Setting a non-default `langData` path may result in significantly smaller files being downloaded.
For example, by changing `langPath` from the default (`https://tessdata.projectnaptha.com/4.0.0`) to `https://tessdata.projectnaptha.com/4.0.0_fast` the size of the compressed English language data is reduced from 10.9MB to 2.0MB. Note that this language data (1) only supports the default LSTM model and (2) is optimized for size/speed rather than quality, so users should switch only after testing whether this data works for their application.
# Reducing Recognition Runtime
# Reducing Recognition Runtime
### Use the Latest Version of Tesseract.js
### Use the Latest Version of Tesseract.js
Old versions of Tesseract.js are significantly slower. Notably, v2 (now depreciated) takes 10x longer to recognize certain images compared to the latest version.
Old versions of Tesseract.js are significantly slower. Notably, v2 (now depreciated) takes 10x longer to recognize certain images compared to the latest version.
### Consider Using the Legacy Model
### Do Not Set `corePath` to a Single `.js` file
In general, the LSTM (default) recognition model provides the best quality. However, the Legacy model generally runs faster, and depending on your application, may provide sufficient recognition quality. If runtime is a significant concern, consider experimenting with the Legacy model (by setting `oem` to `”0”` within `worker.initialize`). As a rule of thumb, the Legacy model is usually viable when the input data is high-quality (high-definition screenshots, document scans, etc.).
If you set the `corePath` argument, be sure to set it to a directory that contains **all 4** of these files:
### Consider Using "Fast" Language Data
1. `tesseract-core.wasm.js`
By default, Tesseract.js uses language data that is optimized for quality rather than speed. You can also experiment with using language data that is optimized for speed by setting `langPath` to `https://tessdata.projectnaptha.com/4.0.0_fast`.
2. `tesseract-core-simd.wasm.js`
3. `tesseract-core-lstm.wasm.js`
4. `tesseract-core-simd-lstm.wasm.js`
### Do Not Set `corePath` to a Single `.js` file
Tesseract.js needs to be able to pick between these files—setting `corePath` to a specific `.js` file will significantly degrade performance or compatibility.
If you set the `corePath` argument, be sure to set it to a directory that contains both `tesseract-core.wasm.js` or `tesseract-core-simd.wasm.js`. Tesseract.js needs to be able to pick between both files—setting `corePath` to a specific `.js` file will significantly degrade performance or compatibility. See [this comment](https://github.com/naptha/tesseract.js/issues/735#issuecomment-1519157646) for explanation.
### Consider Using "Fast" Language Data
By default, Tesseract.js uses language data that is optimized for quality rather than speed. You can also experiment with using language data that is optimized for speed by setting `langPath` to `https://tessdata.projectnaptha.com/4.0.0_fast`. We have not benchmarked the impact this has on performance or accuracy, so feel free to open a Git Issue if you do so and wish to share results.
Tesseract.js offers 2 ways to run recognition jobs: (1) using a worker directly, or (2) using a scheduler to run jobs on multiple workers in parallel. The syntax for the latter is more complicated, but using parallel processing via schedulers provides significantly better performance for large jobs. For more detailed documentation on each function, see the [api page](./api.md).
# Option 1: Using Workers Directly
Tesseract.js also supports creating and managing workers (the objects that execute recognition) manually.
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
In actual use, the `createWorker` step should be separated from the `worker.recognize` step. Doing so enables the following benefits:
1. Workers can be prepared ahead of time
- E.g. a worker can be created when the page is first loaded, rather than waiting for a user to upload an image to recognize
1. Workers can be reused for multiple recognition jobs, rather than creating a new worker and loading language data for every image recognized
- Remember to call `worker.terminate()` after all recognition is complete to free memory
# Option 2: Using Schedulers + Workers
Tesseract.js also supports executing jobs using schedulers. A scheduler is an object that contains multiple workers, which it uses to execute jobs in parallel. For example, the following code executes 10 jobs in parallel using 4 workers.
await scheduler.terminate(); // It also terminates all workers.
})();
```
While using schedulers is no more efficient for a single job, they allow for quickly executing large numbers of jobs in parallel.
When working with schedulers, note that workers added to the same scheduler should all be homogenous—they should have the same language be configured with the same parameters. Schedulers assign jobs to workers in a non-deterministic manner, so if the workers are not identical then recognition results will depend on which worker the job is assigned to.
// For Node.js, langPath may be a URL or local file path
// For Node.js, langPath may be a URL or local file path
// The is-url package is used to tell the difference
// The is-url package is used to tell the difference
// For the browser version, langPath is assumed to be a URL
// For the browser version, langPath is assumed to be a URL
if(env!=='node'||isURL(langPath)||langPath.startsWith('moz-extension://')||langPath.startsWith('chrome-extension://')||langPath.startsWith('file://')){/** When langPath is an URL */
if(env!=='node'||isURL(langPathDownload)||langPathDownload.startsWith('moz-extension://')||langPathDownload.startsWith('chrome-extension://')||langPathDownload.startsWith('file://')){/** When langPathDownload is an URL */
constTESTOCR_TEXT='This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n';
constTESTOCR_TEXT='This is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n';