`createWorker` is a function that creates a Tesseract.js worker. A Tesseract.js worker is an object that creates and manages an instance of Tesseract running in a web worker (browser) or worker thread (Node.js). Once created, OCR jobs are sent through the worker.
-`corePath` path to a directory containing **both**`tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js` from [Tesseract.js-core](https://www.npmjs.com/package/tesseract.js-core) package
- Setting `corePath` to a specific `.js` file is **strongly discouraged.** To provide the best performance on all devices, Tesseract.js needs to be able to pick between `tesseract-core.wasm.js` and `tesseract-core-simd.wasm.js`. See [this issue](https://github.com/naptha/tesseract.js/issues/735) for more detail.
-`config` an object of customized options which are set prior to initialization
- This argument allows for setting "init only" Tesseract parameters
- Most Tesseract parameters can be set after a worker is initialized, using either `worker.setParameters` or the `options` argument of `worker.recognize`.
- A handful of Tesseract parameters, referred to as "init only" parameters in Tesseract documentation, cannot be modified after Tesseract is initialized--these can only be set using this argument
- Examples include `load_system_dawg`, `load_number_dawg`, and `load_punc_dawg`
`worker.recognize` returns a promise to an object containing `jobId` and `data` properties. The `data` property contains output in all of the formats specified using the `output` argument.
> [!NOTE]
> `worker.recognize` still returns an output object even if no text is detected (the outputs will simply contain no words). No exception is thrown as determining the page is empty is considered a valid result.
`worker.setParameters()` set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
Note: `worker.setParameters` cannot be used to change the `oem`, as this value is set at initialization. `oem` is initially set using an argument of `createWorker`. After a worker already exists, changing `oem` requires running `worker.reinitialize`.
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful if content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
Note: to switch from Tesseract LSTM (`oem` value `1`) to Tesseract Legacy (`oem` value `0`) using `worker.reinitialize()`, the worker must already contain the code required to run the Tesseract Legacy model. Setting `legacyCore: true` and `legacyLang: true` in `createWorker` options ensures this is the case.
> Running `worker.detect` requires a worker with code and language data that supports Tesseract Legacy (this is not enabled by default). If you want to run `worker.detect`, set `legacyCore` and `legacyLang` to `true` in the `createWorker` options.
`worker.FS` is a generic FS function to do anything you want, you can check [HERE](https://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions.
`createScheduler` is a factory function to create a scheduler, a scheduler manages a job queue and workers to enable multiple workers to work together, it is useful when you want to speed up your performance.