You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
448 lines
14 KiB
448 lines
14 KiB
# API
|
|
|
|
- [createWorker()](#create-worker)
|
|
- [Worker.load](#worker-load)
|
|
- [Worker.writeText](#worker-writeText)
|
|
- [Worker.readText](#worker-readText)
|
|
- [Worker.removeFile](#worker-removeFile)
|
|
- [Worker.FS](#worker-FS)
|
|
- [Worker.loadLanguage](#worker-load-language)
|
|
- [Worker.initialize](#worker-initialize)
|
|
- [Worker.setParameters](#worker-set-parameters)
|
|
- [Worker.recognize](#worker-recognize)
|
|
- [Worker.detect](#worker-detect)
|
|
- [Worker.terminate](#worker-terminate)
|
|
- [createScheduler()](#create-scheduler)
|
|
- [Scheduler.addWorker](#scheduler-add-worker)
|
|
- [Scheduler.addJob](#scheduler-add-job)
|
|
- [Scheduler.getQueueLen](#scheduler-get-queue-len)
|
|
- [Scheduler.getNumWorkers](#scheduler-get-num-workers)
|
|
- [setLogging()](#set-logging)
|
|
- [recognize()](#recognize)
|
|
- [detect()](#detect)
|
|
- [PSM](#psm)
|
|
- [OEM](#oem)
|
|
|
|
---
|
|
|
|
<a name="create-worker"></a>
|
|
## createWorker(options): Worker
|
|
|
|
createWorker is a factory function that creates a tesseract worker, a worker is basically a Web Worker in browser and Child Process in Node.
|
|
|
|
**Arguments:**
|
|
|
|
- `options` an object of customized options
|
|
- `corePath` path for tesseract-core.js script
|
|
- `langPath` path for downloading traineddata, do not include `/` at the end of the path
|
|
- `workerPath` path for downloading worker script
|
|
- `dataPath` path for saving traineddata in WebAssembly file system, not common to modify
|
|
- `cachePath` path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDB
|
|
- `cacheMethod` a string to indicate the method of cache management, should be one of the following options
|
|
- write: read cache and write back (default method)
|
|
- readOnly: read cache and not to write back
|
|
- refresh: not to read cache and write back
|
|
- none: not to read cache and not to write back
|
|
- `workerBlobURL` a boolean to define whether to use Blob URL for worker script, default: true
|
|
- `gzip` a boolean to define whether the traineddata from the remote is gzipped, default: true
|
|
- `logger` a function to log the progress, a quick example is `m => console.log(m)`
|
|
- `errorHandler` a function to handle worker errors, a quick example is `err => console.error(err)`
|
|
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
const { createWorker } = Tesseract;
|
|
const worker = createWorker({
|
|
langPath: '...',
|
|
logger: m => console.log(m),
|
|
});
|
|
```
|
|
|
|
## Worker
|
|
|
|
A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:
|
|
|
|
- load
|
|
- FS functions // optional
|
|
- loadLanguauge
|
|
- initialize
|
|
- setParameters // optional
|
|
- recognize or detect
|
|
- terminate
|
|
|
|
Each function is async, so using async/await or Promise is required. When it is resolved, you get an object:
|
|
|
|
```json
|
|
{
|
|
"jobId": "Job-1-123",
|
|
"data": { ... }
|
|
}
|
|
```
|
|
|
|
jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.
|
|
|
|
<a name="worker-load"></a>
|
|
### Worker.load(jobId): Promise
|
|
|
|
Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.
|
|
|
|
**Arguments:**
|
|
|
|
- `jobId` Please see details above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
await worker.load();
|
|
})();
|
|
```
|
|
|
|
<a name="worker-writeText"></a>
|
|
### Worker.writeText(path, text, jobId): Promise
|
|
|
|
Worker.writeText() writes a text file to the path specified in MEMFS, it is useful when you want to use some features that requires tesseract.js
|
|
to read file from file system.
|
|
|
|
**Arguments:**
|
|
|
|
- `path` text file path
|
|
- `text` content of the text file
|
|
- `jobId` Please see details above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
await worker.writeText('tmp.txt', 'Hi\nTesseract.js\n');
|
|
})();
|
|
```
|
|
|
|
<a name="worker-readText"></a>
|
|
### Worker.readText(path, jobId): Promise
|
|
|
|
Worker.readText() reads a text file to the path specified in MEMFS, it is useful when you want to check the content.
|
|
|
|
**Arguments:**
|
|
|
|
- `path` text file path
|
|
- `jobId` Please see details above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
const { data } = await worker.readText('tmp.txt');
|
|
console.log(data);
|
|
})();
|
|
```
|
|
|
|
<a name="worker-removeFile"></a>
|
|
### Worker.removeFile(path, jobId): Promise
|
|
|
|
Worker.readFile() remove a file in MEMFS, it is useful when you want to free the memory.
|
|
|
|
**Arguments:**
|
|
|
|
- `path` file path
|
|
- `jobId` Please see details above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
await worker.removeFile('tmp.txt');
|
|
})();
|
|
```
|
|
|
|
<a name="worker-FS"></a>
|
|
### Worker.FS(method, args, jobId): Promise
|
|
|
|
Worker.FS() is a generic FS function to do anything you want, you can check [HERE](ihttps://emscripten.org/docs/api_reference/Filesystem-API.html) for all functions.
|
|
|
|
**Arguments:**
|
|
|
|
- `method` method name
|
|
- `args` array of arguments to pass
|
|
- `jobId` Please see details above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
await worker.FS('writeFile', ['tmp.txt', 'Hi\nTesseract.js\n']);
|
|
// equal to:
|
|
// await worker.readText('tmp.txt', 'Hi\nTesseract.js\n');
|
|
})();
|
|
```
|
|
|
|
<a name="worker-load-language"></a>
|
|
### Worker.loadLanguage(langs, jobId): Promise
|
|
|
|
Worker.loadLanguage() loads traineddata from cache or download traineddata from remote, and put traineddata into the WebAssembly file system.
|
|
|
|
**Arguments:**
|
|
|
|
- `langs` a string to indicate the languages traineddata to download, multiple languages are concated with **+**, ex: **eng+chi\_tra**
|
|
- `jobId` Please see details above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
await worker.loadLanguage('eng+chi_tra');
|
|
})();
|
|
```
|
|
|
|
<a name="worker-initialize"></a>
|
|
### Worker.initialize(langs, oem, jobId): Promise
|
|
|
|
Worker.initialize() initializes the Tesseract API, make sure it is ready for doing OCR tasks.
|
|
|
|
**Arguments:**
|
|
|
|
- `langs` a string to indicate the languages loaded by Tesseract API, it can be the subset of the languauge traineddata you loaded from Worker.loadLanguage.
|
|
- `oem` a enum to indicate the OCR Engine Mode you use
|
|
- `jobId` Please see details above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
/** You can load more languages in advance, but use only part of them in Worker.initialize() */
|
|
await worker.loadLanguage('eng+chi_tra');
|
|
await worker.initialize('eng');
|
|
})();
|
|
```
|
|
<a name="worker-set-parameters"></a>
|
|
### Worker.setParameters(params, jobId): Promise
|
|
|
|
Worker.setParameters() set parameters for Tesseract API (using SetVariable()), it changes the behavior of Tesseract and some parameters like tessedit\_char\_whitelist is very useful.
|
|
|
|
**Arguments:**
|
|
|
|
- `params` an object with key and value of the parameters
|
|
- `jobId` Please see details above
|
|
|
|
**Supported Paramters:**
|
|
|
|
| name | type | default value | description |
|
|
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
|
|
| tessedit\_ocr\_engine\_mode | enum | OEM.DEFAULT | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L268) for definition of each mode |
|
|
| tessedit\_pageseg\_mode | enum | PSM.SINGLE\_BLOCK | Check [HERE](https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163) for definition of each mode |
|
|
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
|
|
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
|
|
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
|
|
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
|
|
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
|
|
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
|
|
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
|
|
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
await worker.setParameters({
|
|
tessedit_char_whitelist: '0123456789',
|
|
});
|
|
})
|
|
```
|
|
|
|
<a name="worker-recognize"></a>
|
|
### Worker.recognize(image, options, jobId): Promise
|
|
|
|
Worker.recognize() provides core function of Tesseract.js as it executes OCR
|
|
|
|
Figures out what words are in `image`, where the words are in `image`, etc.
|
|
> Note: `image` should be sufficiently high resolution.
|
|
> Often, the same image will get much better results if you upscale it before calling `recognize`.
|
|
|
|
**Arguments:**
|
|
|
|
- `image` see [Image Format](./image-format.md) for more details.
|
|
- `options` a object of customized options
|
|
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.
|
|
- `jobId` Please see details above
|
|
|
|
**Output:**
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
const { createWorker } = Tesseract;
|
|
(async () => {
|
|
const worker = createWorker();
|
|
await worker.load();
|
|
await worker.loadLanguage('eng');
|
|
await worker.initialize('eng');
|
|
const { data: { text } } = await worker.recognize(image);
|
|
console.log(text);
|
|
})();
|
|
```
|
|
|
|
With rectangle
|
|
|
|
```javascript
|
|
const { createWorker } = Tesseract;
|
|
(async () => {
|
|
const worker = createWorker();
|
|
await worker.load();
|
|
await worker.loadLanguage('eng');
|
|
await worker.initialize('eng');
|
|
const { data: { text } } = await worker.recognize(image, {
|
|
rectangle: { top: 0, left: 0, width: 100, height: 100 },
|
|
});
|
|
console.log(text);
|
|
})();
|
|
```
|
|
|
|
<a name="worker-detect"></a>
|
|
### Worker.detect(image, jobId): Promise
|
|
|
|
Worker.detect() does OSD (Orientation and Script Detection) to the image instead of OCR.
|
|
|
|
**Arguments:**
|
|
|
|
- `image` see [Image Format](./image-format.md) for more details.
|
|
- `jobId` Please see details above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
const { createWorker } = Tesseract;
|
|
(async () => {
|
|
const worker = createWorker();
|
|
await worker.load();
|
|
await worker.loadLanguage('eng');
|
|
await worker.initialize('eng');
|
|
const { data } = await worker.detect(image);
|
|
console.log(data);
|
|
})();
|
|
```
|
|
|
|
<a name="worker-terminate"></a>
|
|
### Worker.terminate(jobId): Promise
|
|
|
|
Worker.terminate() terminates the worker and cleans up
|
|
|
|
```javascript
|
|
(async () => {
|
|
await worker.terminate();
|
|
})();
|
|
```
|
|
|
|
<a name="create-scheduler"></a>
|
|
## createScheduler(): Scheduler
|
|
|
|
createScheduler() is a factory function to create a scheduler, a scheduler manages a job queue and workers to enable multiple workers to work together, it is useful when you want to speed up your performance.
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
const { createScheduler } = Tesseract;
|
|
const scheduler = createScheduler();
|
|
```
|
|
|
|
### Scheduler
|
|
|
|
<a name="scheduler-add-worker"></a>
|
|
### Scheduler.addWorker(worker): string
|
|
|
|
Scheduler.addWorker() adds a worker into the worker pool inside scheduler, it is suggested to add one worker to only one scheduler.
|
|
|
|
**Arguments:**
|
|
|
|
- `worker` see Worker above
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
const { createWorker, createScheduler } = Tesseract;
|
|
const scheduler = createScheduler();
|
|
const worker = createWorker();
|
|
scheduler.addWorker(worker);
|
|
```
|
|
|
|
<a name="scheduler-add-job"></a>
|
|
### Scheduler.addJob(action, ...payload): Promise
|
|
|
|
Scheduler.addJob() adds a job to the job queue and scheduler waits and finds an idle worker to take the job.
|
|
|
|
**Arguments:**
|
|
|
|
- `action` a string to indicate the action you want to do, right now only **recognize** and **detect** are supported
|
|
- `payload` a arbitrary number of args depending on the action you called.
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
const { data: { text } } = await scheduler.addJob('recognize', image, options);
|
|
const { data } = await scheduler.addJob('detect', image);
|
|
})();
|
|
```
|
|
|
|
<a name="scheduler-get-queue-len"></a>
|
|
### Scheduler.getQueueLen(): number
|
|
|
|
Scheduler.getNumWorkers() returns the length of job queue.
|
|
|
|
<a name="scheduler-get-num-workers"></a>
|
|
### Scheduler.getNumWorkers(): number
|
|
|
|
Scheduler.getNumWorkers() returns number of workers added into the scheduler
|
|
|
|
<a name="scheduler-terminate"></a>
|
|
### Scheduler.terminate(): Promise
|
|
|
|
Scheduler.terminate() terminates all workers added, useful to do quick clean up.
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
(async () => {
|
|
await scheduler.terminate();
|
|
})();
|
|
```
|
|
|
|
<a name="set-logging"></a>
|
|
## setLogging(logging: boolean)
|
|
|
|
setLogging() sets the logging flag, you can `setLogging(true)` to see detailed information, useful for debugging.
|
|
|
|
**Arguments:**
|
|
|
|
- `logging` boolean to define whether to see detailed logs, default: false
|
|
|
|
**Examples:**
|
|
|
|
```javascript
|
|
const { setLogging } = Tesseract;
|
|
setLogging(true);
|
|
```
|
|
|
|
<a name="recognize"></a>
|
|
## recognize(image, langs, options): Promise
|
|
|
|
recognize() is a function to quickly do recognize() task, it is not recommended to use in real application, but useful when you want to save some time.
|
|
|
|
See [Tesseract.js](../src/Tesseract.js)
|
|
|
|
<a name="detect"></a>
|
|
## detect(image, options): Promise
|
|
|
|
Same background as recognize(), but it does detect instead.
|
|
|
|
See [Tesseract.js](../src/Tesseract.js)
|
|
|
|
<a name="psm"></a>
|
|
## PSM
|
|
|
|
See [PSM.js](../src/constants/PSM.js)
|
|
|
|
<a name="oem"></a>
|
|
## OEM
|
|
|
|
See [OEM.js](../src/constants/OEM.js)
|
|
|