tesseract.js/docs/faq.md

FAQ
===

# Project
## What is the scope of this project? 
Tesseract.js is the JavaScript/Webassembly port of the Tesseract OCR engine.  We do not edit the underlying Tesseract recognition engine in any way.  Therefore, if you encounter bugs caused by the Tesseract engine you may open an issue here for the purposes of raising awareness to other users, but fixing is outside the scope of this repository. 

If you encounter a Tesseract bug you would like to see fixed you should confirm the behavior is the same in the [main (CLI) version](https://github.com/tesseract-ocr/tesseract) of Tesseract and then open a Git Issue in that repository.    

# Trained Data
## How does tesseract.js download and keep \*.traineddata?

The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.

During the downloading of language model, Tesseract.js will first check if \*.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \*.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.

## How can I train my own \*.traineddata?

See the documentation from the main [Tesseract project](https://tesseract-ocr.github.io/tessdoc/) for training instructions.
Add FAQ for questions 6 years ago			`FAQ`
			`===`

Update to Tesseract.js Version 4 (#691) See #662 for explanation of Tesseract.js Version 4 changes. List below is auto-generated from commits. * Added image preprocessing functions (rotate + save images) * Updated createWorker to be async * Reworked createWorker to be async and throw errors per #654 * Reworked createWorker to be async and throw errors per #654 * Edited detect to return null when detection fails rather than throwing error per #526 * Updated types per #606 and #580 (#663) (#664) * Removed unused files * Added savePDF option to recognize per #488; cleaned up code for linter * Updated download-pdf example for node to use new savePDF option * Added OutputFormats option/interface for setting output * Allowed for Tesseract parameters to be set through recognition options per #665 * Updated docs * Edited loadLanguage to no longer overwrite cache with data from cache per #666 * Added interface for setting 'init only' options per #613 * Wrapped caching in try block per #609 * Fixed unit tests * Updated setImage to resolve memory leak per #678 * Added debug output option per #681 * Fixed bug with saving images per #588 * Updated examples * Updated readme and Tesseract.js-core version 2 years ago			`# Project`
			`## What is the scope of this project?`
			`Tesseract.js is the JavaScript/Webassembly port of the Tesseract OCR engine. We do not edit the underlying Tesseract recognition engine in any way. Therefore, if you encounter bugs caused by the Tesseract engine you may open an issue here for the purposes of raising awareness to other users, but fixing is outside the scope of this repository.`

			`If you encounter a Tesseract bug you would like to see fixed you should confirm the behavior is the same in the [main (CLI) version](https://github.com/tesseract-ocr/tesseract) of Tesseract and then open a Git Issue in that repository.`

			`# Trained Data`
Add FAQ for questions 6 years ago			`## How does tesseract.js download and keep \*.traineddata?`

Update docs 5 years ago			The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
Add FAQ for questions 6 years ago
Update docs 5 years ago			`During the downloading of language model, Tesseract.js will first check if \.traineddata already exists. (browser: [IndexedDB](https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API), Node.js: fs, in the folder you execute the command) If the \.traineddata doesn't exist, it will fetch \*.traineddata.gz from [tessdata](https://github.com/naptha/tessdata), ungzip and store in IndexedDB or fs, you can delete it manually and it will download again for you.`
Add train own data in faq.md 6 years ago
			`## How can I train my own \*.traineddata?`

Update to Tesseract.js Version 4 (#691) See #662 for explanation of Tesseract.js Version 4 changes. List below is auto-generated from commits. * Added image preprocessing functions (rotate + save images) * Updated createWorker to be async * Reworked createWorker to be async and throw errors per #654 * Reworked createWorker to be async and throw errors per #654 * Edited detect to return null when detection fails rather than throwing error per #526 * Updated types per #606 and #580 (#663) (#664) * Removed unused files * Added savePDF option to recognize per #488; cleaned up code for linter * Updated download-pdf example for node to use new savePDF option * Added OutputFormats option/interface for setting output * Allowed for Tesseract parameters to be set through recognition options per #665 * Updated docs * Edited loadLanguage to no longer overwrite cache with data from cache per #666 * Added interface for setting 'init only' options per #613 * Wrapped caching in try block per #609 * Fixed unit tests * Updated setImage to resolve memory leak per #678 * Added debug output option per #681 * Fixed bug with saving images per #588 * Updated examples * Updated readme and Tesseract.js-core version 2 years ago			`See the documentation from the main [Tesseract project](https://tesseract-ocr.github.io/tessdoc/) for training instructions.`