Performing OCR on an image with pytesseract
It is possible to extract text from within images using the pytesseract library. In this recipe, we will use pytesseract to extract text from an image. Tesseract is an open source OCR library sponsored by Google. The source is available here: https://github.com/tesseract-ocr/tesseract, and you can also find more information on the library there. 0;pytesseract is a thin python wrapper that provides a pythonic API to the executable.
Getting ready
Make sure you have pytesseract installed:
pip install pytesseract
You will also need to install tesseract-ocr. On Windows, there is an executable installer, which you can get here: https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows
. On a Linux system, you can use apt-get
:
sudo apt-get tesseract-ocr
The easiest means of installation on a Mac is using brew:
brew install tesseract
The code for this recipe is in 04/10_perform_ocr.py
.
How to do it
Execute the script for the recipe....