Perform Python OCR on all images in a folder at the same time
Python OCR for All Images in a Folder Simultaneously
If you have a folder full of images with some text that you need to extract into a separate folder with the corresponding image filenames or into a single file, then this is the perfect code you’re looking for.
This article not only provides you with the basics of OCR (Optical Character Recognition), but also helps you create output.txt files for each image in the main folder and save them in a predetermined orientation.
Required Libraries –
pip3 install pillow
pip3 install os-sys
You’ll also need the tesseract-oct and pytesseract libraries. Tesseract-ocr can be downloaded and installed from here, and pytesseract can be installed using pip3 install pytesseract.
The following is the implementation of Python-
# Python program to extract text from all the images in a folder
# storing the text in corresponding files in a different folder
from PIL import Image
import pytesseract as pt
import os
def main():
# path for the folder for getting the raw images
Path="E:GeeksforGeeksimages"
# path for the folder for getting the output
tempPath="E:GeeksforGeekstextFiles"
# iterating the images inside the folder
For imageName in os.listdir(path):
InputPath = os.path.join(path, imageName)
img = Image.open(inputPath)
# applying ocr using pytesseract for python
text = pt.image_to_string(img, lang ="eng")
# for removing the .jpg from the imagePath
imagePath = imagePath[0:-4]
fullTempPath = os.path.join(tempPath, 'time_'+imageName+".txt")
print(text)
# saving the text for each image in a separate .txt file
file1 = open(fullTempPath, "w")
file1.write(text)
file1.close()
if __name__ == '__main__':
main()
Input image:
image_sample1
Output:
<pre><code class="language-python line-numbers"># extract text from all the images in a folder
# storing the text in a single file
from PIL import Image
import pytesseract as pt
import os
def main():
# path for the folder for getting the raw images
Path="E:GeeksforGeeksimages"
# link to the file in which output needs to be kept
fullTempPath ="E:GeeksforGeeksoutputoutputFile.txt"
# iterating the images inside the folder
For imageName in os.listdir(path):
InputPath = os.path.join(path, imageName)
img = Image.open(inputPath)
# applying ocr using pytesseract for python
text = pt.image_to_string(img, lang ="eng")
# saving the text for appending it to the output.txt file
# a + parameter used for creating the file if not present
# and if present then append the text content
file1 = open(fullTempPath, "a+")
# providing the name of the image
File1.write(imageName+"n")
# providing the content in the image
File1.write(text+"n")
file1.close()
# for printing the output file
file2 = open(fullTempPath, 'r')
Print(file2.read())
file2.close()
if __name__ == '__main__':
main()
Input image:
image_sample1
image_sample2
Output:
This outputs a single file after extracting all the information from the images in the folder. The file format is as follows –
Name of the image
Content of the image
Name of the next image and so on .....