Summary: Guest blogger, Ben Vierck, talks about using Windows PowerShell to export text from an image.
Microsoft Scripting Guy, Ed Wilson, is here. Welcome back guest blogger Ben Vierck, for Part 2 of PSImaging. Read Part 1 before diving into today’s post: PSImaging Part 1: Test-Image.
Now, here’s Ben...
In first blog post of this series, we wrote the Windows PowerShell function Test-Image to definitively detect whether a file is a known image type by analyzing the first 8 bits of its header. In this post, we're going to write a Windows PowerShell command with a cmdlet called Export-ImageText that can easily export text from our scanned document images.
Several popular cloud drive offerings have recently begun offering Optical Character Recognition (OCR) as a free add-on to their service. Among others, check out:
- Office Lens: A OneNote scanner for your pocket
- Evernote Scannable
- About Optical Character Recognition in Google Drive
High-quality OCR was once the sole purview of tremendously expensive enterprise software. Now it's a commoditized add-on feature for cloud services. This begs the question, "How can we take advantage of modern OCR on our own systems?"
Let's start with the most accurate open-source OCR engine available: Tesseract-ocr by Google. After installing the Tesseract runtimes, one option is to automate the executable. Instead I chose to wrap up the SDK in a Windows PowerShell binary module. By doing this, we can bundle the dependencies into the module folder so that distribution is a piece of cake.
Rather than leaving this as an exercise for the reader, I've done the work and open-sourced the project here: Positronic-IO/PSImaging. To get the PSImaging module without the source, you can run this one-liner:
& ([scriptblock]::Create((iwr -uri http://tinyurl.com/Install-GitHubHostedModule).Content))
-GitHubUserName Positronic-IO -ModuleName PSImaging -Branch 'master' -Scope CurrentUser
Now let's play...
I have a folder with sample scanned documents, including an image with the repeating text of the Quick Brown Fox. Let's start by extracting all of the text from this file:
Right away I notice that running this command seemed too slow to me. In fact, it clocks in at 1.3 seconds on my machine. Luckily, we can isolate what gets read by passing in a rectangle. Let's see how limiting the scope this way affects performance.
First, we'll isolate an interesting rectangle. Here I've opened the Quick-Brown-Fox.png file in Paint, and I added a rectangle around the word "fox":
Paint tells us the coordinates: x,y = 172,152 h,w = 36,33. Let's add the given coordinates to System.Drawing.Rectangle:
$rect = New-Object System.Drawing.Rectangle 172,152,36,33
Now we'll pass $rect to our Export-ImageText cmdlet:
dir .\Quick-Brown-Fox.png | Export-ImageText -Rect $rect
Profiling this run of the command shows us that it took just 200 ms. That's a command I can run on a database of a million scanned images and be done in a reasonable amount of time.
Next up in this series, we'll leverage another open source technology within Windows PowerShell to automatically group images by document similarity.
~Ben
Thanks again, Ben. I'm looking forward to tomorrow's post.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.
Ed Wilson, Microsoft Scripting Guy