PSImaging Part 2: Export-Text from Images

Summary: Guest blogger, Ben Vierck, talks about using Windows PowerShell to export text from an image.

Microsoft Scripting Guy, Ed Wilson, is here. Welcome back guest blogger Ben Vierck, for Part 2 of PSImaging. Read Part 1 before diving into today’s post: PSImaging Part 1: Test-Image.

Now, here’s Ben...

In first blog post of this series, we wrote the Windows PowerShell function Test-Image to definitively detect whether a file is a known image type by analyzing the first 8 bits of its header. In this post, we're going to write a Windows PowerShell command with a cmdlet called Export-ImageText that can easily export text from our scanned document images.

Several popular cloud drive offerings have recently begun offering Optical Character Recognition (OCR) as a free add-on to their service. Among others, check out:

High-quality OCR was once the sole purview of tremendously expensive enterprise software. Now it's a commoditized add-on feature for cloud services. This begs the question, "How can we take advantage of modern OCR on our own systems?"

Let's start with the most accurate open-source OCR engine available: Tesseract-ocr by Google. After installing the Tesseract runtimes, one option is to automate the executable. Instead I chose to wrap up the SDK in a Windows PowerShell binary module. By doing this, we can bundle the dependencies into the module folder so that distribution is a piece of cake.

Rather than leaving this as an exercise for the reader, I've done the work and open-sourced the project here: Positronic-IO/PSImaging. To get the PSImaging module without the source, you can run this one-liner:

& ([scriptblock]::Create((iwr -uri http://tinyurl.com/Install-GitHubHostedModule).Content))
-GitHubUserName Positronic-IO -ModuleName PSImaging -Branch 'master' -Scope CurrentUser

Now let's play...

I have a folder with sample scanned documents, including an image with the repeating text of the Quick Brown Fox. Let's start by extracting all of the text from this file:

Right away I notice that running this command seemed too slow to me. In fact, it clocks in at 1.3 seconds on my machine. Luckily, we can isolate what gets read by passing in a rectangle. Let's see how limiting the scope this way affects performance.

First, we'll isolate an interesting rectangle. Here I've opened the Quick-Brown-Fox.png file in Paint, and I added a rectangle around the word "fox":

Paint tells us the coordinates: x,y = 172,152 h,w = 36,33. Let's add the given coordinates to System.Drawing.Rectangle:

$rect = New-Object System.Drawing.Rectangle 172,152,36,33

Now we'll pass $rect to our Export-ImageText cmdlet:

dir .\Quick-Brown-Fox.png | Export-ImageText -Rect $rect

Profiling this run of the command shows us that it took just 200 ms. That's a command I can run on a database of a million scanned images and be done in a reasonable amount of time.

Next up in this series, we'll leverage another open source technology within Windows PowerShell to automatically group images by document similarity.

~Ben

Thanks again, Ben. I'm looking forward to tomorrow's post.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy

PSImaging Part 2: Export-Text from Images

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Principal’s past includes domestic violence case

The 10 Tennessee Cities With The Largest Black Population For 2021

Black Angus Grilled Artichokes

[GET] Steal My $1,566.66/Month BLACK HAT SEO Method Before It Gets Saturated...

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Bradford County Court News 4/7/2013

CIERA PERNELL

Teenage girl from North Devon suffered panic attacks from being...

Shatta Wale – You Shock Me (Prod. by Willis Beatz)

SAHARA FLASH LIVE IN WERAGOLLA 2018-04-20

Windows Update / Microsoft Update の接続先 URL について

ESENT データベース USS.jtx で、エラーイベント ID 490、454、489、455 が記録される事象について

||श्री बलभिमाची आरती ||

99 God Status for Whatsapp, Facebook

Rapist Malachi Williams in contempt for 'uncontrolled' behaviour...

Sexual Assault Alert, Man Wanted in an ongoing Sexual Assault investigation,...

Can I request a sedan if I book full-size luxury suv?

Shanike Mcbride

[BluRay] Girls’ Generation – The Best Live at Tokyo Dome