Can't extract the accuarte text embedded in an image

  • A+
Category:Languages

I've written a script in python using pytesseract to get the text embedded in an image. When I run my script, the scraper does it's job weirdly, meaning the text I get as result is quite different from what is in the image.

Script I've tried with:

import requests, io, pytesseract from PIL import Image  response = requests.get('http://skoleadresser.no/4DCGI/WC_Pedlex_Adresse/864928.jpg') img = Image.open(io.BytesIO(response.content)) imagetext = pytesseract.image_to_string(img) print(imagetext) 

The text in the image look like:

Can't extract the accuarte text embedded in an image

Result I'm having:

Adresse WM 0an Hanssensm 7 A 4u21 Slavanqer  warm 52 m so no  Te‘efaks 52 m 90 m  E'Dus‘x Van’s strandflanlmu 

How can I get the accurate result?

 


tl;dr:

import requests import io import pytesseract from PIL import Image  response = requests.get('http://skoleadresser.no/4DCGI/WC_Pedlex_Adresse/864928.jpg') img = Image.open(io.BytesIO(response.content)) width, height = img.size new_size = width*6, height*6 img = img.resize(new_size, Image.LANCZOS) img = img.convert('L') img = img.point(lambda x: 0 if x < 155 else 255, '1') imagetext = pytesseract.image_to_string(img)  print(imagetext) 

Results in:

Adresse Prof. Olav Hanssens vei 7 A
4021 Stavanger

Telefon 52 70 90 00

Telefaks 52 70 90 01

E-post vanja.strand@aof.no

Instructions/How To

OCR is designed to scan letters from a printed, handwritten or typed document which is scanned at a high resolution, with basically no blur - maybe there exist some tools which are dedicated to scan digital images with a low resolution and a lot of blur, but in general they can't guess letters from such input data at any reasonable rate - it is just too blurry and has too few pixels that an OCR tool can make something useful with this data.

This may sound as if there is little chance to get it working nonetheless - just scaling it up without any further processing doesn't do the trick, as you'll see later on, the image would still be too far away from resembling anything like a typed/printed text.

I did some trial and error with the scaling factor and found 6 to be working the best with this image, so:

width, height = img.size new_size = width*6, height*6 

Scaling it up by factor 6 without any resampling:

img = img.resize(new_size) 

Gives us this image, which is pretty useless because it is basically the exact same unreadable image as before, just that 1px*1px is now 6px*6px (notice the grey areas which almost intersect between the letters - especially Pr, s and k will lead to big problems):

Can't extract the accuarte text embedded in an image

Fortunately there are some resampling formulas which are giving very good results, for PIL there is PIL.Image.LANCZOS (amongst others) which applies the Lanczos resampling formula:

img = img.resize(new_size, Image.LANCZOS) 

Can't extract the accuarte text embedded in an image

The difference may not seem that huge at first - but now we have a better fill for the letters instead of those black and grey blocks - and a much more natural blur which we can work with in the next step. Looking now at Pr, s and k we see that they don't intersect this badly anymore.

What needs to be done next in order to make the image look more like an actually printed document, making it black and white by removing the blur - first step is to make the image work with mode L (8 bit pixels b/w)

img = img.convert('L') 

Can't extract the accuarte text embedded in an image

Of course there is virtually no difference since the source image was black text on white background - but still you need this step to be able to work with a brightness threshold to transform it into a b/w image.

This is done by evaluating every single pixel in the image through its 8bit value - a good value to start trying is 128 which is 50% black:

img = img.point(lambda x: 0 if x < 128 else 255, '1') 

Which gives us a text that is far too thin - The OCR tool will recognize most 5 as S and some 0's as O:

Can't extract the accuarte text embedded in an image

Now setting the brightness threshold to 200 we get the following image:

Can't extract the accuarte text embedded in an image

The OCR tool can handle this text since it looks just like a bold font - but as stated previously, OCR tools aim to scan normally printed text, so chances are that it would fail to recognize actual bold text within the image, since it would be way too bold in comparison to normal text.

Let's set the threshold somewhere between 128 and 200 so that we get a naturally looking printed text - by doing a bit of trial and error i found 155 to be working great and make it look like the same font weight as in the original image:

Can't extract the accuarte text embedded in an image

Since this looks very much like a high-res scan of a poorly printed b/w document, the OCR tool can now do its job properly.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: