I've written a script in
pytesseract to get the text embedded in an image. When I run my script, the scraper does it's job weirdly, meaning the text I get as result is quite different from what is in the image.
Script I've tried with:
import requests, io, pytesseract from PIL import Image response = requests.get('http://skoleadresser.no/4DCGI/WC_Pedlex_Adresse/864928.jpg') img = Image.open(io.BytesIO(response.content)) imagetext = pytesseract.image_to_string(img) print(imagetext)
The text in the image look like:
Result I'm having:
Adresse WM 0an Hanssensm 7 A 4u21 Slavanqer warm 52 m so no Te‘efaks 52 m 90 m E'Dus‘x Van’s strandﬂanlmu
How can I get the accurate result?
import requests import io import pytesseract from PIL import Image response = requests.get('http://skoleadresser.no/4DCGI/WC_Pedlex_Adresse/864928.jpg') img = Image.open(io.BytesIO(response.content)) width, height = img.size new_size = width*6, height*6 img = img.resize(new_size, Image.LANCZOS) img = img.convert('L') img = img.point(lambda x: 0 if x < 155 else 255, '1') imagetext = pytesseract.image_to_string(img) print(imagetext)
Adresse Prof. Olav Hanssens vei 7 A
Telefon 52 70 90 00
Telefaks 52 70 90 01
OCR is designed to scan letters from a printed, handwritten or typed document which is scanned at a high resolution, with basically no blur - maybe there exist some tools which are dedicated to scan digital images with a low resolution and a lot of blur, but in general they can't guess letters from such input data at any reasonable rate - it is just too blurry and has too few pixels that an OCR tool can make something useful with this data.
This may sound as if there is little chance to get it working nonetheless - just scaling it up without any further processing doesn't do the trick, as you'll see later on, the image would still be too far away from resembling anything like a typed/printed text.
I did some trial and error with the scaling factor and found 6 to be working the best with this image, so:
width, height = img.size new_size = width*6, height*6
Scaling it up by factor 6 without any resampling:
img = img.resize(new_size)
Gives us this image, which is pretty useless because it is basically the exact same unreadable image as before, just that 1px*1px is now 6px*6px (notice the grey areas which almost intersect between the letters - especially
k will lead to big problems):
Fortunately there are some resampling formulas which are giving very good results, for PIL there is
PIL.Image.LANCZOS (amongst others) which applies the Lanczos resampling formula:
img = img.resize(new_size, Image.LANCZOS)
The difference may not seem that huge at first - but now we have a better fill for the letters instead of those black and grey blocks - and a much more natural blur which we can work with in the next step. Looking now at
k we see that they don't intersect this badly anymore.
What needs to be done next in order to make the image look more like an actually printed document, making it black and white by removing the blur - first step is to make the image work with
mode L (8 bit pixels b/w)
img = img.convert('L')
Of course there is virtually no difference since the source image was black text on white background - but still you need this step to be able to work with a brightness threshold to transform it into a b/w image.
This is done by evaluating every single pixel in the image through its 8bit value - a good value to start trying is
128 which is 50% black:
img = img.point(lambda x: 0 if x < 128 else 255, '1')
Which gives us a text that is far too thin - The OCR tool will recognize most
S and some
Now setting the brightness threshold to
200 we get the following image:
The OCR tool can handle this text since it looks just like a bold font - but as stated previously, OCR tools aim to scan normally printed text, so chances are that it would fail to recognize actual bold text within the image, since it would be way too bold in comparison to normal text.
Let's set the threshold somewhere between
200 so that we get a naturally looking printed text - by doing a bit of trial and error i found
155 to be working great and make it look like the same font weight as in the original image:
Since this looks very much like a high-res scan of a poorly printed b/w document, the OCR tool can now do its job properly.