Python PDF read straight across as how it looks in the PDF

  • A+
Category:Languages

If I use the code in the answer here: Extracting text from a PDF file using PDFMiner in python?

I can get the text to extract when applying to this pdf: https://www.tencent.com/en-us/articles/15000691526464720.pdf

However, you see under "CONSOLIDATED INCOME STATEMENT", it reads down ... ie... Revenues VAS Online advertising then later it reads the numbers... I want it to read across, ie:

Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108 etc... is there a way to do this?

Looking for other possible solutions other than pdfminer.

And if I try using this code for PyPDF2 not all of the text even shows up:

# importing required modules import PyPDF2  # creating a pdf file object pdfFileObj = open(file, 'rb')  # creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj)  # printing number of pages in pdf file a=(pdfReader.numPages)  # creating a page object for i in range(0,a):     pageObj = pdfReader.getPage(i)     print(pageObj.extractText()) 

 


It is difficult to say why pdfminer is give you the text extraction results that it does. Perhaps something is going awry with its algorithm.

The company I work for has sample code for the PDF library to do this, I used the TextExtract C# (I did this as an exercise in testing if it's possible to achieve the results you are seeking) sample on your document (which illustrates how to write code to do text extraction of a PDF document) and got the following extracted from Page 7:

CONSOLIDATED INCOME STATEMENT RMB in million, unless specified Unaudited Unaudited 1Q2018 1Q2017 1Q2018 4Q2017 Revenues 73,528 49,552 73,528 66,392 VAS 46,877 35,108 46,877 39,947 Online advertising 10,689 6,888 10,689 12,361 Others 15,962 7,556 15,962 14,084 Cost of revenues (36,486) (24,109) (36,486) (34,897) Gross profit 37,042 25,443 37,042 31,495 Gross margin 50% 51% 50% 47% Interest income 1,065 808 1,065 1,156 Other gains, net 7,585 3,191 7,585 7,906 Selling and marketing expenses (5,570) (3,158) (5,570) (6,022) General and administrative expenses (9,430) (7,012) (9,430) (8,811) Operating profit 30,692 19,272 30,692 25,724 Operating margin 42% 39% 42% 39% Finance costs, net (654) (691) (654) (859) Share of profit/(loss) of associates and joint ventures (319) (375) (319) (120) Profit before income tax 29,719 18,206 29,719 24,745 Income tax expense (5,746) (3,658) (5,746) (3,123) Profit for the period 23,973 14,548 23,973 21,622 Net margin 33% 29% 33% 33% Attributable to: Equity holders of the Company 23,290 14,476 23,290 20,797 Non-controlling interests 683 72 683 825 Non-GAAP profit attributable to equity holders of the Company 18,313 14,211 18,313 17,454 Earnings per share for profit attributable to equity holders of the Company (in RMB per share) - basic 2.470 1.540 2.470 2.206 - diluted 2.435 1.522 2.435 2.177

As you can see it's returning the results as you are requesting, 'across' the page.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: