Parse PDFs using Tabula

16 Mar 2019

this is pretty sick (and accurate): parse your pdfs (with tables) using Tabula!!!

https://pypi.org/project/tabula-py/

i would first split the PDF into its individual pages, then use tabula on the required pages (it can parse multiple tables too!!)

To split your pdf into individual pages:


from PyPDF2 import PdfFileWriter, PdfFileReader

"""This outputs each selected page as a separate PDF page, with the page number as a prefix."""

dir_path = 'Desktop/abc/'
filename = 'abc.pdf'
pdf_reader = PdfFileReader(open(dir_path + filename, 'rb'))

for i in range(start_page, end_page):
    output = PdfFileWriter()
    output.addPage(pdf_reader.getPage(i-1))
    output_filepath = dir_path + str(i) + '_' + filename
    with open(output_filepath, 'wb') as outputStream:
        output.write(outputStream)

To extract text from your single pdf page:


from PyPDF2 import PdfFileWriter, PdfFileReader

filename = 'abc.pdf'
pdf_reader = PdfFileReader(open(filename, 'rb'))
pageObj = pdf_reader.getPage(0)
text = pageObj.extractText()
text = text.replace('\n', '').replace('\t', '')


##if the above doesn't work, try pdfminer.six for python 3+
## below example from https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

import io
 
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
 
def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
 
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)
 
        text = fake_file_handle.getvalue()
 
    # close open handles
    converter.close()
    fake_file_handle.close()
 
    if text:
        return text
 
if __name__ == '__main__':
    print(extract_text_from_pdf('abc.pdf'))

To extract a single table from your single pdf page:


import tabula

pdf_filepath = '1_abc.pdf'
df = tabula.read_pdf(pdf_filepath)

To extract multiple tables from your single pdf page:


import tabula

pdf_filepath = '1_abc.pdf'
df = tabula.read_pdf(pdf_filepath, multiple_tables=True)

too amazing - tabula parses tables almost perfectly!

read here for other sick features: https://blog.chezo.uno/tabula-py-now-able-to-extract-remote-pdf-and-multiple-tables-at-once-6108e24ac07c


BONUS

Camelot sounds really promising - might give it a try!

https://hackernoon.com/announcing-camelot-a-python-library-to-extract-tabular-data-from-pdfs-605f8e63c2d5


some other links for reference if you really want…

https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

https://automatetheboringstuff.com/chapter13/