Parse PDFs using Tabula

16 Mar 2019

this is pretty sick (and accurate): parse your pdfs (with tables) using Tabula!!!

i would first split the PDF into its individual pages, then use tabula on the required pages (it can parse multiple tables too!!)

To split your pdf into individual pages:

from PyPDF2 import PdfFileWriter, PdfFileReader

"""This outputs each selected page as a separate PDF page, with the page number as a prefix."""

dir_path = 'Desktop/abc/'
filename = 'abc.pdf'
pdf_reader = PdfFileReader(open(dir_path + filename, 'rb'))

for i in range(start_page, end_page):
    output = PdfFileWriter()
    output_filepath = dir_path + str(i) + '_' + filename
    with open(output_filepath, 'wb') as outputStream:

To extract text from your single pdf page:

from PyPDF2 import PdfFileWriter, PdfFileReader

filename = 'abc.pdf'
pdf_reader = PdfFileReader(open(filename, 'rb'))
pageObj = pdf_reader.getPage(0)
text = pageObj.extractText()
text = text.replace('\n', '').replace('\t', '')

##if the above doesn't work, try pdfminer.six for python 3+
## below example from

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
        text = fake_file_handle.getvalue()
    # close open handles
    if text:
        return text
if __name__ == '__main__':

To extract a single table from your single pdf page:

import tabula

pdf_filepath = '1_abc.pdf'
df = tabula.read_pdf(pdf_filepath)

To extract multiple tables from your single pdf page:

import tabula

pdf_filepath = '1_abc.pdf'
df = tabula.read_pdf(pdf_filepath, multiple_tables=True)

too amazing - tabula parses tables almost perfectly!

read here for other sick features:


Camelot sounds really promising - might give it a try!

some other links for reference if you really want…