Table Extraction from PDF

This section describes two methods for extracting tables from PDF files.

Note

To extract tables from any documents, set the strategy parameter to hi_res for both methods below.

Method 1: Using partition_pdf

To extract the tables from PDF files using the partition_pdf, set the infer_table_structure parameter to True and strategy parameter to hi_res.

Usage

from unstructured.partition.pdf import partition_pdf

fname = "example-docs/layout-parser-paper.pdf"

elements = partition_pdf(filename=fname,
                         infer_table_structure=True,
                         strategy='hi_res',
           )

tables = [el for el in elements if el.category == "Table"]

print(tables[0].text)
print(tables[0].metadata.text_as_html)

Method 2: Using Auto Partition or Unstructured API

By default, table extraction from pdf, jpg, png, xls, and xlsx file types is disabled. To enable table extraction from PDFs and other file types using Auto Partition or Unstructured API parameters , you can set the skip_infer_table_types parameter to '[]' and strategy parameter to hi_res.

Usage: Auto Partition

from unstructured.partition.auto import partition

filename = "example-docs/layout-parser-paper.pdf"

elements = partition(filename=filename,
                     strategy='hi_res',
                     skip_infer_table_types='[]', # don't forget to include apostrophe around the square bracket
           )

tables = [el for el in elements if el.category == "Table"]

print(tables[0].text)
print(tables[0].metadata.text_as_html)

Usage: API Parameters

curl -X 'POST' \
    'https://api.unstructured.io/general/v0/general' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
    -F 'strategy=hi_res' \
    -F 'skip_infer_table_types=[]' \
    | jq -C . | less -R

Warning

You may get a warning when the pdf_infer_table_structure parameter is set to True AND pdf is included in the list of skip_infer_table_types parameter. However, this function will still extract the tables from PDF despite the conflict.