Table Extraction from PDF
This section describes two methods for extracting tables from PDF files.
Note
To extract tables from any documents, set the strategy parameter to hi_res for both methods below.
Method 1: Using partition_pdf
To extract the tables from PDF files using the partition_pdf, set the infer_table_structure parameter to True and strategy parameter to hi_res.
Usage
from unstructured.partition.pdf import partition_pdf
fname = "example-docs/layout-parser-paper.pdf"
elements = partition_pdf(filename=fname,
infer_table_structure=True,
strategy='hi_res',
)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].text)
print(tables[0].metadata.text_as_html)
Method 2: Using Auto Partition or Unstructured API
By default, table extraction from pdf, jpg, png, xls, and xlsx file types is disabled. To enable table extraction from PDFs and other file types using Auto Partition or Unstructured API parameters , you can set the skip_infer_table_types parameter to '[]' and strategy parameter to hi_res.
Usage: Auto Partition
from unstructured.partition.auto import partition
filename = "example-docs/layout-parser-paper.pdf"
elements = partition(filename=filename,
strategy='hi_res',
skip_infer_table_types='[]', # don't forget to include apostrophe around the square bracket
)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].text)
print(tables[0].metadata.text_as_html)
Usage: API Parameters
curl -X 'POST' \
'https://api.unstructured.io/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \
-F 'strategy=hi_res' \
-F 'skip_infer_table_types=[]' \
| jq -C . | less -R
Warning
You may get a warning when the pdf_infer_table_structure parameter is set to True AND pdf is included in the list of skip_infer_table_types parameter. However, this function will still extract the tables from PDF despite the conflict.