Full Installation
Basic Usage
For a complete set of extras catering to every document type, use:
pip install "unstructured[all-docs]"
Installation for Specific Document Types
If you’re processing document types beyond the basics, you can install the necessary extras:
pip install "unstructured[docx,pptx]"
Available document types:
"csv", "doc", "docx", "epub", "image", "md", "msg", "odt", "org", "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"
Installation for Specific Data Connectors
To use any of the data connectors, you must install the specific dependency:
pip install "unstructured[s3]"
Available data connectors:
"airtable", "azure", "azure-cognitive-search", "biomed", "box", "confluence", "delta-table", "discord", "dropbox", "elasticsearch", "gcs", "github", "gitlab", "google-drive", "jira", "mongodb", "notion", "opensearch", "onedrive", "outlook", "reddit", "s3", "sharepoint", "salesforce", "slack", "wikipedia"
Installation with conda on Windows
You can install and run unstructured on Windows with conda, but the process
involves a few extra steps. This section will help you get up and running.
Install Anaconda on your Windows machine.
Install Microsoft C++ Build Tools using the instructions in this Stackoverflow post. C++ build tools are required for the
pycocotoolsdependency.Run
conda env create -f environment.ymlusing theenvironment.ymlfile in theunstructuredrepo to create a virtual environment. The environment will be namedunstructured.Run
conda activate unstructuredto activate the virtualenvironment.Run
pip install unstructuredto install theunstructuredlibrary.
Setting up unstructured for local inference
If you need to run model inferences locally, there are a few additional steps you need to
take. The main challenge is installing detectron2 for PDF layout parsing. detectron2
does not officially support Windows, but it is possible to get it to install on Windows.
The installation instructions are based on the instructions LayoutParser provides
here.
Run
pip install pycocotools-windowsto install a Windows compatible version ofpycocotools. Alternatively, you can runpip3 install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"as outlined in this GitHub issue.Run
git clone https://github.com/ivanpp/detectron2.git, thencd detectron2, thenpip install -e .to install a Windows compatible version of thedetectron2library.Install the a Windows compatible version of
iopathusing the instructions outlined in this GitHub issue. First, rungit clone https://github.com/facebookresearch/iopath --single-branch --branch v0.1.8. Then on line 753 iniopath/iopath/common/file_io.pychangefilename = path.split("/")[-1]tofilename = parsed_url.path.split("/")[-1]. After that, navigate to theiopathdirectory and runpip install -e ..Run
pip install unstructured[local-inference]. This will install theunstructured_inferencedependency.
At this point, you can verify the installation by running the following from the root directory of the unstructured repo:
from unstructured.partition.pdf import partition_pdf
partition_pdf("example-docs/layout-parser-paper-fast.pdf", url=None)
Installing PaddleOCR
PaddleOCR is another package that is helpful to use in conjunction with unstructured.
You can use the following steps to install paddleocr in your unstructured conda
environment.
Run
conda install -c esri paddleocrIf you have the Windows version of
detectron2cloned and installed locally, change the name ofdetectron2/toolstodetectron2/detectron2_tools. Otherwise, you will hit the module name conflict error described in this issue.Set the environment variable
KMP_DUPLICATE_LIB_OKto"TRUE". This prevents thelibiomp5md.dlllinking issue described in this issue on GitHub.
At this point, you can verify the installation using the following commands. Choose a
.jpg image that contains text.
import numpy as np
from PIL import Image
from paddleocr import PaddleOCR
filename = "path/to/my/image.jpg"
img = np.array(Image.open(filename))
ocr = PaddleOCR(lang="en", use_gpu=False, show_log=False)
result = ocr.ocr(img=img)
Logging
You can set the logging level for the package with the LOG_LEVEL environment variable.
By default, the log level is set to WARNING. For debugging, consider setting the log
level to INFO or DEBUG.
Extra Dependencies
Filetype Detection
The filetype module in unstructured uses libmagic to detect MIME types. For
this to work, you’ll need libmagic installed on your computer. On a Mac, you can run:
$ brew install libmagic
One Debian, run:
$ sudo apt-get install -y libmagic-dev
If you are on Windows using conda, run:
$ conda install -c conda-forge libmagic
XML/HTML Dependencies
For XML and HTML parsing, you’ll need libxml2 and libxlst installed. On a Mac, you can do
that with:
$ brew install libxml2
$ brew install libxslt
Huggingface Dependencies
The transformers requires the Rust compiler to be present on your system in
order to properly pip install. If a Rust compiler is not available on your system,
you can run the following command to install it:
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Additionally, some tokenizers in the transformers library required the sentencepiece
library. This is not included as an unstructured dependency because it only applies
to some tokenizers. See the
sentencepiece install instructions for
information on how to install sentencepiece if your tokenizer requires it.
Note on Older Versions
For versions earlier than unstructured<0.9.0, the following installation pattern was recommended:
pip install "unstructured[local-inference]"While “local-inference” remains supported in newer versions for backward compatibility, it might be deprecated in future releases. It’s advisable to transition to the “all-docs” extra for comprehensive support.