stephane.bio
  • Invest
  • Build
  • Write
  • Think
Ketchup

ljvmiranda921/prodigy-pdf-custom-recipe: Custom recipe and utilities for document processing

URL
https://github.com/ljvmiranda921/prodigy-pdf-custom-recipe

πŸͺ spaCy Project: Prodigy recipes for document processing and layout understanding

This repository contains recipes on how to use Prodigy and Hugging Face for annotating, training, and reviewing document layout datasets. We'll be finetuning a LayoutLMv3 model using FUNSD, a dataset of noisy scanned documents.

image

This also serves as an illustration of how to design document processing solutions. I attempted to generalize this approach into a framework, which you can read more from my blog.

image

πŸ“‹ project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command
Description
install
Install dependencies
hydrate-db
Hydrate the Prodigy database with annotated data from FUNSD
review
Review hydrated annotations
train
Train FUNSD model
qa
Perform QA for the test dataset using a trained model
clean-db
Drop all generated Prodigy datasets
clean-files
Clean all intermediary files

⏭ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow
Steps
all
install β†’ hydrate-db β†’ train
clean-all
clean-db β†’ clean-files

πŸ—‚ Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File
Source
Description
assets/funsd.zip
URL
FUNSD dataset - noisy scanned documents for layout understanding
stephane.bio

Made with Notion, Published on Super - 2026 Β© Stephane Boghossian

LinkedInInstagramMediumGitHubXBehanceDiscordPinterest