poller
Processes documents and indexes them to be searched.
Every 5 minutes, this polls for new documents as follows:
- Moves any documents from
s3://cavepediav2-importtos3://cavepediav2-filesand updates themetadatatable.- This table has a
splitcolumn, indicating if the file has been split into individual pages.
- This table has a
- Checks the
metadatatable for any unsplit files, then splits them and stores the pages ins3://cavepediav2-pagesand creates an row in theembeddingstable for each page. - Checks claude for any OCR batches that have finished, then stores the results in the
embeddingstable. - Checks the
embeddingstable for un-OCR'd pages and batches them in groups of 1000 to be OCR'd by claude.- Only 1 batch is created per 5 minutes, as it can be easy to overload the server hosting the files.
- A temporary public S3 file link is generated using a presigned s3 url.
- Checks the
embeddingstable for any rows that have been OCR'd, but do not have embeddings generated, then generates embeddings with cohere.- No batching is used with cohere.
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
COHERE_API_KEY |
Yes | - | Cohere API key for embeddings |
S3_ACCESS_KEY |
Yes | - | S3/MinIO access key |
S3_SECRET_KEY |
Yes | - | S3/MinIO secret key |
DB_PASSWORD |
Yes | - | PostgreSQL password |
ANTHROPIC_API_KEY |
Yes | - | Claude API key for OCR |
DB_HOST |
No | localhost | PostgreSQL host |
DB_PORT |
No | 5432 | PostgreSQL port |
DB_NAME |
No | cavepediav2_db | PostgreSQL database name |
DB_USER |
No | cavepediav2_user | PostgreSQL username |
S3_ENDPOINT |
No | https://s3.bigcavemaps.com | S3 endpoint URL |
S3_REGION |
No | eu | S3 region |
Development
# Create .env file with required variables
cp .env.example .env
# Install dependencies
uv sync
# Run
python main.py
Deployment
The poller is automatically built and pushed to git.seaturtle.pw/cavepedia/cavepediav2-poller:latest on push to main.
docker run \
-e COHERE_API_KEY="xxx" \
-e S3_ACCESS_KEY="xxx" \
-e S3_SECRET_KEY="xxx" \
-e DB_PASSWORD="xxx" \
-e DB_HOST="postgres" \
-e ANTHROPIC_API_KEY="xxx" \
git.seaturtle.pw/cavepedia/cavepediav2-poller:latest