Files
cavepediav2/poller/README.md
Paul Walko feabb681e8
Some checks failed
Build and Push Agent Docker Image / build (push) Failing after 1m9s
Build and Push Web Docker Image / build (push) Successful in 3m30s
docs, py 3.14
2025-12-12 17:42:40 +01:00

1.1 KiB

poller

Processes documents and indexes them to be searched.

Every 5 minutes, this polls for new documents as follows:

  1. Moves any documents from s3://cavepediav2-import to s3://cavepediav2-files and updates the metadata table.
    • This table has a split column, indicating if the file has been split into individual pages.
  2. Checks the metadata table for any unsplit files, then splits them and stores the pages in s3://cavepediav2-pages and creates an row in the embeddings table for each page.
  3. Checks claude for any OCR batches that have finished, then stores the results in the embeddings table.
  4. Checks the embeddings table for un-OCR'd pages and batches them in groups of 1000 to be OCR'd by claude.
    • Only 1 batch is created per 5 minutes, as it can be easy to overload the server hosting the files.
    • A temporary public S3 file link is generated using a presigned s3 url.
  5. Checks the embeddings table for any rows that have been OCR'd, but do not have embeddings generated, then generates embeddings with cohere.
    • No batching is used with cohere.