1.1 KiB
1.1 KiB
poller
Processes documents and indexes them to be searched.
Every 5 minutes, this polls for new documents as follows:
- Moves any documents from
s3://cavepediav2-importtos3://cavepediav2-filesand updates themetadatatable.- This table has a
splitcolumn, indicating if the file has been split into individual pages.
- This table has a
- Checks the
metadatatable for any unsplit files, then splits them and stores the pages ins3://cavepediav2-pagesand creates an row in theembeddingstable for each page. - Checks claude for any OCR batches that have finished, then stores the results in the
embeddingstable. - Checks the
embeddingstable for un-OCR'd pages and batches them in groups of 1000 to be OCR'd by claude.- Only 1 batch is created per 5 minutes, as it can be easy to overload the server hosting the files.
- A temporary public S3 file link is generated using a presigned s3 url.
- Checks the
embeddingstable for any rows that have been OCR'd, but do not have embeddings generated, then generates embeddings with cohere.- No batching is used with cohere.