From 8f84e250aaa8b4cb5154b7996636132c2af19300 Mon Sep 17 00:00:00 2001 From: Paul Walko Date: Wed, 22 Feb 2023 12:31:33 -0500 Subject: [PATCH] clarify py scripts --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c69fd32..6eca5f4 100644 --- a/README.md +++ b/README.md @@ -10,8 +10,9 @@ 1. Once a day, the `main.py` script runs, which: 1. Pulls additions to deletions from `s3://pew-cavepedia-data/00_files/` to `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/` 1. Validates `metadata.py` contains data for any new folders. - 1. Runs `00-01.py` - 1. Runs `01-02.py` + 1. Runs `00-01.py`: OCR's PDFs. + 1. Runs `01-02.py`: Processes OCR'd text and puts it into a `02_json` folder JSON blob with metadata fields. + 1. Runs `02-03.py` (TODO): Copies JSON blobs from `02_json` into `03_index` with bad OCR text fixed. 1. Pushes additions or deletions to `s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}` 1. At this point all newly index data should be OCR'd and processed. 1. Once a day, the cavepedia application (must be running on the same host), checks for any updates: