clarify py scripts

master
Paul Walko 2023-02-22 12:31:33 -05:00
parent dab94013f1
commit 8f84e250aa
1 changed files with 3 additions and 2 deletions

View File

@ -10,8 +10,9 @@
1. Once a day, the `main.py` script runs, which:
1. Pulls additions to deletions from `s3://pew-cavepedia-data/00_files/` to `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/`
1. Validates `metadata.py` contains data for any new folders.
1. Runs `00-01.py`
1. Runs `01-02.py`
1. Runs `00-01.py`: OCR's PDFs.
1. Runs `01-02.py`: Processes OCR'd text and puts it into a `02_json` folder JSON blob with metadata fields.
1. Runs `02-03.py` (TODO): Copies JSON blobs from `02_json` into `03_index` with bad OCR text fixed.
1. Pushes additions or deletions to `s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}`
1. At this point all newly index data should be OCR'd and processed.
1. Once a day, the cavepedia application (must be running on the same host), checks for any updates: