diff --git a/README.md b/README.md index c69fd32..6eca5f4 100644 --- a/README.md +++ b/README.md @@ -10,8 +10,9 @@ 1. Once a day, the `main.py` script runs, which: 1. Pulls additions to deletions from `s3://pew-cavepedia-data/00_files/` to `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/` 1. Validates `metadata.py` contains data for any new folders. - 1. Runs `00-01.py` - 1. Runs `01-02.py` + 1. Runs `00-01.py`: OCR's PDFs. + 1. Runs `01-02.py`: Processes OCR'd text and puts it into a `02_json` folder JSON blob with metadata fields. + 1. Runs `02-03.py` (TODO): Copies JSON blobs from `02_json` into `03_index` with bad OCR text fixed. 1. Pushes additions or deletions to `s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}` 1. At this point all newly index data should be OCR'd and processed. 1. Once a day, the cavepedia application (must be running on the same host), checks for any updates: