clarify py scripts
parent
dab94013f1
commit
8f84e250aa
|
@ -10,8 +10,9 @@
|
|||
1. Once a day, the `main.py` script runs, which:
|
||||
1. Pulls additions to deletions from `s3://pew-cavepedia-data/00_files/` to `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/`
|
||||
1. Validates `metadata.py` contains data for any new folders.
|
||||
1. Runs `00-01.py`
|
||||
1. Runs `01-02.py`
|
||||
1. Runs `00-01.py`: OCR's PDFs.
|
||||
1. Runs `01-02.py`: Processes OCR'd text and puts it into a `02_json` folder JSON blob with metadata fields.
|
||||
1. Runs `02-03.py` (TODO): Copies JSON blobs from `02_json` into `03_index` with bad OCR text fixed.
|
||||
1. Pushes additions or deletions to `s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}`
|
||||
1. At this point all newly index data should be OCR'd and processed.
|
||||
1. Once a day, the cavepedia application (must be running on the same host), checks for any updates:
|
||||
|
|
Loading…
Reference in New Issue