Go to file

cavepedia

Requirements

Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at s3://pew-cavepedia-data/00_files/
Once a day, the main.py script runs, which:
1. Pulls additions to deletions from s3://pew-cavepedia-data/00_files/ to /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
2. Validates metadata.py contains data for any new folders.
3. Runs 00-01.py: OCR's PDFs.
4. Runs 01-02.py: Processes OCR'd text and puts it into a 02_json folder JSON blob with metadata fields.
5. Runs 02-03.py (TODO): Copies JSON blobs from 02_json into 03_index with bad OCR text fixed.
6. Pushes additions or deletions to s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}
At this point all newly index data should be OCR'd and processed.
Once a day, the cavepedia application (must be running on the same host), checks for any updates:
1. Pulls additions or deletions from /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
2. If changes, delete the local index and reindex all documents

./launch.sh release [tenant] creates a local release directory for offline usage:
1. Pulls files for the respective tenant from /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/ to ./00_files/
2. Indexes all tenant documents
3. Saves index

Change url to have a /{tenant}/ path part just after the host, for example https://trog.bigcavemaps.com/public/search, or https://trog.bigcavemaps.com/vpi/search
1. During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned.
2. Each path has its own password, specified in a hash format as env vars, eg VPI_PASSWORD=asdf, PUBLIC_PASSWORD=.
3. The public tenant is all documents that are allowed to be public.

On a reasonably powerful PC that can access /bigdata:

Remove cavepedia.bleve if it exists
Run ./launch.sh build release to build linux and windows binaries
Run ./launch.sh run to index documents
Run ./launch.sh release to bundle the data, binaries, docker.sh script, and index data into a zip
Copy the zip to /bigdata/archive/cavepedia/release/cavepedia-X.Y.zip