3.x milestone

master
Paul Walko 2023-02-22 12:15:54 -05:00
parent 33d50adf07
commit dab94013f1
1 changed files with 26 additions and 0 deletions

View File

@ -4,6 +4,32 @@
- go 1.16+
## 3.x milestone
### Main pipeline
1. Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at `s3://pew-cavepedia-data/00_files/`
1. Once a day, the `main.py` script runs, which:
1. Pulls additions to deletions from `s3://pew-cavepedia-data/00_files/` to `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/`
1. Validates `metadata.py` contains data for any new folders.
1. Runs `00-01.py`
1. Runs `01-02.py`
1. Pushes additions or deletions to `s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}`
1. At this point all newly index data should be OCR'd and processed.
1. Once a day, the cavepedia application (must be running on the same host), checks for any updates:
1. Pulls additions or deletions from `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/`
1. If changes, delete the local index and reindex all documents
### Offline export
1. `./launch.sh release [tenant]` creates a local `release` directory for offline usage:
1. Pulls files for the respective tenant from `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/` to `./00_files/`
1. Indexes all tenant documents
1. Saves index
### Multi-tenant
1. Change url to have a `/{tenant}/ path part just after the host, for example `https://trog.bigcavemaps.com/public/search`, or `https://trog.bigcavemaps.com/vpi/search`
1. During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned.
1. Each path has its own password, specified in a hash format as env vars, eg `VPI_PASSWORD=asdf`, `PUBLIC_PASSWORD=`.
1. The `public` tenant is all documents that are allowed to be public.
## Add more documents