# cavepedia ## Requirements - go 1.16+ ## 3.x milestone ### Main pipeline 1. Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at `s3://pew-cavepedia-data/00_files/` 1. Once a day, the `main.py` script runs, which: 1. Pulls additions to deletions from `s3://pew-cavepedia-data/00_files/` to `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/` 1. Validates `metadata.py` contains data for any new folders. 1. Runs `00-01.py`: OCR's PDFs. 1. Runs `01-02.py`: Processes OCR'd text and puts it into a `02_json` folder JSON blob with metadata fields. 1. Runs `02-03.py` (TODO): Copies JSON blobs from `02_json` into `03_index` with bad OCR text fixed. 1. Pushes additions or deletions to `s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}` 1. At this point all newly index data should be OCR'd and processed. 1. Once a day, the cavepedia application (must be running on the same host), checks for any updates: 1. Pulls additions or deletions from `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/` 1. If changes, delete the local index and reindex all documents ### Offline export 1. `./launch.sh release [tenant]` creates a local `release` directory for offline usage: 1. Pulls files for the respective tenant from `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/` to `./00_files/` 1. Indexes all tenant documents 1. Saves index ### Multi-tenant 1. Change url to have a `/{tenant}/` path part just after the host, for example `https://trog.bigcavemaps.com/public/search`, or `https://trog.bigcavemaps.com/vpi/search` 1. During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned. 1. Each path has its own password, specified in a hash format as env vars, eg `VPI_PASSWORD=asdf`, `PUBLIC_PASSWORD=`. 1. The `public` tenant is all documents that are allowed to be public. ## Add more documents 1. Download documents to `/bigdata/archive/cavepedia/cavepedia-data/00_files` 1. Process documents to generate json files 1. Create a new release ## Run 1. Download latest release 1. Run it ## Run in docker 1. Run `./docker.sh up` ## Release On a reasonably powerful PC that can access `/bigdata`: 1. Remove `cavepedia.bleve` if it exists 1. Run `./launch.sh build release` to build linux and windows binaries 1. Run `./launch.sh run` to index documents 1. Run `./launch.sh release` to bundle the data, binaries, docker.sh script, and index data into a zip 1. Copy the zip to `/bigdata/archive/cavepedia/release/cavepedia-X.Y.zip` ## TODO - highlight fuzzy matching - speed up pdf loading - Remove cavepedia-data repo eventually