# cavepedia

## Requirements

- go 1.16+

## 3.x milestone
### Main pipeline
1. Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at `s3://pew-cavepedia-data/00_files/`
1. Once a day, the `main.py` script runs, which:
    1. Pulls additions to deletions from `s3://pew-cavepedia-data/00_files/` to `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/`
    1. Validates `metadata.py` contains data for any new folders.
    1. Runs `00-01.py`: OCR's PDFs.
    1. Runs `01-02.py`: Processes OCR'd text and puts it into a `02_json` folder JSON blob with metadata fields.
    1. Runs `02-03.py` (TODO): Copies JSON blobs from `02_json` into `03_index` with bad OCR text fixed.
    1. Pushes additions or deletions to `s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}`
1. At this point all newly index data should be OCR'd and processed.
1. Once a day, the cavepedia application (must be running on the same host), checks for any updates:
    1. Pulls additions or deletions from `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/`
    1. If changes, delete the local index and reindex all documents

### Offline export
1. `./launch.sh release [tenant]` creates a local `release` directory for offline usage:
    1. Pulls files for the respective tenant from `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/` to `./00_files/`
    1. Indexes all tenant documents
    1. Saves index

### Multi-tenant
1. Change url to have a `/{tenant}/` path part just after the host, for example `https://trog.bigcavemaps.com/public/search`, or `https://trog.bigcavemaps.com/vpi/search`
    1. During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned.
    1. Each path has its own password, specified in a hash format as env vars, eg `VPI_PASSWORD=asdf`, `PUBLIC_PASSWORD=`.
    1. The `public` tenant is all documents that are allowed to be public.


## Add more documents

1. Download documents to `/bigdata/archive/cavepedia/cavepedia-data/00_files`
1. Process documents to generate json files
1. Create a new release


## Run

1. Download latest release
1. Run it


## Run in docker

1. Run `./docker.sh up`


## Release

On a reasonably powerful PC that can access `/bigdata`:

1. Remove `cavepedia.bleve` if it exists
1. Run `./launch.sh build release` to build linux and windows binaries
1. Run `./launch.sh run` to index documents
1. Run `./launch.sh release` to bundle the data, binaries, docker.sh script, and index data into a zip
1. Copy the zip to `/bigdata/archive/cavepedia/release/cavepedia-X.Y.zip`


## TODO

- highlight fuzzy matching
- speed up pdf loading
- Remove cavepedia-data repo eventually