cavepedia/README.md

69 lines
2.7 KiB
Markdown
Raw Permalink Normal View History

2021-07-16 08:09:27 -04:00
# cavepedia
## Requirements
- go 1.16+
2023-02-22 12:15:54 -05:00
## 3.x milestone
### Main pipeline
1. Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at `s3://pew-cavepedia-data/00_files/`
1. Once a day, the `main.py` script runs, which:
1. Pulls additions to deletions from `s3://pew-cavepedia-data/00_files/` to `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/`
1. Validates `metadata.py` contains data for any new folders.
2023-02-22 12:31:33 -05:00
1. Runs `00-01.py`: OCR's PDFs.
1. Runs `01-02.py`: Processes OCR'd text and puts it into a `02_json` folder JSON blob with metadata fields.
1. Runs `02-03.py` (TODO): Copies JSON blobs from `02_json` into `03_index` with bad OCR text fixed.
2023-02-22 12:15:54 -05:00
1. Pushes additions or deletions to `s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}`
1. At this point all newly index data should be OCR'd and processed.
1. Once a day, the cavepedia application (must be running on the same host), checks for any updates:
1. Pulls additions or deletions from `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/`
1. If changes, delete the local index and reindex all documents
### Offline export
1. `./launch.sh release [tenant]` creates a local `release` directory for offline usage:
1. Pulls files for the respective tenant from `/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/` to `./00_files/`
1. Indexes all tenant documents
1. Saves index
### Multi-tenant
2023-02-22 12:36:41 -05:00
1. Change url to have a `/{tenant}/` path part just after the host, for example `https://trog.bigcavemaps.com/public/search`, or `https://trog.bigcavemaps.com/vpi/search`
2023-02-22 12:15:54 -05:00
1. During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned.
1. Each path has its own password, specified in a hash format as env vars, eg `VPI_PASSWORD=asdf`, `PUBLIC_PASSWORD=`.
1. The `public` tenant is all documents that are allowed to be public.
2021-07-16 08:09:27 -04:00
## Add more documents
1. Download documents to `/bigdata/archive/cavepedia/cavepedia-data/00_files`
1. Process documents to generate json files
1. Create a new release
## Run
1. Download latest release
1. Run it
## Run in docker
1. Run `./docker.sh up`
## Release
On a reasonably powerful PC that can access `/bigdata`:
1. Remove `cavepedia.bleve` if it exists
1. Run `./launch.sh build release` to build linux and windows binaries
1. Run `./launch.sh run` to index documents
1. Run `./launch.sh release` to bundle the data, binaries, docker.sh script, and index data into a zip
1. Copy the zip to `/bigdata/archive/cavepedia/release/cavepedia-X.Y.zip`
## TODO
- highlight fuzzy matching
- speed up pdf loading
- Remove cavepedia-data repo eventually