Go to file
Paul Walko 8f84e250aa clarify py scripts 2023-02-22 12:31:33 -05:00
src update for 2.3 2022-12-28 15:39:00 -05:00
.gitignore init commit to reduce size 2021-07-16 08:09:27 -04:00
README.md clarify py scripts 2023-02-22 12:31:33 -05:00
docker.sh update for 2.3 2022-12-28 15:39:00 -05:00
launch.sh update for 2.3 2022-12-28 15:39:00 -05:00

README.md

cavepedia

Requirements

  • go 1.16+

3.x milestone

Main pipeline

  1. Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at s3://pew-cavepedia-data/00_files/
  2. Once a day, the main.py script runs, which:
    1. Pulls additions to deletions from s3://pew-cavepedia-data/00_files/ to /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
    2. Validates metadata.py contains data for any new folders.
    3. Runs 00-01.py: OCR's PDFs.
    4. Runs 01-02.py: Processes OCR'd text and puts it into a 02_json folder JSON blob with metadata fields.
    5. Runs 02-03.py (TODO): Copies JSON blobs from 02_json into 03_index with bad OCR text fixed.
    6. Pushes additions or deletions to s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}
  3. At this point all newly index data should be OCR'd and processed.
  4. Once a day, the cavepedia application (must be running on the same host), checks for any updates:
    1. Pulls additions or deletions from /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
    2. If changes, delete the local index and reindex all documents

Offline export

  1. ./launch.sh release [tenant] creates a local release directory for offline usage:
    1. Pulls files for the respective tenant from /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/ to ./00_files/
    2. Indexes all tenant documents
    3. Saves index

Multi-tenant

  1. Change url to have a /{tenant}/ path part just after the host, for example https://trog.bigcavemaps.com/public/search, or https://trog.bigcavemaps.com/vpi/search`
    1. During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned.
    2. Each path has its own password, specified in a hash format as env vars, eg VPI_PASSWORD=asdf, PUBLIC_PASSWORD=.
    3. The public tenant is all documents that are allowed to be public.

Add more documents

  1. Download documents to /bigdata/archive/cavepedia/cavepedia-data/00_files
  2. Process documents to generate json files
  3. Create a new release

Run

  1. Download latest release
  2. Run it

Run in docker

  1. Run ./docker.sh up

Release

On a reasonably powerful PC that can access /bigdata:

  1. Remove cavepedia.bleve if it exists
  2. Run ./launch.sh build release to build linux and windows binaries
  3. Run ./launch.sh run to index documents
  4. Run ./launch.sh release to bundle the data, binaries, docker.sh script, and index data into a zip
  5. Copy the zip to /bigdata/archive/cavepedia/release/cavepedia-X.Y.zip

TODO

  • highlight fuzzy matching
  • speed up pdf loading
  • Remove cavepedia-data repo eventually