You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Paul Walko a01ee46840 typo 1 month ago
src update for 2.3 3 months ago
.gitignore init commit to reduce size 2 years ago typo 1 month ago update for 2.3 3 months ago update for 2.3 3 months ago



  • go 1.16+

3.x milestone

Main pipeline

  1. Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at s3://pew-cavepedia-data/00_files/
  2. Once a day, the script runs, which:
    1. Pulls additions to deletions from s3://pew-cavepedia-data/00_files/ to /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
    2. Validates contains data for any new folders.
    3. Runs OCR's PDFs.
    4. Runs Processes OCR'd text and puts it into a 02_json folder JSON blob with metadata fields.
    5. Runs (TODO): Copies JSON blobs from 02_json into 03_index with bad OCR text fixed.
    6. Pushes additions or deletions to s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}
  3. At this point all newly index data should be OCR'd and processed.
  4. Once a day, the cavepedia application (must be running on the same host), checks for any updates:
    1. Pulls additions or deletions from /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
    2. If changes, delete the local index and reindex all documents

Offline export

  1. ./ release [tenant] creates a local release directory for offline usage:
    1. Pulls files for the respective tenant from /bigdata/archive/cavepedia/pew-cavepedia-data/00_files/ to ./00_files/
    2. Indexes all tenant documents
    3. Saves index


  1. Change url to have a /{tenant}/ path part just after the host, for example, or
    1. During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned.
    2. Each path has its own password, specified in a hash format as env vars, eg VPI_PASSWORD=asdf, PUBLIC_PASSWORD=.
    3. The public tenant is all documents that are allowed to be public.

Add more documents

  1. Download documents to /bigdata/archive/cavepedia/cavepedia-data/00_files
  2. Process documents to generate json files
  3. Create a new release


  1. Download latest release
  2. Run it

Run in docker

  1. Run ./ up


On a reasonably powerful PC that can access /bigdata:

  1. Remove cavepedia.bleve if it exists
  2. Run ./ build release to build linux and windows binaries
  3. Run ./ run to index documents
  4. Run ./ release to bundle the data, binaries, script, and index data into a zip
  5. Copy the zip to /bigdata/archive/cavepedia/release/


  • highlight fuzzy matching
  • speed up pdf loading
  • Remove cavepedia-data repo eventually