cavepedia
Requirements
- go 1.16+
3.x milestone
Main pipeline
- Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at
s3://pew-cavepedia-data/00_files/ - Once a day, the
main.pyscript runs, which:- Pulls additions to deletions from
s3://pew-cavepedia-data/00_files/to/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/ - Validates
metadata.pycontains data for any new folders. - Runs
00-01.py: OCR's PDFs. - Runs
01-02.py: Processes OCR'd text and puts it into a02_jsonfolder JSON blob with metadata fields. - Runs
02-03.py(TODO): Copies JSON blobs from02_jsoninto03_indexwith bad OCR text fixed. - Pushes additions or deletions to
s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}
- Pulls additions to deletions from
- At this point all newly index data should be OCR'd and processed.
- Once a day, the cavepedia application (must be running on the same host), checks for any updates:
- Pulls additions or deletions from
/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/ - If changes, delete the local index and reindex all documents
- Pulls additions or deletions from
Offline export
./launch.sh release [tenant]creates a localreleasedirectory for offline usage:- Pulls files for the respective tenant from
/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/to./00_files/ - Indexes all tenant documents
- Saves index
- Pulls files for the respective tenant from
Multi-tenant
- Change url to have a
/{tenant}/path part just after the host, for examplehttps://trog.bigcavemaps.com/public/search, orhttps://trog.bigcavemaps.com/vpi/search- During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned.
- Each path has its own password, specified in a hash format as env vars, eg
VPI_PASSWORD=asdf,PUBLIC_PASSWORD=. - The
publictenant is all documents that are allowed to be public.
Add more documents
- Download documents to
/bigdata/archive/cavepedia/cavepedia-data/00_files - Process documents to generate json files
- Create a new release
Run
- Download latest release
- Run it
Run in docker
- Run
./docker.sh up
Release
On a reasonably powerful PC that can access /bigdata:
- Remove
cavepedia.bleveif it exists - Run
./launch.sh build releaseto build linux and windows binaries - Run
./launch.sh runto index documents - Run
./launch.sh releaseto bundle the data, binaries, docker.sh script, and index data into a zip - Copy the zip to
/bigdata/archive/cavepedia/release/cavepedia-X.Y.zip
TODO
- highlight fuzzy matching
- speed up pdf loading
- Remove cavepedia-data repo eventually
Description
Languages
Go
62.2%
CSS
19.4%
Shell
15.6%
Dockerfile
2.8%