a01ee46840 | ||
---|---|---|
src | ||
.gitignore | ||
README.md | ||
docker.sh | ||
launch.sh |
README.md
cavepedia
Requirements
- go 1.16+
3.x milestone
Main pipeline
- Use wasabi s3 as source of truth. Any new docs are uploaded to wasabi at
s3://pew-cavepedia-data/00_files/
- Once a day, the
main.py
script runs, which:- Pulls additions to deletions from
s3://pew-cavepedia-data/00_files/
to/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
- Validates
metadata.py
contains data for any new folders. - Runs
00-01.py
: OCR's PDFs. - Runs
01-02.py
: Processes OCR'd text and puts it into a02_json
folder JSON blob with metadata fields. - Runs
02-03.py
(TODO): Copies JSON blobs from02_json
into03_index
with bad OCR text fixed. - Pushes additions or deletions to
s3://pew-cavepedia-data/{01_ocr,01_pages,02_json,02_text}
- Pulls additions to deletions from
- At this point all newly index data should be OCR'd and processed.
- Once a day, the cavepedia application (must be running on the same host), checks for any updates:
- Pulls additions or deletions from
/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
- If changes, delete the local index and reindex all documents
- Pulls additions or deletions from
Offline export
./launch.sh release [tenant]
creates a localrelease
directory for offline usage:- Pulls files for the respective tenant from
/bigdata/archive/cavepedia/pew-cavepedia-data/00_files/
to./00_files/
- Indexes all tenant documents
- Saves index
- Pulls files for the respective tenant from
Multi-tenant
- Change url to have a
/{tenant}/
path part just after the host, for examplehttps://trog.bigcavemaps.com/public/search
, orhttps://trog.bigcavemaps.com/vpi/search
- During document indexing, each document has a list of tenants. During search, only documents owned by a given tenant are returned.
- Each path has its own password, specified in a hash format as env vars, eg
VPI_PASSWORD=asdf
,PUBLIC_PASSWORD=
. - The
public
tenant is all documents that are allowed to be public.
Add more documents
- Download documents to
/bigdata/archive/cavepedia/cavepedia-data/00_files
- Process documents to generate json files
- Create a new release
Run
- Download latest release
- Run it
Run in docker
- Run
./docker.sh up
Release
On a reasonably powerful PC that can access /bigdata
:
- Remove
cavepedia.bleve
if it exists - Run
./launch.sh build release
to build linux and windows binaries - Run
./launch.sh run
to index documents - Run
./launch.sh release
to bundle the data, binaries, docker.sh script, and index data into a zip - Copy the zip to
/bigdata/archive/cavepedia/release/cavepedia-X.Y.zip
TODO
- highlight fuzzy matching
- speed up pdf loading
- Remove cavepedia-data repo eventually