From a2f4ff6087e6608c6a3027a36adebd4e22e53d07 Mon Sep 17 00:00:00 2001 From: Paul Walko Date: Fri, 12 Dec 2025 19:24:04 +0100 Subject: [PATCH] readme --- README.md | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..a9de63f --- /dev/null +++ b/README.md @@ -0,0 +1,111 @@ +# Cavepedia v2 + +A RBAC managed cave information system with document processing, semantic search, and an AI chat interface. Email paul@bigcavemaps.com for access. Only about 60% vibe coded. + +## Architecture + +``` + +------------------+ + | Auth0 | + +--------+---------+ + | + v ++------------------+ +----------+----------+ +| | WebSocket | | +| Browser +------------->+ web/ (Next.js) | +| | | - CopilotKit UI | ++------------------+ | - Auth0 SSO | + +----------+----------+ + | + v + +----------+----------+ + | web/agent/ | + | (LangGraph) | + | - Google Gemini | + +----------+----------+ + | + v + +----------+----------+ + | mcp/ | + | (FastMCP Server) | + | - Semantic search | + +----------+----------+ + | + +--------------------+--------------------+ + | | + v v + +----------+----------+ +----------+----------+ + | PostgreSQL | | Cohere | + | (pgvector) | | (Embeddings) | + | - embeddings | +---------------------+ + | - metadata | + | - batches | + +----------+----------+ + ^ + | + +----------+----------+ + | poller/ | + | (Document Pipeline)| + | - PDF splitting | + | - OCR (Claude) | + | - Embeddings | + +----------+----------+ + | + +-------------+-------------+ + | | | + v v v + +------+------+ +----+----+ +------+------+ + | S3: import | | S3: files| | S3: pages | + +-------------+ +----------+ +-------------+ +``` + +## Components + +| Component | Description | Tech Stack | +|-----------|-------------|------------| +| **web/** | Frontend application with chat UI | Next.js, CopilotKit, Auth0 | +| **web/agent/** | AI agent for answering cave questions | LangGraph, Google Gemini | +| **mcp/** | MCP server exposing semantic search tools | FastMCP, Cohere | +| **poller/** | Document ingestion and processing pipeline | Python, Claude API, Cohere | + +## Data Flow + +1. **Document Ingestion** (poller) + - PDFs uploaded to `s3://cavepediav2-import` + - Poller moves to `s3://cavepediav2-files`, splits into pages + - Pages stored in `s3://cavepediav2-pages` + - Claude extracts text via OCR + - Cohere generates embeddings + - Stored in PostgreSQL with pgvector + +2. **Search & Chat** (mcp + agent) + - User asks question via web UI + - Agent calls MCP tools for semantic search + - MCP queries pgvector for relevant documents + - Agent synthesizes response with citations + +## Getting Started + +See individual component READMEs: +- [web/README.md](web/README.md) - Frontend and agent setup +- [poller/README.md](poller/README.md) - Document processing pipeline + +## Environment Variables + +Each component requires its own environment variables. See the respective READMEs for details. + +**Never commit `.env` files** - they are gitignored. + +## CI/CD + +Gitea Actions workflows build and push Docker images on changes to `main`: + +| Workflow | Trigger Path | Image | +|----------|--------------|-------| +| build-push-web | `web/**` (excluding agent) | `cavepediav2-web:latest` | +| build-push-agent | `web/agent/**` | `cavepediav2-agent:latest` | +| build-push-poller | `poller/**` | `cavepediav2-poller:latest` | + +## License + +MIT