Launching Sombra: Building an AI-Ready Knowledge Base from Web Content
Big day today - I've taken the leap and published Sombra to the Chrome Web Store. There are a million things that can be improved, but at some point you just have to get it out there and start seeing whether the idea actually resonates.
The Problem That Started It All
I've been increasingly frustrated by the limitations of AI agents when it comes to accessing and working with web content. The core issues kept surfacing again and again:
Context Barriers: If content isn't publicly accessible, you need some method to expose that data to AI tools like Claude Desktop. This creates silos where valuable information gets trapped behind authentication walls or simply isn't available. Anthropic also appear to run a cache in front of web requests - making it even harder to know exactly what it sees.
Context Rebuilding: Having to reconstruct the same research context repeatedly is pure busywork. You find relevant articles, papers, or resources, but then have to manually feed them to your AI assistant piece by piece, losing the connections between sources.
Resource Discovery: AI agents struggle with deciding which web resources might be relevant to your specific use case, especially when dealing with nuanced or domain-specific content.
Projects in Claude Desktop help a lot, along with existing MCP integrations, but the UI feels clunky for managing larger research workflows, and there wasn't something that did exactly what I wanted.
A Research Shadow, Sombra
Sombra combines traditional web scraping techniques with modern AI integration, specifically built around the Model Context Protocol (MCP) for seamless AI tool connectivity.
The core concept is simple: transform your browsing into an intelligent, AI-accessible knowledge base. Web pages you save are converted to searchable markdown using the original arc90 readability algorithm, organized into collections, and made available as MCP resources that your AI tools can directly access.
How It Works
Client-Side Capture: Scraping happens entirely in your browser. If you can see the content in Chrome, you can save it to your collection. This bypasses all the authentication headaches that plague server-side solutions.
Intelligent Processing: Each saved page is processed into clean markdown format, with screenshots captured for visual reference. The content is stored with metadata about the source, timestamp, and host information.
Collection Organization: Pages can be organized into thematic collections - research projects, competitive analysis, technical documentation, whatever makes sense for your workflow.
AI Integration: Through MCP connections, these collections become directly accessible to compatible AI clients. No more copy-pasting or manual context reconstruction.
The Technology Stack
The architecture reflects some deliberate choices:
Backend: Clojure with Pedestal and Datomic. The functional approach and immutable data structures work beautifully for content archival and retrieval. The server renders HTML as Hiccup with Tailwind V4 for styling.
Extension: TypeScript with a side-panel UI, using the Chrome APIs for content script injection and screenshot capture via captureVisibleTab
.
Processing: HTML to markdown transformation happens through a worker pool behind a queue system. Dropbox sync is handled by periodic worker processes checking for unsynced saves.
MCP Implementation: Built the MCP integration from scratch - there isn't much out there in terms of libraries, especially for Clojure backends. I'm considering open-sourcing this component once it's more mature.
Dogfooding in Practice
One of the most validating aspects has been using Sombra to build Sombra itself. I've created collections of competitor research, industry analysis, and technical documentation, then used Claude to analyze positioning, feature comparisons, and strategic decisions.
It's a considerable multiplier to have Claude work with curated, relevant context rather than hoping it can magically discover and connect the right resources. Instead of rebuilding context from scratch each time, I can focus on making connections and asking higher-level strategic questions.
The Name and Future Vision
Why Sombra? I was thinking of sci-fi references like Peter Hamilton's "u-shadow" or the concept of a "shadow" in the Silo series. The idea of an intelligent companion that accumulates knowledge and provides context resonates with where I want to take this concept.
Future developments include exposing screenshots to MCP (still determining if visual references add value when markdown is available), LD+JSON support for richer structured data extraction, and potentially team collaboration features.
Pricing and Availability
The extension offers three tiers:
- Free: 100 lifetime saves with all features unlocked
- Pro: $7/month or $70/year for 2,000 saves annually
- Supporter: $120/year for unlimited saves and screenshots
(with localised pricing available in EUR/GBP)
The goal is making the core functionality accessible while supporting development through users who find real value in the tool.
What's Next
This feels like one of those projects that started by scratching a personal itch and possibly got a bit out of hand. But having built it, it would be a shame not to put it out there in case it helps others facing similar frustrations with AI workflow integration.
I'm particularly curious about feedback on the screenshot integration - whether visual context adds enough value to justify the complexity, or if the markdown extraction captures what's needed.
If you're dealing with similar challenges around AI context management, or if you're just curious about the intersection of web archival and AI augmentation, I'd love to hear your thoughts.
Try Sombra on the Chrome Web Store
The future of AI isn't just about smarter models - it's about smarter integration with how we actually work and research. Sombra is my attempt to bridge that gap.
Built something interesting? I'm always curious about where AI tooling is heading and how we can make these workflows less clunky. Let's connect.