Content engine

Honcho

A content engine that combines aggregation, full-text search, storage, and a headless API in a single integrated system—replacing the typical stack of separate services with one coherent system where content flows from source to index to API without integration seams.

The Short Version

Honcho is a content engine that indexes large collections of documents—research archives, institutional publications, news corpora, internal knowledge bases—and makes them searchable and accessible to AI assistants. Import tens of thousands of articles from a CMS, bulk-load a document archive, or crawl hundreds of feeds over time. Everything lands in the same full-text index.

What makes it powerful is what happens next. Connect Honcho to Claude or another AI assistant, and the entire corpus becomes conversational. Ask “how have expert views on China’s economy evolved over the past decade?” and the AI synthesizes across thousands of indexed documents—threading together analysis, tracking how positions shifted, surfacing connections that no keyword search would find.

Content flows in from many directions: bulk importers for CMS migrations and document archives, an extract API with declarative rules for structured data sources, RSS/Atom crawling for ongoing feeds, and AI assistants that save articles and notes on your behalf.

It supports multiple users and groups, so a team can share a common pool of content and collaborate through their AI assistants. One person’s assistant can summarize something and share it to the team’s feed, building a shared knowledge base without anyone copying and pasting links around.

Honcho is also a persistence layer for AI workflows. Agent apps that summarize articles, monitor topics, or produce analysis can write their output back to Honcho, where it becomes searchable alongside everything else. The content you curate and the content your tools produce all live in one place.

What does this look like in practice?

Cross-topic synthesis across years of financial commentary. Briefings assembled from dozens of sources in minutes. Expert opinion tracked over time. Pattern recognition across industries.

See real use cases →

What It Does

Most content platforms are assembled from separate services—a CMS for storage, Elasticsearch for search, a feed reader for ingestion, S3 for assets, a custom API layer to glue it all together. Honcho replaces that stack with six tightly integrated capabilities.

  • Aggregate Crawl RSS, Atom, and HTML sources on configurable schedules. A declarative rules engine lets you define per-site extraction logic—CSS selectors, field mappings, timestamp parsing, fallback chains—in a config file instead of code. Built-in deduplication and change detection keep the index clean.
  • Ingest Content doesn’t just come from feeds. AI assistants save articles and notes via MCP. A save-URL API endpoint supports bookmarklets and mobile shortcuts. An extract API accepts raw JSON, HTML, or XML and runs it through extraction rules. Bulk importers handle WordPress exports and other CMS migrations. Everything lands in the same index.
  • Index Full-text search powered by Lucene 9 with a configurable text analysis pipeline. Boolean queries, time-range filters, tag and topic facets, custom numeric fields, and configurable relevance scoring.
  • Store Persistent content storage with content hashing for change detection. Encrypted backups with cloud KMS integration. Protobuf-based metadata model.
  • Serve GraphQL, REST, and MCP (Model Context Protocol) endpoints for search, retrieval, and content distribution. No rendering opinions—bring your own frontend, feed reader, mobile app, or AI assistant.
  • Replicate Push and pull replication between instances. Distribute content across organizational boundaries with topic-based routing and editorial workflows.
Content flows from crawl to index to API to replication as a single transaction with a single data model.
Concern Typical Stack Honcho
Storage PostgreSQL / DynamoDB Built-in
Search Algolia / Elasticsearch Built-in
Crawling Scrapy / custom crawlers Built-in
Ingestion Custom importers / ETL Built-in (MCP, save URL, extract API, bulk import)
Assets S3 / cloud storage Built-in
API Custom REST / GraphQL Built-in (GraphQL + REST + MCP)
Sync Custom ETL / webhooks Built-in

What Makes It Different

  • Search is built in, not bolted on The Lucene index is the primary read path. Content is indexed at write time with no sync lag. The query API exposes the full power of Lucene: phrase queries, field-scoped search, boolean composition, range-boosted relevance, and custom numeric fields for domain-specific ranking.
  • Structured content fragments Content is modeled as typed fragments—paragraphs, code blocks, headings, pull quotes, recipe steps. Each fragment type is indexed separately, so you can search within specific block types: find all entries where a code block mentions HashMap, or where a heading contains authentication.
  • Declarative content extraction A rules DSL lets you define how to extract entries from any JSON, HTML, or XML source—selectors, extractors, transforms, timestamp parsers—without writing code. Rules compose via config layering: define a generic RSS base, then override only the fields that differ per source. You can develop rules conversationally through an AI assistant—paste in your raw content, iterate on the rules until the extraction is right, then save them.
  • One system, not six Each service in the typical stack is another deployment, another set of credentials, another failure mode, another thing to keep in sync. Honcho’s tight integration eliminates the boundaries where things break.
Most content platforms are assembled from parts that weren’t designed to work together. Honcho was built as one system from the start—search, storage, ingestion, and API share the same data model, the same transaction, and the same deployment.

Architecture

Built on standard enterprise infrastructure that any Java team can deploy and maintain. No exotic dependencies, no cloud-specific lock-in, no operational surprises.

  • Runtime Java 17 on Jetty 12 with Jakarta Servlet.
  • Search engine Lucene 9 with unified numeric fields, near-real-time search with searcher warm-up, and proper FILTER vs MUST clause handling.
  • API layer GraphQL for flexible queries, REST for JSON/RSS/Sitemap output, and MCP (Model Context Protocol) for AI assistant integration.
  • Data model Protocol Buffers for internal data model and wire format, with a purpose-built JSON encoder for browser and API clients.
  • Instrumentation Dropwizard Metrics on every significant operation—search latency, indexing throughput, crawl rates, storage I/O.
  • Text analysis Pluggable stemmer (KStem, Porter, minimal, none), stop words, protected words, ASCII folding, per-index analyzer selection.
  • Custom fields Define domain-specific indexed fields via configuration with configurable tokenization and case handling.
Every dependency is production-grade open source with a permissive license—Apache 2.0, MIT, or BSD. Deploy anywhere, license freely, and know exactly what you’re running.
Java 17 Jetty 12 Jakarta Servlet Lucene 9 Protocol Buffers GraphQL Java MCP SDK MariaDB Maven Dropwizard Metrics

Built-in MCP Server

Honcho includes a built-in Model Context Protocol server, so AI assistants like Claude can search, retrieve, and create content directly. Tools cover search, content management, collaboration, memory, and personalized digest. Point any MCP client at the /mcp endpoint and the entire index becomes conversational. See practical use cases →

Tool What It Does
search_content Full-text search with host, author, date range, tag, type, and sort filters
get_entry Retrieve a single entry by UID or most recent
get_entries Retrieve multiple entries by UID in a single call (max 25)
find_similar_entries Find content similar to a given entry
list_hosts List content sources, optionally filtered
term_frequency Term frequency statistics for any indexed field
create_entry Create a new entry with title, content, tags, topics, type, author, and metadata
update_entry Update fields on an existing entry—replace or append tags/topics, merge metadata
delete_entry Soft-delete an entry from the database and search index
tag_entries Add or remove tags from entries matching a search query
list_groups List your collaborative groups with members and shared feed status
discover_feeds Discover RSS/Atom feeds from any URL
add_source Add a content source and enable crawling
add_source_to_group Share an existing source with a group so all members can access it
remove_source_from_group Remove a source from a group
send_message Send a message to a group’s shared feed
share_to_group Share an existing entry to a group feed, preserving original author
get_status Account overview—sources, entries, groups, and favorites
save_memory Save a note that can be recalled in future conversations
recall_memory Search or list saved memories
get_digest Get a personalized feed from favorited authors, sources, hosts, and saved searches
list_favorites List favorites with enabled/disabled status
add_favorite Follow an author, source, host, or search query
remove_favorite Remove a favorite from the personalized digest
  • Read and write AI assistants can search and retrieve content, but also create entries, update metadata, manage tags, and curate the index—all through the same MCP interface.
  • Structured retrieval Content fragments let MCP tools return specific block types—code examples, definitions, key paragraphs—rather than dumping entire documents into the context window.
  • Multi-source aggregation Hundreds of curated sources through one interface—industry news, internal docs, regulatory updates—without the model needing to know where each piece lives.
  • Real-time content Continuous crawling and near-real-time indexing. Content is searchable within seconds of arrival.
  • Group collaboration Multiple users sharing an MCP server can write to shared group feeds. “Summarize this and send it to the research group”—an AI-mediated collaboration channel where each person's assistant contributes to a shared knowledge pool.
  • AI memory Assistants can save and recall notes across conversations. “Remember that the client prefers weekly reports on Mondays”—and it’s there next time you ask.
  • Messaging Lightweight messaging through group feeds. AI assistants can send messages, share summaries, and post updates on behalf of their users.
  • Cross-posting Share entries across groups while preserving original author attribution. Add commentary when sharing—provenance metadata tracks who shared what and from where.
  • Personalized digest Users build a personal digest by favoriting authors, sources, hosts, and search queries. “Give me my morning briefing” returns a curated feed of what matters to them, updated in real time.
  • Feed management AI assistants can discover feeds from any URL and add them as crawled sources—“follow this site” becomes a single conversational command.
  • Multi-user isolation Each authenticated user sees only their assigned sources. OAuth with DB-backed tokens provides secure, persistent access.

Use Cases

  • Knowledge base for AI Aggregate domain-specific content, index with structured fragments, and serve to AI assistants via the built-in MCP server. AI assistants can search, create content, manage sources, and collaborate through groups—the entire index becomes conversational.
  • Content aggregation and curation Crawl and normalize hundreds of sources—industry publications, competitor blogs, wire services, regulatory feeds—into a single searchable interface with editorial taxonomy and workflows.
  • Headless search API Lucene-quality full-text search as a service. Boolean queries, faceted filtering, configurable relevance, time-range constraints, and custom ranking signals—without the operational overhead of Elasticsearch or per-query cost of hosted search.
  • Feed infrastructure Crawl RSS/Atom/HTML, normalize to a clean structured model, re-encode as JSON or protobuf. Per-source extraction rules handle non-standard feeds and custom HTML without code changes—deduplication, metadata enrichment, and configurable update intervals.
  • Content archiving Long-term content storage with full-text search and retrieval. Encrypted backups for compliance and retention requirements.

Status

Honcho is actively developed and available for licensing, collaboration, or investment.

The code is proprietary and designed for on-premise deployment—your data stays on your infrastructure. If you are interested in licensing, collaboration, or investment, please get in touch.