Fast HTML-to-Markdown extraction from any URL, for LLMs (r11y)

I keep needing the same thing: take a URL, give me back the actual content as clean markdown, with the navigation and cookie banners and decorative junk stripped out. Not the raw HTML, not a screenshot, just the words that matter in a format an LLM can read without choking on noise.

Sombra does this in the browser - you save a page, it gets converted and filed away. But a lot of the time I'm not in a browser. I'm in a terminal, or inside an agent loop, and I want to pull a page down and pipe it somewhere without leaving the command line. So I built r11y.

r11y is a fast, native CLI and small Clojure library that turns any URL into clean Markdown: boilerplate stripped, metadata kept, ready to hand to an LLM.

r11y turning a web page into clean Markdown, then again with metadata, in well under half a second

The name is readability squashed down to its first and last letters with the count in between, the same trick as i18n or a11y. Say it out loud and it's also "oh rlly?", which if you're old enough to remember the owl meme is either a feature or a warning.

What it does

r11y https://example.com

That fetches the page, finds the main content, and prints markdown to stdout. Pipe it into a file, into pbcopy, into another tool, whatever you like. Add -m and you get YAML frontmatter on top with the title, author, date, description, canonical URL, hero image and favicon:

r11y -m https://www.wired.com/story/some-article/
---
title: Intelligence on Earth Evolved Independently at Least Twice
author: Yasemin Saplakoglu
url: https://www.wired.com/story/intelligence-evolved-at-least-twice/
canonical-url: https://www.wired.com/story/intelligence-evolved-at-least-twice/
hostname: www.wired.com
sitename: WIRED
date: 2025-05-11T07:00:00.000-04:00
icon: https://www.wired.com/.../favicon.ico
image: https://media.wired.com/photos/.../Lede.jpeg
---

# Article content here...

The metadata comes from JSON-LD where it exists (walking @graph and preferring the article-typed object when a page ships several), falling back to OpenGraph, Twitter Card tags, <time> elements and date patterns in the URL. It throws out the placeholder garbage you find in templated pages - things like {template.literal} and #author.fullName that slip through when a CMS doesn't fill them in.

Boilerplate is expensive

The reason to bother with any of this is that the thing reading the page is increasingly a language model, and models pay by the token. Raw HTML is mostly not content. It's scripts, inline styles, analytics, navigation, cookie banners, three copies of the logo and a footer with forty links. People who have measured it tend to land somewhere between a 60 and 80 percent reduction when they strip a page down to the actual prose. One widely shared Cloudflare figure had a single blog post going from around 16k tokens of HTML to 3k of markdown.

That waste compounds. Feed the soup straight into a model and you pay for the junk, you push the real content further down the context window where attention is weaker, and you hand the model more chances to fixate on a "subscribe to our newsletter" interstitial than on the thing you actually asked about. Clean input is about the cheapest quality lever there is, and almost all of it sits upstream of which model you picked.

Speed: how it compares to trafilatura, Defuddle and readability-rust

r11y is compiled to a native binary with GraalVM, so it starts in around 40ms. There's no JVM warmup, no Node process spinning up, no Python interpreter. For a single call that hardly matters. For an agent firing off dozens of fetches in a loop, or a script churning through a list of URLs, the startup cost is the thing that quietly dominates, and shaving it to nothing changes how the whole thing feels.

I ran it against a few of the well-known extractors on representative pages, 5 runs each, wall-clock seconds including the network fetch:

URLr11ydefuddletrafilaturareadability-rust
Long essay (~8k words)0.430.740.990.57
Docs page (Cloudflare)0.260.540.530.36
Next.js landing page0.300.630.520.27

Consistently 1.5 to 2x faster than the typical Node and Python tools. Against readability-rust it's ahead on larger pages and roughly tied on small ones. Network variance dominates the spread, so the actual processing gap is wider than the numbers suggest.

The annoying parts of the modern web

Most of the work in r11y is not the happy path. It's the pages that fight you.

Servers that already have markdown. A growing number of docs sites can hand you markdown directly. r11y sends Accept: text/markdown and uses the response as-is when the server honours it. Cloudflare-fronted sites are a recurring offender here: they'll return a markdown body but label it text/html. r11y sniffs the body and recognises it anyway, then strips and rebuilds any upstream frontmatter into its own schema.

React and Next.js semantic soup. Frameworks love to render <div role="paragraph"> and <div role="list"> instead of actual paragraphs and lists. Extraction algorithms that look for real HTML structure miss all of it. r11y normalises those role-divs back into proper elements before it starts pruning, so the structure survives.

Decorative noise. Inline SVG icons, spacer images, layout tables masquerading as data tables, the same logo repeated fifteen times as UI chrome. All stripped. Tables with no <th> and border=0 get unwrapped rather than rendered as a garbled markdown grid.

GitHub. Point it at a repo and it pulls the README. Point it at a blob URL and it fetches the raw content while keeping the metadata. This one I use constantly.

What it won't do. r11y works on the HTML a server actually returns, and it doesn't run JavaScript. For server-rendered pages, including most React and Next.js sites that ship real markup, that's exactly what you want. For a pure single-page app where the content only materialises after client-side scripts run, there's nothing in the initial HTML to extract, and you'd want a headless browser in front of it first. That's the same boundary trafilatura and the other static extractors live with.

Use it from babashka without GraalVM

The extractor is a plain library underneath the CLI, and it's babashka-compatible. So if you're scripting and don't want to install the native binary, you can pull it straight from git:

bb -Sdeps '{:deps {io.github.dazld/r11y {:git/tag "v1.0.7" :git/sha "59a594a"}}}' \
  -e '(require (quote [r11y.lib.html :as html]))
      (println (:markdown (html/extract-content-from-url "https://example.com" :format :markdown)))'

bb resolves the dep, pulls JSoup transitively, and runs it. No GraalVM, no build step.

The CLI prints markdown and stops there, but the library hands back a lot more. extract-content-from-url returns a map:

{:markdown "# Clojure\n..."          ; the rendered markdown
 :links    [{:text "Why Clojure?" :url "http://..."} ...]
 :images   [{:alt "" :url "https://..."} ...]
 :metadata {:title "Clojure" :sitename "Wikipedia" :date "..." ...}}

:links and :images are deduplicated, sorted, and pulled from the main content rather than the whole page, so you get the article's real links and figures without the nav and footer noise. The cleaned HTML and the live JSoup element are tucked onto the result's metadata too, if you want to run your own selects over the subtree. Run that against the Clojure article on Wikipedia and you get around 355 links and its two actual images alongside the markdown.

That structured return is what makes r11y a pipeline component rather than a fetch-and-print. From one call you can follow :links to crawl, collect :images, index on :metadata, and hand :markdown to a model.

Installing r11y

On macOS (arm64) or Linux (x86_64):

brew install dazld/tap/r11y

Or grab a binary from the releases page, or build it yourself - there's a Makefile and a Docker path for cross-compiling to Linux.

Where it sits, and a real use for it

This started as a piece of the Sombra story: getting clean, AI-ready content out of messy web pages. Sombra solves that for browsing, where the page is already in front of you and authentication is handled. r11y solves it for everything else, the headless and scripted and agent-driven cases where there's no browser and you just want the text.

The use I'm most pleased with so far is in an onboarding flow. When a new customer signs up, we point r11y at their website, pull the readable content, and hand it to a model to bootstrap a first cut of their profile: what they do, how they describe themselves, the language they use. It turns a blank form into something already half filled in, from a single URL.

The other end of the spectrum is bulk ingestion. Feeding a wider pipeline that pulls in many thousands of pages, the per-page cost stops being a rounding error and becomes the whole budget. Run r11y as the library in-process and there's no subprocess or JVM spin-up to pay per page, just the extraction, and being consistently faster than the Node and Python options stops reading as milliseconds and starts reading as hours of wall-clock once you multiply it across the corpus. The fact that it's the same code whether I call it on one URL from the terminal or fan it across a queue is a lot of why I reach for it.

The active centre of gravity for this kind of tool is JavaScript and Python - Readability and Defuddle in JS, trafilatura in Python, Firecrawl and friends as hosted services. The JVM is not empty: Crux sits on JSoup and pulls article content plus metadata, boilerpipe goes back to a 2010 paper and is still influential, and there are Readability ports like JReadability and Readability4J. So extraction itself is well-trodden ground here.

What's unusual is the combination r11y goes for. Most of those are libraries you embed in a running JVM, so used on their own they pay startup on every call, and they hand you cleaned HTML or plain text rather than the markdown-with-frontmatter that an LLM actually wants. Outside of Crux, several are quiet these days, and the older ports predate the modern web they now have to survive - React role-div soup, Cloudflare serving markdown under an HTML content-type, the GitHub special cases. A maintained, native-compiled, markdown-and-metadata-first extractor you can drop into a shell pipe or an agent loop is the part I couldn't find. JSoup hands you the parsing, but the heuristics, the metadata handling, the markdown conversion and all the awkward real-world cases are yours to build. So I built them.

It's the tool I reach for when I want an LLM to read something on the web and not have to argue with the markup first. Feeding models clean context is half of getting good results out of LLM tooling; this is that half showing up at the input. Source is on GitHub, MIT licensed. If you end up using it in something interesting, I'd like to hear about it.

FAQ

How do I convert a web page to Markdown from the command line?

Install r11y (brew install dazld/tap/r11y) and run r11y <url>. It fetches the page, extracts the main content, and prints Markdown to stdout, so you can redirect it to a file or pipe it into another tool.

What's the fastest HTML-to-Markdown tool for feeding LLMs?

In my benchmarks r11y runs 1.5 to 2x faster than the common Node and Python extractors, mostly because the GraalVM binary has no interpreter or VM startup to pay. That gap matters most when you're ingesting many pages rather than one.

Can I use it without installing anything?

Yes. It's a babashka-compatible library, so a bb -Sdeps one-liner pulls it from git and runs it with no native binary and no build step.

Does it return more than Markdown?

As a library, yes. A single call hands back the Markdown plus deduplicated lists of links and images and a metadata map, which is what makes it useful as a building block in a larger ingestion pipeline.


Tags: clojure llm web-scraping cli content-extraction ai markdown

Copyright © 2026 Dan Peddle RSS
Powered by Cryogen