Get in touch

Extracting Content From the Chaos of the Web

Introducing the Mercury Web Parser.

The web is full of web pages. But it’s hard to gain access to the content inside. Is this an ad or a headline? Is this a byline or a paragraph? The problem is simple: A huge portion of the web’s best content has no API. When you need an API for the web’s content, that’s when we hope you’ll reach for the Mercury Web Parser.

The Mercury Web Parser is a new API from Postlight Labs that extracts meaningful content from the chaos of any web page — in particular from articles, blog posts, and similar web content.

Consider the image below. On the left is a page from the New York Times. No filler; just pure content. On the right are the first 100 or so words of the same article as sent by the New York Times to your web browser — the source.

A fraction of the HTML on any page is devoted to laying out the page’s pure content. The rest is dedicated to meta tags, sidebars, headers, footers, advertisements, and all sorts of information that’s useful in the context of your web browser but useless when you want to do something interesting with the content.

Stop Scraping By

There are lots of ways to extract data from a web page — you could write your own scraper. That’ll work — but bespoke scrapers are tedious to write and notoriously brittle. (Should the page’s markup change, for example, the scraper would stop working.)

In contrast, the Mercury Web Parser API extracts pure content from any article on the web. Where most web pages include markup for ads, sidebars, and chum, the Mercury Web Parser sees — and returns — pure content.

Right now, the Mercury Web Parser is being used by our AMP Converter to make any web site AMP-friendly with just one line of code. It’s working incredibly well there — turning millions of URLs into AMP pages. We can see lots of ways that Mercury can be used:

  1. Migrating legacy content into new CMSes. Migrations are notoriously time-consuming and filled with edge cases. You can use Mercury to quickly grab the content of a web page and put it into a new database. And that will work for millions of URLs.
  2. Generating mobile experiences. We use the Mercury Parser to build great AMP pages. But AMP is not the end-game for Mobile. You can take your entire website, run it through Mercury, and build on the scaled-back version to create a mobile web (or app experience).
  3. Creating new experiences for new platforms. The future is not just words on a screen. There’s a panoply of speaking, whirring, humming devices all of which need content-rich experiences. There’s Amazon Echo, SIRI, enhanced ebook formats — and using the Mercury Parser, your web content is ready for it.

But that’s just the beginning: In time, we hope it becomes the go-to tool used by programmers, artists, and hackers across the world who want to remake and remix the web.

The Mercury Web Parser is available now, for free. Sign up today.

Postlight is a digital product studio in NYC. We design and build great apps and experiences for our clients and for ourselves. If you’d like to work with us, get in touch.

Story published on Oct 26, 2016.