Formatting WordPress HTML Content for Instant Articles

by Edward on

There were three main factors motivating us to write this plugin 1. Integrate the Facebook Instant Articles API so we could use live syncing and analytics 2. Better control over Instant Articles options at a global level and on a per article basis 3. Produce better Instant Articles markup from the WordPress generated HTML. In this blog post I’m going to talk about how we tackled number three.

The basic problem is that the HTML structure for Facebook Instant Articles is very strict while the HTML content produced by WordPress can be very messy. This is especially true when it comes to images, oEmbeds, and HTML generated by shortcodes. Often, large portions of an article would be incorrectly formatted, images wouldn’t show, embeds wouldn’t be present or the article content would just be missing altogether. Not really ideal. Of course you can always go into the Instant Articles management console on Facebook and correct the syntax manually but that’s a. time consuming b. would get overridden overtime the article is updated in WordPress.

The Problem

All the existing plugins we tried as well as Facebook’s only PHP lib assumes that the HTML you’re producing is strict. For example:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed faucibus urna a blandit semper.</p>

<img src="my_image.jpg" />

<p>Nunc facilisis aliquet orci vel iaculis.</p>

Which would convert perfectly well to an Instant Article. However, we found that in WordPress the content is more likely to be like this:


<p>
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed faucibus urna a blandit semper.
  <div>
    <img src="my_image.jpg" />
  </div>
  Nunc facilisis aliquet orci vel iaculis.
</p>

In which case the image wouldn’t show up or would be moved out of position and appended to the end of the paragraph (in Instant Articles images need to be top level and not wrapped in p tags otherwise Facebook just ignores them. divs are also invalid). In either case the article didn’t really work. That’s quite a simple example, when it came to embeds, shortcodes and javascript it was much worse.

The Solution

After a lot of tinkering with what was possible and what we needed to achieve we ended up writing our own content parser. From the get go we designed it specifically to correctly format and adapt messy HTML, deal with embeds, images and anything else you might come across in a WordPress post.

Taking the above code as an example, all divs are removed by default, all p tags are parsed and if an image is found in one it’s broken into two paragraphs with the image in the middle and correctly formatted, like thus:


<p>
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed faucibus urna a blandit semper.
</p>

<figure>
  <img src="my_image.jpg" />
</figure>

<p>
  Nunc facilisis aliquet orci vel iaculis.
</p>

We parse and remove any empty elements, images that don’t work or featured images that are repeated in the article body, all of which contravene Instant Article syntax. We don’t stop there either, invalid elements are converted or removed, embeds are wrapped, analytics and ads can be auto added and shortcodes correctly parsed amongst other things.

Now obviously we’re not saying it’s perfect. There’s still a lot of testing to be done and small bugs to work out but we’re very happy so far (honestly, it wasn’t easy!!). It’s currently being used in production on a number of large sites and the number of incorrectly formatted articles has dropped significantly. Our eventual goal is to get that number to practically zero. As well as all the new features we’re planning on releasing over the next few months you can rest assured that we’ll also be working aggressively on making the content parser better and better.

If you’d like to help us test and you do find any content it fails to parse correctly please get in touch and let us know, we’d love to hear from you.