Back in January we announced that the fantastic publishing technology team from PubFactory had joined Safari Books Online. Since then we’ve been hard at work integrating the team into our systems, and they’ve been hard at work building and maintaining search and reference products for their clients in academic publishing.

It’s been a singular experience for me as these are my former colleagues: I worked at iFactory for a number of years as a software engineer. That was my first job connected to publishing. Before that I would have self-identified as a generic “web developer.” While I had always tried to work on web projects that mattered, it was clear to me after my very first publishing project that I’d found my industry. I started Threepress in 2008 to work as a digital publishing technologist.

Threepress specialized in ebook formats and ereaders, while the PubFactory team serves reference and academic publishers. It’s been instructive for me to compare how these two worlds have diverged or converged in the five years since I last worked in the reference field.

Books aren’t data

The EPUB format is strictly XML-based. From the metadata to the table of contents to the book content, an EPUB file must be almost entirely composed of text marked up in well-defined XML schemas. Those schemas allow the EPUB book to be validated by a computer program that follows the schema and other well-defined business rules, ensuring consistent production. At the other end of the workflow, those same schemas would assure reading systems of the predictability of the books added to them.

EPUB 2 was released in 2007, though its design history extends back in the 1990s. At that time, academic publishers were among the only publishers producing and exchanging book data with retailers, mostly via library aggregators and portals. Those became natural models for the commercial ebook industry that did not yet exist. Outside of publishing, XML was “obviously” on a path to overtake historically messy HTML, and so aligning with XML was aligning with the future of web standards.

These were all reasonable assumptions based on the shape of the digital publishing industry when EPUB 2 and its predecessors were codified.

At that time, trade book publishers largely had no need for textual markup. It was not a part of their production workflow, nor was it natively how they produced “digital books”, which with few exceptions were always PDFs. (Safari Books Online was one of those exceptions as we initially required DocBook XML, but we eventually accepted PDF and later EPUB.)

Why is XML so foreign to trade publishers?

XML excels as a data exchange format for textual content with hierarchy. Dictionary entries and journal articles are data. Dictionary entries and journal articles are regular. Even when somewhat unstructured, as in a research paper, the work still has a predictable shape, and its primary goal is information exchange.

A trade book is not data. Even non-fiction trade is a work of human creativity with unpredictable contours. In programming terms, most books are BLOBs, opaque shadowy things that can be moved from system to system but whose contents cannot be inspected in a mechanical way.

Novelists don’t create data. They create books.

Books can’t be wrong

Strict XHTML as a book markup format was the solution to a problem that didn’t exist. It didn’t fit neatly into an XML-based workflow because most book publishers didn’t speak XML anyway. It didn’t align with the direction of web standards, which abandoned an XML-centric approach for good in 2009. It didn’t make ebook consumption any easier for ereaders, because the challenges in ebook display are in the CSS and UI layers. And it didn’t make writing an ereader any easier because embeddable web browsers quickly became the de facto rendering engine, and those already excelled at rendering plain old HTML.

By far the biggest advantage of XML workflows is at the time of production, where one can validate that the XML document contains all of the data that is expected in the correct order, format, and position in the hierarchy.

Books aren’t actually subject to these constraints. You can’t write an XML schema to validate that a book has one or more chapters, as it may have no chapters at all. It may not have an author. It may not have any wordsIt may not have pages.

(I’d go on, but any discussion of the heterogeneity of books inevitably devolves into one of those tedious “What is a book?” slides at publishing conferences.)

Books can’t be right

An ebook application can’t do a lot of things that an XML-driven reference application can. In design meetings I find myself striking out interesting feature after feature: we can’t aggregate indexes terms across a corpus because there’s no standardized EPUB markup for them. We can’t apply a consistent style to chapter titles because of incompetent, un-semantic markup like <p class="header">. We can’t extract quotable epigraphs or context-highlight code samples or anything that my PubFactory colleagues can dream up with their neatly ordered, well-defined XML inputs. EPUB content is a BLOB.

Some ebook systems do apply consistent styling or extract interesting information out of books, but that’s powered either by a huge amount of invisible human effort or a lot of advanced machine learning and heuristics. That capability doesn’t flow naturally out of the markup.

On the other hand, I can throw just about anything even resembling an EPUB book at our reading system — even if it’s completely invalid with HTML tag soup — and it’ll load. We have very little preprocessing necessary; XSLT, which is hard to learn and harder to master, is almost absent from our workflow. And users can upload their own books from anywhere else in the publishing ecosystem.

The paperback ebook

Since EPUB emerged, a variety of simpler formats have been proposed, usually by individuals from the technology industry. They do a better job of solving the problem of book production by capable amateurs, but don’t serve the diverse needs of the publishing industry that EPUB represents: the print-disabled who need rich semantic markup, library catalog systems that want to analyze highly granular metadata, fixed layout books, multi-lingual books, graphic novels, interactive textbooks, and on and on. Full-blown EPUB solves real problems, but as John Maxwell put it at Books in Browsers 2012, XML is a format that serves incumbents.

I hope that the next revision of EPUB allows HTML5 markup, without the leading X-, as I don’t think that XML requirement is solving any problems for anyone. Rich metadata, on the other hand, offers a great deal to the ecosystem, and is a reasonable tradeoff for authoring complexity.

Until we have an EPUB sans XHTML, it’s worth considering a lightweight subset of the format, one that represents a convention over configuration approach. A “microformat” version — EPUB: the beach novel edition — could be mechanically “upsampled” into big boy EPUB for use in the real ecosystem. It won’t solve the problem of heterogeneity in books (which is, after all, not actually a problem except to reading system developers), but it could make it easier for even experienced ebook authors to create publications without firing up an XML editor, for the majority of books that have very simple metadata requirements.  I’ll outline some ideas for that in a future post.