Things You Should Know About Ebook Specs And Readers

Things You Should Know About Ebook Specs And Readers


A summary of the weird things I came across writing the XHTML to ebook transcoders in use on this site.


So, now that this blog is available in several ebook formats, I figured it'd be a good idea to sum up some of the problems I noticed while writing those scripts to transcode things on this blog to ebooks. You know, just 'cause they suck and you should know about them before you try to do yer own.

And before I start, I'd like to point out one thing that people probably aren't aware of: EPUB files are actually a ZIP archive of a collection of a bunch of XML files, which in turn use profiles very close to what's on the web right now. Now Amazon's kindlegen, which produces MobiPocket .mobi files, which in turn are essentially Kindle ebooks without DRM, and despite Kindles being unable to read EPUBs, will work best if you feed it the raw, uncompressed contents of those EPUB files. I think that already set the theme for this bit. Right, let's get busy then.

not the most helpful of error messages

Ebook Readers Actually Care About File Names

That screenshot up there is the result of trying to open an ebook with a content document inside it with the file name extension "frob" instead of "xhtml" on a 4th Generation Kindle.

Now, some of you may not be aware of this, but filenames are actually bollocks. They carry no inherent meaning. This is why on UNIX systems there are tools like file, which will tell you what kind of content is in a given file. But on Windows, there's this weird trend of assigning a special meaning to the last couple letters after the last dot ('.') in this file name. The idea is that these couple letters will help the operating system, and applications running on it, in figuring out what kind of file it's dealing this. Except this is completely bogus.

On Kindles, you actually give kindlegen the manifest of an EPUB ebook so it can work its magic. This manifest contains all files you're going to use, along with a MIME type that precisely specifies what type of file the Kindle will be dealing with. There is no reason whatsoever to get file names involved in any way! It makes no sense at all. Seriously, none. And the error message you get on a Kindle is really kind of useless, too. Both for the author of the ebook and the reader - because, why would either of them assume you care about the file name of a file you somehow mangled into an opaque archive. Someone pass a memo to those project leads, please!

By the way, for anyone curious about this, or those curious about whether this is solved in later firmware releases for your Kindle, here's the working Kindle file and the one that doesn't. I really only changed the file name, which you can see if you unzip this working EPUB version and this EPUB with the contents of the broken one. Colons in filenames seem to cause similar problems with most EPUB readers, weird as that is (ZIP will handle colons just fine, so will UNIX file systems).

Terrible SVG Support

Two things: any non-XML files are a pain to work with in an XML pipeline - SVGs are XML-based image files - and the SVG specs are old. They came out in 2001 to be precise, so in computer terms they're the equivalent of ancient Roman Aqueducts. Yet for some reason, SVG image support in modern products is still a kind of an afterthought, and it shows. The EPUB standards authors were nice enough to include support for these files in the EPUB standards, which is great in principle, yet they included content negotiation mechanisms so that SVG support is still technically optional. This in turn means that the content authors are forced to include alternatives for their SVG content in case the ebook reader author couldn't be arsed to support this format.

Then there are those special ebook readers that might read SVGs fine as separate files, but they can't handle it if you include the SVG's source directly in the XHTML source of your EPUB. Now that's just really weird.

Fortunately, support in fourth generation Kindles is decent, except that you can't use SVGs as cover images - beats me why that is, and why kindlegen won't just transform the SVG to a format the Kindle can handle.

All in all, support is pretty half-assed, and I don't get why. SVGs are great: they scale well because it's a vector image format, the text is present as actual, searchable, highlightable text and in a pinch you can include encoded versions of JPEGs and PNGs if you really must have raster images. You'd think publishers would jump at the idea of using those files. I can see how on a device with limited processing power you'd not want to decode the files every time you encounter them, but given that at least Kindles are able to do so they could just store temporary renders of it if that really were a problem.

The Navigation Document Structure Is Bollocks

Here's the spec if you're interested. I'm picking on this one in particular because the semantic issues are the most apparent. There's similar issues in the other profiles. This applies to both EPUB and Kindle ebooks.

If you look at the specification, you'll see that in the introduction this specification defines that navigation documents are used to create tables of contents for your ebook. That's good. It then goes on to define that navigation documents must be proper XHTML content documents - only that their definition of an XHTML content document is really an HTML 5 document in XHTML notation, as specified here, and here. This is why they can then use the nav element, which is not really part of XHTML in any sense of the way, but rather an element introduced in HTML 5. Using the XHTML MIME type, namespace and doctype as in the examples is then, of course, quite misleading.

And then there's the example for that navigation document. It's really stripped down - just the nav element and exemplary content. Looks like it could be the whole file, except you really need all the other XHTML boilerplate. Then the example is using an h2 heading: this is bad if it's the only assumed content on the page. Hell, chances are you have a document dedicated to just the main table of contents, meaning you'll have only one heading in there and that should be an h1.

This wouldn't be such a big deal if I didn't know for a fact that content authors have some real issues with using the right heading for whatever it is they want to get across. Just have a look at a few random ebooks at Project Gutenberg, and you'll see most of them don't have much of a problem starting their document with h2 and h3 - sometimes using higher level headings later, or even using the heading levels for visual effects.

And then comes the coup de grâce: the standard forces you to use an ol, li, a construct - that's ordered list, list item and regular hyperlinks, and a good guideline in principle. Except sub-headings have to be enclosed in span-elements instead of one of the dedicated HTML heading elements! Are you kidding me? spans are meaningless, blind, inline markup elements, whose only point is that they can be used to define textual styling without conveying any semantic meaning - but these are clearly sub headings, which is an important semantic aspect that a screen reader would want to know about. Who the hell wrote this!? Use a goddamn heading, you only have like, five levels of nesting left!

Hello~? Semantic markup plzkthx?

Nonexistent MathML Support

I write a lot about maths, and as anyone who does this can tell you, mathematical texts require a lot of funky formulas. You can emulate a lot of the typical mathematical layout with tables and various CSS formatting options, but getting this stuff to look nice is a huge pain in the bum. This is why all real scientists typically use LaTeX as opposed to, say, Word for writing documents: all the maths looks ugly in Word. Very. Ugly.

And this is exactly where MathML comes in. Like SVG, this is a fairly old format, only that this one allows you to mark up mathematical formulas instead of vector graphics. Like SVG it's supported in EPUB - kind of. Like SVG it's part of the new HTML 5 standard. Like SVG most decent, modern browsers support it. Like SVG it's XML-based, so it works great in your XML pipeline. Unlike SVG, your Kindle won't know what to do with MathML, so you end up with a lot of gibberish. That means you have to jump through quite a few hoops to get formulas to appear nicely in your ebooks: I ended up using a set of XSLT stylesheets that translate MathML to SVG, and then running inkscape over the output to turn the text into SVG line art. Which sucks. And you lose the structured data that was used to create those formulas, meaning they can never reflow and they look out of place in pure text paragraphs.

Check the makefile that comes with this site's source code if you're curious as to just how ugly that scripting ended up.

Who Came Up With This XML+ZIP Trend?

This has to stop. Seriously, what's up with that? First office programmes and now EPUB. Here's why this is bad: it's a pain to work with, and it doesn't make sense in the first place. EPUB is basically a set of subprofiles for XHTML, a manifest and a useless sentinel file that contains the MIME type of the archive. As a set of files, this allows the content author to structure the content documents - individual XHTML files, e.g. one for each chapter; you could also create directories for each chapter or group image files you'll be using together in some way. You then ZIP up the whole thing and you get one file to distribute.

Having a single file to lug around is a good thing, I'm not denying that. But choosing ZIP for the job just blows. First, let's not forget that XHTML, SVG, MathML and most other content are really nothing more than XML files. Working with several files in an XML pipeline is a chore; it's much nicer if you have one big XML file with all the content. You could still structure it whatever way you want if you so desired and if your XML container format supported it - and I can't be the only one thinking that way, because the folks over at Microsoft apparently got the message for their XML office file formats. You actually have a choice as to whether you want to save those as a ZIP file with XML documents inside or whether you'd instead prefer to save them in a single, large XML file.

Working with ZIP files in an XML pipeline is even worse than working with a bunch of small XML files. If you only have one huge XML file, you can just run XSLTs over it for processing. If you have a ZIP file with lots of small files in it, you need to create a script to unzip it, do your processing and then ZIP it back together.

Also, why'd it have to be ZIP, anyway? At least you could use something simple like CPIO or TAR. Those come in fully text-based variants, too, so you have a slight chance of processing them with XSLT.

Oh, you'd like it not just hierarchically structured but also compressed? How about taking a page from the same UNIX book that tells you all about TAR and CPIO: keep it plain and simple and in one, big file, then just run gzip or bzip2 or lzma over it. There, now it's one small, compressed file. Just like you wanted. And it's pretty easy to determine what's in it, too: just uncompress the first 4k or so and parse the XML. No need to force content generators to include an uncompressed file entry with the MIME type as the first file in the archive. Not that most readers gave much about MIME types to begin with. cough

Why Is There No DocBook Support?

Okay this isn't specific to DocBook, really, but rather: if the lot of ye tried to so hard to stick to established web tech and you sort of established that it'd be great if it were limited to semantic markup only, so that the users get to decide things like the font size and colour and there's no forced line breaks and page sizes like with PDFs but rather reflowing layouts, why would you base it on XHTML+CSS and then come up with a plethora of subprofiles for it? EPUB people, this means you! Well, the Kindle folks too, in a way.

Here's my angle: DocBook has been around for decades. It enforces semantic markup and the latest version is XML-based. Before that it was SGML, which is like a superset of HTML. Why would you even think of creating yet another set of subprofiles to XHTML to try and limit it to a feature set that is essentially DocBook, but as a crossbreed with HTML 5? And then you go and force poor semantics with some of those profiles, like the Navigation Documents.

But if you really needed it to be web-based, at the very least you could've used, like, Atom or another syndication format as the basis for EPUB. Because, you know, it's older than EPUB - 2005 vs 2007 - and already solves the problems you've been trying to solve. That'd even have solved the thing about needing a packaging format to conveniently wrap up your content in. That's what a syndication format is for: just use Atom and embed XHTML content files. That also makes it great to work with in an XML pipeline - although Atom would allow you to keep the files separate if you really wanted that. And you don't need a separate navigation document either, because that's implicit with the feed layout.

Or, of course, you could've used a dedicated markup format for books. Like DocBook! This article is also available as a DocBook 5 file, by the way. wink

Written by Magnus Deininger ().