Brief Musings on HTML Semantics

I have decided to invest in a copy of Vellum, the software that everyone agrees is the most hassle-free when it comes to preparing a manuscript for self-publishing. To use it, I also had to purchase a refurbished MacBook. In the near future, after I have some other things squared away, I’ll be looking into commissioning cover art, and after that, the first two volumes of Jake and the Dynamo will appear, probably within a month or two of each other.

I don’t have my new software yet because I’m waiting on the computer, but I wanted to try an experiment, which took me a surprising number of days to get ready: According to everything I have read, Amazon bases a self-publishing author’s profit from eBook sales on the size of the book’s file, meaning there is a motive to make the file as small as possible. To that end, I set out to create the smallest and cleanest Word document I could before converting it through Vellum, to see if it makes a difference in the final file size.

As it turned out, setting up this experiment took me a ridiculously long time, though future manuscripts I produce will be better for it, since I now have a clean, reusable template to work from. The DOCX files used by Word are infamous for “cruft,” that is, bloated code: If you manually change your font, for example, the default font is still lurking in the code of your document and will get repeated in every paragraph you make. The only way I know of to fix this is to edit the template and set up Word’s so-called “styles” with all the formatting changes you’re planning to use throughout the document.

Basically, I created a new template, copied and pasted the entire plain text of the first volume of Jake and the Dynamo, and reformatted it by hand. That took … longer than I expected. But the end result was a DOCX file half the size of the original.

DOCX files, I have learned, are actually built on XHTML, eXtensible Hypertext Markup Language, which is closely related to the HTML on which webpages are built. Some of Word’s “styles” take advantage of HTML tags, and at least a few of these are important for accessibility and proper document structure. Most importantly, all chapter headings in a manuscript should be made with actual heading styles, both so you can automatically generate a table of contents in the finished product and so the document will be navigable to anyone using assistive technologies such as screen readers. Using headings on web pages is important for the same reason, which is why I’ve been going back to edit some of my longer posts.

While I was rebuilding my manuscript, I got to thinking about other HTML semantics. One goofy feature of HTML is that, basically as an artifact of its developmental history, it has at least three (that I know of) ways of marking text so a user agent will render it in italics. Originally, there was the <i> tag, which just meant italic. But over time, the decision was made that HTML should be purely a semantic language while visual appearances should be handled by a separate markup language, CSS (Cascading Stylesheets). So the current HTML spec reimagines the <i> tag as standing for “idiomatic,” denoting text that is in a different mood or voice. The spec gives the concrete examples of names of ships or taxonomic designations, which are conventionally rendered in italics in English.

Now, in addition to the already-existing tag, there is <em>, which is also rendered in italics by default, but which denotes text that is emphasized. Then there is <cite>, which user agents also render in italics by default, but which is for titles of creative works—even though not all titles should be italicized.

Microsoft Word, of course, allows you to put text in italics while you’re typing, but it also has “emphasis” as one of its default styles. With a little fiddling around, I figured out, sure enough, that pressing the italic button adds <i> tags to the document whereas the emphasis style uses <em>. So, get this, I actually differentiated between the two in my document, using emphasis only for emphasized words and using standard italics for conventions and titles (there is no way, as far as I can tell, to insert a <cite> tag in a Word document).

Now, why would I do this? For my own satisfaction, mostly, but also because I want to see if it carries over to the new file types when I convert.

But then there is another question—is there actually any point to having these different tags? I have read long, drawn-out discussions among web developers over whether a particular instance of text should have <i> or <em> on it, but as far as I can tell, it makes no practical difference. Visually, they look the same, which means they can only be of importance to machines, but the machines that might distinguish between the tags, don’t. Google has said plainly that it does not parse the text of web pages finely enough to care which of these two tags is used, and if Google doesn’t care, it’s unlikely that other search engines do. Screen readers, which read web pages out loud to the visually impaired, could potentially read the marked-up text in a different tone of voice, but according to what I have read, few if any of them actually do. In other words, <em> designates emphasis, but that emphasis is not reproduced verbally.

On top of that, content creators, such as bloggers, are mostly ignorant of these distinctions: They simply write using an editor designed to resemble a word processor, and they hit the italics button when they want italics. In most online editors, that means an <em> tag, which is used indiscriminately for the names of ships and paintings and books as well as for emphasis.

On top of that, the distinction between the tags is ridiculously ambiguous, as evidenced by the lengthy arguments over proper use. The spec for the <i> tag is basically a frantic attempt to justify its continued existence.

But, in any case, I’ll let you know if, in the near future, I am the proud owner of the most semantically correct eBook on the market.

The State of the eBook Exploration

So, I’ve been exploring the subject of how to get into self-publishing and generate my own professional-looking books. General agreement is that the best software for doing this is Vellum, though that has both a prohibitive price ($250 for the full package), and it only runs on a Mac.

Besides that, there is a slew of open-source programs that, altogether, will probably accomplish the same tasks but with considerably more difficulty for the end-user.

Adding to these difficulties, my laptop is now extremely out of date. I’m still running Windows 7 and much of the software I would like to try will only run on Windows 8 or later. This includes Amazon’s free Kindle eBook generator.

When I started exploring this, I naïvely thought at first that I might not have too much difficulty. As it turns out, eBooks are packages of CSS and XHTML files. I saw some authors complaining that most of the software aside from Vellum requires some coding knowledge, and I thought to myself, “Hold on, I can write CSS and HTML.”

So I took an eBook generated in Vellum and pulled it open using an open-source EPUB editor called Sigil, and I didn’t have too much difficulty figuring out how it was built. Not only that, but I thought to myself that, by editing the code directly, I could probably create a much cleaner, more compact file with fewer <div>s and without all the unused CSS rules. I could stick to readable web-safe fonts too. Small file size, after all, is important to sales and royalties since Amazon takes its slice based on file size.

So I started editing the first volume of Jake and the Dynamo in Sigil, and while I could indeed make a slim file with a lot of the same cosmetic features typical of a professionally generated eBook, it was incredibly time-consuming, basically requiring me to insert and edit each paragraph individually (mostly to make sure the italics were in the right places). With a judicious selection of web-safe font stacks, the existing images, and some proper HTML semantics, presto, the result was what you see in the header.

The result looks good in Sigil. But that’s the important part—in Sigil. I opened it with another program and started seeing problems, such as my drop-caps wandering all over the page (and I don’t know why; the CSS for my drop-caps is very similar to how WordPress does it).

But the biggest mess came from Amazon, which insists on a proprietary filetype, MOBI. I made the conversion to MOBI using Calibre, which I can only use in an older version because the latest doesn’t work on Windows 7, and the result was a complete mess. Most especially, either the MOBI filetype or Calibre (not sure which) doesn’t like a lot of my CSS; the kind of stuff I’d do on the web to make sure images resize while keeping their proportions, or to stylize certain tags, apparently doesn’t work in Amazon’s eBooks.

I’ve been needing for some time to update my computer, and that need has become more apparent over the last few days as I’ve repeatedly tried to run software that simply won’t run on my antiquated system. What I’m thinking at present is that I might go ahead and shell out for a refurbished Macbook and a copy of Vellum, and then continue to plod along with my current system for everything else as long as possible. Meanwhile, I’ll add a laptop-update fund to the monthly budget.