Brief Musings on HTML Semantics

I have decided to invest in a copy of Vellum, the software that everyone agrees is the most hassle-free when it comes to preparing a manuscript for self-publishing. To use it, I also had to purchase a refurbished MacBook. In the near future, after I have some other things squared away, I’ll be looking into commissioning cover art, and after that, the first two volumes of Jake and the Dynamo will appear, probably within a month or two of each other.

I don’t have my new software yet because I’m waiting on the computer, but I wanted to try an experiment, which took me a surprising number of days to get ready: According to everything I have read, Amazon bases a self-publishing author’s profit from eBook sales on the size of the book’s file, meaning there is a motive to make the file as small as possible. To that end, I set out to create the smallest and cleanest Word document I could before converting it through Vellum, to see if it makes a difference in the final file size.

As it turned out, setting up this experiment took me a ridiculously long time, though future manuscripts I produce will be better for it, since I now have a clean, reusable template to work from. The DOCX files used by Word are infamous for “cruft,” that is, bloated code: If you manually change your font, for example, the default font is still lurking in the code of your document and will get repeated in every paragraph you make. The only way I know of to fix this is to edit the template and set up Word’s so-called “styles” with all the formatting changes you’re planning to use throughout the document.

Basically, I created a new template, copied and pasted the entire plain text of the first volume of Jake and the Dynamo, and reformatted it by hand. That took … longer than I expected. But the end result was a DOCX file half the size of the original.

DOCX files, I have learned, are actually built on XHTML, eXtensible Hypertext Markup Language, which is closely related to the HTML on which webpages are built. Some of Word’s “styles” take advantage of HTML tags, and at least a few of these are important for accessibility and proper document structure. Most importantly, all chapter headings in a manuscript should be made with actual heading styles, both so you can automatically generate a table of contents in the finished product and so the document will be navigable to anyone using assistive technologies such as screen readers. Using headings on web pages is important for the same reason, which is why I’ve been going back to edit some of my longer posts.

While I was rebuilding my manuscript, I got to thinking about other HTML semantics. One goofy feature of HTML is that, basically as an artifact of its developmental history, it has at least three (that I know of) ways of marking text so a user agent will render it in italics. Originally, there was the <i> tag, which just meant italic. But over time, the decision was made that HTML should be purely a semantic language while visual appearances should be handled by a separate markup language, CSS (Cascading Stylesheets). So the current HTML spec reimagines the <i> tag as standing for “idiomatic,” denoting text that is in a different mood or voice. The spec gives the concrete examples of names of ships or taxonomic designations, which are conventionally rendered in italics in English.

Now, in addition to the already-existing tag, there is <em>, which is also rendered in italics by default, but which denotes text that is emphasized. Then there is <cite>, which user agents also render in italics by default, but which is for titles of creative works—even though not all titles should be italicized.

Microsoft Word, of course, allows you to put text in italics while you’re typing, but it also has “emphasis” as one of its default styles. With a little fiddling around, I figured out, sure enough, that pressing the italic button adds <i> tags to the document whereas the emphasis style uses <em>. So, get this, I actually differentiated between the two in my document, using emphasis only for emphasized words and using standard italics for conventions and titles (there is no way, as far as I can tell, to insert a <cite> tag in a Word document).

Now, why would I do this? For my own satisfaction, mostly, but also because I want to see if it carries over to the new file types when I convert.

But then there is another question—is there actually any point to having these different tags? I have read long, drawn-out discussions among web developers over whether a particular instance of text should have <i> or <em> on it, but as far as I can tell, it makes no practical difference. Visually, they look the same, which means they can only be of importance to machines, but the machines that might distinguish between the tags, don’t. Google has said plainly that it does not parse the text of web pages finely enough to care which of these two tags is used, and if Google doesn’t care, it’s unlikely that other search engines do. Screen readers, which read web pages out loud to the visually impaired, could potentially read the marked-up text in a different tone of voice, but according to what I have read, few if any of them actually do. In other words, <em> designates emphasis, but that emphasis is not reproduced verbally.

On top of that, content creators, such as bloggers, are mostly ignorant of these distinctions: They simply write using an editor designed to resemble a word processor, and they hit the italics button when they want italics. In most online editors, that means an <em> tag, which is used indiscriminately for the names of ships and paintings and books as well as for emphasis.

On top of that, the distinction between the tags is ridiculously ambiguous, as evidenced by the lengthy arguments over proper use. The spec for the <i> tag is basically a frantic attempt to justify its continued existence.

But, in any case, I’ll let you know if, in the near future, I am the proud owner of the most semantically correct eBook on the market.

Author: D. G. D. Davidson

D. G. D. Davidson is an archaeologist, librarian, Catholic, and magical girl enthusiast. He is the author of JAKE AND THE DYNAMO.