Category Archive: Tutorials

To Digital in a Day: Act I

Sat 1:40 PM

I receive a Word and RTF document from a source I won’t disclose just yet. I asked for both because sometimes Word documents don’t open at all on my Mac.

It is 127,000 words long. This actually doesn’t matter much to me, except in terms of the number of CSS classes my various HTML generators might create, but more in time.

Sat 1:45 PM

Open Word document successfully in both OpenOffice and TextEdit, save both as different HTML files. I try the Word document first because it preserves smart quotes. This will be the first time I’ve tried an HTML save from OpenOffice.

On to decipheration and conversion.

Sat 1:50 PM

Dear OpenOffice,

This is not how you win my love.

Not using the OO-generated HTML.

Sat 1:51 PM

Dear TextEdit,

You could use some work, but your otherwise near HTML-4 compliancy coupled with distinct, if overly thorough, CSS directives will assist greatly when I further process this HTML into text for EPub that will pass EPubcheck and Adobe Digital Editions.

Proceeding with TextEdit-generated HTML.

To work!

Sat 1:55 PM

Using Ruby-Epub’s epub tool to create a work directory that will become the Epub book.

Sat 2:00 PM

Things I proceed to do with the powers of MacVim:

Kill the <meta> lines.

Replace the <title> with the actual name of the book.

Remove the generated ToC; we’ll be creating a new linked on later.

Examine HTML/CSS for anything repeated, redundant, or otherwise not useful. Often includes extraneous/repeated CSS classes, extra linebreaks between paragraphs, overcompensating HTML, empty bold/italic/etc tags.

Important note: this is all different from document to document, even by the same author. Generators are thorough, but not all that smart.

Serious text search and replace follows with regular expressions. Note to those not familiar with regular expressions: what follows in this section will make no sense to you. But here’s what it means:

  • I spend a lot of care in converting things to mean what they’re supposed to mean (like determining scenebreaks versus letters versus typeset characters versus normal text). Many of my fellow hand-converters and all of my fellow generators do not do this.

  • But I also have tools in my hands that allow me to take care of these in seconds when I find them and can determine the patterns.

  • Really, I spend most of my time investigating and understanding the structure and style of the text, although it doesn’t mean I have to read all of the text—just enough.

  • If you’re feeling guilty, it’s not your fault. Like I said, generators are stupid. There is sometimes nothing that convinces them that surrounding black text with more black text is redundant. (We’re a long way from The Singularity.)

We now descend into geekery. You can skip over this if you like.

Vim commands:

:%s/<p class="p1"><b></b><br><\/p>\n//
:%s/<p class="p[0-9]\+"><br><\/p>\n//
:%s/<span class="Apple-converted-space">[^<]*<\/span>//g
:%s/<b><\/b>//g
:%s/<i><\/i>//g
:%s/<b><\/b>//g

An interesting case to mention: there are a few places where a break/tab, instead of a paragraph tag, is used. These must be replaced appropriatey.

:%s/<span class="s4"><br>\n<\/span><span class="Apple-tab-span">[^<]*<\/span>/<\/p><Control-V><Return><p>/

Manual replacement is needed in some cases.

Removing <span class="s3"> because black is still black.
Removing <span class="s4"> because Lucida Grande is still Lucida Grande.

Meanwhile, I make note of which CSS classes really matter. They often need to be replaced by appropriate HTML tags for structuring (often they’re chapter headings, for instance), but sometimes they’re needed for special fiction formatting.

If I run into a CSS class with a semantic difference that matters in this way, I rename it to an indication of what it means (such as changing “p15″ to “scenebreak”).

Note: this is where I also find out where paragraph classes no longer occur because they had surrounded empty bold/italic/whatever tags. I delete them from the stylesheet.

Now, paragraphs:

(change p.p2 in stylesheet to p.title)
%s/class="p2"/class="title"/g
(change p.p3 in stylesheet to p.byline)
%s/class="p3"/class="byline"/g
(change p.p7 in stylesheet to p.chapter)
%s/class="p7"/class="chapter"/g
(change p.p10 in stylesheet to p.no-indent)
%s/class="p10"/class="no-indent"/g
(change p.p15 in stylesheet to p.scenebreak)
%s/class="p15"/class="scenebreak"/g
(p.p19, p.p22, p.p25 mean the same thing as p.p15, remove)
%s/class="p19"/class="scenebreak"/g
%s/class="p22"/class="scenebreak"/g
%s/class="p25"/class="scenebreak"/g
(change p.p21 in stylesheet to p.monospace)
%s/class="p21"/class="monospace"/g
(change p.p27 is stylesheet to p.end-text)
%s/class="p27"/class="end-text"/g
(change p.p33, p35, p37, p38 to p.centered)
%s/class="p33"/class="centered"/g
%s/class="p35"/class="centered"/g
%s/class="p37"/class="centered"/g
(p.p39 has larger fonts, but since this matters less on mobile readers, I'll keep the centering and make the font size normal.  This can simply be done by merging the class with the "centered" class.)
%s/class="p39"/class="centered"/g
(Redundant paragraph classes that just mean normal text, strip)
:%s/^<p class="p9"/<p//
:%s/^<p class="p11"/<p//
:%s/^<p class="p12"/<p//
:%s/^<p class="p17"/<p//
:%s/^<p class="p18"/<p//
:%s/^<p class="p20"/<p//
[20 more, not covering here]

Many spans will be eaten in the belly of the Slorg, because they are often redundant once their surrounding paragraph becomes an h2 or something. (Amusing alternative: or it becomes a link, and therefore underlining it and marking it in blue is not necessary…. and often unreadable on grayscale readers.) And some are just redundant, and were removed in one of the previous steps.

Another interesting case comes up: Apple-tab-spans that create a list. This is a little troubling, because there are plenty of mobile readers that can’t deal with HTML lists, so I need to be creative. In the end I keep the bullets as explicit text and shift-right the paragraphs with another CSS class. I remove the tabs as well in this instance.

Other ways I could have gone: replaced the tabs with multiple &nbsp;, risk using HTML lists, used floating divs with set widths.

It’s not perfect, but few things dealing with lists are.

Sat 3:21 PM

Now I clean up the stylesheet itself to remove extraneous CSS directives. Like, for instance, setting margins to 0, or resetting the font to the same one in every class. Or, um, setting left/right margins andindentation on a piece of text that’s going to be dead-centered anyways. Stupid generators.

Sat 3:30 PM

Now I start replacing things like p.chapter with their structural elements. I also add some style of my own to distinguish structural elements of different types.

p.title because h1.title (and I strip out the bold tags).

p.byline stays that way, but I increase the font size and weigh it bold.

p.chapter becomes h2.chapter. Or,

:%s/<p class="chapter">\(.*\)<\/p>/<h2 class="chapter">\1<\/h2>/

I bold the p.end-text.

Any <br> left over must be replaced by the XML-compliant <br />.

Hyperlinks have been changed by the RTF filters to explicitly list the URL alongside the anchor text and remove the anchor tags, so I change all that back to the way it was.

I scan for missing images. The more images authors use, the harder life becomes for me, but fortunately there’s just the one, the Book View Cafe logo. (To get at it, I needed the OpenOffice conversion, because it extracts the images to files.) I add it back, centered.

I add the proper UTF-8 encoding declaration at the top. (Sometimes I get ISO-encoded files; I have to watch out for that, and use the right one.)

I finish up by adding the proper namespace for the outermost <html> tag.

I check the final HTML in Firefox.

Sat. 3:45 PM

What do we have so far?

  1. I reduced a 100-line embedded stylesheet to 9 lines.
  2. I reduced the number of CSS classes from 100 to 9.
  3. I reduced the number of CSS directives from over 400 to just over 20.
  4. I replaced pseudo-structural elements with real structural elements.

But it’s not ready for prime-time just yet.

I copy the entire working directory to my encrypted remote file share because I’m paranoid like that. I verify the copy.

I’m going to take a small break now.

Sat. 4:00 PM

I post this to my blog. Then the showering, food-eating, other stuff.

ETA: Break might be until tomorrow. Friend and I are contemplating Watchmen again. Yes, I thought it was that good.

  • del.icio.us
  • StumbleUpon
  • Google Bookmarks
  • Reddit
  • BlinkList
  • Twitter
  • Facebook
  • Digg
  • Yahoo! Bookmarks
  • Propeller
  • Sphinn
  • Turn this article into a PDF!
  • E-mail this story to a friend!

Creating eBooks: An ePub Tutorial

Most recent version of this document is now at the Spontaneous Derivation Wiki.

This is a step-by-step tutorial, with example, of making a standards-compliant ePub book by hand.

We’ll be using the public domain (in both illustrations and text) book The Velveteen Rabbit. It has the following good qualifications as a tutorial example:

  • Small.
  • Exists in HTML form in the public domain.
  • Tiny table of contents, but a table of contents still exists.
  • Images.

Click here to read more »

  • del.icio.us
  • StumbleUpon
  • Google Bookmarks
  • Reddit
  • BlinkList
  • Twitter
  • Facebook
  • Digg
  • Yahoo! Bookmarks
  • Propeller
  • Sphinn
  • Turn this article into a PDF!
  • E-mail this story to a friend!

eBookifying the Scifiction Archives

A long time ago (in Internet time, anyways), Scifi.com had a section called Scifiction, where they published science fiction stories online—both “classics”, from writers hoary with age (well… maybe not that hoary; Robert Silverberg, Avram Davidson, Barry N. Malzberg, etc), and “originals”, from newer writers (you know, like Elizabeth Bear, Lucius Shepard, M. Rickert, etc).

Then, for whatever reason, Scifi killed Scifiction. All links to the stories were evaporated.

But the Scifiction archives live on. Horribly slow, badly formatted, aging and uncared for, and nearly unreadable in mobile readers like the Kindle, but still there. (Of course, I’m inserting extra drama here. Cue timpanis.)

I got tired of this, so I created eBooks of them, one per year. This was actually my second serious endeavor in the world of eBooks. It was amusing, because the archives are huge; some 325+ stories reside there, spread out over five years. I can’t distribute them, of course, because the stories are all under copyright—and tracking down over 50 writers, some dead so I’d have to contact their estate, is not something I’m about to do. Nor would they wish me to, I think. So I don’t distribute them, and never will.

But the knowledge of how to do it, for people who wish to make personal eBooks, is distributable.

So here’s a description of how I did it, after the cut. It’s not complete in every detail, because some of the process was manual—there are multiple pitfalls in how the archives work, a lot of it because the archives are spread out over five years, and templates and presentation change enough to cause unwary scripts to die with gurgles halfway through the work. And even so, you end up needing to massage things by hand anyways.

Note: this is quite a bit of effort, but it was still less effort than doing it all by hand. I’m rather proud of this. And, of course, it’s a very tl;dr, mid-level technical discussion. I think it’s mostly a geeky thing.

Click here to read more »

  • del.icio.us
  • StumbleUpon
  • Google Bookmarks
  • Reddit
  • BlinkList
  • Twitter
  • Facebook
  • Digg
  • Yahoo! Bookmarks
  • Propeller
  • Sphinn
  • Turn this article into a PDF!
  • E-mail this story to a friend!
This site uses a Hackadelic PlugIn, Hackadelic SEO Table Of Contents 1.6.0.