Tag Archive: to digital in a day

To Digital in a Day: Curtains Down

Sun 3:30 AM

Tinkering around with html2ps. Discover that it works okay if every higher-bit character is html-entitied.

Did some search and replace in MacVim. (Note: could also have simply used HTML Entities for Ruby, and will do so in the future.)

Sun 3:54 AM

And now we have a PDF with clicky-links as well. The full six!

To Digital in a Day: Conclusion

So, to wit:

Downloaded a Word document (127,000 words) from an author.

2 hours to convert to HTML without a table of contents.

45 minutes to tidy the HTML (harder than you think) and create an Epub file that validates.

1 minute to generate the Mobipocket file.

5 minutes to create the table of contents for the HTML file.

More than 2 hours to discover that I can’t generate PDF with links in the Table of Contents, though obviously I can just open the HTML file and print it to PDF.

2 minutes to generate the Sony Reader file, mostly because I’m still unfamiliar with calibre’s any2lrf. It’d it be less than a minute otherwise.

2 minutes to generate the Microsoft Reader file, similarly for oeb2lit.

So basically: not counting the clueless bumbling around with respect to PDF, in three hours I created 5 (6 if you count PDF without hyperlinks, but I don’t really) ebook formats.

And I suppose some time was spent writing all this down for the blog and prosperity.

The time-eater is what I’ll call feature-filled PDF, and always has been. That’s what takes the real hours of frustration. If you only worried about the other formats, you’d be fairly set.

A book publisher who still does traditional print must of course worry about finely set PDF with all the trimmings that the physical media of a book requires—headers, footing, page numbering, left versus right pages—extreme control over the text’s appearance. But a book publisher who only sells ebooks needn’t worry about PDF at all, or at least, not PDF for book-printing. And that frees up a fair amount of time.

Save for the galley checks, of course. Nothing frees up that time. At least, not until The Singularity.

In the end, either the author or I (or heck, both!) can put the files on our download servers, with the logistics of print out of the way. There’s more effort involved if you want to sell the book behind a store front, but that’s more or less what Fictionwise, Amazon’s DTP, LuLu, and Webscriptions are for.

In case you were wondering, if you were hiring me in my full capacity, rather than just a scripter of ruby and knowledge-engine of HTML/CSS, you’d be paying me over $50 an hour. However, I think frankly you don’t need a (cheap) $50/hour programmer who normally works on large-scale highly-available internet services to do this type of stuff, and you could get away with maybe $20/hour.

In the end, it’s still $60 a book that way.

But I really hope you can sell more than 30 copies of your book at $3 a pop (being generous with the taxes here). Wil Wheaton, I hear, sold through more copies of his Sunken Treasure ebook than the print version in a few days. Perhaps even just one day.

Plus, for a book of 80,000 words or more? Thank the Kindle store, which has been consistently training people to buy new Kindle books at $10.00 a pop.

Your move, traditional publishing.

To Digital in a Day: Act III

Sat 9:30 PM

This book needs a cover; that would have been a nice tutorial on Covers for All Book Formats More or Less, but that’s a blog post for another day.

Time for Mobipocket.

mobigen run on the Epub file finishes in a minute. File checks out.

Sat 9:32 PM

Wondering what format to do next.

Oh yes, PDF. Which means a detour via html2ps and then Ghostscript’s ps2pdf.

This is a little more complicated.

Especially since html2ps is segfaulting for some reason on my Mac.

Sat 10:02 PM

Removed incidental cause of the seg fault, will be fixing it for real tomorrow or sommat.

Or… not. html2ps doesn’t deal with raw utf8; it lives on ISO encoding. Like the rest of perl. No Ruby (defaults to utf8) equivalent around.

Sat 10:24 PM

The other option I know of, wkpdf, needs a consolidated HTML file.

So… let’s take the worked original file, and add anchors. MacVim again!

This time I’m running short perl filters against the text directly in the editor, then using grep to grab the anchors and more regular expression/replace to create the links.

Incidentally, this also gives us the one-file HTML version with a table of contents.

Sat 11:09 PM

Well, I ended up enabling the web server with PHP5, increasing memory limits and PCRE backlimit, to try to run html2pdf, the PHP version. But most of my problems stem from the document being too large.

At this point, I can already generate every other format. Except for PDF with hyperlinks (PDF without such links I can do). Which has always been a bit of an Achilles heel for me.

By the way, apparently the Mac these days comes with textutil, so I might have been able to save a bit of time earlier in Act I. Hey, it can convert to… Word document, Open Office Document Text, …. hang on….

Nah, doesn’t preserve inter-text hyperlinks.

Sat 11:20 PM

Let’s just get all the other mobile formats out of the way.

Using calibre’s any2lrf, we get a valid Sony Reader file in under two minutes.

Using calibre’s oeb2lit, we get a valid Microsoft Reader file in under two minutes.

It takes me longer to type all this down for you and to locate my bookmark to the calibre site, actually.

Sat 11:29 PM

The state of affairs:

  • Valid Epub.
  • Valid Mobipocket (MOBI).
  • (Really) Valid HTML with linkage.
  • Valid Sony Reader.
  • Valid Microsoft Reader.

All in one day. Actually, counting up the time, less than one day.

To Digital in a Day: Act II

Sat 8:27 PM

Rested off some dizziness and decided to pick this up again so that there’s at least a book for my Kindle.

The HTML looks pretty good. It’s at the state where someone could email it to their Kindle’s email account and have it converted fairly well. It lacks a table of contents, though.

But first, let’s do the Epub.

Sat 8:30 PM

The annoying thing about Epub: most Epub readers don’t deal well when the source contains a very large HTML file. Other mobile formats are internally broken down into separate records, to be loaded in small chunks so as not to stress memory. But many Epub readers try to load the entire file in one go, including Adobe Digital Editions.

So the first task is to break this file up into multiple files. I usually do one file per chapter, but my concern is that the chapters are so large (about 20k words each) but we’ll have to see what happens.

First, breaking out the stylesheet into its own style.css file, which can then be linked to by each chapter html file.

<link rel="stylesheet" type="text/css" href="style.css" />

Now I write a ruby script to chop the file up, because I’m just that way. You could do this in perl or python as well, but I prefer ruby.

#!/usr/bin/ruby

#
# Opens a new file handle to a file with the given
# filename (no suffix), writing the initial header for
# the file, also using the given title in the <head> section.
#
def open_section(name, title)
    section = File.new("#{name}.html", "w")
    section.puts <<-END
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>#{title}</title>
  <link rel="stylesheet" type="text/css" href="style.css" />
</head>
<body>
    END

    return section
end

#
# Closes the given file handle after writing the HTML footer.
#
def close_section(section)
    section.puts <<-END
</body>
</html>
    END
    section.close
end

File.open("TextilePlanet-te.html") do |file|

    current_section = nil
    section_num = 1
    state = :in_head

    file.each do |line|

        case state
        when :in_head
            if line =~ %r|<body|
                state = :in_body
                current_section = open_section('section-00', 'Title')
            end
        when :in_body
            if line =~ %r|<h2 class="chapter"><b>([^<]+)</b></h2>|
                title = $1
                if current_section
                    close_section(current_section)
                end
                current_section = open_section("section-%02d" % section_num, title)
                section_num += 1
            end

            current_section.puts line
        end
    end

    close_section(current_section)
end

The end result is 11 files, from sectin-00.html to section-10.html.

I check the various HTML files in Firefox to make sure we’ve got everything. Everything is quite nicely split; section-00.html has the title page, section-{01-08}.html each contain an entire chapter, section-09.html contains the author bio, and section-10.html the copyright page.

We need an explicit ToC file for Mobipocket. (Epub doesn’t need it, but I’m working from the same source for all the formats.)

Sat 8:55 PM

I could write the ToC file by hand… or generate it with a script… or just use MacVim and shell commands and some writing by hand.

I do that.

Sat 9:00 PM

Moving the non-toc, non-section, non-stylesheet, non-image files to another directory.

Now using Ruby Epub tools in the root of the work directory. Also, I’ve discovered that an old friend of mine, HTML Tidy, still exists. Big help for finding illegal character sequences and replacing them appropriately.

% epub add-to-opf . content/*
% vim metadata.opf   # (reorder spine)
% epub add-guide . content/toc.html \
      --type "toc" \
      --title "Table of Contents"
% epub add-guide . content/section-01.html \
      --type "text" \
      --title "Start Reading"
% epub add-to-ncx . content/section*
% epub add-to-ncx . content/toc.html
% epub compile .
% epub compile .
  adding: mimetype (stored 0%)
  adding: META-INF/container.xml (deflated 35%)
  adding: metadata.opf (deflated 77%)
  adding: content/style.css (deflated 59%)
  adding: content/section-10.html (deflated 55%)
  adding: content/section-09.html (deflated 33%)
  adding: content/section-02.html (deflated 63%)
  adding: content/section-03.html (deflated 62%)
  adding: content/section-01.html (deflated 62%)
  adding: content/section-06.html (deflated 63%)
  adding: content/section-07.html (deflated 53%)
  adding: content/section-04.html (deflated 62%)
  adding: content/bookview-logo.png (stored 0%)
  adding: content/section-05.html (deflated 63%)
  adding: toc.ncx (deflated 79%)
  adding: content/toc.html (deflated 59%)
  adding: content/section-00.html (deflated 32%)
  adding: content/section-08.html (deflated 33%)
% ~/Software/ebooks/epub/epubcheck *epub
No errors or warnings detected
Sat 9:18 PM

Test reading it successfully in Adobe Digital Editions, with a table of contents and a proper NCX.

Ladies and gentlemen, as of 9:18 PM we have valid Epub. And a blog entry shortly.

To Digital in a Day: Act I

Sat 1:40 PM

I receive a Word and RTF document from a source I won’t disclose just yet. I asked for both because sometimes Word documents don’t open at all on my Mac.

It is 127,000 words long. This actually doesn’t matter much to me, except in terms of the number of CSS classes my various HTML generators might create, but more in time.

Sat 1:45 PM

Open Word document successfully in both OpenOffice and TextEdit, save both as different HTML files. I try the Word document first because it preserves smart quotes. This will be the first time I’ve tried an HTML save from OpenOffice.

On to decipheration and conversion.

Sat 1:50 PM

Dear OpenOffice,

This is not how you win my love.

Not using the OO-generated HTML.

Sat 1:51 PM

Dear TextEdit,

You could use some work, but your otherwise near HTML-4 compliancy coupled with distinct, if overly thorough, CSS directives will assist greatly when I further process this HTML into text for EPub that will pass EPubcheck and Adobe Digital Editions.

Proceeding with TextEdit-generated HTML.

To work!

Sat 1:55 PM

Using Ruby-Epub’s epub tool to create a work directory that will become the Epub book.

Sat 2:00 PM

Things I proceed to do with the powers of MacVim:

Kill the <meta> lines.

Replace the <title> with the actual name of the book.

Remove the generated ToC; we’ll be creating a new linked on later.

Examine HTML/CSS for anything repeated, redundant, or otherwise not useful. Often includes extraneous/repeated CSS classes, extra linebreaks between paragraphs, overcompensating HTML, empty bold/italic/etc tags.

Important note: this is all different from document to document, even by the same author. Generators are thorough, but not all that smart.

Serious text search and replace follows with regular expressions. Note to those not familiar with regular expressions: what follows in this section will make no sense to you. But here’s what it means:

  • I spend a lot of care in converting things to mean what they’re supposed to mean (like determining scenebreaks versus letters versus typeset characters versus normal text). Many of my fellow hand-converters and all of my fellow generators do not do this.

  • But I also have tools in my hands that allow me to take care of these in seconds when I find them and can determine the patterns.

  • Really, I spend most of my time investigating and understanding the structure and style of the text, although it doesn’t mean I have to read all of the text—just enough.

  • If you’re feeling guilty, it’s not your fault. Like I said, generators are stupid. There is sometimes nothing that convinces them that surrounding black text with more black text is redundant. (We’re a long way from The Singularity.)

We now descend into geekery. You can skip over this if you like.

Vim commands:

:%s/<p class="p1"><b></b><br><\/p>\n//
:%s/<p class="p[0-9]\+"><br><\/p>\n//
:%s/<span class="Apple-converted-space">[^<]*<\/span>//g
:%s/<b><\/b>//g
:%s/<i><\/i>//g
:%s/<b><\/b>//g

An interesting case to mention: there are a few places where a break/tab, instead of a paragraph tag, is used. These must be replaced appropriatey.

:%s/<span class="s4"><br>\n<\/span><span class="Apple-tab-span">[^<]*<\/span>/<\/p><Control-V><Return><p>/

Manual replacement is needed in some cases.

Removing <span class="s3"> because black is still black.
Removing <span class="s4"> because Lucida Grande is still Lucida Grande.

Meanwhile, I make note of which CSS classes really matter. They often need to be replaced by appropriate HTML tags for structuring (often they’re chapter headings, for instance), but sometimes they’re needed for special fiction formatting.

If I run into a CSS class with a semantic difference that matters in this way, I rename it to an indication of what it means (such as changing “p15″ to “scenebreak”).

Note: this is where I also find out where paragraph classes no longer occur because they had surrounded empty bold/italic/whatever tags. I delete them from the stylesheet.

Now, paragraphs:

(change p.p2 in stylesheet to p.title)
%s/class="p2"/class="title"/g
(change p.p3 in stylesheet to p.byline)
%s/class="p3"/class="byline"/g
(change p.p7 in stylesheet to p.chapter)
%s/class="p7"/class="chapter"/g
(change p.p10 in stylesheet to p.no-indent)
%s/class="p10"/class="no-indent"/g
(change p.p15 in stylesheet to p.scenebreak)
%s/class="p15"/class="scenebreak"/g
(p.p19, p.p22, p.p25 mean the same thing as p.p15, remove)
%s/class="p19"/class="scenebreak"/g
%s/class="p22"/class="scenebreak"/g
%s/class="p25"/class="scenebreak"/g
(change p.p21 in stylesheet to p.monospace)
%s/class="p21"/class="monospace"/g
(change p.p27 is stylesheet to p.end-text)
%s/class="p27"/class="end-text"/g
(change p.p33, p35, p37, p38 to p.centered)
%s/class="p33"/class="centered"/g
%s/class="p35"/class="centered"/g
%s/class="p37"/class="centered"/g
(p.p39 has larger fonts, but since this matters less on mobile readers, I'll keep the centering and make the font size normal.  This can simply be done by merging the class with the "centered" class.)
%s/class="p39"/class="centered"/g
(Redundant paragraph classes that just mean normal text, strip)
:%s/^<p class="p9"/<p//
:%s/^<p class="p11"/<p//
:%s/^<p class="p12"/<p//
:%s/^<p class="p17"/<p//
:%s/^<p class="p18"/<p//
:%s/^<p class="p20"/<p//
[20 more, not covering here]

Many spans will be eaten in the belly of the Slorg, because they are often redundant once their surrounding paragraph becomes an h2 or something. (Amusing alternative: or it becomes a link, and therefore underlining it and marking it in blue is not necessary…. and often unreadable on grayscale readers.) And some are just redundant, and were removed in one of the previous steps.

Another interesting case comes up: Apple-tab-spans that create a list. This is a little troubling, because there are plenty of mobile readers that can’t deal with HTML lists, so I need to be creative. In the end I keep the bullets as explicit text and shift-right the paragraphs with another CSS class. I remove the tabs as well in this instance.

Other ways I could have gone: replaced the tabs with multiple &nbsp;, risk using HTML lists, used floating divs with set widths.

It’s not perfect, but few things dealing with lists are.

Sat 3:21 PM

Now I clean up the stylesheet itself to remove extraneous CSS directives. Like, for instance, setting margins to 0, or resetting the font to the same one in every class. Or, um, setting left/right margins andindentation on a piece of text that’s going to be dead-centered anyways. Stupid generators.

Sat 3:30 PM

Now I start replacing things like p.chapter with their structural elements. I also add some style of my own to distinguish structural elements of different types.

p.title because h1.title (and I strip out the bold tags).

p.byline stays that way, but I increase the font size and weigh it bold.

p.chapter becomes h2.chapter. Or,

:%s/<p class="chapter">\(.*\)<\/p>/<h2 class="chapter">\1<\/h2>/

I bold the p.end-text.

Any <br> left over must be replaced by the XML-compliant <br />.

Hyperlinks have been changed by the RTF filters to explicitly list the URL alongside the anchor text and remove the anchor tags, so I change all that back to the way it was.

I scan for missing images. The more images authors use, the harder life becomes for me, but fortunately there’s just the one, the Book View Cafe logo. (To get at it, I needed the OpenOffice conversion, because it extracts the images to files.) I add it back, centered.

I add the proper UTF-8 encoding declaration at the top. (Sometimes I get ISO-encoded files; I have to watch out for that, and use the right one.)

I finish up by adding the proper namespace for the outermost <html> tag.

I check the final HTML in Firefox.

Sat. 3:45 PM

What do we have so far?

  1. I reduced a 100-line embedded stylesheet to 9 lines.
  2. I reduced the number of CSS classes from 100 to 9.
  3. I reduced the number of CSS directives from over 400 to just over 20.
  4. I replaced pseudo-structural elements with real structural elements.

But it’s not ready for prime-time just yet.

I copy the entire working directory to my encrypted remote file share because I’m paranoid like that. I verify the copy.

I’m going to take a small break now.

Sat. 4:00 PM

I post this to my blog. Then the showering, food-eating, other stuff.

ETA: Break might be until tomorrow. Friend and I are contemplating Watchmen again. Yes, I thought it was that good.