Tag Archive: ruby-epub

To Digital in a Day: Act II

Sat 8:27 PM

Rested off some dizziness and decided to pick this up again so that there’s at least a book for my Kindle.

The HTML looks pretty good. It’s at the state where someone could email it to their Kindle’s email account and have it converted fairly well. It lacks a table of contents, though.

But first, let’s do the Epub.

Sat 8:30 PM

The annoying thing about Epub: most Epub readers don’t deal well when the source contains a very large HTML file. Other mobile formats are internally broken down into separate records, to be loaded in small chunks so as not to stress memory. But many Epub readers try to load the entire file in one go, including Adobe Digital Editions.

So the first task is to break this file up into multiple files. I usually do one file per chapter, but my concern is that the chapters are so large (about 20k words each) but we’ll have to see what happens.

First, breaking out the stylesheet into its own style.css file, which can then be linked to by each chapter html file.

<link rel="stylesheet" type="text/css" href="style.css" />

Now I write a ruby script to chop the file up, because I’m just that way. You could do this in perl or python as well, but I prefer ruby.

#!/usr/bin/ruby

#
# Opens a new file handle to a file with the given
# filename (no suffix), writing the initial header for
# the file, also using the given title in the <head> section.
#
def open_section(name, title)
    section = File.new("#{name}.html", "w")
    section.puts <<-END
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>#{title}</title>
  <link rel="stylesheet" type="text/css" href="style.css" />
</head>
<body>
    END

    return section
end

#
# Closes the given file handle after writing the HTML footer.
#
def close_section(section)
    section.puts <<-END
</body>
</html>
    END
    section.close
end

File.open("TextilePlanet-te.html") do |file|

    current_section = nil
    section_num = 1
    state = :in_head

    file.each do |line|

        case state
        when :in_head
            if line =~ %r|<body|
                state = :in_body
                current_section = open_section('section-00', 'Title')
            end
        when :in_body
            if line =~ %r|<h2 class="chapter"><b>([^<]+)</b></h2>|
                title = $1
                if current_section
                    close_section(current_section)
                end
                current_section = open_section("section-%02d" % section_num, title)
                section_num += 1
            end

            current_section.puts line
        end
    end

    close_section(current_section)
end

The end result is 11 files, from sectin-00.html to section-10.html.

I check the various HTML files in Firefox to make sure we’ve got everything. Everything is quite nicely split; section-00.html has the title page, section-{01-08}.html each contain an entire chapter, section-09.html contains the author bio, and section-10.html the copyright page.

We need an explicit ToC file for Mobipocket. (Epub doesn’t need it, but I’m working from the same source for all the formats.)

Sat 8:55 PM

I could write the ToC file by hand… or generate it with a script… or just use MacVim and shell commands and some writing by hand.

I do that.

Sat 9:00 PM

Moving the non-toc, non-section, non-stylesheet, non-image files to another directory.

Now using Ruby Epub tools in the root of the work directory. Also, I’ve discovered that an old friend of mine, HTML Tidy, still exists. Big help for finding illegal character sequences and replacing them appropriately.

% epub add-to-opf . content/*
% vim metadata.opf   # (reorder spine)
% epub add-guide . content/toc.html \
      --type "toc" \
      --title "Table of Contents"
% epub add-guide . content/section-01.html \
      --type "text" \
      --title "Start Reading"
% epub add-to-ncx . content/section*
% epub add-to-ncx . content/toc.html
% epub compile .
% epub compile .
  adding: mimetype (stored 0%)
  adding: META-INF/container.xml (deflated 35%)
  adding: metadata.opf (deflated 77%)
  adding: content/style.css (deflated 59%)
  adding: content/section-10.html (deflated 55%)
  adding: content/section-09.html (deflated 33%)
  adding: content/section-02.html (deflated 63%)
  adding: content/section-03.html (deflated 62%)
  adding: content/section-01.html (deflated 62%)
  adding: content/section-06.html (deflated 63%)
  adding: content/section-07.html (deflated 53%)
  adding: content/section-04.html (deflated 62%)
  adding: content/bookview-logo.png (stored 0%)
  adding: content/section-05.html (deflated 63%)
  adding: toc.ncx (deflated 79%)
  adding: content/toc.html (deflated 59%)
  adding: content/section-00.html (deflated 32%)
  adding: content/section-08.html (deflated 33%)
% ~/Software/ebooks/epub/epubcheck *epub
No errors or warnings detected
Sat 9:18 PM

Test reading it successfully in Adobe Digital Editions, with a table of contents and a proper NCX.

Ladies and gentlemen, as of 9:18 PM we have valid Epub. And a blog entry shortly.

RubyEpub Tools (ruby-epub) 0.0.2 Released

Added the ‘add-to-ncx’ operation on the epub script, and removed the creation of the template HTML file, which just got in the way.

See the ruby-epub GoogleCode page.

RubyEpub Tools (ruby-epub) 0.0.1 Released

Right now this is a very minimal bundle of functionality. Basically it’s my create/add-buncha-files/compile script, and not much else. It’ll work on Mac OS X and any Unix. Windows is, on the other hand, special. I don’t have a Windows box, so I don’t know.

On the ruby-epub Google Code page is a featured download (ruby-epub-0.0.1.gem) and a featured wiki page (“Installing”), and also the road map (more a check list) prominently displayed.