Tag Archive: ruby

To Digital in a Day: Act II

Sat 8:27 PM

Rested off some dizziness and decided to pick this up again so that there’s at least a book for my Kindle.

The HTML looks pretty good. It’s at the state where someone could email it to their Kindle’s email account and have it converted fairly well. It lacks a table of contents, though.

But first, let’s do the Epub.

Sat 8:30 PM

The annoying thing about Epub: most Epub readers don’t deal well when the source contains a very large HTML file. Other mobile formats are internally broken down into separate records, to be loaded in small chunks so as not to stress memory. But many Epub readers try to load the entire file in one go, including Adobe Digital Editions.

So the first task is to break this file up into multiple files. I usually do one file per chapter, but my concern is that the chapters are so large (about 20k words each) but we’ll have to see what happens.

First, breaking out the stylesheet into its own style.css file, which can then be linked to by each chapter html file.

<link rel="stylesheet" type="text/css" href="style.css" />

Now I write a ruby script to chop the file up, because I’m just that way. You could do this in perl or python as well, but I prefer ruby.

#!/usr/bin/ruby

#
# Opens a new file handle to a file with the given
# filename (no suffix), writing the initial header for
# the file, also using the given title in the <head> section.
#
def open_section(name, title)
    section = File.new("#{name}.html", "w")
    section.puts <<-END
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
   "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>#{title}</title>
  <link rel="stylesheet" type="text/css" href="style.css" />
</head>
<body>
    END

    return section
end

#
# Closes the given file handle after writing the HTML footer.
#
def close_section(section)
    section.puts <<-END
</body>
</html>
    END
    section.close
end

File.open("TextilePlanet-te.html") do |file|

    current_section = nil
    section_num = 1
    state = :in_head

    file.each do |line|

        case state
        when :in_head
            if line =~ %r|<body|
                state = :in_body
                current_section = open_section('section-00', 'Title')
            end
        when :in_body
            if line =~ %r|<h2 class="chapter"><b>([^<]+)</b></h2>|
                title = $1
                if current_section
                    close_section(current_section)
                end
                current_section = open_section("section-%02d" % section_num, title)
                section_num += 1
            end

            current_section.puts line
        end
    end

    close_section(current_section)
end

The end result is 11 files, from sectin-00.html to section-10.html.

I check the various HTML files in Firefox to make sure we’ve got everything. Everything is quite nicely split; section-00.html has the title page, section-{01-08}.html each contain an entire chapter, section-09.html contains the author bio, and section-10.html the copyright page.

We need an explicit ToC file for Mobipocket. (Epub doesn’t need it, but I’m working from the same source for all the formats.)

Sat 8:55 PM

I could write the ToC file by hand… or generate it with a script… or just use MacVim and shell commands and some writing by hand.

I do that.

Sat 9:00 PM

Moving the non-toc, non-section, non-stylesheet, non-image files to another directory.

Now using Ruby Epub tools in the root of the work directory. Also, I’ve discovered that an old friend of mine, HTML Tidy, still exists. Big help for finding illegal character sequences and replacing them appropriately.

% epub add-to-opf . content/*
% vim metadata.opf   # (reorder spine)
% epub add-guide . content/toc.html \
      --type "toc" \
      --title "Table of Contents"
% epub add-guide . content/section-01.html \
      --type "text" \
      --title "Start Reading"
% epub add-to-ncx . content/section*
% epub add-to-ncx . content/toc.html
% epub compile .
% epub compile .
  adding: mimetype (stored 0%)
  adding: META-INF/container.xml (deflated 35%)
  adding: metadata.opf (deflated 77%)
  adding: content/style.css (deflated 59%)
  adding: content/section-10.html (deflated 55%)
  adding: content/section-09.html (deflated 33%)
  adding: content/section-02.html (deflated 63%)
  adding: content/section-03.html (deflated 62%)
  adding: content/section-01.html (deflated 62%)
  adding: content/section-06.html (deflated 63%)
  adding: content/section-07.html (deflated 53%)
  adding: content/section-04.html (deflated 62%)
  adding: content/bookview-logo.png (stored 0%)
  adding: content/section-05.html (deflated 63%)
  adding: toc.ncx (deflated 79%)
  adding: content/toc.html (deflated 59%)
  adding: content/section-00.html (deflated 32%)
  adding: content/section-08.html (deflated 33%)
% ~/Software/ebooks/epub/epubcheck *epub
No errors or warnings detected
Sat 9:18 PM

Test reading it successfully in Adobe Digital Editions, with a table of contents and a proper NCX.

Ladies and gentlemen, as of 9:18 PM we have valid Epub. And a blog entry shortly.

RubyEpub Tools (ruby-epub) 0.0.2 Released

Added the ‘add-to-ncx’ operation on the epub script, and removed the creation of the template HTML file, which just got in the way.

See the ruby-epub GoogleCode page.

RubyEpub Tools (ruby-epub) 0.0.1 Released

Right now this is a very minimal bundle of functionality. Basically it’s my create/add-buncha-files/compile script, and not much else. It’ll work on Mac OS X and any Unix. Windows is, on the other hand, special. I don’t have a Windows box, so I don’t know.

On the ruby-epub Google Code page is a featured download (ruby-epub-0.0.1.gem) and a featured wiki page (“Installing”), and also the road map (more a check list) prominently displayed.

Perfecting (Simple) PDF Conversion to EPub and Mobipocket

pdf-icon

Problem: Convert PDF to reflowable text, preferably HTML.

Why: This is because text that reflows based on the size of the screen, or the size of the font, or the length of a page (or indeed, without the concept of a page) is what suits mobile ebook readers, with smaller screens, best.

Apologies for two Geekery posts in a row. The rest of the discussion is under the fold.

Click here to read more »

LRF to HTML: The Rough Guide

As of this writing, calibre, which can convert many things from one format to another featuring command-line tools, does not convert LRF to HTML, or indeed, to most anything else other than LRS, an XML format. Currently this is not a high-priority item to fix in calibre itself, because calibre is aimed at converting things to LRF. (The ePub conversion is still relatively new and shiny.)

ETA: Here’s the LRS specification.

So. Heck. Why not. I’m using Ruby, by the way, because Ruby has the kick-ass REXML library, which also forms the cornerstone for my ruby-epub stuff (still in the making).

Geekery after the cut.

Click here to read more »

State of the Union: A Very Basic Ruby Epub Library

Currently I enjoy, through hacks and such, easy ways to create an ePub project directory, update the OPF and NCX files within, compile it all, and even run epubcheck, all very easily.

I’m starting to refactor and redesign all that, with an eye to providing a Ruby library that allows manipulation of various parts of Epub, as well as a Project class tying the elements together (after all, when a unique identifier needs to be synced between two files with different formats of almost entirely different inheritances, it gets a bit annoying to do manually).

One day I’ll probably stick this all in a RubyCocoa interface, so that we have an opposite number to the Windows-only Mobipocket tools.

This library will one day, given the blessings of the RubyForge administrators, become a gem for people to play with. Right now it resides over at https://ruby-epub.googlecode.com/, where you can see a roadmap and browse the code and check-ins.

The state of the library is that it’s in very primitive mode at the moment, and some things still need to be tested more thoroughly (and some things are already tested fairly intensively, but you can almost always use more), so it’s not yet released. Right now there are two scripts, and

At this point we can create a basic epub and then compile it immediately and the result passes epubcheck.

as part of the last check-in proclaims.

It’s all GPLv3 licensed by the way. I like things to be open, because then people can patch stuff, or use the library and do other things of their interest, or suggest more in-depth design changes I’m missing out because I’m not an in-depth Ruby-ist, and so on. Not that I’ll listen to all of it (another reason I like things to be open: people can branch). I’m not sure how I’ll deal with things once it all goes Cocoa, but I look forwards to the future with optimism.

(And also to the new Subversion and its changelists. Changelists are godsends.)

(Now is not the time to argue with me about using wxWidgets or the benefits of being truly cross-platform. No, it really isn’t, unless you want to end up quarantined for a while. I care about as much as anybody writing Mac programs cares, which is about the amount you can fill in a thimble.)

(Nor is this time to tell me to use Python, like about the rest of the Epub tools out there use, save for the few in Java. I’ve used Python. I’m probably one of the few people who likes the syntax, in fact, but at the moment Ruby needs a library and I work in Ruby. Pardon me for being selfish, but I am indeed both open and selfish at the same time.)

During all this I ended up learning Rake and part of Gem creation, and also dove a bit more into Ruby, and thus wrote up a couple quick references (now featured on the new Quick! page).

By the way, things about my coding style and approach:

  • Copious comments, unless I’ve obviously rushed things (in which case I feel horrible). One of the first stops I made during my Ruby crash course was to find out how Rdoc worked.

  • Object-orientated, modular design. I really like responsibilities to belong to cohesive units that can be called by other cohesive units. Crazy, I know.

  • I tend not to do wild and crazy kool-kid hacker things, because I like my code to be readable. And yes, some people do think I’m stupid because I don’t use unless or dance around with the trinary operator and bit-mode flags and re-implementing my own XML parser, but whatever.

  • I don’t worry about speed and rock-hard reliability against all edge-cases up front. I add that in later, and because the devil’s advocate unit tests keep failing until I do.

Mike and Psmith, Psmith in the City Epub Versions

In celebration of moving my downloads over to WP DownloadManager, I decided to release Epub versions of Mike and Psmith and Psmith in the City. ETA: And also Psmith, Journalist. For more about Psmith, see my Kindle-licious series.


  Mike and Psmith: Epub (157.6 KiB, 371 hits)
  Psmith in the City: Epub (157.9 KiB, 402 hits)
  Psmith, Journalist: Epub (167.1 KiB, 395 hits)

I’ve written this warning a few times before, but I might as well do it again:

Warning Warning Warning

The above texts are public domain only in the United States, anywhere with a Berne-convention-style copyright that expires 25 years after author death, and anywhere else without copyright laws.

If you live anywhere else, especially in Canada, Mexico, the United Kingdom, Ireland, every country in continental Europe, almost every country in Asia, South America, and Africa—these are not legal for you to download, read, read aloud, print, or store on a computer or server unless it happens to be housed in the United States, etc etc etc.1

For more information, see Copyright and Wodehouse.

End Warning

I wrote a few scripts to make the Epub process a snap for those of us working by hand. I’m not ready to release them, but here’s an example session (warning: extreme geek):

Click here to read more »

  1. If you think this is ridiculous, join the club. I’m not against copyright in general—far from it—but the man’s been dead for over 25 years. []

Fooling Around with Amazon Images

I’m a fussy person. I wanted to show books in my library on my blog sidebar—but not just in any old way.

Update: Fixed the code samples and script.

Requirements

  • One or more things in the sidebar that shows books in my library.

  • I want to show books I’ve read, books I’m currently reading (and, in some cases, re-reading), and books I will read.

  • Preferably separated from each other.

  • I want to be able to adjust the number of books shown in each category.

  • I want someone else to worry about all the little book images and not have to store/resize them myself.

  • I want flexibility in linking; sometimes I want to link to reviews I’ve written, for instance, and sometimes direct to Amazon, Audible, Webscriptions, etc.

  • If I’m linking to Amazon, I want my Amazon Associates code attached. Optionally, if any other stores have associates programs, I want to use those tags too.

Widgets from Book Social Networks

None of the widgets from Shelfari, GoodReads, or LibraryThing could satisfy these requirements.

The widgets at GoodReads came closest, but in the end they weren’t flexible enough.

Now we descend into high geekery, including ruby code, so the rest of this goes under the cut.

Click here to read more »