As of this writing, calibre, which can convert many things from one format to another featuring command-line tools, does not convert LRF to HTML, or indeed, to most anything else other than LRS, an XML format. Currently this is not a high-priority item to fix in calibre itself, because calibre is aimed at converting things to LRF. (The ePub conversion is still relatively new and shiny.)

ETA: Here’s the LRS specification.

So. Heck. Why not. I’m using Ruby, by the way, because Ruby has the kick-ass REXML library, which also forms the cornerstone for my ruby-epub stuff (still in the making).

Geekery after the cut.

The scope of this code: extremely basic. Should be run on the LRS file produced from calibre’s lrf2lrs utility. The finer details of calibre’s LRS are skipped over, and there are some hacks. It is somewhat smart enough to deal with strange formatting (though not illegal formatting).

But basically it does this:

prompt% ./lrs2html AnExampleBook.lrs
Parsing XML
Done parsing
Attributes: #<OpenStruct title="An Example Book", author="Arthur Not a Poet">
Processing Styles
Styles: {"208"=>"text-align: center; ", "220"=>"text-align: center; ", "209"=>"text-align: center; ", "221"=>"text-align: foot; ", "210"=>"text-align: center; ", "213"=>"text-align: center; ", "214"=>"text-align: center; "}
Processing Pages (Sections)
Procesing text for Section 0
Title: An Example Book
Processed section An Example Book
Procesing text for Section 1
Title: Chapter 1: In the Beginning
Processed section Chapter 1: In the Beginning
Procesing text for Section 2
Title: Chapter 2: Flowering
Processed section Chapter 2: Flowering
Procesing text for Section 3
Title: Chapter 3: Autumn
Processed section Chaper 3: Autumn
Creating directory An_Example_Book
Writing sections
Writing 'An Example Book' to title.html
Writing 'Chapter 1: In the Beginning' to section-01.html
Writing 'Chapter 2: Flowering' to section-02.html
Writing 'Chapter 3: Autumn' to section-03.html
Writing TOC
DONE
prompt% ls An_Example_Book
section-01.html
section-02.html
section-03.html
title.html
toc.html

So, here’s the code, which is cheerily commented as always…. you might want to download the file, since there’s a control character that Wordpress wisely does not allow me to post.

  lrs2html.rb (10.8 KiB, 307 hits)
#!/usr/bin/ruby
#
# Copyright (c) 2008 Arachne Jericho <arachne.jericho@gmail.com>
#
# License: GPL v3 (See http://www.gnu.org/licenses/gpl-3.0.html)
#
# Description:
#
# Converts a calibre .lrs XML file to a directory of HTML
# files suitable for grinding through your favorite epub,
# mobipocket, plucker, etc. creation tools.
#
# Dependencies:
#
# If you're running a standard Ruby 1.8.6 installation, you don't
# need any extra libraries.  This includes most people running Mac
# OS X, since ruby comes for free with Panther onwards.
#

require 'rexml/document'
require 'ostruct'
require 'FileUtils'

if ARGV.length != 1
    STDERR.puts "Usage: $0 <lrs file from calibre's lrf2lrs>"
end

#----- UTILITY FUNCTIONS -----#
#
# If you're a programmer, you may be interested in swapping these
# in and out, depending on how annoying the formatting is.
#

#
# Removes all <span> and <br/> elements from a given paragraph
# HTML element.  This presumes that the LRS tags have been renamed
# to the standard HTML ones.
#
# Parameters:
# - p : paragraph element to "flatten"
#
# Returns:
# - new "flattened" XML element
#
def flatten_paragraph(p)
    xml = p.to_s
    xml.gsub!(/<span[^>]*>/, '')
    xml.gsub!('</span>', '')
    xml.gsub!('<br/>', '')
    minidoc = REXML::Document.new xml
    return minidoc.elements[1]
end

#
# Removes all <br/> elements from the top level of the given element.
# This means that if e looks like
#
#  <blockquote>
#    <br/>

#    <p>This is some stuff<br/>maybe set in free verse<br/>too many poets</p>
#    <br/>
#  </blockquote>
#
# the result of un_br(e) looks like
#
#  <blockquote>
#    <p>This is some stuff<br/>maybe set in free verse<br/>too many poets</p>
#  </blockquote>
#
# Parameters:
# - e : element to remove top-level <br/>'s from
#
# Returns:
# - new XML element
# 
def un_br(e)
    xml = e.to_s
    xml.gsub!('<br/>', '')
    minidoc = REXML::Document.new xml
    return minidoc.elements[1]
end

#
# Recursively stndardizes LRS elements to HTML ones.  This
# function destructively modifies the given XML element, renaming
# it and fiddling with any non-standard attributes (such as
# fontsize).
#
# Parameters:
# - element : XML element to be destructively modified.
#
def process_element(element)
    case element.name
    when 'CR'
        element.name = 'br'
    when 'Italic'
        element.name = 'em'
    else
        element.name = element.name.downcase
    end

    # remove illegal attributes or change them into style
    style = ''
    element.attributes.each do |name, value|
        case name
        when 'fontsize'
            style += "font-size: #{value}px; "
            element.delete_attribute(name)
        end
    end
    if style.length > 0
        # We use to_s on a possibly nil attribute because the nil
        # singleton in ruby actually does support a few methods.
        # to_s returns an empty string for it.
        element.add_attribute('style', element.attributes['style'].to_s + style)
    end

    #
    # Remove illegal characters from element text
    # if it has any.
    #
    # Adding to the list of illegal characters you run across
    # may be something you want to do.  Here I just run into
    # LineFeeds.  You may need to remove carriage returns and
    # such (and that's a literal control-L in there).
    #
    if element.has_text?
        element.text = element.text.gsub(/[]/, '')
    end

    if element.has_elements?
        element.elements.each do |e|
            process_element e
        end
    end
end

puts "Parsing XML"
lrf = File.new ARGV[0]
doc = REXML::Document.new lrf
lrf.close
puts "Done parsingn"

attributes = OpenStruct.new
attributes.title = doc.elements['BBeBXylog/BookInformation/Info/BookInfo/Title'].text
attributes.author = doc.elements['BBeBXylog/BookInformation/Info/BookInfo/Author'].text

puts "Attributes: #{attributes.inspect}n"

# Skipping the toc for now -- the books I'm processing have
# no reasonable ones.

# Process text styles; ignoring page styles for now
puts "Processing Styles"
styles = {}
doc.elements.each('BBeBXylog/Style/TextStyle') do |style|
    style_string = ''
    if style.attributes.has_key? 'align'
        align = style.attributes['align']
        if align != 'head'
            # "head" seems to be the normal alignment. So let's
            # leave it out.
            style_string += "text-align: #{style.attributes['align']}; "
        end
    end
    if style.attributes.has_key? 'parindent'
        # We're not treating parindent well -- just processing
        # an obvious indent of 0 and letting the default indent
        # take over.
        indent = style.attributes['parindent'].to_i
        if (indent == 0)
            # XXX: We're really not treating parindent well, but this
            # is commented out to avoid all those files where people
            # think you should separate every paragraph with 3 <br/>
            # instead of, you know, indenting.
            #
            #style_string += "text-indent: 0ex; "
        end
    end
    if style.attributes.has_key? 'fontweight'
        # XXX: In a sane world I wouldn't comment this out.  Sadly,
        # this is not a sane world, and some people think that all text
        # needs to be bold.
        #style_string += "font-weight: #{style.attributes['fontweight']}; "
    end

    if style_string.length > 0
        styles[style.attributes['stylelabel']] = style_string
    end
end
puts "Styles: #{styles.inspect}n"

puts "Processing Pages (Sections)"
sections = []

#
# Starting our section number at 0 not because it's used for indexing
# (it is indeed used for display) but since the first section is 99.99%
# likely to be the title page.  We might as well increment on the way.
#
section_num = 0
doc.elements.each('BBeBXylog/Main/Page') do |page|
    section = OpenStruct.new
    section.title = ''
    section.num = section_num
    section_num += 1
    section.text = ''

    # We're merrily skipping over BlockSpace because I have
    # no idea what it means or how it should translate to HTML.

    puts "Procesing text for Section #{section.num}"

    page.elements.each('TextBlock') do |text_block|

        # Reference the style block and attach styles to
        # each paragraph, since not every reader knows how
        # to treat a surrounding div.
        style = styles[text_block.attributes['textstyle']]
        if style
            text_block.elements.each('P') do |paragraph|
                paragraph.add_attribute('style', style)
            end
        end

        text_block.elements.each do |element|
            # Just for readability of the result; this
            # is HTML so this won't affect formatting.
            add_newline = false

            process_element(element)

            case element.name
            when 'br'
                # Just forget all really toplevel <br/>'s
                # (a la calling un_br on the text block
                # LRS element, but since the TextBlock is
                # purely LRS and not HTML, I'm avoiding
                # making that call.
                next
            when 'p'
                # XXX: Indeed, not all spans are bad.  Spans
                # are often used in output that wants to
                # avoid any formatting tags at all, preferring
                # to go CSS all the way.  I find this is not the
                # case with a lot of LRF output.
                element = flatten_paragraph element

                # Here we devine the title from the first paragraph
                # of the Page, or at least the first one that has
                # content.  We also demark this element with the
                # special title class.
                if (element.has_text? && section.title.length == 0)
                    element.text =~ /^([^.?!]+)/
                    section.title = $1
                    puts "Title: #{section.title}"

                    element.add_attribute('class', 'title')
                end
                add_newline = true
            when 'plot'
                # Wha? Skipping this LRS element.
                next
            end

            section.text += un_br(element).to_s
            if add_newline
                section.text += "n"
            end
        end
    end

    # Eliminate any empty paragraphs.  These may have resulted
    # through any fiddling you might do above.  This will
    # catch empty paragraphs split across multiple lines.
    section.text.gsub!(%r|<p[^>]*>[ nt]*</p>|m, '')

    puts "Processed section #{section.title}"
    sections.push section
end

# Create a directory and start sticking files into it.
dir_name = attributes.title.gsub(/[^-A-Za-z0-9]/, '_')
puts "Creating directory #{dir_name}"
FileUtils.mkdir_p dir_name

puts "Writing sections"
sections.each do |section|

    if (section.num == 0)
        # Special case - section 0 being 99% likely the title page
        # You may want to add more cases here if you run more often
        # into separate copyright etc. pages
        section.file = 'title.html'
        section.title = attributes.title
    else
        section.file = 'section-%02d.html' % section.num
        section.title = "#{section.title}"
    end

    puts "Writing '#{section.title}' to #{section.file}"
    # And here's where the special title class assigned earlier

    # comes into play formatting-wise.
    #
    # Admittedly there could be something else in the text that
    # uses the title class.  But I leave that to manual finangling
    # afterwards.
    File.open(File.join(dir_name, section.file), 'w') do |file|
        file.puts <<-END
<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>#{section.title}</title>
    <style type="text/css">
      .title { font-size: 1.5em; font-weight: bold; margin: 1ex 0 2ex 0; }
    </style>
  </head>
  <body>
  #{section.text}
  </body>
</html>
        END
    end
end

# We're not writing an NCX toc, just a normal ToC that can be used
# by any reader, whether or not they support NCX. I'm thinking of
# Mobipocket readers here.
puts "Writing TOC"
File.open(File.join(dir_name, 'toc.html'), 'w') do |file|

    # Some readers will understand the hanging indenting CSS
    # combination here (Kindle for instance); others, not so much,
    # but it doesn't hurt them.
    file.puts <<-END
<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Table of Contents</title>
  <style type="text/css">
    p { text-indent: -4ex; text-align: left; padding-left: 4ex; }
    h3 { margin-bottom: 1ex; }
  </style>
  </head>
  <body>
    <h3>Table of Contents</h3>
    END

    sections.each do |section|
        # Don't put the title into the table of contents. 
        if (section.num == 0)
            next
        end

        file.puts <<-END
<p>
<a href="#{section.file}">#{section.title}</a>
</p>
        END
    end
    file.puts <<-END
  </body>
</html>
    END
end

puts "DONE"
  lrs2html.rb (10.8 KiB, 307 hits)
  • del.icio.us
  • StumbleUpon
  • Google Bookmarks
  • Reddit
  • BlinkList
  • Twitter
  • Facebook
  • Digg
  • Yahoo! Bookmarks
  • Propeller
  • Sphinn
  • Turn this article into a PDF!
  • E-mail this story to a friend!