The Switch to Hakyll

This site was originally built with Jekyll. Technically I began with the pre-packaged distribution known as Octopress which offered a Rakefile for common tasks as well as an out-of-the-box directory structure. I didn't use many of these features, however, so I had been wanting to shed traces of Octopress, partly motivated by the pursuit of increased speed in site generation. I found the opportunity to do this when Jekyll 1.0 was released recently.

To cut away the unnecessary components of Octopress, I decided to go through every file and keep only what I absolutely needed. This is evident in commits after 712168ec.

I was well on my way to making the site's source a lot leaner when I remembered that I had been wanting to try Hakyll, a static site generator written in Haskell that I had heard about on Hacker News. Given that I was more or less starting the site from scratch, I figured it was the perfect opportunity to try it.

Ultimately, this site is now compiled with Hakyll. It took me about a week to implement every feature I wanted in Hakyll and Pandoc. The net effect is that the difference in speed and flexibility is highly appreciable.

# File Structure

Oftentimes when new to a system, learning its directory structure can help one to get oriented. Unlike some other static site generators, Hakyll does not enforce any particular directory structure or convention. The one I have adopted for my repository looks like this:

Entry Purpose
provider/ compilable content
src/ Hakyll, Pandoc customizations
Setup.hs build type
blaenk.cabal dependency management
readme.markdown repository information

I build the site binary with cabal build which results in a new top-level directory dist/, which stores the object files generated by GHC. The site binary, stored at the top level, is the actual binary which is used for generating and manipulating the site. This binary has a variety of options, the ones I commonly use are:

Option Purpose
build Generate the entire site
preview Generate changes on-the-fly and serve them on a preview server
deploy Deploy the site using a custom deploy procedure

Build creates a top-level directory generated/ with two sub-directories: a directory cache/ for cached content and a directory site/ where the compiled site is stored.

Deploy puts the compiled site into top-level directory deploy/ which is git-controlled and force pushes the content to the master branch, effectively deploying (on GitHub).

# Hakyll

As I mentioned earlier, Hakyll is a static site generator written in Haskell. Hakyll sites are fleshed out using a Haskell Embedded Domain Specific Language (EDSL). This EDSL is used to declare rules for different patterns which should be searched for within the provider directory and what should be done with them.

For example, in the following Hakyll program:

main :: IO ()
main = hakyll $ do
  match "images/*" $ do
      route   idRoute
      compile copyFileCompiler

match "images/*" is a Rule that states that the provider directory should match all files matching the glob images/*, Route them using the idRoute, and compile them using the Compiler copyFileCompiler.

Routing a file in the context of a static site generator like Hakyll refers to the mapping between the file as it sits in the provider directory and its name/path in the compiled directory; in this case, idRoute keeps the same name/path in the compiled directory.

Compiling a file in this context refers to the operations that should be performed on the contents of the file, for example processing through Pandoc for Markdown to HTML generation, or in this case, simply copying the file from the provider directory to the compiled directory.

Compiler is a Monad, which allows for seamless chaining of operations that should be performed on any given file. For example, here is my Rule for regular posts:

match "posts/*" $ do
  route $ nicePostRoute
  compile $ getResourceBody
    >>= withItemBody (abbreviationFilter)
    >>= pandocCompiler
    >>= loadAndApplyTemplate "templates/post.html" (tagsCtx tags <> postCtx)
    >>= loadAndApplyTemplate "templates/layout.html" postCtx

This states that the compilation process for any given post is as follows:

  1. the post body (i.e. excluding post metadata) is read
  2. the result is passed to an abbreviation substitution filter
  3. the result is passed to my custom Pandoc compiler
  4. the result is embedded into a post template with a so called "post context"
  5. the result is embedded into the page layout

A post is routed using the nicePostRoute function which is largely borrowed from Yann Esposito. It simply routes a posts/this-post.markdown to posts/this-post/index.html so that the post can be viewed at posts/this-post/.

An interesting thing to note is that when templates are applied, they are supplied a Context. A Context is simply a Monoid that encapsulates a key (i.e. String identifier for the field) and an Item. During application of the template, if a field of the form $key$ is encountered, the supplied Context is searched for an appropriate handler (i.e. one with the same key). If one is found, the item is passed to that Context's handler and the result is substituted into the template.

In the above Rule for posts, I pass a pre-crafted post Context, postCtx, and mappend to it a special tags context, tagsCtx which encapsulates tags information for that post.


The first customization I made was to allow support for SCSS. This is usually possible with a simple line:

getResourceString >>= withItemBody (unixFilter "sass" ["-s", "--scss"])

This works fine in POSIX environments, of which Linux is my primary environment for development. However, it's very useful to me to have Windows support as well. The problem is that on Windows, ruby gem binaries---such as scss---are implemented using batch file stubs. The underlying function used for creating the process in unixFilter is System.Process' createProcess, specifically with the proc type. On Windows, this uses the CreateProcess function. Using this function, batch files are not run unless they are run explicitly with cmd.exe /c batchfile. The problem is that there is no simple way to find the file path of the batch file stub for scss.

The solution to this is to use the shell type with createProcess instead of proc. This has the effect of a system call, where the parameter is interpreted by the shell, in Windows' case, cmd.exe. As a result, the program can simply be called as scss, leaving the shell to automatically run the appropriate batch file stub.

To accomplish this, I had to implement what was essentially a mirror copy of Hakyll.Core.UnixFilter with proc switched out with shell. I'll be suggesting a pull request upstream soon which gives the user the option and removes the duplicate code. Now I can implement an SCSS compiler like the following, though I additionally pass it a few extra parameters in my actual implementation:

getResourceString >>= withItemBody (shellFilter "sass -s --scss")

# Abbreviations

One feature I missed from kramdown that wasn't available in my new markdown processor, Pandoc, was abbreviation substitution. It consists of writing abbreviation definitions which are then used to turn every occurrence of the abbreviation into a proper abbr HTML tag with an accompanying tooltip consisting of the definition.

I had hardly used regular expressions in Haskell before, so the method of using it was pretty confusing to me at first. There's a base regex package called regex-base which exposes a common interface API, and then there are a variety of backend implementations. Hakyll happens to use regex-tdfa, a fast and popular backend, so I decided to use that one instead of introducing additional dependencies.

One way of using regular expressions in Haskell is through type inference, as is described in the Text.Regex.Base.Context documentation:

This module name is Context because they [sic] operators are context dependent: use them in a context that expects an Int and you get a count of matches, use them in a Bool context and get True if there is a match, etc.

Keeping this in mind, I explicitly annotated the [[String]] type since I wanted every match and sub-match. I created a function abbreviationReplace that takes a String, removes the abbreviation definitions, and then creates abbr tags out of every occurrence of the abbreviation using the parsed definitions.

The abbreviationReplace function begins like this:

abbreviationReplace :: String -> String
abbreviationReplace body =
  let pat = "^\\*\\[(.+)\\]: (.+)$" :: String
      found = body =~ pat :: [[String]]

# Git Tag

In a previous post I talked about a liquid tag I created for Jekyll which inserts the SHA of the commit on which the site was last generated. I have come to like this small feature of my site. It's not some tacky "Powered by blah" footer. It's pretty unobtrusive. It seems unimportant to people who wouldn't understand what it's about, and those who would understand it might immediately recognize its purpose.

Update: I have stopped including the git commit in the footer of every page. The problem with doing this was that, in order to have every page reflect the new commit, I had to regenerate every page before deploy. This obviously doesn't scale well once more and more pages are added to the site. Instead I have adopted a per-post commit and history link which I believe is a lot more meaningful and meshes perfectly well with generation of pages, i.e. if a post is modified, there'll be a commit made for it and since it was modified it will have to be regenerated anyways. Now I simply include social links in the footer.

One thing I forgot to update the previous post about was that I ended up switching from using the Rugged git-bindings for Ruby to just using straight up commands and reading their output. The reason for doing this was that, while everything worked perfectly fine on Linux, Rugged had problems building on Windows. It turned out that taking this approach ended up being simpler and had the added benefit of decreasing my dependencies.

The equivalent of a liquid tag in Jekyll would be a field, expressed as a Context. For this reason I created the gitTag function that takes a desired key, such as git, which would be used as $git$ in templates, and returns a Context which returns the String of formatted HTML. One problem was that to do this I had to use IO, so I needed some way to escape the Compiler Monad. It turned out that Hakyll already had a function for something like this called unsafeCompiler, which it uses for UnixFilter for example.

Here's what gitTag looks like:

gitTag :: String -> Context String
gitTag key = field key $ \_ -> do
  unsafeCompiler $ do
    sha <- readProcess "git" ["log", "-1", "HEAD", "--pretty=format:%H"] []
    message <- readProcess "git" ["log", "-1", "HEAD", "--pretty=format:%s"] []
    return ("<a href=\"" ++ sha ++
           "\" title=\"" ++ message ++ "\">" ++ (take 8 sha) ++ "</a>")

# Pandoc

Hakyll configuration is fairly straightforward. What took longer was the process of re-implementing some features that I had in kramdown when I used Jekyll that weren't available in my new document processor, Pandoc.

Pandoc is a very interesting project that basically works by parsing input documents into a common intermediate form represented as an abstract syntax tree (AST). This AST can then be used to generate an output document in a variety of formats. In this spirit, I feel it's a lot like the LLVM project. It seems to me that it has been gaining popularity especially from an end-user perspective (i.e. using the pandoc binary), commonly used to do things such as write manual pages in markdown or generate ebooks.

The very nature of how Pandoc transforms input documents into an AST lends itself to straight-forward AST transformations. I have created two such transformations so far: one for Pygments syntax-highlighting and another for fancy table of contents generation.

One of the things I needed to implement, however, was the abbreviation substitution described above. I would have implemented it as a Pandoc customization, but Pandoc has no representation for abbreviations in its abstract syntax tree. This was why I implemented it as a Hakyll compiler instead, using simple regular expressions.

There is actually work towards implementing abbreviation substitution according to the readme under the section "Extension: abbrevations" [sic] but it says:

Note that the pandoc document model does not support abbreviations, so if this extension is enabled, abbreviation keys are simply skipped (as opposed to being parsed as paragraphs).

# Pygments

Update: This has been through two redesigns since this was written. The first involved an fs-backed caching system, but this was still too slow, since the bottleneck seemed to be caused by continuously spawning a new pygmentize process. Most recently I've created a pygments server that the site opens alongside it at launch, and this Pandoc AST transformer communicates with it through its stdout/stdin handles. It works perfectly and the site compiles a lot quicker. It also fully supports UTF-8:

One of the first things I wanted to implement right away was syntax highlighting with Pygments. There are a variety of options for syntax highlighting. In fact, Pandoc comes with support for kate: a Haskell package for syntax highlighting written by the author of Pandoc. However, I don't find it to be on par with Pygments. In the past, I simply posted code to gist and then embedded it into posts. This caused unnecessary overhead and more importantly, would break my site when github made changes to the service.

Eventually I realized that github just uses Pygments underneath, so I implemented a Pandoc AST transformer that finds every CodeBlock, extracts the code within it, passes it to Pygments, and replaces that CodeBlock with a RawBlock containing the raw HTML output by Pygments. I also implemented a way to specify an optional caption which is shown under the code block. I use blaze-html for the parts where I need to hand-craft HTML.

Ultimately, this all means that I can write code blocks like this in markdown:

Or, with a caption:

One thing I had to do was invoke unsafePerformIO in the function I created which runs the code through pygmentize, an end-user binary for the Pygments library. I'm not sure if there's a better way to do this, but my justification for using it is that Pygments should return the same output for any given input. If it doesn't, then there are probably larger problems.

pygmentize :: String -> String -> String
pygmentize lang contents = unsafePerformIO $ do

I don't feel particularly worried about it, given my justification. It's a similar justification used by Real World Haskell when creating bindings for PCRE with the foreign function interface:

It lets us say to the compiler, "I know what I'm doing - this code really is pure". For regular expression compilation, we know this to be the case: given the same pattern, we should get the same regular expression matcher every time. However, proving that to the compiler is beyond the Haskell type system, so we're forced to assert that this code is pure.

This is what the AST transformer I wrote looks like:

pygments :: Block -> Block
pygments (CodeBlock (_, _, namevals) contents) =
  let lang = fromMaybe "text" $ lookup "lang" namevals
      text = fromMaybe "" $ lookup "text" namevals
      colored = renderHtml $ H.div ! A.class_ "code-container" $ do
                  preEscapedToHtml $ pygmentize lang contents
      caption = if text /= ""
                then renderHtml $ H.figcaption $ H.span $ H.toHtml text
                else ""
      composed = renderHtml $ H.figure ! A.class_ "code" $ do
                   preEscapedToHtml $ colored ++ caption
  in RawBlock "html" composed
pygments x = x

# Table of Contents

The more sophisticated and complex of the AST transformers I wrote for Pandoc is table of contents generation. This is something that kramdown had out of the box, though not as fancy. Paired with automatic id generation for headers, this meant that simply placing {:toc} in my page would replace that with automatically generated table of contents based on the headers used in the page.

# Alternatives

Pandoc actually does have support for table of contents generation using the --toc flag. In fact, Julien Tanguy recently devised a way to generate a separate version of every post which only included the table of contents, then re-introduced the table of contents as a Context field $toc$.

I actually tried this approach, along with a metadata field that decided if the table of contents should be included in a given post or page. However, I ended up deciding against using it. One advantage would be that it took less code on my end, and possibly I would avoid re-inventing the wheel. One reason I didn't keep it was because there was a tiny increase in compilation time which I fear might accumulate in the future as the number of posts grow. The reason for this is that the table of contents is generated for every post/page, instead of only the ones that should display it.

Another reason was that it would require me to implement the fancy section numbering in JavaScript, which I don't think would be too difficult since in this case the table of contents already exists and I simply need to insert my numbering. The main reason I decided against it, along with the previous two reasons, is that there would be a noticeable delay between the time when the table of contents are shown plainly and when they are transformed into my custom table of contents.

# Implementation

Implementing this involved many steps. In general terms, I had to make a pass through the document to collect all of the headers, then I had to make another pass to find a special sentinel marker I would manually place in the document to replace it with the generated table of contents. This effectively makes table of contents generation a two-pass transformer.

Gathering all of the headers and their accompanying information, i.e. HTML id, text, level, proved to be a pretty straight-forward task using queryWith from the pandoc-types package:

queryWith :: (Data a, Monoid b, Data c) => (a -> b) -> c -> b
-- Runs a query on matching a elements in a c.
-- The results of the queries are combined using mappend.

Once I collect all of the Header items' information, I normalize them by finding the smallest header level (i.e. big header) and normalizing all headers based on that level. That is, if smallest header level is 3 (i.e. h3), every header gets its level subtracted by 2 so that all headers are level 1 and above. Note that I'm not actually modifying the headers in the document, just the information about them that I've collected.

Next, a Data.Tree is constructed out of the headers which automatically encodes the nesting of the headers. This is done by exploiting groupBy by passing it < as an equivalence predicate:

tocTree :: [TocItem] -> Forest TocItem
tocTree = map (\(x:xs) -> Node x (tocTree xs)) . groupBy (comp)
  where comp (TocItem a _ _) (TocItem b _ _) = a < b

This Tree is finally passed to a recursive function that folds every level of the Tree---known as a Forest---into a numbered, unordered list. While that may sound like an oxymoron, the point is that I wanted to have nested numbering in my table of contents. For this reason, I create an unordered list with a span containing the section number concatenated to the parent's section number. This function generates the HTML.

The final problem was finding a way to insert the table of contents on-demand, in a location of my choosing. In kramdown, this is achieved by writing {:toc}, which gets substituted with the table of contents. Pandoc has no such thing, however. For this reason, I chose a list with a single item, "toc," as the place holder for the table of contents. This means that I write the following wherever I want the table of contents to show up:

You can take a look at the beginning of this post to see what the generated table of contents looks like, especially the nested numbering I was referring to.

# Deploying

I host my site using GitHub Pages. Such sites are deployed by pushing the site to the master branch of the repository. I wrote a quick shell script that accomplishes this in a pretty straightforward manner. It creates a git ignored directory, deploy/, which itself is under git control, associated with the same repository, but its master branch instead.

When I deploy the site with ./site deploy, the contents of deploy/ are removed---except for the .git/ directory---and then all of the new generated files are copied into it. A commit is then generated for the deployment, tagged with the SHA identifier of the commit from which the site was generated, to make it easy for me to track things down sometimes. An eight character, truncated SHA is used as follows:

COMMIT=$(git log -1 HEAD --pretty=format:%H)
git commit -m "generated from $SHA" -q

Finally, the commit is force pushed to the repository, replacing everything already there, effectively deploying the site.

# Conclusion

Preliminary migration to Hakyll was pretty quick. This included porting all of my posts, pages, and other assets to the Hakyll and Pandoc Markdown formats. The rest of the week was spent implementing the various features, some outlined above, and refining the code base.

At first I was a little rusty with my Haskell and found myself at odds with the seemingly capricious compiler, trying to find one way or another to appease it. I quickly remembered that patience prevailed when concerning Haskell, and eventually came to really enjoy reasoning out the problems and solving them with Haskell.

The site binary which is in charge of generation, previewing, etc. is compiled. Once you have configured Hakyll to your liking, you have a very fast binary, especially compared to other site generators which are known not to scale well with the amount of posts. The Compiler Monad in Hakyll takes care of dependency tracking, allowing re-generation of only those items which are affected by those which were changed, instead of the whole site.

But perhaps my favorite aspect of Hakyll is that it's more like a library for static site generation which you use as you see fit, and as a result, your site is entirely customizable.

May 14, 2013