bup - towards the perfect backup

Since I discovered it, I've been in love with the concept behind bup.

bup appeals to my sense of efficiency in taking backups: backups should backup the absolute minimum amount of data so I can have the most versions, and then that frees me to use whatever level of redundancy I deem appropriate for my backup media.

But more then that, the underlying technology of bup is ripe with possibility: the basic premise of a backup tool gives rise to the possibility of a sync tool, a deduplicated home directory tool, distributed repositories, archives and more.

how it works

A more complete explanation can be found on the main GitHub repository, but essentially bup applies rsync's rolling-checksum (literally, the same algorithm) to determine file-differences, and then only backs up the differences - somewhat like rsnapshot.

Unlike rsnapshot however, bup then applies deduplication of the chunks produced this way using SHA1 hashes, and stores the results in the git-packfile format.

This is both very fast (rsnapshot, conversely, is quite slow) and very redundant - the Git tooling is able to read and understand a bup-repository as just a Git repository with a specific commit structure (you can run gitk -a in a .bup directory to inspect it).

why its space efficient

bup's archive and rolling-checksum format mean it is very space efficient. bup can correctly deduplicate data that undergoes insertions, deletions, shifts and copies. bup deduplicates across your entire backup set, meaning the same file uploaded 50 times is only stored once - in fact it will only be transferred across the network once.

For comparison I recently moved 180 gb of ZFS snapshots of the same dataset undergoing various daily changes into a bup archive, and successfully compacted it down to 50 gb. I suspect I could have gotten it smaller if I'd unpacked some of the archive files that have been created in that backup set.

That is a dataset which is already deduplicated via copy-on-write semantics (it was not using ZFS deduplication because you should basically never use ZFS deduplication).

why its fast

Git is well known as being bad at handling large binary files - it was designed to handle patches of source code, and makes assumptions to that effect. bup steps around this problem because it only used the Git packfile and index format to store data: where Git is slow, bup implements its own packfile writers index readers to make looking up data in Git structures fast.

bup also uses some other tricks to do this: it will combine indexes into midx files to speed up lookups, and builds bloom filters to add data (a bloom filter is a fast data structure based on hashes which tells you something is either 'probably in the data set' or definitely not).

using bup for Windows backups

bup is a Unix/Linux oriented tool, but in practice I've applied it most usefully at the moment to some Windows servers.

Running bup under cygwin on Windows, and is far superior to the built in Windows backup system for file-based backups. It's best to combine it with the vscsc tool which allows using 1-time snapshots to save the backup and avoid inconsistent state.

You can see a link to a Gist here of my current favorite script for this type of thing - this bash script needs to be invoked from a scheduled task which runs a batch file like this.

If you want to use this script on Cygwin then you need to install the mail utility for sending email, as well as rsync and bup.

This script is reasonably complicated but it is designed to be robust against failures in a sensible way - and if we somehow fail running bup, to fallback to making tar archives - giving us an opportunity to fix a broken backup set.

This script will work for backing up to your own remote server today. But, it was developed to work around limitations which can be fixed - and which I have fixed - and so the bup of tomorrow will not have them.

towards the perfect backup

The script above was developed for a client, and the rsync-first stage was designed to ensure that the most recent backup would always be directly readable from a Windows Samba share and not require using the command line.

It was also designed to work around a flaw with bup's indexing step which makes it difficult to use with variable paths as produced by the vscsc tool in cygwin. Although bup will work just fine, it will insist on trying to hash the entire backup set every time - which is slow. This can be worked around by symlinking the backup path in cygwin beforehand, but since we needed a readable backup set it was as quick to use rsync in this instance.

But it doesn't have to be this way. I've submitted several patches against bup which are also available in my personal development repository of bup on GitHub.

The indexing problem is fixed via index-grafts: modifying the bup-index to support representing the logical structure as it is intended to be in the bup repository, rather then the literal disk path structure. This allows the index to work as intended without any games on the filesystem, hashing only modified or updated files.

The need for a directly accessible version of the backup is solved via a few other patches. We can modify the bup virtual-filesystems layer to support a dynamic view of the bup repository fairly easily, and add WebDAV support to the bup-web command (the dynamic-vfs and bup-webdav branches).

With these changes, a bup repository can now be directly mounted as a Windows mapped network drive via explorers web client, and files opened and copied directly from the share. Any version of a backup set is then trivially accessible and importantly we can simply start bup-web as a cygwin service and leave it running.

Hopefully these patches will be incorporated into mainline bup soon (they are awaiting review).

so should I use it?

Even with the things I've had to fix, the answer is absolutely. bup is by far the best backup tool I've encountered lately. For a basic Linux system it will work great, for manual backups it will work great, and with a little scripting it will work great for automatic backups under Windows and Linux.

The brave can try out the cutting-edge branch on my GitHub account to test out the fixes in this blog-post, and if you do then posting about them to [bup-list@googlegroups.com[(https://groups.google.com/forum/#!forum/bup-list) with any problems or successes or code reviews would help a lot.

Flashing Marlin with Eclipse and AVR Dude

Intro

3D printing is still very much in the hobbyist stage, and I am a tinkerer at heart. So flashing my printer's firmware is a basic operation - in fact it has to be a basic operation, since most configuration changes are baked into the firmware.

But, I run my printer at 250000 baud rate over USB. This is an excellent choice for printing since the error rate from the serial line on the Arduino is basically 0, and it's faster to boot. Excellent for rapid G-code sending.

But it can make reprogramming using the standard Arduino IDE a bit of a pain, and by and large I don't like the standard Arduino IDE.

The problem is that your Arduino in 3D printing mode is running at a non-standard baudrate. Linux can handle this just fine, as can the underlying programming tool of Arduino avrdude, but the IDE doesn't let you just type in what you need. Eclipse does. Get the latest version of AVRdude you can, since it makes things a lot easier - I have 6.0.1 which is built into Ubuntu "Trusty Tahr".

Reprogramming with Eclipse

People have covered setting up Eclipse for Arduino elsewhere, and I will in the future cover my setup in my own words (I believe understanding comes from finding an idea explained in the right voice a lot of the time) but for now I'll just say it works pretty well.

Reprogramming an Arduino Mega 2560 (or compatible)

The basic command line you need for the Arduino Mega when it's running at 250000 bps is very simple. The avrdude command should be avrdude -cstk500v2 -P/dev/ttyACM0 -b250000 in Eclipse. You can enter this by typing in the baud rate directly in the drop down box - Eclipse will accept it just fine.

The AVRdude settings screen should look something like this when you're done:

AVRDude Configuration Screen

The benefit of this configuration is your printer (or arduino for another project) can be easily reflashed without having to time-out resetting it so you can run AVRdude at one of the "standard" baudrates.

It's worth noting that by creating a custom boards.txt configuration it might be possible to accomplish this in the Arduino IDE as well, but Eclipse is a much easier to use dev-environment for big Arduino projects (like 3D printer firmwares) that I don't really have any inclination to avoid it in the future (plus is generalizes out to non-Arduino AVR programming nicely).

Going further

This is just a short "I did this a few minutes ago note". In the future I'll detail my Arduino Eclipse workspace, to add to the signal to noise ratio on that subject.

Quick note - uninstalling r8618-dkms

This is a quick note on something I encountered while trying to work out why my Realtek NICs are so finicky about connecting and staying connected at gigabit speeds when running Linux.

The current hypothesis is that the r8168 driver isn't helping very much. So I uninstalled it - and ran into two problems.

Firstly

...you need to uninstall it on Ubuntu/Debian with apt-get remove --purge r8168-dkms or the config files (and it's all config files) won't be properly removed, and the module will be left installed.

Secondly

...you really need to make sure you've removed all the blacklist r8169 entries. They can be left behind if you don't purge configuration files, but I found I'd also left a few hanging around in the /etc/modprobe.d directory from earlier efforts. So a quick fgrep r8169 * would've saved me a lot of trouble and confusion as to why r8169 wasn't being automatically detected.

In my case it turned out I'd put a very official looking blacklist-networking.conf file in my modprobe.d directory. On both my machines.

Something about Realtek NICs?

If I find an answer I'll surely provide updates, but needless to say there's no rthyme or reason to when they do or do not work, other then they consistently don't work on kernel 3.11 with the r8168 driver it seems.

Running Gnome Tracker on a Server

In a passing comment it was suggested to me that it would be really great if the home fileserver offered some type of web-interface to find things. We've been aggregating downloaded files there for a while, and there's been attempts made at categorization but this all really falls apart when you wonder "what does 'productivity' mean? And does this go under 'Linux' or some other thing?"

Since lately I've been wanting to get desktop search working on my actual desktops, via Gnome's Tracker project and it's tie-in to Nautilus and Nemo (possibly the subject of a future blog), it seemed logical to run it on the fileserver as an indexer for our shared directories - and then to tie some kind of web ui to that.

Unfortunately, Tracker is very desktop orientated - there's no easy daemon mode for running it on a headless system out-of-the-box, but with a little tweaking you can make it work for you quite easily.

How to

On my system I keep Tracker running as it's own user under a system account. On Ubuntu you need to create this like so (using a root shell - sudo -i):

$ adduser --system --shell=/bin/false --disabled-login --home=/var/lib/tracker tracker
$ adduser tracker root

Since tracker uses GSettings for it's configuration these days, you need to su into the user you just created to actually configure the directories which should be indexed. Since this is a server, you probably just have a list of them so set it somewhat like the example below. Note: you must run the dbus-launch commands in order to have a viable session bus for dconf to work with. This will also be a requirement of Tracker later on.

$ su --shell /bin/bash
$ eval `dbus-launch --sh-syntax`
$ dconf write org/freedesktop/tracker/miner/files/index-recursive-directories "['/path/to/my/dir/1', '/path/to/my/dir/2', '/etc/etc']"
$ kill $DBUS_SESSION_BUS_PID
$ exit

Your Tracker user is now ready at this point. To start and stop the service, we use an Upstart script like the one below:

description "gnome tracker system startup script"
author "wrouesnel"

start on (local-filesystems and net-device-up)
stop on shutdown

respawn
respawn limit 5 60

setuid tracker

script
    chdir /var/lib/tracker
    eval `dbus-launch --sh-syntax`
    echo $DBUS_SESSION_BUS_PID > .tracker-sessionbus.pid
    echo $DBUS_SESSION_BUS_ADDRESS > .tracker-sessionbus
    /usr/lib/tracker/tracker-store
end script

post-start script
    chdir /var/lib/tracker
    while [ ! -e .tracker-sessionbus ]; do sleep 1; done
    DBUS_SESSION_BUS_ADDRESS=$(cat .tracker-sessionbus) /usr/lib/tracker/tracker-miner-fs &
end script

post-stop script 
    # We need to kill off the DBUS session here
    chdir /var/lib/tracker
    kill $(cat .tracker-sessionbus.pid)
    rm .tracker-sessionbus.pid
    rm .tracker-sessionbus
end script

Some things to focus on about the script: we launch and save the DBus session parameters. We'll need these to reconnect to the session to run tracker related commands. The post-stop stanza is to kill off the DBus session.

You do need to explicitely launch tracker-miner-fs in order for file indexing to work, but you don't need to kill it explicitely - it will be automatically shutdown when Upstart kills tracker-store.

Also note that since tracker runs as the user tracker it can only index files and directories which it is allowed to traverse, so check your permissions.

You can now start Tracker as your user with start tracker. And stop it with stop tracker. Simple and clean.

Using this

My plan for this setup is to throw together a Node.js app on my server that will forward queries to the tracker command line client - that app will be a future post when it's done.

Migrating to Gmail

Why

In a stitch of irony given my prior articles wrestling with a decent IDLE daemon for use with getmail, I'm faced with a new problem in figuring out the best way to migrate all my existing, locally hosted email to Gmail.

This is evidently not an uncommon problem for people, presumably for largely the same reasons I'm facing: although I like having everything locally on my own server, it only works in places where (1) I live in the same place as the server and (2) where my server won't be double-NAT'd so dynamic DNS can actually reach it.

How

My personal email has been hosted on a Dovecot IMAP server in a Maildir up till now. Our tool of choice for this migration will be the venerable OfflineIMAP utility, available on Debian-ish systems with apt-get install offlineimap.

A Foreword

I tried a lot to get this to work properly in a Maildir -> Gmail configuration, and while it's technically possible I couldn't seem to get the folder creation to play nicely with tags - OfflineIMAP wants to create them with a leading separate ('/' on Gmail) but Gmail itself doesn't recognize that as root tag. There doesn't seem to be anyway around this behavior with name translation or anything.

I suspect you could work around this by uploading to a subdirectory, and then moving everything out of the subdirectory (sub-tag?) on Gmail, but didn't try it.

Configuration file

In your home directory (I did this on my home server since 7gb of email takes a long time to upload over ADSL) you need to create an .offlineimaprc file. For an IMAP -> IMAP transfer, it has a structure something like this:

[general]
accounts = Gmail-wrouesnel

# Gmail max attachment size - you'll get errors otherwise.
maxsize = 25000000
socktimeout = 600

[Account Gmail-wrouesnel]
# Note the ordering - Gmail is the 'local' folder.
remoterepository = Local
localrepository = Gmail

[Repository Local]
type = IMAP
# This ensures we only do a 1-way transfer. If you want to do 2-way then you need a
# rule to exclude the Gmail [All Mail] folder.
readonly = True
remotehost = localhost
remoteuser = <local user>
remotepass = <local password>
ssl = yes
# I use SSL so this is needed - let it throw an error, then copy the hash back.
cert_fingerprint = 60571343279e7f43ee95000762f5fcd54ad24816
sep = .
subscribedonly = no

[Repository Gmail]
type = IMAP
ssl = yes
remotehost = imap.googlemail.com
remoteuser = <gmail user>
remotepass = <gmail password>
sslcacertfile = /etc/ssl/certs/ca-certificates.crt
sep = /
subscribedonly = no

Running

Test the process first with offlineimap --dry-run to check that things are going to turn out roughly how you expect. Then execute offlineimap to start the process. I really recommend doing this in a byobu or screen session, or at least with the nohup utility since a connection drop will cause offlineimap to abort.

Check back on the process once every day or so to check it's still running - OR - write a shell script to re-invoke it until it succeeds (untested so I won't propose any code).

Personal thoughts

This seems to be the most painless way to upload old email to Gmail. In my case, the move is prompted by a real life move where my 24TB server won't be coming with me. I followed up some options for moving my email system, for example to a Docker image for $5 a month for 20gb, but at the end of the day had to face the fact that there was a perfectly capable free-alternative available and it would just be throwing money away. Everything already operates through my Gmail accounts anyway, so it's not like there's a security concern there and when it comes to email you either use GPG or you're doing nothing anyway.

It's worth the observation here that the same process used for the migration can also be used for a local backup, which is a system I will most definitely be using in the future. OfflineIMAP can write Maildir natively, so there's no need to use an IMAP server locally for that, and helpfully solves the "what if Gmail suddenly disappears problem" (more likely from a power failure then anything else, but my email is important to me).

A Better Getmail IDLE client

Updates

(2013-10-15) And like that I've broken it again. Fixing the crash on IMAP disconnect actually broke IMAP disconnect handling. The problem here is that IMAPClient's exceptions are not documented at all, so a time-based thing like IDLE requires some guessing as to what IMAPClient will handle and what you need to handle. This would all be fine if there was a way to get Gmail to boot my client after 30 seconds so I could test it easily.

I've amended the code so that anytime the code would call _imaplogin() it explicitely dumps the IMAPClient object after trying to log it out, and recreates it. Near as I can tell this seems to be the safe way to do it, since the IMAPClient object does open a socket connection when created, and doesn't necessarily re-open if you simply re-issue the login command.

There's an ongoing lesson here that doing anything that needs to stay up with protocol like IMAP is an incredible pain.

(2013-10-14) So after 4 days of continuous usage I'm happy with this script. The most important thing it does is crash properly when it encounters a bug. I've tweaked the Gist a few times in response (a typo meant imaplogin didn't recover gracefully) and added a call to notify_mail on exit which should've been there to start with.

It's also becoming abundantly clear that I'm way to click-happy with publishing things to this blog, so some type of interface to show my revisions is probably in the future (a long with a style overhaul).

Why

My previous attempt at a GetMail IDLE client was a huge disappointment, since imaplib2 seems to be buggy for handling long-running processes. It's possible some magic in hard terminating the IMAP session after each IDLE termination is necessary, but it raises the question of why the idle() function in the library doesn't immediately exit when this happens - to me that implies I could still end up with a zombie daemon that doesn't retreive any mail.

Thus a new project - this time based on the Python imapclient library. imapclient uses imaplib behind the scenes, and seems to enjoy a little bit more use then imaplib2 so it seemed a good candidate.

The script

Dependencies

The script has a couple of dependencies, most easily installed with pip:

$ pip install psutil imapclient

Get it from a Gist here - I'm currently running it on my server, and naturally I'll update this article based on how it performs as I go.

Design

The script implements a Unix daemon, and uses pidfiles to avoid concurrent executions. It's designed to be stuck in a crontab file to recover from crashes.

I went purist on this project since I wanted to avoid as many additional frameworks as possible and work mostly with built-in constructs - partly as just an exercise in what can be done. At the end of the day I ended up implementing a somewhat half-baked messaging system to manage all the threads based on Queues.

The main thread, being the listener for signals, creates a "manager" thread, which in turn spawns all my actual "idler" threads.

Everything talks with Queue.Queue() objects, and block on the get() method which efficiently uses CPU. The actual idle() function, being blocking, runs on its own thread and posts "new mail" events back to the idler thread, which then invokes getmail.

The biggest challenge was making sure exceptions were caught in all the right places - imapclient has no way to cleanly kill off an idle() process, so a shutdown involves causing the idle_check() call to return an exception.

I kind of hacked this together as I went - the main thing I really targeted was trying to make sure failure modes caused crashes, which is hard to do with Python-threading a lot of the time. A crashed script can be restarted, a zombie script doing nothing looks like it's correctly alive.

Personal thoughts

Pure Python is not the best for this sort of thing - an evented IMAP library would definitely be better but this way I can stick with mostly single file deployment, and I don't want to write my own IMAP client at the moment.

Of course IMAP is a simple enough protocol in most respects, so it's not like it would be hard but the exercise was still interesting. But if I want a new project with this, I would still like to tackle it in something like Haskell.

A GetMail IDLE daemon script

Updates

Although the script in this article works, I'm having some problems with it after long-running sessions. The symptom seems to be that imaplib2 just stops processing IDLE session responses - it terminates and recreates them just fine, but no new mail is ever detected and thus getmail is never triggered. With 12 or so hours of usage out of the script, this seems odd as hell and probably like an imaplib2 bug.

With the amount of sunk time on this, I'm tempted to go in 1 of 2 directions: re-tool the script to simply invoke getmail's IDLE functionality, and basically remove imaplib2 from the equation, or to write my own functions to read IMAP and use the IDLE command.

Currently I'm going with option 3: turn on imaplib's debugging to max, and see if I can spot the bug - but at the moment I can't really recommend this particular approach to anyone since it's just not reliable enough - though it does somewhat belie the fact that Python really doesn't have a good IMAP IDLE library.

Updates 2

After another long-running session of perfect performance, I'm once again stuck with a process that claims to start idling successfully, but seems to hang - giving no exceptions or warnings of any kind and only doing so after 8+ hours of perfect functioning. It's not a NAT issue since this is far short of the 5-day default timeout.

At a best guess the problem seems to be that once logged in, imaplib2 leaves the session open but dumbly just listens to the socket - which eventually dies for some reason (re-assigning IPs by my ISP maybe?) but imaplib's "reader" thread just blocks on polling rather then triggering the callback code (since the notable thing is I can see the poll commands in the log stop, the session timeout detection, but no invocation of the callback).

As it stands, I have to strongly recommend against using imaplib2 for any long running processes like IDLE - you simply can't deal with a library that's going to silently hang itself after a half-day or so without crashing or logging anything to indicate this happens - the only detection is when self-addressed emails don't arrive, but that's a really stupid keep-alive protocol. I'll be retooling the script to try out imapclient next but that will be a future article and a separate gist.

Why

This is a script which took way too long to come together in Python 2.7 using imaplib2 (pip install imaplib2).

The basic idea is to use the very reliabe GetMail4 (apt-get install getmail4) - which is written in Python - to poll my IMAP mail accounts when new mail arrives, rather then as I had been doing with a 1 minute cronjob (which is slightly too slow for how we use email these days, and may not be liked by some mail servers - not to mention resource intensive).

The big benefit here is rapid mail delivery, but the other benefit is that it solves the problem of cron causing overlapping executions of getmail which can lead to blank messages (though not message loss). Other means of solving, such as wrapping cron in a flock call aren't great, since if the lockfiles don't get cleaned up it will just stop working silently.

Requirements

Writing a reliable IDLE daemon that won't cause us to spend half a day wondering where our email is is not easy. This was an interesting Python project for me, and it's certainly not pretty or long - but I mostly spent a ton of time trying to think through as many edge cases as I could. In the end, I settled on tying the daemon itself to sendmail on my system, so at least if it crashes or an upstream server goes offline I'm notified, and I have a decent chance of seeing why, and the use of pid files means I can have cron failsafe re-execute every 5 minutes if it does go down.

The Script

I started with the example I found here but ended up modifiying it pretty heavily. That code isn't a great approach in my opinion since it overwhelms the stack size pretty quickly with multiple accounts - imaplib2 is multithreaded behind the scenes (2 threads per account), so spawning an extra thread to handle each account gives you 3 per account, 6 accounts gives you 18 threads + the overhead of forking and running GetMail in a subprocess.

Though when all things are considered, I didn't improve things all that much but using a single-overwatch thread to reset the IDLE call on each object is simpler to handle (although I don't present it that way IMO). But the important thing is it works.

Download

The script is quite long so grab it from the Gist. It has a few dependencies, best installed with pip

$ pip install imaplib2 psutil
$./getmail-idler.py -h
usage: getmail-idler.py [-h] [-r GETMAILRC] [--pid-file PIDFILE] [--verbose]
                        [--daemonize] [--logfile LOGFILE]

optional arguments:
  -h, --help            show this help message and exit
  -r GETMAILRC          getmail configuration file to use (can specify more
                        then once)
  --pid-file PIDFILE, -p PIDFILE
                        pidfile to use for process limiting
  --verbose, -v         set output verbosity
  --daemonize           should process daemonize?
  --logfile LOGFILE     file to redirect log output too (useful for daemon
                        mode)

It uses a comprehensive argparse interface, the most important parameter is -r. This is exactly like the getmail -r command, and accepts files in the same format - though it doesn't search the same locations although it will search $HOME/.getmail/.

Currently it only handles IMAPSSL, which you should be using anyway though it should be easy to hack it to fix this I just have no incentive too at the moment.

Currently I use this with a cronjob set to every minute or 5 minutes - with verbose logging (-vv) it won't produce output until it forks into a daemon. This means if it crashes (and I've tried to make it crash reliably) cron will restart it on the next round, and it'll email a tracelog (hopefully).

My current crontab using this script:

* * * * * /home/will/bin/getmail-idler.py -r config1.getmailrc -r config2.getmailrc -r config3.getmailrc -r config4.getmailrc -r config5.getmailrc --pid-file /tmp/will-getmail-idler.pid --logfile .getmail-idler.log -vv --daemonize

Personal thoughts

I'm pretty pleased with how this turned out (edit: see updates section at the top on how that's changed - I'm happy with the script, less happy with imaplib2) since it was a great exercise for me in learning some new things about Python. That said, compared to something like NodeJS, I feel with the write library this would've been faster in a language with great eventing support, rather then Python's weird middle-ground of "not quite parallel" threads. But, I keep coming back to the language, and the demo-code I used here was Python so it must be doing something right.

I'll probably keep refining this if I run into problems - though if it doesn't actually stop working, then I'll leave it alone since the whole self-hosted email thing's biggest downside is when your listener dies and you stop getting email - that's the problem I've really tried to solve here - IDLE PUSH email functionality, and highly visible notifications when something is wrong.

Setting up sshttp

When I was travelling Europe I found some surprisingly restricted wi-fi hotspots in hotels. This was annoying because I use SSH to upload photos back home from my phone, but having not setup any tunneling helpers I just had to wait till I found a better one.

There are a number of solutions to SSH tunneling, but the main thing I wanted to do was implement something which would let me run several fallbacks at once. Enter sshttp.

sshttp is related to sslh, in the sense that they are both SSH connection multiplexers. The idea is that you point a web-browser at port 80, you get a web-page. You point your SSH client, and you get an SSH connection. Naive firewalls let the SSH traffic through without complaint.

The benefit of sshttp over sslh is that it uses Linux's IP_TRANSPARENT flag, which means that your SSH and HTTP logs all show proper source IPs, which is great for auditing and security.

This is a blog about how I set it up for my specific server case, the instructions I used as a guide were adapted from here.

Components

My home server hosts a number of daemons, but namely a large number of nginx name-based virtual hosts for things on my network. I specifically don't want nginx trying to serve most of these pages to the web.

The idea is that sshttp is my first firewall punching fallback, and then I can install some sneakier options on the web-side of sshttp (topic for a future blog). I also wanted sshttp to be nicely integrated with upstart in case I wanted to add more daemons/redirects in the future.

Installing sshttp

There's no deb package available, so installation is from github and then I copy it manually to /usr/local/sbin:

$ git clone https://github.com/stealth/sshttp
$ cd sshttp
$ make
$ sudo cp sshttpd /usr/local/sbin

Upstart Script

I settled on the following upstart script for sshttp (adapted from my favorite nodeJS launching script):

# sshttpd launcher
# note: this at minimum needs an iptables configuration which allows the
# outside ports you're requesting through.

description "sshttpd server upstart script"
author "will rouesnel"

start on (local-filesystems and net-device-up)
stop on shutdown

instance "sshttpd - $NAME"
expect daemon

#respawn
#respawn limit 5 60

pre-start script
    # Check script exists
    if [ ! -e /etc/sshttp.d/$NAME.conf ]; then
        return 1
    fi
    . /etc/sshttp.d/$NAME.conf

    # Clear up any old rules this instance may have left around from an
    # unclean shutdown
    iptables -t mangle -D OUTPUT -p tcp --sport ${SSH_PORT} -j sshttpd-$NAME || true
    iptables -t mangle -D OUTPUT -p tcp --sport ${HTTP_PORT} -j sshttpd-$NAME || true
    iptables -t mangle -D PREROUTING -p tcp --sport ${SSH_PORT} -m socket -j sshttpd-$NAME || true
    iptables -t mangle -D PREROUTING -p tcp --sport ${HTTP_PORT} -m socket -j sshttpd-$NAME || true

    iptables -t mangle -F sshttpd-$NAME || true
    iptables -X sshttpd-$NAME || true

    # Add routing rules
    if ! ip rule show | grep -q "lookup ${TABLE}"; then
        ip rule add fwmark ${MARK} lookup ${TABLE}
    fi

    if ! ip route show table ${TABLE} | grep -q "default"; then
        ip route add local 0.0.0.0/0 dev lo table ${TABLE}
    fi

    # Add iptables mangle rule chain for this instance
    iptables -t mangle -N sshttpd-$NAME || true
    iptables -t mangle -A sshttpd-$NAME -j MARK --set-mark ${MARK}
    iptables -t mangle -A sshttpd-$NAME -j ACCEPT

    # Add the output and prerouting rules
    iptables -t mangle -A OUTPUT -p tcp --sport ${SSH_PORT} -j sshttpd-$NAME
    iptables -t mangle -A OUTPUT -p tcp --sport ${HTTP_PORT} -j sshttpd-$NAME
    iptables -t mangle -A PREROUTING -p tcp --sport ${SSH_PORT} -m socket -j sshttpd-$NAME
    iptables -t mangle -A PREROUTING -p tcp --sport ${HTTP_PORT} -m socket -j sshttpd-$NAME
end script

# the daemon
script
    . /etc/sshttp.d/$NAME.conf

    /usr/local/sbin/sshttpd -n 1 -S ${SSH_PORT} -H ${HTTP_PORT} -L${LISTEN_PORT} -U nobody -R /var/empty >> ${LOG_PATH} 2>&1
end script

post-stop script
    . /etc/sshttp.d/$NAME.conf

    # Try and leave a clean environment
    iptables -t mangle -D OUTPUT -p tcp --sport ${SSH_PORT} -j sshttpd-$NAME || true
    iptables -t mangle -D OUTPUT -p tcp --sport ${HTTP_PORT} -j sshttpd-$NAME || true
    iptables -t mangle -D PREROUTING -p tcp --sport ${SSH_PORT} -m socket -j sshttpd-$NAME || true
    iptables -t mangle -D PREROUTING -p tcp --sport ${HTTP_PORT} -m socket -j sshttpd-$NAME || true

    iptables -t mangle -F sshttpd-$NAME || true
    iptables -X sshttpd-$NAME || true

    # Remove routing rules
    if ip rule show | grep -q "lookup ${TABLE}"; then
        ip rule del fwmark ${MARK} lookup ${TABLE}
    fi
    if ip route show table ${TABLE} | grep -q "default"; then
        ip route del local 0.0.0.0/0 dev lo table ${TABLE}
    fi

    # Let sysadmin know we went down for some reason.
    cat ${LOG_PATH} | mail -s "sshttpd - $NAME process killed." root
end script

This script nicely sets up and tears down the bits of iptables mangling and routing infrastructure needed for sshttp, and neatly creates chains for different sshttp instances based on configuration files. It'll only launch a single instance, so launching them all on boot is handled by this upstart script.

To use this script, you need an /etc/sshttp.d directory:

$ sudo mkdir /etc/sshttp.d

and a configuration file like the following, with a *.conf extension:

$ cat /etc/sshttp.d/http.conf
SSH_PORT=22022
HTTP_PORT=20080
LISTEN_PORT=20081
MARK=1
TABLE=22080
LOG_PATH=/var/log/sshttp.log

LISTEN_PORT is your sshttp port. It's 20081 because we're going to use iptables to forward port 80 to 20081 (to accomodate nginx - more on this later). SSH_PORT is an extra SSH port for openssh - so we have both 22 and 22022 open as SSH ports since 22022 can't be publically accessible (and we'd like 22 to be publically accessible).

HTTP_PORT is the port your web-server is listening on, for the same reasons as the SSH_PORT. MARK is the connection marker the daemon looks for - it has to be unique for each one. TABLE is the route table used for look ups - I think. This value can be anything - I think.

LOG_PATH I currently set to the same value for each host for simplicity - sshttp doesn't really log anything too useful (and one of it's features is that you don't need it's logs anyway).

IP Tables configuration

In addition to the sshttpd upstart scripts, some general iptables configuration was needed for my specific server.

To make configuring nGinx simpler for my local network, IP Tables is set to redirect all traffic coming into the server on the ppp0 interface (my DSL line) on port 80 and 443, to the listen ports specified for sshttpd for each interface (so port 80 goes to port 20081 in the above).

This means I can happily keep setting up private internal servers on port 80 on my server withou futzing around with bind IPs, and instead can put anything I want to serve externally onto port 20080 as per the configuration file above.

I like Firewall Builder at the moment for my configuration (although I'm thinking a shell script might be better practice in future).

The relevant IP tables lines if you were doing it manually would be something like

iptables -t nat -A PREROUTING -i ppp0 -p tcp -m tcp -d <your external ip> --dport 80 -j REDIRECT --to-ports 20081

Configuring iptables like this is covered fantastically elsewhere so I won't go into it here.

But with this redirect, externally port 80 now goes to sshttp, which then redirects it to either SSH or to the specific external application I want to serve over HTTP on port 20080.

Conclusion

At the moment with this setup I just have nginx serving 404s back on my sshttp server ports. But the real benefit is, I can turn those into secure proxy's to use with something like corkscrew or proxytunnel.

Or I can go further - use httptunnel to tunnel TCP over GET and POST connections (my actual intent) on the public facing HTTP ports. Or do both - each method has it's trade-offs, so we can just step down till we find one which works!

Upstart script not recognized

I frequently find myself writing upstart scripts which checkout ok, but for some reason don't get detected by the upstart daemon in the init directory, so when I run start myscript I get unknown job back. Some experimentation seems to indicate that the problem is I used gedit over GVFS SFTP to author a lot of these scripts.

For something like myscript.conf, I find the following fixes this problem:

mv myscript.conf myscript.conf.d
mv myscript.conf.d myscript.conf

And then hey presto, the script works perfectly.

Along the same lines, the init-checkconf utility isn't mentioned enough for upstart debugging - my last post shows I clearly didn't know about it. Using it is simple:

$ init-checkconf /etc/init/myscript.conf

Note it needs to be run as a regular user. I'm often logged in as root, so sudo suffices:

$ sudo -u nobody init-checkconf /etc/init/myscript.conf

Wintersmithing

Wintersmith

How to setup and use Wintersmith is covered pretty thoroughly elsewhere on the net, (namely the wintersmith homepage.

Instead I'll cover a few tweaks I had to do to get it running the way I wanted. To avoid being truly confusing, all the paths referenced here are relative to the site you create by running wintersmith new <your site dir here>

LiveReload plugin

There's a Wintersmith LiveReload plugin available which makes previewing your site with wintersmith preview very easy - it's great for editing or setting up CSS.

Installing the LiveReload plugin on Linux Mint (which I run) can be done with sudo npm install -g wintersmith-livereload

You need to then add the path to your config.json file under "plugins" e.g. for this blog:

{
  "locals": {
    "url": "http://localhost:8080",
    "name": "wrouesnel_blog",
    "owner": "Will Rouesnel",
    "description": "negating information entropy"
  },
  "plugins": [
    "./plugins/paginator.coffee",
    "wintersmith-stylus",
    "wintersmith-livereload"
  ],
  "require": {
    "moment": "moment",
    "_": "underscore",
    "typogr": "typogr"
  },
  "jade": {
    "pretty": true
  },
  "markdown": {
    "smartLists": true,
    "smartypants": true
  },
  "paginator": {
    "perPage": 20
  }
}

You then want to insert the line:

!{ env.helpers.livereload() }

into the templates/layout.jade file - giving you something like the following at the top of the file

!!! 5
block vars
  - var bodyclass = null;
html(lang='en')
  head
    block head
      meta(charset='utf-8')
      meta(http-equiv='X-UA-Compatible', content='IE=edge,chrome=1')
      meta(name='viewport', content='width=device-width')
      !{ env.helpers.livereload() }
      script(type='text/javascript').

I also add a script section with Google Analytics to layout.jade because I'm vain like that:

!!! 5
block vars
  - var bodyclass = null;
html(lang='en')
  head
    block head
      meta(charset='utf-8')
      meta(http-equiv='X-UA-Compatible', content='IE=edge,chrome=1')
      meta(name='viewport', content='width=device-width')
      !{ env.helpers.livereload() }
      script(type='text/javascript').
        // google analytics
        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
        (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
        m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
        })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

        ga('create', 'UA-43235370-1', 'wrouesnel.github.io');
        ga('send', 'pageview');

The glitch: Running wintersmith preview with these changes you'll find it doesn't work when trying to browse to the main index.html page with something like "env is undefined". This is a glitch in paginator which has been fixed upstream but not in wintersmith@2.0.5 in npm.

To fix it I just copied the commit patch manually into my local copy of plugins/paginator.coffee :

diff --git a/examples/blog/plugins/paginator.coffee b/examples/blog/plugins/paginator.coffee
index b8e9032..19098d5 100644
--- a/examples/blog/plugins/paginator.coffee
+++ b/examples/blog/plugins/paginator.coffee
@@ -43,7 +43,7 @@ module.exports = (env, callback) ->
         return callback new Error "unknown paginator template '#{ options.template }'"

       # setup the template context
-      ctx = {contents, @articles, @prevPage, @nextPage}
+      ctx = {env, contents, @articles, @prevPage, @nextPage}

       # extend the template context with the enviroment locals
       env.utils.extend ctx, locals

Show most recent article in the index

By default Wintersmith shows short summaries of articles on your index.html page. I can't decide whether or not I like this behavior yet, but until I do what I wanted was to always have my index show my most recent post.

To do this, we take advantage of Jade's iteration and if/then functionality to modify template/index.jade.

As of this article my index.jade looks as follows:

extends layout

block content
  include author
  each article, num in articles
    if num === 0
      // First article - render in full
      article.article.intro
      header
        h1.indexfullarticle
          a(href=article.url)= article.title
        div.date
          span= moment(article.date).format('DD. MMMM YYYY')
        p.author
            mixin author(article.metadata.author)
      section.content!= typogr(article.html).typogrify()
    else
      article.article.intro
      header
        h2
          a(href=article.url)= article.title
        div.date
          span= moment(article.date).format('DD. MMMM YYYY')
        p.author
            mixin author(article.metadata.author)
      section.content
        !{ typogr(article.intro).typogrify() }
        if article.hasMore
          p.more
            a(href=article.url) more

block prepend footer
  div.nav
    if prevPage
      a(href=prevPage.url) « Newer
    else
      a(href='/archive.html') « Archives
    if nextPage
      a(href=nextPage.url) Next page »

There might be a better way to do this, but for me, for now it works. Basically Jade's iterators will provide an iteration number if you add a variable name for it (num in this case) - and the articles are chronological by default. So 0 is always the most recent.

From there I just duplicate some code fro template/article.jade to have it render the full article in section.content - which is article.html, rather then just the intro section - which is article.intro.

An important note here is that the default CSS selectors require some modification to get things to look right. I'm not sure I've nailed it yet, so editing those is an exercise left to the reader (or just a matter of downloading the stylesheet from this site).

Deploy Makefile

This site is hosted on Github pages, but they have no support for Wintersmith - so it's necessary to manually build the static content and upload that. make is more then capable of handling this task, and while we're at it it's a decent tool to automate housekeeping - in particular I wanted my article metadata to be automatically tagged with a date if the date field was blank.

Automatic date tagging

After banging my head with awk or sed one-liners (which probably can be done) I came to my senses and wrote a bash script to do this for me.

#!/bin/bash

find contents -name '*.md' | while read markdownfile; do
    datemeta=$(cat $markdownfile | grep -m1 date: )
    datestamp=$(cat $markdownfile | grep -m1 date: | cut -d' ' -f2)

    if [ ! -z "$datemeta" ]; then
        if [ -z "$datestamp" ]; then
            # generate a datestamp entry and replace the field with sed
            echo "Date stamping unstamped article $markdownfile"
            datestamp=$(date '+%Y-%m-%d %H:%M GMT%z')
            sed -i "s/date:\ .*/date: $datestamp/" "$markdownfile"
        fi
    fi
done

Git Submodules

Since I use Git to manage the blog, but GitHub Pages uses a git repo to represent the finished blog, it's necessary on my local machine to somehow have two repositories - one representing the Wintersmith site in source form, and one representing the GitHub Pages site after it's rendered.

I do this by treating the build/ directory of my Wintersmith site as a Git submodule. Git won't checkout an empty repo, so you need to create a full repo somewhere and then push it to your normal storage (in my case my private server, but it could be somewhere else on GitHub):

$ mkdir build
$ cd build
$ git init
$ git remote add origin ssh://will@myserver/~/wrouesnel.github.io~build.git
$ touch .gitignore
$ git add *
$ git commit
$ git push master

At this point you can delete the build/ directory you just created. It's not needed any more. Then it can be imported as a submodule to the main Wintersmith repo. We also need to add a remote for pushing output to Github:

$ cd your_wintersmith_repo
$ git submodule add ssh://will@myserver/~/wrouesnel.github.io~build.git build
$ cd build
$ git remote add github git@github.com:wrouesnel/wrouesnel.github.io.git

And after all that effort your module is imported and ready to participate in the build process.

Putting the makefile together

The final makefile looks something like this:

# Makefile to deploy the blog

# Search article markdown for "date" metadata that is unset and set it.
date: 
    ./add-date-stamps.bsh

# Draft's are pushed to my private server
draft: date
    wintersmith build
    cd build; git add *; git commit -m "draft" ; \
    git push origin

# Publish makes a draft, but then pushes to GitHub.
publish: draft
    cd build; git commit -m "published to github"; \
    git push github master

.PHONY: date draft publish

The workflow is that I can call make draft which builds the site and commits the build to my private repo which just tracks draft sites elsewhere, and then make publish for when I want things to go live.

There are obviously other ways this could work - for example, I could use post-commit hooks on the server to push to Github Pages, but the idea here is that provided I can access the the wintersmith repository, everything else can be rebuilt.

Personal thoughts

I've been meaning to blog for sometime to have somewhere to put the things I do or random bits of knowledge I pick up so they might help someone else, but for one reason or another most blogging engines never did it for me.

I've never been much of a fan of managed services they lead to sprawling personal "infrastructure" - I'll be happy when my entire digital life can be backed up by just making a copy of my home directory.

So for blogging I've not much cared for the services out there or their focus. I don't particularly want to manage a heavyweight WordPress or other type of CMS installation on a web-server just for a personal blog, since that requires a lot of careful attention to security, patching, updates and I simply don't need the features.

At the same time services like tumblr never quite seemed for me - it skirts the line between microblogging and blogging and it's relationship with markdown and code didn't gel for me. A deluge of social networking features is also not what I wanted.

With Github pages offering free static site hosting, I initially looked at Jekyll as an SSG for putting something together. But Jekyll is written in Ruby, and at the moment I'm on a node.js kick so I really wanted something in that direction. Hence Wintersmith - simple, easy to use, and written in something that I'm inclinded to hack-on but with enough features out of the box (code highlighting in particular) to not feel onerous.

So far I'm really liking the static site model - it's simple, secure and easy to store, manage and keep in a nice neat git repository. Guess I'll see how it goes.