Archive for January, 2008

Regular automated backup of WordPress blog

Monday, January 14th, 2008

For my old uboot blog I had a very simple backup strategy.

The whole blog was available as an RSS feed. (I.e. not just the latest entries: the feed included every article right back to the beginning.) I set up a crontab job on a friendly Linux machine to download that with “wget”, then commit the resulting file to my private Subversion repository.

A similar technique is available with Wordpress. It has an “export” feature which allows you to download the entire blog in an “extended” RSS format, including comments etc.

You have to be logged in to the Wordpress admin system to use the export. However “wget” takes the “–accept-cookies” parameter which takes a “cookies.txt” file. The documentation assumes you’ll be running your browser on the same machine so it expects you’ll have a “cookies.txt” file available. I didn’t but it was a small matter to take the “cookies.txt” file from my Windows machine - where Mozilla correctly stores the file under the “Documents and Settings” Windows directory - find the two lines for the cookies “wordpressuser_nnn” and “wordpresspass_nnn” (the “nnn” is a long hex string) and strip out the rest. “scp” across to the Linux machine and “wget” accepted the file fine.

A small change to my script now downloads this Wordpress export, and commits it into my Subversion as before.

There was the slight inelegance that Subversion only creates a new revision when the files being committed have actually have different content: and I like this feature. This worked fine with the Uboot RSS feed but the Wordpress export includes a comment “generated on <date/time to minute accuracy>”. So every night when the script runs a new revision would be created. No matter, I entered a line using “sed” to strip that out from the comment, committed it, then did a “svn diff” to check that really only that line had been changed.

So now my script looks like:

#!/usr/local/bin/bash

# in crontab:
# 0 4 * * * ~/private-svn/blog/backup/download-backup.sh

cd ~/private-svn/blog/backup
export SITE=http://www.databasesandlife.com/
export FILE=databasesandlife.wordpress.xml

wget –quiet \
  –output-document $FILE.tmp –load-cookies cookies.txt \
  "$SITE/wp-admin/export.php?author=all&download=true&submit=xx"

# the created="xx" attribute in a comment causes each download
# to be a new commit; yet i only want actual changes to show
# up in the subversion revision history
sed 's/created="....-..-.. ..:.."–>/–>/' < $FILE.tmp > $FILE

svn commit -m '* Automatic blog download from crontab' $FILE

The cycle of programming languages

Thursday, January 10th, 2008

The following cycle never ceases to amaze me:

  1. People learning programming find “real” languages such as C++ or Java filled with too many “complex” constructs.
  2. They find or invent languages such as Javascript or PHP or BASIC and think they can get the job done without “unnecessary complexity”
  3. As these programmers develop, they develop increasing complex programs, and find that constructs such as classes, inheritance, exceptions, generics/templates, errors upon encountering undefined variables, and static typing help them debug their code and write better code quicker.
  4. They then add these features to their programming languages and everyone rejoices believing they’ve done something new and great.
  5. Other programmers - just starting out - find the current set of languages to be too complex as they contain features they don’t understand they need, such as classes, inheritance, exceptions, etc.: go to step 1.

I mean PHP5 includes features such as classes, exceptions, and “phpdoc”, similar to Java. When displaying an uncaught exception, the $ex->__toString() method even returns a stack backtrace just like Java. (But global errors - which different to exceptions, as they were invented before PHP5 - do not).

And now Axel has blogged enthusiastically about the improvements to Javscript in the next version. I agree that these are great improvements, but believe it is incorrect to applaud Javascript for these improvements. These are simply useful features which exist in other languages; one can applaud Javascript for realizing they are useful, however at the same time one must observe that they did not realize they were useful when designing previous versions.

I also started programming using BASIC. It did not have advanced constructs such as abstract classes and exception handling. I did not know I needed them when I started programming. So I can certainly sympathize with people at step #1 above.

But it is incorrect to treat languages such as (the original versions of) Javascript or PHP or BASIC as anything other than beginners languages, useful as a stepping stone in the process of learning to program. If you want a programming language for writing expressive and maintainable software, it would seem less effort to just to use existing languages which already have the necessary constructs for doing so, rather than extending beginners languages with constructs identical to these existing languages.

New Year, New Blog

Tuesday, January 8th, 2008

I shall be blogging here henceforth. I have moved all my old articles over from my previous uboot.com blog.

The reason for the move was multiple:

  • It’s important to use the software you write, to experience its successes and limitations. I am a contractor for uboot and have been using their blog; however I am also a contractor for easyname and in December we took the hosting features online we’d been developing in 2006. I’m glad to report they work just great!
  • I wanted more control over the design. The text was small at the uboot blog and didn’t invite reading.
  • I have discovered hierarchical categories! So one can view just my software design posts for example. They are cool. Did uboot have them? If so, I never found them.
  • I wanted a facility for seeing new comments without checking all posts over all pages, and comparing the current number of comments on that post with my memory of the previous number of comments.
  • Trackback wasn’t fully working with uboot. Although I suppose in the time it took me to set up my own blog, I could have repaired the uboot feature!

It wasn’t easy easy as I had hoped to set up this blog. I imagined just FTPing over a Wordpress installation and that would be it. While I don’t want to sound ungrateful to the open-source programmers who made both the blogging software, and the hosting software possible, there are a sufficient number of small problems - both in implementation and in architecture - with the internet, web servers, web protocols, and in every piece of software, as to make the seemingly simple process of installing some blogging software annoying and time-consuming. I am writing this at the end of the second day of full-time work creating this blog.

  • My plan to import my own data was to import the RSS feed from the old Blog. However, that RSS import software had three separate bugs. I have fixed these problems now in the source, and will submit them in due course.
    • The RSS importer didn’t use an XML parser, but instead regular expressions. Thus it required an <item> tag to look exactly like “<item>”; whereas the uboot RSS feed includes attributes of the item tag, i.e. <item x=”y”>. So it didn’t match and simply reported that it had “successfully” imported 0 posts.
    • Newlines in the HTML content were turned into <br> characters. My HTML content had a lot of newlines (that’s how the gmail WYSIWYG editor produces the content). These are ignored by the browser, so shouldn’t be turned into <br>s which are not ignored by the browser. I solved the problem by replacing all newlines in the HTML with spaces, before the <br> conversion happened.
    • HTML escaping was being performed needlessly on the article titles. The titles were already in HTML in the RSS file. So “&quot;” text was introduced into the user-visible titles. I know the RSS feed is correct in this regard as it renders correctly on Google Reader, Bloglines, etc. I have removed this conversion.
  • Alas I lacked sufficient knowledge of CSS so getting the style correct was a great pain. Yet I didn’t have particularly ambitious style requirements, as any viewer of the new blog can confirm.
    • One problem that took ages was the removal of some two-coloured vertical lines. Were they images? Clever CSS borders? I couldn’t find the border commands in the CSS file but also couldn’t find any image commands. Nor were they referenced from the HTML file. Finally I checked the images directory and found an image; then full-text searched all files. I found the image referenced in a <style> element at the <head> of one of the HTML files.
    • Various IE7 problems. I even had to insert a “if browser=IE” Javascript in one place.
  • All embedded image references and intra-blog links had to be changed. (I couldn’t even just leave them pointing to the old blog, as they were relative links, i.e. <img src=”/x/y.jpg”> so didn’t work after the new content had been imported.)
  • I created a “.htaccess” file to password-protect the website while it was under development. Later I deleted the files. However Wordpress had written some rules into the file (without it being obvious to me) so that URLs like “/post-name” would be mapped to the correct PHP files. So after I deleted the “.htaccess” file to give everyone access, the blog no longer worked (it took me some time before I discovered that, as the URL “/” still worked; so it was not obvious which action had led to the pages stopping working)
    • Let us not forget that the syntax of “.htaccess” and “.htpasswd” files is far from obvious in the first place (But thankfully my hoster has a tool to write this files - actually I wrote that piece of software!)
  • I tested the RSS feeds from the new blog in Google Reader just at the moment when the .htaccess file was broken. Thus, Google Reader cached an non-working version of the page with 0 posts. And as Google Reader shares that cache between its users, I knew that anyone trying to subscribe to the feed would see the same thing. It’s fixed now though (by time).
  • I’m sure there were more problems but I can’t remember. I should have written them down as I was working; after all, the probability of me not writing a blog entry about the difficulties of installing the new blog software were clearly not particularly high.

So essentially had I not been deeply familiar with PHP, HTML, Javascript, .htaccess files, FTP, XML and (to an extent) CSS, I would not have made it. This is not something I would recommend for “Aunt Tillie”.

Anyway, now it’s done, and as I now maintain this software in contrast to before, I look forward to also having to fix it when it breaks randomly in the future (as inevitably software always does).