Store data files in one big directory

A server filesystem can store millions of files in one directory. Unless you need to store more files than this, there is no need to artificially introduce an extra layer of directories to keep the number of files per directory down to a lower number. We kept millions of files in one directory at uboot.

Often one needs to store lots of files in a filesystem. For example, at uboot.com each user has a "nickpage", and each of these nickpages requires a file, and there are about 4M users.

It is tempting to use a hierarchical directory structure for this, for example take the last two digits of the nickpage-ID and create directories for each value, then in each of these directories create directories for the 3rd/4th last digits of the nickpage-ID, and in that store the nickpage file, for example the 4,000,001st nickpage might be stored in a file like /var/nickpages/01/00/4000001.xml.

But this decision, seemingly so obvious, contains an implicit assumption, which is wrong. It assumes one cannot store more than a few hundred files in a directory.

In 2000 a consultant for tru64 UNIX told us that one should store no more than one million files in a directory. We stored (I think) about 0.5M files per directory, and it worked fine. We had over 1M page impressions per day. Modern day hardware is much faster (CPUs, disks).

I think one needs to do the following things with a directory containing data files:

Having worked with directory structures both in terms of programming and in terms of operations and live bug-fixing, I can say that it really is simpler to have simple directory structures, and really does work in production. Being able to vi id or new File(directory, id+".xml") is simpler.

Using intermediate directories is really just doing in programming what the OS does for you anyway.

P.S. I like padding IDs with zeros for example 0004000001.xml, this means that files are always listed in numerical order if you sort them alphabetically. Although I assert this is something one rarely wants to do – takes a long time if you store files flat, and isn't possible at all if you store files hierarchically.

P.S. I recently created a nerdy privacy-respecting tool called When Will I Run Out Of Money? It's available for free if you want to check it out.

This article is © Adrian Smith.
It was originally published on 17 Aug 2010
More on: Software Architecture | uboot.com | Operations & Servers | Things I've Released | Linux & UNIX