Archive for the ‘Software Design’ Category

How many days until the end of the month?

Wednesday, July 27th, 2011

It’s a common enough problem, you want to find out if a date is in a particular month, or you want to know how many days there are left in the current month, and so on.

I’ve seen the following awkward solution plenty of times at different companies: To find out if a day is within a month, you see if the date in question is greater than the start of the month and less than the end of the month; this seems logical enough. To determine the number of days left, you subtract the date in question from the last date in the month; this also seems reasonable enough.

The trouble is it’s not particularly easy to find out the last date in a given month. I’ve seen at least the following solutions:

  • Using database functions to find the last day of the month
  • Having an array of 12 elements, maintaining the number of days in each month of the year
  • Simply using the date “YYYY-MM-31″.

Using database functions is the best, for sure, as they’ll do the calculations right, but it’s still a certain amount of work to connect to the database just for this purpose. Obviously there are libraries such as Java’s Calendar which will do the same work but again, it’s a certain amount of effort. The array isn’t good as you have to introduce further logic to handle leap years. The last approach is obviously ridiculous as there are plenty of months for which such a date is invalid. (But MySQL 4 allows such dates and MySQL 5 doesn’t; which as part of a database migration project has brought this point to my attention.)

However, there is a much simpler solution, that is to compare the date with the first day of the next month. Finding the next month is easy (there are always 12 months in the year).

Atomic operations over filesystem and database

Sunday, November 21st, 2010

I had the situation recently where I needed a “button”; when the user clicks the button, then:

  • An “svn up” should occur, to update some XML config files + data files (CSV)
  • If the config files or CSV files have errors, the operation should be aborted and an error printed
  • If correct, the CSV data should be imported into the database ready for processing
  • If correct, the config files should be made available for use

The config files are a bunch of XML files; each file has a complex structure, and they themselves are in a complex directory structure. One option would be to put them into a database as well but I felt that a simpler solution was just to leave them on the disk where they could be parsed. Directories, files and XML files more naturally fit the hierarchical nature of the configuration in question. I didn’t want to have multiple representations of this configuration (one being directories/XML; the other being a database).

This is all OK, but how to make all of that (config files update + database work) atomic? i.e.

  • If anything goes wrong, the entire operation fails and rolls back with an error message
  • If it succeeds then the entire operation succeeds
  • If the server crashes, then an update was either successful or rolled back (but nothing in between)

I came up with a nice solution:

  • A database transaction is started
  • The “svn up” process programmatically connects to the Subversion server and does a brand new checkout into a directory with a random name (i.e. actually “svn co” not “svn up”).
  • The XML files are parsed and checked; if there is an error, the operation is aborted
  • The CSV files are loaded into the database within the transaction, if there is an error, the operation is aborted (& rolled back)
  • The new random directory name is written into a table “current directory name”, also in the transaction
  • The transaction is committed
  • Each time access is needed to the XML files, a lookup is first done in the table to retrieve the directory name, the files are then read from this directory.

This has the consequence that:

  • Either the “commit” occurs and the new XML files are “live” (table points to new directory name) and the CSV data has been imported
  • Or a “rollback” (or crash and auto-rollback) occurs in which case the CSV import is rolled back, and the table points to the old version of the XML files which are still on the disk

I’m pleased with this solution :-)

One should really also write a clean-up script, deleting directories which (a) aren’t in the database and (b) were created a certain amount of time ago e.g. a few days. That way old directories, or directories checked out which had errors or failed, are removed, but directories which are in the process of being checked out are not removed.

P.S. Allowing the user to submit new versions of the data by checking it into Subversion is a nice idea:

  • It’s a bunch of structured data (bunch of XML and CSV files in a particular directory format)
  • This is easy to visualize with Windows Explorer and TortoiseSVN
  • This would not be easy or nice to do with multiple <input type=file> file upload buttons
  • The “svn co” command in the software is easy; there are various libraries for programmatic access to SVN: I used svnkit: I found it well documented, seemingly bug free, it just worked first time.
  • Obligatory link about why I didn’t use Git ;-) (the person involved can use TortoiseSVN but wouldn’t have been able to handle git)

Store data files in one big directory

Tuesday, August 17th, 2010

A server filesystem can store millions of files in one directory. Unless you need to store more files than this, there is no need to artificially introduce an extra layer of directories to keep the number of files per directory down to a lower number.

Often one needs to store lots of files in a filesystem. For example, at Uboot, each user has a “nickpage”, and each of these nickpages requires a file, and there are about 4M users.

It is tempting to use a hierarchical directory structure for this, for example take the last two digits of the nickpage-ID and create directories for each value, then in each of these directories create directories for the 3rd/4th last digits of the nickpage-ID, and in that store the nickpage file, for example the 4,000,001st nickpage might be stored in a file like /var/nickpages/01/00/4000001.xml.

But this decision, seemingly so obvious, contains an implicit assumption, which is wrong. It assumes one cannot store more than a few hundred files in a directory.

In 2000 a consultant for tru64 UNIX told us that one should store no more than one million files in a directory. We stored (I think) about 0.5M files per directory, and it worked fine. We had over 1M page impressions per day. Modern day hardware is much faster (CPUs, disks).

  • tru64 used a tree structure for the relationship between filenames and the information about the file, similar to a normal index in a database table. (And database tables clearly support having more than a few hundred rows in a table!)
  • Solaris ZFS uses a hash structure to identify files. This is even quicker in some respects (as the tree doesn’t have to be traversed) and it requires fewer locks (as operations on two files do not have any common tree parent nodes) although no doubt re-hashing will have to be done if the directory expands beyond a certain limit (in contrast to the tree-approach).

I think one needs to do the following things with a directory containing data files:

  • Find a particular file, in which case it’s easier in terms of programming, and faster for the OS to deal with a flat structure, than with a structure with lots of intermediate directories.
  • Do an operation on all files, for example search for particular content within all files. In that case the time is spent looking into the files, the structure of the files in the directory does not matter.

Having worked with directory structures both in terms of programming and in terms of operations and live bug-fixing, I can say that it really is simpler to have simple directory structures, and really does work in production. Being able to vi <id> or new File(directory, id+”.xml”) is simpler.

Using intermediate directories is really just doing in programming what the OS does for you anyway.

P.S. I like padding IDs with zeros for example “0004000001.xml”, this means that files are always listed in numerical order if you sort them alphabetically. Although I assert this is something one rarely wants to do – takes a long time if you store files flat, and isn’t possible at all if you store files hierarchically.

Multi-core RISC OS proposal

Friday, June 11th, 2010

I was reading an article about the RISC OS operating system and the fact that it doesn’t support multiple CPU cores. Then I got to thinking, how would one implement multi-CPU support on such an operating system?

Between 1989 and 1994 I used a British computer running the operating system RISC OS. This operating system is still alive, and is still in use. It was produced by Acorn Computers and runs on the ARM processor (“Acorn Risc Machine”), now found in many computing platforms such as the iPhone. It has many innovative features which haven’t made it yet to other operating systems (although is the topic for a post of its own!)

Back then there was obviously no need to support multiple CPUs. So today one has the choice of either finding single-core CPUs (will become increasingly difficult) or using a multi-core CPU with all but one core inactive (which is a waste). Various fora speculate on the technical possibility or impossibility of utilizing the extra cores. I don’t think it would be especially difficult. This article describes how I would go about it.

Constraints and information about RISC OS:

  • RISC OS can run multiple programs at once but yet has no concept of pre-emptive multi-tasking, threads, processes, etc.
  • It’s extremely old, therefore extremely fast on modern processors (even running on only one core)
  • Many parts of the operating system, such as those to do with the filesystem, are marked as “non re-entrant” in the documentation, that means they would need to be re-written or altered to work with pre-emptive multi-tasking or simultaneous execution (multi-core), for example locks would need to be introduced.

For me at least, I would say that the way the system currently operates on a single core is already very fast (when running at less than full speed via emulation on an x86 processor – it runs faster than it did on native hardware in the 1990s, it runs faster than Windows ran in the 1990s, and faster than Windows runs today.) So I think for most tasks, there is no need to use additional cores.

I am used to working in companies with limited human resources, and due to the fact that RISC OS isn’t a mainstream operating system, I am going to assume that this would also be a constraint with any solution to introduce multi-core capabilities to RISC OS. That means a complete re-write of RISC OS and applications, or even visiting every single part of the operating system and introducing locking and the expectation that any function can be called at any time due to pre-emptive multi-tasking and simultaneous execution by multiple cores, is also out of the question.

So my solution would be to assume that those extra cores should do “pure CPU” work, for example video processing, rendering fractals, or other CPU-intensive work. All work like accessing the filesystem, interacting with the user, which is already fast enough, would not be given the opportunity to work even faster by being executed in parallel on multiple cores.

I would write a very simply layer, similar to a hypervisor, which would run under RISC OS, which would simply handle allocation of CPU and memory resources. This supervisor would allow processes to run, which would not share memory (and thus not need to support shared-memory primitives such as locking). Pipes would allow processes managed by the supervisor to communicate.

The first, and often the only process, which this supervisor would manage, would be the entire existing RISC OS and all applications. This process would manage the user-interface, network, filesystem, printers and communication between all running RISC OS applications. So this “RISC OS + applications” process would run largely unaltered from the existing software (saving much development time; but also providing no benefit in terms of speed increase).

But there would be an extra API available to applications to spawn new processes managed by the supervisor, which would execute asynchronously, and primitives to pass/receive messages from those spawned processes. The supervisor would run these processes on the different cores, and pre-emptively multi-task the processes in the case of more processes than cores.

I would take the approach to concurrency of “message-passing with no shared memory” as opposed to “shared memory (threads)” for two reasons:

  • Existing RISC OS lacks the primitives to deal with shared memory and pre-emptive concurrency (such as locking), so it would not be possible to allow the spawned processes to interact with the main memory of the RISC OS applications. (Although it would be theoretically possible to allow the spawned processes to themselves have threads, although that would require introducing extra concepts such as locking to the supervisor.)
  • It might be argued one would need to support shared-memory computation to allow existing software to be easily ported to the platform, however this isn’t really a concern as porting software to RISC OS is next to impossible for other reasons: there is no POSIX, and the concepts provided by POSIX such as streams unifying filesystem access, inter-process communication, network access, are not available on RISC OS (and I would not advocate building them in the supervisor.)

I would allow “pipes” to be created between the processes and the pipes store an ordered set of “messages”, each message is simply a bunch of bytes. RISC OS software is generally written in C so just passing the byte-contents of a “struct” is easy and convenient. (This is the Erlang philosophy: Facebook say “Erlang… approaches concurrency with three iron rules: no shared memory even between processes on the same computer, a standard format for messages passed between processes, and a guarantee that messages are read in the order in which they were received.”)

This proposal has the following consequences:

  • The supervisor would be reasonably easy to write
  • RISC OS and applications would not need to be re-written. The application developer can, at a time of their choosing, decide to take advantage of the new feature, and surgically alter their code by altering only that part of the code where the speed improvement is desired (i.e. processing logic, not the GUI)
  • The supervisor as described would not consume many resources (i.e. the system would not suddenly become radically slower due to the introduction of this layer.)
  • The system would work as well on single-CPU systems as on multi-CPU systems (so software written with the new system can work on older single-core hardware)
  • One could envisage such software also running across multiple networked computers, so if a colleague’s computer was not in use (e.g. they were on holiday) ones own computer could become faster.

Admittedly there although there is no performance loss, there is also no performance gain until apps are extended. However I think spending the work to add pre-emptive multi-tasking to RISC OS (a huge undertaking) simply isn’t necessary: RISC OS works well as it is, and arguably is a lot more reliable (in terms of repeatable and therefore testable behaviour) due to its cooperative nature. The original designers of RISC OS could have created a pre-emptive multi-tasking operating system instead of RISC OS — the hardware supported it (RISC/iX ran on it) — but they actively decided not to do that.

See also:

Programming with unique constraints

Friday, June 13th, 2008

If you’re using a remote database system, your application doesn’t have access to all the data at any point in time. (I.e. you just load and save the rows you’re interested in in a particular transaction). Therefore if you want to do some database-wide operation, you need to ask the database to do it.

When you want to enforce uniqueness, for example across a whole table (for example a document name needs to be unique), or across a particular part of a table (for example each user must have documents with unique names), you need to do that in the database.

There is only one acceptable way to do this with a SQL database:

  1. Insert a new row
  2. If the insert succeeds, there wasn’t a row there before, and now there is
  3. If the insert throws a unique constraint violation, the row is already there
  4. If you want to update the row (i.e. an “insert or update” operation to maintain a “lazy singleton” in the database), you can update the row with safety after the unique constraint violation, as you can be certain the row is already there.

The following methods are all not acceptable:

  • Do a “select” to find out how many rows there are. If there aren’t any, do an “insert”. However someone may have inserted a row between your “select” and your “insert”.
  • Do an “update” and if the database says that 0 rows have been updated, do an “insert”. Again, someone may have inserted a row between your “update” and the “insert”.
  • Do a “select for update” statement (Oracle, Postgres, InnoDB) to check that there aren’t any rows while creating a lock, and then do an insert. However that statement only locks the rows it returns, so if it doesn’t return any rows, it doesn’t create any locks, so you still can’t be certain that no one has inserted a row between the “select” and the “insert”.
  • Lock the whole table and do one of the above. This works, but it means that all write access is “serialized” i.e. happens after one another. Any other operation, writing something completely irrelevant, will now also have to wait until the end of your transaction, whereas it shouldn’t have. This reduces concurrency.

The way I program this is the following. On the “insert” statement, I catch the database error, and see if it’s a “unique constraint violation”-type error. If it is, I throw an (unchecked) Exception. The calling code can catch that and do something with it (or not, if the statement should not generate such an error, in which case it will propogate to the main loop like any other database exception). I have had the pleasure of introducing this to easyname.eu and now also the pleasure of introducing this to the WebTek framework.

It is extremely frustrating working with big complex frameworks, whose usage is much more complex than just writing SQL manually, and not being able to do the above properly.

  • Hibernate clearly states in its documentation that if any database error occurs, the Session (main object managing all persistence) must be destroyed as its state will be out-of-sync with the database. But there is no way other than the above to do this sort of check (as far as I know).
  • OptimalJ generates code that, if a database error occurs, sets the transaction to rollback. Including any other work you may have done.

I mean checking unique constraints is something every database application needs, so the fact it’s not supported by major frameworks is just unbelievable.

Oracle, Nulls and the empty string

Thursday, June 12th, 2008

Oracle has an unusual feature, which attracts it a lot of criticism. If you try to insert the empty string into a column marked “not null”, you get an error. The empty string is treated the same as “null” by Oracle.

This is different to programming languages (and indeed other databases, at least MySQL), which means one has to be careful not to make a mistake when using a programming language to talk to the database. That it’s different from other systems is the main reason for the criticism.

However, how many times have you wanted the following to be allowed in your data schema, for example on a “first name” column:

  • Writing nulls is not allowed
  • Writing the empty string is allowed

I just wrote a program and the front-end framework helpfully noticed that the field was “not null” in the database and gave the user an error in the front-end if the field was empty. However when I altered the code slightly, it no longer gave an error. Because the field in the program was the empty string, and not null.

However, I assert, when dealing with data, checking only for not null is not useful; you also want to check that the string contains some data in that case.

I have now updated the framework, so that if the field is marked “not null”, then the error is presented to the front-end not only if the variable in the program is “null”, but also if it is the empty string.

(Note: I am not advocating e.g. Java lose the distinction between ==null and .isEmpty(): for some reason this is a useful distinction when doing programming and data manipulation—such as null indicating that a variable isn’t initialized yet—I just don’t think it’s a useful distinction in a system solely designed to model persistent data.)

Encapsulation or public attributes: but nothing inbetween

Monday, May 19th, 2008

This post asked the question:

Whenever a class in my model contains a collection which requires that particular care be taken with its items, there’s an internal debate regarding how to expose it to other classes. And with this, there are two major schools: one, the paranoia-based approach which doesn’t allow external code to touch the collection’s internal items and two, the trusting approach which just returns the collection for everyone to deal with.

What are your thoughts on the matter? What do you use, when and why?

Definitely paranoia.

It may make certain things more difficult or require more code, but one of the key cornerstones of object-oriented programming is encapsulation and not exposing your internal data in a way that means that others can break it.

If one wants to go for the trust approach – sure it’s easier – but if easiness is the objective one can write a C program and just declare a struct. Then anyone can access the data anyway they want and there’s absolutely no code to write (not even getters and setters). But the world moved away from that model towards encapsulation as with N lines of code (or Nk LOC or NM LOC), without encapsulation, any of those can be responsible for the creation of inconsistent data.

Explicit vs Implicit data typing

Thursday, April 10th, 2008

I was reading this article about how certain data gets messed up when one imports it into Excel (certain data looks like a date and thus gets converted into one), and it reminded me of a problem I had when transferring data over an XML protocol from Perl (the SOAP library was inspecting the hex data I was transferring, but a small percentage of hex numbers look like “123e123″, which looked like a floating point number to the library)

I think both problems are actually the same problem. It can be traced back to the necessity to make exactly one of the following two decisions when creating data-processing systems:

  1. Either you try and work out what the datatype of a piece of data is by looking at e.g. the data’s string representation. E.g. this data is “abcd” so it’s a string, this data is “123″ so it’s a number.
  2. Or you explicitly store and state, external to the data, its data type. E.g. store not only that the data is “123″, but that it’s a number.

Option 1 seems attractive as it’s simpler as you only need to store one piece of data. It also feels more normalized, as one piece of data is generally better than two (e.g. what if they are inconsistent, e.g. data is “abcjzh” and type is “number”?)

But the trouble is you option 1 doesn’t work (see above.)

But it gets worse. Option 1 seems to work, yet does not actually work in all case (and you want your software to work in all cases). That’s more dangerous that if it simply and clearly didn’t work.

The authors of the SOAP library in my example presumably believed their software worked. And I believed my software, built on top of the SOAP library, worked. It worked in my unit tests and when I tested it by clicking-through the front-end. Only 0.6% of users had a code with a hex string that looked like an exponent, so it’s understandable that I just didn’t hit it when testing. But with e.g. 2M users in the database, some of my users will hit it. And that means that the software I released didn’t work (working meaning working 100% for everyone.)

But I like my software to work. The way I achieve that is to avoid errors which are difficult to detect. Making errors is human; if they are easy to catch, one can spot them and then correct them.

XML in the database

Monday, March 3rd, 2008

As I do a lot of programming using open source databases, I couldn’t help but notice that inclusion of XML in the database seems to be a hot topic. Postgres 8.1 (just released) and MySQL 5.1 (not released yet) will allow XML documents to be stored as the value of cells, and allow those documents to be manipulated and inspected. (Oracle has had this feature since 2001)

I generally don’t like XML (as Robin put it: most formats are designed to be either conveniently readable by humans, or by computer. XML is conveniently readable by neither.)

So I especially wouldn’t like to store XML in the database. However, there are some times when one would naturally want to store some hierarchical data and the relational model doesn’t really allow that conveniently. Compared to:

  • Storing e.g. Java serialized objects as a BLOB in the database (with no chance of ever doing a “select” against that data, or migrating that data using a SQL migration script);
  • Or storing direct XML without manipulation functions (slightly more readable, but still not programatically manipulatable using SQL)

I suppose using the XML features provided in the database is better than either of the above solutions.

The following table contains documentation copied from the documentation or the changelog of the database in question:

New in Postgres 8.3
(4th Feb 2008)
“XML Support: New XML data type fully supports the SQL/XML specification of ANSI SQL:2003, including well-formedness checks, type-safe operations, SQL/XML publishing and XPath queries. Version 8.3 also includes additional functions for XML data export.”
New in MySQL 5.1
(In development)
“XML functions with XPath support. ExtractValue() returns the content of a fragment of XML matching a given XPath expression. UpdateXML() replaces the element selected from a fragment of XML by an XPath expression supplied by the user with a second XML fragment (also user-supplied), and returns the modified XML. See Section 11.10, ‘XML Functions’.”
New in Oracle 9.0.1
(Released 2001)
“The introduction of Oracle XML DB and the XMLType datatype provides new techniques that facilitate the persistence of XML content in the database. These techniques include the ability to store XML documents in an XMLType column or table, or in Oracle XML DB Repository. Storing XML as an XMLType column or table makes Oracle Database aware that the content is XML. This allows the database to: Perform XML-specific validations, operations, and optimizations on the XML content; Facilitate highly efficient processing of XML content by Oracle XML DB.”
New in MS SQL Server 2005
(Released 2005)
“SQL Server 2005 introduces a native data type called XML. A user can create a table that has one or more columns of type XML in addition to relational columns. … Together with a large set of functions, embedded XQuery and data modification languages provide rich support for manipulating XML data.”

Which log levels to use when?

Tuesday, February 5th, 2008

I’m sure there are a lot of opinions in the world about which log levels to use for which errors. Log levels in the sense of if a text destined for a logfile should be prefixed with “Info”, “Warning”, “Error” etc.

There is even great debate about which log levels there should be. E.g. should there be a “Debug” level or a “Trace” level? Or both? In which case what’s the difference?

I tended to ignore those debates—for me a log statement was simply a log statement—but recently I was deploying an application, and various people had worked on it and used log levels inconsistently. Hardly a surprise as I’d never set any guidelines on how they should be used, due to aforestated ignoring. So I started to think about how I would have wanted the log levels to have been in the application I was at that moment deploying.

Firstly, what does a log level actually influence?

  • Different texts are stated in the log file. The only differences this makes to anyone is that operations teams, not too familiar with the software, tend to understandable freak out when one has lots of ERRORs in the log file; whereas they tend to freak out less with lots of INFOs.
  • You can do monitoring based on log files. E.g. one can create assertions which can be monitored, such as “INFOs may be in the log file, but ERRORs may not.”
  • You can set a log level and state that errors below a certain level will not be displayed, e.g. “ERRORs should be displayed but WARNINGs should not”. So the log level of an error may influence whether it gets logged at all.

And how would those differences influence ones decision about which level to make a particular log?

I came up with the following scheme for me:

TRACE For things that should happen, but one doesn’t want to see live. (For example SQL statement logs, etc.)
INFO For things that should happen, but one does want to log live. (For example, “writing file to /x/y.pdf…”) It’s important to state IDs and paths in log files, for if something goes wrong, one needs to be able to work out exactly what the robot did with what rows and objects. And if the robot crashes, a log, written before an action is done, such as “Writing PDF file…” can help to identify what part of the process caused the error.
WARNING   For things that shouldn’t happen, but where the user sees the correct result. For example files that should always exist, but if they don’t, they can be automatically recreated. If those happen live, one might want to look at them. But as long as the user sees the right thing, one can look at them in ones own time.
ERROR For things that shouldn’t happen, and where the user sees an incorrect result. Including “fatal” errors.

And in that case, one would monitor ERRORs (or maybe WARNINGs if one was otherwise bored), and would set the log level to TRACE on all test servers and INFO on all live servers.