Archive for March, 2007

We don't need these users - let's move them to an "archive" table!

Thursday, March 29th, 2007

For one of the customers I currently work for, when we first designed the platform in Q1/2000, there was the "account" table, there we stored our users. There were always various pressures to move "inactive" users to a separate "archive" table. I was always against this decision.

In Q4/2005, during a period of my absence, it was decided to implement this decision. A bunch of users were to be deleted, but "not quite", in case we needed their data again. Their data was to be moved from the "account" table to an "account_archive" table.

This was really the worst decision ever made. I said that before, and now I see the consequences. I want anyone who considers such an operation good, to understand the consequences. So I list them here.

  • More and more, bosses and business people require we do operations on "all" users, which includes the "account_archive" table. This generally involves a "union" of both tables.
  • Now I have to create a real-time data interface to a slave system. This also including archived users. That means I have a "who has changed" table (input queue for the process exporting changed users to the slave system). This table references account_ids, but I can't create an FK from this table to "account", as sometimes an "account_id" references "account" and sometimes "account_archive".
  • There are classes which model a User, and this uses the "account" table as the underlying table. This enables me to build logic functions on the User class, and this has been done. However, at the time the class was built, there was only "account", so I can't use this class to model users who are stored in the "account_archive" table. (I'm not going to extend the User object to include the "account_archive" table, that will make this critical code too complex)
  • Now I have to allow users to "unsubscribe" from a newsletter, and "archive" people can receive newsletters, if they elected to receive them while they were active. Again, I can't use the User objects to do that. So I have to just program in plain SQL in an fcgi (or create a second class MaybeArchivedUser to model a user which could be in either table, and then duplicate some instance methods - that's what I chose to do).
  • It was suggested "maybe we archived the wrong users". But it's nearly impossible to re-create them as the schema is different, and some information has not been kept on in "account_archive". Their nicknames, which are unique amongst active users (but not amongst archive users) might have been reused in the meantime.

It would have been necessary to decide on one of the two courses of action:

  • We actually will never need these users again: we delete them
  • We might or do need them in the future. In which case we set a special "status" in the "account" table. They can't log in. But we can build User objects. We can re-enable them if necessary. We can even let them log in to some mini-platform where they can do a few things such as delete themselves or request their reactivation.

It has been said that removing users from account "increases performance". However:

  • It's more probable that two accounts will be read from the same disk block, after a defragmentation has occurred (did any defragmentation run? I don't think so)
  • If there are half the number of users, that's one less binary-index level. If there are 2M users that's 21 index branches. If 1M users that's 20 index branches. Hardly a big saving.
  • Although backup (and recovery) no doubt became quicker
  • This hardly makes up for the other disadvantages

Splitting a table up into two tables, for "performance", or whatever reason, is never a good thing to do. Add a status flag.

Fast

Wednesday, March 28th, 2007

From our Oracle test instance:

1360965 rows created.
Elapsed: 00:01:17.90

That's 1¼ minutes to insert (and index) over 1¼ million rows.

And this is a very old test instance. I think the hardware was last updated 2-3 years ago.

That's pretty quick!

Mouse reboot

Wednesday, March 28th, 2007

I have been using a trusty wireless mouse for about 3 months now. (I didn’t want a wireless mouse, but here in Macau, I didn’t know what was going on, so I walked into an expensive hardware store—the only hardware store I knew—and they only had wireless mice. Well I thought, it may be twice the price but even twice the price isn’t expensive, and I need a mouse…)

It suddenly stopped working while I was using it.

  • The light under the mouse was on, so the mouse thought it was working.
  • The touch pad built into the laptop worked, so Windows was still working and accepting pointer-movement instructions.
  • I took out the USB device, which communicates with the mouse, and put it back in. “Detecting new hardware” etc. But it didn’t start working again.
  • I plugged the USB device into a new port. Even more “Detecting new hardware” etc. But it still didn’t start working again.

Then I took the batteries out of the mouse (which has no on/off switch), then put them back in again. Then it started working again.

My mouse had crashed, and needed a reboot.

MySQL's "enum" datatype is a good thing

Friday, March 23rd, 2007

I’ve often had discussions with people about whether the “enum” type in MySQL is a good thing or not. Basically there are two ways to use your database:

  1. As an unstructured bunch of “stuff” to store whatever the software needs to persist. Such databases use lots of “blob” data with serialized objects (it’s easy to program), tables with multiple functions (”object” table with “type” column), few constraints, and so on.
  2. As a representation of the data the program is trying to model. Such databases have meaningful column names, two different types of things are two different tables. Adding constraints is easy.

There are good arguments for both. As one is programming in a programming language, whatever is most convenient for that programming language (e.g. serialize a bunch of objects to a blob) means less effort, less code: which means less cost and less bugs. Those are all good things.

But for some reason I’ve always been the 2nd type. I like to look at a database and understand what data is being modeled. It creates a certain self-documentation which can often be lacking from complicated software. Constraints can be added which acts like functional assertions (functional in the sense that they involve no state: you say that this value must be between 1 and 10 and then it is that way. You don’t have to program any “path” or state to check that).

That an item can be in exactly one a distinct set of states is a fact of life in all types of domain modeling:

  • What state is this invoice in? Is it “paid”, “open”, “paid+processed”?
  • Is this item deleted or not? (Or marked temporarily suspended, pending administrator checking the content)
  • Is this photo public, or does it belong to a user?

To explicitly work with enumerations is to acknowledge their existence within the domain, and to create a more 1:1 mapping between the domain and its implementation. That’s what all software should be about, modeling. Ideally a program is nothing more than a model and a few rules for how that model transitions between one consistent state and another.

If you program databases like #2 above, like I do, then enumerations really do make the data model richer.

  • The database maps to the domain easier (thereby documenting the domain, in case this is not done elsewhere)
  • The database maps to a domain-consistent implementation in the programming language easier
  • The database implicitly then has a constraint, as you can’t set the column to be some value which makes no sense for the domain (and therefore the program)

So enums in DBs are a good thing, and should be used wherever possible. Oracle should implement them just like MySQL does. And enums should be used in programming languages too. Why enums appeared only since Java 1.5 I don’t know. And the fact that php, perl etc do not support them, is yet another reason not to use those programming languages to try and attempt to build any software system whose function it is to model a domain.

It can be said that then it makes it “more difficult” to add a new state, as now you have to “change the database as well”. But if the model changes, and the database reflects the model, then that’s a good thing, not a bad thing. And it isn’t even much effort: If you say that invoices can be in a new state, then there’ll be a lot of programming work to support that change (UI, billing logic, robots, test scripts). The “alter table” statement is no work at all proportionally, and you’ll also be thankful for every extra constraint the database or programming language can offer you (which such a change feels very scary in a Perl program, but not so much in a Java program).

Windows path length limit

Wednesday, March 21st, 2007

It really seems that Windows does indeed have a path limit.

While checking some files into a subversion repository:

  • The repository was D:\Adrian\my-respository
  • Within the repository I had quite a deep directory structure, to access this particular project
  • Within this particular project, the IDE I was using had a few levels of directories, to include "work/src" and so on
  • The class path of the Java classes was quite deep, "com/company/project" etc
  • Subversion itself puts a few levels of dirs in ".svn/text-base" and so on

While checking in all this stuff, I got the error "path invalid" suddenly. And opening the created .svn directory in Windows Explorer, right-clicking, and chooseing "new directory" immediately brought up the error:

Cannot create 'New Folder': path invalid

So it seems paths have a limit in Windows. The existing working path was 220 characters long, with 20 directories including the working directory and the hard disk's root directory.

This is all very annoying, as I can't really do anything about any of the above reasons why the path is so long.

Beautiful Kettle

Tuesday, March 20th, 2007

One really does appreciate things that work elegantly.

My girlfriend insisted we bought the following kettle, on account of its colour.

It was quite an expensive kettle but it did look so good, I thought well, OK.

But with its expense comes more than just its beauty.

  • After the water has boiled, to open the spout, to pour the water, there is an extra plastic bit (looking like an ear) which stays cold. Meaning you don't burn your hands or need to use a dish cloth.
  • The purple handle is designed in such a way that it too doesn't get hot.
  • You can lift the lid of the kettle off without having to move the handle to any special angle (as was the case with the previous kettle)
  • The water pours beautifully. No spillage whatsoever.
  • Without water, it's light (ideal for the ladies for whom it was presumably designed).
  • It's the right size. There's no point making a kettle holding only one cup of water but most families don't need industrial sized kettles which can cook 34 cups of water either. This kettle is physically small but can boil 3-4 cups of water.
  • The spout is big enough that one can poor water in it, without having to lift the lid off, if that's one's preferred method of filling the kettle (it is mine)

If only all things in life were made as well as this kettle.

What does this error mean?

Tuesday, March 20th, 2007

While moving a folder "old-cvs-data", with many subdirectories, to the Recycle Bin under Windows XP…

Maybe each file that is stored in the recycle bin has a "original path" attribute, with a max length 256 chars, and that stores the original path like "Dir1\Dir2\Dir3\file.txt". Maybe if files are nested too deeply that attribute cannot hold the value. But that's just a guess.

Maybe it really is time to get a Mac.

Concurrency control using Oracle’s “Select for Update”

Monday, March 19th, 2007

There are times when one need to prevent a certain “critical section” of code from being executed by more than one process on the same object at the same time. For example, if the requirements state that a user cannot have two subscriptions of the same type active at the same time. So to enable a subscription, if one does “a) check if user can have the subscription, b) enable the subscription”, one need to make sure that there aren’t multiple processes doing that simultaneously.

Oracle (and thus MySQL InnoDB and PostgreSQL) support row-level write locks. This means that (in contrast to other databases) if one database connection, in a transaction, writes to a row (but has not commited yet) then other connections can freely read the row: they get the version which was most recently commited, which was the version before this transaction. Only if a second connection tries to write to the same row, will the connection wait for the first connection to perform a commit or rollback.

The “select for update” statement acts like a “write” above. All rows which are returned from the select to a first connection are not being written to by any other connection, and any other connection which later attempts to write (or “select for update”) the rows will wait until the first transaction has commited or rolled back. Locking rows using “select for update” is thus an appropriate mechanism to implement locking.

As “select for update” still can only lock existing rows, one needs to decide what the appropriate row(s) are to lock for a particular operation. Remember that the rows are only locked to other connections using “select for update” or writing to the row: anyone can still read the rows. If one wishes to exclude the possibilty of a user having two subscriptions of the same type simultaneously, one cannot lock the “zero rows” the user must have in the subscriptions table beforehand, as only existing rows can be locked.

It thus makes sense to lock the “user” row representing the user (or some other table containing exactly one row per user).

Therefore, for certain critical operations, such as adding a new subscription line for a user, I shall perform the following operations:

  1. Open transaction
  2. “Select account_id from account where account_id=?”
  3. Does the user already have a subscriptions row of this type currently active?
  4. If not, create the new row for the user
  5. Commit

Note that with other database engines, simply reading the row within a transaction may be enough for it to acquire an exclusive lock. Or worse, for the entire page (Informix?) or entire table (MySQL MyISAM?) to acquire an exclusive lock.

And even if it’s just the row being locked (best case), other reads (e.g. a user just wants to view some data) will cause the reading process to block, increasing the number of processes waiting from just those needing to wait due to execution of a critical section, to all processes accessing the table.

Copy/Paste between two rich text editors in web browsers

Thursday, March 15th, 2007

I don't know on what technology the gmail rich text editor is based, nor the uboot rich text editor for composing blog posts (although the latter I should know!) but despite having different appearances (fonts etc) and existing in different websites, one can copy/paste formatted data (e.g. lists, bold) from one to the other (at least using Firefox). The text takes on the appearance of the editor one copies it into.

I don't know how that works, but I think that's quite impressive!

Bottles more than 100% full

Thursday, March 15th, 2007

I don't know how they do it, but here in Macau, one can buy e.g. a 1.5L plastic bottle of a non-fizzy drink such as still water, and instead of having a small space of air at the top, when one opens it

  1. The line of the water is exactly at the top of the bottle, i.e. one couldn't fill it any more
  2. Before one gets a chance to observe #1, water has spilt out

As far as I can see, the above is impossible (as well as unusual, from the perspective of a European). But nevertheless, despite impossibility, it is actually the case.