Archive for the ‘Databases’ Category

Programming with unique constraints

Friday, June 13th, 2008

If you’re using a remote database system, your application doesn’t have access to all the data at any point in time. (I.e. you just load and save the rows you’re interested in in a particular transaction). Therefore if you want to do some database-wide operation, you need to ask the database to do it.

When you want to enforce uniqueness, for example across a whole table (for example a document name needs to be unique), or across a particular part of a table (for example each user must have documents with unique names), you need to do that in the database.

There is only one acceptable way to do this with a SQL database:

  1. Insert a new row
  2. If the insert succeeds, there wasn’t a row there before, and now there is
  3. If the insert throws a unique constraint violation, the row is already there
  4. If you want to update the row (i.e. an “insert or update” operation to maintain a “lazy singleton” in the database), you can update the row with safety after the unique constraint violation, as you can be certain the row is already there.

The following methods are all not acceptable:

  • Do a “select” to find out how many rows there are. If there aren’t any, do an “insert”. However someone may have inserted a row between your “select” and your “insert”.
  • Do an “update” and if the database says that 0 rows have been updated, do an “insert”. Again, someone may have inserted a row between your “update” and the “insert”.
  • Do a “select for update” statement (Oracle, Postgres, InnoDB) to check that there aren’t any rows while creating a lock, and then do an insert. However that statement only locks the rows it returns, so if it doesn’t return any rows, it doesn’t create any locks, so you still can’t be certain that no one has inserted a row between the “select” and the “insert”.
  • Lock the whole table and do one of the above. This works, but it means that all write access is “serialized” i.e. happens after one another. Any other operation, writing something completely irrelevant, will now also have to wait until the end of your transaction, whereas it shouldn’t have. This reduces concurrency.

The way I program this is the following. On the “insert” statement, I catch the database error, and see if it’s a “unique constraint violation”-type error. If it is, I throw an (unchecked) Exception. The calling code can catch that and do something with it (or not, if the statement should not generate such an error, in which case it will propogate to the main loop like any other database exception). I have had the pleasure of introducing this to easyname.eu and now also the pleasure of introducing this to the WebTek framework.

It is extremely frustrating working with big complex frameworks, whose usage is much more complex than just writing SQL manually, and not being able to do the above properly.

  • Hibernate clearly states in its documentation that if any database error occurs, the Session (main object managing all persistence) must be destroyed as its state will be out-of-sync with the database. But there is no way other than the above to do this sort of check (as far as I know).
  • OptimalJ generates code that, if a database error occurs, sets the transaction to rollback. Including any other work you may have done.

I mean checking unique constraints is something every database application needs, so the fact it’s not supported by major frameworks is just unbelievable.

Oracle, Nulls and the empty string

Thursday, June 12th, 2008

Oracle has an unusual feature, which attracts it a lot of criticism. If you try to insert the empty string into a column marked “not null”, you get an error. The empty string is treated the same as “null” by Oracle.

This is different to programming languages (and indeed other databases, at least MySQL), which means one has to be careful not to make a mistake when using a programming language to talk to the database. That it’s different from other systems is the main reason for the criticism.

However, how many times have you wanted the following to be allowed in your data schema, for example on a “first name” column:

  • Writing nulls is not allowed
  • Writing the empty string is allowed

I just wrote a program and the front-end framework helpfully noticed that the field was “not null” in the database and gave the user an error in the front-end if the field was empty. However when I altered the code slightly, it no longer gave an error. Because the field in the program was the empty string, and not null.

However, I assert, when dealing with data, checking only for not null is not useful; you also want to check that the string contains some data in that case.

I have now updated the framework, so that if the field is marked “not null”, then the error is presented to the front-end not only if the variable in the program is “null”, but also if it is the empty string.

(Note: I am not advocating e.g. Java lose the distinction between ==null and .isEmpty(): for some reason this is a useful distinction when doing programming and data manipulation—such as null indicating that a variable isn’t initialized yet—I just don’t think it’s a useful distinction in a system solely designed to model persistent data.)

XML in the database

Monday, March 3rd, 2008

As I do a lot of programming using open source databases, I couldn’t help but notice that inclusion of XML in the database seems to be a hot topic. Postgres 8.1 (just released) and MySQL 5.1 (not released yet) will allow XML documents to be stored as the value of cells, and allow those documents to be manipulated and inspected. (Oracle has had this feature since 2001)

I generally don’t like XML (as Robin put it: most formats are designed to be either conveniently readable by humans, or by computer. XML is conveniently readable by neither.)

So I especially wouldn’t like to store XML in the database. However, there are some times when one would naturally want to store some hierarchical data and the relational model doesn’t really allow that conveniently. Compared to:

  • Storing e.g. Java serialized objects as a BLOB in the database (with no chance of ever doing a “select” against that data, or migrating that data using a SQL migration script);
  • Or storing direct XML without manipulation functions (slightly more readable, but still not programatically manipulatable using SQL)

I suppose using the XML features provided in the database is better than either of the above solutions.

The following table contains documentation copied from the documentation or the changelog of the database in question:

New in Postgres 8.3
(4th Feb 2008)
“XML Support: New XML data type fully supports the SQL/XML specification of ANSI SQL:2003, including well-formedness checks, type-safe operations, SQL/XML publishing and XPath queries. Version 8.3 also includes additional functions for XML data export.”
New in MySQL 5.1
(In development)
“XML functions with XPath support. ExtractValue() returns the content of a fragment of XML matching a given XPath expression. UpdateXML() replaces the element selected from a fragment of XML by an XPath expression supplied by the user with a second XML fragment (also user-supplied), and returns the modified XML. See Section 11.10, ‘XML Functions’.”
New in Oracle 9.0.1
(Released 2001)
“The introduction of Oracle XML DB and the XMLType datatype provides new techniques that facilitate the persistence of XML content in the database. These techniques include the ability to store XML documents in an XMLType column or table, or in Oracle XML DB Repository. Storing XML as an XMLType column or table makes Oracle Database aware that the content is XML. This allows the database to: Perform XML-specific validations, operations, and optimizations on the XML content; Facilitate highly efficient processing of XML content by Oracle XML DB.”
New in MS SQL Server 2005
(Released 2005)
“SQL Server 2005 introduces a native data type called XML. A user can create a table that has one or more columns of type XML in addition to relational columns. … Together with a large set of functions, embedded XQuery and data modification languages provide rich support for manipulating XML data.”

Reading row-by-row into Java from MySQL

Thursday, February 7th, 2008

Trying to read a large amount of data from MySQL using Java using one query is not as easy as one might think.

I want to read the results of the query a chunk at a time. If I read it all at once, the JVM understandably runs out of memory. In this case I am stuffing all the resulting data into a Lucene index, but the same would apply if I was writing the data out to a file, another database, etc.

Naively, I assumed that this would just work by default. My initial program looked like this (I’ve left out certain things such as closing the PreparedStatement):

public void processBigTable() {
    PreparedStatement stat = connection.prepareStatement(
        "SELECT * FROM big_table");
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

Failed with the following error:

Exception in thread "main"
        java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2823)
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2763)
    ...
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1657)
    ...

The line it failed at was the exceuteQuery. So as we can see from the stack backtrace, it’s clearly trying to load all the results into memory simultaneously.

I tried all sorts of things but it was only after I took at the MySQL JDBC driver code did I find the answer. In StatementImpl.java:

protected boolean createStreamingResultSet() {
    return ((resultSetType == ResultSet.TYPE_FORWARD_ONLY)
        && (resultSetConcurrency == ResultSet.CONCUR_READ_ONLY)
        && (fetchSize == Integer.MIN_VALUE));
}

This boolean function determines if it’s going to use the approach “read all data first” or “read rows a few at a time” (= “streaming” in their terminology). I clearly need the latter.

You can specify, using the generic JDBC API, the number of rows you want to fetch at once (the “fetchSize”). Why would you have to set that to Integer.MIN_VALUE, which is stated to be −231, in order to get streaming data? I wouldn’t have guessed that.

Basically this very important decision about which approach to use, which in my case amounts to “program works” or “program crashes”, is left to test whether three variables are set to various values. I am not aware if this is in the documentation (I didn’t find it), nor if this decision is guaranteed to be stable, i.e. won’t change in some future driver version.

Now my code looks like the following:

public void processBigTable() {
    PreparedStatement stat = c.prepareStatement(
        "SELECT * FROM big_table",
        ResultSet.TYPE_FORWARD_ONLY,
        ResultSet.CONCUR_READ_ONLY);
    stat.setFetchSize(Integer.MIN_VALUE);
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

This code works, and reads chunks of rows at a time.

Well I’m not sure if it reads chunks of rows at a time, or just one row at a time. I hope it doesn’t read one row at a time, because that would be very inefficient in terms of number of round trips from the software to the database. I assumed this was what the fetchSize parameter was controlling, so you could tune the size of the chunks to meet your particular latency and memory setup. But being forced to set it to a large negative number in order to get it to work means one has no control over the size of the chunks (as far as I can see).

(I am using Java 6 with MySQL 5.0 and the JDBC driver “MySQL Connector” 5.1.15.)

COUNT(*) vs COUNT(pk_col)

Monday, November 19th, 2007

A while back I was doing some performance tuning on MySQL 5 for a customer. A SELECT was counting the number of rows in the table. I always use COUNT(*) for that but I know a lot of people, including the customer, use COUNT(pk_col). The query was taking a long time (a few minutes). I analyzed that the problem came from the usage of COUNT(pk_col) instead of COUNT(*). With COUNT(*) it was instantaneous.

I didn’t know that there was a difference between the two. There is no difference in their semantics, therefore it didn’t occur to me that there might be a difference in the way they were executed.

Just to recap the SQL syntax:

  • COUNT(DISTINCT col) counts the number of distinct values that are in “col”
  • COUNT(col) counts the number of rows where col is not null.
  • COUNT(*) counts the number of rows in a table

It follows that for columns which don’t have any “null” values in them, COUNT(col) must be the same as COUNT(*). For any column marked “not null”, such as a primary key, this is always the case. And thus it follows that COUNT(pk_col) must always deliver an identical result to COUNT(*).

However, the database in question was executing a COUNT(pk_col) query and a COUNT(*) query differently.

  • For the COUNT(*) queries it was simply counting the number of rows in the table (taking constant time and not requiring reading the rows from disk)
  • For the COUNT(pk_col) it was going through all the rows, presumably checking if the pk_col was in fact null, and counting the number of rows where it wasn’t, thus requiring a lot of disk access.

Alas I didn’t write down the output from the EXPLAIN statement and I’ve just tried it on some other databases (using MySQL 4.1 and 5.0) and was unable to reproduce this behaviour. So this is a bit of a useless blog entry! Nevertheless - you must believe me - it did happen!

So the conclusion is, one should always use COUNT(*) and never COUNT(pk_col).

Interesting Oracle/MySQL locking difference

Wednesday, September 5th, 2007

I know the rules for Oracle row locking well. A row can be locked for write if one updates it, or if one “select for update”s it.

  1. create table a (x number); (and equivalent in MySQL for InnoDB)
  2. Session A: insert into a values (9);
  3. Session A: commit;
  4. Session A: start transaction (in mysql)
  5. Session A: select * from a where x=9 for update;
  6. Session B: start transaction (in mysql)
  7. Session B: select * from a where x=9 for update;
  8. Session B hangs, waiting for row-level lock to be release from the row by Session A
  9. Session A: update a set x=4;
  10. Session A: commit;
  11. At this point, Session B returns no rows. Lock has been released, and row no longer confirms to where, so is not returned.
  12. Session A: update a set x=5;

This is where the difference occurs:

  • Oracle returns the row to session A. The command “for update” in session B did not return any rows, and thus no rows were locked, and thus session A has nothing to wait on.
  • MySQL (version 4.1.18) blocks session A waiting on the transaction in session B to be ended. That means A requires a lock owned by B. But what is this lock? Is it a row-level lock on the row which was not selected? Some other type of lock?

The reason I tested this was it just occurred to me that in a “select for update” the where clause is applied twice:

  1. Firstly to determine which rows to return. A lock is requested for those rows, which can involve waiting until the lock is released if it is owned by some other transaction.
  2. After the lock has been acquired, the rows is checked again to see if it still confirms to the where clause. If not, it is not returned (although the session blocked waiting for it)

I suspect that both databases simply do step 1 involving waiting and acquiring the lock on the rows which initially conform to the where clause. And both apply the where in step 2, as in both cases no rows were actually returned dispite the wait. But Oracle, upon noticing the row will not be returned, actually releases the lock as the command implies only rows returned will be locked. But MySQL just leaves the lock form step 1 hanging around, which is wrong.

Fast

Wednesday, March 28th, 2007

From our Oracle test instance:

1360965 rows created.
Elapsed: 00:01:17.90

That's 1¼ minutes to insert (and index) over 1¼ million rows.

And this is a very old test instance. I think the hardware was last updated 2-3 years ago.

That's pretty quick!

MySQL's "enum" datatype is a good thing

Friday, March 23rd, 2007

I’ve often had discussions with people about whether the “enum” type in MySQL is a good thing or not. Basically there are two ways to use your database:

  1. As an unstructured bunch of “stuff” to store whatever the software needs to persist. Such databases use lots of “blob” data with serialized objects (it’s easy to program), tables with multiple functions (”object” table with “type” column), few constraints, and so on.
  2. As a representation of the data the program is trying to model. Such databases have meaningful column names, two different types of things are two different tables. Adding constraints is easy.

There are good arguments for both. As one is programming in a programming language, whatever is most convenient for that programming language (e.g. serialize a bunch of objects to a blob) means less effort, less code: which means less cost and less bugs. Those are all good things.

But for some reason I’ve always been the 2nd type. I like to look at a database and understand what data is being modeled. It creates a certain self-documentation which can often be lacking from complicated software. Constraints can be added which acts like functional assertions (functional in the sense that they involve no state: you say that this value must be between 1 and 10 and then it is that way. You don’t have to program any “path” or state to check that).

That an item can be in exactly one a distinct set of states is a fact of life in all types of domain modeling:

  • What state is this invoice in? Is it “paid”, “open”, “paid+processed”?
  • Is this item deleted or not? (Or marked temporarily suspended, pending administrator checking the content)
  • Is this photo public, or does it belong to a user?

To explicitly work with enumerations is to acknowledge their existence within the domain, and to create a more 1:1 mapping between the domain and its implementation. That’s what all software should be about, modeling. Ideally a program is nothing more than a model and a few rules for how that model transitions between one consistent state and another.

If you program databases like #2 above, like I do, then enumerations really do make the data model richer.

  • The database maps to the domain easier (thereby documenting the domain, in case this is not done elsewhere)
  • The database maps to a domain-consistent implementation in the programming language easier
  • The database implicitly then has a constraint, as you can’t set the column to be some value which makes no sense for the domain (and therefore the program)

So enums in DBs are a good thing, and should be used wherever possible. Oracle should implement them just like MySQL does. And enums should be used in programming languages too. Why enums appeared only since Java 1.5 I don’t know. And the fact that php, perl etc do not support them, is yet another reason not to use those programming languages to try and attempt to build any software system whose function it is to model a domain.

It can be said that then it makes it “more difficult” to add a new state, as now you have to “change the database as well”. But if the model changes, and the database reflects the model, then that’s a good thing, not a bad thing. And it isn’t even much effort: If you say that invoices can be in a new state, then there’ll be a lot of programming work to support that change (UI, billing logic, robots, test scripts). The “alter table” statement is no work at all proportionally, and you’ll also be thankful for every extra constraint the database or programming language can offer you (which such a change feels very scary in a Perl program, but not so much in a Java program).

Concurrency control using Oracle’s “Select for Update”

Monday, March 19th, 2007

There are times when one need to prevent a certain “critical section” of code from being executed by more than one process on the same object at the same time. For example, if the requirements state that a user cannot have two subscriptions of the same type active at the same time. So to enable a subscription, if one does “a) check if user can have the subscription, b) enable the subscription”, one need to make sure that there aren’t multiple processes doing that simultaneously.

Oracle (and thus MySQL InnoDB and PostgreSQL) support row-level write locks. This means that (in contrast to other databases) if one database connection, in a transaction, writes to a row (but has not commited yet) then other connections can freely read the row: they get the version which was most recently commited, which was the version before this transaction. Only if a second connection tries to write to the same row, will the connection wait for the first connection to perform a commit or rollback.

The “select for update” statement acts like a “write” above. All rows which are returned from the select to a first connection are not being written to by any other connection, and any other connection which later attempts to write (or “select for update”) the rows will wait until the first transaction has commited or rolled back. Locking rows using “select for update” is thus an appropriate mechanism to implement locking.

As “select for update” still can only lock existing rows, one needs to decide what the appropriate row(s) are to lock for a particular operation. Remember that the rows are only locked to other connections using “select for update” or writing to the row: anyone can still read the rows. If one wishes to exclude the possibilty of a user having two subscriptions of the same type simultaneously, one cannot lock the “zero rows” the user must have in the subscriptions table beforehand, as only existing rows can be locked.

It thus makes sense to lock the “user” row representing the user (or some other table containing exactly one row per user).

Therefore, for certain critical operations, such as adding a new subscription line for a user, I shall perform the following operations:

  1. Open transaction
  2. “Select account_id from account where account_id=?”
  3. Does the user already have a subscriptions row of this type currently active?
  4. If not, create the new row for the user
  5. Commit

Note that with other database engines, simply reading the row within a transaction may be enough for it to acquire an exclusive lock. Or worse, for the entire page (Informix?) or entire table (MySQL MyISAM?) to acquire an exclusive lock.

And even if it’s just the row being locked (best case), other reads (e.g. a user just wants to view some data) will cause the reading process to block, increasing the number of processes waiting from just those needing to wait due to execution of a critical section, to all processes accessing the table.

CHAR vs VARCHAR (and VARCHAR2)

Friday, February 23rd, 2007

A friend just asked me:

I have a DB, informix actually but I think its unimportant. A column is a char100. I have a string of text in a row in that col. the string is 4 characters long. When I select the char100 column I get a space padded string of 100 characters with my string at the front. Have you ever seen this?

Yep that's normal. A char(100) column is exactly that: 100 characters, no more, no less. So if you put too few characters in the field, rather than giving an error, the db pads the value with spaces.

The "very advanced" datatype varchar (or varchar2 in oracle) remembers how many characters you put in the field. (I shudder to think what oracle's varchar datatype must have been like. These days it's an alias to varchar2).

There are basically no reasons to use a char. We used e.g. char(2) to store a user's country in uboot ("at", "de" etc) but in fact we even had some problems with that, so we decided to always use varchar thenceforth.

In MySQL, if you define a table where it can pre-compute the byte-width of a row in advance, i.e. it's only composed of chars, ints, etc, no varchars, then if you delete rows and then re-insert rows, you never lose any space. As it can slot new rows exactly into the space taken up by old rows.

But if you use any varchars, then it can't. And in that case, it states there is no advantage to having any chars. So if you have a table with some chars and some varchars, then it converts those chars to varchars.

Incidentally, my book on Oracle (which serves also as an advertisement for Oracle) says that Informix, if you do an insert, then the page where the row gets written gets locked for the duration of the transaction, i.e. page lock not row lock. So if you do another insert from another transaction, even though it's an independent row, it's probably going to be written to the same page, so it must wait on the first transaction. So there's a lot of waiting going on. Not being able to do two inserts on the same table from two different transactions simultaneously would seem to be exeedingly rubbish. So I'm curious if it's really true or just Oracle marketing. Or maybe it was the case in Informix 1.0. But let's not forget that Oracle 1.0 probably didn't have varchar2.

Update: my friend writes: spoke to our informix guy, he says that was true in informix 7.10 from '97.