Archive for the ‘MySQL’ Category

Reading row-by-row into Java from MySQL

Thursday, February 7th, 2008

Trying to read a large amount of data from MySQL using Java using one query is not as easy as one might think.

I want to read the results of the query a chunk at a time. If I read it all at once, the JVM understandably runs out of memory. In this case I am stuffing all the resulting data into a Lucene index, but the same would apply if I was writing the data out to a file, another database, etc.

Naively, I assumed that this would just work by default. My initial program looked like this (I’ve left out certain things such as closing the PreparedStatement):

public void processBigTable() {
    PreparedStatement stat = connection.prepareStatement(
        "SELECT * FROM big_table");
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

Failed with the following error:

Exception in thread "main"
        java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2823)
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2763)
    ...
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1657)
    ...

The line it failed at was the exceuteQuery. So as we can see from the stack backtrace, it’s clearly trying to load all the results into memory simultaneously.

I tried all sorts of things but it was only after I took at the MySQL JDBC driver code did I find the answer. In StatementImpl.java:

protected boolean createStreamingResultSet() {
    return ((resultSetType == ResultSet.TYPE_FORWARD_ONLY)
        && (resultSetConcurrency == ResultSet.CONCUR_READ_ONLY)
        && (fetchSize == Integer.MIN_VALUE));
}

This boolean function determines if it’s going to use the approach “read all data first” or “read rows a few at a time” (= “streaming” in their terminology). I clearly need the latter.

You can specify, using the generic JDBC API, the number of rows you want to fetch at once (the “fetchSize”). Why would you have to set that to Integer.MIN_VALUE, which is stated to be −231, in order to get streaming data? I wouldn’t have guessed that.

Basically this very important decision about which approach to use, which in my case amounts to “program works” or “program crashes”, is left to test whether three variables are set to various values. I am not aware if this is in the documentation (I didn’t find it), nor if this decision is guaranteed to be stable, i.e. won’t change in some future driver version.

Now my code looks like the following:

public void processBigTable() {
    PreparedStatement stat = c.prepareStatement(
        "SELECT * FROM big_table",
        ResultSet.TYPE_FORWARD_ONLY,
        ResultSet.CONCUR_READ_ONLY);
    stat.setFetchSize(Integer.MIN_VALUE);
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

This code works, and reads chunks of rows at a time.

Well I’m not sure if it reads chunks of rows at a time, or just one row at a time. I hope it doesn’t read one row at a time, because that would be very inefficient in terms of number of round trips from the software to the database. I assumed this was what the fetchSize parameter was controlling, so you could tune the size of the chunks to meet your particular latency and memory setup. But being forced to set it to a large negative number in order to get it to work means one has no control over the size of the chunks (as far as I can see).

(I am using Java 6 with MySQL 5.0 and the JDBC driver “MySQL Connector” 5.1.15.)

Interesting Oracle/MySQL locking difference

Wednesday, September 5th, 2007

I know the rules for Oracle row locking well. A row can be locked for write if one updates it, or if one “select for update”s it.

  1. create table a (x number); (and equivalent in MySQL for InnoDB)
  2. Session A: insert into a values (9);
  3. Session A: commit;
  4. Session A: start transaction (in mysql)
  5. Session A: select * from a where x=9 for update;
  6. Session B: start transaction (in mysql)
  7. Session B: select * from a where x=9 for update;
  8. Session B hangs, waiting for row-level lock to be release from the row by Session A
  9. Session A: update a set x=4;
  10. Session A: commit;
  11. At this point, Session B returns no rows. Lock has been released, and row no longer confirms to where, so is not returned.
  12. Session A: update a set x=5;

This is where the difference occurs:

  • Oracle returns the row to session A. The command “for update” in session B did not return any rows, and thus no rows were locked, and thus session A has nothing to wait on.
  • MySQL (version 4.1.18) blocks session A waiting on the transaction in session B to be ended. That means A requires a lock owned by B. But what is this lock? Is it a row-level lock on the row which was not selected? Some other type of lock?

The reason I tested this was it just occurred to me that in a “select for update” the where clause is applied twice:

  1. Firstly to determine which rows to return. A lock is requested for those rows, which can involve waiting until the lock is released if it is owned by some other transaction.
  2. After the lock has been acquired, the rows is checked again to see if it still confirms to the where clause. If not, it is not returned (although the session blocked waiting for it)

I suspect that both databases simply do step 1 involving waiting and acquiring the lock on the rows which initially conform to the where clause. And both apply the where in step 2, as in both cases no rows were actually returned dispite the wait. But Oracle, upon noticing the row will not be returned, actually releases the lock as the command implies only rows returned will be locked. But MySQL just leaves the lock form step 1 hanging around, which is wrong.

MySQL's "enum" datatype is a good thing

Friday, March 23rd, 2007

I’ve often had discussions with people about whether the “enum” type in MySQL is a good thing or not. Basically there are two ways to use your database:

  1. As an unstructured bunch of “stuff” to store whatever the software needs to persist. Such databases use lots of “blob” data with serialized objects (it’s easy to program), tables with multiple functions (”object” table with “type” column), few constraints, and so on.
  2. As a representation of the data the program is trying to model. Such databases have meaningful column names, two different types of things are two different tables. Adding constraints is easy.

There are good arguments for both. As one is programming in a programming language, whatever is most convenient for that programming language (e.g. serialize a bunch of objects to a blob) means less effort, less code: which means less cost and less bugs. Those are all good things.

But for some reason I’ve always been the 2nd type. I like to look at a database and understand what data is being modeled. It creates a certain self-documentation which can often be lacking from complicated software. Constraints can be added which acts like functional assertions (functional in the sense that they involve no state: you say that this value must be between 1 and 10 and then it is that way. You don’t have to program any “path” or state to check that).

That an item can be in exactly one a distinct set of states is a fact of life in all types of domain modeling:

  • What state is this invoice in? Is it “paid”, “open”, “paid+processed”?
  • Is this item deleted or not? (Or marked temporarily suspended, pending administrator checking the content)
  • Is this photo public, or does it belong to a user?

To explicitly work with enumerations is to acknowledge their existence within the domain, and to create a more 1:1 mapping between the domain and its implementation. That’s what all software should be about, modeling. Ideally a program is nothing more than a model and a few rules for how that model transitions between one consistent state and another.

If you program databases like #2 above, like I do, then enumerations really do make the data model richer.

  • The database maps to the domain easier (thereby documenting the domain, in case this is not done elsewhere)
  • The database maps to a domain-consistent implementation in the programming language easier
  • The database implicitly then has a constraint, as you can’t set the column to be some value which makes no sense for the domain (and therefore the program)

So enums in DBs are a good thing, and should be used wherever possible. Oracle should implement them just like MySQL does. And enums should be used in programming languages too. Why enums appeared only since Java 1.5 I don’t know. And the fact that php, perl etc do not support them, is yet another reason not to use those programming languages to try and attempt to build any software system whose function it is to model a domain.

It can be said that then it makes it “more difficult” to add a new state, as now you have to “change the database as well”. But if the model changes, and the database reflects the model, then that’s a good thing, not a bad thing. And it isn’t even much effort: If you say that invoices can be in a new state, then there’ll be a lot of programming work to support that change (UI, billing logic, robots, test scripts). The “alter table” statement is no work at all proportionally, and you’ll also be thankful for every extra constraint the database or programming language can offer you (which such a change feels very scary in a Perl program, but not so much in a Java program).

Database error messages

Wednesday, February 21st, 2007

Database error messages in general are very bad. Why?

Oracle Version 8 had lots of messages such as "Invalid column name" where they meant:

  1. Column name not found in the table in question. The word "invalid" is the wrong word as it implies illegal characters or something like that.
  2. Which column? Which table? The parser surely knows this at the time it generates the error message. But it helpfully chooses not to inform the user.

Thankfully Oracle 10 has improved its error messages a lot. They include the statement in question and the point in the statement producing the problem. And error messages contain which foreign key constraint has been violated, and so on.

But I have the following problem with MySQL. I try to create a table with InnoDB with a foreign key constraint and it says:

ERROR 1005 (HY000):
Can't create table './myschema/mytable.frm' (errno: 150)

What it means is: the statement has an error in it. But what is the error? In this case, there was a foreign key constraint and the column didn't exist in the referenced table. But why couldn't it tell me this?

How MySQL reduces error messages in your program

Thursday, February 1st, 2007

Ah MySQL (at least MyISAM) so isn't a real database!

Firstly, when doing an insert, I did some arithmetic. The numeric column was of a certain width. If the result of the arithmetic is larger than the maximum allowed value my number was just getting turned into that maximum allowed value, without warning or error. A large number suddenly becoming some other large number may sound good in the philosophy of "errors are bad - we want to minimize errors!" but literally it's never what you want. Oracle gives an error if a number is too big to be stored in a column. Which is what you always want.

Secondly, due to above arithmetic overflow errors, my insert statement was failing (as multiple values that should have been distinct, but beyond the maximum, were then identical, equal to the maximum). I kept on doing it and it kept on failing. Then I looked at the table and each time I'd done such an unsuccessful insert (a single statement to insert maybe 10k rows) some rows (but not all - due to the error) were getting inserted. Having half a statement succeed is never what you want! Oracle sets an invisible checkpoint before each statement and if the statement fails, rolls the database back to that checkpoint. That's always what you want!