Archive for the ‘MySQL’ Category

Constraint name visibility on MySQL and PostgreSQL

Thursday, February 7th, 2013

Can one have two tables with constraints of the same name? Is that even consistent between types of constraints? What about between database vendors?

It turns out that it’s neither consistent between types, nor is the way in which it’s inconsistent consistent between database vendors.

  MySQL 5.5 PostgreSQL 9.2
Unique constraint names Local to table Global to DB
Foreign Key constraint names Global to DB Local to table

From MySQL:

mysql> CREATE TABLE foo (x INTEGER, 
    ->   CONSTRAINT foo_unique UNIQUE(x));
Query OK, 0 rows affected (0.01 sec)

-- Second UNIQUE constraint CAN be created with same name
mysql> CREATE TABLE foo2 (x INTEGER, 
    ->   CONSTRAINT foo_unique UNIQUE(x));
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE foo3 (x INTEGER, 
    ->   CONSTRAINT foo_fk FOREIGN KEY (x) REFERENCES foo(x));
Query OK, 0 rows affected (0.01 sec)

-- Second FOREIGN KEY constraint CANNOT be created with same name
mysql> CREATE TABLE foo4 (x INTEGER, 
    ->   CONSTRAINT foo_fk FOREIGN KEY (x) REFERENCES foo(x));
ERROR 1005 (HY000): Can't CREATE TABLE 'test.foo4' (errno: 121)

From PostgreSQL:

postgres=# CREATE TABLE foo (x INTEGER, 
postgres-#   CONSTRAINT foo_unique UNIQUE (x));
CREATE TABLE

-- Second UNIQUE constraint CANNOT be created with same name
postgres=# CREATE TABLE foo2 (x INTEGER,
postgres-#   CONSTRAINT foo_unique UNIQUE (x));
ERROR:  relation "foo_unique" already exists

postgres=# CREATE TABLE foo3 (x INTEGER,
postgres-#   CONSTRAINT foo_fk FOREIGN KEY (x) REFERENCES foo(x));
CREATE TABLE

-- Second FOREIGN KEY constraint CAN be created with same name
postgres=# CREATE TABLE foo4 (x INTEGER,
postgres-#   CONSTRAINT foo_fk FOREIGN KEY (x) REFERENCES foo(x));
CREATE TABLE

MySQL Performance: MyISAM or InnoDB?

Friday, November 16th, 2012

A friend of mine posed the following question recently:

Do you have an opinion over using myisam or innodb for a read heavy MySQL DB?

I do indeed have an opinion on this topic. It is thus: Use InnoDB for everything, use MyISAM for nothing.

In some speed tests, it might appear that MyISAM is faster for read-heavy workloads. However, that only translates to reality when you’ve got a database which is doing nothing apart from read-only work (e.g. data warehouse system generating reports during the day, and importing new transaction data in the night.)

In most environments, the speed you get is measured by the amount of concurrency you can support, not by the throughput of a single transaction running without anything else running on the machine. InnoDB allows reads and writes simultaneously (row-level locking; multi-version concurrency control), whereas MyISAM just locks the whole table as soon as you do anything.

For example, a system where one user is reading their data, another user is writing their data, and a third user is running some reports over the entire database, the following will happen with MyISAM:

  • As soon as e.g. a minute-long report query starts, nothing else will be able to write to the database until it ends. This means that all your users who want to write to the database will simply wait that number of minutes. Further, if it’s a web application, as those users will consume an Apache child process, any further read-only requests will also fail as Apache will run out of available child processes.
  • If you have a number of individual users doing small write queries and small read queries, the writes will wait on the reads and vice versa. This means you will have low concurrent throughput.

With InnoDB, on the other hand, writes lock only the rows involved, and only against other writes.

  • The minute-long report query doesn’t lock anything. Other users can read and write anything they like.
  • Multiple individual users can write to their own rows concurrently without waiting on each other.
  • Even after a “commit” has been issued for a change, the old value is stored. This means that the minute-long report sees the values of the data which were valid at the point in time the statement or transaction started, even if, during the execution of the statement, another transaction altered the data and committed that change (“consistent read”).

And then there are the long list of non-performance related advantages of InnoDB over MyISM: the fact that InnoDB allows you to use transactions, allows foreign key constraint checking, and various other things, which you’ll be wanting to use.

BTW if you’re going to do a migration, a company I worked for did this once:

Rather than doing a “turn off; migrate; turn on” scenario, they set up a new server with the new storage engine and turned on replication. A few weeks later, all the data was replicated to the new server. So they they just turned off the old server, made sure the two instances were in-sync, turned on the new server, so that reduced the downtime by quite a bit. That also works from MySQL 4.x -> 5.0, and 5.0 -> 5.1, indeed going from 4.x MyISM to 5.1 InnoDB involved a chain of databases all replicating from one another.

For more information on migration, see: http://www.mysqlperformanceblog.com/2009/01/12/should-you-move-from-myisam-to-innodb/

curtime

Thursday, July 7th, 2011

The MySQL function curtime has (at least) the following issues:

  1. The function returns the current time as a number i.e. the time quarter past 8 would be returned from the function as the number 201500. (So if you subtract one from such a number you get 201499 which has no meaning.)
  2. The function only returns this if you use it in an “integer context”, i.e. (x+0) causes x to be evaluated to an integer. (Otherwise it produces a string in a different format.)
  3. At some point the function was changed not to return an integer but to return a decimal with 6 decimal places. As far as I can tell these 6 decimal places are always zero. The example in the documentation also supports this experience.
  4. There is no way to specify a timezone. When going from a point in time (e.g. unix time or simply the concept of “now”) to/from an external format (hours, minutes, year, …) you must always specify a timezone. The concept of the “default timezone” should just be eradicated and deleted from the world. If I had a time-machine, the first thing I would do is go back and prevent this concept’s creation. (The user must see the data in their timezone, which on a web application needn’t be the same as the server’s. And if you are writing a desktop application, you need to decide explicitly to use the user’s timezone, as not all data might be for the user, e.g. data from a server or database.)
  5. I don’t really think the name of the function is very good (saving characters by abbreviating the name, but why?)

Why would you ever use a function? Why would you ever add it to your database product. *Shakes Head*

I welcome feedback of the form “this is an amazing function, I’ve used it my application and it fitted my need exactly”. I do really welcome this. Please don’t hold back.

Related: curdate, now.

MySQL “lock tables” does an implicit commit

Thursday, May 7th, 2009

The MySQL “lock tables” (and “unlock tables”) command has a nasty side-effect, it implicitly commits the current transaction.

This caused a bug in production code (a normally irrelevant temporary error, which should have normally caused a rollback, only rolled back to the last “unlock tables” command due to its implicit commit, and thus left the database in an inconsistent state, meaning that when the request was retried the software found the database in a state it wasn’t expecting and thus couldn’t process the request even though the irrelevant temporary error had now been fixed. So the request got retried indefinitely, and the result was that a lot of orphan invoices were created in the database, which wouldn’t have been a problem but invoice numbers were then consumed by these unused invoices, and Austrian law dictates that invoice numbers must be sequential, i.e. the numbering system shouldn’t have huge holes in it. Or, as in our case, a number of huge holes, as the system was up and running for other requests during this bug (a feature, that one faulty request shouldn’t take the whole system down), so other invoices did get correctly generated at the time, and their invoice numbers were then small islands in the sea of unused/invalid invoice numbers produced by the bug.)

I wish this had been more obviously documented (MySQL 5.1 lock tables documentation). It is there in the bullet points (and the comments!), but it’s hardly well emphasized, and I missed it.

The trouble is that the “lock tables” command comes from the stone-age of MySQL before it supported transactions at all.

This has been “fixed” in MySQL 6.0 (MySQL 6.0 changelog), via the introduction of a new options ”in share mode” / “in exclusive mode”.

In the mean-time, if you need to lock a table (which one needs to do in order to ensure table-wide sequential numbering, as a primary key isn’t designed for that) I would recommend creating a table with a name like “mutux” and creating a row for each type of table-lock you might like to acquire, and then “select for update” the row corresponding to your table. This creates an in-transaction row-level lock, doesn’t do an implicit commit, and the lock will be automatically released when the transaction ends.

Automatic reconnect from Hibernate to MySQL

Friday, October 24th, 2008

Yesterday I spent the entire day getting the following amazing state-of-the-art not-ever-done-before feature to work:

  • Executing a SQL statement from my program

Because, as everyone knows, I don’t suffer from NIHS, I used standard object-relational mapping software Hibernate, with a standard programming language Java, using the standard web-application server Tomcat, and now I am using the standard “connection pooling” software C3P0 (which I didn’t know I needed to execute a SQL statement, see below..)

The program is, in fact, already completed, and is nearly deployed. On the test server it works fine and even on the (future) live server it worked fine. But the customer noticed that if one installed it one day, the next day it didn’t work. I’ve had such symptoms many times before, so I know immediately what was going on:

  • MySQL drops a connection after 8 hours (configurable)
  • The software is used during the day, but isn’t used during the night, therefore the connection times out in the night
  • Therefore in the morning, the program one installed the day before no longer works

Perhaps I exaggerated the simplicity above of what I was really trying to achieve. It should really be expressed as the following:

  • Executing a SQL statement from my program, even if a long time has passed since the last one was executed

But that amounts to the same thing in my opinion! It isn’t rocket science! (But in fact is, see below..)

A obvious non-solution is to increase the “connection drop after” time on the MySQL server from 8 hours to e.g. “2 weeks” (“wait_timeout” in “mysql.cnf”). But software has got to be capable of reconnecting after a connection drops. The database server may need to be reset, it may crash, it may suffer hardware failure, etc. If, every time one restarts one particular service, one has to restart a thousand dependent services (maybe some Java, some Perl, some PHP, some robots, ..) and then maybe restart services which are dependent on them – that’s a maintenance nightmare. So the software has to be altered to be able to handle connection drops automatically, by reconnecting. Once the software has been so altered, one no longer needs to alter the “wait_timeout” on the server.

The error was:

org.hibernate.util.JDBCExceptionReporter: The last packet successfully received from the server was 56697 seconds ago. The last packet sent successfully to the server was 56697 seconds ago, which  is longer than the server configured value of ‘wait_timeout’. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property ‘autoReconnect=true’ to avoid this problem.

Quite a helpful error message, don’t you think? But

  • I’m not going to increase “wait_timeout” as discussed above,
  • “testing validity” in the application – well I was using standard software Hibernate which should take care of this sort of thing automatically, but evidently wasn’t
  • and we were already using ?autoReconnect=true in the JDBC URL (this evidently wasn’t working).

I figured I really needed to get to the bottom of this. Googling just showed (many) people with the same problem, but no solutions. The only way to get to the bottom of software is to read the source. (It has been the way to resolve issues of simple things simply not working in MySQL before.)

I stopped looking in the MySQL source for why “autoReconnect=true” didn’t work when I saw the following text in the source describing the autoReconnect parameter:

The use of this feature is not recommended, because it has side effects related to session state and data consistency

I have no idea what particular side-effects are meant here? I guess that’s left as an exercise for the reader, to test their imagination.

And anyway, I figure that a reconnect-facility belongs in the “application” (Hibernate in my case) as opposed to in database-vendor specific code. I mean the exactly the same logic would be necessary if one were connecting to PostgreSQL or Oracle, so it doesn’t make sense to build it in to the database driver.

So then I looked in the Hibernate code. To cut a long story short, the basic connection mechanism of Hibernate (as specified in all the introductory books and websites, which is probably how most people learn Hibernate) doesn’t support reconnecting, one has to use H3C0 connection pool (which itself didn’t always support reconnecting)

(I don’t want to use container/Tomcat-managed connections, as I have some command-line batch jobs, and I don’t want to use different code for the batch jobs than the web application. Although another company defined Servlets which were “batch job servlets”, and these were executed via a “wget” entered into crontab – to get the user of container-managed connections – but this seems a too-complex solution to my taste..)

But once one’s used H3C0, the default behavior seems to be that to process a request, if the connection is dead then the user sees and error – but at least it reconnects for the next request. I suppose one error is better than infinite errors, but still not as good as zero errors. It turns out one needs the option testConnectionOnCheckout - which the documentation doesn’t recommend because testing the connection before a request might lead to lower performance. Surely the software firstly has to work, only secondly does it have to work fast.

So, to summarize, to get a connection to “work” (which I define as including handling dropped connections by reconnecting without error): In “hibernate.cfg.xml”:

<!-- hibernate.cfg.xml -->
<property name="c3p0.min_size">5</property>
<property name="c3p0.max_size">20</property>
<property name="c3p0.timeout">1800</property>
<property name="c3p0.max_statements">50</property>
<property name="connection.provider_class">
   org.hibernate.connection.C3P0ConnectionProvider</property>
<!-- no "connection.pool_size" entry! -->

Then create a file “c3p0.properties” which must be in the root of the classpath (i.e. no way to override it for particular parts of the application):

# c3p0.properties
c3p0.testConnectionOnCheckout=true

Amazing, that that stuff doesn’t just work out of the box. Programming the solution myself in Uboot took, I think, 1 line, and I’m sure it’s not more in WebTek either.

That was an amazing amount of effort and research to get the simplest thing to work. Now if only this project had been paid by the hour…..

[Update 28 May 2009] More Java hate today. Starting a new application, deployed it, and it didn’t work. In the morning, the application was down. Reason: The new project used Hibernate 3.3, and upgrade from 3.2 to 3.3 requires the “connection.provider_class” property to be set. Previously the presence of “c3p0.max_size” was enough.

Reading row-by-row into Java from MySQL

Thursday, February 7th, 2008

Trying to read a large amount of data from MySQL using Java using one query is not as easy as one might think.

I want to read the results of the query a chunk at a time. If I read it all at once, the JVM understandably runs out of memory. In this case I am stuffing all the resulting data into a Lucene index, but the same would apply if I was writing the data out to a file, another database, etc.

Naively, I assumed that this would just work by default. My initial program looked like this (I’ve left out certain things such as closing the PreparedStatement):

public void processBigTable() {
    PreparedStatement stat = connection.prepareStatement(
        "SELECT * FROM big_table");
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

Failed with the following error:

Exception in thread "main"
        java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2823)
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2763)
    ...
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1657)
    ...

The line it failed at was the exceuteQuery. So as we can see from the stack backtrace, it’s clearly trying to load all the results into memory simultaneously.

I tried all sorts of things but it was only after I took at the MySQL JDBC driver code did I find the answer. In StatementImpl.java:

protected boolean createStreamingResultSet() {
    return ((resultSetType == ResultSet.TYPE_FORWARD_ONLY)
        && (resultSetConcurrency == ResultSet.CONCUR_READ_ONLY)
        && (fetchSize == Integer.MIN_VALUE));
}

This boolean function determines if it’s going to use the approach “read all data first” or “read rows a few at a time” (= “streaming” in their terminology). I clearly need the latter.

You can specify, using the generic JDBC API, the number of rows you want to fetch at once (the “fetchSize”). Why would you have to set that to Integer.MIN_VALUE, which is stated to be −231, in order to get streaming data? I wouldn’t have guessed that.

Basically this very important decision about which approach to use, which in my case amounts to “program works” or “program crashes”, is left to test whether three variables are set to various values. I am not aware if this is in the documentation (I didn’t find it), nor if this decision is guaranteed to be stable, i.e. won’t change in some future driver version.

Now my code looks like the following:

public void processBigTable() {
    PreparedStatement stat = c.prepareStatement(
        "SELECT * FROM big_table",
        ResultSet.TYPE_FORWARD_ONLY,
        ResultSet.CONCUR_READ_ONLY);
    stat.setFetchSize(Integer.MIN_VALUE);
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

This code works, and reads chunks of rows at a time.

Well I’m not sure if it reads chunks of rows at a time, or just one row at a time. I hope it doesn’t read one row at a time, because that would be very inefficient in terms of number of round trips from the software to the database. I assumed this was what the fetchSize parameter was controlling, so you could tune the size of the chunks to meet your particular latency and memory setup. But being forced to set it to a large negative number in order to get it to work means one has no control over the size of the chunks (as far as I can see).

(I am using Java 6 with MySQL 5.0 and the JDBC driver “MySQL Connector” 5.1.15.)

Interesting Oracle/MySQL locking difference

Wednesday, September 5th, 2007

I know the rules for Oracle row locking well. A row can be locked for write if one updates it, or if one “select for update”s it.

  1. create table a (x number); (and equivalent in MySQL for InnoDB)
  2. Session A: insert into a values (9);
  3. Session A: commit;
  4. Session A: start transaction (in MySQL)
  5. Session A: select * from a where x=9 for update;
  6. Session B: start transaction (in MySQL)
  7. Session B: select * from a where x=9 for update;
  8. Session B hangs, waiting for row-level lock to be release from the row by Session A
  9. Session A: update a set x=4;
  10. Session A: commit;
  11. At this point, Session B returns no rows. Lock has been released, and row no longer confirms to where, so is not returned.
  12. Session A: update a set x=5;

This is where the difference occurs:

  • Oracle returns the row to session A. The command “for update” in session B did not return any rows, and thus no rows were locked, and thus session A has nothing to wait on.
  • MySQL (version 4.1.18) blocks session A waiting on the transaction in session B to be ended. That means A requires a lock owned by B. But what is this lock? Is it a row-level lock on the row which was not selected? Some other type of lock?

The reason I tested this was it just occurred to me that in a “select for update” the where clause is applied twice:

  1. Firstly to determine which rows to return. A lock is requested for those rows, which can involve waiting until the lock is released if it is owned by some other transaction.
  2. After the lock has been acquired, the rows is checked again to see if it still confirms to the where clause. If not, it is not returned (although the session blocked waiting for it)

I suspect that both databases simply do step 1 involving waiting and acquiring the lock on the rows which initially conform to the where clause. And both apply the where in step 2, as in both cases no rows were actually returned despite the wait. But Oracle, upon noticing the row will not be returned, actually releases the lock as the command implies only rows returned will be locked. But MySQL just leaves the lock form step 1 hanging around, which is wrong.

MySQL's "enum" datatype is a good thing

Friday, March 23rd, 2007

I’ve often had discussions with people about whether the “enum” type in MySQL is a good thing or not. Basically there are two ways to use your database:

  1. As an unstructured bunch of “stuff” to store whatever the software needs to persist. Such databases use lots of “blob” data with serialized objects (it’s easy to program), tables with multiple functions (“object” table with “type” column), few constraints, and so on.
  2. As a representation of the data the program is trying to model. Such databases have meaningful column names, two different types of things are two different tables. Adding constraints is easy.

There are good arguments for both. As one is programming in a programming language, whatever is most convenient for that programming language (e.g. serialize a bunch of objects to a blob) means less effort, less code: which means less cost and less bugs. Those are all good things.

But for some reason I’ve always been the 2nd type. I like to look at a database and understand what data is being modeled. It creates a certain self-documentation which can often be lacking from complicated software. Constraints can be added which acts like functional assertions (functional in the sense that they involve no state: you say that this value must be between 1 and 10 and then it is that way. You don’t have to program any “path” or state to check that).

That an item can be in exactly one a distinct set of states is a fact of life in all types of domain modeling:

  • What state is this invoice in? Is it “paid”, “open”, “paid+processed”?
  • Is this item deleted or not? (Or marked temporarily suspended, pending administrator checking the content)
  • Is this photo public, or does it belong to a user?

To explicitly work with enumerations is to acknowledge their existence within the domain, and to create a more 1:1 mapping between the domain and its implementation. That’s what all software should be about, modeling. Ideally a program is nothing more than a model and a few rules for how that model transitions between one consistent state and another.

If you program databases like #2 above, like I do, then enumerations really do make the data model richer.

  • The database maps to the domain easier (thereby documenting the domain, in case this is not done elsewhere)
  • The database maps to a domain-consistent implementation in the programming language easier
  • The database implicitly then has a constraint, as you can’t set the column to be some value which makes no sense for the domain (and therefore the program)

So enums in DBs are a good thing, and should be used wherever possible. Oracle should implement them just like MySQL does. And enums should be used in programming languages too. Why enums appeared only since Java 1.5 I don’t know. And the fact that php, perl etc do not support them, is yet another reason not to use those programming languages to try and attempt to build any software system whose function it is to model a domain.

It can be said that then it makes it “more difficult” to add a new state, as now you have to “change the database as well”. But if the model changes, and the database reflects the model, then that’s a good thing, not a bad thing. And it isn’t even much effort: If you say that invoices can be in a new state, then there’ll be a lot of programming work to support that change (UI, billing logic, robots, test scripts). The “alter table” statement is no work at all proportionally, and you’ll also be thankful for every extra constraint the database or programming language can offer you (which such a change feels very scary in a Perl program, but not so much in a Java program).

(Enums are supported in PostgreSQL since v8.3)

How MySQL reduces error messages in your program

Thursday, February 1st, 2007

Ah MySQL (at least MyISAM) so isn't a real database!

Firstly, when doing an insert, I did some arithmetic. The numeric column was of a certain width. If the result of the arithmetic is larger than the maximum allowed value my number was just getting turned into that maximum allowed value, without warning or error. A large number suddenly becoming some other large number may sound good in the philosophy of "errors are bad – we want to minimize errors!" but literally it's never what you want. Oracle gives an error if a number is too big to be stored in a column. Which is what you always want.

Secondly, due to above arithmetic overflow errors, my insert statement was failing (as multiple values that should have been distinct, but beyond the maximum, were then identical, equal to the maximum). I kept on doing it and it kept on failing. Then I looked at the table and each time I'd done such an unsuccessful insert (a single statement to insert maybe 10k rows) some rows (but not all – due to the error) were getting inserted. Having half a statement succeed is never what you want! Oracle sets an invisible checkpoint before each statement and if the statement fails, rolls the database back to that checkpoint. That's always what you want!