Archive for the ‘Java’ Category

Java gotcha: anArray.hashCode isn’t deep

Thursday, February 14th, 2008

Every object has a hashCode and an equals method. These are used to determine where to place an object within a hashing algorithm, and if two objects with the same place in the hashing algorithm actually are the same, respectively. If you want to add objects to a Set—which stores only unique objects—it uses these methods to determine whether two objects are the same and thus shouldn’t both be stored.

If you have code like:

Set<byte[]> uniqueArrays = new HashSet<byte[]>();
uniqueArrays.add(new byte[] { 1,2,3 });
uniqueArrays.add(new byte[] { 1,2,3 });
uniqueArrays.add(new byte[] { 1,2 });
System.out.println(uniqueArrays.size() + " unique byte arrays");

This code prints 3. You might expect this program to print 2, as there are only two unique arrays within the Set. But arrays’ hashCode methods do not return the same result for two different arrays with the same contents. This is in contrast to, for example, the String class, which does indeed consider the String’s contents when computing the hashCode.

Set<String> uniqueStrings = new HashSet<String>();
uniqueStrings.add(new String("123"));
uniqueStrings.add(new String("123"));
uniqueStrings.add(new String("12"));
System.out.println(uniqueStrings.size() + " unique strings");

This code prints 2. (The slightly strange-looking “new String” here is to make sure that there are actually different object instances with the same content being passed to the add method; otherwise the Java compiler would use the same object instance for the two calls, as the string-content is the same.)

The solution is to use the Arrays.hashCode(anArray) method.

This isn’t particularly convenient if you want to store unique arrays in a set. But if you have an object with e.g. a byte[] instance variable, then you can implement the hashCode method on that object to use Arrays.hashCode, or you can use the code:

Map<Integer, byte[]> map = new HashMap<Integer, byte[]>();
map.put(Arrays.hashCode(anArray), anArray);
Collection<byte[]> uniqueByteArrays = map.values();

Creating an Iterator for a streaming ResultSet in Java

Monday, February 11th, 2008

The Java Iterator interface requires one implements a hasNext method, to determine if the current item is the last to be iterated over, or not. The MySQL driver’s implementation of the JDBC ResultSet object, if one uses streaming mode throws an exception from its isLast method. (Streaming mode prevents the JVM from running out of memory, which it would do if it tried to fetch all the results at once.)

Therefore I’ve developed an Iterator class based on such a ResultSet whose “next” method actually pre-fetches the row after the current one. The Iterator’s “hasNext” method therefore just returns if the row was created or not. And the “next” method returns the pre-fetched one, and fetches the next one.

And in order to make this code reusable, it’s an abstract superclass, and you can implement a method in a concrete subclass which converts the row into an object of your choosing. And thus the concrete subclass will provide an implementation of Iterator<T> for your T.

And to make this code reusable to people other than me, I hereby make it available.

ResultSetIterator.java

Reading row-by-row into Java from MySQL

Thursday, February 7th, 2008

Trying to read a large amount of data from MySQL using Java using one query is not as easy as one might think.

I want to read the results of the query a chunk at a time. If I read it all at once, the JVM understandably runs out of memory. In this case I am stuffing all the resulting data into a Lucene index, but the same would apply if I was writing the data out to a file, another database, etc.

Naively, I assumed that this would just work by default. My initial program looked like this (I’ve left out certain things such as closing the PreparedStatement):

public void processBigTable() {
    PreparedStatement stat = connection.prepareStatement(
        "SELECT * FROM big_table");
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

Failed with the following error:

Exception in thread "main"
        java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2823)
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2763)
    ...
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1657)
    ...

The line it failed at was the exceuteQuery. So as we can see from the stack backtrace, it’s clearly trying to load all the results into memory simultaneously.

I tried all sorts of things but it was only after I took at the MySQL JDBC driver code did I find the answer. In StatementImpl.java:

protected boolean createStreamingResultSet() {
    return ((resultSetType == ResultSet.TYPE_FORWARD_ONLY)
        && (resultSetConcurrency == ResultSet.CONCUR_READ_ONLY)
        && (fetchSize == Integer.MIN_VALUE));
}

This boolean function determines if it’s going to use the approach “read all data first” or “read rows a few at a time” (= “streaming” in their terminology). I clearly need the latter.

You can specify, using the generic JDBC API, the number of rows you want to fetch at once (the “fetchSize”). Why would you have to set that to Integer.MIN_VALUE, which is stated to be −231, in order to get streaming data? I wouldn’t have guessed that.

Basically this very important decision about which approach to use, which in my case amounts to “program works” or “program crashes”, is left to test whether three variables are set to various values. I am not aware if this is in the documentation (I didn’t find it), nor if this decision is guaranteed to be stable, i.e. won’t change in some future driver version.

Now my code looks like the following:

public void processBigTable() {
    PreparedStatement stat = c.prepareStatement(
        "SELECT * FROM big_table",
        ResultSet.TYPE_FORWARD_ONLY,
        ResultSet.CONCUR_READ_ONLY);
    stat.setFetchSize(Integer.MIN_VALUE);
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

This code works, and reads chunks of rows at a time.

Well I’m not sure if it reads chunks of rows at a time, or just one row at a time. I hope it doesn’t read one row at a time, because that would be very inefficient in terms of number of round trips from the software to the database. I assumed this was what the fetchSize parameter was controlling, so you could tune the size of the chunks to meet your particular latency and memory setup. But being forced to set it to a large negative number in order to get it to work means one has no control over the size of the chunks (as far as I can see).

(I am using Java 6 with MySQL 5.0 and the JDBC driver “MySQL Connector” 5.1.15.)

Random unreproducable Java error of the day

Monday, January 21st, 2008

I mean I’m really kind of of the opinion that Java Sevlets, at least when using Tomcat and the other open source tools, don’t work. I mean surely it can’t be difficult to implement a Servlet container or logging framework!

I just tried to start Tomcat and it refused to start because of the following error:

log4j:ERROR Error occured while converting date.
java.lang.NullPointerException
  at java.lang.System.arraycopy(Native Method)
  at java.lang.AbstractStringBuilder.getChars
  at java.lang.StringBuffer.getChars
  at org.apache.log4j.helpers.ISO8601DateFormat.format
  at java.text.DateFormat.format
  ...
  at org.apache.log4j.Category.log
  at org.apache.commons.logging.impl.Log4JLogger.error
  ...
  at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt
  at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run
  at java.lang.Thread.run
log4j:ERROR Error occured while converting date.

So I just hit “start” again, and this time it starts without error.

And people trust their mission-critical server architecture to this stuff!

Web software front-end test cases

Wednesday, January 16th, 2008

Recently I was working on a website which was developed in PHP without a web framework. A lot of things were programmed manually which would normally be taken care of by a web framework (for example things like: in case of an error, the HTML fields on the form are re-populated on the response page).

So I came up with an extensive set of test cases, and made sure I used them all on every field on every page.

These are the tests, in no particular order.

  1. On a form with radio buttons and text fields next to the radio buttons, clicking on the radio button should position the text cursor in the text field (as typing there is the next thing you’re always going to want to do).
  2. Clicking on the text field by a radio button should select the radio button.
  3. Checkboxes and radio buttons which have text beside them must have the text in a <label> so that clicking the text selects the checkbox or radio button.
  4. On a <form>, pressing return must do the same as clicking “OK”.
  5. Pressing the TAB key on the keyboard should progress from one field to the next in a reasonable order.
  6. Every action where something is done (e.g. delete an FTP account) should have a confirmation text on the result page like “ftp account XYZ deleted”. I think it’s important to include the name of the object being acted upon in the message.
  7. All forms should use “post” and “get” appropriately. I mean various people have various strong views about when to use one and when to use the other. But for me the difference is the browser’s “are you sure you want to repost?” message when you click refresh. Do you want it? If so use “post”, otherwise use “get”. Also bookmarkability.
  8. Is there a reasonably small amount of HTML generated? E.g. in uboot to display the address book page requires about 1MB of HTML to be downloaded to the browser (not including any external CSS, graphics etc.) That’s too much.
  9. The back button should work everywhere. (E.g. Uboot multimedia gallery: click on a picture, click back, you’re at page 1 of the gallery rather than the page you were on.)
  10. Strings to long? Then “…” should be displayed, to avoid breaking the layout.
  11. If breadcrumbs are used, (i.e. main page > hosting > ftp accounts > ftp account ‘x’) then they must work. Ideally they should contain data such “ftp account ‘x’”.
  12. While loading a page on a slow connection, is the text readable before the background image loads? E.g. white text on a black background image may not be readable before the image loads. There should be an alternative flat-colour background of a similar colour. (Thanks Helge for pointing this out a few years ago!)
  13. While loading the page, does the page move around a lot due to width=x height=x attributes missing on an <img> tag? It’s annoying when you start to read the text on the page and then it moves a few seconds later just because some logo has finally loaded.
  14. Every time there is an error in the form and the page is reloaded with the error: is the original data still in the form?
  15. In all places where the user may enter free text: Do weird non-Latin1 characters work correctly, both entering them and displaying them? Do ” ‘ work correctly? Are < > & displayed correctly? Is all of this written to the database correctly? (Sometimes the browser sends &#123; style code to the server. If this isn’t escaped on the display code then the character may appear to have been processed correctly. But you don’t want &#123; strings in your database data.)
  16. Type in long values into text fields. Values should be neither truncated nor should an internal error be produced (default behaviour if you’re using MySQL or Oracle respectively)
  17. If there are any id=nn type parameters in the URL, then adding or subtracting 1 from the URL should not allow you to view other people’s content.
  18. Is the <title> tag set usefully? This is necessary when you use the down-arrow by the “back” button on the browser, to determine how far back you want to go. A whole lot of options such as “MyWebsite”, “MyWebsite”, “MyWebsite” is not very helpful.
  19. Test in a high-latency environment. Such as over a UMTS connection. If there are a lot of redirects, or images referenced from CSS referenced from HTML, or Javascript making AJAX calls, processing the result then making more calls, then the page will be slow, but work fine over a LAN connection.
  20. Test in an unreliable environment, e.g. where packets get lost. Google Spreadsheets, when you type in a value, lets you continue editing the page while it’s sending the value to the server. But if there’s an error or timeout with the sending you see an error, and the cell is reverted to its previous value. You simply have to type in the same data all over again. Instead of that it should remember your data, and display an warning “can’t connect to server right now; retrying…”
  21. If you type in URLs in capitals or mixed case: do they still work? (Thanks again Helge!)
  22. AJAX progress indicators: Is it the case that you’ve written an AJAX site, and when the user clicks an action, absolutely no visual feedback is given to the user that he’s clicked? You need to have some kind of feedback, e.g. the “loading…” of gmail.
  23. Are the buttons large enough to be easily clicked on? E.g. confirmation page with “Yes”, “No” options displayed as links in a tiny font. They’re hard to click!
  24. Do the colours work even if you view your laptop at a weird angle? A site I was using recently used a white background (what a surprise..) and to highlight the tool you had selected used a light-grey background. Worked fine on their monitors I’m sure, but at the airport with your laptop on your lap, they’re difficult to see.
  25. Does the site work, or at least fail gracefully on old browsers? An error message immediately is preferable to allowing the user to type in lots of text then lose it when the user presses “OK”, due to some browser incompatibility issue (e.g. MediaWiki on Safari 3 beta for Windows)
  26. What about small screens? My parents computer uses 800×600 and I use my Laptop in 1064×600 normally. In Google Reader I can hardly see any feeds at that size. The whole screen is taken up with toolbars, menus, etc.
  27. Is the session timeout compatible with the company’s policy? E.g. do you really need the user to log in again after 10 minutes, i.e. when he just had to nip off for a meeting?
  28. If the session expires, what happens to the user’s data? Composing a long email and clicking “send” only to receive the response page “please log in again!”, and losing the email, is wrong.
  29. If there is a possibility to log in on every screen (e.g. “logged out” at the top-right of the screen, or like on uboot), then logging in should take you back to the screen where you were. Because that’s what the user would want.
  30. If you are logged out and go to a particular screen e.g. via URL or bookmark sent while a user was logged in, do you get useful information? A redirect to the homepage is also OK, but “general error” is not.
  31. If you go to a page with a form, is the text cursor already in the first field? Or do you have to reach for the mouse and click on the first field in order to use the form on the page?
  32. Does the site look good on both LCD monitors and conventional monitors? Some colour combinations (e.g. dark green on a light green background) are perfectly readable and look nice on LCDs, but are completely unreadable on conventional monitors.
  33. If a browser window is open, and a user is logged in, and from another browser (or directly in the database) that user’s password is changed, the user deleted, or disabled, is the first browser immediately logged out?

Java 5 enums can be compared with ==

Thursday, September 6th, 2007

Java Enum instances are singletons. This seems to be not clearly documented by Sun (at least I found it difficult to find). But it’s the case.

What this means is that it’s possible to compare enumerated types by identity, which is cool for readability. (And it means that the switch statement works.)

You don’t have to write this:

if (PurchaseState.complete.equals(anItem.getPurchaseState()) { ...

You can write:

if (anItem.getPurchaseState() == PurchaseState.complete) { ...

This is documented here in the “discussion” section.

Java: List<X> or X[] ?

Thursday, July 12th, 2007

Since the creation of Java 1.5, one’s been able to parametrize classes using generics, with a syntax similar to C++ templates.

Before Java 1.5, I would always return simple list data structures as arrays. This was

  • Type-safe (e.g. User[] as opposed to List; the former one knows what’s in the collection, in the latter one doesn’t)
  • One could find out the length of the collection with array.length (in contrast to C arrays)

But since Java 1.5, one has a choice. One could use the Java collections framework, now supporting generics, or still use arrays.

Perhaps it’s because I don’t like change, but I would still advocate using arrays as opposed to Lists:

  • The generic information is thrown away at compile-time, so a List<X> and List<Y> look the same at run-time, whereas X[] and Y[] do not. Introspection, and getting exceptions at the time of a wrong array cast, and not later, are the benefits here.
  • You can easily create an array declaratively. int[] x = new int[] { 1, 2 }; You can’t do the same with the collections frameworks.
  • I’m sure arrays are faster
  • Arrays are also simpler. I think one should, given two solutions to the same problem, nearly always take the simpler unless there’s a clear benefit of the more complex solution (which I don’t see in this case)

Maybe the point about faster and simpler aren’t really relevant points these days. But collections are still the sort of things which one accesses in inner loops. Consider the ways in which X[] is faster than List<X>:

  • To iterate over the collection, with List<X> you need to create an Iterator. This also happens if you use the “foreach” construct.
  • To get a particular element, or to get the length, you have to call methods, like aList.get(3).

It has been said that using Iterators is preferable to using a for loop over array indexes, for software design reasons. This may be the case in some special situations, but I really don’t think it’s an advantage in common usage.

  • One can iterate over an array or collection with the Java 1.5 “foreach” keyword: so in this case the source code looks the same.
  • The code “for (i=0; i<array.length; i++)”, i.e. non-iterator code, is not really difficult to write or difficult to read.

Hibernate / Boolean Fields / MySQL 5.0

Wednesday, July 4th, 2007

There’s a problem persisting boolean fields using Hibernate 3.2.2 to MySQL 5.0, if you allow Hibernate to generate your schema, and you leave Hibernate to generate the schema in the default way. It works fine on MySQL 4.1 and it doesn’t matter if you use boolean (primitive) or Boolean (object) types for the fields.

with a class such as:

public class MyObject {
   protected boolean myField;
   public boolean getMyField() { return myField; }
   public void setMyField(boolean x) { myField = x; }
}

and a Hibernate mapping such as:

<property name="myField" column="my_field" not-null="true" />

and allow Hibernate to generate the schema on startup, e.g. by writing the following in the “hibernate.cfg.xml” file:

<property name="hbm2ddl.auto">create</property>

Against MySQL 4.1 this all works, and the column has the data type tinyint(1). But in MySQL 5.0 the data type is bit(1) (which seems logical enough) but Hibernate then throws the following unhelpful exception upon every insert:

could not insert: [com.company.MyObject]
org.hibernate.exception.DataException: could not insert: [com.company.MyObject]
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:77)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:43)
....
Caused by: java.sql.SQLException: Data too long for column 'my_field' at row 1
at ....

The solution is to change the Hibernate mapping for the field to this:

<property name="myField" not-null="true" >
   <column sql-type="BOOLEAN" not-null="true" name="my_field" />
</property>

Then the field is generated as tinyint(1) and then it all works fine again.

Generate Javadoc HTML only for public members

Monday, July 2nd, 2007

In Java there are four protection levels which members (fields and methods) can have:

  1. Private
  2. Protected
  3. Package-level
  4. Public

Any member can have Javadoc (including private members).

But when one generates the Javadoc, which protected levels should be included?

Generated Javadoc is used by humans. These humans are probably not you. And thus are probably clients of your classes, either within or outside of your organization. It’s possible, although unlikely, that they may be able to access package-level members. It’s possible they may need to subclass your class, although in (nearly) all cases I can conceive of, they won’t do that without looking at your source code.

Javadoc should be simple to understand. There’s simply a lot of potentially documentable stuff going on in a class, which is capable of reducing simplicity. Setters which only Hibernate needs to see (private), or which only your factories in your package need to see (package-level).

Javadoc should therefore only be generated only for public attributes. That’s what Sun’s JDK docs do as well (for example you don’t see any protected or private stuff here). And there’s an additional benefit of simplicity is that if this is the only level for which the Javadoc is being generated, it doesn’t even state the protection level in the summary, so you see “int getX()” in the method list as opposed to “public int getX()”.

This can be achieved with the “-public” option to the Javadoc generation program. In Netbeans 5.5, right-click on the project in the “projects” tab, select the menu item “properties”, go to the “documentation” entry under the “build” entry, and enter “-public” in the “additional javadoc options” field.