Creating an Iterator for a streaming ResultSet in Java

The Java Iterator interface requires one implements a hasNext method, to determine if the current item is the last to be iterated over, or not. The MySQL driver’s implementation of the JDBC ResultSet object, if one uses streaming mode throws an exception from its isLast method. (Streaming mode prevents the JVM from running out of memory, which it would do if it tried to fetch all the results at once.)

Therefore I’ve developed an Iterator class based on such a ResultSet whose “next” method actually pre-fetches the row after the current one. The Iterator’s “hasNext” method therefore just returns if the row was created or not. And the “next” method returns the pre-fetched one, and fetches the next one.

And in order to make this code reusable, it’s an abstract superclass, and you can implement a method in a concrete subclass which converts the row into an object of your choosing. And thus the concrete subclass will provide an implementation of Iterator<T> for your T.

And to make this code reusable to people other than me, I hereby make it available.

ResultSetIterator.java

Reading row-by-row into Java from MySQL

Trying to read a large amount of data from MySQL using Java using one query is not as easy as one might think.

I want to read the results of the query a chunk at a time. If I read it all at once, the JVM understandably runs out of memory. In this case I am stuffing all the resulting data into a Lucene index, but the same would apply if I was writing the data out to a file, another database, etc.

Naively, I assumed that this would just work by default. My initial program looked like this (I’ve left out certain things such as closing the PreparedStatement):

public void processBigTable() {
    PreparedStatement stat = connection.prepareStatement(
        "SELECT * FROM big_table");
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

Failed with the following error:

Exception in thread "main"
        java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2823)
    at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2763)
    ...
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1657)
    ...

The line it failed at was the exceuteQuery. So as we can see from the stack backtrace, it’s clearly trying to load all the results into memory simultaneously.

I tried all sorts of things but it was only after I took at the MySQL JDBC driver code did I find the answer. In StatementImpl.java:

protected boolean createStreamingResultSet() {
    return ((resultSetType == ResultSet.TYPE_FORWARD_ONLY)
        && (resultSetConcurrency == ResultSet.CONCUR_READ_ONLY)
        && (fetchSize == Integer.MIN_VALUE));
}

This boolean function determines if it’s going to use the approach “read all data first” or “read rows a few at a time” (= “streaming” in their terminology). I clearly need the latter.

You can specify, using the generic JDBC API, the number of rows you want to fetch at once (the “fetchSize”). Why would you have to set that to Integer.MIN_VALUE, which is stated to be −231, in order to get streaming data? I wouldn’t have guessed that.

Basically this very important decision about which approach to use, which in my case amounts to “program works” or “program crashes”, is left to test whether three variables are set to various values. I am not aware if this is in the documentation (I didn’t find it), nor if this decision is guaranteed to be stable, i.e. won’t change in some future driver version.

Now my code looks like the following:

public void processBigTable() {
    PreparedStatement stat = c.prepareStatement(
        "SELECT * FROM big_table",
        ResultSet.TYPE_FORWARD_ONLY,
        ResultSet.CONCUR_READ_ONLY);
    stat.setFetchSize(Integer.MIN_VALUE);
    ResultSet results = stat.executeQuery();
    while (results.next()) { ... }
}

This code works, and reads chunks of rows at a time.

Well I’m not sure if it reads chunks of rows at a time, or just one row at a time. I hope it doesn’t read one row at a time, because that would be very inefficient in terms of number of round trips from the software to the database. I assumed this was what the fetchSize parameter was controlling, so you could tune the size of the chunks to meet your particular latency and memory setup. But being forced to set it to a large negative number in order to get it to work means one has no control over the size of the chunks (as far as I can see).

(I am using Java 6 with MySQL 5.0 and the JDBC driver “MySQL Connector” 5.1.15.)

Which log levels to use when?

I’m sure there are a lot of opinions in the world about which log levels to use for which errors. Log levels in the sense of if a text destined for a logfile should be prefixed with “Info”, “Warning”, “Error” etc.

There is even great debate about which log levels there should be. E.g. should there be a “Debug” level or a “Trace” level? Or both? In which case what’s the difference?

I tended to ignore those debates—for me a log statement was simply a log statement—but recently I was deploying an application, and various people had worked on it and used log levels inconsistently. Hardly a surprise as I’d never set any guidelines on how they should be used, due to aforestated ignoring. So I started to think about how I would have wanted the log levels to have been in the application I was at that moment deploying.

Firstly, what does a log level actually influence?

  • Different texts are stated in the log file. The only differences this makes to anyone is that operations teams, not too familiar with the software, tend to understandable freak out when one has lots of ERRORs in the log file; whereas they tend to freak out less with lots of INFOs.
  • You can do monitoring based on log files. E.g. one can create assertions which can be monitored, such as “INFOs may be in the log file, but ERRORs may not.”
  • You can set a log level and state that errors below a certain level will not be displayed, e.g. “ERRORs should be displayed but WARNINGs should not”. So the log level of an error may influence whether it gets logged at all.

And how would those differences influence ones decision about which level to make a particular log?

I came up with the following scheme for me:

TRACE For things that should happen, but one doesn’t want to see live. (For example SQL statement logs, etc.)
INFO For things that should happen, but one does want to log live. (For example, “writing file to /x/y.pdf…”) It’s important to state IDs and paths in log files, for if something goes wrong, one needs to be able to work out exactly what the robot did with what rows and objects. And if the robot crashes, a log, written before an action is done, such as “Writing PDF file…” can help to identify what part of the process caused the error.
WARNING   For things that shouldn’t happen, but where the user sees the correct result. For example files that should always exist, but if they don’t, they can be automatically recreated. If those happen live, one might want to look at them. But as long as the user sees the right thing, one can look at them in ones own time.
ERROR For things that shouldn’t happen, and where the user sees an incorrect result. Including “fatal” errors.

And in that case, one would monitor ERRORs (or maybe WARNINGs if one was otherwise bored), and would set the log level to TRACE on all test servers and INFO on all live servers.

The Internet: It’ll get slower before it gets faster

For the 3 weeks I was in the UK recently I used a UMTS modem (i.e. like a 3G phone) to surf the web and do all my work. Going round to my friend Robin’s house, who also works in IT, he does all his surfing through a cable from his phone to his computer: i.e. also UMTS.

At least in the UK, this is extremely popular. Also in Asia it makes a lot of sense; they have excellent high-speed mobile phone networks there and all ones preconceptions about the Asians having the latest handset devices: I can confirm first-hand that they’re all true.

As we all know and have been experiencing since about 2000, more and more phones are going to get more powerful and have larger screens. Full browsers will (and do) run on them. They will also be UMTS devices.

And for those people who don’t surf via UMTS, nearly everyone I know surfs at home using WLAN. A lot of offices use WLAN too. And obviously all the surfing at airports, coffee houses, hotels, conferences etc. all goes on via WLAN.

UMTS and WLAN have high bandwidth, but they have extremely high latency compared with a cable connection. That means that although the bytes flow fast once they’ve started, it takes a long time for the first byte to arrive.

I am quite proud of the fact that when I designed the “Uboot Joe” software (Windows software which ran on the user’s PC, sat on the notification area by the clock, and communicated with Uboot) I took this into account. Every action you do with the Joe is at most one client-server round trip. For example to view all the thumbnails in a folder, there is a single request from Joe to the server like “get all data in folder_id” and the return structure is a) information about the folder, b) information on all the photos within the folder, and c) all the binary JPEG data for the thumbnails of all those images. You can try using the Uboot Joe on a UMTS link, and it works faster than any website.

Contrast this design with HTML. The first response from the server contains <img src=xx> tags and only once that has been received can the browser make the further requests necessary to retrieve the images. If the first bytes of every response take a long time to arrive, then the user experiences that “long time” twice before they see the data they requested; first to get the HTML page then second to get the images.

In fact it’s worse. If a page has 50 embedded images, it doesn’t open up 50 concurrent connections to the server (for good reason). Instead it opens e.g. 4 connections. Which means that e.g. image number 5 has to wait for the “long time” of fetching image 1 to complete. (Some sites try to get around this by having lots of servers with different names e.g. img341.domain.com and distributing the images over these servers.)

And it’s even worse than that. Even if the application only does one round-trip to the server, the underlying protocols might do more round-trips, for example firstly to contact the DNS server to get the IP address for the domain name used in the URL; and then secondly to request the data from the server.

In addition to this being a problem with UMTS and WLAN, one also has to take into account that the Internet is global. When I’m in Macau accessing European servers I get a round trip of about 300ms. So if one adds three “long times” to an otherwise extremely fast request—easily done—one has added a whole second on to the time the user has to wait. And Jakob Nielsen says that after 1 second in total, users start to lose focus on what they’re doing.

So to design applications in this age, one needs to be aware of the number of serial server round-trips (i.e. the number of times you need to ask the server for something, and only once it’s been delivered, must you ask the server for something else).

For example:

  1. An HTML page which contains an external CSS file, and this CSS file contains URLs to images.
  2. Pages with many images. The browser only requests a few files from the same server at once, so again the response to image number 1 must be finished before the request for image number 3 can begin.
  3. Javascript software which does multiple serial calls to the server, e.g. “get session token for username/password” then “get info to display on page for session token”.
  4. A form which submits data to a piece of software. The software does something but instead of returning a result page, returns a redirection command to a “real” result page. Often done to allow one to hit “refresh” safely on the result page, or make the URL of the result page look nicer.

GWT is excellent in this regard. It has the ability to download lots of those small icon-size images in one request (it makes one big image on the server and chops them up again on the client) and it makes you explicitly aware of the number of server round trips by forcing you to define interfaces for client-server interactions - as opposed to some automated scheme where you write code and the framework decides when to insert client-server round-trips. (Wicket makes client-server round trips easy with AjaxLink; my fear is it might be too easy, and one might do them too often, and lose the overview of how many are happening).

Pre-caching is a good idea too. E.g. if you are a photo viewer application, with a photo shown full screen with a “next” button, it makes sense to load the image on the “next” page even before the user’s clicked on it. That download won’t interfere with the rest of the activity the user is doing, as the bandwidth is not the bottleneck, just the time between starting the download and the bytes starting to arrive at the client. (Although one can’t download too much without the user noticing, as some people pay per MB!)

But the most important point, I think, is: these days, one must test ones web applications on a high latency connection. Generally speaking, historically I have tended to develop locally (everything installed on my laptop), or I develop in an office with a network cable and high-speed Internet and a link to the data center where the test server sits—and the office is in the same country as the server. Maybe this sounds strange, but I think one should develop web applications while using a UMTS card.

Copy/paste between Excel and MSN

Not even the simplest things work with computers these days.

I have an Excel sheet and I want to copy a value into an MSN conversation. On Windows. Notice the vendor of all these products.

  1. I copy the cell and in the middle of the sentence I’m tying into MSN I press Ctrl-V. MSN hangs for about 5 seconds. Then a notification is sent to the other party that I want to transfer a file, something .tmp.gif, i.e. an bitmap image of the value I’m trying to my sentence. Of course the other party hasn’t seen the first part of my sentence yet, as it’s still in the message composer and I haven’t pressed Return yet, so this file transfer request would come as a bit of a surprise to the other party.
  2. No problem - I can double-click the cell in Excel to edit it, and copy the characters from the cell, as opposed to copying the cell itself. However, the cell is a formula. That means that when I edit the cell I get text such as =D1+E9 as opposed to the numeric result value which I wanted to paste into the MSN conversation.

So what is the solution? As far as I can see I have to have both windows open side-by-side on the screen, and type in the value into the MSN window that Excel is displaying….

Useful “standby” button

Before I left Vienna, the Internet stopped working in my flat. There were some major electrical engineering works going on in the building, so I assumed this was the cause. Now I’ve come back, they’ve stopped, but my Internet hasn’t started working again.

The modem, from my ISP Chello, was an “ARRIS type”, and the “online” light (light 2 in the following diagram) was permanently flashing.

What was the modem trying to communicate to me? On the Chello homepage it says one should turn the modem off and on if this happens. Needless to say, by the time I read that helpful advice, I’d already done that quite a few times!

But it turns out there is a button on the modem, which is marked “standby”. Amazingly enough pressing this has the effect of

  1. making the Internet no longer work, and
  2. making the “online” light flash instead of being constant.

Why would one want that? How had it come to be in that state?

Anyway, pressing it resolved the problem, and freed me from the necessity of making a telephone call to Chello support.

Back to Vienna: Site of Previous Toilet Explosions

I’m heading back to Vienna tomorrow (Saturday). I land at 17:30 or thereabouts.

Alas it’s too late to go to BarCamp Vienna (on Saturday), which upsets me greatly. I think that’s the 4th BarCamp I’ve missed due to not being in Vienna. Hopefully I’ll be here for the next one. I tried to change my flight but the cost would have been over € 300.

In addition there’s an open day on Sunday at the Tudor Barn in England where I’m getting married. It would have been lovely to go to that but that also would have required a change of flight (in the other direction).

And hopefully when I get to Vienna, my toilet will be OK…..

The night before I left Vienna (mid December), which was a Saturday, I went to the pub with some friends. Needless to say my plan was to pack on the day of departure, Sunday. I got back late, but with 6 hours sleep I would have had (just) sufficient time to get up at 9am, pack, and make my train.

I got back and to my horror — the whole of my toilet floor was covered in water (thankfully water into the toilet not out of it!). I had to leave the next day, a Sunday, it was now 03:00 on that Sunday morning, and my parents had paid for the hotel room in the Austrian village where they we were staying, for Sunday night.

What I did first—obviously!—was panic! What I did second was to ring the Austrian girlfriend of the friend I’d been out with that night. I knew she’d be awake as we shared a taxi together and I got out first. She advised me of an Austrian 24-hour plumber I could call. That was a real life-saver!

Then I made a bit of a negotiation mistake. The plumber advised me on the phone he only accepted cash. I pointed out that I would only be able to get out € 400 (the limit on my card). While he was on his way, I went to the cash machine and got out that full amount. Once he’d done the work he told me the bill came to € 398, quite a coincidence…?

Now the thing about my toilet is that it used to be the floor’s toilet. So the main tap which turns off the water to my flat doesn’t actually turn off the water to the toilet; the toilet is before that tap. But not to fear, there’s another tap just for the toilet. I started to turn that tap and the toilet did stop dripping, but now the tap started dripping! I turned the tap back on, hoping to revert to the situation that the toilet was dripping, but now they were both dripping! Nightmare.

When he came, he said he’d have to turn the water off in the basement. Now when I moved in to my flat 9 years ago, there were a bunch of keys hanging up on the inside of one of the cupboards. To this day I don’t know what they’re all for. So I grabbed them all and I set off for the basement with the plumber who’d just arrived; at 04:00 Sunday morning, in winter in continental Europe, chilly with snow on the ground outside, 5 hours before the time to get up (and not fully sober, I must confess, although seeing all that water helped.)

Basically under the building where I live—which is neither modern nor well-maintained I must assert—there is a labyrinth system of brick tunnels. It’s really amazing. And so many wires and pipes, in all different directions. And all so dusty and nasty! And one of them, probably, was the tap to turn off the water to the tap next to my toilet.

We didn’t find the tap. He even asked me if I had any documentation on the tap structure of the building! So he did what I can only describe as a “hack” on the tap next to my toilet to make it stop dripping and also stop supplying water to the toilet. He fixed the toilet with what might also be described as a “hack”. I finally got to bed at 6am.

A few hours later, I awoke, I packed, and I got my train. Needless to say I forgot any of my skiing clothes (I was going to a ski resort, although not to ski), I brought normal clothes including some summer clothes, enough for about 4 days. I wondered why my case was so light yet I was going away for so long? But it didn’t occur to me what the reason could be or what I could do about it!

I have been living with the consequences of that bad packing experience—the complete lack of clothes—for the last 6 weeks.

Anyway - this is all to say, I hope his hack worked? I’ll see when I get back to Vienna tomorrow.

Subscribe via Email or RSS

To keep up to date with new postings here on my blog you can use one of the following subscription facilities:

  1. Enter your email address on the subscribe by email page. You’ll get an email each time I do a post. You can unsubscribe any time. You can also subscribe to just some categories (e.g. Databases, or Life).
  2. Use an RSS reader. If you don’t already have one, you can sign up free for Google Reader. Find the “add feed” option (in Google Reader it’s on the left hand side) and just type in my URL, www.databasesandlife.com.

Deceptive practices in mobile data tariffs

As those who know me know, I was sent a bill for € 2600 from T-Mobile Austria because I’d used their data card in the T-Mobile UK network. I’d downloaded 260MB. T-Mobile Austria claimed that T-Mobile UK was a “foreign network” (choosing to be one brand only when it suits them) and charged me € 10/MB despite charging only € 39/month for 800MB when I’m in Austria. Quite a price difference. It clearly didn’t cost them that much, they only adopt this strategy to rip people off. I even went to the consumer protection agency in Austria who agreed it was outrageous. They wrote T-Mobile AT a letter asking them to reduce the bill. T-Mobile promptly wrote back and said “no”.

I considered switching operators in Austria, but I looked around, and it turns out all the other operators have a similar price structure. Apart from the operator 3. They have a no-roaming-fee policy, as long as one stays within their network.

So, now I have signed up for 3 UK (as I was in the UK at the time I needed it, and I figured with no roaming costs, it doesn’t matter where you sign up). If I’d been with 3 before rather than T-Mobile before, my monthly bill would have been £10 rather than € 2600.

But I was, nevertheless, still extremely suspicious of network operators, after my experience with T-Mobile. And with good reason, it turns out.

3 supply a tool (with a horrible UI! - but it works) to allow you to connect to the Internet. I have only used their tool to connect, not any other. And helpfully it tells you how many MB you have used. You get 1000MB included in the monthly fee and every extra MB costs 10p.

But the other day I booted my computer and found the number of MB used had gone down overnight! The display on the previous day had been around 400MB and now it was showing around 300MB. This couldn’t be right.

One of the things I learned from my T-Mobile experience is that operators have websites where you can check your current bill. Supposing that 3 also had such a facility, I went around looking for it and sure enough it does have one, called my3.

And I suspected, the tool was showing an incorrect amount.

  • Here is a screenshot of the tool which says that 439+91MB have been used, i.e. 530MB have been used, so 470MB should be remaining from my monthly 1000MB.
  • A screenshot of the website however clearly shows I only 228MB are actually remaining.

If I had believed the 3-branded tool which installs itself onto the computer the first time you plug the 3-branded modem into a USB slot on your computer, I would have surfed e.g. 242MB more than my 1000MB allowance - or maybe more if the used amount goes down again mysteriously. That would have cost £24 extra, which is not the end of the world, but it’s not the £10/month deal I signed up for.

I wrote to customer services explaining the situation and suggesting they file a bug report against their software, as it displays incorrect data, and the data is directly relevant to billing. Amazingly a customer services representative called me and explained to me that only the value on the my3 website was the authoritative value. I explained that I understood that, but he should still get the program fixed, or at least display a warning. He said he’d “look into it” but it didn’t sound to me like he had any power to do anything about that. But maybe I’m wrong.

But basically, be very careful of mobile companies concerning data tariffs.

  1. T-Mobile AT charging € 39/month for 800MB is € 0.04 per MB. They charge € 0.20 for any MB used over the 800MB limit. But roaming to the network apparently with the same brand in the UK, they charged me € 10 per MB, which is 50x € 0.20. But they’ve got the UMTS equipment in the UK anyway, for their UK customers who are on similar price-plans. The only extra complexity is provisioning and billing. I don’t believe that costs 50x more than building the UMTS network in the first place.
  2. 3 UK produce a 3-branded tool installed from a 3-branded modem which came with my 3 contract, which displays blatantly incorrect data. Upon calling customer services one learns one cannot trust the tool, one can only trust the my3 website. But that is far from obvious. Anyone trusting the number on the tool will end up with a bill for a higher number of MB than they had believed they had used.

Random unreproducable Java error of the day

I mean I’m really kind of of the opinion that Java Sevlets, at least when using Tomcat and the other open source tools, don’t work. I mean surely it can’t be difficult to implement a Servlet container or logging framework!

I just tried to start Tomcat and it refused to start because of the following error:

log4j:ERROR Error occured while converting date.
java.lang.NullPointerException
  at java.lang.System.arraycopy(Native Method)
  at java.lang.AbstractStringBuilder.getChars
  at java.lang.StringBuffer.getChars
  at org.apache.log4j.helpers.ISO8601DateFormat.format
  at java.text.DateFormat.format
  ...
  at org.apache.log4j.Category.log
  at org.apache.commons.logging.impl.Log4JLogger.error
  ...
  at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt
  at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run
  at java.lang.Thread.run
log4j:ERROR Error occured while converting date.

So I just hit “start” again, and this time it starts without error.

And people trust their mission-critical server architecture to this stuff!