Java gotcha: anArray.hashCode isn’t deep
Every object has a hashCode and an equals method. These are used to determine where to place an object within a hashing algorithm, and if two objects with the same place in the hashing algorithm actually are the same, respectively. If you want to add objects to a Set—which stores only unique objects—it uses these methods to determine whether two objects are the same and thus shouldn’t both be stored.
If you have code like:
Set<byte[]> uniqueArrays = new HashSet<byte[]>();
uniqueArrays.add(new byte[] { 1,2,3 });
uniqueArrays.add(new byte[] { 1,2,3 });
uniqueArrays.add(new byte[] { 1,2 });
System.out.println(uniqueArrays.size() + " unique byte arrays");
This code prints 3. You might expect this program to print 2, as there are only two unique arrays within the Set. But arrays’ hashCode methods do not return the same result for two different arrays with the same contents. This is in contrast to, for example, the String class, which does indeed consider the String’s contents when computing the hashCode.
Set<String> uniqueStrings = new HashSet<String>();
uniqueStrings.add(new String("123"));
uniqueStrings.add(new String("123"));
uniqueStrings.add(new String("12"));
System.out.println(uniqueStrings.size() + " unique strings");
This code prints 2. (The slightly strange-looking “new String” here is to make sure that there are actually different object instances with the same content being passed to the add method; otherwise the Java compiler would use the same object instance for the two calls, as the string-content is the same.)
The solution is to use the Arrays.hashCode(anArray) method.
This isn’t particularly convenient if you want to store unique arrays in a set. But if you have an object with e.g. a byte[] instance variable, then you can implement the hashCode method on that object to use Arrays.hashCode, or you can use the code:
Map<Integer, byte[]> map = new HashMap<Integer, byte[]>(); map.put(Arrays.hashCode(anArray), anArray); Collection<byte[]> uniqueByteArrays = map.values();
The (default) hash of mutable objects should only depend on the identity of the object. If this isn’t the case, then mutating the object changes the hash. This, in turn, breaks the assumptions of data structures (read: hash tables) based on hashing. Let’s assume that the hash of an array would depend on the elements stored in the array. Consider a program that first inserts an array into a hash table and then mutates the array. After the mutation, the hash of the array has changed. Trying to lookup the array will now fail.
Of course, one fundamental problem with Java is that it doesn’t provide a comprehensive set of immutable data types. This leads to many kinds of problems. One well known problem is the need to make copies of data structures before returning them to avoid clients to break class invariants via mutation. Contrast with a sane imperative language like Standard ML that provides comprehensive support for both mutable (ref cells and arrays) and immutable data types (vectors, tuples, records, variant types, numeric types, …). In Standard ML one can entirely avoid using mutable data types in interfaces except when the data structures are supposed to be mutated by the clients. Consequently, there is basically never need to make copies of data structures just to be safe.
Unfortunately, arrays inherit their implementation of hashCode(), equals() and toString() fro Object. There is no Array class type that all array inherit from only an Array & Arrays helper classes.
This means the hashCode for an array is based on it internal object number, equals is only true for the same object and toString() prints the internal class name and the internal id… Not very useful IMHO.