If you’re going to change databases, do it in one go

At Uboot, there was the suggestion that we should change database technology. This was because we would save money with by moving away from Oracle to e.g. MySQL.

It was further decided, as we had 200k+ lines of code, and 100+ tables, that if a migration was to occur, it shouldn’t happen  ”all at once”, but instead one would have some parts of the system migrated with other parts of the system not yet migrated.

But the strategy of migrating databases piece by piece has the following problems:

  1. Data consistency. Initially, all important data are within the source database. It is possible to guarantee certain consistency constraints (e.g. all photos are owned by existing users). However, if e.g. photos are migrated to MySQL, while users remain in Oracle, no such consistency can be guaranteed. Inconsistency has the consequence that software must accept not only valid data, but also invalid data. It is impossible to test the software for all combinations of invalid data, and with the rule that “software which is not tested does not work”, one can deduce that the software will then not work. The set of inconsistencies will increase over time, so the set of users who experience bugs will also increase over time. Afterwards, going from an inconsistent state to a consistent state is nearly impossible.
  2. Reporting over multiple system areas. With one database, one can ask a question such as “how many photos belong to users logged in in the last month”. A “SQL join” is used. However if the tables reside in different database, no such query can be asked. One would have to export the data from one database to another, in order to ask the query. Or export all data into a centralized data warehouse. Or write software to do what a single line of “join” would do. All this effort is saved, by using a single database.
  3. Point-in-time recovery. With one system, if the database crashes or an error is made (e.g. “drop table account” is executed by an employee), a “point-in-time recovery” can be performed. Data is restored to its consistent state as it was at a certain point in time. If one uses more than one database, a point-in-time recovery will be more difficult. For example if the databases are out-of-sync by 1 second, then some photos will exist for which there are no users, or vice versa, and data consistency problems exist as described in point 2.
  4. Spending money on migration; no saving in license fees. If a migration is done, time/money is spent. Not only in doing the migration, but fixing performance problems after the system is done. If money can be saved on the Oracle license, perhaps this time is worth it, however if one is using multiple databases then one is still paying for Oracle.

The solution is either to change databases in one go, or simply don’t change databases. You want to have your data in one place, at all times.

Minimal security is gained by using case-sensitive passwords

My colleague suggested that systems should compare passwords in a case-sensitive manner. He pointed out the larger space of possible passwords, and thus the longer passwords would take to crack by a brute-force attack. This is all standard knowledge.

I still maintain, however, as discussed before, that the correct trade-off between usability and security is to compare passwords in a case-insensitive manner.

The increase in usability in having case-insensitive passwords is obvious (and documented in the previous post), so to understand the usability/security trade-off, how much is security reduced by using them?

The following calculations assume the following:

  • The NSA, or whoever, wishes to attack you, has access to the hashed password (e.g. database leak)
  • They have a strong machine such as this one
  • They use a brute-force attack and try every combination (they don’t just use dictionary words)
  • Assuming you’re a person of interest, let’s say they are prepared to expend 2 months running the machine at full power just to crack your password
  • A case-sensitive password has 80 possible characters to choose from (upper-case, lower-case, numbers, symbols), a case-insensitive normalized password as described in my previous blog post has 54 possible characters.

So how long would your password have to be to defeat the above attacker, with case-sensitive and case-insensitive passwords?

SHA1 Passwords

The machine above can try 63 billion passwords per second. That means, in the 2 months available, it can try out 3.4×1017 passwords.

  • A case-sensitive password with length 10 has 10.7×1017 possibilities, so cannot be cracked
  • A case-insensitive password with length 11 has 11.3×1017 possibilities, so cannot be cracked

So the “lack of security” imposed by case-insensitivity can be mitigated by having a single extra character in your password, or, put another way, making your password 10% longer.

To those who would argue that it’s likely people will use random 10 character passwords but not random 11 character passwords: I propose that there are those of us who will generate n character passwords using a tool (the site is free to suggest n, meaning its value doesn’t matter), and there are those who would use their pet’s name as their password, in which case even a case-sensitive password is insecure, meaning case-sensitivity doesn’t matter either.

bcrypt(10)

But let’s try a more realistic example. Who uses SHA1? We use bcrypt, like presumably everyone else.

bcrypt has a strength parameter. It re-hashes the password 2n times. So each time you increase the strength parameter by one, it takes twice as long to calculate. By default this strenth parameter is 10, which is fine for us: it takes our server 0.1 seconds to calculate such a hash.

The web-page says that monster machine can do 71k bcrypt(5) passwords per second. So that means it can calculate 2.2k bcrypt(10) passwords per second. Meaning in the two months, it can calculate 1.1×1010 passwords. So that means:

  • A case-sensitive password with length 6 has 26×1010 possibilities, so cannot be cracked
  • A case-insensitive password with length 6 has 2×1010 possibilities, so cannot be cracked

So we find out that with a normal hashing strategy, the password doesn’t have to be made longer to remain at the same level of security.

The “lack of security” imposed by case-insensitive passwords mean that the password either has to be slightly longer, or not longer at all. The usability advantages are very real. So that, in my mind, makes the usability vs security trade-off a clear win for case-insensitive passwords.

When and who should fix bugs?

There are (at least) the following options that a project manager must make when organizing bug fixes to software:

  • Should the original author of the code in question fix the bug? Or should there be a “bug team” who surgically go in and fix bugs?
  • Should one fix bugs as one goes along? Or concentrate on features and write the bugs down and fix them in a “bug sprint”?

I am very much in favour of the original author fixing bugs, and fixing bugs as soon as they occur. Because:

  • It’s difficult to predict how long it will take to fix a bug, e.g. estimates might be “between 5 minutes to 4 hours”. It’s easier to see how far through the project you are if you have 50% of the features working without bugs, than if you have 75% with an unknown number of bugs each taking an unknown length of time to fix.
  • If a programmer makes a mistake, it’s important that they learn it, so they won’t make it next time. That’s why the original author should fix the bug.
  • Do you put your best programmers or your worst programmers on bug fixing? Bug fixing is tricky, so if you put your worst programmers on bug fixing they’ll only make the situation worse. If you put your best programmers on bug fixing, they’ll all quit.
  • If one person is maintaining one piece of code, they feel “ownership” over it. This is perhaps the opposite of the desirable quality that everyone knows the code. Nevertheless, I think “ownership” is a powerful concept that causes people to take more care of their work than they otherwise would, leading to a better piece of software.
  • Assuming that bugs are fixed immediately, and you find a bug in an existing part of the system, you’ll report it and/or fix it. If bugs are left until the end, and you see a bug, you’ll just ignore it, as you know there are existing bugs. This might cause a new/different bug to not get reported.
  • If you fix something weeks later, you might have forgotten the original code, including weird important speical cases. Ideally everything would be perfectly documented and readable, but that often isn’t the case in reality. Having the special cases in your head will prevent your bug fix from actually being an introduction of further bugs.

One of my old bosses used to utter the phrase “you break it, you fix it!”. I never had good associations with that phrase, every time I heard it, new pieces of work were only seconds away from being assigned to me :-)

Don’t use constants for table and column names when writing SQL statements

I was always in two minds about using constants for table and column names when writing SELECT queries. Now I’ve concluded that constants are definitely bad, and should not be used. Here’s why.

The topic of discussion is difference is between writing

sql = "SELECT * FROM " + TABLE_NAME + " WHERE ..."

and

sql = "SELECT * FROM my_table WHERE ..."

There are the following consequences from this choice as far as I can see:

  • A compiler like Java can tell you if you misspell a variable name (e.g. tableName) but not if you misspell “myTable” in the middle of a string (Win for constants, if you’re using a static typing language)
  • You can rename the values easier: just change and rename the constant. But how often do just table names or columns get renamed? Normally when I change the database I am implementing a new feature, and everything has to be changed anyway. (Marginal win for constants)
  • The layout of the query is shorter and easier to read if constants are not used. (Win for not using constants)

To avoid having this choice in the first place:

  • One could use an ORM, but at least in Java, e.g. HQL in Hibernate still has a string for column names, and table names if you’re doing joins, so the problem is still there.
  • Using a system like LINQ in .NET which allows you to specify queries in a way the compiler understands, not just a string. (But can it do everything SQL can do including vendor-specific things?)
  • Being able to extend the language with other languages such as SQL and regular expressions. This is a fantasy of mine and a friend, hasn’t happened yet. This would work by the compiler working in conjunction with the database engine to assert that the query is valid at compile time (and possibly even creating an db-specific internal parsed representation right there and then.)

Compare the following two pieces of code and I think the choice will become obvious. Both pieces of code come from the current code-base I’m working on, neither have I written myself.

sb.append("SELECT ").append(RecruiterRole.TABLE_NAME).append(".*,");
sb.append(Login.TABLE_NAME).append(".*");
sb.append(" FROM ").append(RecruiterRole.TABLE_NAME);
sb.append(',').append(Login.TABLE_NAME);
sb.append(" WHERE ");
sb.append(RecruiterRole.COLUMN_COMPANY_ID).append(" = ?");
sb.append(" AND ");
sb.append(RecruiterRole.TABLE_NAME).append('.').
sb.append(RecruiterRole.COLUMN_LOGIN_ID).append('=?');

vs

sql.append(" SELECT * ");
sql.append(" FROM application");
sql.append(" WHERE job_advert_id IN (");
sql.append("   SELECT job_advert_id");
sql.append("   FROM share");
sql.append("   WHERE talent_scout_login_id = ?)");
sql.append(" AND potential_applicant_identity_id NOT IN (");
sql.append("   SELECT potential_applicant_identity_id");
sql.append("   FROM positive_endorsement");
sql.append("   WHERE talent_scout_login_id = ?)");
sql.append(" AND company_id = ?");
sql.append(" AND share_talent_scout_login_id = ?");
sql.append(" ORDER BY datetime_utc DESC");

Here are the reasons why the second code is more readable:

  • Because table/column names are inline, the code reads easier
  • Indenting is used for sub-selects
  • Each condition, order-by is on its own line (e.g. “AND company_id=?”)
  • Keywords uppercase, column and table names lowercase

The danger of the above code is less that errors in spelling will only be detected at run-time and not at compile-time, but that the query does the wrong thing (while appearing to do the right thing). For example, an error I saw recently (which obviously did not make the live system!) was that users could see data not only from themselves but from all users because the “WHERE login_id=?” had been forgotten. But to the untrained eye, or a user on the test system with only a few users, the query appeared to work.

In this case, it’s a clear win for readability, over compile-time checking of a mistake which will is unlikely to happen and will be identified at run-time.

Are Java enum values instances or classes?

That depends on if the enum values provide methods which differ from one another.

The following code produces just one class file, “Network.class”. “facebook” and “linkedIn” are just instances of the Network Java class.

public enum Network {
  facebook, 
  linkedIn;

  public void printName() { System.out.println(getName()); }
}

But the following code produces one class file for each value, named “Network$1.class” etc., as well as one class file for the abstract superclass, “Network.class”.

public enum Network {
  facebook {
    public Client newClient() { return new FacebookClient(); }
  },
  linkedIn {
    public Client newClient() { return new LinkedInClient(); }
  };

  public abstract Client newClient();
}

“facebook” and “linkedIn” are in fact different Java classes now.

Having a constructor taking parameters, and initializing each value of the enum by calling this constructor with values, is not sufficient to force the generation of individual classes per value.

Just because they are different classes in this situation doesn’t automatically mean you can do everything you’d expect to be able to do with a class. You can’t test for class membership using “instanceof” for example (not that this would be very useful for an enum, as there is only one instance of every value).

PostgreSQL vs MySQL “update” difference

While one session is doing a “delete then insert”, and another session doing an “update”, the update waits until the “delete then insert” is finished. But does the “update” then affect the “inserted” row or the “deleted” row?

Create a table foo(id) and insert one row with id=5 into it. Then:

Session A Session B
START TRANSACTION START TRANSACTION
DELETE FROM foo WHERE id=5
UPDATE foo SET id=6 WHERE id=5 (blocks)
INSERT INTO foo VALUES (5)
COMMIT (unblocks)
SELECT id FROM foo

What does Session B see?

The result depends on your database vendor:

  • PostgreSQL: Row has id=5 (Session B’s update affected the deleted row)
  • MySQL InnoDB: Row has id=6 (Session B’s update affected the inserted row)

I don’t think I’d say either of these is the “right” approach, I think they are both valid. If two things happen at the same time, who’s to say which should win?

I would actually advise against deleting then inserting. I would insert then, if a unique constraint violation is triggered, do an update. But this doesn’t change the fact that, if you’re doing two things at the same time, you can’t attach meaning to which one should win.

BTW if you’re using an ORM like Hibernate, which claims to abstract away from the database, and allow you to use any vendor, do you think it takes differences like this into account?

Originally posted here:
https://news.ycombinator.com/item?id=5435786

Religious arguments, fear of the unknown

A friend said this to me today:

it’s really odd, why do fans of dynamic languages often have this cult-like approach?

It’s an interesting thought. I know many people who love dynamic typing and reject static typing without knowing much about it. (Of course there are many people who love dynamic typing but who do know lots about static typing.)

It got me to thinking, I reckon if you only know X and know nothing about non-X, you might start to fear non-X. Your fear might express itself by reinforcing your belief that X is the best, that non-X is no good.

I’m sure this is a standard human emotion, but I’ve absolutely no idea what it’s called. “Fear of the unknown” perhaps.

Anyway, an hour later I experienced a similar emotion when reading this:
http://blogs.msdn.com/b/oldnewthing/archive/2013/03/20/10403718.aspx

I have absolutely no idea what this article is on about. I know so little about Windows. I always deploy on UNIX or Linux. I always have. There’s so much to know about Windows. And all that stuff that there is to know: I don’t know any of it.

I reckon, when being hit with a volley of emotional, religious, and cult-like arguments about why all things other than X are bad, one should take into account the possibility that the arguer doesn’t know anything other than X, and is afraid of everything which would expose that lack of knowledge.

Generating XML programmatically? Don’t use CDATA

If you want to represent characters, you want to represent any possible sequence of characters. In an XML file, escaping < with &lt; gives you a way to represent any character. Using CDATA doesn’t give you a way to do that.

In an XML document there are two ways to represent character data. Either

  1. You just write the characters in the XML file (in which case you need to escape characters that look like XML tags i.e. < with &lt; etc.), or
  2. You use CDATA, no longer having to care about replacing < with &lt; etc. Ends as soon as the first ]]> is encountered, which is unlikely to appear in your characters.

These two syntaxes result in an identical XML document. An XML parser must consider XML files identical, which differ only by if CDATA is used, or not. In both cases, there is text within a tag.

CDATA has a (somewhat strange) syntax like:

<my-tag><![CDATA[My characters]]></my-tag>

From its naming, CDATA (“Character DATA”) might seem like exactly what you need to represent character data. Combine that with the fact that characters such as < don’t need to be escaped.

However, in fact it’s exactly the opposite of what you need. Whereas, by not using CDATA, it’s possible to escape the characters that mustn’t appear (i.e. replace < with &lt;) with CDATA there is no way to escape the characters that mustn’t appear (i.e. ]]>).

It might seem “unlikely” that ]]> is actually going to appear in the character data your user wishes to represent. (This is actually irrelevant, as software should work all the time, not just be “unlikely” not to work.) However, even this “unlikelihood” is misleading. No matter what sequence of characters XML had chosen to end CDATA, as soon as you represent data in itself (e.g. send an XML document in a tag), this sequence will appear in the data. So if you’re working with XML (which you are), this will happen more often than you think.

CDATA is just a convenience mechanism for writing XML files by hand. If you’re writing files by hand, you know which characters appear in your data, so you know whether you can use CDATA safely or not. If you’re creating a program to write XML files, you don’t know what the user’s data will contain, so you can’t use CDATA.

“To programmize” (verb)

Talking to someone who doesn’t work in IT, and who isn’t a native English speaker; She was talking about various people in software development, and referred to them doing “programmizing” (as opposed to “programming”).

That’s a great concept. “Progammizing” reminds me of “terrorizing”. As in: before you have a bunch of processes involving paper and people and it all works. Then, some people come in and do some programmizing. Afterwards, nothing works, everything’s slow, crashes, has bugs, loses data, you have to turn stuff off and on a lot….

Map keys: Only allow strings, don’t allow “any object”

Associative arrays. Maps. Hashes. Dictionaries. They are an important cornerstone of data modelling in software. Every language has them.

Question: What types should the keys be?

The answer depends on what language you’re using. I have spent a lot of time programming Perl and Java, and they answer this question differently. Perl’s hashes force you to have string keys. Java’s Map allows you to have any objects as keys, and you can override (amongst others) the equals method to allow the map to know if two keys are the same or not. C++ STL’s “map” is the same as Java, but in addition allows you to supply code, when you create the map, to determine if two key objects are the same or not.

The Java and C++ STL way definitely seem to be more general. But after years of programming in both worlds, I’ve come to a conclusion: The way Perl does it, allowing only strings as keys, is better.

It’s obvious whether two strings are equal or not. It’s not obvious whether two arbitrary objects are equal or not.

Take a User object. This represents a user in the system. It has an id, a name, a picture and so on. You want information against users, so you create a Map with Users as keys, and the information as values. The code Map<User, MyInfo> reads easily enough. Adding entries and fetching entries from this map reads easily enough.

But what does it actually do? Under what circumstances are two User objects equal? When their IDs are equal? When all fields are equal? When certain key fields such as names are equal? What if two users haven’t been inserted into the database yet and thus don’t have IDs yet?

This is a whole can of worms. I have encountered subtle bugs due to object equality not being what I expect it to be in a certain situation (even though it looked reasonable enough when the class was defined). These bugs have been difficult to trace and fix, and aren’t always the sort of bug which even have a manifestation quickly or all the time.

Adding information to the map with map.add(user.id, myInfo) is also easy to read. You know exactly what it does. You don’t have to program those annoying equals and hashCode methods, so your business logic becomes less diluted by implementation artefacts. Those methods also can’t have bugs if you haven’t programmed them.

A disadvantage: If you’re using a language supporting static typing and generics, then extra readability and compiler-verification can be gained by stating the type of the key of a relationship. Just using “String” doesn’t allow this (is it a user id? the name of the user?). But I still think that the simplicity aspect described above is more important.

Perhaps one could argue that numbers should be allowed as keys too. But I think, for simplicity, there should only be one type of key. A number can be converted to a string easily enough (as long as one makes sure to remove any leading zeros, which would make two identical numbers appear to be different strings). More complex data, that one might want as a key, can be more easily converted to a string than to a number.

Regarding sets and finding unique objects, the situation is similar. Perl has no facility to do this. Java has the ability to add objects to a Set, the elements’ equals method is used to determine if objects are the same or not. C++ STL’s “set” is the same as Java, but also allows you to supply a function to determine if two elements are the same or not.

Again, if you want to find which (unique) users certain information pertains to, in Java you’d create a Set<User> and put the users in there. Again, what makes one User the same as another? Again, subtle bugs can be introduced when your understanding of what makes a user unique doesn’t coincide with the author’s of the user class.

This is the way Perl does it: There are only maps, no sets. You place objects in the map as values, and the unique aspect is the key of the map. At the end you just take all the values of the map, and ignore the key. map.add(user.id, user) is easy to write, it’s clear what’s going on.