Archive for the ‘Coding’ Category

Recognizing URLs within plain text, and displaying them as clickable links in HTML, in Wicket

Wednesday, March 12th, 2014

I have just, out of necessity for a customer project, written code which takes user-entered plain text, and creates out of that HTML with URLs marked up as clickable links.

Although marking up links in user-entered text is standard functionality, Stack Overflow would have you believe that it’s not something that should not be attempted, as it cannot be done perfectly. This is technically correct, however, users are accustomed to software which does a best-effort attempt, and customers are accustomed to take delivery of software meeting users expectations.

The software I have written is available as open-source, either as a Java class with the method encodeLinksToHtml which takes some plain text and returns safe HTML with clickable links, or as a component in the Wicket web framework called MultilineLabelWithClickableLinks.

Finding links within text is not as easy at it seems

Users may enter with/without protocol (http://). Domains may or may not have www at the start. There may or may not be a trailing slash. There may or may not be information after the URL. Having a whitelist of acceptable domain endings such as “.com” is a bad idea as the list is large and subject to change over time. Punctuation after links should not be included (for example “see foo.com.”, with a trailing dot which is not part of the URL)

The software matches “foo://foo.foo/foo”, where:

  • Protocol is optional
  • Domain must contain at least one dot
  • Last part is optional and can contain anything apart from space and trailing punctuation (= part of the sentence in which the link is embedded)

Quotes are not allowed because we don’t want <a href=”foo”> to have foo containing quotes (XSS).

Making links clickable is not as easy as it seems

Facts:

  • Conversion from plain text to HTML requires that entities such as “&” get replaced by “&amp;”.
  • Links such as “foo.com/a&b” need to get replaced by “<a href=’foo.com/a&b’>foo.com/a&amp;b</a>”. (“&” in URL needs to stay “&” in the href, but needs to become “&amp;” in the visible text part)

Therefore,

  • One cannot firstly replace entities and then markup links, as the links should contain unescaped “&” as opposed to “&amp;”.
  • One cannot firstly encode links and then replace entities as the angle brackets in the link’s “<a href..” would get replaced by “&lt;a href…” which the browser would not understand.

Therefore, the replacement of HTML entities, and the replacement of links, must be done in a single (complicated) pass, rather than two (simple) passes.

Jetty doesn’t show errors on web application start-up

Monday, December 23rd, 2013

From a certain version of the “jetty” package in Debian Linux, if the web application didn’t start up (servlet init() throws an Exception), this error wasn’t logged anywhere. The solution is to install the libjetty-extra package.

sudo apt-get install libjetty-extra

It took some amount of experimentation to find the solution. I don’t know why you’d ever want to not log errors; i.e. why the logging of errors is an “extra”.

Jetty is a Java web server, similar to Tomcat, it’s my server of choice. It’s simple, doesn’t seem to do much apart from run WARs, and (apart from this issue) I’ve rarely had any problems with it.

Wrapping IDs in objects

Thursday, December 5th, 2013

In Java, I like to wrap ID values in objects, rather than just passing them around the code as their native “int” or “long” or “String” values.

The reasons are twofold why using an object is better:

  • Code becomes more readable, for example foo(LoginId x) is more readable than foo(long x). (Although perhaps neither foo nor x are good names, so perhaps the example over-exaggerates this improvement.)
  • The compiler can do more checking. If you pass a job advert’s id (as a long) to a function expecting a login id (as a long), the compiler cannot warn you of your mistake. This becomes particularly relevant if you have a function taking two IDs and you pass them the wrong way around.

Things to consider when writing such an “ID object”

  • Do not allow the ID contained within the object to be null. Having “LoginId x” where x is not null, but the value contained within x is null, makes no sense. (For example, use primitive types if dealing with numerical IDs, as they cannot be null.)
  • If the ID is a string, don’t allow this string to be empty; same as above.
  • Implement equals and hashCode methods so that these IDs can be used within Sets, or as keys within Maps, or as keys in Wicket drop-downs, or wherever.
  • Make them serializable.
  • Make the ID attribute a “public final” attribute. That means useless getter methods can be avoided, from the object itself and from client code.
  • The object should have a single constructor which takes the value and sets it in the attribute.
  • Implement toString so that the debugger can display the object usefully.

Don’t use constants for table and column names when writing SQL statements

Friday, June 14th, 2013

I was always in two minds about using constants for table and column names when writing SELECT queries. Now I’ve concluded that constants are definitely bad, and should not be used. Here’s why.

The topic of discussion is difference is between writing

sql = "SELECT * FROM " + TABLE_NAME + " WHERE ..."

and

sql = "SELECT * FROM my_table WHERE ..."

There are the following consequences from this choice as far as I can see:

  • A compiler like Java can tell you if you misspell a variable name (e.g. tableName) but not if you misspell “myTable” in the middle of a string (Win for constants, if you’re using a static typing language)
  • You can rename the values easier: just change and rename the constant. But how often do just table names or columns get renamed? Normally when I change the database I am implementing a new feature, and everything has to be changed anyway. (Marginal win for constants)
  • The layout of the query is shorter and easier to read if constants are not used. (Win for not using constants)

To avoid having this choice in the first place:

  • One could use an ORM, but at least in Java, e.g. HQL in Hibernate still has a string for column names, and table names if you’re doing joins, so the problem is still there.
  • Using a system like LINQ in .NET which allows you to specify queries in a way the compiler understands, not just a string. (But can it do everything SQL can do including vendor-specific things?)
  • Being able to extend the language with other languages such as SQL and regular expressions. This is a fantasy of mine and a friend, hasn’t happened yet. This would work by the compiler working in conjunction with the database engine to assert that the query is valid at compile time (and possibly even creating an db-specific internal parsed representation right there and then.)

Compare the following two pieces of code and I think the choice will become obvious. Both pieces of code come from the current code-base I’m working on, neither have I written myself.

sb.append("SELECT ").append(RecruiterRole.TABLE_NAME).append(".*,");
sb.append(Login.TABLE_NAME).append(".*");
sb.append(" FROM ").append(RecruiterRole.TABLE_NAME);
sb.append(',').append(Login.TABLE_NAME);
sb.append(" WHERE ");
sb.append(RecruiterRole.COLUMN_COMPANY_ID).append(" = ?");
sb.append(" AND ");
sb.append(RecruiterRole.TABLE_NAME).append('.').
sb.append(RecruiterRole.COLUMN_LOGIN_ID).append('=?');

vs

sql.append(" SELECT * ");
sql.append(" FROM application");
sql.append(" WHERE job_advert_id IN (");
sql.append("   SELECT job_advert_id");
sql.append("   FROM share");
sql.append("   WHERE talent_scout_login_id = ?)");
sql.append(" AND potential_applicant_identity_id NOT IN (");
sql.append("   SELECT potential_applicant_identity_id");
sql.append("   FROM positive_endorsement");
sql.append("   WHERE talent_scout_login_id = ?)");
sql.append(" AND company_id = ?");
sql.append(" AND share_talent_scout_login_id = ?");
sql.append(" ORDER BY datetime_utc DESC");

Here are the reasons why the second code is more readable:

  • Because table/column names are inline, the code reads easier
  • Indenting is used for sub-selects
  • Each condition, order-by is on its own line (e.g. “AND company_id=?”)
  • Keywords uppercase, column and table names lowercase

The danger of the above code is less that errors in spelling will only be detected at run-time and not at compile-time, but that the query does the wrong thing (while appearing to do the right thing). For example, an error I saw recently (which obviously did not make the live system!) was that users could see data not only from themselves but from all users because the “WHERE login_id=?” had been forgotten. But to the untrained eye, or a user on the test system with only a few users, the query appeared to work.

In this case, it’s a clear win for readability, over compile-time checking of a mistake which will is unlikely to happen and will be identified at run-time.

Are Java enum values instances or classes?

Saturday, April 13th, 2013

That depends on if the enum values provide methods which differ from one another.

The following code produces just one class file, “Network.class”. “facebook” and “linkedIn” are just instances of the Network Java class.

public enum Network {
  facebook, 
  linkedIn;

  public void printName() { System.out.println(getName()); }
}

But the following code produces one class file for each value, named “Network$1.class” etc., as well as one class file for the abstract superclass, “Network.class”.

public enum Network {
  facebook {
    public Client newClient() { return new FacebookClient(); }
  },
  linkedIn {
    public Client newClient() { return new LinkedInClient(); }
  };

  public abstract Client newClient();
}

“facebook” and “linkedIn” are in fact different Java classes now.

Having a constructor taking parameters, and initializing each value of the enum by calling this constructor with values, is not sufficient to force the generation of individual classes per value.

Just because they are different classes in this situation doesn’t automatically mean you can do everything you’d expect to be able to do with a class. You can’t test for class membership using “instanceof” for example (not that this would be very useful for an enum, as there is only one instance of every value).

Generating XML programmatically? Don’t use CDATA

Wednesday, March 20th, 2013

If you want to represent characters, you want to represent any possible sequence of characters. In an XML file, escaping < with &lt; gives you a way to represent any character. Using CDATA doesn’t give you a way to do that.

In an XML document there are two ways to represent character data. Either

  1. You just write the characters in the XML file (in which case you need to escape characters that look like XML tags i.e. < with &lt; etc.), or
  2. You use CDATA, no longer having to care about replacing < with &lt; etc. Ends as soon as the first ]]> is encountered, which is unlikely to appear in your characters.

These two syntaxes result in an identical XML document. An XML parser must consider XML files identical, which differ only by if CDATA is used, or not. In both cases, there is text within a tag.

CDATA has a (somewhat strange) syntax like:

<my-tag><![CDATA[My characters]]></my-tag>

From its naming, CDATA (“Character DATA”) might seem like exactly what you need to represent character data. Combine that with the fact that characters such as < don’t need to be escaped.

However, in fact it’s exactly the opposite of what you need. Whereas, by not using CDATA, it’s possible to escape the characters that mustn’t appear (i.e. replace < with &lt;) with CDATA there is no way to escape the characters that mustn’t appear (i.e. ]]>).

It might seem “unlikely” that ]]> is actually going to appear in the character data your user wishes to represent. (This is actually irrelevant, as software should work all the time, not just be “unlikely” not to work.) However, even this “unlikelihood” is misleading. No matter what sequence of characters XML had chosen to end CDATA, as soon as you represent data in itself (e.g. send an XML document in a tag), this sequence will appear in the data. So if you’re working with XML (which you are), this will happen more often than you think.

CDATA is just a convenience mechanism for writing XML files by hand. If you’re writing files by hand, you know which characters appear in your data, so you know whether you can use CDATA safely or not. If you’re creating a program to write XML files, you don’t know what the user’s data will contain, so you can’t use CDATA.

Don’t names classes AbstractThing or ISerializable

Monday, February 25th, 2013

Naming a class AbstractThing violates the Liskov Substitution Principle, and causes client code to read wrong. Call a class a name which you’d be happy to call any of the objects belong to the class.

If you want to borrow your friend’s phone, and you’re not sure what brand it is, how many times have you asked “Excuse me mate, can I borrow your abstract phone? Cheers.”

Object-oriented modeling and programming allows certain classes of objects to specialize other general classes of objects. For example, both a Rectangle and Circle are special types of of Shape.

Every Rectangle is a Shape. Everywhere you can use a Shape, you can use a Rectangle. Code that deals with a Rectangle (but which can also deal with Circles) might look like:

Shape s = .....;
s.drawTo(myGraphicsContext);

Using an object (e.g. a Rectangle) anywhere you can use a generalization of it (e.g. a Shape) is an important part of the object-oriented concept, and known as the Liskov Substitution Principle.

It’s also obvious: what sort of sentence or logic would make statements about shapes, but then not be applicable to rectangles?

If you name the generalization an AbstractShape, this principle is violated. A Rectangle isn’t an AbstractShape. If anything it’s a “Concrete Shape”! A rectangle isn’t abstract (in the sense of “I don’t know what type of shape this is. Could be a rectangle, could be anything else.”). Code using AbstractShape then reads wrong:

AbstractShape s = new Rectangle(...);

The same is true of interfaces. What do all objects which are serializable have in common? They are Serializable. Naming the interface ISerializable would lead to code like:

ISerializable s = new String("hello")

But s refers to a particular instance; a particular thing. There’s nothing abstract or “just an interface” about s. s is a String, which is a specialization of the more general type of thing called a Serializable. The line of code, including s’s type, must reflect what s actually is. It’s a serializable.

Further, what if you change an interface to an abstract class, or a concrete class to an abstract class, or vice-versa? The client code must be updated. While isn’t too much of a pain with refactoring tools, it still leads to the suspicion that irrelevant information about the tools used to construct the implementation are leaking out to client code.

A class of a set of objects is named to capture the common aspects of those objects. If all objects are rectangles, call their class Rectangle. Naming a superclass, abstract or not, is no different. What do all the object modeled have in common? Some are rectangles, some are circles, but they are all Shapes.

The objects being classified into classes are not AbstractShapes, BaseShapes, IShapes, or similar.

Always use a “default” clause in a “switch” statement

Tuesday, February 19th, 2013

I was told that it’s good practice to include a default clause in all switch statements. I recently remembered this advice but can’t remember what the justification was.

Things can and do go wrong. Values will not be what you expect, and so on. To quote the old computer science motto: If it can go wrong, it will.

Not wanting to include a default clause implies you are confident that you know the set of possible values. If that’s the case, then a value which isn’t one of them indicates an error somewhere. And if that’s the case, you’d certainly want to be informed about it.

switch (myVar) {
   case 1: ......; break;
   case 2: ......; break;
   default: throw new RuntimeException("unreachable: myVar" + myVar);
}

There’s no reason to include more information than just the value of any variables which are known only at run-time. Anything which is visible from the source code, you don’t need to include in the text, as you’ll be looking at the source code to debug the error. (The exception stack-trace will include the line number.)

Originally from http://stackoverflow.com/questions/4649423/should-switch-statements-always-contain-a-default-clause/4649518

No need to support IE8 (even for corporate software)

Tuesday, February 12th, 2013

One of my customers (mobilreport) is used by accountancy departments in large companies in Austria and Germany to analyze their employees’ mobile phone usage. More so than average web software, this web software needs to work on whatever web browsers large companies have installed; large companies tend to use more Internet Explorer than the world at large.

So it’s with pleasure that I report the following trend: (Numbers are the % of all users who come to the website within that month using IE8)

June 2012 34.94
July 2012 44.97
August 2012 22.12
September 2012 17.43
October 2012 16.00
November 2012 14.00
December 2012 7.78
January 2013 4.21
February 2013 1.85  (so far)

It’s amazing to think, only 7 months ago (July 2012), that 45% of our users were using IE8 (the remaining 55% being split between IE6, IE7, IE9, FF, …). And now less than 2%, hopefully soon even less.

I am developing more and more software which uses SVG to visualize things (e.g. United Youth Symbol Generator), and IE9+ (and other desktop browsers, and most mobile browsers) all support SVG. So the drop of IE8 is great news.

Do not use automatic code reformatting

Thursday, January 24th, 2013

Beautiful source code is about a communication between the author of a program and the reader of a program. It’s a communication between two people. A computer cannot communicate as well with a human as a human can.

I’ve often heard it asserted that people cannot read other people’s code without auto-formatting. However, I think that reflects poorly on both the author (who isn’t thinking enough about the reader), and the reader (reading code often requires as much effort as writing code; often readers are lazy). If the authoring programmer cannot lay out code so that it’s understandable, they’re probably writing bad code; reformatting their code won’t suddenly make it good code.

Further to the reasons why auto-formatting do not help, here are some examples where it actually hinders. There are some examples from a customer I went back to work with in 2007, who I hadn’t visited for a few months.


In the following example, there is a clear structure to the statement. A restriction is added which is an “or”, composing of a left-hand-side and a right-hand-side. I formatted each one of those sides as on a single line indented from the surrounding “or”.

result.add(Restrictions.or(
    Restrictions.ltProperty("wonUnitCount", "desiredUnitCount"),
    Restrictions.eq("reserveMet", false)));

The reformatted code does not have this structure. The “.eq” on the second line is a method on the Restrictions class of the right-hand-side. From the structure one might think that it’s one of the most structurally significant methods of the statement, but it’s not.

result.add(Restrictions.or(Restrictions.ltProperty("wonUnitCount", "desiredUnitCount"), Restrictions
    .eq("reserveMet", false)));

I needed to protect a bunch of parameters against being null. The eye scanning the following code can clearly see the “==” and the “=” in all the statements, thus one can see that all the statements are related and do similar things.

if (offeredUnitCount     == null) offeredUnitCount     = 1;
if (startCentsPerUnit    == null) startCentsPerUnit    = 1;
if (reserveCentsPerUnit  == null) reserveCentsPerUnit  = startCentsPerUnit;
if (autoBidIntervalCents == null) autoBidIntervalCents = 1;

However, the reformatted code is not only twice as long (thus meaning that some statements and methods which one would otherwise be able to see now scroll off the bottom of the monitor) but the reader is forced to examine each statement to see what they do as “==” and “=” visual pattern is no longer there.

if (offeredUnitCount == null)
    offeredUnitCount = 1;
if (startCentsPerUnit == null)
    startCentsPerUnit = 1;
if (reserveCentsPerUnit == null)
    reserveCentsPerUnit = startCentsPerUnit;
if (autoBidIntervalCents == null)
    autoBidIntervalCents = 1;

A method was going to throw a bunch of Exceptions but didn’t do so yet. This was just “in progress code”.

* @throws CentsPerUnitBeyondSystemMaximumException if either startCentsPerUnit
* or autoBidMaxCentsPerUnit are greater than the system maximum.
*
* TODO - throws DesiredUnitCountTooLowException(min)
* TODO - throws DesiredUnitCountGreaterThanOfferedUnitCountException
* TODO - throws StartCentsPerUnitTooLowException(min)
* TODO - throws CannotBidOnOwnItemException
* TODO - throws MustIncreaseUnitCountOrChangeMaxBid

Afterwards, one doesn’t know what’s going on at all.

* @throws CentsPerUnitBeyondSystemMaximumException if either startCentsPerUnit or autoBidMaxCentsPerUnit are greater than the
* system maximum. TODO - throws DesiredUnitCountTooLowException(min) TODO - throws
* DesiredUnitCountGreaterThanOfferedUnitCountException TODO - throws StartCentsPerUnitTooLowException(min) TODO - throws
* CannotBidOnOwnItemException TODO - throws MustIncreaseUnitCountOrChangeMaxBid

The method here takes few arguments. Listed in this way it’s clear that the function takes four arguments.

public static List<AuctionForSeller> getAuctionListForSeller(
    AuctionListTypeForSeller type,
    Member seller,
    int firstIndex,
    int itemsPerPage
) {

Afterwards there are 3 arguments on the first line and 1 argument on the last line. It’s almost as if the 4th argument is somehow of special significance; that’s the way it appears to the eye (even before one thinks about it with the conscious brain). But that’s not the case.

public static List<AuctionForSeller> getAuctionListForSeller(AuctionListTypeForSeller type, Member seller, int firstIndex,
    int itemsPerPage) {