Algorithm: Recognizing URLs within plain text, and displaying them as clickable links in HTML, in Wicket

I have just, out of necessity for a customer project, written code which takes user-entered plain text, and creates out of that HTML with URLs marked up as clickable links.

Although marking up links in user-entered text is standard functionality, Stack Overflow would have you believe that it's not something that should not be attempted, as it cannot be done perfectly. This is technically correct, however, users are accustomed to software which does a best-effort attempt, and customers are accustomed to take delivery of software meeting users expectations.

The software I have written is available as open-source, either as a Java class with the method encodeLinksToHtml which takes some plain text and returns safe HTML with clickable links, or as a component in the Wicket web framework called MultilineLabelWithClickableLinks.

Users may enter with/without protocol (http://). Domains may or may not have www at the start. There may or may not be a trailing slash. There may or may not be information after the URL. Having a whitelist of acceptable domain endings such as ".com" is a bad idea as the list is large and subject to change over time. Punctuation after links should not be included (for example "see foo.com.", with a trailing dot which is not part of the URL)

The software matches foo://foo.foo/foo, where:

Quotes are not allowed because we don't want <a href="foo"> to have foo containing quotes (XSS).

Facts:

Therefore,

Therefore, the replacement of HTML entities, and the replacement of links, must be done in a single (complicated) pass, rather than two (simple) passes.

P.S. I recently created a nerdy privacy-respecting tool called When Will I Run Out Of Money? It's available for free if you want to check it out.

This article is © Adrian Smith.
It was originally published on 12 Mar 2014
More on: Wicket | Algorithms | Things I've Released