VPN fail

I tried to use my company laptop outside of the office for the first time. I clicked on the “VPN” option in the start menu which the company had installed on the laptop. Up came a browser, trying to access the VPN page (accessible from the Internet). However, it didn’t work, as the browser was trying to connect to the company’s web proxy. The company web proxy is obviously only available once one is in the company network, i.e. once the VPN is connected.

(The solution was to include the VPN website server in the list of sites not to use the company proxy for; however as I only just managed that, I wonder how many of the non-technical users within the company will manage it.)

Store data files in one big directory

A server filesystem can store millions of files in one directory. Unless you need to store more files than this, there is no need to artificially introduce an extra layer of directories to keep the number of files per directory down to a lower number.

Often one needs to store lots of files in a filesystem. For example, at Uboot, each user has a “nickpage”, and each of these nickpages requires a file, and there are about 4M users.

It is tempting to use a hierarchical directory structure for this, for example take the last two digits of the nickpage-ID and create directories for each value, then in each of these directories create directories for the 3rd/4th last digits of the nickpage-ID, and in that store the nickpage file, for example the 4,000,001st nickpage might be stored in a file like /var/nickpages/01/00/4000001.xml.

But this decision, seemingly so obvious, contains an implicit assumption, which is wrong. It assumes one cannot store more than a few hundred files in a directory.

In 2000 a consultant for tru64 UNIX told us that one should store no more than one million files in a directory. We stored (I think) about 0.5M files per directory, and it worked fine. We had over 1M page impressions per day. Modern day hardware is much faster (CPUs, disks).

  • tru64 used a tree structure for the relationship between filenames and the information about the file, similar to a normal index in a database table. (And database tables clearly support having more than a few hundred rows in a table!)
  • Solaris ZFS uses a hash structure to identify files. This is even quicker in some respects (as the tree doesn’t have to be traversed) and it requires fewer locks (as operations on two files do not have any common tree parent nodes) although no doubt re-hashing will have to be done if the directory expands beyond a certain limit (in contrast to the tree-approach).

I think one needs to do the following things with a directory containing data files:

  • Find a particular file, in which case it’s easier in terms of programming, and faster for the OS to deal with a flat structure, than with a structure with lots of intermediate directories.
  • Do an operation on all files, for example search for particular content within all files. In that case the time is spent looking into the files, the structure of the files in the directory does not matter.

Having worked with directory structures both in terms of programming and in terms of operations and live bug-fixing, I can say that it really is simpler to have simple directory structures, and really does work in production. Being able to vi <id> or new File(directory, id+”.xml”) is simpler.

Using intermediate directories is really just doing in programming what the OS does for you anyway.

P.S. I like padding IDs with zeros for example “0004000001.xml”, this means that files are always listed in numerical order if you sort them alphabetically. Although I assert this is something one rarely wants to do – takes a long time if you store files flat, and isn’t possible at all if you store files hierarchically.

XSS attack via unchecked image uploads

If you allow users of your website to upload data (e.g. images), and you display this data to other users, you need to open the file on the server to examine it and check that it really is what it should be (e.g. an image).

Most website software will need to examine the image anyway, to extract thumbnails, determine width/height, etc. In which case, this security comes for free. But I’ve seen software which doesn’t have any such needs, and thus server-side examination is not done.

The reason is:

  • An uploading user (attacker) may upload a file such as “xxx.jpg” whose contents are in fact HTML/Javascript rather than an image
  • The web server may serve this file with the extension in the URL and the Content-Type: image/jpeg, however…
  • Internet Explorer may under some circumstances ignore these two pieces of information, instead preferring to look at the bytes of the file, and work out what type of content the file contains
  • Even if the image looks right in the <img> tag, if the viewing user right-clicks and views the image in its own window, then the data will be interpreted as HTML/Javascript
  • This gives the file the capability to read the viewing user’s cookies, look into forms in other windows, etc., as the HTML/Javascript appears to have been served by the site the viewing user is viewing.
  • Through having e.g. <img> tags (not affected by single-source policy) the file can send e.g. GET requests to an external site and transport this cookie and form information.
  • Thus the attacker’s external site now has cookie/form information, and the attacker can use this to impersonate the user at the site.

I was unaware of this before @ch2500 brought this my attention, thanks! More information.

Letters

The UNIX program “mail”, not the mail client one would use by choice under most circumstances. However today I have the pleasure of using a Solaris box which seems to be filled only with legacy tools like original “vi”, a “tail” where if you do “tail -f *” it only tails the first matching file and not all files.. but I digress.

$ mail
mail: Too many letters, overflowing letters concatenated

I mean ignoring the error itself and the disrespect which it forces me to show this program, look at the wording.

The error involves the word “letters”, i.e. the program was written before the word “mail” was the standard word for internet mail.

Humans reading source is not a strategy to defeat viruses

Apple’s policy of reviewing all apps is officially (amongst other reasons probably) to make sure apps don’t contain viruses etc.

However here we clearly see that having humans look over source code of all apps does not catch all policy violations (violations include writing viruses, and include other things as in this case):
http://www.mobilecrunch.com/2010/08/12/apple-pulls-camera-from-the-app-store-after-its-developers-reveal-a-contraband-feature/

If they can’t catch this policy violation, they won’t be able to catch all viruses either. (And they would need to catch all viruses to make it worth looking for viruses at all; catching all but one viruses is not sufficient. A single virus can do a lot of damage and infect widely.)

Apple not allowing one to just write and distribute code on their iPhone platform is an outrageous limitation of freedom. Presumably they have their reasons for doing so (e.g. so they make money on every commercial purchase of an application), but at least that reason they claim for doing it (protecting the public) is demonstrably not true.

(P.S. I’m not proposing a strategy which will defeat viruses; I’m just saying that Apple’s app store policy is not one.)

(P.P.S. Game console manufacturers have always had similar restrictions, those limitations of freedom are just as outrageous.)

Multi-core RISC OS proposal

I was reading an article about the RISC OS operating system and the fact that it doesn’t support multiple CPU cores. Then I got to thinking, how would one implement multi-CPU support on such an operating system?

Between 1989 and 1994 I used a British computer running the operating system RISC OS. This operating system is still alive, and is still in use. It was produced by Acorn Computers and runs on the ARM processor (“Acorn Risc Machine”), now found in many computing platforms such as the iPhone. It has many innovative features which haven’t made it yet to other operating systems (although is the topic for a post of its own!)

Back then there was obviously no need to support multiple CPUs. So today one has the choice of either finding single-core CPUs (will become increasingly difficult) or using a multi-core CPU with all but one core inactive (which is a waste). Various fora speculate on the technical possibility or impossibility of utilizing the extra cores. I don’t think it would be especially difficult. This article describes how I would go about it.

Constraints and information about RISC OS:

  • RISC OS can run multiple programs at once but yet has no concept of pre-emptive multi-tasking, threads, processes, etc.
  • It’s extremely old, therefore extremely fast on modern processors (even running on only one core)
  • Many parts of the operating system, such as those to do with the filesystem, are marked as “non re-entrant” in the documentation, that means they would need to be re-written or altered to work with pre-emptive multi-tasking or simultaneous execution (multi-core), for example locks would need to be introduced.

For me at least, I would say that the way the system currently operates on a single core is already very fast (when running at less than full speed via emulation on an x86 processor – it runs faster than it did on native hardware in the 1990s, it runs faster than Windows ran in the 1990s, and faster than Windows runs today.) So I think for most tasks, there is no need to use additional cores.

I am used to working in companies with limited human resources, and due to the fact that RISC OS isn’t a mainstream operating system, I am going to assume that this would also be a constraint with any solution to introduce multi-core capabilities to RISC OS. That means a complete re-write of RISC OS and applications, or even visiting every single part of the operating system and introducing locking and the expectation that any function can be called at any time due to pre-emptive multi-tasking and simultaneous execution by multiple cores, is also out of the question.

So my solution would be to assume that those extra cores should do “pure CPU” work, for example video processing, rendering fractals, or other CPU-intensive work. All work like accessing the filesystem, interacting with the user, which is already fast enough, would not be given the opportunity to work even faster by being executed in parallel on multiple cores.

I would write a very simply layer, similar to a hypervisor, which would run under RISC OS, which would simply handle allocation of CPU and memory resources. This supervisor would allow processes to run, which would not share memory (and thus not need to support shared-memory primitives such as locking). Pipes would allow processes managed by the supervisor to communicate.

The first, and often the only process, which this supervisor would manage, would be the entire existing RISC OS and all applications. This process would manage the user-interface, network, filesystem, printers and communication between all running RISC OS applications. So this “RISC OS + applications” process would run largely unaltered from the existing software (saving much development time; but also providing no benefit in terms of speed increase).

But there would be an extra API available to applications to spawn new processes managed by the supervisor, which would execute asynchronously, and primitives to pass/receive messages from those spawned processes. The supervisor would run these processes on the different cores, and pre-emptively multi-task the processes in the case of more processes than cores.

I would take the approach to concurrency of “message-passing with no shared memory” as opposed to “shared memory (threads)” for two reasons:

  • Existing RISC OS lacks the primitives to deal with shared memory and pre-emptive concurrency (such as locking), so it would not be possible to allow the spawned processes to interact with the main memory of the RISC OS applications. (Although it would be theoretically possible to allow the spawned processes to themselves have threads, although that would require introducing extra concepts such as locking to the supervisor.)
  • It might be argued one would need to support shared-memory computation to allow existing software to be easily ported to the platform, however this isn’t really a concern as porting software to RISC OS is next to impossible for other reasons: there is no POSIX, and the concepts provided by POSIX such as streams unifying filesystem access, inter-process communication, network access, are not available on RISC OS (and I would not advocate building them in the supervisor.)

I would allow “pipes” to be created between the processes and the pipes store an ordered set of “messages”, each message is simply a bunch of bytes. RISC OS software is generally written in C so just passing the byte-contents of a “struct” is easy and convenient. (This is the Erlang philosophy: Facebook say “Erlang… approaches concurrency with three iron rules: no shared memory even between processes on the same computer, a standard format for messages passed between processes, and a guarantee that messages are read in the order in which they were received.”)

This proposal has the following consequences:

  • The supervisor would be reasonably easy to write
  • RISC OS and applications would not need to be re-written. The application developer can, at a time of their choosing, decide to take advantage of the new feature, and surgically alter their code by altering only that part of the code where the speed improvement is desired (i.e. processing logic, not the GUI)
  • The supervisor as described would not consume many resources (i.e. the system would not suddenly become radically slower due to the introduction of this layer.)
  • The system would work as well on single-CPU systems as on multi-CPU systems (so software written with the new system can work on older single-core hardware)
  • One could envisage such software also running across multiple networked computers, so if a colleague’s computer was not in use (e.g. they were on holiday) ones own computer could become faster.

Admittedly there although there is no performance loss, there is also no performance gain until apps are extended. However I think spending the work to add pre-emptive multi-tasking to RISC OS (a huge undertaking) simply isn’t necessary: RISC OS works well as it is, and arguably is a lot more reliable (in terms of repeatable and therefore testable behaviour) due to its cooperative nature. The original designers of RISC OS could have created a pre-emptive multi-tasking operating system instead of RISC OS — the hardware supported it (RISC/iX ran on it) — but they actively decided not to do that.

See also:

Sudoku Solver

The above is a Sudoku solver I wrote in Java a while back. Here it is running in the browser. The Java has been converted to in-browser Javascript by Google Web Toolkit.

Please feel free to type in the hardest Sudokus you can find anywhere, and let me know if it can solve them, I haven’t found any it can’t yet.

I assume the in-page Javascript will not work in RSS readers, email notifications, etc. If that’s the case, click here to get to the original Sudoku solver page on my blog.

The algorithm works by backtracking, i.e. simply going through all squares from top-left to bottom-right, and finding the set of “candidate numbers” for that square (i.e. which numbers are not already used on that row, column or 3×3 block). The first candidate number is “applied” by placed it into the square, and the algorithm moves to the next square. If there is a square where the set of candidate numbers is empty (i.e. no number would work, given the candidates applied in previous squares) then the system returns to the previous square and tries the next candidate for that square. Once the system has gone “beyond” the bottom-right square, then the set of applied candidates consitutes a solution. However, backtracking continues, if the system reaches beyond the bottom-right square again, then the previously found solution
was not unique, and the algorithm terminates. Once the system has backtracked back to the start, if a single solution was found then it is a “unique solution” (success case), otherwise if there are no solutions then the Sudoku is unsolvable.

By the way, Internet Explorer really isn’t that fast; I appreciate that is well-known to anyone who’s ever read Google Chrome marketing material! But the Sudoku here takes 708ms to solve on IE8, 334ms on Firefox 3.6, about 100ms on Safari 4 and Opera 10, and 49ms on Chrome 4, on this 2.3GHz/core Intel Windows Vista computer.

Update: Alas there were Sudokus which took too long for the original synchronous approach (i.e. user clicks, do calculation, update the screen with the results). So now, in the case the calculation takes too long, a “worker” is spawned, which does the calculation, and the spinner is displayed in that case.

Download the source here.

Mistake

I sent an invoice to an agent. The first page of the PDF was the normal invoice stuff – my name and address, their name and address, the number of hours worked, hourly rate, total amount, etc. The second page was, as they had requested, the hours that I’d worked; each day on a separate line with the start/end time and number of hours lunch.

Obviously I’m not an expert on the German language and tone is always difficult to gauge but they replied:

mit Ihrem Anhang kann ich nichts anfangen, da es weder eine Tagesstundensumme noch eine Monatsumme, noch Ihren Name, noch das Projekt aufweist. …  bitte ich Sie, mir eine ordnungsgemäße Stundenaufzeichnung zu schicken

My name was on the first page, why do they need it on the second as well? They want to know how many hours I’ve worked per day, well there is an invention known as the spreadsheet, and that comes with a SUM(..) function.

But it was the tone which annoyed me the most, I mean if they need this info, they can just ask me, I’m more than happy to provide it.

I mean I guess one has to decide, does one just “take it” (signalling that this sort of communication is acceptable) or does one “attack” (leading potentially to disharmony, or, in my case, possible inappropriateness if I’ve misinterpreted the tone). But I don’t want to just always have to “take it” because I’m not confident enough about my German.

So I amended the PDF and sent it to them, but wrote to them in the mail that I assumed they would already know my name, as it was written on the first page…. (The bit about spreadsheets having been invented I left out.)

So I sent that off, was looking forward to see what the reply would be, when I realized that every email I wrote to them I started with “Sehr geehrter Herr X” and every email they wrote to me they signed “Frau X”, nice one…. I think I have my explanation as to why they were so displeased with me… and tomorrow they’ve got my nice aggressive answer waiting in their inbox, also addressed to Herr X …

My mate came across the following HTML

<select class="selectTitle" id="C5" name="C5">
  <option value="">--select--</option>
  <option value="1">Mr</option>
  <option value="2">Ms</option>
  <option value="3">Mrs</option>
  <option value="4">Miss</option>
  <option value="5">Doctor</option>
  <option value="6">Judge</option>
  <option value="7">Professor</option>
  <option value="8">Councillor</option>
  <option value="9">Madam</option>
  <option value="10">Mr &amp; Mrs</option>

Read the rest of this entry »

The special variable “_”

Reading this blog post, Destructuring binds in Ruby just now reminded me of a feature I love about Prolog which I wish would make it into other languages.

Firstly, I love assigning a list to a list of lvalues i.e. variables; this is possible in both PHP and Perl which I use regularly; and no doubt many other languages. (But not Java: why not!?)

    ($a, $b) =      ($b, $a);  // Perl
list($a, $b) = array($b, $a);  // PHP

PHP, as always, wins in inelegance, having the left side syntactically different to the right side. While it’s obviously the case that a list of values and a list of lvalues are technically different, I don’t think this difference should be expressed in the syntax.

I mean, in most languages you write e.g. $foo=4 and $bar=$foo; in both those cases you write $foo but yet they do something different (lvalue and rvalue); given that you write them the same there i think the same should apply to lists.

But I digress – What I want to mention is using “underscore” to mean “any variable”. I first saw this in Prolog.

For example, imagine you have to implement an interface (e.g. in Java), it requires you to write a function taking two parameters, but one of the parameters you don’t care about. Wouldn’t it be nice to write

interface ExistingApi {
   public void createObject(String name, Object otherInformation);
}

class MyInstanceOfTheApi {
   public void createObject(String name, _) {
      ...
   }
}

i.e. this shows clearly you do not care about the second parameter.

In current Java (and all languages I program including Perl, PHP) you have to give all variables a name even if you don’t use them, either in function definitions or in “assign to a list” scenarios mentioned above. It is then left as an exercise to the reader to determine if these variables are used or not, and indeed an exercise to the writer to name the variable they are never going to use.

I mean yes, technically you can actually just call variables “_” (or “$_” (except in Perl where “$_” already means something)) but that would then be a coding convention as opposed to a language feature, and who knows if the coding convention is actually used correctly by a programmer. (If “_” is a variable there’s nothing stopping someone from using its value.)

And then you have the problem if you have two variables you don’t care about, you can’t call them both “_”.

By the way, Programming in Prolog is an excellent book.