I like things happening in parallel but which don’t require mental effort thinking about synchronization and race conditions etc; i.e. where one can write code that basically looks like sequential code but is actually parallel. Futures, backed by threads, which are trivial to implement in Java, are great at this.
My employee Martin’s written some code to fetch data (friends) from various connected accounts, which currently can include:
- Google Contacts
- and potentially others in the future
These lookups could take time:
- Facebook’s systems might take e.g. one second to deliver the information,
- and there is potentially latency between our system and these various external systems.
To allow a faster user experience, he suggested doing all these lookups across all networks simultaneously in threads, and only return when all threads are finished. Good suggestion.
I further suggested not writing a routine to wrap all social network (“social network lookup code”) which starts then waits for all threads to finish. Instead, having each network start a thread of its own, and immediately returning a future which, only when some client actually accesses the data, blocks until the thread is finished.
Why? There is also a huge graph full of data displayed over all users (maybe up to 1k users), and “friends” are highlighted in this graph. Displaying the graph involves a massive SELECT to the database. Using futures means the SELECT can also do its work in a thread, and return immediately with a future. This means, without any social network knowing about one another or the SELECT, these things can run in parallel.
So far so good, but it gets better. He suggested using a Wicket AjaxLazyLoadPanel, which means the page gets displayed, and immediately upon loading, the page does an AJAX request to fetch the data within the panel. It’s a nice Wicket feature, and you program it the same as any other panel. The idea is that even if all the above operations take a long time, the page can be loaded quickly (albeit with a “loading” sign for the data).
So, what we’re going to do is start all those futures off when the original page is loaded, and the implementation to display the contents of the panel just uses the results of the futures (blocking if they’re not finished). This means all of the following are done in parallel:
- Hitting various external systems for friends: FB, LinkedIn, ….
- Hitting our database to get information for the graph
- Latency of the user’s internet connection to deliver the original page, and request the contents of the panel.
And all of these are things that use different resources (FB’s servers, our DB server, client’s internet connection) so they will really happen in parallel (in contrast e.g. to performing two disk-bound database operations in parallel.)
This algorithm rocks :-) Futures rock.
P.S. The algorithm admittedly only works on one server. But our web framework Wicket stores a lot of state in the server’s memory so it’s already a necessity to use sticky sessions (i.e. each click for each session goes to the same server). Fail-over (one server dies, session gets replicated and picked up by another server) won’t work for these futures. That’s an unlikely event, one that will be solved with one refresh of the browser, and is a price I’m willing to pay.
P.P.S. Caching the information would be another approach to speeding up the algorithm, and it’s something we might need to look into in addition. But caching has disadvantages Does FB T&C allow us to cache their friends? Even if so, what about stale data? For the big database query, how do we invalidate that cache when a relevant row in the database gets changed? Caching will never parallelize the work with the client’s internet’s latency. What about the first request? (where no information is cached)