Category Archives: Software engineering

The problem with Backing Stores, or what is NoSQL and why would you use it anyway

Durability is something that you normally want somewhere in a system: where the data will survive reboots, crashes, and other sorts of things that routinely happen to real world systems.

Over the many years that I have worked in system design, there has been a recurring thorny problem of how to handle this durable data.  What this means in practice when building a new system is the question “what should we use as our backing store?”.  Backing stores are often called “databases”, but everyone has a different view of what database means, so I’ll try and avoid it for now.

In a perfect world a backing store would be:

  • Correct
  • Quick
  • Always available
  • Geographically distributed
  • Highly scalable

While we can do these things quite easily these days with the stateless parts of an application, doing them with durable data is non-trivial. In fact, in the general case, it’s impossible to do all of these things at once (The CAP theorem describes this quite well).

This has always been a challenge, but as applications move onto the Internet, and as businesses become more geographically distributed, the problem has become more acute.

Relational databases (RDBMSes) have been around a very long time, but they’re not the only kind of database you can use. There have always been other kinds of store around, but the so-called NoSQL Movement has had particular prominence recently. This champions the use of new backing stores not based on the relational design, and not using SQL as a language. Many of these have radically different designs from the sort of RDBMS system that has been widely used for the last 30 years.

When and how to use NoSQL systems is a fascinating question, and I put forward our thinking on this. As always, it’s kind of complicated.  It certainly isn’t the case that throwing out an RDBMS and sticking in Mongo will make your application awesome.

Although they are lumped together as “NoSQL”, this is not actually a useful definition, because there is very little that all of these have in common. Instead I suggest that there are these types of NoSQL backing store available to us right now:

  • Document stores – MongoDB, XML databases, ZODB
  • Graph databases – Neo4j
  • Key/value stores – Dynamo, BigTable, Cassandra, Redis, Riak, Couch

These are so different from each other that lumping them in to the same category together is really quite unhelpful.

Graph databases

Graph databases have some very specific use cases, for which they are excellent, and probably a lot of utility elsewhere. However, for our purposes they’re not something we’d consider generally, and I’ll not say any more about them here.

Document stores

I am pretty firmly in the camp that Document stores, such as MongoDB, should never be used generally either (for which I will undoubtedly catch some flak). I have a lot of experience with document databases, particularly ZODB and dbxml, and I know whereof I speak.

These databases store “documents” as schema-less objects. What we mean by a “document” here is something that is:

  • self-contained
  • always required in it’s entirety
  • more valuable than the links between documents or it’s metadata.

My experience is that although often you may think you have documents in your system, in practice this is rarely the case, and it certainly won’t continue to be the case. Often you start with documents, but over time you gain more and more references between documents, and then you gain records and and all sorts of other things.

Document stores are poor at handling references, and because of the requirement to retrieve things in their entirety you denormalise a lot. The end result of this is loss of consistency, and eventually doom with no way of recovering consistency.

We do not recommend document stores in the general case.

Key/value stores

These are the really interesting kind of NoSQL database, and I think these have a real general potential when held up against the RDBMS options.  However, there is no magic bullet and you need to choose when to use them carefully.

You have to be careful when deciding to build something without an RDBMS. An RDBMS delivers a huge amount of value in a number of areas, and for all sorts of reasons. Many of the reasons are not because the RDBMS architecture is necessarily better but because they are old, well-supported and well-understood.

For example, PostgreSQL (our RDBMS of choice):

  • has mature software libraries for all platforms
  • has well-understood semantics for backup and restore, which work reliably
  • has mature online backup options
  • has had decades of performance engineering
  • has well understood load and performance characteristics
  • has good operational tooling
  • is well understood by many developers

These are significant advantages over newer stores, even if they might technically be better in specific use cases.

All that said, there are some definite reasons you might consider using a key/value store instead of an RDBMS.

Reason 1: Performance

Key/value stores often naively appear more performant than RDBMS products, and you can see some spectacular performance figures in direct comparisons. However, none of them really provide magic performance increases over RDBMS systems, what they do is provide different tradeoffs. You need to decide where your performance tradeoffs lie for your particular system.

In practice what key/value stores mostly do is provide some form of precomputed cache of your data, by making it easy (or even mandatory) to denormalize your data, and by providing the performance characteristics to make pre-computation reasonable.

If you have a key/value store that has high write throughput characteristics, and you write denormalized data into it in a read-friendly manner then what you are actually doing is precomputing values. This is basically Just A Cache. Although it’s a pattern that is often facilitated by various NoSQL solutions, it doesn’t depend on them.

RDBMS products are optimised for correctness and query performance and  write performance takes second place to these.  This means they are often not a good place to implement a pre-computed cache (where you often write values you never read).

It’s not insane to combine an RDBMS as your master source of data with something like Redis as an intermediate cache.  This can give you most of the advantages of a completely NoSQL solution, without throwing out all of the advantages of the RDBMS backing store, and it’s something we do a lot.

Reason 2: Distributed datastores

If you need your data to be highly available and distributed (particularly geographically) then an RDBMS is probably a poor choice. It’s just very difficult to do this reliably and you often have to make some very painful and hard-to-predict tradeoffs in application design, user interface and operational procedures.

Some of these key/value stores (particularly Riak) can really deliver in this environment, but there are a few things you need to consider before throwing out the RDBMS completely.

Availability is often a tradeoff one can sensibly make.  When you understand quite what this means in terms of cost, both in design and operational support (all of these vary depending on the choices you make), it is often the right tradeoff to tolerate some downtime occasionally.  In practice a system that works brilliantly almost all of the time, but goes down in exceptional circumstances, is generally better than one that is in some ways worse all of the time.

If you really do need high availability though, it is still worth considering a single RDBMS in one physical location with distributed caches (just as with the performance option above).  Distribute your caches geographically, offload work to them and use queue-based fanout on write. This gives you eventual consistency, whilst still having an RDBMS at the core.

This can make sense if your application has relatively low write throughput, because all writes can be sent to the single location RDBMS, but be prepared for read-after-write race conditions. Solutions to this tend to be pretty crufty.

Reason 3: Application semantics vs SQL

NoSQL databases tend not to have an abstraction like SQL. SQL is decent in its core areas, but it is often really hard to encapsulate some important application semantics in SQL.

A good example of this is asynchronous access to data as parts of calculations. It’s not uncommon to need to query external services, but SQL really isn’t set up for this. Although there are some hacky workarounds if you have a microservice architecture you may find SQL really doesn’t do what you need.

Another example is staleness policies.  These are particularly problematic when you have distributed systems with parts implemented in other languages such as Javascript, for example if your client is a browser or a mobile application and it encapsulates some business logic.

Endpoint caches in browsers and mobile apps need to represent the same staleness policies you might have in your backing store and you end up implementing the same staleness policies in Javascript and then again in SQL, and maintaining them. These are hard to maintain and test at the best of times. If you can implement them in fewer places, or fewer languages, that is a significant advantage.

In addition, it is a practical case that we’re not all SQL gurus. Having something that is suboptimal in some cases but where we are practically able to exploit it more cheaply is a rational economic tradeoff.  It may make sense to use a key/value store just because of the different semantics it provides – but be aware of how much you are losing without including an RDBMS, and don’t be surprised if you end up reintroducing one later as a platform for analysis of your key/value data.

Reason 4: Load patterns

NoSQL systems can exhibit very different performance characteristics from SQL systems under real loads. Having some choice in where load falls in a system is sometimes useful.

For example, if you have something that scales front-end webservers horizontally easily, but you only have one datastore, it can be really useful to have the load occur on the application servers rather than the datastore – because then you can distribute load much more easily.

Although this is potentially less efficient, it’s very easy and often cheap to spin up more application servers at times of high load than it is to scale a database server on the fly.

Also, SQL databases tend to have far better read performance than write performance, so fan-out on write (where you might have 90% writes to 10% reads as a typical load pattern) is probably better implemented using a different backing store that has different read/write performance characteristics.

Which backing store to use, and how to use it, is the kind of decision that can have huge ramifications for every part of a system.  This post has only had an opportunity to scratch the surface of this subject and I know I’ve not given some parts of it the justice they deserve – but hopefully it’s clear that every decision has tradeoffs and there is no right answer for every system.

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

Using mock.patch in automated unit testing

Mocking is a critical technique for automated testing. It allows you to isolate the code you are testing, which means you test what you think are testing. It also makes tests less fragile because it removes unexpected dependencies.

However, creating your own mocks by hand is fiddly, and some things are quite difficult to mock unless you are a metaprogramming wizard. Thankfully Michael Foord has written a mock module, which automates a lot of this work for you, and it’s awesome. It’s included in Python 3, and is easily installable in Python 2.

Since I’ve just written a test case using mock.patch, I thought I could walk through the process of how I approached writing the test case and it might be useful for anyone who hasn’t come across this.

It is important to decide when you approach writing an automated test what level of the system you intend to test. If you think it would be more useful to test an orchestration of several components then that is an integration test of some form and not a unit test. I’d suggest you should still write unit tests where it makes sense for this too, but then add in a sensible sprinkling of integration tests that ensure your moving parts are correctly connected.

Mocks can be useful for integration tests too, however the bigger the subsystem you are mocking the more likely it is that you want to build your own “fake” for the entire subsystem.

You should design fake implementations like this as part of your architecture, and consider them when factoring and refactoring. Often the faking requirements can drive out some real underlying architectural requirements that are not clear otherwise.

Whereas unit tests should test very limited functionality, I think integration tests should be much more like smoke tests and exercise a lot of functionality at once. You aren’t interested in isolating specific behaviour, you want to make it break. If an integration test fails, and no unit tests fail, you have a potential hotspot for adding additional unit tests.

Anway, my example here is a Unit Test. What that means is we only want to test the code inside the single function being tested. We don’t want to actually call any other functions outside the unit under test. Hence mocking: we want to replace all function calls and external objects inside the unit under test with mocks, and then ensure they were called with the expected arguments.

Here is the code I need to test, specifically the ‘fetch’ method of this class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class CloudImage(object):
 
    __metaclass__ = abc.ABCMeta
 
    blocksize = 81920
    def __init__(self, pathname, release, arch):
        self.pathname = pathname
        self.release = release
        self.arch = arch
        self.remote_hash = None
        self.local_hash = None
 
    @abc.abstractmethod
    def remote_image_url(self):
        """ Return a complete url of the remote virtual machine image """
 
    def fetch(self):
        remote_url = self.remote_image_url()
        logger.info("Retrieving {0} to {1}".format(remote_url, self.pathname))
        try:
            response = urllib2.urlopen(remote_url)
        except urllib2.HTTPError:
            raise error.FetchFailedException("Unable to fetch {0}".format(remote_url))
        local = open(self.pathname, "w")
        while True:
            data = response.read(self.blocksize)
            if not data:
                break
            local.write(data)

I want to write a test case for the ‘fetch’ method. I have elided everything in the class that is not relevant to this example.

Looking at this function, I want to test that:

  1. The correct URL is opened
  2. If an HTTPError is raised, the correct exception is raised
  3. Open is called with the correct pathname, and is opened for writing
  4. Read is called successive times, and that everything returned is passed to local.write, until a False value is returned

I need to mock the following:

  1. self.remote_image_url()
  2. urllib2.urlopen()
  3. open()
  4. response.read()
  5. local.write()

This is an abstract base class, so we’re going to need a concrete implementation to test. In my test module therefore I have a concrete implementation to use. I’ve implemented the other abstract methods, but they’re not shown.

1
2
3
class MockCloudImage(base.CloudImage):
    def remote_image_url(self):
        return "remote_image_url"

Because there are other methods on this class I will also be testing, I create an instance of it in setUp as a property of my test case:

class TestCloudImage(unittest2.TestCase):

    def setUp(self):
        self.cloud_image = MockCloudImage("pathname", "release", "arch")

Now I can write my test methods.

I’ve mocked self.remote_image_url, now i need to mock urllib2.urlopen() and open(). The other things to mock are things returned from these mocks, so they’ll be automatically mocked.

Here’s the first test:

1
2
3
4
5
6
7
8
    @mock.patch('urllib2.urlopen')
    @mock.patch('__builtin__.open')
    def test_fetch(self, m_open, m_urlopen):
        m_urlopen().read.side_effect = ["foo", "bar", ""]
        self.cloud_image.fetch()
        self.assertEqual(m_urlopen.call_args, mock.call('remote_image_url'))
        self.assertEqual(m_open.call_args, mock.call('pathname', 'w'))
        self.assertEqual(m_open().write.call_args_list, [mock.call('foo'), mock.call('bar')])

The mock.patch decorators replace the specified functions with mock objects within the context of this function, and then unmock them afterwards. The mock objects are passed into your test, in the order in which the decorators are applied (bottom to top).

Now we need to make sure our read calls return something useful. Retrieving any property or method from a mock returns a new mock, and the new returned mock is consistently returned for that method. That means we can write:

1
m_urlopen().read

To get the read call that will be made inside the function. We can then set its “side_effect” – what it does when called. In this case, we pass it an iterator and it will return each of those values on each call.

Now we call call our fetch method, which will terminate because read eventually returns an empty string.

Now we just need to check each of our methods was called with the appropriate arguments, and hopefully that’s pretty clear how that works from the code above. It’s important to understand the difference between:

1
m_open.call_args

and

1
m_open().write.call_args_list

The first is the arguments passed to “open(…)”. The second are the arguments passed to:

1
local = open(); local.write(...)

Another test method, testing the exception is now very similar:

1
2
3
4
5
    @mock.patch('urllib2.urlopen')
    @mock.patch('__builtin__.open')
    def test_fetch_httperror(self, m_open, m_urlopen):
        m_urlopen.side_effect = urllib2.HTTPError(*[None] * 5)
        self.assertRaises(error.FetchFailedException, self.cloud_image.fetch)

You can see I’ve created an instance of the HTTPError exception class (with dummy arguments), and this is the side_effect of calling urlopen().

Now we can assert our method raises the correct exception.

Hopefully you can see how the mock.patch decorator saved me a spectacular amount of grief.

If you need to, it can be used as a context manager as well, with “with”, giving similar behaviour. This is useful in setUp functions particularly, where you can use the with context manager to create a mocked closure used by the system under test only, and not applied globally.

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

Generating sample unicode values for testing

When writing tests, there’s one thing we’re now sure to do because it’s caught us out so many times before: use unicode everywhere, particularly in Python.

Often, people will just go to Wikipedia and paste bits of unicode chosen at random into sample values, but that doesn’t always make for good readability when the tests you forgot to comment break two months later and you have to revisit them.

I’ve found a simple way to generate test unicode values that make sense, is to use an upside-down text generator, or some other l33t text transformer which produces unicode. Using text detailing whatever the sample value is supposed to represent, it’s still pretty legible at a glance and you’ll hopefully flag up those pesky UnicodeDecode errors quicker.

There’s a handy list of different text transformations and websites that will perform them for you on Wikipedia.

Black Box BlackBerry

Debugging software is best done using the scientific method: gather evidence about the effects of the bug, conjure up hypotheses to explain the behaviour, experiment to test the hypotheses and modify the code to change the behaviour. Rinse and repeat. If you can’t consistently reproduce the bug though, it can get tricky.

Recently, while developing a site targeted at mobile devices, we came across an intermittent problem when using a BlackBerry device. Testing mobile sites with desktop browsers and emulators can only take you so far. Eventually you reach the point where real devices begin to exhibit their own peccadillos and so we use DeviceAnywhere to access a whole host of remote-controlled physical devices.

Using the BlackBerry Curve, occasionally, our login page wouldn’t proceed to the home page after successful authentication. But we could never reproduce the this in our development environments, only on live; sometimes.
One major difference between the two environments was that the live one had dozens of servers behind a load-balancer which used a URL parameter for session affinity (we couldn’t assume all mobile devices would support cookies), whereas the development environment was a single server. We also had a staging environment which closely reproduced the live environment, although there were only a couple of servers behind its load-balancer. Initial tests on the staging environment indicated that the problem didn’t appear there either.

To rule out the mobile network provider, we installed the excellent Opera Mini browser on the BlackBerry and it worked every time. This also ruled out any issues with pages being cached by Akamai, the content delivery network. So we were now looking for a problem with our code interacting with the BlackBerry browser, but only behind our live load-balancer; sometimes.

After painstakingly tracing through the live Apache logs we closed in on the unexpected cause: a bug in the BlackBerry browser. When a server tells a browser to redirect it sends the full URL, including in our case the all-important session parameter. This URL was being tampered with before the browser navigated to it. The parameter name was being converted to lower-case (if it wasn’t preceded by a slash). This meant that the load-balancer didn’t use it for server affinity so the home page server probably didn’t have a logged-in session, and so it would bounce back to the login page.

The reason this problem had been so hard to reproduce was that in development there was only one server so affinity wasn’t an issue and the server software didn’t care about the case of the session parameter. Also the site URL was different and so the session parameter always had a preceding slash which didn’t trigger the BlackBerry URL tampering, so it never appeared as lower-case in the development logs. And on the staging environment, because there were only two servers, the device would hit the same server, notwithstanding any affinity failure caused by the lower-casing, half of the time by chance alone. The live environment was more likely to fail, but even it gave a sizeable probability of hitting the same server successively by chance alone.

We built a test server and, using some black box reverse-engineering (because the BlackBerry browser is closed-source), we reckon the logic inside the browser’s redirect code goes something like this: “lower-case all the characters in the location URL up to the first slash” presumably with the intention of making the DNS name lower-case. But it should be: “… up to the first slash or ?” to preserve the case of any query parameters.

Googling for this issue returns a number of other sites having redirect and login issues with BlackBerrys. I wonder how many are caused by this subtle, case-sensitive bug?

We’ve since searched our logs and found the bug across this wide range of BlackBerry devices/versions:BlackBerry8100/4.2.0

  • BlackBerry8100/4.5.0.52
  • BlackBerry8110/4.3.0
  • BlackBerry8120/4.5.0.52
  • BlackBerry8310/4.2.2
  • BlackBerry8700/4.2.1
  • BlackBerry8800/4.2.1
  • BlackBerry8820/4.2.2
  • BlackBerry8830/4.2.2
  • BlackBerry8900/4.6.1.101
  • BlackBerry8900/4.6.1.109
  • BlackBerry9000/4.6.0.125
  • BlackBerry9000/4.6.0.221

We’ve logged it with BlackBerry. I’ll post an update if we receive any response.

Some thoughts on concurrency

In an earlier post Over on the Twisted blog, Duncan McGreggor has asked us to expand a bit on where we think Twisted may be lacking in it’s support for concurrency. I’m afraid this has turned into a meandering essay, since I needed to reference so much background. It does come to the point eventually…

An unsolved problem

To many people it must seem as though “computers” are a solved problem. They seem to improve constantly, they do many remarkable things and the Internet, for example, is a wonder of the modern world. Of course there are screw ups, especially in large IT projects, and these are generally blamed on incompetent officials and greedy consulting firms and so on.

Although undoubtedly officials are incompetent and consultants are greedy, these projects are often crippled by the failure of industry to recognise that some of the core problems of systems design are an unsolved problem. Concurrency is one of the major areas where they fall down. Building an IT system to service a single person is straightforward. Rolling that same system out to service hundreds of thousands is not.

It may seem odd to people outside the world of software, but concurrency (“doing several things at once”) is *still* one of the hot topics in software architecture and language design. Not only is it not a solved problem, there’s still a lot of disagreement on what the problem even *is*.

Here’s a typical scenario in IT systems rollout. Every experienced engineer will have been involved in this. A project where it seemed to be going ok, the software was substantially complete and people were talking about live dates. So the developers chuck it over the wall to the systems guys, so they can run some tests to work out how much hardware they’ll need.

And the answer comes back something like “we’re going to need one server per user” or “it falls over with four simultaneous users”. And I can tell you, if you get *that far* and discover this, the best option is to flee. Run for the hills and don’t look back.

Two worlds

There has always been a distinction between the worlds of academia and industry. Academics frame problems in levels of theoretical purity, and then address them in the abstract. Industry is there to solve immediate problems on the ground, using the tools that are available.

Academics have come up with a thousand ways to address concurrency, and a lot of these were dreamt up in the early days of computing. All the things I’m going to talk about here were substantially understood in the eighties. But these days it takes twenty years for something to make it from academia to something industry can use, and that time lag is increasing.

Industry only really cares about it’s tooling. The fact that academics have dreamt up some magic language that does really cool stuff is of no interest if there isn’t an ecosystem big enough to use. That ecosystem needs trained developers, books, training courses, compilers, interpreters, debuggers, profilers and of course huge systems libraries to support all the random crap every project needs (oh, it’s just like the last project except we need to write iCalendar files *and* access a remote MIDI music device). It also needs actual physical “tin” on which to run the code, and the characteristics of the tin make a lot of difference.

Toy academic languages are *no use*, as far as most of industry is concerned, for solving their problems. If you can’t go and get five hundred contractors with it on their CV, then you’re stuck.

The multicore bombshell

So, all industry has these days, really, is C++ and Java. C++ is still very widely used, but Java is gaining ground rapidly, and one of the reasons for this is it’s support for concurrency. I’ll quote Steve Yegge:

But it’s interesting because C++ is obviously faster for, you know, the short-running [programs], but Java cheated very recently. With multicore! This is actually becoming a huge thorn in the side of all the C++ programmers, including my colleagues at Google, who’ve written vast amounts of C++ code that doesn’t take advantage of multicore. And so the extent to which the cores, you know, the processors become parallel, C++ is gonna fall behind.

But for now, Java programs are getting amazing throughput because they can parallelize and they can take advantage of it. They cheated! Right? But threads aside, the JVM has gotten really really fast, and at Google it’s now widely admitted on the Java side that Java’s just as fast as C++.

His point here is vitally important. The reason Java is gaining is not an abstract language reason, it’s because of a change in the architecture of computers. Most new computers these days are multicore. They have more than one CPU on the processor die. Java has fundamental support for threading, which is one approach to concurrency, and so some programs can take advantage of the extra cores. On a quad-core machine, with the right program, Java will run four times faster than C++. A win, right?

Well here’s a comment from the master himself, Don Knuth:

I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop, worse than the “Itanium” approach that was supposed to be so terrific—until it turned out that the wished-for compilers were basically impossible to write.

Let me put it this way: During the past 50 years, I’ve written well over a thousand programs, many of which have substantial size. I can’t think of even five of those programs that would have been enhanced noticeably by parallelism or multithreading. Surely, for example, multiple processors are no help to TeX….

I know that important applications for parallelism exist—rendering graphics, breaking codes, scanning images, simulating physical and biological processes, etc. But all these applications require dedicated code and special-purpose techniques, which will need to be changed substantially every few years.

(via Ted Tso)

Hardware designers are threatening to increase the numbers of cores massively. Right now you get two, four maybe eight core systems. But soon maybe hundreds of cores. This is important.

The problems with threading

Until recently, if you’d said to pretty much any developer that concurrency was an unsolved problem, they’d look at you like you were insane. Threading was the answer – everyone knew that. It’s supported in all kernels in all major Operating Systems. Any serious software used threads widely to handle all sorts of concurrency, and hey it was easy – Java, for example, provides primitives in the language itself to manage synchronisation and all the other stuff you need.

But then some people started realising that it wasn’t quite so good as it seemed. Steve Yegge again:

I do know that I did write a half a million lines of Java code for this game, this multi-threaded game I wrote. And a lot of weird stuff would happen. You’d get NullPointerExceptions in situations where, you know, you thought you had gone through and done a more or less rigorous proof that it shouldn’t have happened, right?

And so you throw in an “if null”, right? And I’ve got “if null”s all over. I’ve got error recovery threaded through this half-million line code base. It’s contributing to the half million lines, I tell ya. But it’s a very robust system.

You can actually engineer these things, as long as you engineer them with the certain knowledge that you’re using threads wrong, and they’re going to bite you. And even if you’re using them right, the implementation probably got it wrong somewhere.

It’s really scary, man. I don’t… I can’t talk about it anymore. I’ll start crying.

This is a pretty typical experience of anyone who has coded something serious with threads. Weird stuff happens. You get deadlocks and breakage and just utterly confusing random stuff.

And you know, all those times your Windows system just goes weird, and stuff hangs and crashes and all sorts. I’m willing to bet a good proportion of those are due to errors in threading.

In reality threads are hard. It’s sort of accepted wisdom these days (at least amongst some of the community) that threads are actually too hard. Too hard for most programmers anyhow.

Python

We’re Python coders, so Python is obviously of particular interest to us. We also write concurrent systems. Python’s creator (Guido van Rossem) took an approach to threading, which has become pretty standard in most modern “dynamic” languages. Rather than ensure the whole Python core is “thread-safe” he introduced a Global Interpreter Lock. This means that in practice when one thread is doing something it’s often impossible for the interpreter to context switch to other threads, because the whole interpreter is locked.

It certainly means threads in Python are massively less useful than they are in, say, Java. For a lot of people this has doomed Python – “what no threads!?” they cry, and then move on. Which is a shame, because threads are not the only answer, and as I’ve said I don’t even think they are a good answer.
Enter Twisted. Twisted is single-threaded, so it avoids all of the problems of threads. Concurrency is handled cooperatively, with separate subsytems within your program yielding control, either voluntarily or when they would block (i.e. when they are waiting for input).

This model fits a large proportion of programming problems very effectively, and it’s much more efficient than threads. So how does this handle multicore? Pretty effectively right now. We design our software in such a way that core parts can be run separately and scaled by adding more of them (“horizontal” scaling in the parlance). Our soon-to-be-released CMS, Exotypes, works this way, using multiple processes to exploit multiple cores.

This is a really effective approach. We can run say six processes, load balance between them and it takes great advantage of the hardware. Because we’ve designed it to work this way, we can even scale across multiple physical computers, giving us a lot of potential scale.

But what of machines of the future? Over a hundred cores, run a hundred processes? Over a thousand? At large numbers of cores the multi-process model breaks down too. In fact I don’t think any commonly deployed OS will handle this sort of hardware well at all, except for specialised applications. This is where I think Twisted falls down, through no fault of it’s own. I just suspect, like Don Knuth, that the hardware environment of the future is one that’s going to be extremely challenging for us to work in.

Two worlds, reprise

Of course, these issues have been addressed in academia, and I think, to finally answer Duncan’s question, that the long term solution to concurrency has to be addressed as part of the language. The only architecture that I think will handle it is the sort of thing represented in Erlang – lightweight processes that share no state.

Erlang addresses the challenge of multicore computer fantastically well, but as a language for writing real programs it suffers some huge lacks. I don’t think it’s Erlang that’s going to win, but it’s going to be a language with many of it’s features.

First, Erlang is purely functional, with no object-oriented structures. Pretty much every coder in the world has been trained, and is familiar with, the OO paradigm. For a language to gain traction it’s going to need to support this. This is quite compatible with Erlang’s concurrency model, and shouldn’t be too hard to support.

It also needs a decent library. Right now, the Erlang library ecosystem is, well, sparse.

Finally it needs wide adoption.

So, gods of the machines, I want something that’s got OCaml’s functional OO, Erlang’s concurrency and distribution, Python’s syntax and Python’s standard library. And I want you to bribe people to use it.

If you can do all this, not only will we be able to support multicore, but we might also, finally, be able to actually build a large IT system that actually works.

Interesting times in image processing

You’ve all seen the awesome Photosynth demo on TED. (If you haven’t, do.)

There’re several interesting things there. I especially liked the infinite-resolution Seadragon demo, with the startling claim that the only thing that should limit the speed of the application is the number of pixels being thrown around on-screen: not the size of the underlying image.

But the star of the show is arguably the ability to recognise features in photos, in order to composite them together intelligently. I presume the same technology would allow the computer to recognise known features or places, adding a semantic layer onto images currently absent. Imagine having your holiday snaps automatically tagged with the correct placenames, landmarks and from that, geodata.

Simultaneously, lots of companies (and certainly lots of government agencies) are working on facial recognition. You can already use Riya to search and tag your photo collection.

I expect this to be offered by Picasa, Flickr, iPhoto etc. within 2 years or so, whether they buy the technology, or develop it from scratch. And I certainly expect it to help power Google searches, within the same timeframe. (In the meantime, they’re building up a semantic layer around photos by other means, e.g. the delightful Image Labeler.)
I’ll leave it for a different post (or commenters) to explore the implications for privacy.

Actually recognising faces (but not identity) in photos is already becoming common in cameras, e.g. my Canon Ixus 70. In tests, it actually recognised the faces of sandstone angels in a cemetary, and in the office we were able to draw rudimentary faces on paper that the camera recognised.

Riya also extended their technology into shopping, their proof-of-concept like.com allowing you to search for shoes, handbags, clothes etc. on the basis of “likeness”. I don’t think it’s a silver bullet for the online shopping experience, but certainly valuable (to the user, and to them as a business).

Here are another couple of interesting things. Given sufficient processing power, and numbers of photos (and that’s not hard these days), you can perform what seems like magic.

  • Scene completion using millions of photographs
    “The algorithm patches up holes in images by finding similar image regions in the database that are not only seamless but also semantically valid.”
  • Content-Aware Image Sizing
    “It demonstrates a software application that resizes images in such a way that the content of the image is preserved intelligently.” Has to be seen to be believed.
  • Reconstructing 3D models from 2D photographs, e.g. Fotowoosh

These are all things that our brains are capable of doing without thinking, but we are gradually developing the processing power, the visual memory (repository of images), and clever algorithms to make it possible.

Chandler, and the hardness of software

CIO Insight magazine have a good interview with Scott Rosenberg, who was involved in the Chandler project. In many ways it seems to me like a poster-boy for the Agile movement. If Chandler had been approached in a more agile manner they may not only have shipping code, but I think the shipping code would be, now, for a very different piece of software.

Chandler made quite a splash when the project was launched because the project lead was Mitch Kapoor; creator of Lotus 1-2-3, Notes and the ill-fated but ambitious Groovy, and one of the leading lights in software development. One of the key things in an open source project is the project lead and Mitch is a proper heavyweight. He attracted a lot of very good developers and the project got moving with quite a fanfare.

I remember when the Chandler project was announced. Back then it was a fantastic idea, although ambitious. I still think the idea has a lot of legs. The idea was to produce a Microsoft Outlook killer, but using good UI and engineering design techniques. If you analyse it, Outlook is a pretty terrible application. Most of its users have probably never even thought about it, but the application only barely satisfies common use cases, and those with a lot of laborious work on the part of the user. It is all most people have used though, and people tend not to be terribly introspective about their software.

Mozilla Thunderbird is just as bad, and even less ambitious than Outlook. As an email client, neither of them compares at all well with the power and features of mutt, for example. Mutt however takes ages to learn and is ultimately limited by the capabilities of a terminal.

So, Chandler was supposed to change all this. Bring the sensibilities of a spreadsheet (i.e. a thinly disguised programming environment) to Groupware. To be a Notes for the 21st Century. Satisfy the Real Needs of Groupware users.

This is a Hard Problem. It has all the features of a project that disappears up its own arse, just as Mozilla did. There is a huge scope for architecture with such a large problem, and so that’s what they did: Architecture. Just as Mozilla did. Even though they sensibly chose Python to develop Chandler (unlike C and the homegrown XUL for Mozilla), they indulged in huge amounts of Big Design Up Front, which means that the environment and probably the users are now well ahead of where Chandler was aiming.

Maybe, like the Mozilla project, it will suddenly emerge from obscurity to take over the world. They will have a much more difficult environment for launch than Mozilla did however. The browser incumbent, Internet Explorer, was so appallingly bad it made a very easy target. The incumbents for Chandler will not only be better but will be hosted, browser accessed applications – which provides a set of behaviours that Chandler will find it impossible to replicate.

All of this may not sound like a software problem, but instead a marketing problem. To call it that is to miss much of the point in the real difficulty in software development. The most intractable problem is not avoiding building buggy software. The problem is in building the software that people actually want.

Queues are Databases?

Arnon Rotem-Gal-Oz mentions a thesis by Jim Grey in his post Queues Are Databases?. I had the opportunity to architect a large system that relied entirely on a large network of queues earlier in the year. It seemed natural to build the queue on top of a database.

Arnon is making the assumption here that the “database” is an RDBMS, which is where I think the real difference lies. Further to this he really seems to mean that the queue is in fact a single table – that seems to be the implication in the followup Queues Are Databases: Round Two. You might need to hack around with that page to see all the content because one of the adverts has gone mad and eaten it).

That is definitely wrong in my view, because of the issues in factoring objects. The underlying queue structure shouldn’t impose factoring requirements on the objects in the queue.

I used a class that multiply inherited from an axiom Item and a standard Twisted DeferredQueue. Although very simple that provides all of the primitives of a persistent queue, and very cheaply to boot.

Verbs in REST

More good stuff from lesscode.org with Useful and Useless REST. The verb issue is one I’ve come across quite a few times, and without really deciding what I think. Ryan’s comments on the semantics for proxies is interesting, and might be the real decider.

The guys at lesscode write some real interesting stuff, but I do wish they’d do their bickering in private.