Category Archives: Python

Links

There is a new version of gunicorn, 19.0 which has a couple of significant changes, including some interesting workers (gthread and gaiohttp) and actually responding to signals properly, which will make it work with Heroku.

The HTTP RFC, 2616, is now officially obsolete. It has been replaced by a bunch of RFCs from 7230 to 7235, covering different parts of the specification. The new RFCs look loads better, and it’s worth having a look through them to get familiar with them.

Some kind person has produced a recommended set of SSL directives for common webservers, which provide an A+ on the SSL Labs test, while still supporting older IEs. We’ve struggled to find a decent config for SSL that provides broad browser support, whilst also having the best levels of encryption, so this is very useful.

A few people are still struggling with Git.  There are lots of git tutorials around the Internet, but this one from Git Tower looks like it might be the best for the complete beginner. You know it’s for noobs, of course, because they make a client for the Mac :)

I haven’t seen a lot of noise about this, but the EU has outlawed pre-ticked checkboxes.  We have always recommended that these are not used, since they are evil UX, but now there’s an argument that might persuade everyone.

Here is a really nice post about splitting user stories. I think we are pretty good at this anyhow, but this is a nice way of describing the approach.

@monkchips gave a talk at IBM Impact about the effect of Mobile First. I think we’re on the right page with most of these things, but it’s interesting to see mobile called-out as one of the key drivers for these changes.

I’d not come across the REST Cookbook before, but here is a decent summary of how to treat PUT vs POST when designing RESTful APIs.

Fastly have produced a spectacularly detailed article about how to get tracking cookies working with Varnish.  This is very relevant to consumer facing projects.

This post from Thought Works is absolutely spot on, and I think accurately describes an important aspect of testing The Software Testing Cupcake.

As an example for how to make unit tests less fragile, this is a decent description of how to isolate tests, which is a key technique.

The examples are Ruby, but the principle is valid everywhere. Still on unit testing, Facebook have open sourced a Javascript unit testing framework called Jest. It looks really very good.

A nice implementation of “sudo mode” for Django. This ensures the user has recently entered their password, and is suitable for protecting particularly valuable assets in a web application like profile views or stored card payments.

If you are using Redis directly from Python, rather than through Django’s cache wrappers, then HOT Redis looks useful. This provides atomic operations for compound Python types stored within Redis.

Using mock.patch in automated unit testing

Mocking is a critical technique for automated testing. It allows you to isolate the code you are testing, which means you test what you think are testing. It also makes tests less fragile because it removes unexpected dependencies.

However, creating your own mocks by hand is fiddly, and some things are quite difficult to mock unless you are a metaprogramming wizard. Thankfully Michael Foord has written a mock module, which automates a lot of this work for you, and it’s awesome. It’s included in Python 3, and is easily installable in Python 2.

Since I’ve just written a test case using mock.patch, I thought I could walk through the process of how I approached writing the test case and it might be useful for anyone who hasn’t come across this.

It is important to decide when you approach writing an automated test what level of the system you intend to test. If you think it would be more useful to test an orchestration of several components then that is an integration test of some form and not a unit test. I’d suggest you should still write unit tests where it makes sense for this too, but then add in a sensible sprinkling of integration tests that ensure your moving parts are correctly connected.

Mocks can be useful for integration tests too, however the bigger the subsystem you are mocking the more likely it is that you want to build your own “fake” for the entire subsystem.

You should design fake implementations like this as part of your architecture, and consider them when factoring and refactoring. Often the faking requirements can drive out some real underlying architectural requirements that are not clear otherwise.

Whereas unit tests should test very limited functionality, I think integration tests should be much more like smoke tests and exercise a lot of functionality at once. You aren’t interested in isolating specific behaviour, you want to make it break. If an integration test fails, and no unit tests fail, you have a potential hotspot for adding additional unit tests.

Anway, my example here is a Unit Test. What that means is we only want to test the code inside the single function being tested. We don’t want to actually call any other functions outside the unit under test. Hence mocking: we want to replace all function calls and external objects inside the unit under test with mocks, and then ensure they were called with the expected arguments.

Here is the code I need to test, specifically the ‘fetch’ method of this class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class CloudImage(object):
 
    __metaclass__ = abc.ABCMeta
 
    blocksize = 81920
    def __init__(self, pathname, release, arch):
        self.pathname = pathname
        self.release = release
        self.arch = arch
        self.remote_hash = None
        self.local_hash = None
 
    @abc.abstractmethod
    def remote_image_url(self):
        """ Return a complete url of the remote virtual machine image """
 
    def fetch(self):
        remote_url = self.remote_image_url()
        logger.info("Retrieving {0} to {1}".format(remote_url, self.pathname))
        try:
            response = urllib2.urlopen(remote_url)
        except urllib2.HTTPError:
            raise error.FetchFailedException("Unable to fetch {0}".format(remote_url))
        local = open(self.pathname, "w")
        while True:
            data = response.read(self.blocksize)
            if not data:
                break
            local.write(data)

I want to write a test case for the ‘fetch’ method. I have elided everything in the class that is not relevant to this example.

Looking at this function, I want to test that:

  1. The correct URL is opened
  2. If an HTTPError is raised, the correct exception is raised
  3. Open is called with the correct pathname, and is opened for writing
  4. Read is called successive times, and that everything returned is passed to local.write, until a False value is returned

I need to mock the following:

  1. self.remote_image_url()
  2. urllib2.urlopen()
  3. open()
  4. response.read()
  5. local.write()

This is an abstract base class, so we’re going to need a concrete implementation to test. In my test module therefore I have a concrete implementation to use. I’ve implemented the other abstract methods, but they’re not shown.

1
2
3
class MockCloudImage(base.CloudImage):
    def remote_image_url(self):
        return "remote_image_url"

Because there are other methods on this class I will also be testing, I create an instance of it in setUp as a property of my test case:

class TestCloudImage(unittest2.TestCase):

    def setUp(self):
        self.cloud_image = MockCloudImage("pathname", "release", "arch")

Now I can write my test methods.

I’ve mocked self.remote_image_url, now i need to mock urllib2.urlopen() and open(). The other things to mock are things returned from these mocks, so they’ll be automatically mocked.

Here’s the first test:

1
2
3
4
5
6
7
8
    @mock.patch('urllib2.urlopen')
    @mock.patch('__builtin__.open')
    def test_fetch(self, m_open, m_urlopen):
        m_urlopen().read.side_effect = ["foo", "bar", ""]
        self.cloud_image.fetch()
        self.assertEqual(m_urlopen.call_args, mock.call('remote_image_url'))
        self.assertEqual(m_open.call_args, mock.call('pathname', 'w'))
        self.assertEqual(m_open().write.call_args_list, [mock.call('foo'), mock.call('bar')])

The mock.patch decorators replace the specified functions with mock objects within the context of this function, and then unmock them afterwards. The mock objects are passed into your test, in the order in which the decorators are applied (bottom to top).

Now we need to make sure our read calls return something useful. Retrieving any property or method from a mock returns a new mock, and the new returned mock is consistently returned for that method. That means we can write:

1
m_urlopen().read

To get the read call that will be made inside the function. We can then set its “side_effect” – what it does when called. In this case, we pass it an iterator and it will return each of those values on each call.

Now we call call our fetch method, which will terminate because read eventually returns an empty string.

Now we just need to check each of our methods was called with the appropriate arguments, and hopefully that’s pretty clear how that works from the code above. It’s important to understand the difference between:

1
m_open.call_args

and

1
m_open().write.call_args_list

The first is the arguments passed to “open(…)”. The second are the arguments passed to:

1
local = open(); local.write(...)

Another test method, testing the exception is now very similar:

1
2
3
4
5
    @mock.patch('urllib2.urlopen')
    @mock.patch('__builtin__.open')
    def test_fetch_httperror(self, m_open, m_urlopen):
        m_urlopen.side_effect = urllib2.HTTPError(*[None] * 5)
        self.assertRaises(error.FetchFailedException, self.cloud_image.fetch)

You can see I’ve created an instance of the HTTPError exception class (with dummy arguments), and this is the side_effect of calling urlopen().

Now we can assert our method raises the correct exception.

Hopefully you can see how the mock.patch decorator saved me a spectacular amount of grief.

If you need to, it can be used as a context manager as well, with “with”, giving similar behaviour. This is useful in setUp functions particularly, where you can use the with context manager to create a mocked closure used by the system under test only, and not applied globally.

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

Reviewing Django REST Framework

Recently, we used Django REST Framework to build the backend for an API-first web application. Here I’ll attempt to explain why we chose REST Framework and how successfully it helped us build our software.

Why Use Django REST Framework?

RFC-compliant HTTP Response Codes

Clients (javascript and rich desktop/mobile/tablet applications) will more than likely expect your REST service endpoint to return status codes as specified in the HTTP/1.1 spec. Returning a 200 response containing {‘status’: ‘error’} goes against the principles of HTTP and you’ll find that HTTP-compliant javascript libraries will get their knickers in a twist. In our backend code, we ideally want to raise native exceptions and return native objects; status codes and content should be inferred and serialised as required.

If authentication fails, REST Framework serves a 401 response. Raise a PermissionDenied and you automatically get a 403 response. Raise a ValidationError when examining the submitted data and you get a 400 response. POST successfully and get a 201, PATCH and get a 200. And so on.

Methods

You could PATCH an existing user profile with just the field that was changed in your UI, DELETE a comment, PUT a new shopping basket, and so on. HTTP methods exist so that you don’t have to encode the nature of your request within the body of your request. REST Framework has support for these methods natively in its base ViewSet class which is used to build each of your endpoints; verbs are mapped to methods on your view class which, by default, are implemented to do everything you’d expect (create, update, delete).

Accepts

The base ViewSet class looks for the Accepts header and encodes the response accordingly. You need only specify which formats you wish to support in your settings.py.

Serializers are not Forms

Django Forms do not provide a sufficient abstraction to handle object PATCHing (only PUT) and cannot encode more complex, nested data structures. The latter limitation lies with HTTP, not with Django Forms; HTTP forms cannot natively encode nested data structures (both application/x-www-form-urlencoded and multipart/form-data rely on flat key-value formats). Therefore, if you want to declaratively define a schema for the data submitted by your users, you’ll find life a lot easier if you discard Django Forms and use REST Framework’s Serializer class instead.

If the consumers of your API wish to use PATCH rather than PUT, and chances are they will, you’ll need to account for that in your validation. The REST Framework ModelSerializer class adds fields that map automatically to Model Field types, in much the same way that Django’s ModelForm does. Serializers also allow nesting of other Serializers for representing fields from related resources, providing an alternative to referencing them with a unique identifier or hyperlink.

More OPTIONS

Should you choose to go beyond an AJAX-enabled site and implement a fully-documented, public API then best practice and an RFC or two suggest that you make your API discoverable by allowing OPTIONS requests. REST Framework allows an OPTIONS request to be made on every endpoint, for which it examines request.user and returns the HTTP methods available to that user, and the schema required for making requests with each one.

OAuth2

Support for OAuth 1 and 2 is available out of the box and OAuth permissions, should you choose to use them, can be configured as a permissions backend.

Browsable

REST framework provides a browsable HTTP interface that presents your API as a series of forms that you can submit to. We found it incredibly useful for development but found it a bit too rough around the edges to offer as an aid for third parties wishing to explore the API. We therefore used the following snippet in our settings.py file to make the browsable API available only when DEBUG is set to True:

if DEBUG:
    REST_FRAMEWORK['DEFAULT_RENDERER_CLASSES'].append(
        'rest_framework.renderers.BrowsableAPIRenderer'
    )

Testing

REST Framework gives you an APITestCase class which comes with a modified test client. You give this client a dictionary and encoding and it will serialise the request and deserialise the response. You only ever deal in python dictionaries and your tests will never need to contain a single instance of json.loads.

Documentation

The documentation is of a high quality. By copying the Django project’s three-pronged approach to documentation – tutorial, topics, and API structure, Django buffs will find it familiar and easy to parse. The tutorial quickly gives readers the feeling of accomplishment, the high-level topic-driven core of the documentation allows readers to quickly get a solid understanding of how the framework should be used, and method-by-method API documentation is very detailed, frequently offering examples of how to override existing functionality.

Project Status

At the time of writing the project remains under active development. The roadmap is fairly clear and the chap in charge has a solid grasp of the state of affairs. Test coverage is good. There’s promising evidence in the issue history that creators of useful but non-essential components are encouraged to publish their work as new, separate projects, which are then linked to from the REST Framework documentation.

Criticisms

Permissions

We found that writing permissions was messy and we had to work hard to avoid breaking DRY. An example is required. Let’s define a ViewSet representing both a resource collection and any document from that collection:

views.py:

class JobViewSet(ViewSet):
    """
    Handles both URLS:
    /jobs
    /jobs/(?P<id>\d+)/$
    """
    serializer_class = JobSerializer
    permission_classes = (IsAuthenticated, JobPermission)
 
    def get_queryset(self):
        if self.request.user.is_superuser:
            return Job.objects.all()
 
        return Job.objects.filter(
            Q(applications__user=request.user) |
            Q(reviewers__user=request.user)
        )

If the Job collection is requested, the queryset from get_queryset() will be run through the serializer_class and returned as an HTTPResponse with the requested encoding.

If a Job item is requested and it is in the queryset from get_queryset(), it is run through the serializer_class and served. If a Job item is requested and is not in the queryset, the view returns a 404 status code. But we want a 403.

So if we define that JobPermission class, we can fail the object permission test, resulting in a 403 status code:

permissions.py:

class JobPermission(Permission):
    def get_object_permission(self, request, view, obj):
    if obj in Job.objects.filter(
        Q(applications__user=request.user) |
        Q(reviewers__user=request.user)):
        return True
    return False

Not only have we duplicated the logic from the view method get_queryset (we could admittedly reuse view.get_queryset() but the method and underlying query would still be executed twice), if we don’t then the client is sent a completely misleading response code.

The neatest way to solve this issue seems to be to use the DjangoObjectPermissionsFilter together with the django-guardian package. Not only will this allow you to define object permissions independently of your views, it’ll also allow you filter querysets using the same logic. Disclaimer: I’ve not tried this solution, so it might be a terrible thing to do.

Nested Resources

REST Framework is not built to support nested resources of the form /baskets/15/items. It requires that you keep your API flat, of the form /baskets/15 and /items/?basket=15.

We did eventually choose to implement some parts of our API using nested URLs however it was hard work and we had to alter public method signatures and the data types of public attributes within our subclasses. We required entirely highly modified Router, Serializer, and ViewSet classes. It is worth noting that REST Framework deserves praise for making each of these components so pluggable.

Very specifically, the biggest issue preventing us pushing our nested resources components upstream was REST Framework’s decision to make lookup_field on the HyperlinkedIdentityField and HyperlinkedRelatedField a single string value (e.g. “baskets”). To support any number of parent collections, we had to create a NestedHyperlinkedIdentityField with a new lookup_fields list attribute, e.g. ["baskets", "items"].

Conclusions

REST Framework is great. It has flaws but continues to mature as an increasingly popular open source project. I’d whole-heartedly recommend that you use it for creating full, public APIs, and also for creating a handful of endpoints for the bits of your site that need to be AJAX-enabled. It’s as lightweight as you need it to be and most of what it does, it does extremely well.

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

Django Class-Based Generic Views: tips for beginners (or things I wish I’d known when I was starting out)

Django is renowned for being a powerful web framework with a relatively shallow learning curve, making it easy to get into as a beginner and hard to put down as an expert. However, when class-based generic views arrived on the scene, they were met with a lukewarm reception from the community: some said they were too difficult, while others bemoaned a lack of decent documentation. But if you can power through the steep learning curve, you will see they are also incredibly powerful and produce clean, reusable code with minimal boilerplate in your views.py.

So to help you on your journey with CBVs, here are some handy tips I wish I had known when I first started learning all about them. This isn’t a tutorial, but more a set of side notes to refer to as you are learning; information which isn’t necessarily available or obvious in the official docs.

Starting out

If you are just getting to grips with CBVs, the only view you need to worry about is TemplateView. Don’t try anything else until you can make a ‘hello world’ template and view it on your dev instance. This is covered in the docs. Once you can handle that, keep reading the docs and make sure you understand how to subclass a ListView and DetailView to render model data into a template.

OK, now we’re ready for the tricky stuff!

Customising CBVs

Once you have the basics down, you will find that most of your work revolves around subclassing the built-in class-based generic views and overriding one or two methods. At the start of your journey, it is not very obvious what to override to achieve your goals, so remember:

  • If you need to get some extra variables into a template, use get_context_data()
  • If it is a low-level permissions check on the user, you probably want dispatch()
  • If you need to do a complicated database query on a DetailView, ListView etc, try get_queryset()
  • If you need to pass some extra parameters to a form when constructing it via a FormView, UpdateView etc, try get_form() or get_form_kwargs()

ccbv.co.uk

If you haven’t heard of ccbv.co.uk, go there and bookmark it now. It is possibly the most useful reference out there for working with class-based generic views. When you are subclassing views and trying to work out which methods to override, and the official docs just don’t seem to cut it, ccbv.co.uk has your back. If it wasn’t for that site, I think we would all be that little bit grumpier about using CBVs.

Forms

CBVs cut a LOT of boilerplate code out of the process of writing forms. You should already be using ModelForms wherever you can to save effort, and there are generic class-based views available (CreateView/UpdateView) that allow you to plug in your ModelForms and reduce your boilerplate code even further. Always use this approach if you can. If your form does not map to a particular model in the database, use FormView.

Permissions

If you want to put some guards on your view e.g. check if the user is logged in, check they have a certain permission etc, you will usually want to do it on the dispatch() method of the view. This is the very first method that is called in your view, so if a user shouldn’t have access then this is the place to intercept them:

1
2
3
4
5
6
7
8
9
10
from django.core.exceptions import PermissionDenied
from django.views.generic import TemplateView
 
class NoJimsView(TemplateView):
    template_name = 'secret.html'
 
    def dispatch(self, request, *args, **kwargs):
        if request.user.username == 'jim':
            raise PermissionDenied # HTTP 403
        return super(NoJimsView, self).dispatch(request, *args, **kwargs)

Note: If you just want to restrict access to logged-in users, there is a @require_login decorator you can add around the dispatch() method. This is covered in the docs, and it may be sufficient for your purposes, but I usually end up having to modify it to handle AJAX requests nicely as well.

Multiple inheritance

Once you start subclassing and overriding generic views, you will probably find yourself needing multiple inheritance. For example, perhaps you want to extend your “No Jims” policy (see above) to several other views. The best way to achieve this is to write a small Mixin and inherit from it along with the generic view. For example:

1
2
3
4
5
6
7
8
9
10
11
class NoJimsMixin(object):
    def dispatch(self, request, *args, **kwargs):
        if request.user.username == 'jim':
            raise PermissionDenied # HTTP 403
        return super(NoJimsMixin, self).dispatch(request, *args, **kwargs)
 
class NoJimsView(NoJimsMixin, TemplateView):
    template_name = 'secret.html'
 
class OtherNoJimsView(NoJimsMixin, TemplateView):
    template_name = 'other_secret.html'

Now you have entered the world of python’s multiple inheritance and Method Resolution Order. Long story short: order is important. If you inherit from two classes that both define a foo() method, your new class will use the one from the parent class that was first in the list. So in the above example, in your NoJimsView class, if you listed TemplateView before NoJimsMixin, django would use TemplateView’s dispatch() method instead of NoJimsMixin’s. But in the above example, not only will your NoJimsMixin’s dispatch() get called first, but when you call super(NoJimsMixin, self).dispatch(), it will call TemplateView’s dispatch() method. How I wish I had known this when I was learning about CBVs!

View/BaseView/Mixin

As you browse around the docs, code and ccbv.co.uk, you will see references to Views, BaseViews and Mixins. They are largely a naming convention in the django code: a BaseView is like a View except it doesn’t have a render_to_response() method so it won’t render a template. Almost all Views inherit from a corresponding BaseView and add a render_to_response() method e.g. DetailView/BaseDetailView, UpdateView/BaseUpdateView etc. This is useful if you are subclassing from two Views, because it means you can choose which one renders the final output. It is also useful if you want to render to JSON, say in an AJAX response, and don’t need HTML rendering at all (in this case you’d need to provide your own render_to_response() method that returns a HttpResponse).

Mixin classes provide a few helper methods, but can’t be used on their own, as they are not full Views.

So in short, if you are just subclassing one thing, you will usually subclass a View. If you want to manually render a non-HTML response, you probably need a BaseView. If you are inheriting from multiple classes, you will need a combination of some or all of View, BaseView and Mixin.

A final note on AJAX

Django is not particularly good at serving AJAX requests out of the box, and once you start trying to use CBVs to do AJAX form submissions, things get quite complicated.

The docs offer some help with this in the form of a Mixin you can copy and paste into your code, which gives you JSON responses instead of HTML. You will also need to pass CSRF tokens in your POST requests, and again there is an example of how to do this in the docs.

This should be enough to get you started, but I often find myself having to write some extra Mixins, and that is before even considering the javascript code on the front end to send requests and parse responses, complete with handling of validation and transport errors. Here at Isotoma, we are working on some tools to address this, which we hope to open-source in the near future. So watch this space!

Conclusion

In case you hadn’t worked it out, we at Isotoma are fans of Django’s class-based generic views. They are definitely not straightforward for newcomers, but hopefully with the help of this article and other resources (did I mention ccbv.co.uk?), it’ll be plain sailing before you know it. And once you get what they’re all about, you won’t look back.

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

Buildout Basics Part 1

Introduction to the series

This is the first in a 3 part series of tutorials about creating, configuring and maintaining buildout configuration files, and how buildout can be used to deploy and configure both python-based and other software.
During the course of this series, I will cover buildout configuration files, some buildout recipes and a simple overview structure of a buildout recipe. I will not cover creating a recipe, or developing buildout itself.

For a very good guide on the python packaging techniques that we will be relying on, see this guide: http://guide.python-distribute.org

All code samples will be python 2.4+ compatible, system command lines will be debian/ubuntu specific, but simple enough to generalise out to most systems (OSX and Windows included).
Where a sample project or code is required, I’ve used Django as it’s what I’m most familiar with, but this series is all about the techniques and configuration in buildout itself, it is not Django specific, so don’t be scared off if you happen to be using something else.

Buildout Basics

So, what’s this buildout thing anyway?

If you’re a python developer, or even just a python user, you will probably have come across either easy_install or pip (or both). These are pretty much two methods of achieving the same thing; namely, installing python software

$> sudo easy_install Django

This is a fairly simple command, it will install Django onto the system path, so from anywhere you can do

>>> import django
>>>

While this is handy, it’s not ideal for production deployment. System installing a package will lead to problems with maintenance, and probably also lead to version conflict problems, particuarly if you have multiple sites or environments deployed on the same server. One environment may need Django 1.1, the newest may need 1.3. There are significant differences in the framework from one major version to another, and a clean upgrade may not be possible. So, system-installing things is generally considered to be a bad idea, unless it’s guaranteed to be a dedicated machine.

So what do you do about it?
One answer is to use a virtualenv. This is a python package that will create a ‘clean’ python environment in a particular directory:

$> virtualenv deployment --no-site-packages

This will create a directory called ‘deployment’, in which is a clean python intepreter with only a local path. This environment will ignore any system-installed packages (--no-site-packages), and give you a fresh, self contained python environment.

Once you have activated this environment, you can then easy_install or pip install the packages you require, and then use them as you would normally, safe in the knowledge that anything you install in this environment is not going to affect the mission-critical sales (or pie-ordering website) process that’s running in the directory next door.

So, if virtualenv can solve this problem for us, why do we need something else, something more complex to solve essentially the same problem?

The answer, is that buildout doesn’t just solve this problem, it solves a whole lot more problems, particuarly when it comes to ‘how do I release this code to production, yet make sure I can still work on it, without breaking the release?’

Building something with buildout

The intent is to show you the parts for a buildout config, then show how it all fits together. If you want to see the final product, then dissemble it to find the overall picture, scroll to the end of this, have a look, then come back. Go on, it’s digital, this will still be here when you come back….

Config Files

Pretty much everything that happens with buildout is controlled by its config file (this isn’t quite true, but hey, ‘basics’). A config file is a simple ini (ConfigParser) style text file; that defines some sections, some options for those sections, and some choices in the options.

In this case, the sections of a buildout config file (henceforth referred to as buildout.cfg) are generally referred to as parts. The most important of these parts is the buildout part itself, which controls the options for the buildout process.

An absolute minimum buildout part looks something like this:

[buildout]
parts = noseinstall

While this is not a complete buildout.cfg, it is the minimum that is required in the buildout part itself. All is is doing is listing the other parts that buildout will use to actually do something, in this case, it is looking for a single part named noseinstall. As this part doesn’t exist yet, it won’t actually work. So, lets add the part, and in the next section, see what it does:

[buildout]
parts = noseinstall

[noseinstall]
recipe = zc.recipe.egg
eggs = Nose

An aside about bootstrap.py

We now have a config file that we’re reasonably sure will do something, if we’re really lucky, it’ll do something that we actually want it to do. But how do we run it? We will need buildout itself, but we don’t have that yet. At this point, there are two ways to proceed.

  1. sudo apt-get install python-zc.buildout
  2. wget http://python-distribute.org/bootstrap.py && python bootstrap.py

For various reasons, unless you need a very specific version of buildout, it is best to use the bootstrap.py file. This is a simple file that contains enough of buildout to install buildout itself inside your environment. As it’s cleanly installed, it can be upgraded, version pinned and generally used in the same manner as in a virtualenv style build. If you system-install buildout, you will not be able to easily upgrade the buildout instance, and may run into version conflicts if a project specifies a version newer than the one you have. Both approaches have their advantages, I prefer the second as it is slightly more self contained. Mixing the approaches (using bootstrap.py with a system-install is possible, but can expose some bugs in the buildout install procedure).

The rest of this document is going to assume that you have used bootstrap.py to install buildout.

Running some buildout

Now we have a method of running buildout, it’s time to do it in the directory where we left the buildout.cfg file created earlier:

$> bin/buildout

At this point, buildout will output something along the lines of:

Getting distribution for 'zc.recipe.egg'.
Got zc.recipe.egg 1.3.2.
Installing noseinstall.
Getting distribution for 'Nose'.
no previously-included directories found matching 'doc/.build'
Got nose 1.0.0.
Generated script '/home/tomwardill/tmp/buildoutwriteup/bin/nosetests-2.6'.
Generated script '/home/tomwardill/tmp/buildoutwriteup/bin/nosetests'.

Your output may not be exactly similar, but should contain broadly those lines.

The simple sample here is using the zc.recipe.egg recipe. This is probably the most common of all buildout recipes as it is the one that will do the heavy work of downloading an egg, analysing its setup.py for dependencies (and installing them if required), and then finally installing the egg into the buildout path for use. Recipes are just python eggs that contain code that buildout will run. The easiest way to think of this is that while a recipe is an egg, recipe contains instructions for the buildout process itself, and therefore will not be available to code at the end.

An analysis of the buildout output shows exactly what it has done. It has downloaded an egg for zc.recipe.egg and run the noseinstall part. Let’s take a closer look at that noseinstall part from before:

[noseinstall]
recipe = zc.recipe.egg
eggs = Nose

So, we can see why buildout has installed zc.recipe.egg, it is specified in the recipe option of this part, so buildout will download it, install it and then run it. We will take a closer look at the construction of a recipe in a later article, but for now, assume that buildout has executed a bunch of python code in the recipe, and we’ll carry on.
The python code in this case will look at the part that it is in, and look for an option called eggs. As we have specified this option, it will then look at this as a list, and install all the eggs that we have listed; in this case, just the one, the unittest test runner Nose.
As you can see from the bottom of the buildout output, the recipe has downloaded Nose, extracted it and created two files; bin/nosetests and bin/nosetests-2.6. Running one of those files like so:

$> bin/nosetests

----------------------------------------------------------------------
Ran 0 tests in 0.002s

OK
$>

We can see that this is nose, as we expect it to be. Two files have been generated because that is that the setup.py for Nose defines, a base nosetest executable, and one for the specifc python version that we have used (python 2.6 in my case). These are specified in the setup.py that makes up the nose egg, which will be covered in a later article.

Conclusion

We can install buildout into a development environment, and use a simple config file to install a python egg. The next article will cover a development example for using with django, and some niceties such as version pinning and running python in our buildouted environment.

The Third Manifesto Implementers’ Workshop

Earlier this month I went to the Third Manifesto Implementers’ Workshop at Northumbria University in Newcastle. A group of us discussed recent developments in implementing the relational data model.

The relational data model was proposed by E. F. Codd in 1969 in response to the complex, hierarchical, data storage solutions of the time which required programs to be written and compiled for each database query. It was a powerful abstraction, but unfortunately SQL and its implementations missed out on important features, and broke it in fundamental ways. In response to this problem, and the industry’s approach towards object-databases, Chris Date and Hugh Darwen wrote “The Third Manifesto” (TTM) to put forward their ideas on how future database systems should work. I urge you to read their books (even if you’re not interested in the subject) – the language is amazing: precise, concise, comprehensive and easy to read – other technical authors don’t come close.

The relational model treats data as sets of logical propositions and allows them to be queried, manipulated and constrained declaratively. It abstracts away from physical storage and access issues which is why it will still be used a hundred years from now (and why NoSQL discussions like these http://wiki.apache.org/couchdb/EntityRelationship http://www.cmlenz.net/archives/2007/10/couchdb-joins are retrograde). If you’re writing loops to query your data, or having to navigate prescribed connection paths, then your abstractions are feeble and limited.

At the workshop, I talked about my project, Dee, which implements the relational ideas from TTM in Python. You can see my slides here (or here if you have an older browser).

Erwin Smout gave a couple of talks about implementing transition constraints and dispensing with data definition language.

David Livingstone walked us through the RAQUEL architecture – a layered approach along the lines of the OSI network model.

Hugh Darwen discussed the features and implementation of IBM’s Business System 12, one of the first ever relational database systems which had some surprisingly dynamic features, including key inferencing, so that view definitions could keep tabs on their underlying constraints.

Chris Date took us through his latest thoughts on how to update relational views in a generic way. The aim is for database users to be able to treat views and base tables in the same way, for both reading and writing. Lots to think about here, and my to-do list for Dee has grown another section.

Adrian Hudnott discussed a couple of research projects around optimising multiple relational assignments and tracking the source of updates so that transition constraints could be more effective.

Renaud de Landtsheer gave an insight into the work he’s been doing implementing first-order-logic constraints within Oracle databases.

I sat next to Toon Koppelaars (whose name went down well with the Geordies) and then I realised I had his book (Applied Mathematics for Database Professionals) waiting to be read on my desk at work, thanks to the eclectic Isotoma library (and Wes).

It was a packed couple of days with plenty of food for thought. Thank you to David Livingstone and Safwat Mansi for organising and hosting such an enjoyable and interesting event.

Tamper-protection for Bank Transactions

If you need to send electronic transactions to Swedish banks, you’ll be required to add anti-tampering seals to the files. The banks recommend you use a third-party system to create the HMAC SHA256-128 seals, but that could involve a fair amount of expensive server software and maintenance contracts (some linked to the number of people who work in your company).

Instead, you can do it yourself in Python like this:

import hmac
import hashlib
import string

NORMALISE = string.maketrans(
    '\xC9\xC4\xD6\xC5\xDC\xE9\xE4\xF6\xE5\xFC' + ''.join(
    [chr(x) for x in range(0,32)]) + ''.join(
    [chr(x) for x in range(127,256)
     if x not in (201,196,214,197,220,233,
                  228,246,229,252)]),
    '\x40\x5B\x5C\x5D\x5E\x60\x7B\x7C\x7D\x7E' + ''.join(
    [chr(195) for x in range(0,32)]) + ''.join(
    [chr(195) for x in range(127,256)
     if x not in (201,196,214,197,220,233,
                  228,246,229,252)]))

def hex_to_bytes(hexs):
    """Convert string of hex into bytes"""
    return ''.join(['%s' % chr(int(hexs[i:i+2], 16))
                    for i in range(0, len(hexs), 2)])

def get_signature(contents, key):
    """Calculate the HMAC SHA256-128 signature

       contents - an iso-8859-1 (latin-1) encoded string
       key - a string of hex characters

       Returns a 32 char string of hex characters (128 bits)
    """
    key = hex_to_bytes(key)

    #Normalise the contents
    contents = contents.translate(NORMALISE, '\r\n')

    dig = hmac.new(key, msg=contents,
                   digestmod=hashlib.sha256).digest()
    return ''.join(['%02X' % ord(x) for x in dig[:16]])

And then to calculate the signature for a file:

>>> print get_signature(open('bankfile.dat').read(),
...                     '1234567890abcdef1234567890abcdef')
25122AE4179BD51DC87AD6EA08D16D45

Scaffolding template tags for Django forms

We love Django here at Isotoma, and we love using Django’s awesome form classes to generate self-generating, self-validating, [X]HTML forms.

However, in practically every new Django project I find myself doing the same thing over and over again (and I know others do too): breaking the display of a Django form instance up into individual fields, with appropriate mark-up wrappers.

Effectively I keep recreating the output of BaseForm.as_p/as_ul/as_table with template tags and mark-up.

For example, outputting a login form, rather than doing:

{{ form.as_p }}

We would do:

<p>
{% if form.username.errors %}
  {% for error in form.username.errors %}
    {{ error }}
  {% endfor %}
{% endif %}
{{ form.username.label }} {{ form.username }}
</p>
<p>
{% if form.password.errors %}
  {% for error in form.password.errors %}
    {{ error }}
  {% endfor %}
{% endif %}
{{ form.password.label }} {{ form.password }}
</p>

Why would you want to do this? There are several reasons, but generally it’s to apply custom mark-up to a particular element (notice I said mark-up, not styling, that can be done with the generated field IDs), as well as completely customising the output of the form (using <div>‘s instead etc.), and also because some designers tend to prefer this way of looking at a template.

“But”, you might say, “Django already creates all this for us with the handy as_p/as_ul/as_table methods, can you just take the ouput from that?”
Well, yes, in fact on a project a couple of weeks ago that’s exactly what I did, outputting as_p in a template, and then editing the source chucked out in a browser.
Which gave me the idea to create a simple little tool to do this for me, but with the Django template tags for dynamically outputting the field labels and fields themselves.

I created django-form-scaffold to do just this, and now I can do this from a Python shell:

>>> from dfs import scaffold
>>> from MyProject.MyApp.forms import MyForm
>>> form = MyForm()
>>> # We can pass either an instance of our form class
>>> # or the class itself, but better to pass an instance.
>>> print scaffold.as_p(form)

{% if form.email.errors %}{% for error in form.email.errors %}
{{ error }}{% endfor %}{% endif %}
<p>{{ form.email.label }} {{ form.email }}</p>
{% if form.password1.errors %}{% for error in form.password1.errors %}
{{ error }}{% endfor %}{% endif %}
<p&gtl{{ form.password1.label }} {{ form.password1 }}</p>
{% if form.password2.errors %}{% for error in form.password2.errors %}
{{ error }}{% endfor %}{% endif %}
<p>{{ form.password2.label }} {{ form.password2 }}</p>

Copy and paste this into a template, tweak, and Robert’s your mother’s brother.

As well as as_p(), the dfs.scaffold module also has the equivalent functions as_ul(), as_table, and an extra as_div() function.

Writing interactive command line DB query tools with pyDBCLI

While some people can’t navigate a relational database without reaching for a GUI, the vast majority of us spend a good proportion of our lives inside interactive command line interfaces, such as psql or mysql.

What do you do, however, if there isn’t a CLI query tool available for the DB you’re working with?
I had this exact same problem recently with a project, the main reason for the lack of any such tooling is because I wasn’t actually dealing with a real DB at all, but an ODBC interface to a web service that exposed reporting data as if it were tables in a DB. Rather than spend the rest of my life messing around with queries in a hooked up spreadsheet, or repeatedly writing one off Python snippets, I decided to write my own psql-like tool.

Thus the first version of pyDBCLI was born; well not really, my first tool really only handled querying Webtrends (I’ll save that for another post), while pyDBCLI is a base class for making such tools as long as you have a DB API compatible cursor to query.
pyDBCLI is based on the fantastic cmd.Cmd, so you can extend pretty much exactly as you would would Cmd, except with some extra properties such as cursor and multi_prompt are provided.

In order to make a Cmd based tool behave more like psql I ended up overriding the parseline method with regular expressions to handle escaped commands such as “\d” and “\c”, fuzzing them if not escaped, or un-escaping them if escaped so that Cmd’s unmodified parseline method can handle parsing and dispatching to defined do_* methods.

The other main change was modifying the default method to dispatch command lines to the a query method for querying against the DB API cursor; as well as this default and several other commands will detect an unfinished SQL query and wait for a finishing character (“;” by default) before sending it (or doing anything) else. This is what the multi_prompt property is for, the prompt property is replaced with multi_prompt when a query spans more than one line, and is set back again when the query is finished and executed.

2 example tools are bundled with pyDBCLI, in the extras package:

  • odbc – a tool to query an ODBC exposed data source, using PyODBC; takes PyODBC compatible DSN strings.
  • litecli – a tool to query a SQLite database; SQLite has it’s own CLI tool to do this, which is very well rounded and much better than litecli, but this is provided as a fairly functional example tool.

Tomorrow I’ll discuss the original reason I whipped up pyDBCLI: creating a tool for querying Webtrends, via ODBC, quickly and with more ease.

Querying Webtrends analytics via ODBC with SQLAlchemy

Webtrends is a web traffic analytics package, similar to Google Analytics. Recently we had the requirement of being able to pull data out of reports on a Webtrends instance.
Luckily enough they have a nice RESTful data extraction API; not so luckily it is only available for Webtrends Analytics 9 instances, while we were limited in our requirements to Webtrends 8.

Prior to Webtrends 9 the official data extraction method is a Windows-only ODBC driver, primarily used for connecting Excel spreadsheets and Microsoft ADO applications. The driver provides a pseudo-relational-database interface to pre-made reports on a Webtrends instance, which you can query using a simple SQL-subset.
Notice I use qualifiers like “pseudo” and “interface”, that’s because what’s really going on in the background is the driver makes an HTTP call (with details such as the SQL being sent) to a web service on the Webtrends instance, and the web service returns some binary data representing data in the queried report, which the driver then returns as a table. The reports themselves aren’t actually real tables in a real database, although I’m sure this is how they’re represented somewhere in the Webtrends system, what we get back is a set of aggregated data normally used to display pretty graphs and bar charts in the web interface.

To make life easier for us, as we were already using SQLAlchemy for querying PostgreSQL tables, and we would need to mock our Webtrends data at some point, it made sense to be able to use SQLAlchemy for all the data objects; with that in mind I made the SQLAWebtrends dialect.
After installing SQLAWebtrends you can create ORM classes matching the Webtrends reports you want to query, and then, using a specially formatted DSN, run queries against them as you would any other DB.

Considerations in making a dialect for Webtrends:

  • Nearly all the special features of SQLAlchemy from full unicode support, to field binding and various row counting hacks, need to be disabled as the feature-set provided by the Windows ODBC driver are extremely limited.
  • Method for getting meta-data such as table names and columns needed to be overridden and done using Microsoft ADO compatible method.
  • Some combination of PyODBC’s column pre-binding and the Webtrends driver’s complete lack of features means any attempt at binding will fail, so being executing queries I needed to “unbind” them, replacing ? placeholders with actual data, and relying on inbuilt filtering method to provide any sort of field escaping/filtering.
  • LIMIT clauses are also extremely, erm, limited for want of better words. So that needed overriding too.
  • Finally, SQLAlchemy likes to wrap every value, include names, in quotes; Webtrends doesn’t like this, and quite frankly borks, so we disable this functionality too.
  • Bonus point: PyODBC rocks.

Caveats for use:

  • Your SQLAlchemy models for Webtrends reports shouldn’t include primary key columns, because frankly there probably aren’t any unique primary keys in the reports, but this doesn’t matter as SQLAlchemy won’t care as long as Webtrends doesn’t complain (which it won’t).
  • As with any other SQLAlchemy model you can call properties in the ORM class anything you like, but the underlying table column names need to match up to the column names in the Webtrends report.
  • The ODBC driver and web service don’t support JOINs, so you can’t use these with your ORM models either.
  • The iterator wrapper around the PyODBC cursor instance returned by queries will only ever returns one row unless you call .yield_per(1) on the query-set. I haven’t had time to figure out why this is the case, but I suspect it’s something to do with row pre-buffering, which is disabled as a consequence of yield_per.
  • Every now and again you’ll see rows with lots of blank values in them, except that any number values (measures in Webtrends) will be higher. If you look closely these are actually sums of all the values following it up until the next row of seemingly blank data. These are aggregated summary rows, displaying sums of the data for that particular subset of data (depending on which field is the “dimension” for the report, a sort of primary key used in creating reports). Unless you’re after this data as well, I find the best thing is to just do a quick check for so many blank fields and skip these rows.

Example using models and running a query:

from sqlalchemy.orm import mapper, sessionmaker
from sqlalchemy import create_engine
from sqlalchemy import String, MetaData, Column, Table

# You would probably setup your username, password
# host etc. here, including the profile and template
# you want to query.

# Create the DB connection
engine = create_engine(
    "webtrends+pyodbc://%s:%s@%s:80/%s?dsn=Webtrends&profile_guid=%s" %
        {user, password, host, template, profile}
)
metadata = MetaData(bind=engine)
Session = sessionmaker(bind=engine)
session = Session()

# Table schema
wt_user_report = Table('UsersByPages', metadata,
    Column('User', String, nullable=True),
    Column('PagesURLs', String, nullable=True),
    Column('PageHits', String, nullable=True),
    Column('TimePeriod', String, nullable=True),
    Column('StartDate', String, nullable=True),
    Column('EndDate', String, nullable=True)
)

# ORM class
class WTUserReport(object):
    pass
mapper(WTUserReport, wt_user_report)

# Create a query
query = session.query(WTUserReport).filter(
    TimePeriod="2010.m06.d22"
).yield_per(1) # Remember we need this for the iterator to work

# Iterate over the query-set and print some columns
for r in query:
    print "User %s hit %s %s times" % (
        r.User,
        r.PagesURLs,
        r.PageHits,
)