Category Archives: Databases

The problem with Backing Stores, or what is NoSQL and why would you use it anyway

Durability is something that you normally want somewhere in a system: where the data will survive reboots, crashes, and other sorts of things that routinely happen to real world systems.

Over the many years that I have worked in system design, there has been a recurring thorny problem of how to handle this durable data.  What this means in practice when building a new system is the question “what should we use as our backing store?”.  Backing stores are often called “databases”, but everyone has a different view of what database means, so I’ll try and avoid it for now.

In a perfect world a backing store would be:

  • Correct
  • Quick
  • Always available
  • Geographically distributed
  • Highly scalable

While we can do these things quite easily these days with the stateless parts of an application, doing them with durable data is non-trivial. In fact, in the general case, it’s impossible to do all of these things at once (The CAP theorem describes this quite well).

This has always been a challenge, but as applications move onto the Internet, and as businesses become more geographically distributed, the problem has become more acute.

Relational databases (RDBMSes) have been around a very long time, but they’re not the only kind of database you can use. There have always been other kinds of store around, but the so-called NoSQL Movement has had particular prominence recently. This champions the use of new backing stores not based on the relational design, and not using SQL as a language. Many of these have radically different designs from the sort of RDBMS system that has been widely used for the last 30 years.

When and how to use NoSQL systems is a fascinating question, and I put forward our thinking on this. As always, it’s kind of complicated.  It certainly isn’t the case that throwing out an RDBMS and sticking in Mongo will make your application awesome.

Although they are lumped together as “NoSQL”, this is not actually a useful definition, because there is very little that all of these have in common. Instead I suggest that there are these types of NoSQL backing store available to us right now:

  • Document stores – MongoDB, XML databases, ZODB
  • Graph databases – Neo4j
  • Key/value stores – Dynamo, BigTable, Cassandra, Redis, Riak, Couch

These are so different from each other that lumping them in to the same category together is really quite unhelpful.

Graph databases

Graph databases have some very specific use cases, for which they are excellent, and probably a lot of utility elsewhere. However, for our purposes they’re not something we’d consider generally, and I’ll not say any more about them here.

Document stores

I am pretty firmly in the camp that Document stores, such as MongoDB, should never be used generally either (for which I will undoubtedly catch some flak). I have a lot of experience with document databases, particularly ZODB and dbxml, and I know whereof I speak.

These databases store “documents” as schema-less objects. What we mean by a “document” here is something that is:

  • self-contained
  • always required in it’s entirety
  • more valuable than the links between documents or it’s metadata.

My experience is that although often you may think you have documents in your system, in practice this is rarely the case, and it certainly won’t continue to be the case. Often you start with documents, but over time you gain more and more references between documents, and then you gain records and and all sorts of other things.

Document stores are poor at handling references, and because of the requirement to retrieve things in their entirety you denormalise a lot. The end result of this is loss of consistency, and eventually doom with no way of recovering consistency.

We do not recommend document stores in the general case.

Key/value stores

These are the really interesting kind of NoSQL database, and I think these have a real general potential when held up against the RDBMS options.  However, there is no magic bullet and you need to choose when to use them carefully.

You have to be careful when deciding to build something without an RDBMS. An RDBMS delivers a huge amount of value in a number of areas, and for all sorts of reasons. Many of the reasons are not because the RDBMS architecture is necessarily better but because they are old, well-supported and well-understood.

For example, PostgreSQL (our RDBMS of choice):

  • has mature software libraries for all platforms
  • has well-understood semantics for backup and restore, which work reliably
  • has mature online backup options
  • has had decades of performance engineering
  • has well understood load and performance characteristics
  • has good operational tooling
  • is well understood by many developers

These are significant advantages over newer stores, even if they might technically be better in specific use cases.

All that said, there are some definite reasons you might consider using a key/value store instead of an RDBMS.

Reason 1: Performance

Key/value stores often naively appear more performant than RDBMS products, and you can see some spectacular performance figures in direct comparisons. However, none of them really provide magic performance increases over RDBMS systems, what they do is provide different tradeoffs. You need to decide where your performance tradeoffs lie for your particular system.

In practice what key/value stores mostly do is provide some form of precomputed cache of your data, by making it easy (or even mandatory) to denormalize your data, and by providing the performance characteristics to make pre-computation reasonable.

If you have a key/value store that has high write throughput characteristics, and you write denormalized data into it in a read-friendly manner then what you are actually doing is precomputing values. This is basically Just A Cache. Although it’s a pattern that is often facilitated by various NoSQL solutions, it doesn’t depend on them.

RDBMS products are optimised for correctness and query performance and  write performance takes second place to these.  This means they are often not a good place to implement a pre-computed cache (where you often write values you never read).

It’s not insane to combine an RDBMS as your master source of data with something like Redis as an intermediate cache.  This can give you most of the advantages of a completely NoSQL solution, without throwing out all of the advantages of the RDBMS backing store, and it’s something we do a lot.

Reason 2: Distributed datastores

If you need your data to be highly available and distributed (particularly geographically) then an RDBMS is probably a poor choice. It’s just very difficult to do this reliably and you often have to make some very painful and hard-to-predict tradeoffs in application design, user interface and operational procedures.

Some of these key/value stores (particularly Riak) can really deliver in this environment, but there are a few things you need to consider before throwing out the RDBMS completely.

Availability is often a tradeoff one can sensibly make.  When you understand quite what this means in terms of cost, both in design and operational support (all of these vary depending on the choices you make), it is often the right tradeoff to tolerate some downtime occasionally.  In practice a system that works brilliantly almost all of the time, but goes down in exceptional circumstances, is generally better than one that is in some ways worse all of the time.

If you really do need high availability though, it is still worth considering a single RDBMS in one physical location with distributed caches (just as with the performance option above).  Distribute your caches geographically, offload work to them and use queue-based fanout on write. This gives you eventual consistency, whilst still having an RDBMS at the core.

This can make sense if your application has relatively low write throughput, because all writes can be sent to the single location RDBMS, but be prepared for read-after-write race conditions. Solutions to this tend to be pretty crufty.

Reason 3: Application semantics vs SQL

NoSQL databases tend not to have an abstraction like SQL. SQL is decent in its core areas, but it is often really hard to encapsulate some important application semantics in SQL.

A good example of this is asynchronous access to data as parts of calculations. It’s not uncommon to need to query external services, but SQL really isn’t set up for this. Although there are some hacky workarounds if you have a microservice architecture you may find SQL really doesn’t do what you need.

Another example is staleness policies.  These are particularly problematic when you have distributed systems with parts implemented in other languages such as Javascript, for example if your client is a browser or a mobile application and it encapsulates some business logic.

Endpoint caches in browsers and mobile apps need to represent the same staleness policies you might have in your backing store and you end up implementing the same staleness policies in Javascript and then again in SQL, and maintaining them. These are hard to maintain and test at the best of times. If you can implement them in fewer places, or fewer languages, that is a significant advantage.

In addition, it is a practical case that we’re not all SQL gurus. Having something that is suboptimal in some cases but where we are practically able to exploit it more cheaply is a rational economic tradeoff.  It may make sense to use a key/value store just because of the different semantics it provides – but be aware of how much you are losing without including an RDBMS, and don’t be surprised if you end up reintroducing one later as a platform for analysis of your key/value data.

Reason 4: Load patterns

NoSQL systems can exhibit very different performance characteristics from SQL systems under real loads. Having some choice in where load falls in a system is sometimes useful.

For example, if you have something that scales front-end webservers horizontally easily, but you only have one datastore, it can be really useful to have the load occur on the application servers rather than the datastore – because then you can distribute load much more easily.

Although this is potentially less efficient, it’s very easy and often cheap to spin up more application servers at times of high load than it is to scale a database server on the fly.

Also, SQL databases tend to have far better read performance than write performance, so fan-out on write (where you might have 90% writes to 10% reads as a typical load pattern) is probably better implemented using a different backing store that has different read/write performance characteristics.

Which backing store to use, and how to use it, is the kind of decision that can have huge ramifications for every part of a system.  This post has only had an opportunity to scratch the surface of this subject and I know I’ve not given some parts of it the justice they deserve – but hopefully it’s clear that every decision has tradeoffs and there is no right answer for every system.

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

A Different View (part 2)

In the previous post, we saw how we could use a single, raw query in Django to combat ORMs’ tendency to generate query explosions within loops. We used raw() with joins and some column renaming to ensure all the data we needed came back in one go. We had to modify the property names in the template slightly (e.g. book.jacket.image became book.jacket_image) and the result was a RawQuerySet which had a fair number of limitations but, we avoided the typical order-of-magnitude increase in query count.

Super models

supermodel - by Jamie Beck and Kevin Burg

There is a way we can use a raw query, to get all the benefits described in the previous post, and return a real QuerySet. We need a Django model and so need an underlying table. A much underused feature of SQL databases is the view. Views are virtual tables derived from queries. Most are currently implemented as read-only, but they provide many benefits – again another blog post would be needed to explain them all – but they include:

  • adding a level of abstraction, effectively an API (which is also available to other applications)
  • protecting the application from underlying table and column changes
  • giving fine-grained control of permissions, especially row-level
  • providing a way to efficiently add calculated columns such as line_total = quantity * price
  • avoiding de-normalising things which should remain derived from other data
  • avoiding the not-insubstantial cost of repeatedly sending the same complicated queries over the network
  • enabling the database to pre-compile and cache the privilege-checks and query plans

So we could define a database view (with a few extra foreign-key columns for use later):

CREATE VIEW vbook AS
SELECT
  book.id,
  book.id AS book_id,
  book.title,
  jacket.id AS jacket_id,
  jacket.image AS jacket_image,
  author.id AS author_id
  author.name AS author_name,
  shelf."position"
FROM
  book
  NATURAL JOIN shelf
  NATURAL JOIN author
  NATURAL JOIN jacket

and then map a Django model to it, using Meta.db_table='vbook', and set the Meta.managed = False to tell Django not to maintain the table definition.

(note again the use of natural join for simplicity (and short table names) and that they don’t actually work as-is with Django’s table/column naming)

We’ve done this in the past for some of our projects, and shoe-horning the view-creation script into Django/buildout was always the fiddly bit.

Recently, we used django-postgres which makes the model-view mapping a bit simpler. You subclass pg.View instead of models.Model and set the sql property to your query, e.g.

class vBook(pg.View):    title = models.CharField(max_length=100)
    jacket_image = models.ImageField(null=True)
    author_name = models.CharField(max_length=100)
    shelf_position = models.IntegerField()
 
    sql = BOOK_DETAILS

manage.py sync_pgviews then creates the views for the models in the specified application. It’s quite a lightweight tool, and simply builds and issues the SQL CREATE VIEW statements needed to wrap each view-based model’s sql query.

(There is a Kickstarter project under way which should also add view handling to Django)

book_listingWe now have an almost fully-fledged Django model. We can query the objects against an effectively de-normalised table, while still retaining an essentially normalised schema. We can chain refinement methods to the resulting QuerySet to further filter and order the objects, including select_related() if necessary (although it would always be preferable, given time, to put the extra join into the view). We can have custom object managers, Meta options, introspection and all the other features of a Django model, even some read-only admin which would be good for reporting.

Note that we removed the

WHERE shelf.position <= 10

from the view definition. We can define the view without such filters and leave it to the Django application to apply them, e.g.

vbooks = vBook.objects.filter(shelf__position__lte = 10)

This gives a more flexible model, more akin to the original Book model, although you could keep the WHERE clauses in if you wanted to logically partition a table into multiple views/models.

Agile

Another great reason for using such view-based models is to postpone decisions about optimising bits of the system before we know anything about it. As an example, let’s say that a book needs to have an average rating that should be calculated from the entries in a ratings table. This could accurately be calculated using SQL’s AVG function, but that might be expensive and so we might need to pre-calculate it and possibly store it outside of the database. That would require us to design, build, test, document, and maintain:

  • an external data source capable of handling the unknown load and volume
  • a way to link a book to the external data source, possibly using new drivers
  • a way of updating the external data source for a book, duplicating the rating table entries
  • a way to synchronise the two data storage systems, possibly involving cron jobs and queues
  • a way to build and sustain the data source and links in whatever production environment we use, e.g. AWS

All of this would add complexity, up-front R&D and ongoing maintenance. The results would not be as fresh or reliable as the database results, and not necessarily any faster. Using AVG in a view leaves us scope to replace how it is calculated at a later date, once we know more about the performance.

Often, caching the book results using Django’s cache is enough and we can stick with the basic approach. Premature optimisation could well be a huge waste of effort, and the view-based models let us defer those decisions and get started very quickly.

Abstract

Picasso - Skull and Pitcher

Defining views early in a project could also be a way to postpone building complex parts of the system. Let’s say a book has a complicated ‘cost’ attribute and we’re not sure yet how it will be calculated or where all the data will come from but we need to start to display it during the initial iterations. We could add the column to the book view on day one and worry about where it comes from later. e.g.

CREATE VIEW vbook AS
SELECT
  book.id,
  book.id AS book_id,
  book.title,
  6.283185 AS cost, --todo: use a cost function here instead  ...

And then vbook.cost can be used in the knowledge that the reference won’t change. Also, if the cost calculation is defined within the view and it needs to change post-production, the view can be recreated while the application is running with no migration or down-time.

More modelling

We can further enhance the view-based model and add relationships to give us back the all-too-convenient dot-notation – still useful if we’re careful to use it outside of large loops. We should make sure any relationships don’t give Django the impression that it needs to try to maintain integrity – it doesn’t need to since the underlying table only has virtual rows. We can do this using on_delete=models.DO_NOTHING, e.g.

class vBook(pg.View):
    title = models.CharField(max_length=100)
    jacket_image = models.ImageField(null=True)
    author_name = models.CharField(max_length=100)
    shelf_position = models.IntegerField()
 
    author = models.ForeignKey(Author,                               on_delete=models.DO_NOTHING)             #needs author_id to be returned by the sql 
    sql = BOOK_DETAILS

These are, of course, complete Django relationships and so we can access them from any direction, e.g.

my_author.vbook_set.all()

would return an author’s books as vBook objects using a single extra query. You could even go further and use select_related('vbook') when getting the author. Django treats these models just like any other.

We can’t save or create using such view-based models – the database will reject that for anything but the simplest views, so we still use the underlying table-based models to do that. But we can link the models together, with a one-to-one relationship, to make things easier, e.g.

class vBook(pg.View):
    book = models.OneToOneField(Book,                                on_delete=models.DO_NOTHING)           #needs book_id to be returned by the sql    title = models.CharField(max_length=100)
    jacket_image = models.ImageField(null=True)
    author_name = models.CharField(max_length=100)
    shelf_position = models.IntegerField()
 
    author = models.ForeignKey(Author,
                               on_delete=models.DO_NOTHING)
             #needs author_id to be returned by the sql
 
    sql = BOOK_DETAILS

Now, given a vbook instance, we can update the underlying book, for example:

vbook.book.title='A Tale of Two Cities'
vbook.book.title.save()

And also, given a book instance, we can ask for all the pre-joined book information:

book.vbook.author_name

There are still some questions about how best to use these models, e.g. how to share methods with underlying table-based models and how best to name fields from other models.

Deferred Optimisation

Once we’ve defined view-based models to bring together the core concepts of the system, we can build the application to use them for reading and the base-table models for writing. We then have a variety of options to greatly optimise the system at a later date.

Here are some ways we can optimise database views. The important points are:

  • just by using views to join tables, we’ve already made great savings in the number of database queries needed
  • the ideas below can be applied later, once we know if and where further optimisation is needed
  • none of these would require any changes to the application code using the data: they’re true implementation changes, and database views give us such a clear separation of interface and implementation that the application could stay running!

Pull from other tables

For example, at a later point, we could modify the view to join to a table containing pre-calculated ratings and return that instead of the AVG sub-select.

Pull from external systems

We could use PostgreSQL’s foreign data wrappers to join to external data sources. It might be useful in itself to map a Django model to such a foreign table without an intermediate view, if all the data is foreign.

Use database functions

Database functions can be written in a number of languages including PL/pgSQL, Python, Ruby and Javascript. They are pre-compiled and are very efficient not least because they can access the data without it needing to be returned first and are in situ so can filter out unwanted results.

Load-balance across read-only replicas

PostgreSQL has the concept of standby/slave servers that can be kept synchronised with a master server. These can be made available for read-only queries to spread the load. Since we can be sure that view-based models are only used for reading, we could use Django’s database router to route all queries against those models to the slave servers.

Materialise

Views are virtual, except when they’re not. PostgreSQL 9.3 added materialised views.

CREATE MATERIALIZED VIEW vbook AS
SELECT ...

These are still declared as derived tables but the data in them is physically stored for much faster access: the results are effectively cached in the database. These are useful if the application can handle some staleness, and the only change necessary is in the way the view is declared – the application needn’t know or care. Such views can be refreshed periodically (although before 9.4 the refresh takes out a table lock). Indexes can be added to materialised views, potentially giving massive performance increases, especially for calculated columns.

Although I think materialisation is a last resort, it’s a very powerful one. I’ve added a MaterializedView class to isotoma/django-postgres which should help create them, though it could do with some more testing and options to control refreshing.

This really gives the best of both worlds: a single query can provide Django objects with the speed expected from de-normalised storage, but derived by the database from a normalised schema.

Summary

Django’s ORM makes accessing related objects simple and convenient. However, when using an ORM, accessing related objects in loops often leads to an explosion of supporting queries, which can go unnoticed during development but which can lead to poor performance. Django’s ORM has some methods that try to alleviate the problem but they have limitations. We can use raw SQL to efficiently join related information in the database to avoid these query explosions, as well as giving us more powerful ways to group and summarise data. I think these SQL queries should be given more prominence in our projects.

We can go further and push the raw SQL into the database by declaring virtual tables (database views). We can then map Django models onto these virtual tables to also give an extra layer of abstraction on top of the base models. This lets us defer implementation decisions and provides lots of ways to optimise the system at a later stage, once we know more about its performance, without affecting the application code.

A Different View (materialised)

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

A Different View

A Different View

One of the consequences of the mismatches between databases and ORMs is that iterating over objects can generate explosions of database queries. Often, these explosions go unnoticed because the individual queries are fast and the number of concurrent users is small, especially during development. The explosions are sometimes hard to detect, either because developers don’t look for them or the tools that can expose the explosions don’t report everything (e.g. the Django debug toolbar doesn’t report the queries generated by ajax calls).

Exploding

debug_toolbarDjango does a good job of lazily evaluating what it can, but the simple act of using the convenient dot-notation inside a loop can turn a single database query into a cascade of supporting queries.

For example, given this book model:

class Book(models.Model):
    title = models.CharField(max_length=100)
    author = models.ForeignKey(Author)
    jacket = models.ForeignKey(Jacket)
    shelf = models.ForeignKey(Shelf)

and this books query:

books = Book.objects.filter(shelf__position__lte = 10)

which returns 10 rows. Then this template:

{% for book in books %}
  {{ book.jacket.image }} {{ book.title }} 
  by {{ book.author.name }}
{% endfor %}

book_listing_simplewould issue a single query to get all 10 rows, but would then generate 2 additional queries per iteration to get the jacket and the author for each book. So a total of 21 queries. You really need to use the Django debug toolbar to realise this – install it now and keep your eye on it until the very end of the project: it only takes a late-breaking, well-intentioned dot to cause an explosion.

A good way to detect when explosions happen is to check the number of queries in your unit tests.

self.assertNumQueries(self.BOOKS_QUERY_COUNT, self.client.get, reverse('books'))

Add these at the start of your project. It means you have to keep increasing the counts while developing, but it does make it obvious when something has exploded.

Remember, every one of those queries involves the following steps:

  • create a connection to the database (tcp/ip – handshake – username, password) (set CONN_MAX_AGE in Django 1.6 to help reduce this overhead)
  • send the SQL query over the network
  • parse the SQL query
  • check the table and column read/write permissions
  • optimise the query plan based on the latest statistics
  • start a transaction
  • perform the query (checking for other users’ locks)
  • return the resulting metadata and rows over the network
  • commit the transaction
  • close the connection

Joining

A far better approach is to get the database to do the work. It can join together the jacket and author information much more efficiently (orders of magnitude more efficiently – but that’s another topic).

Django added select_related() to try to address this problem. Given a field name, it passes the required table to the database to be joined with the original queryset, e.g.

books = Book.objects.filter(shelf__position__lte = 10
                           ).select_related('jacket',
                                            'author')

It is limited though, since it only works with some relationships in some directions, and it pulls in all the columns which is usually very wasteful, especially if large text fields are present. Adding defer() or only() to work around this can help, but isn’t pretty, e.g.

books = Book.objects.filter(shelf__position__lte = 10
                           ).select_related('jacket', 'author'
                           ).defer('author__biography',
                                   'author__summary',
                                   'author__resume')

prefetch_related() was added later to help with more complex relationships but still issues multiple queries and does the ‘joining’ in Python. Besides that, it also has a number of caveats and special interactions with other methods. Trying to combine these with annotations for aggregate counts and such would lead to some fragile, hard-to-construct and, I think, hard-to-read syntax that can fail even for slightly complicated queries.

So instead, we could issue the following query:

books = Book.objects.raw("""
  SELECT
    book.id, book.title,
    jacket.image AS jacket_image,
    author.name AS author_name
  FROM book
    NATURAL JOIN shelf NATURAL JOIN author NATURAL JOIN jacket
  WHERE shelf.position <= 10
""")

(note: I’ll use natural join throughout for simplicity, and short table names for that matter, but they don’t actually work as-is with Django’s table/column naming)

And then, a slightly modified template:

{% for book in books %}
  {{ book.jacket_image }} {{ book.title }} 
  by {{ book.author_name }}
{% endfor %}

would only issue 1 query, regardless of the number of books.

Adding further related information to the original template will dramatically increase the number of queries, whereas joining another table to the raw SQL version will add none. Similarly, returning 20 books instead of 10 would just about double the number of queries for the original version (so 41 for our example), whereas the raw one would still only issue a single query.

As well as avoiding query explosions, there are other benefits to using raw SQL. A full explanation needs another blog post, but some of the major ones are:

  • it’s the only way to do anything but the simplest of aggregation queries.
  • it only returns what is required (assuming you avoid SELECT *). Pulling every column, including large blobs, over the network when they’re not referenced wastes CPU, temporary sort/cache space and network bandwidth.
  • you can use easy-to-read comparisons (e.g. position <= 10 instead of shelf__position__lte = 10).
  • if you need/want, you can use database-specific features, such as function calls, range queries, NULL ordering, etc.
  • checking the query plan is easier – there’s no need to set a breakpoint and then print the queryset.query.
  • it returns consistent results, whereas multiple queries, even in the same transaction, can see different data depending on what’s been committed by concurrent users (unless we used the serializable isolation level, which we don’t).

Formatting

I’ve tried a number of ways to embed SQL queries into Django modules:

...raw("""
SELECT ...
FROM ...
WHERE ...
""")

is jarring.

...raw("""SELECT ...
          FROM ...
          WHERE ...
       """)

is better, but adds too much whitespace and makes it harder to edit.

...raw("""SELECT ..."""
       """FROM ..."""
       """WHERE ...""")

avoids the extra whitespace but doesn’t ensure enough whitespace unless you’re careful to add a space at the end of each line.

None are great and none of them provide an easy way to copy and paste the query, which is necessary to tune it and check its plan in a database client.

Elevating

For a better approach, I suggest creating an sql directory (alongside models.py – it is that important) and create python modules in there to hold the named query declarations, e.g. sql/books.py:

BOOK_DETAILS = """
SELECT
  book.id,
  book.title,
  jacket.image AS jacket_image,
  author.name AS author_name
FROM
  book
  NATURAL JOIN shelf
  NATURAL JOIN author
  NATURAL JOIN jacket
WHERE
  shelf.position <= 10
"""

then it can be called like this:

from sql.books import BOOK_DETAILS
 
books = Book.objects.raw(BOOK_DETAILS)

Clicking the BOOK_DETAILS reference in any decent IDE will take you to the query. Now that the query has a bit more respect it can more easily be copied and pasted into other tools, and formatted as you’d like, and re-used without repetition.

One more note about formatting. Although you could put Python comments before the declaration, I suggest keeping the query comments inside the query because they may be useful on the server, e.g. when viewing the database logs. So instead of:

#Get top 10 book summaries
#(including author and jacket details)
BOOK_DETAILS = """
SELECT
...
"""

use:

BOOK_DETAILS = """
/*
Get top 10 book summaries
(including author and jacket details)
*/
SELECT
...
"""

Downsides

There are some downsides with raw(), however.

  • You need some SQL knowledge of course, and an understanding of how Django links tables together, but I think you should have that anyway.
  • If your SQL does introduce some non-standard syntax (which shouldn’t be the case for straightforward joins) it could make your application less portable.  Although I’d argue that some features are worth that sacrifice, and would hope the likelihood of switching to a different database would be low if you’re already using a good one.
  • Even though raw() ensures only 1 query is used, that query must still be sent over the network and parsed every time it’s run.
  • The real problem is that raw() returns a RawQuerySet, not a QuerySet, so no further refinement methods can be applied (e.g. filter, order_by) – though arguably they’re better off being added to the SQL.

We can do better, as I’ll explain in part 2

About us: Isotoma is a bespoke software development company based in York and London specialising in web apps, mobile apps and product design. If you’d like to know more you can review our work or get in touch.

The Third Manifesto Implementers’ Workshop

Earlier this month I went to the Third Manifesto Implementers’ Workshop at Northumbria University in Newcastle. A group of us discussed recent developments in implementing the relational data model.

The relational data model was proposed by E. F. Codd in 1969 in response to the complex, hierarchical, data storage solutions of the time which required programs to be written and compiled for each database query. It was a powerful abstraction, but unfortunately SQL and its implementations missed out on important features, and broke it in fundamental ways. In response to this problem, and the industry’s approach towards object-databases, Chris Date and Hugh Darwen wrote “The Third Manifesto” (TTM) to put forward their ideas on how future database systems should work. I urge you to read their books (even if you’re not interested in the subject) – the language is amazing: precise, concise, comprehensive and easy to read – other technical authors don’t come close.

The relational model treats data as sets of logical propositions and allows them to be queried, manipulated and constrained declaratively. It abstracts away from physical storage and access issues which is why it will still be used a hundred years from now (and why NoSQL discussions like these http://wiki.apache.org/couchdb/EntityRelationship http://www.cmlenz.net/archives/2007/10/couchdb-joins are retrograde). If you’re writing loops to query your data, or having to navigate prescribed connection paths, then your abstractions are feeble and limited.

At the workshop, I talked about my project, Dee, which implements the relational ideas from TTM in Python. You can see my slides here (or here if you have an older browser).

Erwin Smout gave a couple of talks about implementing transition constraints and dispensing with data definition language.

David Livingstone walked us through the RAQUEL architecture – a layered approach along the lines of the OSI network model.

Hugh Darwen discussed the features and implementation of IBM’s Business System 12, one of the first ever relational database systems which had some surprisingly dynamic features, including key inferencing, so that view definitions could keep tabs on their underlying constraints.

Chris Date took us through his latest thoughts on how to update relational views in a generic way. The aim is for database users to be able to treat views and base tables in the same way, for both reading and writing. Lots to think about here, and my to-do list for Dee has grown another section.

Adrian Hudnott discussed a couple of research projects around optimising multiple relational assignments and tracking the source of updates so that transition constraints could be more effective.

Renaud de Landtsheer gave an insight into the work he’s been doing implementing first-order-logic constraints within Oracle databases.

I sat next to Toon Koppelaars (whose name went down well with the Geordies) and then I realised I had his book (Applied Mathematics for Database Professionals) waiting to be read on my desk at work, thanks to the eclectic Isotoma library (and Wes).

It was a packed couple of days with plenty of food for thought. Thank you to David Livingstone and Safwat Mansi for organising and hosting such an enjoyable and interesting event.

Writing interactive command line DB query tools with pyDBCLI

While some people can’t navigate a relational database without reaching for a GUI, the vast majority of us spend a good proportion of our lives inside interactive command line interfaces, such as psql or mysql.

What do you do, however, if there isn’t a CLI query tool available for the DB you’re working with?
I had this exact same problem recently with a project, the main reason for the lack of any such tooling is because I wasn’t actually dealing with a real DB at all, but an ODBC interface to a web service that exposed reporting data as if it were tables in a DB. Rather than spend the rest of my life messing around with queries in a hooked up spreadsheet, or repeatedly writing one off Python snippets, I decided to write my own psql-like tool.

Thus the first version of pyDBCLI was born; well not really, my first tool really only handled querying Webtrends (I’ll save that for another post), while pyDBCLI is a base class for making such tools as long as you have a DB API compatible cursor to query.
pyDBCLI is based on the fantastic cmd.Cmd, so you can extend pretty much exactly as you would would Cmd, except with some extra properties such as cursor and multi_prompt are provided.

In order to make a Cmd based tool behave more like psql I ended up overriding the parseline method with regular expressions to handle escaped commands such as “\d” and “\c”, fuzzing them if not escaped, or un-escaping them if escaped so that Cmd’s unmodified parseline method can handle parsing and dispatching to defined do_* methods.

The other main change was modifying the default method to dispatch command lines to the a query method for querying against the DB API cursor; as well as this default and several other commands will detect an unfinished SQL query and wait for a finishing character (“;” by default) before sending it (or doing anything) else. This is what the multi_prompt property is for, the prompt property is replaced with multi_prompt when a query spans more than one line, and is set back again when the query is finished and executed.

2 example tools are bundled with pyDBCLI, in the extras package:

  • odbc – a tool to query an ODBC exposed data source, using PyODBC; takes PyODBC compatible DSN strings.
  • litecli – a tool to query a SQLite database; SQLite has it’s own CLI tool to do this, which is very well rounded and much better than litecli, but this is provided as a fairly functional example tool.

Tomorrow I’ll discuss the original reason I whipped up pyDBCLI: creating a tool for querying Webtrends, via ODBC, quickly and with more ease.

Querying Webtrends analytics via ODBC with SQLAlchemy

Webtrends is a web traffic analytics package, similar to Google Analytics. Recently we had the requirement of being able to pull data out of reports on a Webtrends instance.
Luckily enough they have a nice RESTful data extraction API; not so luckily it is only available for Webtrends Analytics 9 instances, while we were limited in our requirements to Webtrends 8.

Prior to Webtrends 9 the official data extraction method is a Windows-only ODBC driver, primarily used for connecting Excel spreadsheets and Microsoft ADO applications. The driver provides a pseudo-relational-database interface to pre-made reports on a Webtrends instance, which you can query using a simple SQL-subset.
Notice I use qualifiers like “pseudo” and “interface”, that’s because what’s really going on in the background is the driver makes an HTTP call (with details such as the SQL being sent) to a web service on the Webtrends instance, and the web service returns some binary data representing data in the queried report, which the driver then returns as a table. The reports themselves aren’t actually real tables in a real database, although I’m sure this is how they’re represented somewhere in the Webtrends system, what we get back is a set of aggregated data normally used to display pretty graphs and bar charts in the web interface.

To make life easier for us, as we were already using SQLAlchemy for querying PostgreSQL tables, and we would need to mock our Webtrends data at some point, it made sense to be able to use SQLAlchemy for all the data objects; with that in mind I made the SQLAWebtrends dialect.
After installing SQLAWebtrends you can create ORM classes matching the Webtrends reports you want to query, and then, using a specially formatted DSN, run queries against them as you would any other DB.

Considerations in making a dialect for Webtrends:

  • Nearly all the special features of SQLAlchemy from full unicode support, to field binding and various row counting hacks, need to be disabled as the feature-set provided by the Windows ODBC driver are extremely limited.
  • Method for getting meta-data such as table names and columns needed to be overridden and done using Microsoft ADO compatible method.
  • Some combination of PyODBC’s column pre-binding and the Webtrends driver’s complete lack of features means any attempt at binding will fail, so being executing queries I needed to “unbind” them, replacing ? placeholders with actual data, and relying on inbuilt filtering method to provide any sort of field escaping/filtering.
  • LIMIT clauses are also extremely, erm, limited for want of better words. So that needed overriding too.
  • Finally, SQLAlchemy likes to wrap every value, include names, in quotes; Webtrends doesn’t like this, and quite frankly borks, so we disable this functionality too.
  • Bonus point: PyODBC rocks.

Caveats for use:

  • Your SQLAlchemy models for Webtrends reports shouldn’t include primary key columns, because frankly there probably aren’t any unique primary keys in the reports, but this doesn’t matter as SQLAlchemy won’t care as long as Webtrends doesn’t complain (which it won’t).
  • As with any other SQLAlchemy model you can call properties in the ORM class anything you like, but the underlying table column names need to match up to the column names in the Webtrends report.
  • The ODBC driver and web service don’t support JOINs, so you can’t use these with your ORM models either.
  • The iterator wrapper around the PyODBC cursor instance returned by queries will only ever returns one row unless you call .yield_per(1) on the query-set. I haven’t had time to figure out why this is the case, but I suspect it’s something to do with row pre-buffering, which is disabled as a consequence of yield_per.
  • Every now and again you’ll see rows with lots of blank values in them, except that any number values (measures in Webtrends) will be higher. If you look closely these are actually sums of all the values following it up until the next row of seemingly blank data. These are aggregated summary rows, displaying sums of the data for that particular subset of data (depending on which field is the “dimension” for the report, a sort of primary key used in creating reports). Unless you’re after this data as well, I find the best thing is to just do a quick check for so many blank fields and skip these rows.

Example using models and running a query:

from sqlalchemy.orm import mapper, sessionmaker
from sqlalchemy import create_engine
from sqlalchemy import String, MetaData, Column, Table

# You would probably setup your username, password
# host etc. here, including the profile and template
# you want to query.

# Create the DB connection
engine = create_engine(
    "webtrends+pyodbc://%s:%s@%s:80/%s?dsn=Webtrends&profile_guid=%s" %
        {user, password, host, template, profile}
)
metadata = MetaData(bind=engine)
Session = sessionmaker(bind=engine)
session = Session()

# Table schema
wt_user_report = Table('UsersByPages', metadata,
    Column('User', String, nullable=True),
    Column('PagesURLs', String, nullable=True),
    Column('PageHits', String, nullable=True),
    Column('TimePeriod', String, nullable=True),
    Column('StartDate', String, nullable=True),
    Column('EndDate', String, nullable=True)
)

# ORM class
class WTUserReport(object):
    pass
mapper(WTUserReport, wt_user_report)

# Create a query
query = session.query(WTUserReport).filter(
    TimePeriod="2010.m06.d22"
).yield_per(1) # Remember we need this for the iterator to work

# Iterate over the query-set and print some columns
for r in query:
    print "User %s hit %s %s times" % (
        r.User,
        r.PagesURLs,
        r.PageHits,
)