Data Modelling for Startups (Part I)

Published in

Apolitical Engineering

12 min readMay 15, 2023

This is the first article in a three-part series.

Why read these articles?

Over the last five years at Apolitical, the global platform for government, I and my team had the chance to do something exciting: to build up the company’s data infrastructure and systems from scratch.

Always a small team, we had to keep it lean — building only what was necessary to get the job done at the time. As the company has matured, so has our data operation. But if I were doing it all again, I wouldn’t do it the same way—data tooling has come a long way in half a decade!

In these articles I will try to distil a few lessons for startups beginning to grapple with their data, based on how I’d do it now.

Introduction — data modelling

Here’s something we’ve changed recently: in 2022 we took the leap from ETL to ELT. In other words, we now aggregate all our data in a central data warehouse (in as close to its ‘raw’ form as possible), then transform it in situ.

This change to our working methods has been revolutionary. The transformation step entails bringing together data that originated in different sources, then executing joins, aggregates, filters etc. to bring it into a form that

matches the business understanding of the data as closely as possible
is suitable for ingestion by business intelligence (BI) tools, machine-learning workflows, and regular-old data analysts

This process is what I mean by data modelling.

Under the old ETL model, doing it well requires a lot of up-front planning — which puts a startup in a tricky spot. But there are real benefits to proper data modelling. Our switch to ELT has greatly improved our situation, because we can now quickly iterate our transformation logic without painstakingly backfilling data.

In turn, this has allowed us to introduce more powerful BI tools that expect and require well-modelled data in order to be effective.

There are many articles ‘out there’ about data modelling. However, I have tended to find that these present idealised cases, rather than dealing with some of the more complex situations one encounters in the real world.

This makes sense. Writing about complexity takes a lot more time … and a lot more words. I hope that a summary of what I’ve learnt over the last year will offer a new and different perspective — one that I trust will be relevant to others working in small, lean teams. In short: it’s the article I wish I’d read a year ago.

This is the first article in a three-part series. This article focuses on the necessary background needed to understand the point of data modelling and to participate in discussions about it. I include some of the thornier issues that don’t always get a mention in introductory articles.

The second article will cover how you might put the necessary infrastructure in place in order to do data modelling properly and to get the benefits of it.

The final article will draw these two strands together, offering examples and practical advice.

First, some theory

From now on I will use the term data models to refer to the final products of the data transformation process. Typically, these will be tables stored in your data warehouse.

Data models in a warehouse (a.k.a. an OLAP database) do not need to follow the highly normalised schema typical of a production SQL database (a.k.a. an OLTP database). Instead, data should be denormalised into a relatively small number of large tables.

We generally stop denormalising when we achieve something called a star schema, which is what you get when you morph your data into fact and dimension tables (sometimes with additional bridge tables, a.k.a. junction tables).

Facts, dimensions, bridges, stars ..?

A dimension table represents a business entity of a particular type.

Each row represents a single entity, and each column an attribute of such entities. There should be a unique ID (primary key) to act as a stable identifier for the entity.

A table with three columns: ID, Name, and Country — Every row represents a specific customer. The ID doesn’t have to be numeric, it could be a UUID.

For example: in a table of customers, every row should represent a single customer. The table should have a ‘customer ID’ column. Any other columns will represent attributes: things we know about that customer (e.g. given name, country of residence).

We would expect dimension tables to change over time, since some attributes may be mutable. For example, a customer can change their contact details, or even their name.

A fact table represents a business process we want to measure and monitor.

A row in such a table represents an event or measurement (or perhaps an aggregated group of events/measurements). A ‘pure’ fact table row should consist of

identifiers (foreign keys) representing links to various dimension tables
any measures (numeric values) associated with the fact

A table with four columns: Customer ID, Timestamp, Tax, and Gross Value — Every row represents a specific sale that took place.

For example: in a table of sales, every row would represent a single sale. The table might have a ‘customer ID’ column, containing IDs from the customers (dimension) table. The table could also have ‘timestamp’ column (indicating when the sale took place) and a ‘value’ column, capturing the value of the sale. We can imagine other measure columns capturing things like tax applied to the sale etc.

We would not generally expect facts to change after they are first recorded (unless some kind of correction is issued). New facts can be simply appended to the rest of the data.

A bridge table can be used to represent many-to-many relationships between dimension tables (i.e. mutable relationships that aren’t considered to be a ‘fact’).

Two ‘dimension’ tables with ID columns (one has a ‘Name’ column, the other an ‘Address’ column). Arrows point from the ID columns to a third (bridge) table. It has two columns, Customer ID and Address ID. It pairs customers with addresses. — This bridge table simply maps IDs from one dimension to another, representing a many-to-many relationship.

For example: if multiple customers live at the same address, and some customers have multiple delivery addresses, we might have an Address dimension and a Customer dimension, with a mapping table matching Addresses to Customers.

Alternative ways to deal with this issue:

use a nested data-type on the dimension table, i.e. an array of addresses on the customer table (but not many BI tools can handle this)
change the ‘grain’ — create a ‘Localised Customers’ dimension table representing everything we know about a particular customer in a particular place, with a surrogate key generated for each unique customer/place pair (but this abstraction is hard for end users to reason about)
rethink the model to disassociate the dimensions by putting a fact table between them, e.g. in this case a ‘deliveries’ fact table can reference an Address and a Customer without creating an additional link between the two (but of course this is not always possible)

Because of the drawbacks to these alternatives, bridge tables have their place.

A star schema imagines a fact table (representing a particular business process) with many dimensions hanging off it (each fact table column is either a measure or references a single dimension table). Thus, the fact table is the centre of the star and the dimension tables are the points.

A blue square labelled ‘Sales’ is surrounded by three red squares labelled customers, products, and dates. Arrows point from these to the central blue square. — Primary keys in the dimension tables target lookup columns in the fact table (more on treating dates as a dimension later in the article). In practice, fact tables capture a many-to-many relationship (e.g. a product may be bought by many customers, and a customer may buy many products).

In practice, the situation is often more complex:

we may be using multiple tables, with a bridge table, to capture all the relevant information about a particular dimension
multiple facts may reference the same dimensions. We call these ‘conformed dimensions’ (they act as a single-source-of-truth for the entities they represent across all analysis contexts).
often we want to compare facts (e.g. different actions such as subscriptions and purchases) with a common dimension … or more than one common dimension

This final pattern is sometimes called the ‘chasm trap’. It looks like this:

A blue square labelled Sales is on the left, and another labelled Reviews is on the right. Three red squares form a vertical line between the two blue squares, dividing them. The red squares are labelled Customers, Products, and Dates. Two arrows start at each red square, pointing at each of the blue squares. — If we want to compare two fact tables, we have to work out how to ‘cross the chasm’ (i.e. understand the relationship between the two based on their shared dimensions and join them appropriately).

The reason it’s a ‘trap’ is because when you have multiple ‘join paths’ between fact tables, it’s very easy to end up double counting facts. Some business intelligence software can manage this issue.

Things can get a bit more dangerous when we create a ‘galaxy’ schema — a set of star schemas connected via conformed (shared) dimensions:

Four red squares represent dimensions as follows: dates, products, customers, and subscriptions. Three blue squares represent facts as follows: sales, reviews, and subscription creations. Sales and reviews facts both link to dates, products, and customers dimensions. Subscription creations facts link to dates, subscriptions, and customers. — An example ‘galaxy schema’. Note that queries including the ‘Subscription Creation’ fact could reference the ‘Subscriptions’ dimension, which is NOT shared with the other fact tables.

This can sometimes cause problems if queries regarding multiple different facts reference dimensions that are not shared.

In these scenarios, care needs to be taken that the query makes meaningful sense! As a rule of thumb, filtering by an attribute of a dimension that isn’t shared is usually okay (consider, for example, counting sales made to customers with a particular type of subscription). However, queries that attempt to aggregate by such attributes should be checked carefully.

One way to avoid this is to find a way (e.g. in a business intelligence tool) to expose coherent groups of data models that avoid such pitfalls. These model groups would contain either

a single star
a single chasm

This is like the previous diagram, with two overlapping shaded areas. An orange shaded area groups dates, customers, and subscriptions dimensions with the subscription creations fact (a star). A purple shaded area groups dates, customers, and products dimensions with sales and reviews facts (a chasm). — The orange area represents a single star schema, the purple area represents a single chasm schema.

Queries executed against such model groups are much more likely to produce meaningful results.

To OBT or not to OBT?

Perhaps the simplest way to create a reliable model group is to construct an OBT model — that stands for ‘One Big Table’.

In the era of cloud data warehouses, a rule of thumb is that storage is cheap, processing is expensive. Therefore it may make a lot of sense to pre-join your data models into One Big Table: process everything you need once, store the results for later.

This is easiest to think about in the case of the star schema. Let us say you have a fact table, sales, linked to dimension tables customers, products, and dates.

You could create a SQL query to join all of these into One Big Sales Table:

-- models/obt/obt_sales.sql

    SELECT *
      FROM {{ ref('sales') }} AS s
RIGHT JOIN {{ ref('customers') }} USING(customer_id)
RIGHT JOIN {{ ref('products') }} USING(product_id)
RIGHT JOIN {{ ref('dates') }} AS d ON s.sale_date = d.calendar_date

I have written this in the style of a DBT model.

Note the (somewhat unusual) use of a RIGHT JOIN. This is a convention I picked up from our BI tool:

the tool encourages us to think of fact tables — such as sales — as the principal table from which we select data, onto which dimensions will be joined (if needed — a simple COUNT can be done without joins)
we can expect (if we’ve done our modelling properly) that there will never be any rows in our fact table that relate to nonexistent rows in our dimension tables
however, there may be rows in our dimension tables that are not linked to any rows in the sales table (some products could be very unpopular and have no sales)

We could use an INNER JOIN (and then every row would represent a sale, complete with all associated information) … but what if we want to use our OBT model to compute the total value of sales per product?

    SELECT
           product_id,
           IFNULL(SUM(sale_value), 0) AS total_value
      FROM obt_sales
  GROUP BY product_id
  ORDER BY total_value

Because we used RIGHT JOIN to construct our OBT model, some rows have dimension attributes, but NULL sales measures. We can use the query above to reveal which products have zero sales value associated with them. It all depends on your use cases for the model!

You can hopefully see that OBT models, though not as flexible as the more fundamental facts and dimensions, offer convenience and safety.

The downsides become clearer when you try to construct an OBT model for a chasm schema.

Let us imagine we have two fact tables, sales and reviews. Both fact tables are linked to the dimension tables customers and products. That is to say:

customers can buy products
customers can review products

Two red squares (dimensions) labelled Customers and Products at the top and bottom, two blue squares (facts) labelled Sales and Reviews on the left and right. Arrows connect each dimension to each fact. — A diagram displaying the chasm trap schema just described.

Perhaps we want to answer questions like ‘do people with repeat purchases leave better reviews?’ In order to construct a single table that correctly exposes this collection of dimensions and facts, we’d need to ensure it contained all information on sales and reviews.

However, it’s not clear that a review can be linked directly to a sale. Indeed, your company may allow someone to buy the same product multiple times, but to review it only once: the link only exists via the dimension tables.

There are different ways to handle this. One would be:

for every combination of customer and product, create a row
join aggregated sales data (derived from any ‘sale’ facts that exist for that combination of customer and product)
join reviews data (derived from any ‘review’ fact that exists for that combination of customer and product)

You can see that this particular choice depended closely on the business logic.

If instead customers can write multiple reviews of a single product, we could also aggregate our reviews data before joining. But aggregation loses information. Will you store the average score? The maximum score? The most recent score?

Welcome to the date dimension

There is another approach I want to mention at this point.

It is usual in data modelling to consider a date to be a dimension. It has the right properties:

fact tables will often have a date (or timestamp) column
each date is unique, immutable identifier for a day
each day has attributes that we could list (for example, what day of the week is it? In which calendar quarter does it lie?) and use for filtering, grouping etc

It may very well make sense to create a ‘calendar table’, treating the date as the primary key (and any date columns in fact tables as references to this key). Here’s an example the good people at Count have provided, which you could use yourself.

In the case just discussed, it may well be that sales and reviews are additionally related via a date. This would look like the first example of a chasm schema from earlier in the article:

In that case, we could build an OBT model by creating combinations of customers, products, and dates. Such a model avoids aggregating all the data … at least, it avoids aggregating it so much that we lose important information. We would describe the resulting table as being at the customer/product/daily grain. This is a more ‘fine grained’ table than the version at the customer/product grain.

This table does what is required. However, the drawbacks are clear:

such tables could end up being very large, especially if they become more fine-grained
we still needed to aggregate the table at some level — e.g. if two sales of the same product are made to the same customer on the same day, we’d need to aggregate those sales. We could go to the hour (or even minute!) grain, but that makes problem (1) worse.

All-in-all, perhaps it would be better just to stick with separate fact and dimension tables in this case.

This does highlight the slightly odd status of the date dimension:

many BI tools don’t require you to bring your own date dimension table in order to work properly — after all, the ‘foreign key’ for a date is usually just the date (2023-01-05 is a compact, immutable ID for a day), and any BI tool worth its salt can derive useful aggregations and filters (e.g. by month, ‘business day’ etc) from that information alone
many fact tables contain a timestamp column with very high resolution (e.g. to the microsecond). It would be ridiculously impractical to create a ‘microsecond dimension table’, so in this case you can either convert the timestamp to a date (you might ask yourself whether business intelligence use cases need higher resolution — this depends on the business!) or simply leave it as-is, expecting your BI tool to handle it

Conclusion

Data modelling is a big subject, but I hope this article provides something of a map for the start of your journey. To recap, I’ve covered:

the different components of data modelling (facts, dimensions, and bridges)
data model schemas (star schemas, chasm traps, ‘galaxy’ schemas) and the role of conformed dimensions
OBT models and the tradeoffs involved in creating them
some data modelling pitfalls that you’re likely to encounter

In the next article I’ll move on from data modelling theory. I will offer practical advice for teams just getting started with building a fit-for-purpose data infrastructure.