5 Foundational Design Patterns for Data Modelling
Edit: 5 Situational Design Patterns for Data Modelling (part 2) is out!
Like there are software design patterns, there also exist data model design patterns. Unfortunately, data model design patterns are not nearly as well-known as software design patterns. One popular example of object-oriented design patterns is by the Gang of Four (GoF), often taught in undergraduate computer science courses. That is surprising because data modelling underpins fields like software design and database architecture. So you would expect their patterns to be more widely known.
Plenty of books and internet articles have been written on the topic of data modelling. But none of those that I have read was able to provide a succinct description of its patterns. Also, what some call “patterns” is in my eyes not generic enough. They, for example, prescribe solutions for managing users or building a product database. Of course, these are common challenges, but I wouldn’t call them patterns. Patterns should be universally applicable — even if you don’t have users or products.
So in this post, I present my attempt to distil recurring problems in data modelling on the most elementary level, and some possible solutions. I will keep these as generic as possible and should therefore be applied to any domain.
Preliminary
Data modelling
Before we go into the design patterns, I provide a quick refresher on data modelling fundamentals. This also helps to clarify the nomenclature and conventions that I will be using throughout this post. (Feel free to skip right to the patterns if you are in a rush, there is nothing controversial here).
To visualize the patterns I will use entity-relationship (ER) models. An ER model specifies the entity types and the possible relationship types between instances of those entity types.
An entity is something that exists independently. It can have one or more attributes, which are properties describing that entity. Each attribute has a name (lowercase) and sometimes a datatype (after :
) if relevant. An entity will usually have a primary key (abbreviated to PK) attribute, often called an identifier (ID), that is unique to that entity. The attribute values of an entity can change over time, but its identity (coded by its ID) stays the same. Just like you as a person will age, but you are still you.
When data modelling, we are more likely to talk about entities on a higher level in terms of entity types. These are classes of entities that all have the same attribute names but with possibly different values. For example, a person is an entity type with an age attribute. Every person has an age, but the number of years can vary.
Often the terms “entity” and “entity type” are used interchangeably. An entity type is frequently referred to as simply “entity”. I will do the same in this post.
Generic names, which should be renamed to specific concepts in your domain model, are indicated with angel brackets like “<name>”.
Entities can be associated with other entities through relationships. When talking about directed relationships, I refer to the two entities on both sides of the relationship using “subject” and object” or “parent” and “child” if there is an obvious hierarchy.
There are multiple types of associations that differ from each other semantically. You can have an aggregation association which implies a relationship where the child can exist independently of the parent. Think of a teacher (parent) and a student (child). And you can have a composite association which implies a relationship where the child cannot exist independently of the parent. Think of a house (parent) and room (child).
Another special type of association is specialization/generalization. It is called specialization when the children are subtypes of their parent. Its inverse, called generalization, is when the children are supertypes of their parent. This type of relationship has the following notation:
What is the difference between an association and a relationship you may be thinking? The way I see it, an association is a link between two things in the real world. A relationship is how the association is encoded in the data model.
When talking about relationships between entity types there are two other characteristics:
Relationship optionality specifies whether there is a mandatory or optional association between every entity of one type to at least one entity of the other type.
For example, there is a mandatory relationship between a leaf to the plant it grew from (because all leaves grow on a plant). But there is an optional relationship between a plant to fruit (because not all plants grow fruit).
Mandatory relationships are used for aggregate associations and optional relationships are used for composite associations.
Relationship cardinality specifies whether every entity of one type can be associated with one or many entities of the other type.
More frequently, people specify the cardinality of both sides of a relationship with these terms:
- One-to-one. For example, a motorcycle and its rider (though you could have two riders I guess)
- One-to-many. For example, a primary school teacher and their pupils (again, you can think of exceptions)
- Many-to-many. For example, car models and car parts
The optionality and cardinality characteristics are usually specified together. In this post I use crows foot notation:
One-to-one and one-to-many relationships are usually modelled using a foreign key (abbreviated to FK). This is an attribute containing another entity’s primary key. In the example below, the Pupil
entity has a teacher id
foreign key that links it to a single teacher.
Many-to-many relationships require an associative entity to cross-reference instances of one entity type to instances of another entity type. The associative entity has two foreign keys, linking to the primary keys of the entities it is associating. This creates two one-to-many relationships on either side that together make a many-to-many relationship.
In the example of car models and car parts, which have a many-to-many relationship, you can see how the Model part
associative entity cross-references them below.
A relationship can be modelled by adding a foreign key to your entity which points to the primary key of another entity.
The aforementioned relationships are all strict relationships. They are enforced by the model structure. Loose relationships are not enforced. They are useful when we want to model an association between two entity types without creating an explicit relationship between them. This is usually achieved by storing the entity type name and entity ID of the entity instance we wish to link to.
Loose relationships are useful when we have a large number of entity types that are associated with each other. A single entity type name and ID can link to many entity types. Their downside is that the integrity of a model containing loose relationships is difficult to maintain. When an entity is deleted, you need to make sure that all the loose relationships are deleted as well. For this reason, I generally advise against loose relationships, except for when the model would be too complicated without them.
One model concept you will see sparingly used in the patterns is the exclusive arc. This is visualised as an arc over multiple relationships in the ER diagram. It means that each entity instance can have only one of those relationships. In other words, only one foreign key is non-null and the others are null. In the example below this is used to model pets which can either be a cat or a dog.
What are patterns again?
Patterns are reusable solutions to common problems. Their benefits include:
- being tried-and-tested solutions that collective experience has shown to work well
- allowing for faster development
- providing a common language and jargon for describing a solution
- helping to create more consistent, and thus easier-to-understand, data models
So it’s clear they have a lot of advantages. But they have some requirements in order to be effective. Good patterns:
- are generic and can be applied anywhere in your model
- are composable and can thus be mixed and matched with other patterns to solve more complex problems
- do not repeat information because this helps maintain data integrity
- are database system-agnostic (however, I’ve used focused on relational/SQL databases because those are the most popular)
- are intuitive and easy to understand (I hope 😜)
Patterns
Now I am going to discuss the data modelling design patterns that I have drawn out of literature and my experience.
Even more than in software modelling, in data modelling there are multiple ways to solve a problem. There is not always a clear-cut solution. It can depend on (among others):
- the likelihood of the model’s need to change in the future
- your team’s experience
- the available time
- your company structure
- your tools
For that reason, I present multiple solutions to each problem. Each time I start with the simplest solution. These have a strict model structure that makes them easier to understand and less prone to errors. However, because they tend to solve a problem in only one place of the model (i.e. locally), you may need to repeat them many times. This can lead to a large model.
Then I discuss increasingly more flexible solutions. These have a model structure that is more generic and allows solving the problem in multiple places of the model at once (i.e. globally), now and in the future. However, they have the drawback of being more complicated to understand and implement correctly. That is because the business rules are less likely to be enforced by the model structure. So the business rules need to be enforced by application logic, which requires more testing as well.
Here is a comparison:
To denote the flexibility of the model I use 5 levels. L0 is the least flexible. L4 is the most flexible. The meanings are roughly:
- L0: solves the problem ad-hoc (this tends to be the simplest thing you can think of, though the inflexibility can be a major shortcoming)
- L1: better, but still some downsides
- L2: solves the problem in one place in a generic way so that future changes are easier to make
- L3: allows reapplying pattern in the same place of the model without changing the schema
- L4: allows reapplying pattern anywhere in the model without changing the schema
There will not always be 5 solution levels for every problem.
Note: as we are concerning ourselves here with logical data models and not physical models, implementation details for improving performance, storage size and other database-specific concepts are of lesser importance. You might find when applying these patterns in practice that you need to make some adaptations for implementing them successfully.
Type pattern
Problem: classify entities into a finite set of types (these can also be called “categories”, “classes”, and “statuses”).
L0 solution
The easiest thing to do is just add a <type>
attribute of datatype varchar
(i.e. variable-length string) to the subject. This attribute contains the type of entity as a hardcoded string.
This model obviously does not constrain all the possible values of <type>
. so you need to enforce this during data entry. Also, there is duplication — if you want to change the name of a type, then you need to remember to update all the entities with that type.
L1 solution
Some databases have support for the enum
datatype. An attribute with this datatype cannot be assigned any value which is not in the set of allowed values.
This solves the problem of duplication on the database level. However, a modification (e.g. addition, removal, change of name) of a type now requires a schema migration.
L2 solution
This solution eliminates duplication, just like L1, while making it easier to change the allowed types. The type has been extracted into its own <Type>
entity. Assigning the type means setting the <type> id
foreign key on your <Subject>
to the correct identifier. Inserting, updating, or removing types can be easily done by modifying the <Type>
entities.
Another advantage over the L1 solution is that additional information can be tied to the <Type>
like extra attributes or relationships with other entities. You could even build a hierarchy of types if you wanted!
A downside is that another entity type needs to be added.
L3 solution
Now imagine you’d like to apply multiple types to the same subject. These types are independent from each other. In that case, you could add a many-to-many <Subject type>
associative entity.
It is now possible for the subject to have any number of types.
Unfortunately, what you lose in this model is the ability to enforce which combinations of types are possible for the subject. That now needs to be controlled by the program or user.
L4 solution
The L4 solution allows assigning a type to any existing entity in your model. It achieves this by creating a loose relationship to <Type>
by specifying the subject entity type name and entity ID in <Subject type>
. To make it possible to assign any number of types to an entity, you could make the right side of <Subject type>
a loose relationship as well.
As mentioned before, the downside of loose relationships is that data integrity is more difficult to maintain. You cannot take advantage of some relational databases which can cascade the deletion of a row to everything that has a relationship with it.
Role pattern
Problem: assign a set of associations (i.e. link to objects) to entities.
The set of associations we call a role. A role represents the part that an actor or a group of actors plays within a system. The meaning of an actor can be taken quite broadly — it is anything that can do something autonomously. Usually, it is a person. But it can be a piece of software as well, like an automated background process.
L0 solution
The simplest thing to do here is just assign the associations to your subject directly through foreign keys.
However, this does not enforce that a set of associations are always assigned together. For this, we need a role entity.
L1 solution
Here we explicitly model a <Role>
entity that has the associations that we’d like to assign as a unit. It has a link to the subject through the <subject> id
foreign key.
However, this approach has the downside that we cannot reuse a role. A role can only be used once. This duplication will lead to problems if we want to change a role that is been applied to many subjects at once.
L2 solution
For a role to be reused, we change the foreign key relationship to be coming from the <Subject>
. Now, a subject can be bound to a role by setting the <role> id
foreign key.
A big drawback of this model is that every entity can only have one role. Sometimes that’s okay. But in the more complex systems, it’s often not.
Also, if you would like to change the associations that a role can have (e.g. add <Object 3>
, then you need to modify the schema.
L3 solution
This more advanced solution allows binding roles to any number of subjects easily. In addition, roles can have any number of associations which change be changed easily.
Even though this model is more flexible, it also exhibits a higher degree of generalization. That is a good thing if your team and tools can support the many-to-many relationships it requires. But otherwise, make sure you are not over-engineering it.
L4 solution
In order to allow binding a role to any subject entity, we can do the same as in L3 but use a loose relationship to the subject just like in Type pattern L4 solution. But remember, it also suffers from the same disadvantages as described there.
Likewise, there can be a need to link a role to any association. This can also be solved using loose relationships.
Context pattern
Problem: add more information to existing relationships.
For example, you would like an extra attribute, type (see Type pattern), relationship, or a whole role (see Role pattern). That is when the Context pattern comes in handy.
L0 solution
Like before, the easiest thing to do is to add the context straight to the entity where it is needed. In this case, we place the information regarding the subject-object relationship on the object itself in the <subject context>
attribute.
This requires the least upfront work but has the downside that the context is not explicitly related to the relationship to the subject. So whoever changes the <subject> id
foreign key, must remember to change the <subject context>
as well.
L1 solution
The problem in the previous solution can be solved by adding the context directly to the associative entity that links the object to the subject.
If there is a many-to-many relationship, then there will be an associative entity already. If not, then a new associative entity needs to be added.
It is now much clearer that the context is specific to the subject-object association. If the associative entity is dropped, the context is dropped together with it.
However, there is still some duplication here. A context cannot be shared between multiple subject-object associations.
L2 solution
This solution allows reusing the context, something that wasn’t possible in the L1 solution.
The context is extracted into its own entity. A foreign key is added to the <Subject object>
entity to make use of the context.
This solution will work for most people. However, sometimes you want even more flexibility and assign multiple contexts to the same subject-object association.
L3 Solution
This solution is more flexible than L2. A many-to-many association has been added between <Subject object>
and <Context>
.
While this solution is very flexible, it is also less straightforward to understand and implement. Information on the subject-object association is now stored in three entities (<Subject object>
, <Subject object context>
, and <Context>
). So you always need to do at least two joins to query it.
L4 solution
An L4 solution for adding context to any existing relationship is possible (using loose relationships, see Type pattern L4 solution) but I wouldn’t normally recommend it. The model will become very abstract. Also, you need to keep in mind that every relationship could then have additional context information stored in a separate entity without an strict relationship linking to it.
Hierarchy pattern
Problem: define a tree structure where every parent has a one-to-many relationship to its children.
This leads to a pyramid structure that consists of levels. An important characteristic is that the associations between the entities in adjacent levels is of the same semantic nature. For example:
- <parent> owns <children>
- <parent> is boss of <children>
- <parent> is derived from <children>
- <parent> is composed of <children>
- <parent> is generalization of <children>
An organizational structure is a typical example of a hierarchy. There is a single CEO, under which there are multiple senior executives, under which there might be directors, managers, and finally individual contributors.
Another typical example is a taxonomy. A taxonomy is a hierarchy of types.
L0 solution
There is no L0 solution for this problem.
L1 solution
The simplest way to satisfy the requirements of a hierarchy is to add one-to-many relationships between the entities in adjacent levels.
This solution will actually work well in many cases. It is easy to understand as there are clear levels. Also, it enforces the notion that parents can only be directly related to their children and the children to their grandchildren.
To add more information to the relationships, the Context pattern can be used. That does, however, not take away that there is a bit of duplication here. If there is a similar context for all the relationships between the levels, that will have to be repeated.
L2 solution
The problem of duplicated information in the L1 solution can be solved using recursion. Using a self-relationship we can say there is now a recursive relationship between the entities.
The levels are not clear anymore — you need to count the number of ancestors in order to find an entity’s level. Also, the levels are not static. An entity can have any other entity as a parent. This was different in the L1 solution. This can be either an advantage or a dealbreaker, depending on how fluid your structure needs to be.
In addition, you may not wish to model all entities the same. Your tree might be heterogenous and so entities might have different attributes from each other.
Finally, the self-relationship adds a cycle to the model. Not all technologies can handle this trivially. So that needs to be taken into account as well.
L3 solution
The L2 solution can be a nice and clean approach if you need just one hierarchy. But what if you plan on changing, adding, and removing hierarchies? This L3 solution can handle those mechanisms without the need for schema changes. It is achieved by adding a new associative entity, called <Descendant>
, which stores all the parent-child relationships. It also has a <hierarchy> id
attribute to distinguish which hierarchy it belongs to. Information for each hierarchy can be attached to this new ID.
This solution is much more flexible. However, the tree structure is less clear. Also, you need to make sure a child can only have a single parent in each hierarchy. Some technologies allow adding a uniqueness constraint on(<hierarchy> id, <child> id)
which guarantees this. But if that’s not possible, then you need to enforce this constraint through other means instead.
L4 solution
Building hierarchies out of arbitrary entities is possible using a loose relationship to the parent and the child.
Though I would only recommend this if your desired hierarchy consists of many disparate entities. Linking them up with s relationships would require a large schema change. However, in this case, with loose relationships you only need to add the <Descendant>
entity. But as always, be careful with loose relationships.
Network pattern
Problem: define a web structure where any two entities can have an association between them.
A social media network is a typical example. Any two persons can be friends.
Instead of talking about parents and children as in the Hierarchy pattern, we will use the term “peer”. That is because in a network there are no levels anymore and so all entities are peers of each other.
L0 solution
There is no L0 solution for this problem.
L1 solution
We can simply create a network structure by adding an associative entity for each pair of entities that may have an association between them.
For a network of n entity types, (n*(n-1))/2 associative entity types are required. For a small number of entity types, this might be okay. But for large heterogenous networks, this doesn’t scale. Also, by storing all the peer-to-peer relationships in separate associative entities, there is a lot of duplication. Imagine having to add a new attribute to all peer-to-peer relationships. That will be a lot of work!
L2 solution
A massively simplified network structure can be created if all entities and peer-to-peer relationships between them can be modelled the same.
But that’s a big if. It assumes that we are working with a homogenous network where all peers are of the same type. However, it’s not uncommon for there to be schematic differences between peers. If so, you can either stick with L1, or make the type, properties and associations of an entity flexible (spoiler: in part 2 I describe how).
Like with other L2 solutions, adding multiple separate networks requires schemas changes. To make this easy we need an L3 solution.
L3 solution
This solution is very similar to the L2 solution. The difference is the added <network> id
attribute that distinguishes to which network each peer-to-peer relationship belongs (see Hierarchy pattern L3 solution for an explanation of the benefits).
L4 solution
This solution is just like the Hierarchy pattern L4 solution and has the same advantage and disadvantages.
Conclusion
If you have experience with data modelling, these patterns might look obvious to you. Likely you have already applied some of them without being aware of it. But then, my goal was not to show things you techniques you have never seen. The objective was to provide a structured overview of the available patterns. The patterns have been untangled, broken down, and named. Now they are easy to reuse — just look up your problem, choose your desired flexibility level, and find a solution.
The patterns presented in this part are the simplest possible building blocks in your data models. In part 2 I present advanced patterns that are applicable in more niche situations. They go by jazzy names such as Property pattern, Shapeshifter pattern, Temporal pattern, Changelog pattern, and Party pattern.