sql

When building our applications, we often try to identify some problem and transform it into a programming language by creating appropriate data structures and relations between them. Then, when we want to store that data in an SQL database, things sometimes get tricky. In this post, I want to show three ways we can represent real-life inheritance in an SQL database.

In general, there are three most basic ways of representing two types that are in the same inheritance line, ie. one extends another. In our examples, I am using PostgreSQL database. In order to run it using docker, you need to run this command:

docker run --name some-postgres -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 -d postgres

Postgres image in various DB versions are available on Docker Hub.

In our example, we'll want to create a Player and NBAPlayer objects, represented in Golang as:

type Player struct {
    ID      int
    Name    string
    Club    string
    Country string
}

type NBAPlayer struct {
    Player
    PointsScored int
}

Note that by having Player inside NBAPlayer struct we indicate that the latter has all the attributes present in the former.

One common table

The fastest approach would be to create one table that will store data from both parent and child entities:

CREATE TABLE player (
    id SERIAL PRIMARY KEY NOT NULL,
    name VARCHAR(255) NOT NULL,
    type VARCHAR(10) NOT NULL, --we need that!
    club VARCHAR(100) NOT NULL,
    country CHAR(2) NOT NULL,
    points_total INT --can be null
);

Note that using this approach requires us to add some attribute that would allow us to filter out all non-NBAPlayers. In order to do that we add type column that will store eg. NBA as one of its values.

This approach is nice, easy and fast. Selecting all players is obvious, as well as all NBA players:

postgres=# SELECT * FROM player;
 id |        name        |...| points_total 
----+--------------------+---|-------------
  1 | LeBron James       |...|        28787
  2 | Kevin Durant       |...|        19121
  3 | Robert Lewandowski |...|             
  4 | Marcin Gortat      |...|         7028
(4 rows)

postgres=# SELECT id, name, points_total FROM player WHERE type = 'NBA';
 id |     name      |...| points_total 
----+---------------+---|--------------
  1 | LeBron James  |...|        28787
  2 | Kevin Durant  |...|        19121
  4 | Marcin Gortat |...|         7028
(3 rows)

What is obvious is that our one single non-NBA Player will not have any points_scored, so his value there would be NULL. This is a bit of a problem because this value should not be nullable in our application, but for the sake of database schema, we must allow it. When we expand our model by adding other NBA-specific columns (and later on adding FIFAPlayer with some attributes of its own) we will end up with a huge table with lots of columns and even more NULL values.

Another downside of this approach is connected to our type column, as we are now required to add it to each and every insert:

INSERT INTO player(name, type, club, country, points_total) VALUES ('Cristiano Ronaldo', 'FIFA', 'Real Madrid', 'PT', NULL);

I also added explicit NULL value for points_total to show how our application might behave when we want to insert a full row every time.

Separate tables

The second approach, also quite fast and simple is to create two separate table, so that we can avoid type column and NULL values. We would not, however, add non-NBA players to both tables. Just separate them as we did with Golang structs:

CREATE TABLE player (
    id SERIAL PRIMARY KEY NOT NULL,
    name VARCHAR(255) NOT NULL,
    club VARCHAR(100) NOT NULL,
    country CHAR(2) NOT NULL
);

CREATE TABLE nba_player (
    id SERIAL PRIMARY KEY NOT NULL,
    name VARCHAR(255) NOT NULL,
    club VARCHAR(100) NOT NULL,
    country CHAR(2) NOT NULL,
    points_total INT NOT NULL -- can be nullable now
);

Now when we insert players, we don't need to add type as a column, but make sure we put them in a correct table:

INSERT INTO player(name, club, country) VALUES ('Cristiano Ronaldo', 'Real Madrid', 'PT');
INSERT INTO nba_player(name, club, country, points_total) VALUES ('Pau Gasol', 'San Antonio Spurs', 'ES', 20001);

So as you can see, we haven't saved a lot with this move, but it's still easy to do it on the code level, so we don't worry too much about that. Selecting all NBA players is also very simple:

# SELECT * FROM nba_player;
 id |     name      |         club          | country | points_total 
----+---------------+-----------------------+---------+--------------
  1 | LeBron James  | Cleveland Cavaliers   | US      |        28787
  2 | Kevin Durant  | Golden State Warriors | US      |        19121
  3 | Marcin Gortat | Washington Wizards    | PL      |         7028
  4 | Pau Gasol     | San Antonio Spurs     | ES      |        20001
(4 rows)

But when we want to see all players, we must ask our SQL database to UNION the results, since they are in separate places. Not only that, but we need to list the columns explicitly, to make sure they match and can be displayed in the same table:

# SELECT id, name, club, country FROM player 
UNION 
SELECT id, name, club, country FROM nba_player;
 id |        name        |         club          | country 
----+--------------------+-----------------------+---------
  3 | Marcin Gortat      | Washington Wizards    | PL
  2 | Kevin Durant       | Golden State Warriors | US
  2 | Cristiano Ronaldo  | Real Madrid           | PT
  1 | LeBron James       | Cleveland Cavaliers   | US
  4 | Pau Gasol          | San Antonio Spurs     | ES
  1 | Robert Lewandowski | Bayern Munich         | PL
(4 rows)

As you can see, we don't get points_total but that's OK (after all we wanted to list all players, so we mustn't be looking for points values), but we might be worried more about that we have a duplicate in id column...

Also, think how much effort it would require if you want to alter club column eg. to allow it to contain longer strings? That's right, you would have to do it in both tables.

Reference table

The third approach is more difficult but is probably the most popular when it comes to data duplication and consistency. It requires some more work when it comes to inserting and selecting data, but it's fine - we are paid to do that, right?

What we want to do is create some abstract (but not really in OOP sense) object that will store all base information for each player. Then, in a separate table we'll reference that base by its id, therefore, in the end, our result will combine data from those two places and result in one row per player. For the end user it will not make any difference (duh!), and underneath we will have just one place where we define common attributes:

CREATE TABLE player (
    id SERIAL PRIMARY KEY NOT NULL,
    name VARCHAR(255) NOT NULL,
    club VARCHAR(100) NOT NULL,
    country CHAR(2) NOT NULL
);
CREATE TABLE nba_player (
    player_id INT REFERENCES player (id),
    points_total INT NOT NULL
);

Now it gets very tricky as we need to use some PostgreSQL magic to make it all work and be consistent. First, we need to use RETURNING clause on INSERT command to get the ID of a row inserted into player column. Then its result must be used as WITH ... AS, so that it can actually be reused. Finally, we SELECT that id, alongside with any hard-coded (or inserted by our application) values to populate nba_player table:

WITH new_player AS (
    INSERT INTO player(name, club, country) VALUES ('Pau Gasol', 'San Antonio Spurs', 'ES') RETURNING id
)
INSERT INTO nba_player(player_id, points_total) 
SELECT id, 20001 FROM new_player;

With a regular player, we can just insert the base data without worrying about anything:

INSERT INTO player(name, club, country) VALUES ('Cristiano Ronaldo', 'Real Madrid', 'PT');

Finally, when it comes to selecting data, we have an easy way to see all players:

# SELECT * from player;
 id |        name        |         club          | country 
----+--------------------+-----------------------+---------
  1 | LeBron James       | Cleveland Cavaliers   | US
  2 | Kevin Durant       | Golden State Warriors | US
  3 | Robert Lewandowski | Bayern Munich         | PL
  4 | Marcin Gortat      | Washington Wizards    | PL
  5 | Cristiano Ronaldo  | Real Madrid           | PT
  6 | Pau Gasol          | San Antonio Spurs     | ES

But when it comes to all NBA players, we need to join results from two places:

# SELECT * FROM player JOIN nba_player ON player.id = nba_player.player_id;
 id |     name      |...| country | player_id | points_total 
----+---------------+---+---------+-----------+--------------
  1 | LeBron James  |...| US      |         1 |        28787
  2 | Kevin Durant  |...| US      |         2 |        19121
  4 | Marcin Gortat |...| PL      |         4 |         7028
  6 | Pau Gasol     |...| ES      |         6 |        20001

As you can see, we got the id twice (under player_id as well) but it can be cleared when we list all the columns we are interested in.

Summary

The same thing can be achieved in many different ways. If you want a quick solution, you will probably go with either of the first two. In my opinion, however, the third approach is the best even though it requires us a bit more effort. On the other hand, we still need to build those INSERT and SELECT calls once in our code, so if this is harder at the beginning it wouldn't worry me that much. In the long run, saving time on table alterations or removing data inconsistencies might be harder to do.