Pre-loading a PostgreSQL Docker Container
PostgreSQL is a powerful relational database management system that can be easily downloaded and installed from its official image on DockerHub using Docker. However, it’s not so straightforward to pre-load your own data. This blog post is about preparing a derivative of the base PostgreSQL Docker image that’s preloaded with your own database and pushing it back to DockerHub for redistribution.
Prerequisites I’m going to assume you have a modern version of docker
running. I’m using
Docker Desktop for Mac.
I’ve also installed PostgreSQL using brew install postgresql
, which puts a
suite of command line utilities, including createdb
, which I use in the middle
of this tutorial. I’m usually using fish, but the
following instructions are given with Bourne-again shell (bash) syntax.
Notes Throughout this post, I’ll shorten PostgreSQL to postgres. All the commands in this tutorial are run from the shell of the host system, i.e. I did not ssh or exec into the Docker image itself to run them.
Run the base image
The first step is to run a base image. This could be as simple as
docker run postgres
, but there are a few options to add to make the rest of
this process more simple. I’ve written this post while preparing the docker
image for the Biolookup Service, so biolookup
appears many times as
names/passwords throughout. I’ll definitely be writing another post soon about
what that service is and what it does (think
Ontology Lookup Service, but not just
restricted to ontologies).
$ docker pull postgres
$ docker run \
-p 5434:5432 \
--name postgres-biolookup \
--detach \
-e POSTGRES_PASSWORD=biolookup \
-e PGDATA=/var/lib/postgresql/pgdata \
--shm-size 1gb \
postgres
-p
/--publish
This takes an argument looking like<X>:<Y>
. The<Y>
corresponds to the port inside the docker container, and the<X>
corresponds to what’s visible outside. I’m mapping from the default postgres port inside the container (i.e., 5432) to a non-default one outside (i.e., 5434) to avoid conflict with my local installation of postgres.--name
This gives a nice name to the container for lookup later. This doesn’t have to be the same as the name you give when you push to dockerhub, but it’s probably better to stay consistent. If you don’t give one, docker assigns a silly name for you.-d
/--detach
Rather than running in my current shell, this backgrounds it. Since I used--name
, I can look up my image directly using$(docker ps --filter "name=postgres-biolookup" -q)
.-e
/--env
This allows you to specify environment variables.- Setting
POSTGRES_PASSWORD
explicitly sets the password for the default postgres user (namedpostgres
). - Setting
PGDATA
ensures that a docker commit will actually persist the database’s content. The cryptic path that came after is just the standard path postgres uses. Move along.
- Setting
-
--shm-size
By default, the shared memory is 64mb. When loading up this big database, this caused the following crash:... File "/Users/cthoyt/dev/pyobo/src/pyobo/database/sql/cli.py", line 52, in load _load( File "/Users/cthoyt/dev/pyobo/src/pyobo/database/sql/loader.py", line 65, in load _load_definition(engine=engine, table=defs_table, path=defs_path, test=test) File "/Users/cthoyt/dev/pyobo/src/pyobo/database/sql/loader.py", line 97, in _load_definition _load_table( File "/Users/cthoyt/dev/pyobo/src/pyobo/database/sql/loader.py", line 311, in _load_table cursor.execute(sql) psycopg2.errors.DiskFull: could not resize shared memory segment "/PostgreSQL.1699521131" to 67128576 bytes: No space left on device
Luckily, StackOverflow had me covered and suggested increasing the shared memory to 1gb using
--shm-size
.
Create the database
Creating the database on the already running postgres docker image is a bit more straightforwards:
$ PGPASSWORD=biolookup createdb -h localhost -p 5434 -U postgres biolookup
PGPASSWORD=biolookup
sets the password in the environment when this command
gets run so there’s no need to manually interact with it. -h
is for host, -p
is for password, and -U
is for username. -e
can be added optionally to show
the commands that are run for debugging. The final part biolookup
is the name
of the database that gets created.
Load the database
The motivation for this post was to prepare a rather large database for the
Biolookup Service, which contains hundreds of millions of identifiers, names,
definitions, and alternative identifiers for biomedical entities. I’ve written
previously about building this database,
which after an incredible effort boils down to running pyobo database build
from the shell. It also automatically uploads its contents to Zenodo:
Data | DOI |
---|---|
Names | |
Definitions | |
Alternative Identifiers | |
Species |
Loading the database requires the
biolookup
Python package. It
automatically downloads the data from the latest releases on Zenodo if not
available locally, then puts it in the database.
$ python -m pip install biolookup
$ biolookup load --uri postgresql+psycopg2://postgres:biolookup@localhost:5434/biolookup --test
The --test
makes the database only load 100K records instead of hundreds of
millions of records. For building a real database, remove this.
Commit and push to DockerHub
The docker commit
checks what the difference between the base image and the
current state of the image is. Because of the -e PGDATA=...
, it also tracks
the new data added. For the Biolookup Service, the image has gone from about
300mb to almost 40gb, so be patient. I went and made breakfast while this was
happening and it was done by the time I came back. For reference, it was a
saturday morning American breakfast.
$ docker commit \
-a "Charles Tapley Hoyt <cthoyt@gmail.com>" \
-m "Added biolookup schema and data" \
$(docker ps --filter "name=postgres-biolookup" -q) \
biopragmatics/postgres-biolookup:latest
After committing, it’s time to push to DockerHub. You might need to do
docker login
before this. The name of the image takes the form
<organization>/<name>[:<tag>]
. Make sure you push to an organization that you
have rights to, and the tag (the part after the colon) is optional.
$ docker push biopragmatics/postgres-biolookup:latest
You can check to see it was uploaded properly here.
Run locally
Since the biolookup
web application is automatically installed with PyOBO and
the database is now built locally, you can test it locally with:
$ biolookup web --sql --uri postgresql+psycopg2://postgres:biolookup@localhost:5434/biolookup
Run with Docker Compose
You can use the following configuration as a docker-compose.yml
file to
orchestrate the pre-loaded database with the front-end web application (more
information on that here):
version: "3"
services:
app:
image: biopragmatics/biolookup:latest
environment:
BIOLOOKUP_SQLALCHEMY_URI: postgresql+psycopg2://postgres:biolookup@database/biolookup
restart: always
ports:
- "8765:8765"
depends_on:
- database
database:
image: biopragmatics/postgres-biolookup:latest
ports:
- "5432:5432"
You can run this with docker-compose up --detach
. When you do this, you need
to be patient (1-5 minutes) for the database to start up before making requests
from the web application. If you didn’t detach when running docker-compose,
postgres will actually log when it’s ready. If you try making a request before
it’s done starting, you’ll probably get an error message that looks like this:
Traceback (most recent call last):
...
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 605, in __connect
connection = pool._invoke_creator(self)
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect
return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 584, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: the database system is starting up
If you want to more correct about this, you can use the
pg_isready
to
check the connection status of a PostgreSQL server. But I think it’s practical
enough just to wait a few minutes
Next Steps
My next steps are to figure out the best way to automate the first three steps (running the base image, creating the database, and loading the database) then hopefully do it in an automated setting in GitHub Actions.
Special thanks to Ben Gyori for outlining how to do this and getting me going in the right direction. Thanks to Scott Colby for feedback and pro docker tips.