Pre-loading a PostgreSQL Docker Container
PostgreSQL is a powerful relational database management system that can be easily downloaded and installed from its official image on DockerHub using Docker. However, it’s not so straightforward to pre-load your own data. This blog post is about preparing a derivative of the base PostgreSQL Docker image that’s preloaded with your own database and pushing it back to DockerHub for redistribution.
Prerequisites I’m going to assume you have a modern version of docker running. I’m using
Docker Desktop for Mac. I’ve
also installed PostgreSQL using brew install postgresql
, which puts a suite of command line
utilities, including createdb
, which I use in the middle of this tutorial. I’m usually
using fish, but the following instructions are given with Bourne-again
shell (bash) syntax.
Notes Throughout this post, I’ll shorten PostgreSQL to postgres. All the commands in this tutorial are run from the shell of the host system, i.e. I did not ssh or exec into the Docker image itself to run them.
Run the base image
The first step is to run a base image. This could be as simple as docker run postgres
, but there
are a few options to add to make the rest of this process more simple. I’ve written this post while
preparing the docker image for the Biolookup Service, so biolookup
appears many times as
names/passwords throughout. I’ll definitely be writing another post soon about what that service is
and what it does (think Ontology Lookup Service, but not just
restricted to ontologies).
$ docker pull postgres
$ docker run \
-p 5434:5432 \
--name postgres-biolookup \
--detach \
-e POSTGRES_PASSWORD=biolookup \
-e PGDATA=/var/lib/postgresql/pgdata \
--shm-size 1gb \
postgres
-p
/--publish
This takes an argument looking like<X>:<Y>
. The<Y>
corresponds to the port inside the docker container, and the<X>
corresponds to what’s visible outside. I’m mapping from the default postgres port inside the container (i.e., 5432) to a non-default one outside (i.e., 5434) to avoid conflict with my local installation of postgres.--name
This gives a nice name to the container for lookup later. This doesn’t have to be the same as the name you give when you push to dockerhub, but it’s probably better to stay consistent. If you don’t give one, docker assigns a silly name for you.-d
/--detach
Rather than running in my current shell, this backgrounds it. Since I used--name
, I can look up my image directly using$(docker ps --filter "name=postgres-biolookup" -q)
.-e
/--env
This allows you to specify environment variables.- Setting
POSTGRES_PASSWORD
explicitly sets the password for the default postgres user (namedpostgres
). - Setting
PGDATA
ensures that a docker commit will actually persist the database’s content. The cryptic path that came after is just the standard path postgres uses. Move along.
- Setting
-
--shm-size
By default, the shared memory is 64mb. When loading up this big database, this caused the following crash:... File "/Users/cthoyt/dev/pyobo/src/pyobo/database/sql/cli.py", line 52, in load _load( File "/Users/cthoyt/dev/pyobo/src/pyobo/database/sql/loader.py", line 65, in load _load_definition(engine=engine, table=defs_table, path=defs_path, test=test) File "/Users/cthoyt/dev/pyobo/src/pyobo/database/sql/loader.py", line 97, in _load_definition _load_table( File "/Users/cthoyt/dev/pyobo/src/pyobo/database/sql/loader.py", line 311, in _load_table cursor.execute(sql) psycopg2.errors.DiskFull: could not resize shared memory segment "/PostgreSQL.1699521131" to 67128576 bytes: No space left on device
Luckily, StackOverflow had me covered and suggested increasing the shared memory to 1gb using
--shm-size
.
Create the database
Creating the database on the already running postgres docker image is a bit more straightforwards:
$ PGPASSWORD=biolookup createdb -h localhost -p 5434 -U postgres biolookup
PGPASSWORD=biolookup
sets the password in the environment when this command gets run so there’s no
need to manually interact with it. -h
is for host, -p
is for password, and -U
is for
username. -e
can be added optionally to show the commands that are run for debugging. The final
part biolookup
is the name of the database that gets created.
Load the database
The motivation for this post was to prepare a rather large database for the Biolookup Service, which
contains hundreds of millions of identifiers, names, definitions, and alternative identifiers for
biomedical entities. I’ve written previously about building
this database, which after an incredible effort boils down to running
pyobo database build
from the shell. It also automatically uploads its contents to Zenodo:
Data | DOI |
---|---|
Names | |
Definitions | |
Alternative Identifiers | |
Species |
Loading the database requires the biolookup
Python
package. It automatically downloads the data from the latest releases on Zenodo if not available
locally, then puts it in the database.
$ python -m pip install biolookup
$ biolookup load --uri postgresql+psycopg2://postgres:biolookup@localhost:5434/biolookup --test
The --test
makes the database only load 100K records instead of hundreds of millions of records.
For building a real database, remove this.
Commit and push to DockerHub
The docker commit
checks what the difference between the base image and the current state of the
image is. Because of the -e PGDATA=...
, it also tracks the new data added. For the Biolookup
Service, the image has gone from about 300mb to almost 40gb, so be patient. I went and made
breakfast while this was happening and it was done by the time I came back. For reference, it was a
saturday morning American breakfast.
$ docker commit \
-a "Charles Tapley Hoyt <cthoyt@gmail.com>" \
-m "Added biolookup schema and data" \
$(docker ps --filter "name=postgres-biolookup" -q) \
biopragmatics/postgres-biolookup:latest
After committing, it’s time to push to DockerHub. You might need to do docker login
before this.
The name of the image takes the form <organization>/<name>[:<tag>]
. Make sure you push to an
organization that you have rights to, and the tag (the part after the colon) is optional.
$ docker push biopragmatics/postgres-biolookup:latest
You can check to see it was uploaded properly here.
Run locally
Since the biolookup
web application is automatically installed with PyOBO and the database is now
built locally, you can test it locally with:
$ biolookup web --sql --uri postgresql+psycopg2://postgres:biolookup@localhost:5434/biolookup
Run with Docker Compose
You can use the following configuration as a docker-compose.yml
file to orchestrate the pre-loaded
database with the front-end web application (more information on
that here):
version: '3'
services:
app:
image: biopragmatics/biolookup:latest
environment:
BIOLOOKUP_SQLALCHEMY_URI: postgresql+psycopg2://postgres:biolookup@database/biolookup
restart: always
ports:
- "8765:8765"
depends_on:
- database
database:
image: biopragmatics/postgres-biolookup:latest
ports:
- "5432:5432"
You can run this with docker-compose up --detach
. When you do this, you need to be patient (1-5
minutes) for the database to start up before making requests from the web application. If you didn’t
detach when running docker-compose, postgres will actually log when it’s ready. If you try making a
request before it’s done starting, you’ll probably get an error message that looks like this:
Traceback (most recent call last):
...
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 605, in __connect
connection = pool._invoke_creator(self)
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 578, in connect
return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 584, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: the database system is starting up
If you want to more correct about this, you can use the
pg_isready
to check the connection
status of a PostgreSQL server. But I think it’s practical enough just to wait a few minutes
Next Steps
My next steps are to figure out the best way to automate the first three steps (running the base image, creating the database, and loading the database) then hopefully do it in an automated setting in GitHub Actions.
Special thanks to Ben Gyori for outlining how to do this and getting me going in the right direction. Thanks to Scott Colby for feedback and pro docker tips.