December 11, 2017
How to Initialize a Postgres Docker with a Million-Plus Records
Besides the official Docker documentation, there are several good sources on the internet where you can read about how to kick-start your Dockerized environment on your project, including Stack Overflow Docker tag, official Docker images for MySQL, and Postgres. Unfortunately, there aren’t so many places where one can read about the failures in setting up Docker in various common use cases.
You may wonder, “Why would anyone be interested in such a thing?” I think the value of such stories lays in the fact that we discover how such a solution works many times by trial and error - pretty much the same method applied by a 2nd grader. Do you find yourself in a similar situation? You probably do, so the ideas from here will bear some value to you.
If this is the first time you’ve heard about Docker, head here first https://docs.docker.com/get-started and then return to this article. I have used Docker 1.12 on MacOS, in native mode - this means with Docker managing the build in hypervisor.
Database Initialization: Problem Definition
We all use databases. It’s the most common use case in every small or large distributed service or web application. If you don’t use an ORM of any sort where your entities are “magically” created by the framework, then you are probably extracting some valuable data from a database that you don’t primarily own. In that case, you’ll need some data initialization where you, at the very minimum, will have to create the database and the tables needed by your application.
The basic recipe will install the software and any utilities needed alongside, then will “mount” the data volume and launch the init script; the init script will check if the data folder is empty or not and depending on that will launch the initialization scripts from the folder
docker-entrypoint-initdb.d which were mounted by the user.
If you apply this recipe to initialize your database, it will work.
But what if you have the following use case: using a Docker container in development, you mock the real database with some large quantities of data - let’s say about a million. The above initialization will work, but it’s time consuming. If you’re combining this with several other images in a Travis configuration where you want to run your cool suite of automated tests, then you’ll find yourself waiting for the container to become available…for a long time…say 15 minutes. This may not be an inconvenience in some cases, but most of the times it is, for a number of reasons:
- You are using a Docker compose configuration locally and waiting 15 minutes is too much
- Your Continuous Integration processing queue is taking too long.
Database Initialization: Proposed Solution
In one of our projects, where we needed to mock a Redshift database in the CI/CD setup, we’ve come across the problem described above. In order to circumvent it, we have thought to initialize the database during image construction. We needed to alter the dB initialization script to the following workflow:
The excerpt of the Dockerfile that implements the above functionality is highlighted below:
# source: https://github.com/kiasaki/docker-alpine-postgres FROM alpine RUN echo "@edge http://nl.alpinelinux.org/alpine/edge/main" >> /etc/apk/repositories && \ apk update && \ apk add curl "libpq@edge<9.7" "postgresql-client@edge<9.7" "postgresql@edge<9.7" "postgresql-contrib@edge<9.7" && \ curl -o /usr/local/bin/gosu -sSL "https://github.com/tianon/gosu/releases/download/1.2/gosu-amd64" && \ chmod +x /usr/local/bin/gosu && \ apk del curl && \ rm -rf /var/cache/apk/* ENV LANG en_US.utf8 ENV PGDATA /var/lib/postgresql/data/ ENV POSTGRES_DB datastore ENV POSTGRES_USER docker ENV POSTGRES_PASSWORD letmein RUN mkdir -p /opt/setup/data-scripts.d/ RUN mkdir -p /zdata/ COPY ./data-scripts.d/* /opt/setup/data-scripts.d/ WORKDIR /opt/setup/ COPY db-setup.sh /opt/setup/ COPY db-pack.sh /opt/setup/ COPY db-run.sh /opt/setup/ RUN ./db-setup.sh RUN ./db-pack.sh VOLUME $PGDATA EXPOSE 5432 ENTRYPOINT [ "/opt/setup/db-run.sh" ] CMD [ "postgres" ]
For brevity, we’ll ignore the packing and unpacking scripts, though one can find them in this gist:
https://gist.github.com/robert-malai/19d139937a6b51d2709a6584702b4e5e. The remaining part is the logic baked in when the container is launched - that part is implemented in db-run.sh:
#!/bin/sh set -e if [ ! -f "$PGDATA/PG_VERSION" ]; then echo "Restoring $PGDATA ..." tar -xf /zdata/backup.tar -C $PGDATA sync echo "Done." else echo "$PGDATA was already there, skipping restore." fi echo "Launching command: $@ ..." if [ "$1" = 'postgres' ]; then gosu postgres "$@" else exec "$@" fi
During the bootstrap of the container, the whole database is being unpacked and restored in the data mounted folder, and at the end, the server is launched via the “postgres” mnemonic.
Running the above container had increased the image build to about 15 minutes - the initialization scripts were comprised of about 2 million records on a very simple table structure, with only 5 fields. The good part is that the container took only about 2 minutes to be bootstrapped from the base image. This greatly improved the user experience of the developers who will debug locally using a database, which is very close to what is in production.
The customization of the database initialization scripts prove to be of moderate difficulty (it’s by no means a trivial task) but the end result was reliable, so we can recommend this approach to anyone who is in a similar situation.
Spread the Word
If you like this article, please help us spread the word about it via the Click to Tweet button below or one of the social sharing icons at the bottom of the post.
[bctt tweet="What's the best way to initialize a #Postgres database with a million-plus records in @Docker? Find out from Robert Malai. " via="no"]