How to Initialize a Postgres Docker with a Million-Plus Records

Besides the official Docker documentation, there are several good sources on the internet where you can read about how to kick-start your Dockerized environment on your project, including Stack Overflow Docker tag, official Docker images for MySQL, and Postgres. Unfortunately, there aren’t so many places where one can read about the failures in setting up Docker in various common use cases.

You may wonder, “Why would anyone be interested in such a thing?” I think the value of such stories lays in the fact that we discover how such a solution works many times by trial and error – pretty much the same method applied by a 2nd grader. Do you find yourself in a similar situation? You probably do, so the ideas from here will bear some value to you.

If this is the first time you’ve heard about Docker, head here first https://docs.docker.com/get-started and then return to this article. I have used Docker 1.12 on MacOS, in native mode – this means with Docker managing the build in hypervisor.

Database Initialization: Problem Definition

We all use databases. It’s the most common use case in every small or large distributed service or web application. If you don’t use an ORM of any sort where your entities are “magically” created by the framework, then you are probably extracting some valuable data from a database that you don’t primarily own. In that case, you’ll need some data initialization where you, at the very minimum, will have to create the database and the tables needed by your application.

It is important to understand the generic pattern applied by MySQL or Postgres images to achieve exactly that:

 

The basic recipe will install the software and any utilities needed alongside, then will “mount” the data volume and launch the init script; the init script will check if the data folder is empty or not and depending on that will launch the initialization scripts from the folder docker-entrypoint-initdb.d which were mounted by the user.

If you apply this recipe to initialize your database, it will work.

But what if you have the following use case: using a Docker container in development, you mock the real database with some large quantities of data – let’s say about a million. The above initialization will work, but it’s time consuming. If you’re combining this with several other images in a Travis configuration where you want to run your cool suite of automated tests, then you’ll find yourself waiting for the container to become available…for a long time…say 15 minutes. This may not be an inconvenience in some cases, but most of the times it is, for a number of reasons:

  • You are using a Docker compose configuration locally and waiting 15 minutes is too much
  • Your Continuous Integration processing queue is taking too long.

Database Initialization: Proposed Solution

In one of our projects, where we needed to mock a Redshift database in the CI/CD setup, we’ve come across the problem described above. In order to circumvent it, we have thought to initialize the database during image construction. We needed to alter the dB initialization script to the following workflow:

 

The excerpt of the Dockerfile that implements the above functionality is highlighted below:

# source: https://github.com/kiasaki/docker-alpine-postgres
FROM alpine

RUN echo "@edge http://nl.alpinelinux.org/alpine/edge/main" >> /etc/apk/repositories && \
    apk update && \
    apk add curl "libpq@edge<9.7" "postgresql-client@edge<9.7" "postgresql@edge<9.7" "postgresql-contrib@edge<9.7" && \
    curl -o /usr/local/bin/gosu -sSL "https://github.com/tianon/gosu/releases/download/1.2/gosu-amd64" && \
    chmod +x /usr/local/bin/gosu && \
    apk del curl && \
    rm -rf /var/cache/apk/*

ENV LANG en_US.utf8
ENV PGDATA /var/lib/postgresql/data/

ENV POSTGRES_DB datastore
ENV POSTGRES_USER docker
ENV POSTGRES_PASSWORD letmein

RUN mkdir -p /opt/setup/data-scripts.d/
RUN mkdir -p /zdata/
COPY ./data-scripts.d/* /opt/setup/data-scripts.d/

WORKDIR /opt/setup/
COPY db-setup.sh /opt/setup/
COPY db-pack.sh /opt/setup/
COPY db-run.sh /opt/setup/

RUN ./db-setup.sh
RUN ./db-pack.sh

VOLUME $PGDATA

EXPOSE 5432

ENTRYPOINT [ "/opt/setup/db-run.sh" ]

CMD [ "postgres" ]

For brevity, we’ll ignore the packing and unpacking scripts, though one can find them in this gist:
https://gist.github.com/robert-malai/19d139937a6b51d2709a6584702b4e5e. The remaining part is the logic baked in when the container is launched – that part is implemented in db-run.sh:

#!/bin/sh

set -e

if [ ! -f "$PGDATA/PG_VERSION" ]; then
    echo "Restoring $PGDATA ..."
    tar -xf /zdata/backup.tar -C $PGDATA
    sync
    echo "Done."
else
    echo "$PGDATA was already there, skipping restore."
fi

echo "Launching command: $@ ..."
if [ "$1" = 'postgres' ]; then
    gosu postgres "$@"
else
    exec "$@"
fi

During the bootstrap of the container, the whole database is being unpacked and restored in the data mounted folder, and at the end, the server is launched via the “postgres” mnemonic.

Conclusion

Running the above container had increased the image build to about 15 minutes – the initialization scripts were comprised of about 2 million records on a very simple table structure, with only 5 fields. The good part is that the container took only about 2 minutes to be bootstrapped from the base image. This greatly improved the user experience of the developers who will debug locally using a database, which is very close to what is in production.

ProsCons
  • Greatly reduced container bootstrap time
  • Reliable
  • Requires image repository
  • Maintenance of the database Dockerfile

The customization of the database initialization scripts prove to be of moderate difficulty (it’s by no means a trivial task) but the end result was reliable, so we can recommend this approach to anyone who is in a similar situation.

Spread the Word

If you like this article, please help us spread the word about it via the Click to Tweet button below or one of the social sharing icons at the bottom of the post.

What's the best way to initialize a #Postgres database with a million-plus records in @Docker? Find out from Robert Malai. Click To Tweet
Robert Malai

Robert Malai

Engineering Manager

Robert is working as an Engineering Manager at 3Pillar Global. Robert is extremely passionate about technology and engineering, which he has been managing for the last 10 years in various teams. He has built products from embedded software development in automotive to distributed services in the cloud. Robert is also passionate about Databases, Linux servers, Python, JavaScript and Software Development. Prior to joining 3Pillar, Robert was involved in the automotive industry during his stint at Continental Automotive.

2 Responses to “How to Initialize a Postgres Docker with a Million-Plus Records”
  1. Swapnil on

    This is not a clean solution Either. Instead why cannot you port the data in IMG like Hazel Cast or Apache Ignite working like a SDO or SSD .

    Porting the whole data in the Volume is a disaster waiting to happen.

    Reply
    • Robert Malai on

      Thank you for your comment.
      I believe your proposed solution is a valid one as well, tough it involves an additional container. It would be great for the readers if you have a gist to share…

      Reply
Leave a Reply