• 6 min read

Replibyte In Practice

Matt J

Replibyte In Practice

Replibyte is a tool that makes it easy to migrate data from one database to another. But the real value comes from all the steps it allows you to perform between the migration, such as transformations to obfuscate sensitive data. I saw it a few weeks ago in the TLDR Newsletter and have wanted to try it out.

I have this side project I have been working on and using for years. It's a finance management application where I record all of my financial accounts and transactions and has over two years of data. I have been wanting to figure out an easy way to demo it for people without exposing my actual financial details.

My initial thought was to create a fixture generator to create dummy data in a demo environment. But with this, you lose some of the integrity of the data because it is all fake, and you have to execute the generators periodically as if you were inputting transactions manually to keep the data continuous. It also takes whatever time I have to work on this project away to create these generators.

With Replibyte, I can have the best of both worlds - real-world data obfuscated to avoid exposing my financial details.

Concepts

First, let's talk about how Replibyte works. The main functionality of Replibyte is to export the data from a source database to a data dump file and then import the data dump file to a target database. Additionally, some features make Replibyte more valuable than a simple backup with pg_dump.

Here is an illustration they provide in their documentation, and I will describe each step below.

Steps that Replibyte follows to migrate data from a source database to a destination database.
Steps that Replibyte follows to migrate data from a source database to a destination database.

Transformation

For me, transformation is the primary differentiator from most backup utilities. It allows you to transform data from the source database as it moves into the data dump. The primary use for this feature is to obfuscate sensitive data from a production environment, such as PII, credit card info, sensitive notes, or in my case, just account names and transaction information. It would also be helpful to redact credential information like access tokens tied to an account or user.

While they do have some built-in transformers, they don't entirely cover all the use cases. Luckily, they have decent documentation to be able to build your own, and I have created another post to go over how to make a custom transformer to manipulate some data from the production database for my side project.

Datastore

Their default examples all use AWS S3 as their datastore, but they have a complete list that they support and some pretty good documentation on how to set them up. The example presented below will store them in a local directory, which is not well documented along with the rest of the datastores they support but is pretty intuitive once you see the configuration.

Supported Datastores by Replibyte
Supported Datastores by Replibyte

Subsetting

Subsetting is more experimental, but I'm very hopeful for this feature. It allows you to filter down the amount of data being dumped. Currently, there seems to be only one particular type of subsetting available - random sampling - which allows you to dump a certain percentage of data selected randomly. You can also configure to only export certain tables, but you can't be more specific on how to limit the specific data being exported.


Example Usage

Before we start, you should know that Replibyte is created to restore a database, not just specific data, which means that subsequent calls will wipe existing data from the target database and recreate it with the data from the data dump. I'd suggest creating a test database to try this out before wiping out your target database. In Postgres, it's as simple as running this command:

psql postgres -c 'create database replibyte_test;'

For simplicity's sake, I will just walk through getting set up and migrating from one database to another using a local directory to store the data dump(s).

⚠️
The following steps and configuration use version 0.9.6 and do not guarantee it will work the same on future versions.

Install

The first thing to do is to install the replibyte command line tool. The following commands are how to do it in macOS, but they have documentation for all other operating systems here.

brew tap Qovery/replibyte
brew install replibyte

Now you should have the replibyte command available to you in your terminal. Running replibyte -h will give you a quick usage summary.

Configure

The next step is to set up a configuration file to tell Replibyte how to access the source and target databases and where to store the data dumps. This configuration file can also be used to configure transformations and subsets of data, but this is all you need to get started. You can find a more comprehensive example file in their documentation.

encryption_key: $REPLIBYTE_ENCRYPTION
source:
  connection_uri: $SOURCE_DATABASE_URL
destination:
  connection_uri: $TARGET_DATABASE_URL
datastore:
  local_disk:
    dir: ./dumps

This configuration is pretty straightforward - it tells Replibyte that the connection string for the source and destination databases are stored in environment variables, and the data dump should be stored in a local directory named dumps (which should be created before executing the first dump).

🗒️
I use dotenv to store my environment variables in a .env.replibyte file, but I am excluding the usage in this example for brevity. 

Another nifty feature I haven't mentioned yet is the encryption key, which will make sure your data dumps are secure while at rest in your datastore. This is important even if you are transforming PII and other sensitive data because that is still your proprietary data, and you don't want anyone else getting ahold of it.

Execute

The last step is to export your source database to your designated datastore and restore it to your target database.

replibyte -c ./conf.yml dump create
replibyte -c ./conf.yml dump restore remote -v latest

The dump create subcommand will dump the data from the source database into the configured datastore, which in this case would be a local directory named replibyte.

The dump restore remote -v latest subcommand will restore the latest dump in the datastore to the target database. You can list all of the dumps from the datastore with the subcommand dump list. It will look something like this:

$ replibyte -c ./conf.yml dump list

 name               | size   | when          | compressed | encrypted
--------------------+--------+---------------+------------+-----------
 dump-1659702710545 | 625 kB | 2 minutes ago | true       | true
 dump-1659663215513 | 625 kB | 11 hours ago  | true       | true
 dump-1659662277379 | 625 kB | 11 hours ago  | true       | true
 dump-1659658713948 | 625 kB | 12 hours ago  | true       | true

To restore a different data dump, instead of the latest one, your subcommand would like similar to dump restore remote -v dump-1659663215513.

⚠️
One thing to notice is that this dump information is stored within a metadata.json file. If you remove or alter this file, replibyte doesn't seem to know anything about your data dumps.
A better way would probably have been to store a file with each dump so replibyte could derive this information from those files instead of requiring this metadata file be persisted.

That's all it takes. The transformation(s) and data subsetting happen when you create the dump. And you can restore from any previously generated data dump. The asynchronous nature allows you to make one data dump and restore it to multiple different target databases if you need.


Future Features?

Replibyte is still in its infancy as far as features go, so here are some of the features I would love to see in future versions.

Specific Subsetting

As I mentioned above, the only thing supported regarding subsetting is the ability to specify certain tables or a specific % of random data from a table. I'd love to be able to provide a query or some sort of config as input to direct how the subsetting can work.

For example, I want to be able to query specific accounts, transactions tied to those accounts, and any other ancillary data needed for those transactions in other tables.

Partial Dump and Restore

Along with being more narrow in subsetting data, it would also help to specify a partial dump, which could be subsequently inserted instead of overriding the entire database. With this, you'd probably also want a way to tag or name data dumps instead of dump-$timestamp.

For example, at work, we have a custom script that queries all data pertaining to a specific deal from multiple tables, which is then subsequently inserted into a test database that can be used to replicate a specific state of the application. As of now, this would not be possible for Replibyte to perform this action without wiping out the entire database.

Transformer Config

I cover how to create a transformer in another post, but one thing I noticed is that there is no way to configure them. The only input is the specific data to transform, whereas it would be nice to be able to create a more generic transformer with configurable inputs to be able to make it more shareable and useful for multiple projects and a wider audience.


I'm happy to have Replibyte as a tool in my arsenal, and I'm excited to see where it goes from here.

Let me know if this was helpful for you and what you think about Replibyte if you have tried it out.

Find me on Twitter or email me at [email protected]