How To Migrate Data From One Database To Another


And do it efficiently

author bryan image

Bryan Ho

25th June 2021

instagram icon github icon linkedin icon
email icon
how-to-migrate-data-efficiently

Introduction

Data migration is the process of moving data between locations, formats, or systems. Data can be in the form of a table, schema or even a whole database. There are many ways to move data from one location to another. On the other hand, there are multiple data storage platforms such as Postgres, mySQL, Amazon S3 etc. to migrate the data to. Migration between these all require slightly different methods.

It is also crucial to precisely transfer the data, less some data gets lost in the process or duplicate entries are created. As such, finding a method that is efficient and effective is important.

In this article, I will share with you some of the methods I used to migrate varying volumes of data. Of course, these are not cookie cutter methods that can be reapplied anywhere, but I will try my best to explain the theory behind the migrations and hopefully it would be applicable to your data migration situation.

Content Page

  1. Hevo
  2. Writing A Script
  3. Import And Export
how hevo functions
Image from Hevo, pictorial representation of how hevo functions


Starting off simple, we can use Hevo for data migration and transformation. Click on this link to access the official Hevo page. It is by far the best data migration tool I have used and it is also very efficient. It took me less than 20 minutes to migrate 5 million rows of data! Cross platform migration has never been easier, data sources to migrate from include Amazon Redshift, MongoDB, Postgres, Google Drive, Shopify and a ton more.


Video from Youtube by Hevo Data, an introductory video to the overview of Hevo


The steps are straightforward. 1. Set up a destination, which is the new location of your data store. Fill in the credentials and they will be verified. This includes IP address, database, schema etc. 2. Create a pipeline. There are a few ways to migrate and depending on your needs, select the appropriate pipeline mode. Key in the credentials, be verified and you are good to go!

My only gripe with hevo is that it cannot accommodate extensively large table migration. Previously, I had to migrate a single table of 68 million rows from one postgres server to another. However, I could not use hevo as it’s limit was at 5 million rows.

There was a feature to replicate the entire database, but that included tables that I did not want. I found it too cumbersome to replicate everything and then waste more time deleting 95% of it. Luckily, I managed to figure out a cost-effective (free) way to migrate huge data, relatively quickly. If you are interested, check out the data migration article here.

Do note that hevo is not free, but there are free trials, which comes in very helpful if you are doing the bulk of the data migration in a short time span.

Writing a script is appropriate when transferring moderately large amounts of data. Transferring data in batches is recommended as it not only speeds up the time taken but you are also able to log the progress of the transfer.

During my internship, I was assigned to integrate 2 tables and migrate it to another location. However, my script crashed mid-way through. It could be because of an extreme case that I failed to consider when I was integrating 2 tables or because there was a timeout with the database connection. Luckily, as I did the migration in batches, I knew where the migration last stopped and picked up from that same spot accordingly. This would be impossible if I had migrated the entire chunk at once.

In the example below, I wrote a python script to transfer some data from postgres. This is only the logic of the script and not the full code.


start = 0
limit = 100000
for i in range(1, 5):
    try:
        holder = (i*10000)
        query = """SELECT * FROM api_log OFFSET {} LIMIT {}""".format(start, limit)
        df = pd.read_sql(query, engine_dataDump)
        df.to_sql('api_log_testing', con = engine_public, if_exists = 'append', index=False)
        start = holder
        print(f"completed transfer till row {start}")
    except Exception as e:
        print(e)
                  

Breaking down the code, I am transferring the data in 4 batches so the code block will loop 4 times indicated by the for loop. In the SQL query, I am selecting specific chunks of data from the table api_log with the help of offset and limit commands. Limit is fixed as it is the number of rows you want to transfer per loop. Offset changes every loop as it is based on the number of rows I transferred in the previous loop.

pd.read_sql executes the query and engine_dataDump is the database I am connected to. df.to_sql then sends the queried rows to the new database (engine_public) and the table is titled api_log_testing.

Then, I reassign the “holder” to the “start” variable so that the offset for the next loop will be at the specified bulk of data that I have yet to upload.

The print statement is to log my progress. The try and except statements are for the error logging to capture any errors that can be fixed in the future. If I had done my logging in during my migration at my internship, I would have been able to catch the error and avoid it in the future!

how to import and export with pgAdmin
Image from postgrestutorials, finding the import/export option


This method is very useful if you have less than a million rows of data. In its essence, you are downloading all the data from your source into your local computer. Then uploading all that data into the new location.

For example, if you are in pgAdmin, just right click on the table you want to migrate (1) and pick export (2). Alternatively, locate the download icon which serves the same purpose. Name the files with the appropriate extension such as .csv. Following, navigate into the location where you would like your new table to be. Right click again and choose import, pick the csv file and you are done!

Conclusion

These are the 3 ways I did my data migration that I felt were efficient and reliable. It is important to know the size of your data as you want to select the appropriate method -- one that doesn't waste unnecessary resources while has high efficacy. I hope this article is useful and that it will help speed up your mundane task of migrating. Stay cool Cucumbers!