Recently we achieved a project regarding large data replication from Oracle database to PostgreSQL. We are facing the challenge that we have very limited window to complete the large volume data loading on daily basis.
Initially we tried PostgreSQL loading tool psql, it took 4.5 hours to complete transferring approx. 250,000 records. But regular business operations require a lot shorter time to transfer more data, e.g. complete 1 million records replication in 30 minutes. Based on the performance of psql (which is very likely single thread basis), to complete 1 million records replication needs 18 hours (1,080 minutes)! – which is not feasible from business perspectives.
We built up an application based on Quartz Scheduler which does table to table asynchronous replication in nearly real time manner.
In order to improve the performance for bulk loading cases, we decided to implement the data replication in multi-threading approach. When the count of result set reaches to 1,000 records, we open a new thread to handle the inserting operation. Overall we allow 10 threads running at the same time. You know what, the loading test shows that it spent only 108 minutes on completing 1 million records loading. It cuts off 90% original loading time!
But nearly two hours loading time is still too long. We moved one step further: to add batch process for preparedStatement. To our surprise, the 1 million records loading test was completed in 27 minutes! Another 75% enhancement on the top of multi-threading loading.
Based on the loading test, the current loading time is just 2.5% of the previous. The improvement is truly significant!
So for large volume data loading, in order to improve your data replication performance significantly,
- Implementing multi-threading writing to your target data store helps a lot;
- Adding batch process to your preparedStatement cuts off loading time significantly.
Should you have any questions on how to improve your data replication efficiency, please feel free to contact us.