Speed up PostgreSQL to BigQuery

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Speed up PostgreSQL to BigQuery



I would like to upload some data that is currently stored in postGreSQL to Google Bigquery to see how the two tools compare.



To move data around there are many options but the most user friendly (for me) one I found thus far leverages the power of python pandas.


sql = "SELECT * FROM ".format(input_table_name)
i = 0
for chunk in pd.read_sql_query(sql , engine, chunksize=10000):
print("Chunk number: ",i)
i += 1
df.to_gbq(destination_table="my_new_dataset.test_pandas",
project_id = "aqueduct30",
if_exists= "append" )



however this approach is rather slow and I was wondering what options I have to speed things up. My table has 11 million rows and 100 columns.



The postGreSQL is on AWS RDS and I call python from an Amazon EC2 instance. Both are large and fast. I am currently not using multiple processors although there are 16 available.





We shift new data (daily partitions) daily from PG to Bigquery and for most cases we use program in golang. But for really huge tables like ~100 milions rows approx 20GB size we use bash script which wraps up export from PG into CSV format and load to BQ using bq command line tools. Because with golang program it took several hours. Script covers also conversion of table structure into data types used on BQ - postgresql.freeideas.cz/…
– JosMac
Jul 10 at 13:26



bq




1 Answer
1



As alluded to by the comment from JosMac, your solution/approach simply won't scale with large datasets. As you're already running on AWS/RDS then something like the following would be better in my opinion:





great sugestion. For which part do I need cloud functions and for which part dataflow. I've not used those services and the field is crowded with pub/sub aws lambda etc.
– Rutger Hofste
Jul 11 at 9:01





shinesolutions.com/2017/03/23/…
– Graham Polley
Jul 11 at 10:11





Eventually I've used a large EC2 instance with 32 vCPUs to query the database and send the resulting csv files directly to Google Cloud Storage, skipping the S3 step.
– Rutger Hofste
Jul 17 at 12:29






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard