Best Practices for High-
Performance
ETL To BigQuery
Table of Contents
BigQuery Introduction 1
GCS - Staging Area for BigQuery Upload 2
Nested & Repeated Data 3
Data Compression 4
Time Series Data & Table Partitioning 5
Streaming Insert 7
Bulk Updates 8
Transforming Data After Load (ELT) 9
Federated Tables for Adhoc Analysis 10
Access Control & Data Encryption 11
Character Encoding 12
Backup & Restore 12
Bonus - There is An Easier Way to Perform ETL! 13
1
In today's data-intensive ecosystem there is always a need for a reliable data warehouse
BigQuery - a fully managed cloud data warehouse for analytics from Google Cloud Platform
(GCP), is one of the most popular cloud-based analytics solutions. Due to its unique
architecture and seamless integration with other services from Google Cloud Platform, there
are certain elements to be considered as best practices while migrating data to BigQuery. All
the important points are discussed in the following sections.
GCS - Staging Area for BigQuery Upload
Nested & Repeated data
Data Compression
Time Series Data & Table Partitioning
Streaming Insert
Bulk Updates
Transforming Data After Load (ELT)
Federated Tables for Adhoc Analysis
Access Control & Data Encryption
Character Encoding
Backup & restore
Best Practices for High Performance ETL To BigQuery
2
Unless you are directly loading data from your local machine, before loading the data into
BigQuery you have to upload data to GCS. To move data to GCS you have multiple options:
Gsutil is a command line tool which can be used to upload data to GCS from different
servers.
If your data is present in any online data sources like AWS S3 you can use Storage
Transfer Service from Google cloud. This service has options to schedule transfer jobs.
Other things to be noted while loading data to GCS:
GCS bucket and BigQuery dataset should be in the same location with one exception -
If the dataset is in the US multi-regional location, data can be loaded from GCS bucket
in any regional or multi-regional location.
Format supported to upload from GCS to BigQuery are - Comma-separated values
(CSV), JSON (newline-delimited), Avro, Parquet, ORC, Cloud Datastore exports, Cloud
Firestore exports.
Best Practices for High Performance ETL To BigQuery
3
BigQuery performs best when the data is denormalized. Instead of keeping relations,
denormalize the data and take advantage of nested and repeated fields. Nested and
repeated fields are supported in Avro, Parquet, ORC, JSON (newline delimited) formats.
STRUCT is the type that can be used to represent an object, which can be nested and ARRAY
is the type to be used for repeated value. For example, the following row from a BigQuery
table and “address” field is an array of a struct:
{
"id": "1",
"first_name": "John",
"last_name": "Doe",
"dob": "1968-01-22",
"addresses": [
{
"status": "current",
"address": "123 First Avenue",
"city": "Seattle",
"state": "WA",
"zip": "11111",
"numberOfYears": "1"
},
{
"status": "previous",
"address": "456 Main Street",
"city": "Portland",
"state": "OR",
"zip": "22222",
"numberOfYears": "5"
}
]
}
Best Practices for High Performance ETL To BigQuery
4
Most of the time the data will be compressed before transfer. You should consider the
below points while compressing data.
The binary Avro is the most efficient format for loading compressed data.
Parquet and ORC format are also good as they can be loaded in parallel.
For CSV and JSON, BigQuery can load uncompressed files significantly faster than
compressed files because uncompressed files can be read in parallel.
Best Practices for High Performance ETL To BigQuery
5
Time series data is a generic term used to indicate a sequence of data points paired with
timestamps. Common examples are clickstream events from a website or transactions from
Point Of Sale machine. The velocity of this kind of data is much higher and volume increases
over time. Partitioning is a common technique used to efficiently analyze time series data
and BigQuery has good support for this with partitioned tables.
A partitioned table is a special BigQuery table that is divided into segments often called as
partitions. It is important to partition bigger table for better maintainability and query
performance. It also helps to control costs by reducing the amount of data read by a query.
BigQuery has mainly three options to partition a table:
Ingestion-time partitioned tables - For these type of table BigQuery automatically
loads data into daily, date-based partitions that reflect the data ingestion date. A
pseudo column named _PARTITIONTIME will have this date information and can be
used in queries.
Best Practices for High Performance ETL To BigQuery
6
Partitioned tables - Most common type of partitioning which is based on TIMESTAMP
or DATE column. Data is written to a partition based on the date value in that column.
Queries can specify predicate filters based on this partitioning column to reduce the
amount of data scanned.
You should use the date or timestamp column which is most frequently used in
queries as partition column.
Partition column should also distribute data evenly across each partition. Make
sure it has enough cardinality.
Also, note that the Maximum number of partitions per partitioned table is 4,000.
Legacy SQL is not supported for querying or for writing query results to
partitioned tables.
Sharded Tables - You can also think of shard tables using a time-based naming
approach such as [PREFIX]_YYYYMMDD and use a UNION while selecting data.
Generally, Partitioned tables perform better than tables sharded by date. However, if you
have any specific use-case to have multiple tables you can use sharded tables.
Ingestion-time partitioned tables can be tricky if you are inserting data again as part of some
bug fix. You can read a detailed comparison here.
Best Practices for High Performance ETL To BigQuery
7
For inserting data into a BigQuery table in batch mode a Load Job will be created which will
read data from the source and insert into the table (read more on the Load Jobs). Streaming
data will enable us to query data without any delay of load job. Stream insert can be done to
any BigQuery table using Cloud SDKs or other GCP services like Dataflow (Dataflow is an
auto-scalable stream and batch data processing service from GCP - read more).
Following things to be noted while stream insert:
Streaming data is available for the query after a few seconds of first stream insert in
the table.
It takes up to 90 minutes to become data available for copy and export.
While streaming to a partitioned table, the value of _PARTITIONTIME pseudo column
will be NULL while data is in the streaming buffer.
While streaming to a table partitioned on a DATE or TIMESTAMP column, the value in
that column should be between 1 year in the past and 6 months in the future. Data
outside this range will be rejected.
Best Practices for High Performance ETL To BigQuery
8
BigQuery has quotas and limits for DML statements which is getting increased over time. As
of now the limit of combined INSERT, UPDATE, DELETE, and MERGE statements per day per
table is 1,000. Note that this is not the number of rows. This is the number of the statement
and as you know, one single DML statement can affect millions of rows.
Now within this limit, you can run updates or merge statements affecting any number of
rows. It will not affect any query performance, unlike many other analytical solutions.
Best Practices for High Performance ETL To BigQuery
9
Sometimes it is really handy to transform data within BigQuery using SQL, which is often
referred to as Extract Load Transfer (ELT). BigQuery supports both INSERT INTO SELECT and
CREATE TABLE AS SELECT methods to data transfer across tables.
Example:
INSERT das.DetailedInve (product, quantity)
VALUES('countertop microwave',
(SELECT quantity FROM ds.DetailedInv
WHERE product = 'microwave'))
CREATE TABLE mydataset.top_words
AS SELECT corpus,ARRAY_AGG(STRUCT(word, word_count)) AS top_words
FROM bigquery-public-data.samples.shakespeare GROUP BY corpus;
Best Practices for High Performance ETL To BigQuery
10
You can directly query data stored in the location below from BigQuery which is called
federated data sources or tables.
Cloud BigTable
GCS
Google Drive
Read more on Federated tables
Things to be noted while using this option:
Query performance might not be good as native BigQuery table.
No consistency guaranteed in case of external data is changed while querying.
Can’t exports data from an external data source using BigQuery Job.
Currently, Parquet or ORC format is not supported.
The query result is not cached, unlike native BigQuery tables.
Best Practices for High Performance ETL To BigQuery
11
Data stored in BigQuery is encrypted by default and keys are managed by GCP Alternatively
customer can manage key using Google KMS service.
To grant access to resources, BigQuery uses IAM (Identity and Access Management) to
dataset level. Tables and views are child resources of datasets and inherit permission from
the dataset. There are predefined roles like bigquery.dataViewer and bigquery.dataEditor or
user can create custom roles. Check out the documentation to know more.
Best Practices for High Performance ETL To BigQuery
12
Sometimes it will take some time to get the correct character encoding scheme while
transferring data. Take notes of the points mentioned below as it will help you to get it
correct in the first place.
BigQuery expects all source data to be UTF-8 encoded with below exception
If a CSV file with data encoded in ISO-8859-1 format, it should be specified and
BigQuery will properly convert the data to UTF-8
Delimiters should be encoded as ISO-8859-1
Non-convertible characters will be replaced with Unicode replacement character: �
BigQuery addresses backup and disaster recovery at the service level. The user does not
need to worry about it. Still, BigQuery is maintaining a complete 7-day history of changes
against tables and allows to query a point-in-time snapshot of the table. Read more about
the syntax of querying snapshot.
Best Practices for High Performance ETL To BigQuery
13
There is An Easier Way To Perform ETL!
The detailed steps of ETL processing using Google BigQuery mentioned involves multiple
complex stages and can be a cumbersome experience. If you want to load any data easily
into Google BigQuery without any hassle, you can try out Hevo. Hevo automates the flow of
data from various sources to Google BigQuery in real time and at zero data loss. In addition
to migrating data, you can also build aggregates and joins on Google BigQuery to create
materialized views that enable faster query processing.
Hevo integrates with a variety of data sources ranging from SQL, NoSQL, SaaS, File Storage
Base, Webhooks, etc. with the click of a button.
Sign up for a free trial or view a quick video on how Hevo can make ETL easy.
Published on: 21st February 2019
Best Practices for High Performance ETL To BigQuery
Looking for a simple and reliable way to bring
Data from Any Source toBigQuery?