Guidance to Schema Migration while On Boarding to Big Query

Jayakumar
7 min readJan 26, 2021

Once you decide to migrate from On-Prem to Big Query the first and foremost process to plan out would be Schema migration, its critical that this decision is well though out, this process is very critical for the overall process outcome. Schema migration requires meticulous planning in deciding the approach to be followed and the implementation strategy. This blog explores the possible approaches while explaining the merits and de-metrics.

At large the schema migration strategy can be bucketed as follows

Re-Engineer:

Re-engineering opens up, a lot of potential to play around with as compared to AS-IS which is just retro fitting. Re-Engineering gives one an option to optimize the data structures to reap the benefit Google BigQuery has to offer.

As a best practice, BigQuery recommends to consider reviewing the schema whenever possible. A common practice in BigQuery is to leave the data in a flat, de-normalized structure. If the source schema follows the Star schema pattern then, with some amount of flattening it can improve the SQL performance in BigQuery. The goal is to produce a wide, column-oriented table. Although Big Query does support joins on a relational/star-schema, it is usually a best practice to consider flattening the data given the columnar nature of a database like BigQuery. Techniques like flattening nested and repeating fields, etc., to be considered to further optimize the data storage and processing.

Its counter intuitive to Denormalize a table in the name of optimization from traditional modeling standards. However, Big Query is not a traditional DB and it requires some level of unlearning. Big Query tackles and debunks the main two key benefits normalization brings on to the table,

  1. Reduction of redundant data
  2. Data consistency and data integrity

Both, of these becomes irrelevant with BigQuery. BigQuery is not a transaction processing system; hence, we are not going to create inconsistencies in the data. In addition, the extra cost associated with storing and querying extra copies of data is often outweighed by the much faster query times. (Remember the costing for Big Query is more for the volume that is being processed compared to the storage cost of the data)

The reason why denormalizaion works better in BigQuery is that it reduces the number of joins to be performed; joining tables is a very time-consuming operation, though is not very noticeable in smaller tables, it would sooner become a bottle neck as the table grows, the time required joining high volume tables increases exponentially.

Below is the recommendation by Google

Follow these general guidelines to design the optimal schema for BigQuery:

  1. Denormalize a dimension table that is larger than 10 gigabytes, unless you see strong evidence that data manipulation, UPDATE and DELETE operation, costs outweigh the benefits of optimal queries.
  2. Keep a dimension table that is smaller than 10 gigabytes normalized, unless the table rarely goes through UPDATE and DELETE operations.
  3. Take full advantage of nested and repeated fields in denormalized tables.

AS-IS migration:

Often times, since the consumption dictates the table structure, whenever the consumers are not ready for the change yet, in order to move the data warehouse to GCP as quick as possible, AS-IS schema migration is one viable option to go for. In this approach the DB objects like Table, Views etc., will be migrated AS-IS with no or little change to the structure to accommodate data type compatibility.

For use cases where the On-Prem is in 3NF, and scope of migration is restricted only to the warehouse migration and consumption still requires the old pattern, de-normalization is no more an option. However, there are techniques like Partitioning and Clustering, which can still be employed to increase the performance.

Conclusion :

When Google recommends to de-normalize the data while bring them to BigQuery, it is not a blanket statement. One has to understand the current data volume and the future growth pattern. Also, considering the level of data manipulation that would happen for a table, not every table benefit denormalizaion.

It is recommended to have a combination of AS-IS and Denormalized table structure for migration use cases. For green field development while modeling for BigQuery, One can opt for more denormalized structures taking advantage of nested and repeated fields. Often times there has to be an additional layer on top of this objects for the consumption.

The increase in elapsed time for the additional joins, because the tables or in 3NF or even Snowflake dimensions can be compensated by creating clustering on the key columns used for joins as long as the table are below 10GB limit, the level of joins nesting also plays a major role.

Options at your disposal while planning for Schema Migration and Critical points to consider

1) Nested and Repeated fields (Struct and Array Data types):- One can take full advantage of struct and array data types whenever possible to logically put together likely accessed column values ( column families), so that these are accessed only when required, otherwise they can be just part of the table. The data elements can be accessed individually using OFFSET or a full-blown flat structure depending on the need, using UNNEST with CROSS JOIN.

Below is an example of a flat table where Struct and Array Data types are employed on columns Address and Departments.

2) De-normalization impact on SCD type tables

When it comes to maintaining SCD tables, having a normalized table minimizes the impact Vs having a denormalized flat fact (with dimensions put in the same table) poses challenge in maintaining SCD. There are certain techniques Google recommends; however, there is no one-size-fit-all solution. Often times one must sacrifice efficient SCD handling in exchange for optimized query performance and vise-versa

SCD- 2 handling with de-normalized structure.

Google recommends the following techniques to avoid the analysis downtime

Technique 1: view switching

This technique is based on two views of the data: “main” vs. “shadow”. On update cycles, the “shadow” view is created/updated and goes through data correctness tests while the users work against the “main” view. Once the test is complete the shadow is swapped with main and the now shadow is purged.

Technique 2: in-place partition loading

When the change in data can be isolated by a partition and brief analysis downtime is tolerated, one can opt for partition swapping by holding the partition data in other table or as a file exported to google storage, on which the update happens.

Technique 3: update data masking

A very small frequently changing dimension can be normalized and isolated which can be conditionally joined to the main table.

select

main.branch_ky

,coalesce(iso.branch_nm, main.branch_nm)

,coalesce(iso.branch_typ, main.branch_nm)

from tbl_branch main

left outer join tbl_branch_iso iso

on main. branch_ky = . iso.branch_ky

Note: — Denormalized schemas are not storage-optimal, but BigQuery’s low cost of storage addresses concerns about storage inefficiency. One can contrast costs against gains in query speed to see why storage is not a significant factor.

3) Benefit achieved by employing clustering

With No Cluster defined on the join column

With Cluster defined on the join column

4) Partitioning Vs Clustering

5) DB object compatibility matrix from different infamous RDBMS to BigQuery

Note:-

  1. BigQuery supports only UTC Timestamp, care to be taken if the data getting migrated has specific time zone
  2. Because of the datatype, consolidation that happens while bring data into Big Query one should expect a data volume inflation in the range of 150% — 200 % and plan the storage requirements accordingly.

Consolidation of datatypes from various Infamous RDBMS to Big Query.

Teradata to BigQuery migration

Following are some of the datatype compatibility change when migrating from Teradata to BigQuery.

Netezza to BigQuery migration

Following are some of the datatype compatibility change when migrating from Netezza to BigQuery.

Note: There is no ARRAY data type in Netezza. The ARRAY type is instead stored in a varchar field; the same can be moved to ARRAY type in BigQuery.

Redshift to BigQuery migration

Following are some of the datatype compatibility change when migrating from Netezza to BigQuery.

Note: BigQuery also has the following data types that do not have a direct Redshift analog (TIME, STRUCT, ARRAY, BYTES)

Oracle to BigQuery migration

Following are some of the datatype compatibility change when migrating from Oracle to BigQuery.

--

--