CloverDX Blog on Data Integration

6 steps to modernizing your data warehouse in the cloud

Written by CloverDX | December 01, 2021

On-premise data warehouses can be incredibly expensive and difficult to maintain. Beyond the cost of building the infrastructure, you have to consider ongoing hardware updates, hiring skilled admins, fault-proofing, and more.

For many businesses, this on-premise infrastructure can become too much to manage in time.

That's where moving to a cloud data warehouse becomes more appealing. In the public cloud, scalability, reliability and ease-of-access are built-in. You can benefit from quick deployment and lower costs, allowing you to champion innovation.

Keen to find out more? Let's explore your options before explaining the six key steps to modernization.

Prefer video? Watch the full session on Modernizing your Data Warehouse in the Cloud

Your cloud data warehouse options

Of course, there are a lot of options to choose from when looking for a cloud data warehouse. Your choice is determined by several factors:

  • You might already be using one ecosystem (e.g. AWS), so it makes more sense to stick with it.
  • One of your teams might prefer one option over another.
  • You're either looking for more efficiency and simplicity, or you want to avoid vendor lock-in.

The last factor will decide whether you choose one vendor or multiple vendors.

Here are some of the top cloud data warehouses on the market:

  • Amazon Redshift is one of the most utilized warehouses on the market. It's brilliant for predictable pricing, support for transactional data, and familiarity.
  • Google BigQuery is good for workloads with huge spikes. You have access to a fast engine when you need it.
  • Snowflake is the new kid on the block. It takes more of an ELT approach and is more flexible and intuitive than some of the other options, such as Redshift.
  • Yellowbrick Data is ****a unique cloud data warehouse that offers solutions for a hybrid cloud approach.
  • IBM Db2 Database is a relational database that specializes in providing actionable and insightful data insights.

The 6 pillars of data warehouse modernization

Now that we've given a brief overview of some of the various cloud data warehouses on the market, let's explore the six pillars you need to consider when modernizing your warehouse.

1. Architecture

Let's start with your architecture.

This typically involves taking information from your databases, applications and repositories, staging it, and then loading it into a data warehouse.

But there are a couple of different ways to go about it.

  • Pushing to the cloud. In this case, you have your ETL sitting locally on your on-premise infrastructure. All you need to do is push those resources into your cloud staging area, and then pipeline that to your chosen data warehouse.
  • Pulling data from on-prem. In this instance, you open up ports to a cloud ETL provider, who would then perform the staging and loading into the cloud data warehouse. Bear in mind, however, that this approach can rack up network charges and compute resources.

Either approach works; it's entirely down to your organization and the cost and resource factors involved.

2. Building modern ETL data pipelines

Next, you'll need to plan the implementation; how do you plan on getting your data from your on-premise or cloud resources and into a new data warehouse?

Think about the following:

  • Use case. Your industry might determine the way in which you move your data. Healthcare organizations, for instance, may need to process their data in an on-premise environment rather than in the cloud. On top of this, you'll need to consider your sensitive data and any future projects you plan to build. How will data regulations and compliance factor into your efforts?
  • Managed services vs self-managed. Again, this is a question of your unique business needs. A managed service may be convenient and offer peace of mind, but a self-managed service will give you the flexibility to deploy and build data pipelines as you see fit.
  • Business needs. When choosing an ETL or ELT tool, look beyond your current needs. Will you need something that's more flexible and long-term? Are there other aspects of your business that could make use of this tool? If you're going to make a purchase, you may as well choose something that will stand the test of time and keep providing value.

The next question you'll need to ask yourself is: do we want a general-purpose or cloud-native ETL?

General-purpose vs cloud-native ETL

Oftentimes, you'll find that cloud-native tools are simple to use and provide good integration between specific systems and applications. But sometimes the simplest tool isn't the best.

For more complicated or custom transformations, cloud-native might not be fit for the job.

This is where general-purpose ETLs, such as CloverDX, are beneficial. They support a wide range of different data sources and targets, and allow you to build your own custom components to grab data and transform as you see fit.

The right platform will support both on-premise infrastructures as well as cloud marketplaces.

3. Data quality

Poor data quality amounts to $15 million worth of losses per year, according to a Gartner research report.

It'll come as no surprise that data quality is one of the biggest obstacles when moving to a cloud data warehouse. Why? Because data frequently comes directly from your users, often in an application or Excel spreadsheet. As processes and ideas change without proper documentation, this data becomes less and less reliable.

As such, you need to make sure you're cleansing your data pipelines before migrating to the cloud. Modernizing your data warehouse is a perfect opportunity to sift through all your problematic data and update your 'messy' data.

A platform such as CloverDX can make this task easier through automated data quality processes.

4. Operations and maintenance

Loading your data is only a small part of the data warehouse modernization process.

You'll also have to extract your data from your sources. This means keeping a close eye on the health of your processing, from extraction to status reporting and error handling.

This involves anticipating problems, training your teams on your chosen data warehouse, and adopting new roles.

For instance, if you currently have an in-house infrastructure team, but you're planning on moving to Snowflake (a platform that requires less infrastructure work), you may need to shift about role expectations. There could be other areas of your business where you could utilize their skills.

5. Cost management

Cost management is another big consideration. Of course, this can be hard to estimate upfront. Different providers charge differently and, unfortunately, may make their services appear cheaper than they actually are.

Before you make any decisions, you'll need to understand the relevant pricing models, as well as monitor your spending closely to ensure you're not paying for anything you're no longer using.

With that in mind, here are three areas to focus on:

  • Storage. This involves the storage of the warehouse itself, the object storage you'll use to push or pull data into the warehouse, and backups.
  • Compute. With this, you want to find a balance between performance and price. If you want to use a certain cloud vendor's very powerful CPU, you're bound to pay more per computational second, depending on how it's billed. But if you can tackle the query in a slightly longer timeframe on a lower compute node, for less money, that might be better fit for you. It depends on what's important to your business - availability and speed or cost.
  • Traffic. When you're moving data inside and outside of your cloud instance, you'll need to pay for the networking costs. In scenarios where you might need to increase your normal needs just to move test copies around or to conduct UATs, this cost will only rise. There's no way around this, but compression can help. It's important to make sure you communicate these traffic needs with your team (to avoid any surprise bills).

6. Migrating an existing warehouse

Last but not least, you'll need to think about the process behind migrating an existing warehouse to the cloud.

Before the migration, you need to:

  • Weigh up the utility of legacy resources versus the cost of implementing them. Do you want to migrate your existing investment or create a new implementation?
  • Decide whether to migrate that warehouse or take an initial load from the sources around it.
  • Examine whether you need all of your data.

On top of this, you'll also want to avoid the 'sunk cost fallacy'. You need to take what's really worthwhile to your business and avoid the time, effort, and cost of resources that are less valuable. Ultimately, you don't want to waste your teams' time needlessly on moving data that's not going to be used.

On that note, you'll need to consider what your future might look like. As you go through this modernization process, the nature and function of your teams will change. So, think about how you can prepare your data engineering and business analytics teams for this new cloud data warehouse.

Move to the cloud with ease

There's a lot of careful planning, cost considerations and resource management to consider before modernizing your data warehouse.

From deciding whether to 'push' or 'pull' your data to cleansing your data, you need to ensure you prepare properly. That way, you'll get a better head start in the cloud.

Watch the full video for more detail on modernizing your data warehouse in the cloud