Data Migration

A reference guide

A data migration reference guide for anyone wanting a deeper understanding of what is involved in some typical migrations.

Download

Basic

What is a data migration?

  • What is a data migration?

    Data migration is the process of moving data from one place to another - typically applications, storage systems and various data storage formats. There is one active data set before and after the migration. This is in contrast to data integrations where data is active in both, or several, places.

    Data migrations are one of the most common data processes and almost all companies will eventually undertake one. There can be many drivers, including application replacements or upgrades, business process changes, data volume growth and performance requirements.

Why should I care?

  • Why should I care?

    Despite being a common process, data migrations are often underestimated as they are usually not the end goal of a project and represent just a part of a larger project. For example, the acquisition and implementation of a new CRM system in order to improve the sales team's performance is an end goal but in order to get there, the data from the incumbent system will need to be migrated to the new one.

Examples

  • Data Migration - Examples

    One of the very common migrations is moving an organization's data to the cloud. Sometimes this is as simple as replacing self-hosted email server with cloud-hosted solution. But in many cases the migration is significantly more complex since it involves multiple application migrations. Very commonly this includes CRM applications, document storage, accounting or payroll applications and more.

    Many cloud providers simplify these migrations by providing tools to import existing data from common self-hosted solutions. However, those tools are not always usable "as-is" and custom development is often required to migrate the data.

    Another common case is migrations where the viability of the current system has reached its limits in terms of performance, features or even support availability. This might be moveingof a data warehouse from one database to another. Such moves are often not 1:1 copies and often involves data cleansing exercises or more complicated changes if the target database behaves differently from the source (e.g. the target is a columnar database while the source is traditional relational database).

Common Migration Strategies

Big Bang Migration

  • Big Bang Migrations

    A Big Bang data migration is one in which all data is migrated in one operation. Even though the process of actual migration can take a while, there is a single point in time after which data in the old system is no longer used and the new system goes live. For anyone using the data, this point in time is the “big bang” event.

    Big bang migrations typically have significant preparation periods and short down times, during which the system is unavailable. The ideal big bang migration has zero down time, but this cannot always be guaranteed.

    The overall process can be visualized as shown below.

    big bang

    • In the first phase, the data and its usage is analyzed and, based on the result, the migration architecture is proposed and the process planned.
    • Most time is usually spent developing and testing the migration tools, whether using traditional programming languages or more specialized tools, including ETL.
    • During the design and development phase, no actual data is migrated. The migration itself is a short burst of activity that takes place once the development and testing is completed.
    • Finally, the owners of the data verify the migration results.

    Only after data owners and all other stakeholders confirm that the migration was successful can the whole process be considered complete.

    The fact that the migration itself only happens toward the end of the process is both a benefit and a drawback of a big bang migration. It can be beneficial, as users do not need to think about two different systems simultaneously and switch between two live systems.

    On the other hand, the data migration being so late increases the burden on the planning, development and testing phases and insufficiencies in these areas can lead to expensive failures.

Trickle Migration

  • Trickle migration

    A trickle migration can be likened to an agile approach to a data migration, breaking the migration down into many sub-migrations, each with its own set of data and migration goals.

    trickle-migration

    This approach allows stakeholders to verify the success of each individual phase, giving a stepwise indication of progress. Should any of the sub-processes fail, it is usually necessary to re-run only the failed process and lessons learned from that failure can be applied to subsequent runs.

    Trickle migrations, however, may require more complex planning. They also place a higher burden on users of the data since they have to keep working with two systems while the ongoing overall migration takes place.

Big Bang vs. Trickle

  • Big Bang vs. Trickle Migrations

    Each methodology has its own pros and cons. These usually fall into functional rather than financial categories. Regardless of the methodology, migrations can be expensive, especially if badly planned or executed.

    One of the biggest advantages of a big bang migration is that all changes happen one time only in a relatively short space of time and so there is a single cut-off point. However, this can also be a disadvantage, as if the migration fails, a complete rollback is usually required.

    On the other hand, a trickle migration is a series of smaller migrations, each with its own deadlines and scope. If a failure occurs during an individual phase, then that phase alone needs to be rolled back and repeated.

    The big bang approach can result in easier development given its one time nature, whereas a trickle migration can require considerable effort to keep both the legacy and the new systems running in parallel. This effort is not limited to the technical staff but also to end-users.

    In some cases a big bang migration may be impossible, as you might not be able fully automate all steps of the process. For example, a migration might require prolonged interactions with customers. If you need to use the new system's capabilities during the potentially many months it might take to get contractual agreement with customers, then this would make a big bang approach unfeasible.

Selecting the right migration strategy

  • Selecting the right data migration strategy

    The differences between a trickle and a big bang migration are considerable. Deciding which to choose should be made very early on.

    The decision is often driven by several key questions.

    • What is the migration deadline?
    • Can the system (or systems) experience downtime?
    • Do we fully understand our data so we can plan the whole process end to end?

    There is no simple rule, however big bang migrations are normally selected where the scope is well defined from the outset and where deadlines or other project properties mandate it.

    Conversely, trickle migrations are beneficial when the migration can be easily split into several different stages. They are also suitable when the scope is hard to define. In such cases, the trickle migration's phased migration approach allows you to migrate “easier” data first while dealing with the more complex processes later.

    Another major consideration is the experience of your team. Each migration can pose different technical and project management challenges. Ensuring you have the most suitable people with the most relevant experience on your team can play a significant role in the final outcome. Teams that prefer an agile approach typically prefer trickle migrations and, conversely, teams who are more used to a waterfall methodologies prefer the big bang approach.

Pitfalls

Not understanding your data quality

  • Not understanding your data quality

    Data quality is one of the most underestimated properties of data. Data in any system that has been in production for a while can have all sorts of data quality issues. These can range from simple issues such as typos through to missing or invalid data.

    Data owners often have only a vague idea about the overall quality of their data and the impact on subsequent data oriented processes. While they will clearly understand more obvious issues, they may be completely unaware of more complex or legacy problems.

    We would strongly recommend doing a full data quality evaluation of your production data early on in a data centric project. This can be complicated by security restrictions, but in our experience, data in a test environment never fully captures the depth and complexity of the issues that appear in production systems.

Inadequate system knowledge

  • Inadequate system knowledge

    Data applications, especially mission-critical ones, tend to be in production for extended periods of time as companies typically do not want to invest in technology unless absolutely necessary.

    This means that institutional knowledge of applications is often lost due to inadequate documentation or staff turnover.

    A lack of system knowledge can negatively affect data migration or integration projects due to over-optimistic estimates or data mapping issues that only manifest themselves in later stages of the project.

    As such, a lack of knowledge can be very expensive and any poor estimates or deficient data mapping exercises can lead to costly project restarts and substantial budget overruns.

Changing requirements or scope

  • Changing requirements or scope

    It is not uncommon for the project scope to change during a project’s lifetime, which is why so many IT projects end up over budget.

    One of the most common reasons for this is that the involved applications or data sources are not necessarily fully understood or taken into account during the analysis phase. This is compounded by the fact that documentation of the systems involved is often incomplete, leading to guesses rather than estimates of what is required to implement the business logic. Such guesses are normally overly optimistic.

    The best way to avoid this is to be thorough and honest during the design phase and to ensure that all stakeholders are invited to contribute to the scope discussions. While this can be difficult and time-consuming in larger organizations, the benefits can easily outweigh the expense of a slightly longer design phase.

Selecting the right migration strategy

  • Selecting the right data migration strategy

    The differences between a trickle and a big bang migration are considerable. Deciding which to choose should be made very early on.

    The decision is often driven by several key questions.

    • What is the migration deadline?
    • Can the system (or systems) experience downtime?
    • Do we fully understand our data so we can plan the whole process end to end?

    There is no simple rule, however big bang migrations are normally selected where the scope is well defined from the outset and where deadlines or other project properties mandate it.

    Conversely, trickle migrations are beneficial when the migration can be easily split into several different stages. They are also suitable when the scope is hard to define. In such cases, the trickle migration's phased migration approach allows you to migrate “easier” data first while dealing with the more complex processes later.

    Another major consideration is the experience of your team. Each migration can pose different technical and project management challenges. Ensuring you have the most suitable people with the most relevant experience on your team can play a significant role in the final outcome. Teams that prefer an agile approach typically prefer trickle migrations and, conversely, teams who are more used to a waterfall methodologies prefer the big bang approach.

Ignoring the possibility of failures

  • Ignoring the possibility of failure

    Contemplating failures is rarely taken seriously enough. Ensuring that contingencies are in place, should a migration or other integration process fail, is very important, especially for mission critical applications. A failed migration can leave both the legacy and the new system inoperational, causing major organizational disruption.

    It is important to set thresholds so that the migration phase is only considered failed if it makes more sense to perform a rollback instead of living with the result and fixing the issues afterwards.

    Some projects might require a success threshold close to 100% whereas others could have a lower threshold. If the benefits of using the new system, including its migration shortcomings, delivers a good enough outcome and the improved functionality of the new system can be taken advantage of, there may be no point in rolling back.

    Rollback is especially important for trickle data migrations. As they have multiple phases, the chance of at least one failure is high. As a result, it is important to design rollbacks in such a way that subsequent data migration phases are unnecessarily difficult or even impossible.

Custom coded solutions

  • Custom coded solutions

    Companies regularly take a coding approach when working with data. This can work perfectly well in the short term or for simpler projects. However, there are some important considerations over time.

    • As the amount of code grows, maintainability becomes a serious challenge
    • Logging is typically an afterthought so when issues arise, there is a lack of diagnostic information when you need it
    • Integration with new technology is slow to implement
    • Performance in the early days is rarely a consideration but years later it often is and serious bottlenecks can develop
    • Performance bottlenecks can require a full refactoring that can take a great deal of time
    • Bugs can and will happen and debugging can be hard and cause downtime or interruptions
    • Developers carry a lot of knowledge in their heads and when they leave they take it with them
    • Code is often very poorly documented and often depends on institutional knowledge during maintenance. Leavers familiar with the codebase can leave the company with major maintenance headaches.

The Migration Process

Data Discovery

  • Data discovery

    Data discovery is an often overlooked and underestimated part of any data related project. People often make false assumptions about their data as most people view the data from their own perspective only. However, data discovery is a critical part of the design phase as it provides the input for the scope definition and project estimates.

    Discovery involves a formal analysis of the data itself, finding relationships within and between datasets. One also has to consider all applications that use the data and need to be reconfigured or updated, especially in the case of data migrations.

    As such, discovery in larger organizations can be an activity involving various teams and often cross departmental boundaries.

Data Cleansing

  • Data cleansing

    During data discovery, you will often find that the data cannot be used in its current form and first needs to be cleansed. There are many different reasons for low data quality, ranging from simple ones (anything involving human data entry is likely to have various errors including typos, missing data, data misuse etc.) all the way to complex issues stemming from improper data handling practices and software bugs.

    Data cleansing is the process of taking “dirty” data in its original location and cleaning it before it is used in any data transformation. Data cleansing is often an integral part of the business logic with the data being cleaned in the transformation but left unchanged in the originating system. Other approaches can also be used. For example, a separate, clean copy of the data can be created if the data needs to be reused or if cleansing is time-consuming and requires human interaction.

Data Validation

  • Validation

    Data validation is an important process in achieving sufficient data quality in the target system and goes hand in hand with data cleansing. Data issues are discovered during validation and data that does not pass the validation step is marked for cleansing.

    Data validation can be a manual or an automated process. A high degree of automation is, of course, preferable and indeed necessary for larger systems. Nevertheless, some manual validation may still be necessary, especially for the most important data that cannot be allowed through without human confirmation or correction.

Data Mapping

  • Data mapping

    Data mapping is a process that maps data from the format used in one system to a format used by another system. The mapping can take a significant amount of time as

    • multiple applications often store data for the same entity
    • applications can be unfamiliar or poorly documented.

    The data mapping documentation has a significant impact on the overall implementation effort as, in many cases, there is no single correct way of mapping data between different structures. Common reasons for this are

    • there is no direct one to one mapping between values
    • data structures representing the same entity are too different.

    Proper data mapping requires detailed knowledge from the data discovery project phase. It also usually involves substantial input from data consumers.

    The mapping process is simplified with tools that visualize the mapping between different entities and provide automation of the mapping process.

    Data mapping also needs to consider the future development of applications involved. In such cases some data structures might change and it is important for the mapping and its implementation to be able to accommodate such changes as easily as possible.

Best Practice

Data Quality Monitoring

  • Data quality monitoring

    Monitoring of data quality for every involved application is vital in order to prevent pollution of multiple applications with low quality data coming from a single source.

    Monitoring often consists of data validation rules that are applied to each record as it is transformed into its destination format.

    Choice of data attributes to monitor, and how to monitor them, is one of the key decisions of the design phase of the whole project. If you have too much monitoring with overly detailed reports, you can overwhelm stakeholders resulting in important problems being overlooked. On the other hand, too little monitoring is undesirable as important observations are simply not being reported.

    Proper data quality monitoring is, therefore, about carefully balancing the investment in building monitoring rules with the volume of output.

Reporting

  • Reporting

    Reporting is a key part of data migrations and other integrations and is rarely done well. Well executed reporting ensures that stakeholders get all the status information very quickly and are able to react in a short timeframe. This, in turn, shortens the time it takes to determine whether or not everything was successful.

    A variety of reports could be generated. Some common examples are

    • data quality reports
    • effort reports
    • resource utilization reports

    A well designed reporting pipeline will notify relevant parties so they can take necessary action should issues arise.

Migrations with CloverDX

Connectivity

  • Connectivity

    One of the first questions asked is how to connect to the many different input and output systems in existence.

    Integrations and migrations often involve dozens, or even hundreds, of different applications, each with its own storage and connectivity requirements. This can include various file types in remote and local file systems, web services, database types, messaging queues and more.

    CloverDX provides a wide variety of “connectors” that provide the desired connectivity and also supports powerful abstraction of connectivity parameters. This makes it simple to work with the data regardless of source or target.

    For example, file processing abstraction allows you to work with your files whether local, in the Cloud or via FTP. Similarly, we offer powerful database abstraction so you can use the same set of components with any relational database.

    Such abstractions allow graph developers to build their solution locally before simply switching the connectivity parameters when ready for the production deployment.

    • Structured data: delimited and fixed-length files in any format
    • Hierarchical data: JSON, XML and custom formats
    • Excel files: XLS and XLSX
    • Relational databases: any database with JDBC driver
    • NoSQL databases: MongoDB
    • Big Data: HDFS, S3
    • REST and SOAP APIs
    • Other: SalesForce, Tableau, email, LDAP, JMS, ...

Data Structure Management (Metadata)

  • Data Structure Management

    You’ll often have to work with a large number of different data structures. Each application may have its own way of storing data even if they serve the same purpose. For example, consider something as simple as storing information about customers in a CRM system where there are infinitely many ways of storing details about a person or a company including associated contact information.

    CloverDX can be thought of as “strongly typed”. Before processing each record, CloverDX needs to fully understand the structure and attributes of the data. This ensures that the transformation can enforce static business logic checking and prevent mismatched record types in various operations. This in turn can prevent logical errors when working with seemingly similar but actually different data structures.

    CloverDX’s ability to parametrize any aspect of a data transformation allows you to design generic transformations that generate their data type information on the fly based on the data they process. This can dramatically simplify the design of graphs that operate on more complex data.

Validation

  • Validation

    Validation is one way to ensure clean, valid data. The aim is to automate the detection of as many invalid records as possible, minimizing the need for human intervention.

    To simplify the process, CloverDX provides many built-in components with a UI based configuration, removing the need for coding.

    For example, the Validator component allows developers to quickly and visually design complex, rule-based validations. These validation rules generate information for each error that can then be fed into a reporting pipeline for the creation of quality reports.

    validation

    The ProfilerProbe component allows for easy measurement of various data properties such as data patterns, basic statistical properties of the data (minimum, maximum etc.), unique values and more.

Data Mapping

  • Data Mapping

    Data mapping is a fundamental part of data migration, data ingestion and other integration processes. It is essential to support the widest range of data transformation options as well as various means of streamlining transformation design and testing.

    CloverDX provides many mapping related components ranging from the simple (sorting, deduplication, lookups etc.) through to various programmable components that allow you to code your own business logic (Reformat, Normalizer, Denormalizer, Joiners, etc.).

    transform

    Many CloverDX components allow you to code your own business logic using our own scripting language, CTL, or Java, giving you access to the widest possible range of your own or external libraries.

    transform-ctl

    CloverDX provides comprehensive code debugging. Further, we also support data debugging by allowing you to inspect data in full detail as it flows through each part of your transformation.

    edge inspection

Reusability

  • Reusability

    In the world of software development, reusability is taken for granted through mechanisms such as functions and classes.

    In CloverDX we fully support reusability by allowing developers to build “child transformations”. A child transformation is a regular transformation that can be referenced from any other transformation. As well as giving the advantages of reusability, you can significantly simplify the readability of your top level transformations.

    subgraph

    CloverDX also supports reusability in several other ways. You can manage your own code libraries, store externalized connection settings for your databases, create shared libraries of your data structures and more.

    CloverDX is fully compatible with all common version control systems, so you can collaborate on projects using your preferred SCM tools such as git, Mercurial and Subversion.

Automation and Jobflows

  • Automation and Jobflows

    It is important that your transformation tool allows developers to build complete pipelines consisting of several transformations and processes with inter-process dependencies.

    CloverDX handles this using Jobflows. A Jobflow executes and monitors internal processes (data transformations or other Jobflows) as well as external processes (typically scripts). Together with the powerful error handling and visual nature of the jobflow design process, CloverDX allows developers to very quickly build entire pipelines that properly react to success and failure.

    jobflows

    Thanks to its ability to automate entire processes, monitor and send appropriate notifications, CloverDX can quickly become a central orchestration point, notifying responsible and interested parties on the progress of migrations and other data integration processes in real time.

    API Endpoints

    CloverDX exposes multiple different APIs that are useful for integration with 3rd party tools. For example it is possible to trigger jobs via REST or SOAP API calls or to monitor various external systems for changes such as arriving files or JMS messages.