CloverDX is a new name for CloverETL Learn more
Every data integration project is different, but they're all susceptible to some of the same design and implementation mistakes.
CloverDX Senior Consultant Kevin Scott walked through some of the common pitfalls in our webinar, plus gave some tips on how to avoid them in your projects. Watch the webinar or read the transcript below.
One common mistake we see is starting a data integration project too quickly. Sometimes referred to as ‘Ready, Shoot, Aim’. When deadlines are looming it’s tempting to start making immediate progress.
The danger here is wasted effort, because you’re likely processing without a firm set of requirements. And when those requirements do surface or change, it’s going to be costly to go back and re-work.
When you’re approaching a project, the temptation is often to focus on the current state – the current character of the data – and then to design a solution around that state.
But the data you’re working with is dynamic. It will grow. And will also likely change structures and formats. And your processes might also change – maybe today you need a daily batch of data to move in a two-hour window, or handle 1000 new events per hour throughout the day, but these are rarely static numbers.
So our first two mistakes: don’t rush in, and spend some time thinking about scale.
And when I applied these two lessons to this talk, I discovered I actually have 16 mistakes, not 10.
Let’s start with some mistakes that can happen as you initially begin to define a data problem and possible solution. I already started with the two above.
Closely related to the scaling problem is misunderstanding the lifespan of your data solution. It’s too easy to think of a data integration (DI) project as a one-off, with definite start and end dates. The reality is that most DI projects are better thought of as an ongoing initiative.
Most data integration projects are never done. (Maybe counter-intuitively, this is particularly true for successful projects as people will often want more).
There’s a great marketplace of software tools available for your data integration project, and you'll probably spend significant effort evaluating and choosing the right data tools.
But it’s a mistake to confuse that effort with the effort of actually building your solution. The tools will make you more efficient, but the hard work of designing your data pipeline will still be hard work.
As you define your data integration project, you will need to think carefully of end users. If your solution is not aligned with user expectation, either in their needs or their capabilities, your success will be limited.
If your users think the solution is too difficult or time-consuming to use, your effort will be wasted when no-one uses it. Underestimating need can also be a problem if the solution gets used more than you had planned for and maybe ends up failing under the load.
A data integration project doesn’t often start with a clean slate. More likely is that there are processes in place that are no longer working. Maybe they’re too slow, maybe broken, maybe too difficult to maintain.
Not taking the time to fully recognize, understand and memorialize this pain can be a mistake.
Fail to understand the motivating pain and you risk building a solution that may be better, but not better in ways that actually address that pain. We’ve seen projects that make some technology change, or move from on-prem to the cloud, without really addressing the root motivation.
Let's move on to a new class of mistakes, ones that tend to surface as you assemble the team that is going to work on the project.
Your project needs a business owner. Someone that understands the business goals of the project and can answer questions related to functionality, schedule and tradeoffs – and someone with authority to change the project scope.
Equally importantly, the business owner must be actively involved in the project. Too often business owners are figureheads, attending a weekly status meeting or keeping only a passing interest in progress.
The choice of data integration software can be dizzying. It’s tempting to try and choose tools that will meet all your data management needs - for this project and others, and for needs that might occur in the future.
In our experience, it can be a better to take a more measured approach to software selection. Don’t insist on features you don’t need in the next year, or are unable to use. For example, we’ve seen this a lot with big data technologies: until recently every data integration project added Hadoop to the end of their tool requirements list, seemingly regardless of any plans for big data analytics.
When you assemble a team, they will of course be confident, smart and well-suited for the project. It can be tempting to presume that they can learn new tools quickly by themselves and don’t need any formal training.
In our experience it can be a mistake to skip this training. In training, you’ll likely learn tips and tricks that don’t surface in self-guided study, and expert training can often provide more than just tool expertise. Your vendor’s training staff often has valuable experience with many similar projects, and a session with them can be a source of domain expertise.
Establishing a personal relationship with technical staff from your vendor can also be a valuable resource as you execute.
How much third party help do you want for your project? This depends on many factors, from the skill set and availability of your internal staff to the aggressiveness of your schedule.
There is a spectrum of building-it-yourself options. What parts of the project do you want to build? What parts do you want to buy? Spend the time necessary to choose where on that spectrum to position your project.
At one extreme is a 100% custom solution built without any tools. Maybe you have a strong internal team of developers who may be able to make rapid initial progress on core project goals – the goals that address the pain we talked about earlier.
The mistake we see with this approach deals with underestimating the ‘assumed’ features that surround a data integration solution – things such as data quality assessment, logging, error handling and reporting, security, monitoring and so on. These foundational features can cause your solution to quickly gain weight beyond the core requirements.
Your programming team may be able to build it, but it that the best use of their time?
Closely related to the decision of where to place your project on the DIY spectrum is your assessment of total project cost. Misjudging costs is a common mistake.
At the outset it is tempting to equate the cost of the tool = cost of the solution.
Data integration architectures can of course run from the very basic project-specific to a more generic framework centric.
I need to be careful not to over-generalize, but in my experience, the more DIY flavor a project has, the more likely it is to tend towards project-specific architecture.
When I say framework, I mean some abstract of generic approach to architecture, often driven by some configuration file. So instead of writing code that reads a specific customer file and inserts its content into a customer table in a database, you create a framework which will read an arbitrary file and insert its content into an arbitrary table, based on some guidance in a configuration file.
Data integration frameworks can appear intimidating and unnecessarily complex. Initial progress can seem slow, when compared to a more direct treatment of project specific requirements.
Project-specific architecture is easier to conceive and faster to develop, but is less flexible and extensible.
Also, frameworks will be more likely to support corporate data governance and auditability goals and be more tolerant, and even welcoming to business process changes and resulting data changes.
A POC can:
But be careful how you represent a POC to stakeholders. POCs are great for validating requirements, but can also convey a false sense of progress to the casual observer.
The goal is to avoid surprises. Demo as much as possible as early as possible to as many stakeholders as possible. Call it Agile or Lean development from software.
Handling bad data is often an afterthought in data projects and that can be a costly mistake.
When bad data creeps into our systems it can affect credibility and can be expensive to repair. Bad data handling should be a core part of your DI project.
Remedies can include adding explicit validation stages to look for bad data throughout the pipeline and catch it as early as possible, and to define processes to not only detect but to correct bad data.
I have left discussion of testing as the last item. Ironic, because this is also often left as the final step in DI projects - which is a mistake.
The biggest mistake we see in the testing phase is the failure to obtain relevant and valid test data.
Robust test data is vital to ensure your solution works, but at the same time remarkably difficult to obtain. The best test data is actual production data, but that is almost always off limits, because (a) it is in production on production infrastructure and (b) it may hold sensitive or PII data. Synthesizing or anonymizing test data is the next best thing, but make sure you plan and budget for this effort as well.
We have seen projects delayed months due to the inability to obtain test data of quality sufficient to perform the required QA assessment.
And that wraps up the list. I am sure you have seen many of these before, but I trust at least a couple of them have given you something new to think about.
At CloverDX we have been helping clients plan and execute data integration projects for nearly 20 years, and we welcome the opportunity to talk with you in more detail about your specific challenges, and help you avoid some of these costly mistakes.