BI & Analytics

A reference guide

A BI & Analytics reference guide for managers and technical staff, this provides a deeper understanding of what is often involved and how to prepare.

Download

Basic

What is BI?

  • What is BI?

    Business Intelligence (BI) is the technology and strategy used to analyze business data. It is a rather broad and wide field and can include intelligence from historical, operational, or futuristic data. Some functions that can be included in business intelligence include

    • Reporting
    • OLAP
    • Analytics
    • Data mining
    • Event processing
    • Performance management

    BI solutions are usually built to work with large quantities of data in order to answer specific business questions.

    For example, an e-commerce company that sells a number of products online may want to understand what drives business value for a certain demographic that purchases individual products in order to better market to that demographic and increase sales.

What is Analytics?

  • What is Analytics?

    Analytics is a function of business intelligence and is usually the recognition of patterns in data that can be used in business planning.

    The expected consequence of this is the adoption of a data driven approach to operating a business. This is not a new function of BI, but has received a lot of attention in recent years to the extent that many companies have invested heavily in data scientists to help make data driven decisions.

Why should I care?

  • Why should I care?

    Business intelligence is an essential function that needs to be properly considered and planned for in order to get the desired results. A solid business intelligence strategy can help propel the business forward, but a loosely defined strategy can set the business back. So how can you optimize the BI process?

    • Analyze and understand all data, internal and external
    • Understand what reports and analytics your data should yield. If you increase your marketing spend in NYC for a product, what return can you expect to see?
    • Carefully plan and design your architecture and data strategy
    • Are there any regulatory/privacy concerns within the data and if so, how do you mitigate that?

Recommended approaches

Planning

  • Planning

    One of the first things you will want to do is to plan for the solution that you would like to build. Some of the things you might need to consider are

    • what is the exact data you need to use
    • how you will on-board and process the data
    • how you will store the data
    • how you publish the data
    • how often to refresh
    • how much historical data do you need?

Data strategy and architecture

  • Data Strategy and Architecture

    Once you have defined your strategy, you will then need to start designing the architecture for the solution. This will include all of the technologies that you plan to use in the end-to-end workflow.

    • Where and how will you process data? This could be on-premise, cloud, or hybrid approach.
    • Where and how will you store the data? This could be a relational database, document or columnar database, files, or some combination of stored on premise or in the cloud.
    • What level of performance do you require? The last thing you want is to produce a dashboard for the CEO where they have to wait 40 seconds for content to appear.

Publishing your data

  • Publishing your data

    How you publish data to end users and how long it takes to have them view the data on-demand is an important factor.

    • Are you only analyzing a few aggregated tables and can statically populate the structure once a day?
    • Do you need the data to be refreshed every time someone refreshes the dashboard?
    • Do you need a tool that can load and handle very high data volumes?
    • In general, analytical tools are very good at publishing and visualizing data, but they fall short when they need to join or aggregate a large quantity of data. Each has its own challenges associated with it, but keep in mind the performance when you are defining your requirements. Again, if the data takes too long to load, you may have already diminished its value.

Automation

  • Automation

    One of the first things you will want to consider for a BI and analytics project is how to automate the process of data ingestion. In order to understand the automation process, you need to understand the lineage of your data.

    Once the lineage is understood, you can start mapping the data from the source systems to your target analytics platform. Keep in mind, you want this to be an automated and reliable data feed so your reporting is always up-to-date.

    Other complexities can also arise such as the need to guarantee data availability with failovers, data recovery plans, standby servers, operations continuity etc.

Data mapping

  • Data mapping

    Data mapping is a process that maps data from the format used in one system to a format used by another system. The mapping can take a significant amount of time as

    • multiple applications often store data for the same entity
    • applications can be unfamiliar or poorly documented.

    The data mapping documentation has a significant impact on the overall implementation effort as, in many cases, there is no single correct way of mapping data between different structures. Common reasons for this are

    • there is no direct one to one mapping between values
    • data structures representing the same entity are too different.

    Proper data mapping requires detailed knowledge from the data discovery project phase. It also usually involves substantial input from data consumers.

    The mapping process is simplified with tools that visualize the mapping between different entities and provide automation of the mapping process.

    Data mapping also needs to consider the future development of applications involved. In such cases some data structures might change and it is important for the mapping and its implementation to be able to accommodate such changes as easily as possible.

Data cleansing

  • Data cleansing

    During data discovery, you will often find that the data cannot be used in its current form and first needs to be cleansed. There are many different reasons for low data quality, ranging from simple ones (anything involving human data entry is likely to have various errors including typos, missing data, data misuse etc.) all the way to complex issues stemming from improper data handling practices and software bugs.

    Data cleansing is the process of taking “dirty” data in its original location and cleaning it before it is used in any data transformation. Data cleansing is often an integral part of the business logic with the data being cleaned in the transformation but left unchanged in the originating system. Other approaches can also be used. For example, a separate, clean copy of the data can be created if the data needs to be reused or if cleansing is time-consuming and requires human interaction.

Governance and safeguards

  • Governance and safeguards

    Another important aspect of the planning phase is to decide how to expose the data to users. Typical questions asked in this phase of pipeline design can include:

    • Will this be used internally?
    • Will this be used externally?
    • Who will have access and what kind of access will they have?
    • Do you have sensitive data that will need to be protected and regulated?

    These considerations are often not planned properly and result in delays, cost overruns and increased end user frustration.

Pitfalls

Ignoring data preparation

  • Ignoring data preparation

    If you are looking at a BI/Analytics solution, do not ignore your data preparation strategy. Too often we see people spend their money on the visualization software without thinking about how they plan to prepare their data for their new visualization tool.

    Your visualization tool is only as good as the quality of your data. So, if you have low quality data, your visualizations will not give you valuable information or a good return on your investment.

    Think about the entire picture from the creation of data to the visualization of data before deciding how you want things to be visualized.

Self service analytics is best

  • Thinking that self service analytics is best

    Self service analytics, when provisioned properly, provides access to the broadest universe of data for an organization. This is helpful in that if an analyst needs access to any piece of data, they can expect to find it somewhere.

    However, this can become problematic as the number of data sets, and their complexity, increases. The challenge is then how to find the right data and understanding the relationships between the many datasets. Beyond a certain level of complexity, self-service environments can be very time consuming and frustrating to work with.

    In such cases, the best approach is to ensure that all raw data is transformed, prepared and staged in a data store that provides friendly and efficient access to analysts.

Starting with a complicated solution

  • Starting with a complicated solution

    In general, if you start with a complicated solution, you are opening yourself up for more risk of failure. We recommend that you go the opposite direction and start small.

    Starting small generally means that rather than worrying about an expensive BI or analytical tool, you create reports that help the business team to understand their data. Once you outgrow your reports you can then start looking at more innovative ways to drive insights about your company.

    If you make data the focal point from the ground up, you will be in a better position of understanding the data you currently manage and be able to drive better insights from that data.

Choosing the wrong tool or solution

  • Choosing the wrong tool/solution

    Picking the wrong tool for the job will not only set you back in terms of time, but it will also set you back on the bottom line. Each tool/solution is different, so you need to fully understand the problem you are looking to solve and the outcome you are hoping to achieve.

    You wouldn’t use a screwdriver to hammer a nail into a piece of wood, would you? Again, you could, but it’s probably not the most efficient tool if you need to do it 100 times.

Not planning for new sources, technologies and applications

  • New systems, applications and technologies

    In today’s world, businesses are constantly looking to optimize their systems to understand their data, expose it flexibly and extract maximum value from it.

    As a result, it is important to factor in these eventualities and be as prepared as possible to deal with

    • short term requests from management
    • system and data structure changes that might be forced on you in the future
    • new systems coming onstream that need to be incorporated

Not understanding infrastructure

  • Not understanding your infrastructure

    Mosy data integration projects involve IT and require a thorough understanding of your infrastructure. There are many moving parts to a project and these all need to be understood and planned in order to avoid unforeseen delays mid project.

    • Appropriately configured firewall rules that allow connection to external data sources
    • Suitable access permissions to all systems
    • Knowledge of existing data structures, especially those in other domains or departments
    • Impact of temporary unavailability of source data on your ingestion processes
    • Knowledge of governance and compliance processes, such as password rotations, and how they impact ingestion

Not understanding your data quality

  • Not understanding your data quality

    Data quality is one of the most underestimated properties of data. Data in any system that has been in production for a while can have all sorts of data quality issues. These can range from simple issues such as typos through to missing or invalid data.

    Data owners often have only a vague idea about the overall quality of their data and the impact on subsequent data oriented processes. While they will clearly understand more obvious issues, they may be completely unaware of more complex or legacy problems.

    We would strongly recommend doing a full data quality evaluation of your production data early on in a data centric project. This can be complicated by security restrictions, but in our experience, data in a test environment never fully captures the depth and complexity of the issues that appear in production systems.

Inadequate system knowledge

  • Inadequate system knowledge

    Data applications, especially mission-critical ones, tend to be in production for extended periods of time as companies typically do not want to invest in technology unless absolutely necessary.

    This means that institutional knowledge of applications is often lost due to inadequate documentation or staff turnover.

    A lack of system knowledge can negatively affect data migration or integration projects due to over-optimistic estimates or data mapping issues that only manifest themselves in later stages of the project.

    As such, a lack of knowledge can be very expensive and any poor estimates or deficient data mapping exercises can lead to costly project restarts and substantial budget overruns.

Custom coded solutions

  • Custom coded solutions

    Companies regularly take a coding approach when working with data. This can work perfectly well in the short term or for simpler projects. However, there are some important considerations over time.

    • As the amount of code grows, maintainability becomes a serious challenge
    • Logging is typically an afterthought so when issues arise, there is a lack of diagnostic information when you need it
    • Integration with new technology is slow to implement
    • Performance in the early days is rarely a consideration but years later it often is and serious bottlenecks can develop
    • Performance bottlenecks can require a full refactoring that can take a great deal of time
    • Bugs can and will happen and debugging can be hard and cause downtime or interruptions
    • Developers carry a lot of knowledge in their heads and when they leave they take it with them
    • Code is often very poorly documented and often depends on institutional knowledge during maintenance. Leavers familiar with the codebase can leave the company with major maintenance headaches.

Changing requirements or scope

  • Changing requirements or scope

    It is not uncommon for the project scope to change during a project’s lifetime, which is why so many IT projects end up over budget.

    One of the most common reasons for this is that the involved applications or data sources are not necessarily fully understood or taken into account during the analysis phase. This is compounded by the fact that documentation of the systems involved is often incomplete, leading to guesses rather than estimates of what is required to implement the business logic. Such guesses are normally overly optimistic.

    The best way to avoid this is to be thorough and honest during the design phase and to ensure that all stakeholders are invited to contribute to the scope discussions. While this can be difficult and time-consuming in larger organizations, the benefits can easily outweigh the expense of a slightly longer design phase.

Best Practice

Start small

  • Start small

    If you haven’t started using your data to drive your business, start small.

    Define a few key metrics that you would like to understand, and build a solution that will automate the report you are looking for. Once you have simple reports setup, you will want more information, which will entail a larger and more expensive solution.

Identify requirements early

  • Identify your requirements early

    Let’s say you are past the initial, small scale BI/Analytics solution, and are thinking about a larger project to deliver more insights into the business.

    In our experience, it’s best to define what you want at the beginning, and work towards that goal. For example, if you are looking to serve data to data analysts via a dashboard so they can define the metrics themselves, that’s a very different project than defining the metrics that are required before the project.

    One approach requires a much more flexible architecture (self serve), whereas the other requires a more focused approach (similar to a DWH). Neither is better than the other, they are different, so you need to understand what you need first before you begin the implementation.

Invest wisely

  • Invest wisely

    This is a general principle that you should follow in your all aspects of your life, but this is especially true for deciding on technology or a solution that will help to drive your business.

    You might consider using a free tool, but it’s not a wise investment because of the time/money it takes to build and maintain. On the other hand, you could use an expensive tool ($250k+) and find that the output you receive from it is not exactly what you want.

    Take a step back, understand your requirements, do your research, and pick a partner. Once that’s done, then you can start defining how you want to get the most from your data.

Data Quality Monitoring

  • Data quality monitoring

    Monitoring of data quality for every involved application is vital in order to prevent pollution of multiple applications with low quality data coming from a single source.

    Monitoring often consists of data validation rules that are applied to each record as it is transformed into its destination format.

    Choice of data attributes to monitor, and how to monitor them, is one of the key decisions of the design phase of the whole project. If you have too much monitoring with overly detailed reports, you can overwhelm stakeholders resulting in important problems being overlooked. On the other hand, too little monitoring is undesirable as important observations are simply not being reported.

    Proper data quality monitoring is, therefore, about carefully balancing the investment in building monitoring rules with the volume of output.

Reporting

  • Reporting

    Reporting is a key part of data migrations and other integrations and is rarely done well. Well executed reporting ensures that stakeholders get all the status information very quickly and are able to react in a short timeframe. This, in turn, shortens the time it takes to determine whether or not everything was successful.

    A variety of reports could be generated. Some common examples are

    • data quality reports
    • effort reports
    • resource utilization reports

    A well designed reporting pipeline will notify relevant parties so they can take necessary action should issues arise.

BI & Analytics using CloverDX

Connectivity

  • Connectivity

    One of the first questions asked is how to connect to the many different input and output systems in existence.

    Integrations and migrations often involve dozens, or even hundreds, of different applications, each with its own storage and connectivity requirements. This can include various file types in remote and local file systems, web services, database types, messaging queues and more.

    CloverDX provides a wide variety of “connectors” that provide the desired connectivity and also supports powerful abstraction of connectivity parameters. This makes it simple to work with the data regardless of source or target.

    For example, file processing abstraction allows you to work with your files whether local, in the Cloud or via FTP. Similarly, we offer powerful database abstraction so you can use the same set of components with any relational database.

    Such abstractions allow graph developers to build their solution locally before simply switching the connectivity parameters when ready for the production deployment.

    • Structured data: delimited and fixed-length files in any format
    • Hierarchical data: JSON, XML and custom formats
    • Excel files: XLS and XLSX
    • Relational databases: any database with JDBC driver
    • NoSQL databases: MongoDB
    • Big Data: HDFS, S3
    • REST and SOAP APIs
    • Other: SalesForce, Tableau, email, LDAP, JMS, ...

Data Structure Management (Metadata)

  • Data Structure Management

    You’ll often have to work with a large number of different data structures. Each application may have its own way of storing data even if they serve the same purpose. For example, consider something as simple as storing information about customers in a CRM system where there are infinitely many ways of storing details about a person or a company including associated contact information.

    CloverDX can be thought of as “strongly typed”. Before processing each record, CloverDX needs to fully understand the structure and attributes of the data. This ensures that the transformation can enforce static business logic checking and prevent mismatched record types in various operations. This in turn can prevent logical errors when working with seemingly similar but actually different data structures.

    CloverDX’s ability to parametrize any aspect of a data transformation allows you to design generic transformations that generate their data type information on the fly based on the data they process. This can dramatically simplify the design of graphs that operate on more complex data.

Validation

  • Validation

    Validation is one way to ensure clean, valid data. The aim is to automate the detection of as many invalid records as possible, minimizing the need for human intervention.

    To simplify the process, CloverDX provides many built-in components with a UI based configuration, removing the need for coding.

    For example, the Validator component allows developers to quickly and visually design complex, rule-based validations. These validation rules generate information for each error that can then be fed into a reporting pipeline for the creation of quality reports.

    validation

    The ProfilerProbe component allows for easy measurement of various data properties such as data patterns, basic statistical properties of the data (minimum, maximum etc.), unique values and more.

Data mapping

  • Data Mapping

    Data mapping is a fundamental part of data migration, data ingestion and other integration processes. It is essential to support the widest range of data transformation options as well as various means of streamlining transformation design and testing.

    CloverDX provides many mapping related components ranging from the simple (sorting, deduplication, lookups etc.) through to various programmable components that allow you to code your own business logic (Reformat, Normalizer, Denormalizer, Joiners, etc.).

    transform

    Many CloverDX components allow you to code your own business logic using our own scripting language, CTL, or Java, giving you access to the widest possible range of your own or external libraries.

    transform-ctl

    CloverDX provides comprehensive code debugging. Further, we also support data debugging by allowing you to inspect data in full detail as it flows through each part of your transformation.

    edge inspection

Reusability

  • Reusability

    In the world of software development, reusability is taken for granted through mechanisms such as functions and classes.

    In CloverDX we fully support reusability by allowing developers to build “child transformations”. A child transformation is a regular transformation that can be referenced from any other transformation. As well as giving the advantages of reusability, you can significantly simplify the readability of your top level transformations.

    subgraph

    CloverDX also supports reusability in several other ways. You can manage your own code libraries, store externalized connection settings for your databases, create shared libraries of your data structures and more.

    CloverDX is fully compatible with all common version control systems, so you can collaborate on projects using your preferred SCM tools such as git, Mercurial and Subversion.

Automation and Jobflows

  • Automation and Jobflows

    It is important that your transformation tool allows developers to build complete pipelines consisting of several transformations and processes with inter-process dependencies.

    CloverDX handles this using Jobflows. A Jobflow executes and monitors internal processes (data transformations or other Jobflows) as well as external processes (typically scripts). Together with the powerful error handling and visual nature of the jobflow design process, CloverDX allows developers to very quickly build entire pipelines that properly react to success and failure.

    jobflows

    Thanks to its ability to automate entire processes, monitor and send appropriate notifications, CloverDX can quickly become a central orchestration point, notifying responsible and interested parties on the progress of migrations and other data integration processes in real time.

    API Endpoints

    CloverDX exposes multiple different APIs that are useful for integration with 3rd party tools. For example it is possible to trigger jobs via REST or SOAP API calls or to monitor various external systems for changes such as arriving files or JMS messages.

Creating APIs

  • Creating APIs

    Many data integrations depend on being able to define and quickly expose APIs that can be consumed by external applications. In CloverDX, this can be very effectively done for REST APIs via the Data Services feature.

    CloverDX Data Services allows developers to design a REST API and visually develop transformation logic that sits behind this API call. The resulting API endpoint can be quickly published from the CloverDX Server.

    API config

    A CloverDX API endpoint can be used by many different systems such as Tableau, SalesForce and any other services that can call REST APIs.

    Typically, a transformation, such as the one below, can be launched.

    API