Contact

R20/Consultancy

+31 252-514080

info@r20.nl

 

 

Title: Lean Data Architectures to Minimize Data Copying

Subtitle: From Data-by-copying to Data-on-demand

Introduction

There was a time when people would visit a record store to buy a copy of an album to listen to at home. There was also a time when people went to a video store to rent a DVD to have a copy at home to watch the movie. Not anymore, music and video are streamed. People no longer listen to or watch copies. Music-by-copying has been replaced by music-on-demand and video-by-copying by video-on-demand.

Unfortunately, the world of data has remained unchanged. Data is still copied, in fact several times before it is even consumed. Many organizations are lightyears away from data architectures that support data-on-demand.

Most data architectures are duplication-heavy. So much data is duplicated multiple times. For example, data about a specific customer can be stored in a transactional system, a staging area, a data warehouse, several data marts, and in a data lake. Even within one database, data can be stored multiple times to support different data consumers. Additionally, redundant copies of the data are stored in development and test environments. But business users also copy data. They may have copied data from central databases to private files and spreadsheets. Also, data infrastructures currently consist of data lakes, data hubs, data warehouses, and data marts. And all these systems contain overlapping data.

In addition to these intra-organizational forms of data copying, massive inter-organizational copying takes place. When organizations exchange data with each other, the receiving organizations store the data in their own systems, creating even more copies of the data.

It is time for lean data architectures that minimize copying of data. The advantages of this are manyfold, such as the architecture is more flexible, improves productivity and maintenance, lowers low data latency enabling real or near real data-on-demand solutions, and less error-prone.

A lean architecture relies more heavily on the strength and performance of technology. The entire workload is not distributed across several data marts or a data lake with a data warehouse, but must be processed by a smaller number of databases. The good news is that new technology is available that can handle these bigger workloads. Especially fast analytical databases and scalable cloud platforms make lean architectures a reality.

In the old days, several reasons existed to create data copies. But database performance, cloud technology, and network speed have improved enormously, often making copying of data unnecessary. Unfortunately, new data architectures are still being designed in which data is stored redundantly. Architects think too casually about copying data and storing it redundantly. Copying data has many drawbacks and challenges:

  • Higher data latency
  • Complex data synchronization
  • More complex data security and data privacy enforcement
  • Higher development and maintenance costs
  • Higher technology costs
  • More complex database administration
  • More complex metadata administration
  • Reduced data quality

Redundant data is introduced too easily, and this unrestrained duplication must stop. Lean data architectures aim to reduce the time data is copied.

During this seminar, Rick van der Lans explains how to design a lean data architecture and which solutions and technologies are available to develop one. Design guidelines for zero-copy and single-copy data architectures and a comparison with duplication-heavy architectures are discussed. How to minimize intra- and inter-organization copying is discussed. The impact on existing data warehouse, data lake, and data hub architectures are presented. A complete picture of designing lean data architectures in real-life projects is given.

Subjects

Part 1: Unrestrained Copying of Data

  • Examples of intra-organization data copies
  • Examples of inter-organization data copies
  • Copying data in new data architectures, such as data lakes and data hubs
  • What is data minimization?
  • From data-by-delivery to data-on-demand
  • Risks and drawbacks of copying and duplicating data

Part 2: Justifying Lean Data Architectures

  • Business advantages of lean data architectures, such as improved time-to-market, support for (near) real-time data consumers (internal and external), improved conformance to data security and privacy, and improved data quality
  • Technical advantages of lean data architectures, such as simplification of development, management and operation of synchronization programs, less complex database and metadata administration

Part 3: New Technologies Enabling Lean Data Architectures

  • Analytical database servers and their distributed, share-based architecture
  • Translytical database servers: combining transactions and analysis
  • Cloud technology offers the required scalability and centralization of data
  • Data virtualization enables reduction of redundant data
  • Messaging and streaming technology

Part 4: Design guidelines for Lean Data Architectures

  • Differences between zero-copy solutions (real data-on-demand) and single-copy solutions (near real data-on-demand)
  • Valid reasons for copying data, such as source does not keep track of history, availability level of source is too low, and extracting data from source is too expensive
  • Use the 1:1+ approach for table design with single-copy solutions
  • Extended copies contain data not stored by the source; reasons may be the need for artificial data, additional metadata and auditability
  • The difference between technical and functional data corrections
  • Trust the performance of database servers
  • Keeping track of data history only once
  • Copy when needed, but not by default

Part 5: Minimizing Inter-organization of Copying Data

  • Replacing managed file transfer by data-on-demand across organizations
  • Challenges: Extra infrastructure needed at source, more unpredictable workloads, service-level agreements
  • Accessing geographically dispersed data sources
  • Maximizing performance of distributed queries by centralizing data in the cloud
  • What can we learn from video-streaming services, such as Netflix and Amazon Prime?

Part 6: Transforming Current Data Architectures to Lean Architectures

  • From traditional data warehouse architectures to logical data warehouse architectures
  • From physical data lake with zones and tiers to virtual data lakes
  • From data lakehouses to logical data lake houses
  • From data fabrics to logical data fabrics
  • The impact of lean data architectures on data privacy aspects

Part 7: Closing remarks

  • General recommendations for designing lean data architectures
  • 'Netflixing' your data

What You Will Learn:

  • How to design lean data integration architectures using examples.
  • What the real drawbacks are of creating too many copies of the data are, including higher data latency, complex data synchronization, more complex data security and privacy, and higher development and maintenance costs.
  • How new database, integration, and cloud technology can help to design lean data architectures that contain less copied data.
  • What the effect is of applying data minimization to data warehouse and data lake architectures.
  • How to design the data in single-copy solutions.
  • What the 1:1+ approach for data architectures means.
  • How to replace managed-file-transfer solutions by data-on-demand solutions, and how to reduce inter-organizational data flows.
  • How to design data architectures from the perspective of data processing specifications and not data stores.

Related Books:

 Data Virtualization: Selected Writings by Rick F. van der Lans

 Data Virtualization for Business Intelligence Systems by Rick F. van der Lans

Related Articles and Blogs:

 Part 1: Drowning in Data Delivery Systems, May 2018

 Part 2: Key Benefits of a Unified Data Delivery Platform, June 2018

 Part 3: How Siloed Data Delivery Systems Were Born, June 2018

 Part 4: Big Data is Not the Biggest Change in IT, June 2018

 Part 5: Requirements for a Unified Data Delivery Platform , June 2018

 Part 6: A Unified Data Delivery Platform - A Summary, June 2018

Related Whitepapers:

 The Fusion of Distributed Data Lakes - Developing Modern Data Lakes; February 2019, sponsored by TIBCO Software

 Unifying Data Delivery Systems Through Data Virtualization; October 2018; sponsored by fraXses

 Architecting the Multi-Purpose Data Lake With Data Virtualization; April 2018; sponsored by Denodo Technologies

 The Next Wave of Analytics - At the Edge; December 2017; sponsored by Edge Intelligence Software

 Developing a Data Delivery Platform with Composite Information Server; June 2010; sponsored by Cisco (Composite Software)

Geared to: Data architects; enterprise architects; solutions architects, business intelligence specialists; data analysts; data warehouse designers; business analysts; data scientists; technology planners; technical architects; IT consultants; IT strategists; systems analysts; database developers; database administrators; solutions architects; IT managers.