#Big Data Analytics #ETL #Data warehouse #Data Integration

ETL in Data Warehouse: The Definitive Guide

September 26, 2022

Contents

Data warehouse
ETL (Extract, Transform, and Load) in the Data Warehouse
- Definition
How does ETL work?
Conclusion

Data is a precious treasure for human life in the Information Age. As data comes from enormous sources, most organizations have recently struggled with data integration and governance. As a result, each company needs an ultimate solution to solve this problem and enhance its business performance. This blog will present the general approaches to matching these tasks.

Data warehouse

Definition

A data warehouse is a central repository for storing transformed data. Since the data is stored, it can not be changed in the data warehouse. Various types of data, such as structured, semi-structured, or unstructured inputs, will be converted by the ETL process before entering the data warehouse. The main functions of the data warehouse are to analyze historical data and run queries.

Data warehouse characteristics

A data warehouse is considered a data management system that assists in making decisions. It has four distinct characteristics, as follows:

Subject-oriented

A data warehousing process is designed to resolve a specific theme such as sales, distribution, or marketing instead of the organization’s existing operations. It focuses on making decisions on a particular subject by modeling the data and eliminating irrelevant or redundant information.

Integrated

Data from numerous sources, including mainframes, flat files, and relational databases, is integrated to develop the data warehouse. This data warehouse integration establishes a standard format for all similar data derived from the different databases and helps with proper data analysis. The critical point is to ensure the reliability of the name conventions—scaling, encoding structure, etc.

Data integration issues (Source: Guru99.com)

Take an example from Guru99; three different applications are labeled A, B, and C with information about gender, date, and balance. However, in each application, the data on sex is stored in different formats, as shown in the figure above.

Application A: gender data is a logical value.
Application B: gender data is a numerical value.
Application C: gender data is a character value.

After the processing and cleansing process, the gender data from the three applications are converted to the same format in the data warehouse.

Time-Variant

Time variance is a distinct feature, indicating that data warehouses are consistent within a period, meaning that the data warehouse is loaded daily, hourly, or on some other periodic basis, and does not change within that period. This feature sets DWs apart from operational systems which store daily business transaction records and can only keep up-to-date information.

As seen in Fig 1, the operational system is considered the original source of data before loading it into the data warehouse. The past data in the data warehouse is queried and analyzed across a given period. Whereas, in transactional systems, old data is either moved or deleted.

Operational system in the data warehouse

For example, a customer bought products from your company. She lived in Melbourne, Australia, then moved to Sydney from 2008 to 2013. Currently, she has settled down in Paris, France, since July 2014.

Your business could use a data warehouse to look through all past addresses of the customer, but the transactional system only updates the current address.

Non-Volatile

As data archived in the data warehouse is permanent, it is not erased or deleted if new data is entered. Data is refreshed at predetermined times and is read-only, which is useful for examining and analyzing historical data.

Transactional processing, recovery, and concurrency control techniques are not required. In a data warehouse environment, functions like deleting, updating, and inserting performed in an operational application are excluded. Instead, there are two different types of data operations:

Data Loading
Data Access

Data warehouse architecture

Modern data warehouses need “data warehouse architecture” as the design and building blocks to store, clean, and organize data properly. The main purpose of the data warehouse architecture is to determine which approaches are the most effective for consolidating corporate information from diverse data sources to support customized queries, business analysis, and decision-making.

According to the data warehouse model below, there are five main components in the process, including:

Data warehouse access tool (ETL tools)
Central database (Data Warehouse Database)
Metadata
Query tools
Data warehouse bus

However, the components may vary according to the enterprises’ requirements and conditions.

Data warehouse architecture, components, and diagram concepts

Data warehouse access tools (ETL tools)

ETL tool draws out data from original sources and then converts it into a standard format before entering the data warehouse.

The tasks of these ETL tools:

Removing irrelevant data from operational databases before loading them into the data warehouse
Calculating summaries and data extracted
Put default values in place for any missing data
De-duplicate repeated data derived from multiple original sources
Detect and substitute common names and definitions for data derived from different sources.

Central Database

A central database is a database that archives all organizational data and makes it available for analytics and reporting. The key point is that the company has to determine which types of databases are appropriate for storing information in the data warehouse.

Recently, many enterprises have chosen relational databases as a central database to build a grounded foundation. Today, relational databases are increasingly being utilized for business intelligence, data integration, and data warehousing thanks to the advancement of relational technologies.

To support relational databases in business intelligence, various alternative technologies are replacing them in many parts of the technical architecture. Some prevalent technologies include:

OLAP databases
Massively parallel processing (MPP) databases
Data virtualization
In-database analytics
In-memory analytics
Cloud-based BI, DW, or data integration
NoSQL databases

Meta Data

Metadata is “data about data,” created, transformed, stored, accessed, and exploited in enterprises. Metadata determines how data is modified and processed in the data warehouse. It also clarifies the data’s origin, usage, values, and features.

To handle metadata effectively, it is important to understand what the data represents, where it was extracted, how it was transformed, and so on to deliver consistent, complete, validated, clean, and up-to-date data for business analytics.

An example of how Metadata converts raw data into valuable knowledge:

Raw data: 2501 TK019 09.08.22

Model number: 2501
Sales agent ID: TK019
Date: 09/08/2022

Two types of Metadata confuse IT and software vendors, namely Technical and Business metadata.

Technical data: the data is processed by software tools and is commonly used by data warehouse designers and administrators. For example:

ETL tools specify fields and mappings between source and targets, transformations, and workflows.
Databases scrutinize columns (format, size, etc.), tables, and indexes.
BI tools interpret fields and report them.

Business data: the data offers an easier way for end-users to understand a particular data set stored in the data warehouse. For example, the business context for reporting on weekly sales, inventory turns, or budget variances are business data.

Query tools

Query tools allow organizations to interact with the data warehouse system:

There are four different types categories:

Query and reporting tools
Application Development tools
Data mining tools
OLAP tools

Data Warehouse Bus

The Data Warehouse Bus determines data flow in the data warehouse, consisting of Inflow, Outflow, Upflow, Downflow, and Meta flow. When building a data bus, it is required to consider the shared dimensions and facts between data marts thoroughly.

Data mart

A data mart is, to put it simply, a division of a data warehouse that is concentrated on a single functional area of a company. It is used by a particular department, unit, or set of users in an organization, such as sales, marketing, finance, or HR, so it is commonly managed by a single department. Compared to data warehouses, data marts are smaller but more flexible, drawing data from a few sources instead of broad inputs.

According to Guru 99, there are five fundamental steps to implementing a data mart.

a) Designing

Collecting the technical as well as commercial needs and locating data sources.
Choosing a suitable data subset.
Establishing the data mart’s logical and physical structure.

b) Constructing

Deploy the actual database that was established in the earlier stage.

For example, objects used in database schemas, such as tables, indexes, views, etc.

c) Populating

Drawing data to target data mapping
Deriving the source of data
Cleansing and converting the data
Bumping data into the data mart
Building and archiving metadata

d) Accessing

The access step needs to perform the following tasks:

Launch meta layer to convert database structures and objects’ names into business terms -> Assist non-technical users working in Data marts easily.
Build up and cultivate database structures.
Establish APIs and interfaces if needed.

e) Managing

The Data Mart Implementation process ends with this step. This step includes management duties like

Managing continuous user access
Optimizing and adjusting system performance
Loading and controlling fresh data into the Data marts successively
Preparing recovery methods and guaranteeing the system’s availability if the system fails.

Data Warehouse Reporting Layer

Through the reporting layer of the data warehouse, end users have access to the BI interface or database architecture. The data warehouse’s reporting layer serves as a dashboard for data visualization, report creation, and extraction.

A data warehouse only retrieves data in a unified format. Hence, it is essential to use an optimal tool like the ETL tool to convert raw data from multiple sources into appropriate data formats. ETL tools can boost the speed and efficiency of extracting, transforming, and loading a huge amount of data into the data warehouse while guaranteeing the high quality of the data.