What is a data warehouse? A source of business intelligence
Databases are generally relational (SQL), also NoSQL, and transactional (OLTP), analytical (OLAP) or hybrid (HTAP). Departmental and specialist databases were initially seen as a significant improvement in business practices, but were later ridiculed as “islands”. Attempts to create an integrated database for all company data are categorized as follows: Data Lake A data warehouse if the data remains in native format and if the data is in a common format and schema . A subset of data warehouses is called data marts.
Data warehouse defined
Basically, a data warehouse is an analytical, usually relational, database created from two or more data sources, and typically stores petabytes of historical data. Data warehouses often have a large amount of compute and memory resources to run complex queries and generate reports. These are often business intelligence (BI) systems and machine learning data sources.
Why use a data warehouse?
One of the primary motivations for using an Enterprise Data Warehouse (EDW) is that it limits the number and types of indexes that an Operational Database (OLTP) can create, which slows down analytical queries. By copying the data to the data warehouse, you can index all the important elements of the data warehouse to improve the performance of your analytical queries without affecting the write performance of the OLTP database.
Another reason to use an enterprise data warehouse is to allow you to combine data from multiple sources for analysis. For example, an OLTP sales application probably doesn’t need to know the weather at the point of sale, but the sales forecast can use that data. Adding historical weather data to your data warehouse makes it easy to integrate it into your historical sales data model.
Data warehouse and data lake
A data lake that stores data files in native format is essentially a “read-time schema”. This means that applications that read data from the lake must impose their own types and relationships on the data. The data warehouse, on the other hand, is a “writing schema”. That is, data types, indexes, and relationships are imposed on the data when it is stored in the EDW.
“Schema on read” is suitable for data that can be used in certain contexts, and while there is a risk that the data will not be used at all, there is little risk of losing the data. ((((QuboleVendors cloud data warehouse tools for data lakes estimate that 90% of the data in most data lakes is inactive.) “” Schema write “is suitable for data that has a purpose data and must be correctly associated With data from other sources There is a risk that incorrectly formatted data will not be correctly converted to the desired data type and will be deleted during import.
Data warehouse and data mart
The data warehouse contains data for the entire company, and the data warehouse contains data for specific lines of business. The data mart can be data warehouse dependent, data warehouse independent (that is, derived from a production database or an external source), or a hybrid of of them.
Reasons for creating a data mart include using less space, returning query results faster, and running at a lower cost than a full data warehouse. Data stores often contain summary and selected data instead of or in addition to detailed data in the data warehouse.
Data warehouse architecture
Typically, a data warehouse has a layered architecture of source data, intermediate databases, ETL (extract, transform, load) or ELT (extract, transform, and transform), appropriate data storage, and ‘data display tools. There are. Each layer serves a different purpose.
Source data often includes operational databases from sales, marketing, and other parts of the business. It can also include social media and external data such as surveys and demographics.
The middle layer stores the data retrieved from the data source. If the source is unstructured, like social media text, this is where the schema is imposed. It is also the place where quality checks are applied, removing poor quality data and correcting common errors. The ETL tool extracts the data, performs the necessary mappings and transformations, and loads the data into the data storage layer.
The ELT tool saves the data first, then converts it later. If you are using the ELT tool, you can also use a data lake to bypass the traditional middle layer.
The data storage layer of your data warehouse contains cleansed and transformed data ready for analysis. Often a row-oriented relational store, but it can also be column-oriented or have a mapped list index for full-text search. Data warehouses often have many more indexes than operational data stores to speed up analytical queries.
Displaying data from a data warehouse is often done by running an SQL query. SQL queries can be built using GUI tools. The output of SQL queries is often used to create bulletin boards, charts, dashboards, reports, and forecasts using Business Intelligence (BI) tools.
Recently, data warehouses have started to support machine learning to improve the quality of models and forecasts. For example, Google BigQuery added SQL statements that support a linear regression model for prediction and a binary logistic regression model for classification. Some Data Warehouses Deep Learning Library When Machine Learning Machine (((AutoML) tool.
Cloud data warehouse and on-premise data warehouse
The data warehouse can be implemented on-premises, in the cloud, or hybrid. Historically, data warehouses have always been on-premises, but the cost of capital and the lack of scalability of on-premises data center servers can be an issue. EDW installations increased as vendors began offering data warehousing appliances. However, there is now a trend to move all or part of the data warehouse to the cloud to take advantage of the unique scalability of the EDW cloud and the ease of connecting to other cloud services.
The downside of putting petabytes of data in the cloud is the operational costs of cloud data storage and cloud data warehouse computing and memory resources. While the time to upload petabytes of data to the cloud may seem like a major hurdle, hyperscale cloud providers are now offering high capacity disk data transfer services.
Top-down and bottom-up data warehouse design
There are two main ideas on how to design a data warehouse. The difference between the two is related to the direction of the data flow between the data warehouse and the data mart.
The top-down design (called the inman approach) treats the data warehouse as a centralized data repository for the entire enterprise. Data stores are derived from data warehouses.
The bottom-up design (known as the Kimball approach) sees data stores as primary and combines them into a data warehouse. According to Kimball’s definition, a data warehouse is a “copy of transactional data specially structured for query and analysis.”
EDW insurance and manufacturing applications tend to prefer Inman’s top-down design approach. Marketing tends to prefer the Kimball approach.
Data lake, data mart or data warehouse?
Ultimately, all decisions related to an enterprise data warehouse are summed up in the objectives, resources and budget of the enterprise. The first question is whether you need a data warehouse. The next task is to identify the data source, its size, its current growth rate, and what you are currently doing to use and analyze it. You can then start experimenting with data lakes, data stores, and data warehouses to see what works for your organization.
We recommend a proof of concept using a small subset of the data hosted on your existing on-premises hardware or a small cloud installation. Once you’ve validated your design and demonstrated its benefits to your organization, you can scale it up to a full-fledged facility with full administrative support.
Copyright © 2021 IDG Communications, Inc.