MSBI : BI # 69 : Business Intelligence – Tools & Theory # 61 : Implementing Business Intelligence #3 : BI Target Database & Data Mart

Hi Folks,

This post is part of Series Business Intelligence – Tools & Theory

Currently running topic for this series is listed as below :

Series Business Intelligence – Tools & Theory

>>Chapter 1 : Business Intelligence an Introduction

>>Chapter 2 : Business Intelligence Essentials

>>Chapter 3 : Business Intelligence Types

>>Chapter 4 : Architecting the Data

>>Chapter 5 : Introduction of Data Mining

>>Chapter 6 : Data Mining Techniques

>>Chapter 7 : Introduction to Data Warehousing

>>Chapter 8 : Different Ways of Data Warehousing

>>Chapter 9 : Knowledge Management

>>Chapter 10 : Data Extraction

>>Chapter 11 : Business Intelligence Life Cycle

>>Chapter 12 : Business Intelligence User Model

>>Chapter 13 : Business Intelligence Issues and Challenges

>>Chapter 14 : Business Intelligence Strategy and Roadmap

>>Chapter 15 : Implementing Business Intelligence<You are here>

Continuing from my previous post on this series, If you have missed any link please visit link below

We are going to Cover the Following Points in this article

  • BI Target Database
  • Data Mart

BI Target Database

Converse to the data-in philosophy that is the data entry of the operational systems, the data-out philosophy that is the reporting and querying of the BI applications consists of the following design considerations:

· The BI target databases are designed for the simplified, high- performance data retrieval, and not for the efficiency of the data storage and maintenance which is vital design considerations for the operational databases

· The data redundancy elimination or minimization of the data redundancy is not a goal in designing the BI target databases. If a choice has to be made, then the data redundancy is chosen over the complexity, but the redundancy will have to be controlled. The redundant data will have to be consistent and reconcilable. The essential presumptions for designing the BI target databases are:

· The data is stored in a way that it is easily accessible in ways that are of interest to the business people.

· The design is driven by the access and the usage.

· The normalized design is not essentially instinctive for a business person and can therefore become quite complex.

· The BI data cannot be invented. All the data in the BI target databases should be present in or be derivable from the current internal or the external operational data sources.

A main decision for all the BI applications is at what level, and whether or not, to store the summarized data in the BI target databases. The database administrator and the lead developer will have to choose to store both the detailed and summarized data, either together in the same BI target database or in the different BI target databases. This database design decision should be based on the access and the usage requirements.

Logical Database Design

As there are differences in the goal and purpose between the operational systems and the BI applications, various database design techniques have been prepared for the BI target databases. The highly demoralized designs store the aggregated and the summarized data in a multidimensional fashion. The logical database designs are documented as the physical data models along with the technical metadata.

The aggregation and summarization are possibly the most important contributors to the good BI application performance. If most of the business analysts have to see the data summarized, these totals have to be pre- calculated and stored for the fast retrieval. It is necessary to discuss the level of granularity with the business representative, along with the other business analysts who can use the BI target databases as they will anticipate the database design to let them to drill down to some level of detail.

The multidimensional database designs support the fast retrieval of a broad range of data. There are two popular multidimensional design techniques and they are the star schema and snowflake schema.

The Star Schema

Star schema1, is the simplest of dimensional modeling. It consists of few fact tables (or just one) referencing few other dimension tables. The resulting schema resembles a starburst with the central fact table and surrounding dimensional table in radial pattern.

In the data is represented as an array of pre-calculated values, called facts, around which the analysis is performed. These pre-calculated facts represent the atomic operational data values which have been pre- summarized by certain dimensions, such as the customer, the product, and the time. A dimension in the star schema is similar to the entity in a logical data model; it is a business object about the data which is collected for the business process.

The star schema reflects the view of the business query. As the name indicates, the star schema has a single object in the middle, called the fact table, which is linked in a radial fashion to many objects, called the dimension tables. The following figure 15.2 gives an example of a star schema which is for a store.

image

Figure Star schema

A star schema has only two levels:

· A Fact table

· A series of single-level dimension tables

The Fact tables possess many characteristics. It

· Shows an important business event that is a business activity or transaction, such as s sale or a claim.

· Is the quantifiable part of the business event and are the columns in the fact table.

· Relates to the related dimension tables that is the business objects, such as the customer or the product.

· Will have a long composite key consisting of the primary keys of the related dimension tables that are the foreign keys in the fact table.

The number of extremely redundant fact tables will exist for a given subject area. Each of the fact table can contain various aggregation levels of the same data.

The Dimension tables have various characteristics. Dimension tables:

· Are the business objects, which can represent the various views from which the facts in a fact table can be viewed and analyzed.

· Basically have a one-attribute primary key.

· Are denormalised2 that is the data belonging together from a particular business perspective, such as the roll-up hierarchy, is put together into one table. This will produce some redundant data values, which is acceptable in the design schema.

· Are short and wide; the tables have reasonably few rows which can be short, but there are many columns in the tables which can be wide.

· Whenever possible, will have to be shared by the fact tables that is conformed dimensions.

· Of the time is one dimension with the attributes describing the timestamp, such as the calendar year, the quarter season, the fiscal period, or the accounting period. Some other examples of the common dimension tables are the customers, product, policy, sales representative, region, and store.

Most of the multidimensional DBMSs successfully deal with the optimization of a large multi-table JOINs.

One method for finding whether the DBMS is resolving the query successfully is to look at the optimized plan for the query. For example:

· If the fact table is the last table Joined, this is to show optimization. If the fact table is somewhere in the middle, or even somewhere in the beginning, the DBMS might not resolve the JOIN optimally unless it uses more sophisticated JOIN algorithms.

· If the DBMS does not use the Cartesian product JOINs, then the DBMS might take the qualifying row keys and relate them to a composite fact table index, or it might apply them through an index intersection against the multiple fact table single- column indices.

In either case, check that the DBMS is executing the multidimensional queries in the most useful manner since the performance depends on it.

The star schema is the most accepted database design schema for the BI applications for many reasons and they are:

· Yielding the best performance for the trend analysis, queries and reports which will include the years of historical data.

· Providing maximum flexibility for the multidimensional data analysis.

· Supporting most of the relational DBMS vendors with the changes to the

DBMS optimizer.

· Simplifying makes the complex data analysis much less complicated than with a standard normalized design.

It is easier to ask the questions such as the following:

o Which insurance broker is providing the most or the least lucrative business?

o What are the most regularly occurring types of claims from the insurance broker?

o When are the claims occurring?

· Selecting the address and searching the schemes which require few seeks, are rather only one per retrieval

· Running many operations in parallel.

Partitioning

Make sure that the tables are partitioned well across the multiple disks. This is mainly important for the Very Large Data Bases (VLDB) where the fact tables can reach many hundred gigabytes. Partitioning allows the data of one “logical” table to be distributed across many physical datasets. The physical data distribution is based on a partitioning column, which is mostly date. Since, partitioning the column must be part of the table’s primary key, partitioning the column cannot be a derived column, and also it cannot contain NULL values. Partitioning allows to keep a back up and to restore a portion of the table without affecting the availability of other portions of the same table which are not being backed up or restored.

Clustering

The cluster table requirement should be defined, and physically co-locate to the related tables on the disk drive. Clustering is a very helpful technique for the sequential access of huge amounts of data. Clustering is achieved through the clustering indices that decide in which sequential order the rows in the tables have to be physically stored in the datasets. Clustering the primary keys of each of the table has to be done to avoid page splits, that is, to ensure that the new rows inserted into the tables will be kept sequentially on the disk according to the columns in the clustering index. This technique can be used to significantly enhance the performance because the sequential access of the data is the norm in the BI applications. When the rows of a table are no longer kept in the same order as it in the clustering index that is data fragmentation then the performance will suffer and the table will have to be reorganized.

Indexing

Two extreme indexing strategies are to index everything and the other is to index nothing neither of the two is advisable. Instead of curving to these extremes, index the columns which are regularly searched and have a high distribution in values, such as Account Open Date. Do not index the columns that have a low distribution in values, such as Gender Code.

Once which columns to index has been decided, decide the index strategy that have to be used. Most of the DBMSs give various access methods to select from, either a sequential access or a direct access using any of the following well-known indexing algorithms:

· B-tree

· Hash

· Binary

· Sparse

Reorganizations

Rarely there will need to reorganize the databases because the incremental loads will have to fragment the datasets over time, and the inserted rows will no longer be kept in a logical sequence. The fragmentation might result in long data retrieval chains and the performance can drop off importantly. Most DBMSs give the reorganization routines to rearrange the fragmented database in order to get back the space occupied by the deleted data or to move the records from the overflow areas into a free space in the prime data areas.

The basic activities involved in reorganizing a database are to copy the old database onto another device, re block the rows, and to reload them. This is not a small effort for the BI target databases. The good news is that all the DBMSs can carry out a partial reorganization routine on the database partitions, which is why the database administrator partitions the BI target databases.

Backup and recovery

Since the software and the hardware may fail. It is important to set up the backup and the recovery procedures. The DBMSs give utilities to take the full backups along with the incremental backups. Many organizations are under the mistaken impression that the BI target databases can be recreated from the original source data. They forget to realize that it might take a very long time to reestablish the BI target databases if they have to rerun all the early and the historical Extract/Transform/Load (ETL) programs-taking the original source files that are available.

Disaster recovery is an issue for the BI applications. If the back tapes or cartridges are damaged during a disaster, it can be hard to recreate the BI target databases, and can take a very long time. From the same reason, many companies prefer to store their database backups in remote locations.

Snowflake Schema: In snowflake schema, few dimension tables are normalized and hence the data are further broken into additional tables. The resulting schema resembles a snowflake. 

(For additional information on star schema and snowflake schema, refer to Introduction to Data Warehousing)

Data Mart

Data mart is a database which has the same characteristics as that of a data warehouse, and is usually smaller and is focused on the data for one division or one workgroup within an enterprise.

There are three different views of the place of the data mart in the world of data warehousing.

· Specialized data marts are created with a subset of the information in the data warehouse. These are easier to use because they only have the specific information that the specific user group will require. The use of many data marts allows the querying load to be spread among the various computers. This can minimize the network traffic.

· Free-standing data marts are developed, independent from the data warehouse. The information for the data mart might come from just one legacy system and is faster and cheaper to develop a different data mart instead of building an enterprise-wide data warehouse with the data marts got from it. The drawback of the solution is that the company’s data will not be integrated and thus violates one of the Bill Inmon’s original defining features of the data warehouse. If various separate data marts are built using the strategy, then it will usually contain data that is duplicated and inconsistent.

· The data mart is the prototype or the first step in the data warehousing process. An enterprise picks the division or the group that would benefit most from the data-based knowledge. A data mart is built with the group’s data. The additional type of information is added to the data mart as the time goes on until it is turned into the data warehouse.

Data mart possibly has a marketing advantage over data warehouse. The entire data warehousing process is about creating data-based knowledge and bring that knowledge to people. A warehouse is a place where things are kept away. A mart is a suitable place to buy something. Most of the data warehousing professionals include quick access to the information as a defining feature of the term ‘data warehouse’. There are three keys to implement faster and are as follows:

· Follow an iterative, phased methodology: Most of the time can be spent on the project focusing on the particular business value the end user wants and the over several iterations build the solution into the vision

· Hold to a fixed time for each phase: If two weeks are kept aside for the scope for example stick to the window. Do not extend any phase especially the early ones unless the project is doomed to failure.

· Avoid scope creep at all costs: Though costly and dangerous in any project like the data warehousing or otherwise, scope creep like while adding feature requests keep creeping in the long past the cutoff point can destroy a data mart effort. By adding the last-minute features and probably add complexity to the data mart with only slight incremental business value if any so the little can put the project at risk.

Hope you will like Series Business Intelligence – Tools & Theory series !

If you have not yet subscribe this Blog , Please subscribe it from “follow me” tab !

So that you will be updated @ real time and all updated knowledge in your mail daily for free without any RSS subscription OR news reading !!

Happy Learning and Sharing !!

For More information related to BI World visit our all Mentalist networks Blog

SQL Server Mentalist … SQL Learning Blog

Business Intelligence Mentalist … Business Intelligence World

Microsoft Mentalist … MVC,ASP.NET, WCF & LinQ

MSBI Mentalist … MS BI and SQL Server

NMUG Bloggers …Navi Mumbai User Group Blog

Architectural Shack … Architectural implementation and design patterns

DBA Mentalist …Advance SQL Server Blog

MVC Mentalist … MVC Learning Blog

Link Mentalist … Daily Best link @ your email

Infographics Mentalist … Image worth explaining thousand Words

Hadoop Mentalist … Blog on Big Data

BI Tools Analysis … BI Tools

Connect With me on

| Facebook |Twitter | LinkedIn| Google+ | Word Press | RSS | About Me |