MSBI : BI # 45 : Business Intelligence – Tools & Theory # 37 : Data Extraction #3 : Various Data Extraction Techniques

Hi Folks,

This post is part of Series Business Intelligence – Tools & Theory

Currently running topic for this series is listed as below :

Series Business Intelligence – Tools & Theory

>>Chapter 1 : Business Intelligence an Introduction

>>Chapter 2 : Business Intelligence Essentials

>>Chapter 3 : Business Intelligence Types

>>Chapter 4 : Architecting the Data

>>Chapter 5 : Introduction of Data Mining

>>Chapter 6 : Data Mining Techniques

>>Chapter 7 : Introduction to Data Warehousing

>>Chapter 8 : Different Ways of Data Warehousing

>>Chapter 9 : Knowledge Management

>>Chapter 10 : Data Extraction<You are here>

Continuing from my previous post on this series, If you have missed any link please visit link below

We are going to Cover the Following Points in this article

  • Various data extraction techniques

Various data extraction techniques

The extraction method has to be selected based on the source system and also on the business needs in the target data warehouse environment. Often there’s no option to add additional logic to the source systems to improve an incremental extraction of data due to the performance or the increased workload of these systems. Sometimes even the customer cannot add anything to an out-of-the-box application system.

The expected amount of the data to be extracted and the stage in the ETL process which can be initial load or maintenance of data may also affect the decision of how to extract. Basically, the decision has to be made on how to extract data logically and physically.

The Logical Extraction Methods

There are two kinds of logical extraction methods:

§ Full Extraction

§ Incremental Extraction

Full Extraction

In full extraction the data is extracted totally from the source system. Since, this extraction reflects all the data which is presently available on the source system there will be no need to keep track of the changes to the data source since the previous successful extraction. The source data will be given as it is and no additional logical information for example timestamps is required on the source site. An example for full extraction may be an export file of a distinct table or a remote SQL statement scanning the complete source table.

Incremental Extraction

At a particular point in time, only the data that has been altered since a well- defined event back in the history will be extracted. This event might be the last time of extraction or a more difficult business event like the last booking day of a fiscal period. To recognize this delta change there must be an option to recognize all the changed information since this particular time event. This information can be either given by the source data like an application column. This might reflect the last changed timestamp or a changed table where an appropriate additional mechanism will keep track of the changes apart from the originating transactions. In most of the cases, using the latter method means adding the extraction logic to the source system.

There are many data warehouses which do not use any change-capture techniques as part of the extraction process. Instead the entire table from the source systems is extracted to the data warehouse or to the staging area, and these tables are then compared with a previous extract from the source system to recognize the changed data. This approach may not have important affect on the source systems, but can clearly place an important burden on the data warehouse processes.

Physical Extraction Methods

Depending on the selected logical extraction method and the potentials and limitations on the source side, the retrieved data can be physically extracted using two methods. The data can be either extracted online from the source system or from an offline structure. Such an offline structure may already exist or it may be created by an extraction routine.

The following are the methods of physical extraction:

§ Online Extraction

§ Offline Extraction

Online Extraction

In online extraction the data is extracted directly from the source system itself. The extraction process can then connect directly to the source system to access the source tables themselves or to an intermediate system that keeps the data in a preconfigured manner. For example, snapshot logs or change tables. Note that the middle system is not essentially physically different from the source system.

With online extractions, we need to consider if the distributed transactions are using the original source objects or the prepared source objects.

Offline Extraction

The data is not extracted directly from the source system but is kept explicitly outside the original source system. The data already has an existing structure for example, redo logs, archive logs or transportable table spaces.

The following structures can be considered:

§ Flat files: In flat files the data is in a defined, generic format. The Additional information about the source object is required for further processing.

§ Dump files: In Dump files the information about the containing objects is included.

§ Redo and archive logs: In redo and archive logs the information is in a special, additional dump file.

§ Transportable table spaces: Transportable table spaces are a powerful way to extract and move large volumes of data between Oracle databases. Oracle Corporation suggests that the transportable table spaces can be used whenever possible, because they can provide significant advantages in performance and manageability over the other extraction techniques.

Change Data Capture

An important consideration in extraction is the incremental extraction, also known Change Data Capture. If a data warehouse extracts data from an operational system on an every night basis, then the data warehouse may require only the data that has changed since the last extraction which might be the data that has been modified in the past 24 hours.

When it is possible to ably identify and extract only the most recently changed data, the extraction process as well as all the downstream operations in the ETL process can be much more efficient, because it must extract a much smaller volume of data. Unfortunately, for many source systems, recognizing the recently modified data can be difficult or intrusive to the operation of the system. Change Data Capture is typically the most demanding technical issue in data extraction.

This is because the change data capture is often desirable as part of the extraction process and it may not be possible to use Oracle’s Change Data Capture mechanism.

There are several techniques for implementing a self-developed change capture on Oracle source systems:

§ Timestamps

§ Partitioning

§ Triggers

These techniques are based on the characteristics of the source systems, or may need modifications to the source systems. Thus, each of these techniques must be carefully valuated by the owners of the source system before the implementation.

Each of these techniques can work in combination with the data extraction techniques. For example, the timestamps can be used whether the data is being unloaded to a file or accessed through a distributed query.

Timestamps

There are some tables in operational systems which have timestamp columns. This timestamp tells the time and date that a given row was last modified. If the table in an operational system has columns having timestamps, then the latest data can be easily identified using the timestamp columns. For example, the following query might help in extracting today’s data from an orders table:

SELECT * FROM orders WHERE TRUNC(CAST(order_date AS date),’dd’)

= TO_DATE(SYSDATE,’dd-mon-yyyy’)

If the timestamp information is not there in an operational source system, then it may not be possible to modify the system to include timestamps. Such modifications would first require modifying the operational system’s tables to include a new timestamp column and then creating a trigger which would help to update the timestamp column after every operation that modifies a given row.

Partitioning

Some source systems may use Oracle range partitioning, so that the source tables are partitioned along a date key, which would allow easy identification of new data. For example, if you are extracting from an orders table, and the orders table is partitioned by week, then it is easy to identify the present week’s data.

Triggers

Triggers can be created in operational systems to keep track of the recently updated records. They can then be used along with timestamp columns to recognize the exact time and date when a given row was last modified. This can be done by creating a trigger on each source table that requires change data capture. Following each DML statement that is executed on the source table, this trigger helps to update the timestamp column with the current time. Thus, the timestamp column gives the exact time and date when a given row was last modified.

A similar internalized trigger-based technique is used for Oracle materialized view logs. These are the logs which are used by materialized views to identify changed data, and these are the logs which are accessible to end users. A materialized view log can be created on each source table which would require change data capture. Whenever there are any modifications to be made to the source table, a record is inserted into the materialized view log showing which rows were modified. Materialized view logs rely on triggers, but also provide an advantage in the creation and maintenance of the change-data system which is largely managed by Oracle.

However, Oracle suggests the use of synchronous Change Data Capture for trigger based change capture. Since, Change Data Capture provides an externalized interface for accessing the change information and also provide a framework for maintaining the distribution of the information to various clients

The Trigger-based techniques affect the performance on the source systems, and should be carefully considered before applying on a production source system.

Extracting data from the operational systems

The data in the operational system has to be located within the company once the data for analysis purpose has been identified. The data which is required for the warehouse is extracted from the source operational systems and written into the staging area which will be later transformed. To minimize the performance impact on the source database the data should be loaded without applying the transformations to it.

More often the operational system‟s owners will not let the warehouse developers to directly access the systems but can give periodic extracts. These extracts will usually be in the form of flat, sequential operating system files, which will make the staging area.

The application programs have to be developed to select the fields and records necessary for the warehouse. If the data is kept in a legacy system, then it can be written in COBOL which will require special logic to handle things such as the repeating fields in the "COBOL occurs clause." The data warehouse designers have to work with the application developers for the OLTP systems which are usually built to extract scripts which provide the required columns and formats of the data.

As part of designing the ETL process, how frequently data should be extracted from the operational systems should be determined. It can be at the end of some time period or business event, such as at the end of the day or week or the closing of the fiscal quarter. The meaning of "end of the day" or the "last day of the week," has to be clearly defined if the system is used across different time zones. This kind of extraction can be done at different times for different systems and considered to be loaded into the warehouse during an upcoming batch window. Another feature of the warehouse design process involves deciding what level of aggregation has externalized interface for accessing the change information and also provide a framework for maintaining the distribution of the information to various clients.

The Trigger-based techniques affect the performance on the source systems, and should be carefully considered before applying on a production source system.

Extracting data from the operational systems

The data in the operational system has to be located within the company once the data for analysis purpose has been identified. The data which is required for the warehouse is extracted from the source operational systems and written into the staging area which will be later transformed. To minimize the performance impact on the source database the data should be loaded without applying the transformations to it.

More often the operational system‟s owners will not let the warehouse developers to directly access the systems but can give periodic extracts. These extracts will usually be in the form of flat, sequential operating system files, which will make the staging area.

The application programs have to be developed to select the fields and records necessary for the warehouse. If the data is kept in a legacy system, then it can be written in COBOL which will require special logic to handle things such as the repeating fields in the "COBOL occurs clause." The data warehouse designers have to work with the application developers for the OLTP systems which are usually built to extract scripts which provide the required columns and formats of the data.

As part of designing the ETL process, how frequently data should be extracted from the operational systems should be determined. It can be at the end of some time period or business event, such as at the end of the day or week or the closing of the fiscal quarter. The meaning of "end of the day" or the "last day of the week," has to be clearly defined if the system is used across different time zones. This kind of extraction can be done at different times for different systems and considered to be loaded into the warehouse during an upcoming batch window. Another feature of the warehouse design process involves deciding what level of aggregation has. It may not be possible to identify the validation errors until the data has been extracted from the operational systems. This situation can take place when the data is extracted from multiple data sources. For example, integrating data extracted from separate sales tracking, shipping, and billing systems might uncover discrepancies that should be addressed in one or more of the source systems.

You may also recognize the inconsistencies other than the errors in the data after it has been extracted. For example, different type of data sources may used for different coding systems for the similar kind of data. You can also use translation tables to reconcile the differences during the extraction operation or later during the transformation operations. For example, a legacy system might code state province names using a three-character code, whereas another system might use a two-character code. The data obtained from one or both of these systems should be translated into a single set of codes before the data is loaded into the data warehouse.

In other cases, inconsistencies can be discovered if the source systems allow free-form entry of text information. Such data is often internally not consistent because different data-entry personnel can enter the same data in different ways. The inconsistent representations of the same data will have to be reconciled if such data is used for analysis. For example, in a data source that allows free-form text entry for the state or province portion of an address then the state of Florida can be entered as FL, Fla, Florida, or even Flor. It may become difficult to modify legacy source systems to implement a standard coding validation. The Manual transformation adjustments might be necessary to reconcile such differences if the contributing source systems cannot be modified.

Data in Operational Systems

The source systems usually store data in two ways. The operational data in the source system can be classified into two broad categories. The type of data extraction technique may depend on the nature of each of the following two categories:

§ Current Value

§ Periodic Status

Current Value: In current value most of the attributes in the source systems fall into this category. Here, the stored value of an attribute shows the value of the attribute at this moment of time. The values of which are transient or transitory. As the business transactions occur, the values will also change. Prediction of how long the present value will stay or when it will get changed cannot be made. Therefore, the current value is the stored value of an attribute which occurs at that moment of time.

Periodic Status: In periodic status the value of the attributes are stored as the status every time a change happens. This means the status value is stored with the reference to the time. For example, the data about an insurance policy is kept as the status data of the policy at every point of time. So the history of the changes is kept in the source systems themselves. This makes the data extraction becomes relatively simple.

Hope you will like Series Business Intelligence – Tools & Theory series !

If you have not yet subscribe this Blog , Please subscribe it from “follow me” tab !

So that you will be updated @ real time and all updated knowledge in your mail daily for free without any RSS subscription OR news reading !!

Happy Learning and Sharing !!

For More information related to BI World visit our all Mentalist networks Blog

SQL Server Mentalist … SQL Learning Blog

Business Intelligence Mentalist … Business Intelligence World

Microsoft Mentalist … MVC,ASP.NET, WCF & LinQ

MSBI Mentalist … MS BI and SQL Server

NMUG Bloggers …Navi Mumbai User Group Blog

Architectural Shack … Architectural implementation and design patterns

DBA Mentalist …Advance SQL Server Blog

MVC Mentalist … MVC Learning Blog

Link Mentalist … Daily Best link @ your email

Infographics Mentalist … Image worth explaining thousand Words

Hadoop Mentalist … Blog on Big Data

BI Tools Analysis … BI Tools

Connect With me on

| Facebook |Twitter | LinkedIn| Google+ | Word Press | RSS | About Me |

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: