MSBI : BI # 43 : Business Intelligence – Tools & Theory # 35 : Data Extraction #1 : Data Extraction & Role of ETL Process

Hi Folks,

This post is part of Series Business Intelligence – Tools & Theory

Currently running topic for this series is listed as below :

Series Business Intelligence – Tools & Theory

>>Chapter 1 : Business Intelligence an Introduction

>>Chapter 2 : Business Intelligence Essentials

>>Chapter 3 : Business Intelligence Types

>>Chapter 4 : Architecting the Data

>>Chapter 5 : Introduction of Data Mining

>>Chapter 6 : Data Mining Techniques

>>Chapter 7 : Introduction to Data Warehousing

>>Chapter 8 : Different Ways of Data Warehousing

>>Chapter 9 : Knowledge Management

>>Chapter 10 : Data Extraction<You are here>

Continuing from my previous post on this series, If you have missed any link please visit link below

We are going to Cover the Following Points in this article

  • Data Extraction
  • Role of ETL process

Data Extraction

By now you must be familiar with the need and significance of knowledge management. And also about generating, using and storing knowledge and the various technologies used for knowledge management.

The data extraction is the process of extracting the data from a source system, which can be further used in a data warehouse environment. The process of obtaining data from the source systems and bringing it into the data warehouse is called as Extraction, Transformation, and Loading (ETL). After the extraction, this data can be transformed and loaded into the data warehouse.

To overcome the problem of loading the data initially to the data warehouse, keeping it updated and resolving discrepancies the Extract, Transform and Load (ETL) utilities were developed. The source systems might be very complex and poorly documented, and thus determining which data needs to be extracted can be difficult. The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date. The source identification covers the identification or recognition of all the correct data sources. It is a significant first process in the data extraction function. There are different kinds of sources and extraction methods like the logical extraction methods, physical extraction methods and change data capture.

Data Extraction is the act or the process of extracting data out of, which is usually unstructured or badly structured, data sources for added data processing or data storage or data migration. This data can be extracted from the web. The internet pages in the html1, xml2 etc. can be considered to be unstructured data source because of the wide variety in the code styles. This also includes exceptions and violations of the standard coding practices. The import into the intermediate extracting system can be usually followed by data transformation and possibly the inclusion of metadata prior to export to another stage in the data workflow.

Usually unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting the data from these unstructured sources has become a considerable technical challenge where as historically data extraction had to deal with changes in physical hardware formats. Majority of the current data extraction deals with extracting the data from the unstructured data sources, and from different software formats. The rising process of data extraction from the web is also known as Web scraping.

The process of adding structure to unstructured data can be done in a number of forms:

§ Using a text pattern matching which is also known as Regular expression to recognize small or large scale structure. For example, records in a report and their related data from headers and footers.

§ Using the table-based approach to recognize the common sections within a limited domain. For example, in resumes identify the skills, previous work experience, qualifications etc. Using a standard set of commonly used headings. For example, Education might be found under Education or Qualification or Courses.

§ Using the text analytics to try to understand the text and then link it to other information.

Role of ETL process

Overview of Extraction, Transformation, and Loading (ETL)

The data warehouse should be loaded regularly so that the purpose of facilitating business analysis can be served. In order to perform this operation, data from one or more operational systems must be obtained and copied into the warehouse. The process of obtaining data from the source systems and bringing it into the data warehouse is usually called Extraction, Transformation, and Loading (ETL).

ETL is perhaps too simple, because it omits the transportation phase and indicates that each of the other phases of the process is different. The whole process along with data loading is referred to as ETL.

ETL relates to a broad process and not to the three well-defined steps.

The methodology and the tasks of ETL have been well known for many years, and are not essentially unique to data warehouse environments where a wide variety of proprietary applications and database systems are considered as the IT backbone of any enterprise. Data has to be shared between the applications or systems, try to combine them, and give at least two applications the same picture of the world. This kind of data sharing was regularly addressed by mechanisms similar to what is now known as ETL.

In Data warehouse environments apart from exchange there is additional burden of integrating, rearranging and consolidating data over many systems, thus, providing a new combined information base for business intelligence. Furthermore, the data volume in data warehouse environment tends to be very big.

In the ETL process, during extraction the desired data will be identified and extracted from many different sources including database systems and applications. Often it is not possible to recognize the particular subset of interest, therefore more data than required has to be extracted, so the recognition of the appropriate data can be done at a later point in time. Depending on the source system’s capabilities, for example in the operating system resources, some transformations can take place during the extraction process. The size of the extracted data can range from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two logically identical extractions where the time span can differ between days/hours and minutes to near real-time. The Web server‟s log files for example can easily become hundreds of megabytes in a very short span of time.

After extracting the data, it has to be physically moved to the target system or to an intermediate system for further processing. Depending on the selected way of transportation, some transformations can be done during this process. For example, a SQL statement which can directly access a remote target through a gateway can concatenate two columns using the SELECT statement.

Offline Extract, Transform, and Load (ETL)

Previously the one common interface that was given between the dissimilar systems in an organization was magnetic tape. They were standardized and any system could have written tapes that could be read by other systems. So, the first data warehouses were fed by magnetic tapes prepared by different systems within the organization which left the problem of data disparity. There is often little relation to data written to tape to one system to data written by another system. The data warehouse‟s database was designed to support the analytical functions necessary for the business intelligence function. The database design was a well structured database with complex indices to support the Online Analytical Processing (OLAP). The databases configured for OLAP allows the complex analytical and ad hoc queries with quick execution time. The data given to the data warehouse from the enterprise system gets transformed to a format understandable to the data warehouse. To overcome the problem of loading the data initially to the data warehouse, keeping it updated and resolving discrepancies the Extract, Transform and Load (ETL) utilities were developed. The following figure 10.1 shows how the data can be extracted from the source databases, transformed into the common data warehouse format, and loaded into the data warehouse:

image

Extract, Load and Transform

The key to the success of this approach is the transformation function. The transform function is the key to the success of this approach and helps to apply a series of rules to the extracted data so that it is correctly formatted for loading into the data warehouse.

The examples of transformation rules are:

§ Selecting the data to load.

§ Translating the encoded items.

§ Encoding and standardizing the free-form values.

§ Deriving the new calculated values that is sale price = price – discount.

§ Merging of data from multiple sources.

§ Summarizing or aggregating certain rows and columns.

§ Splitting a column into multiple columns for example- a comma- separated list.

§ Resolving the discrepancies between similar data items.

§ Validating the data.

§ Ensuring the data consistency.

The ETL function allows the integration of multiple data sources into a well- structured database for the use in complex analyses. The ETL process will have to be executed periodically such as daily, weekly, or monthly, depending on the business needs. This process is called the offline ETL because the target database is not always updated. It is updated periodically on batch basis. Though the offline ETL serves its purpose well there are some serious drawbacks which are as follows:

§ The data in the data warehouse could be weeks old. Therefore, it is helpful for planned functions but is not particularly adaptable for tactical uses.

§ The source database typically should be inactive during the extract process. Or else, the target database is not in a consistent state following the load. Considering this result, the applications must be shut down, often for hours.

In online ETL, the function of ETL is to support the real-time business intelligence which should be continuous and non-invasive. In contrast to offline ETL which gives old but consistent responses to queries the online ETL gives present but varying responses to successive queries. This is because the data that it uses is continuously updated to reflect the present state of the enterprise.

The Offline ETL technology has always served businesses for decades. The intelligence that is obtained from this data informs long-term reactive strategic decision making. On the other hand, the short-term operational and proactive tactical decision making will continue to rely on instinct.

Hope you will like Series Business Intelligence – Tools & Theory series !

If you have not yet subscribe this Blog , Please subscribe it from “follow me” tab !

So that you will be updated @ real time and all updated knowledge in your mail daily for free without any RSS subscription OR news reading !!

Happy Learning and Sharing !!

For More information related to BI World visit our all Mentalist networks Blog

SQL Server Mentalist … SQL Learning Blog

Business Intelligence Mentalist … Business Intelligence World

Microsoft Mentalist … MVC,ASP.NET, WCF & LinQ

MSBI Mentalist … MS BI and SQL Server

NMUG Bloggers …Navi Mumbai User Group Blog

Architectural Shack … Architectural implementation and design patterns

DBA Mentalist …Advance SQL Server Blog

MVC Mentalist … MVC Learning Blog

Link Mentalist … Daily Best link @ your email

Infographics Mentalist … Image worth explaining thousand Words

Hadoop Mentalist … Blog on Big Data

BI Tools Analysis … BI Tools

Connect With me on

| Facebook |Twitter | LinkedIn| Google+ | Word Press | RSS | About Me |

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s