MSBI : BI # 44 : Business Intelligence – Tools & Theory # 36 : Data Extraction #2 : Importance of Source Identification

Hi Folks,

This post is part of Series Business Intelligence – Tools & Theory

Currently running topic for this series is listed as below :

Series Business Intelligence – Tools & Theory

>>Chapter 1 : Business Intelligence an Introduction

>>Chapter 2 : Business Intelligence Essentials

>>Chapter 3 : Business Intelligence Types

>>Chapter 4 : Architecting the Data

>>Chapter 5 : Introduction of Data Mining

>>Chapter 6 : Data Mining Techniques

>>Chapter 7 : Introduction to Data Warehousing

>>Chapter 8 : Different Ways of Data Warehousing

>>Chapter 9 : Knowledge Management

>>Chapter 10 : Data Extraction<You are here>

Continuing from my previous post on this series, If you have missed any link please visit link below

We are going to Cover the Following Points in this article

  • Importance of source identification

Importance of source identification

Source identification covers the identification or recognition of all the correct data sources. It does not just end with only the identification of the data sources but goes beyond the examination and verification that the identified sources might provide the required value to the data warehouse

Consider a part of the database, which can be one of the data marts, which is designed to give the strategic information on the fulfillment of orders. For this purpose the historical information about the fulfilled and pending orders has to be stored. If there is a need to ship orders through multiple delivery channels then there is a need to capture data about these channels. If the users are interested in analyzing the orders by the status of the orders as the orders go through the fulfillment process, then there arises a need to extract the data on the order statuses.

In the fact, in the table for order fulfillment, there is a need for attributes about the total order amount, discounts, commissions, expected delivery time, actual delivery time, and dates at different stages of the process.

The dimension tables for product, order disposition, delivery channel, and customer are required for the following reasons:

First, it has to be determined if the source systems are provided with the data needed for the data mart. Then, from the source systems, the correct data source for each data element in the data mart has to be established. There should be a verification process to make sure that the identified sources are in fact the right ones.

The following figure  describes a stepwise approach to source identification for order fulfillment:

image

Figure  Source Identification – a step wise approach

Source identification is a simple a process as it may sound. It is a significant first process in the data extraction function. This can be done using the source identification process for every piece of information which has to be stored in the data warehouse.

Different Ways of recognizing the source

The nature of the source data that has to be extracted or captured has to be clearly understood before choosing the data extraction techniques. Also it should be understood as to how the extracted data can be used as the source data is always in a state of constant change.

The Business dealings keep changing the data in the source systems. In most cases, the value of an attribute in the source system is the value of that attribute at the present time. The day-to-day business transactions always change the values of the attributes of the data structures in the source operational systems. When a customer goes to another state, the data about the customer changes in the customer table present in the source system. When there are types of additional packages are added to the way a product can be sold, the data of the product changes in the source system. When there is a correction applied to the quantity ordered, the data regarding the order gets changed in the source system.

The data in the source systems have to be time-dependent or temporal. This can be because the source data keeps changing with time. The value of a single variable changes over time. Again, consider the example of the change of address of a customer for a move from Bengaluru to Pune. In the operational system, what is important is that the current address of the customer has MH as the state code. The actual change transaction states that the previous state code was KA and the revised state code is MH, need not be preserved. But think about how this change can affect the information in the data warehouse. If the state code is used for analyzing any of the measurements such as sales, the sales to the customer before the change must be counted in Pune state and those after the move must be counted in Bengaluru state. In other words, the history cannot be overlooked in the data warehouse. This brings us to the question of how to capture the history from the source systems? The answer depends on how exactly the data is retained in the source systems.

The most common technique used traditionally is to build some regular expressions that can match the pieces like the URL‟s and link titles. The screen-scraper software was actually started out as an application written in Perl. Apart from regular expressions, there is also some code written in something like Java or Active Server Pages to parse out bigger chunks of text. Using raw regular expressions to get out the data can be a little intimidating to the uninitiated, and can become disordered when a script contains a lot of them. At the same time, if there are regular expressions, and scraping the project is relatively small, then this could of a great solution.

The other techniques for getting the data out can become very sophisticated as algorithms that make use of the artificial intelligence are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently remove out the pieces that are of interest. Still there are other approaches which deal with developing „Ontologies‟, or hierarchical vocabularies are supposed to represent the content domain.

There are a number of companies that offer commercial applications which in particular is specifically supposed to do screen-scraping. The applications differ quite a bit, but for medium to large-sized projects they are often a good solution. Each one can have their own learning curve, so the plan on taking time to learn the ins and outs of a new application should be decided. Especially if you plan on doing a reasonable amount of screen-scraping it is probably a good idea to at least look around for a screen-scraping application, as this will save the time and money in the long run.

So, the best approach to data extraction really depends on the needs, and what are the resources available. There are some advantages and disadvantages of the various approaches which are as follows:

Raw regular expressions and codes

Advantages:

§ The familiarity of the regular expressions and at least one programming language can be a quick solution.

§ The regular expressions allow a fair amount of “fuzziness” in the

matching so that minor changes to the content will not get affected.

§ The need to learn any new languages or tools won‟t occur assuming that

there is familiarity with regular expressions and programming language.

§ The regular expressions are supported in almost all the programming languages.

Disadvantages:

§ This can be complex for those who don‟t have a lot of experience.

Learning regular expressions is not like going from Perl to Java. It is more like going from Perl to XSLT.

§ This can often be confusing to analyze. For example, consider the regular expressions people could have created to match something as simple as an email address.

§ If the content to be matched changes for example the web page gets changed by adding a new “font” tag which would be required to update the regular expressions to account for the change.

§ The data discovery portion of the process like passing over various web pages to get to the page containing the data that is required can still to be handled, and can get fairly complex if there is a need to deal with cookies.

Where can this approach be used: The straight regular expressions in screen-scraping is used when there is a small job which has to be done quickly.

Ontologies and artificial intelligence

Advantages:

This has to be created only once and can more or less extract the data from any page within the content domain that is being targeted.

This data model is generally built-in. For example, if the data about the cars from web sites has to be extracted, then the extraction engine will know the make, the model, and the price which will make it easy to map it to the existing data structures. For example, insert the data into the correct location in the database.

This requires little long term maintenance. As the websites change will likely be need to do very little to the extraction engine in order to relate to the changes.

Disadvantages:

§ The need to create and work with such an engine is relatively complex.

The level of skill required to even understand an extraction engine which

uses the artificial intelligence and ontologies is much bigger than what is necessary to deal with regular expression.

§ The type of engines required is expensive to build. There are some commercial offerings which will give the basis for doing this type of data extraction, but is still required to configured to work with specific content domain targeted.

§ The data discovery portion of the process has to be dealt which might not go well with the approach. Data discovery is the process of crawling through websites such that it arrives at the pages where the data has to be extracted.

When to use this approach: Typically the ontologies and artificial intelligence is used when there is plans on extracting the information from a very large number of sources. This makes sense to do when the data to be extracted is in an unstructured format for example, newspaper classified ads. In cases where the data is very structured that is there are clear labels to identify the various data fields, then it makes more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

§ This abstracts most of the complicated stuff away. Some pretty sophisticated things in most screen-scraping applications can be done without knowing anything about the regular expressions, HTTP, or cookies.

§ This dramatically reduces the amount of time necessary to set up a site which has to be scraped. Once a particular screen-scraping application has been learnt then the amount of time required to scrape sites vs. other methods will be significantly lowered.

§ This is to support a commercial company. If there is trouble using a commercial screen-scraping application then there are chances to support forums and help lines where assistance can be obtained.

Disadvantages:

§ The learning curve. In this each screen-scraping application will have its own way of going about the things. This may indicate learning a new scripting language along with familiarizing the way core application works.

§ A potential cost. Most of the finished screen-scraping applications are commercial.

§ A proprietary approach. Any time a proprietary (a matter of degree) application is used to solve a computing problem may or may not be a big deal, but will at least consider how well the application being used will get integrated with other software applications. For example, once the screen-scraping application has extracted the data it will become easy to get the data from the code.

When to use this approach: Screen-scraping applications will vary widely in the ease-of-use, the price, and the suitability to handle a broad range of scenarios. Chances are that if by paying a bit, a significant amount of time can be saved. If there is a quick scrape of a single page then just about any language with regular expressions can be used. If the data has to be extracted from hundreds of web sites that are all formatted differently then it is probably better off investing in a complex system that uses ontologies and/or artificial intelligence.

Hope you will like Series Business Intelligence – Tools & Theory series !

If you have not yet subscribe this Blog , Please subscribe it from “follow me” tab !

So that you will be updated @ real time and all updated knowledge in your mail daily for free without any RSS subscription OR news reading !!

Happy Learning and Sharing !!

For More information related to BI World visit our all Mentalist networks Blog

SQL Server Mentalist … SQL Learning Blog

Business Intelligence Mentalist … Business Intelligence World

Microsoft Mentalist … MVC,ASP.NET, WCF & LinQ

MSBI Mentalist … MS BI and SQL Server

NMUG Bloggers …Navi Mumbai User Group Blog

Architectural Shack … Architectural implementation and design patterns

DBA Mentalist …Advance SQL Server Blog

MVC Mentalist … MVC Learning Blog

Link Mentalist … Daily Best link @ your email

Infographics Mentalist … Image worth explaining thousand Words

Hadoop Mentalist … Blog on Big Data

BI Tools Analysis … BI Tools

Connect With me on

| Facebook |Twitter | LinkedIn| Google+ | Word Press | RSS | About Me |

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s