MSBI : BI # 23 : Business Intelligence – Tools & Theory # 15 : Introduction to Data Mining #2 : Architecture of Data Mining & Kinds of Data which can be mined

Hi Folks,

This post is part of Series Business Intelligence – Tools & Theory

Currently running topic for this series is listed as below :

Series Business Intelligence – Tools & Theory

>>Chapter 1 : Business Intelligence an Introduction

>>Chapter 2 : Business Intelligence Essentials

>>Chapter 3 : Business Intelligence Types

>>Chapter 4 : Architecting the Data

>>Chapter 5 : Introduction of Data Mining<You are here>

Continuing from my previous post on this series, If you have missed any link please visit link below

We are going to Cover the Following Points in this article

  • Architecture of Data Mining
  • Kinds of Data which can be mined

Architecture of Data Mining

Now that we are familiar with DM work, let us study the architecture of DM.

The technological purpose in KDD (Knowledge Discovery in Databases) process is to design architecture for Data Mining, it is also planned to tackle the process-related issues. It is hypothetical that the function of the Data Mining Technology is to process, memory and information intensive work which requires constant interaction with the database.

It is believed that the Data Preparation (Data mining, Transformation, Cleansing and Loading) is beyond the range of the Data Mining architecture. To protect the correctness of the data mining results, the Data Preparation process must be addressed before the Data Mining process as explained in the earlier topic. Data preparation includes:

· Data mining: To make sure that the information is pulled out from the accurate and reliable ‘master’ source. For example- In an organization, the employee’s customer ID and address may be available in three different systems, but it should be taken out from a source which is most authentic and complete. The main source for this purpose is “Source Systems Mapping1”.

· Transformation: Once the Data is mined out, a variety and numerous amounts of cleaning activities are performed to deliver it in’ Data- warehouse loaded/presentation server’. It’s an intricate and biggest challenge in data-warehouse. Example: It is equivalent to end to end

process of the steel mill, churning out the stainless steel billets having the right mix of iron, nickel and chromium, and each of its molecules aligned in the desired crystalline form.

· Cleansing: The preliminary point of data cleansing is, when one knows one type and extent of the data quality issues For example, this is done, when there are dissimilar records for the same customer. In one verification it has the right name and address, while the other has the right telephone and Fax. We combine the two records to have all the basics filled-up.

· Loading: Data loading starts after the information sets are ready in the Data production in the presentation server. This is considered to be a simple process. The key concern in the ‘Loading’ process is to achieve the speed. This is achieved by using various methods. Example Certain ETL tools will allow you to extract, transform, and load in one process. That is, it is not necessary to create intermediate files.

DM has three layers:

· Database layer with sub-layers of database & metadata.

· Application layer in data management & algorithms.

· Front-end layer for management, input structure settings and results display/visualization.


                                           Architecture of Data Mining

The first tier is the database tier where information and metadata2 is made ready and it is stored. The second tier is the Data Mining Application where the algorithms processes the information and collects the results in the database. The third tier is the Front-End layer, which facilitates the structure settings for Data Mining Application and visualization of the results in comprehensible form.

· Database tier: It is not essential that the Database tier is hosted on an RDBMS. It can be combination of RDBMS and file system or just a file system. Example. the data from base systems may be stacked up on a files system and then loaded onto an RDBMS. The Database tier consists of a variety of layers. The data in these layers interface with numerous systems based on the actions in which it participates.


                                                               Database Tier

· Metadata tier: The Metadata layer is the most regularly used layer. It contains information about resources, transformations, cleaning rules and the Data Mining Results. It forms the backbone for the facts in the entire Data Mining Architecture

· Data Layer: This layer consists of Staging Area, Prepared / Processed Data and Data Mining Results. The Staging Area is used for provisionally holding the data taken from various source systems. It can be kept in any form e.g. flat files, tables in RDBMS. This data is transformed, cleaned, combined and loaded into a planned scheme during Data Preparation process. This equipped data is used as Input Data for Data Mining. The base data may undergo summarization or source based on the business case before it’s presented to the Data Mining Application.

The Data Mining production can be captured in the Data Mining Results layer so that it can be made available to the users for visualization and analysis.

· Data Mining Application: Data Mining Application has two main components as shown in the figure

o Data Manager

o Data Mining Tools/Algorithms


               Data Mining Tools/Algorithms

· Data Manager: This layer handles the data in the Database Tier and controls the data flow for data mining .It has the following functionality.

· Manage Data Sets: Arrangement of input data will be essential for building the DMM (Data Mining Model), final testing and operation tasks. The data manager layer will support in dividing the data into multiple set so that it can be utilized during various stages of the Data Mining task. Same is the case with results of the Data Mining task, which might be utilized for further processing.

· Input Data Flow: The data need to be mined from the database in the required format of the Data Mining task. Also the data flow needs to be restricted as per the Data Mining task provisions i.e., row by row or mass load. The Data Mining task may also involve data in precise arrangement (like itemized data for Associations). A few alteration routines will be essential to change the data from Database level into the necessary arrangement as per the provision. Another option of converting the information at database can be considered.

· Output Data Flow: The outcome generated by the Data Mining task needs to be managed and facilitated to target systems (Front End or other systems like CRM) in required data format and data flow specifications.

The Data Manager layer needs to be convenient depending on the database from which data has to be extracted and the Data Mining tool.

· Data Mining Tools/Algorithms: This is the main tool of the complete architecture. The Data Mining Tool will enclose diverse tasks. The main functionality of the task will be to investigate the data and generate the results. Various techniques/algorithms can be utilized depending upon the business case. These are described in the data mining techniques.

Several tools are accessible in the market to give the best possible result as production e.g. SAS, Teradata Miner and IBM Intelligent Miner. These tools simply make easy the application of algorithms on the input data. But the most significant task, which is always aligned to the specific business case, is setting the parameters for the algorithms

· Front End: It is the user interface layer. It has following prime functionalities:

o Administration

o Input Parameter Settings

o Data Mining Results/Visualization

· Administration: Administration screens for the ETL and Data Mining tasks are usually provided as a part of the products/tools.

These are utilized to manage the following main tasks:

o Data flow processes (Example: Extracts, Loads)

o Data Mining routines

o Fault reporting and alteration is also handled through the administration screens.

o User security settings

· Input Parameter Settings: During the Data Mining Model build, iterations are expected. These iterations are needed to modify the model by changing various limitations involved in the model. For performing a Data Mining task, the user needs to provide individual key limitations. Then monitor the outcome on the results and change the parameters if needed based on the understanding and understanding of the results. This ability is provided in the Front End.

· Data Mining Results: The results of the data mining task have to be configured, interpreted for the user. The front-end is use to the predefined formats of the exposed files generated by the individual Data Mining technique. The user has the flexibility to analyze the results of Data Mining. Reporting service executes the job of displaying the report, charts and smart reports (Example: Clusters, Trees, and Networks).

Kinds of Data which can be mined

In the previous section we learnt how Data mining architecture .Now we will study the kinds of data which can be mined. In theory, data mining is not explicit to one category of medium or data. Data mining should be linked to any kind of information storage area. However, algorithms and approaches may vary when applied to different types of data. The challenges presented by different types of data vary extensively.

Data mining is considered for databases, including relational databases, object-relational databases and object oriented databases.

Here are some examples in more detail:

· Flat files: These are the frequent data resource for data mining algorithms, particularly at the investigation stage. These are simple data files in text or binary format with a formation known by the data mining algorithm. The data in these files can be transactions, time-series information, scientific dimensions, etc.

· Relational Databases: It consists of a set of tables consisting of either values of entity features, or values of features from entity relationships. These Tables have rows and columns, where columns signify attributes and rows signifies tuples. A tuple is a relational table which matches to either an object or a relationship between objects and is recognized by a set of attribute values signifying a unique key. Example Here we have some customer dealings, items, and borrow representing business activity in a fictional video store Mega Video Store. The commonly used enquiry language for relational database is SQL, which allows retrieval and manipulation of the data stored in the tables, as well as the

calculation of aggregate functions such as average, sum, min, max and count. For example, a SQL query to choose the videos grouped by class would be:

SELECT CALCULATE (*) FROM Items WHERE type= video GROUP BY category.

· Data Warehouses: This is a storehouse of data collected from numerous data sources (often diverse) and is used completely under the same combined scheme. It analyses data from diverse sources Example: Let us suppose that Mega Video Store gets a franchise in India. If the supervisor of the company wants to use the data from all stores for planned decision-making, future direction, marketing, etc., it would be more suitable to stock up all the data in one site with a homogeneous structure that allows interactive analysis. In other words, data from the diverse stores would be loaded, cleaned, altered and incorporated together to facilitate decision making and multi-dimensional views, data warehouses are usually structured by a multi-dimensional data structure.

· Transaction Databases: It is a set of records signifying transactions, with a time stamp, an identifier and a set of items. Associated with the transaction records could also be expressive data for the items. For example, in the video store, the rentals table such as shown in. Each record is a hire agreement with a customer identifier, a date, and the list of items hired (that is, video tapes, games, VCR, and so on.). One distinctive data mining analysis on such data is the so-called ‘basket analysis’ or ‘association rules’ in which relations between items happening together or in series are studied.

· Multimedia Databases: This includes video, images, audio and text media. They can be stored on extended object relational or oriented databases, or purely on a file system. It is characterized by its high measurement, which makes data mining even more demanding. Data mining from multimedia storehouses may need computer vision, computer graphics, image interpretation, and natural language processing methodologies.

· Spatial Databases: These are databases store ecological data like maps, and global or district positioning. Such database performs new challenge to data mining algorithms.

· Time-Series Databases: These databases contain time related data.

For example, stock market facts or logged activities. These databases typically have a non-stop flow of a new data, which sometimes causes the need for a demanding real time analysis. Data mining in such databases includes the study of trends and relationship between developments of different variables, as well as the forecast of trends and movements of the variables in time.

· World Wide Web: This is the most varied and active storehouse available. A very large number of authors and publishers are constantly contributing to its growth and change, and a huge number of users are using its resources daily. Data in the World Wide Web is prearranged in interrelated documents. These documents can be text, auditory, video, unrefined data, and even application.

Theoretically, this contains three major components:

o The content of the Web, which includes the documents available.

o The structure of the Web, which covers the hyperlinks and the relations between documents.

o The usage of the web, describing how and when the resources are used, web structure mining and web usage mining.

Hope you will like Series Business Intelligence – Tools & Theory series !

If you have not yet subscribe this Blog , Please subscribe it from “follow me” tab !

So that you will be updated @ real time and all updated knowledge in your mail daily for free without any RSS subscription OR news reading !!

Happy Learning and Sharing !!

For More information related to BI World visit our all Mentalist networks Blog

SQL Server Mentalist … SQL Learning Blog

Business Intelligence Mentalist … Business Intelligence World

Microsoft Mentalist … MVC,ASP.NET, WCF & LinQ

MSBI Mentalist … MS BI and SQL Server

NMUG Bloggers …Navi Mumbai User Group Blog

Architectural Shack … Architectural implementation and design patterns

DBA Mentalist …Advance SQL Server Blog

MVC Mentalist … MVC Learning Blog

Link Mentalist … Daily Best link @ your email

Infographics Mentalist … Image worth explaining thousand Words

Hadoop Mentalist … Blog on Big Data

BI Tools Analysis … BI Tools

Connect With me on

| Facebook |Twitter | LinkedIn| Google+ | Word Press | RSS | About Me |

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s