An Introduction to Data Mining

Anand Tamboli

Data mining is a technique which treats data methodically so as to analyze data and its behavioral observations. The goal of data mining is to extract important information from data which was previously not known. It can help in the recognition of certain patterns or trends in the data.

The biggest challenge faced is not only to get the information but also to search through it to find connections and data points that were unknown previously. While data mining is not good at telling you "why" certain data behaves in a certain way, it is an excellent tool for telling you "how."

Comparing Data Mining to the Six Sigma Methodology

In comparison, the Six Sigma methodology can explain why data does behave in a certain way.

Six Sigma is famous for its data driven approach. During the Measure and Analyze phases of this methodology, rigorous steps are followed to gather and perform analysis on various data. These steps typically incorporate such well known tools as root cause analysis and statistical hypothesis testing.

The Measure and Analyze phases help to identify why things are the way they are. This knowledge in turn can be used to establish linkages between inputs and outputs; these identified linkages can then help to carry out improvements.


Since quality of results is as good as the quality and treatment of data, it is highly recommended to follow the data mining approach religiously while working on the Measure and Analyze phases.

While Six Sigma in itself contains some of the data mining steps, it does not provide detailed know-how of these steps.

Data Mining Steps

Data mining consists of four steps: clustering, classification, regression and association rule learning.

However, one more important step is required before actual data mining can start: pre-processing, in which a target data set must be assembled. A common source for data is usually an organization’s database, which often contains certain garbage or irrelevant data-points. Therefore, the target dataset has to be cleaned. Cleaning can remove data with noise and missing information points. It is also necessary to validate integrity of data points in a set. These are essential steps to obtain sanity in the results.

Once data cleansing is performed, clustering follows. This is the task of discovering structures and groups in the data which are similar in some or many ways. It may not require previous knowledge about the given data.

Each data point is then classified (classification) in order to generalize the data and create new data out of it. This often helps in narrowing down assessment points, thereby reducing complexity of overall data analysis.

The next step is regression. Regression is an attempt to find a (typically mathematical) function which models the data with the least possible errors. This further generalizes the dataset.

Following regression, association rule learning searches for relationships between variables. For example, a supermarket may gather data on customer buying habits. Using association rule learning, marketers can then determine which products customers frequently buy together and subsequently use this data for marketing purposes.

Data mining always follows one final critical step, which is results validation. Results validation verifies the patterns that are produced by the data mining algorithms for the wider data set. Not all patterns found by the data mining algorithms are necessarily valid, but they often display strong or weak co-relation.

A Data Mining in Financial Services Example

A popular example of data mining is use of past behavior data to rank customers and approaches for various offers. For instance, financial institutions have often used these techniques in order to decide what approach to take when offering new loans and credit cards to customers.

In any financial institution, the company’s internal database captures an abundant number of customer characteristics, such as card balances, number of open loans, and whether or not a customer has ever responded to a loan offer through a phone call, e-mail or direct mail (an example of clustering).

Data mining thus helps establish common characteristics within available customer data, which can subsequently establish a predictive model (an example of generalization and association rule learning). The financial institution may then use this knowledge to create a new campaign with the hope of increasing its customer base and annual revenue.


With this primary insight into data mining, it is evident that Six Sigma methodology consists of almost all the data mining steps as part of its rigor, which can further yield better results when practitioners explicitly perform the data mining steps in the Measure and Analyze phases.