Transforming Unstructured into Structured Data



Daniel Senter
07/18/2012

Have you been reading about Big Data? If you have you'll have come across 3 words to describe it. If you haven't, here they are:

- Volume

- Velocity

- Variety

The concept of developing processes to manage the increasing 'volumes' and 'velocity' of data almost seems conceivable. However from a process excellence point of view I'm specifically interested in the 'variety', as this relates to two data types; structured data and unstructured data.

Structured data is relatively simple and easy to use in process improvements as the data generally resides in databases in the form of columns and rows. It is grouped into relations or classes based upon shared characteristics. The data is generally allocated attributes (data descriptions) related to the classes within each group to help in ordering and logically grouping. Finally it can be described by predefined formats (string or value) with predefined lengths of characters.

This makes structured data a good starting point for anyone looking for robust data to create information upon which to form meaningful insights. Structured data can be queried and analysed to sort, group, filter, count and sum in order to answer business questions or measure process capability. Whilst this doesn't account for the validity of the data it does enable relatively easy processing to verify and observe the data. Structured data forms a large part of the data used by many in process improvements, however this trend is quickly changing as the dominance of unstructured data increases.

Unstructured data is a generic term used to describe data that doesn't sit in databases and is a mixture of textual and non textual data. Unstructured non textual data generally relates to media such as images, video and audio files. As the volumes of this type of data increases through the use of smart phones and mobile Internet the need to analyse and understand it grows too. Slightly less unwieldy are unstructured textual data made up of media files (documents, spreadsheets, presentations), email messages and an array of other files generated and stored on corporate networks.

As unstructured data resides on corporate networks, within collaboration tools and in the cloud it can be extremely difficult to interrogate or event locate. In order to search the data, processes need to be in place to help tag and sort it. This step is key to allow for semantic searching against key words or contexts. Unstructured data is being utilised in a big way for social media companies wanting to understand their markets and customers in more depth. This presents the same opportunities to many of our businesses to help understand not only its customers better, but operations within.

A recent IDC report predicted the volume of digital content in 2012 will increase from 2011 figures by 48% to over 2.7 zettabytes (ZB) continuing to an estimated 7.9 zettabytes (ZB) by 2015! Over 90% of this data is estimated to be unstructured data, which highlights the need to develop robust methods to understand and analyse the embedded information.

The challenge for businesses is to develop processes to apply structure to the unstructured nature of the data. For example determining the level of satisfaction of customers by analysing emails and social media may involve searching for words or phrases. Words and phrases may be grouped into positive, negative or neutral classifications.

At this stage the unstructured data is transformed to structured data where the groups of words found based upon their classification are assigned a value. A positive word may equal 1, a negative -1 and a neutral 0. This unstructured data can now be stored and analysed as you would with structured data. Much more work is needed in this area to analyse the unstructured non textual data and many of the big vendors are working on solutions.

I believe the businesses that will get the most of their unstructured data sources are those that find ways and tools to transform the unstructured to structured data...