Unstructured Data and the Affordances of Digital Technology
In 2010, IT consultants Gartner, predicted that the amount of data held by enterprises would increase 800% over the next five years. They also predicted that 80% of this data would be unstructured i.e. it would be made up of emails, reports, webpages, blogs, comments, videos and audio clips, etc. And this is just the data that organisations hold directly. In addition there are the open data sets made available by government, and the ever growing body of data being produced online via social media, e.g. by the UK’s 10 million twitter users, on a daily basis.
However, there is a problem. Unlike structured data (e.g. data held in spreadsheets and database), computers struggle to process and analyse unstructured data (e.g. free-form text such as word documents and webpages). It still requires human input to make sense of this increasingly vast source of rich data, which in turn raises serious challenges around limited human resources and the sheer volume of unstructured data.
At Nominet Trust we have started to explore the question of whether new developments in digital technology can help us better understand and make use of unstructured data? We want to know if technologies such as natural language processing and machine learning provide the key to unlocking the full value and range of insights contained in our increasingly vast data sets? Could approaches such as semantic analysis (e.g. computers being able to identify positive or negative feelings towards a person, subject or experience) help both funders and the social enterprises they support better understand their impact, while allowing us all to respond more quickly to emerging needs?
Unstructured Data and the Affordances of Digital Technology
In early February 2015 Nominet Trust brought together a number of social enterprises who are already using digital technology to analyse unstructured data with the aim of applying their practical experiences to answering the questions above.
During the event we heard from; Kev Kirkland, Data Unity; Jamie Bartlett, Demos; Nick Wilsdon, Youth Music; Sarah Johns, Blackburn Youth Zone; and Nick Hamlin and Marc Maxson, Global Giving.
The main points to come out of the day were:
- Yes we can - everyone was able to present practical examples of where their organisation was already using digital technology to help analyse unstructured data. Jamie Bartlett from Demos and The Centre for the Analysis of Social Media (CASM) shared detailed insights of how bespoke tools and models could be used to analyse massive Twitter datasets to gauge people’s attitudes about specific subjects in real time - far more quickly than traditional polling methods. Sarah Johns from Blackburn Youth Zone was able to share experiences of using an off-the-shelf product to help understand the impact of their work on young people’s personal development by analysing interactions on Facebook. Nick Wilsdon from Youth Music demonstrated how Nvivo could be used to manage large sets of monitoring and evaluation material in such a way that made it invaluable to identifying emerging themes in funded projects and accessing material for reports. Across these examples new insights were being generated, and the sheer volume of data would have been impossible to process without the use of digital technology.
- Limitations - are significant, and every organisation identified problems and limitations that they had encountered. The main issue is that even the most sophisticated natural language processing tools available are still not able to replace human expertise in terms of their ability to understand the full content, context and nuance of human communication. Specifically:
- Human use of language is very flexible, and heavily influenced by specific events and immediate context. Fixed classifications and models, no matter how detailed, cannot respond to the nuances of human language in different situations (a good example given by Jamie was that of a classification framework that can accurately analyse book reviews on Amazon, but does not work well with DVD reviews on Amazon)
- Computers find it hard to pick up on irony or colloquial uses of negative or positive words e.g. “that’s sick” can be a good thing
- A deep understanding of the subject matter, recent events and sometimes even knowing the people and communities involved gives an expert human analyst the types of insight into unstructured data that cannot currently be replicated by machines e.g. knowing a certain phrase or word may have a very specific or alternative meaning in a particular community
- Sentiment analysis i.e. whether a data point (e.g. a tweet) is expressing positive or negative feelings is a common feature boasted by digital tools. However, especially from a social research and evaluation perspective this really does not tell you very much.
- Off the shelf - products offer some advantages, but also bring their own limitations. On the plus side they are often easy to use and work ‘out of the box’ without requiring any technical knowledge. This contrasts with open source tools that can be very useful, but require technical skills e.g. NL Toolkit (Python) or Stanford NLP Group Toolkit (Java). On the downside off-the-shelf products are often ‘black box’ tools i.e. information goes in, and results come out, but the workings are hidden. This means replicating results, sharing models, or configuring tools to specific contexts is often not possible. They can also be very expensive, with one example given of a service that was practically useful, but cost c.£300 per month at a non-for-profit rate, too much for a medium sized social enterprise.
- Bespoke - a key point raised was that analysts need to be able to understand the tools and models they are using, and to be able to configure them to their specific contexts. Analysts and inquirers should shape the form and function of digital tools - not the other way around. For Nick Hamlin at Global Giving it was very clear that any tools they developed and made available to partners had to support their theory of change i.e. learning organisations have more impact.
- Potential - as with most emerging technologies, the tools currently available have limitations, but it is still early days, and social enterprises are already starting to explore the potential of a wide range of technologies to address key questions. For Cassie, Digital Manager at the Royal Institution, its a question of evaluating the impact of their work - e.g. ‘5000 people watched one of our videos. So What? How did it affect them? How did they respond?’ For Jamie at Demos it is the exciting potential of being able to access a whole new form of user generated data (e.g. tweets) to carry out social research - data that is unstructured, unprompted and immediate. Kev Kirkland from Data Unity drew attention to the ability of semantic web technology to make search tools much more accurate and able to find relevant data, whilst also helping better understand difficult questions that are of interest to social enterprises such as the relationships between people and things. Furthermore, digital tools such as Data Unity that use semantic web principles can help make sense of datasets that are constantly evolving and changing.
There are still lots of challenges, and no magic answers, at least not yet. However, what became clear over the course of the event were the opportunities and interest amongst the organisations present to share what they were working on. This involved sharing technology, experiences and contacts, but also looking at how resources such as algorithms, models and classifications could be shared and developed collaboratively.
As previously mentioned, sharing and collaboration will be crucial given the complexity of the underlying technology and models, the limited resources available and the real need to ensure that tools and models are developed that reflect the expertise, understanding and needs of the social sector. Moving forward it would be good to see:
- Sharing - of practical experiences of using digital technology to analyse unstructured data, including successes, failures, insights gained and lessons learned
- Open source technology - with sharing extending to tools, models, classification frameworks, queries, algorithms, coding, etc.
- Investment - by social funders and enterprises in technology that is open and configurable, but also easy to use and designed to encourage learning and experimentation
Please do get in touch if there are experiences, insights or questions that you would like to share, or if you would like to be involved in the next stages of this conversation.
Our speakers have produced some really useful tools and insights, which they are more than happy to share, including:
- Data Unity: an open source web tool which lets you explore and visualise data then share discoveries with others
- Global Giving: shared tools and online resource for collecting and analysing thousands of stakeholder stories
- Demos: reports, examples, blogs and insights from The Centre for the Analysis of Social Media (CASM), a collaboration between Demos and the Text Analytics Group at the University of Sussex.
We’ve also put together a list of resources here that might be of interest, including free to use tools, demos, and practical reflections.