Let me start with a quote from Tim Barnes Lee, the founder of the Internet. He said, “Data is a precious thing because they last longer than systems”. For example, if you take systems like Google or Yahoo, you will many who argue that the data those companies have collected over their operation are indeed the most important assert they have. Those data give them power to either optimize what they do or to go on new directions.
If you look around, you will see there is so much data being available. Let me try to touch on few types of data.
- Sensors – Human activities (e.g. near field communication), RFID, Nature (Weather), Surveillance, Traffic, Intelligence etc.
- Activities in World Wide Web
- POS and transaction logs
- Social networks
- Data collected by governments, NGOs etc.
The paper, Miller, H.J., The Data Avalanche is here, Shouldn’t we be digging? Journal of Regional Science, 2010, is a nice discussion on the subject.
Data do come in many shapes and forms. Some of them are moving data—or data steams–while others are in rest; some are public, some are tightly controlled; some are small, and some are large etc.
Thinks about a day in your life, and you will realize how much data are around you, that you know that is available; but very hard to accessed or processed. For example, do you know the distribution of your spending? Why it is so hard to find the best deal to by a used car? Why cannot I find the best route to drive now? the list goes on and on..
It is said that we are drowning in an ocean of data, and making sense of that data is considered to be the challenge for our time. To think about it, Google have made a fortune by solving a seemingly simple problem: the content-based search. There are so many companies that either provide data (e.g. Maps, best deals) or provide add on services on top of the data (e.g. analytics, targeted advertising etc.).
As I mentioned earlier, we have two types of data. First, moving data are data streams, and users want to process them near real time to either adopt themselves (e.g. monitoring the stock market) or to control the outcome (e.g. Battle field observations or logistic management). The Second is data in the rest. We want to store them, search then, and then process them. This processing is for either to detect some patterns (fraud detection, anti-money laundering, surveillance) or to make predications (e.g. predict the cost of a project, predict natural disasters).
So broadly we have two main challenges.
- How to store and query data in a scalable manner?
- How to make sense of data (how to run the transformations data->information->knowledge ->insight)
And there are many other challenges, and following are some of them.
- Supporting semantics. This includes extracting semantics from data (e.g. using heuristics based AI systems or through statistical methods) and supporting efficient semantics based queries.
- Supporting multiple representations of the same data. Does converting on demand is the right way to go or should we standardize? Does standardization is practical?
- Master data management – Making sure all copies of data are updated, and any related data is identified, referenced and updated together.
- Data ownership, delegation, and permissions.
- Privacy concerns: unintended use of data and ability to correlation too much information.
- Exposing private data in a controlled manner.
- Making data accessible to all intended parties, from anywhere, anytime, from any device, through any format (subjected to permissions).
- Making close to real-time decisions with large-scale data (e.g. targeted advertising). Or in other words how to make analytical jobs faster.
- Distributed frameworks and languages to process large data processing tasks. Is Map-Reduce good enough? What about other parallel problems?
- Ability to measure the confidence associated with results generated from a given set of data.
- Taking decisions in the face of missing data (e.g. lost events etc.). Regardless of the design, some of the data will be lost while monitoring the system. Then decisions models have to still work, and be able to ignore or interpolate missing data.
I am not trying to explain the solutions here, but hopefully I will write future posts talking about state of art about some of the challenges.