The rapid proliferation of smartphones; the maturation of RFID technology; the increasing use of social networks as a platform for commerce – all these technology trends mean that businesses are bracing themselves for exponential growth in data volumes over the coming years.
But if there is one community that has even more data on its hands than global businesses, it is high-energy physicists. Their work with huge experimental facilities such as the Large Hadron Collider produces phenomenal quantities of data, all of which must be processed and analysed in minute detail.
As head of the Petabyte Storage Group at Oxford’s Rutherford Appleton Laboratory, David Corney works at the sharp end of the high-energy physics community’s ravenous hunger for data. Specialising in Oracle’s database technology, the group collects and manages data from a number of sources, including many petabytes from the LHC, making it available for various high-energy physics researchers across the UK.
Information Age spoke to Corney about the challenges that arise from the task, how information technology is helping to expand the boundaries of science, and how the scientific community is figuring out how to share more data.
Information Age: What does the Petabyte Storage Group do?
David Corney: One of our jobs is to make the data from the Large Hadron Collider available for the UK high-energy physics community. If these guys can’t get to their data or our machines go down, we get a lot of heat.
We run a system called the CERN Advanced Storage management, or CASTOR. We take the data from CERN and shove it onto disk. We have about seven petabytes sitting in front of two tape robots. Some people want it on disk and some want it on tape, and we make it available to them as quickly as possible.
What are the challenges involved?
The rate at which data is starting to accumulate at places like the LHC means that we can just about stay on top of storing it, but what we can’t do is pick out the important data at the same time.
There are periods of stability, when you can use and manage large volumes of data, but at some point the system you are using to do that ages. I don’t just mean the hardware – you can always replace the hardware – but also the software management system. A new technology could come out and eradicate it.
How do you overcome that?
If you want to change the system you’re running and you’ve got 20 petabytes or whatever, the obvious way to do it is to eliminate the need for data migration by repointing the metadata. You don’t want to have to migrate all of the data and re-ingest it – it’s just far too big an operation.
Sharing Science
How is the ability to handle large volumes of data helping the scientific community?
The computing power that is now available allows you to ask questions that you could never have dreamed of asking before. There are questions that would have taken ten years of processing to answer, but now you can do it in minutes.
Also, there’s the potential of what’s going to arise out of disciplines being able to access other disciplines’ data and make sense of it, mix it with their own and ask questions that they haven’t previously thought of.
For example, if a biophysicist can understand the data from the National Environment Research Council (NERC) that describes the ocean’s chemical make-up, that’s hugely beneficial.
Are scientists happy to share their data sets?
There is a notion that has been around for a while that scientists do not just accumulate prestige through publications, but actually through accredited data sets. They will present their analysis and conclusions, but also the data set that they used to get their results for the world to use.
Also, research councils are developing policies that require grant holders to have some sort of data-archiving policy that means that when they finish, their data isn’t going to get lost and will be available for the rest of the world. And the EU is only dishing out money to projects that will make their data available across communities.
So the data is starting to double up. But no-one knows quite how to manage it.
How can different disciplines understand one another’s data?
There are common data formats available. For example, there is a data format called Nexus that is used in the ISIS muon source and the Diamond Light Source (both experimental physics platforms in the UK). Nexus is used to describe all aspects of the particular experiments and, because it’s well understood, it can be shared across the whole of the protein crystallography community.
If I was a crystallographer and I wanted to make my data available to the whole physics community, I would use Nexus. Other scientists can pick up a Nexus file, know what the file is and get on with it.
And what about making it available to the wider community?
There are tools being developed to introduce another layer on top of the data that contains the information and explanation that make the data understandable.
For example, there’s a project called CASPAR [Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval] that is looking at ways to do that. We’re still in the early stages, but I think a common standard will emerge eventually.
Do you think business can benefit from all this?
The scientific community is really focused on doing what it does – science – but there’s lots of fallout, such as automated workflows, etc. All of those things are lying by the side of the road for businesses to come along and look at.
There are tools coming out of science, but it is up to business how to use them.