Search giant Google has been so successful at indexing and analysing the web that few companies would so foolish as to attempt to do the job for themselves.
However, there is plenty of information on the web that businesses could use, but that Google does not make available through its many services. For example, the graph of interconnections between websites that mention a particular brand may reveal how information about that brand is disseminated, but Google's web graph is proprietary information.
Giving smaller and less Internet savvy organisations access to this kind of web analytics is the ultimate aim of an EC-funded project called LEADS (Large-scale Elastic Architecture for Data-as-a-Service).
The three-year project aims to build an experimental system for analysing the public web that companies could in theory use as a shared, pay-as-you-go service. Led by the University of Neuchâtel in Switzerland, the LEADS consortium also includes Dresden University of Technology open source software vendor Red Hat and sports equipment manufacturer Adidas.
"Adidas is interested in seeing how the web reacts to the launch of a new product, for example," explains Dr. Etienne Riviere, a computer scientist at University of Neuchâtel and the project's coordinator. "But while they are big company, they are not specialists in the Internet and they don't have the capacity to run analytics like this."
The plan is to build a web analytics platform that stores and indexes the web, including both historical data and real-time updates. The platform would allow third parties to run analytical queries on a pay-per use basis, similar to a cloud infrastructure service.
Customers could track, for example, the frequency with which a certain keyword is mentioned in day, or map the connections between particular sites.
One of the key challenges the project will examine is how to do this while preserving the privacy of its users. "The idea is that the queries themselves are encrypted," says Riviere. "The computation is using some mathematical constructs that allow you to compare elements without revealing either the elements of the query."
What about the privacy of people mentioned on the web? "As soon as something is published on the web, we consider it public," he says. "Our system is no better or worse than what Google and Yahoo! are already doing."
Energy efficient 'micro clouds'
The analytics platform itself is not the only experimental component of the project. LEADS is also pioneering an energy efficient distributed computing model, in which heat produced by a number of 'micro clouds' is reused to warm the buildings in which they are located.
This system is based on 'micro cloud' technology from German start-up AoTerra. A 'micro cloud' is a system of a dozen or so servers fitted with heat recapturing technology.
Recapturing the heat means that the 'micro clouds' can be located near the point of consumption, not in a distant, polar climate.
"Our current network infrastructure is not saturated by cloud computing traffic but it may soon be," explains Riviere. "The problem is that if you ship everything to the North Pole, then you are you going to have big problems with the size of the [network] pipes which were not created for this kind of usage."
LEADS is an experimental research project – the real objective is to provoke and then solve as many thorny technical challenges as possible. Nevertheless, Riviere hopes that the answers it spawns may one day go on to make highly sophisticated web analytics available as a service to mainstream companies.
"The main impact we'd like to see is for small companies and companies that are not specialists in the Internet to get access to the kind of analytics they wouldn't have used before," he says.