HOW does a business know who its customers, its employees and its partners really are? There are many different forms of information it can use: an address, a birth date, an account number. But none of these are of much use unless they are paired with the short character strings that we have used to identify individuals since before history began: our names.
And yet despite the central role that names play in distinguishing people from one another, information management systems have to date been singularly poor at handling name data. Data quality tools can quickly discern whether a telephone number has the correct number of digits, or whether a postal code is legitimate. But when it comes to applying the same quality checks to names, the tools demonstrate a critical blind-spot.
This problem is becoming more pressing as businesses extend their global reach, says Dr Jack Hermanson, whose company Language Analysis Systems (LAS), a specialist in the analysis of names, was acquired by IBM in March 2006.
The leading application of name analysis software has so far been the detection of fraud, against public authorities and businesses. Hermanson’s company was originally engaged by the US government in comparing the names of people applying for entry into the country against lists of wanted criminals or known terrorists.
A common technique employed by such undesirables to evade detection and also in money laundering, says Hermanson, is the use of an alternative Anglicisation of a name. Without creating a false identity, wanted men could simply breeze through border controls because the systems checking their names against national security lists had no ‘understanding’ of the original language.
The classic, if unfortunate, example that Hermanson cites as demonstration for the need for more intelligent name analysis systems is the story of Aimal Kansi, a Pakistani militant who, despite being on various wanted lists, entered the US and, after living there for a year, killed two CIA agents.
On entering the country, Kansi had given his family name as ‘Kasi’, an Anglicisation of a Pakistani name. Hermanson believes that had this name been subjected to the software LAS developed, the parallel between ‘Kasi’ and ‘Kansi’ would have been identified, and agents would have had a chance at least to ascertain his true identity.
LAS’s software, to be sold by IBM’s Entity Analytics arm which Hermanson now heads, was built through the co-operation of computer scientists and linguists. The reason that the analysis of name data lags behind other comparable analytical techniques, says Hermanson, is the fact that understanding human language is not the kind of task that suits the engineering mindset of most programmers.
“Computer engineers always want to find the solution to a problem. But linguists recognise that with language there is nothing to solve. All we can do is to add very slightly to a body of work,” he says.
Through the co-operation of the two fields, LAS developed algorithms that compare text strings against linguistic logic. Words are broken into their phonetic components, each of which can be written down in many different ways depending upon the linguistic background of the writer.
To demonstrate the complexity of the task, Hermanson describes how a single Chinese name, which has only one incarnation in Chinese script, can be Anglicised in many different ways, from ‘Xue’ to ‘Hsueh.’ Only a system that incorporates linguistic rules could identify these as the same name. “There are eleven different ways of spelling Osama Bin Laden in Chinese,” he adds.
LAS’s technology has been applied in many circumstances, from airlines checking that flyers don’t book multiple seats under different variations of their own name in case they miss a flight, to a company searching marketing lists to find women from ethnic minorities.
The company’s Name Inspector tool scans databases for likely errors in name fields. Hermanson believes that name analysis tools deliver most value when applied to giant databases, such as that maintained by one Name Inspector user, the US postal service. “On that scale, if you improve accuracy by 1%, you can make millions of dollars worth of savings,” he says.
Having started out, like so many others, serving US national security, name analysis technology is now moving into the corporate mainstream. For IBM, the LAS acquisition forms part of its Master Data Management strategy, and will bolster its offerings for retail, financial services and healthcare providers, according to Ovum analyst Helena Schwenk. Other vendors, such as Dutch company Human Inference, sell comparable technology to marketing departments.
And the technology will continue to develop as linguistic understanding, and techniques for codifying that understanding, advance. Areas to be refined include the extraction of names from text, which is still a crude art.
“Names are not just character strings, they are little databases that contain a wealth of information,” says Hermanson. “Only once we apply a knowledge based framework for name analysis do we begin to unlock their potential.”