Dr Lea El Samarji, who has led intelligent industry solutions for EMEA at Avanade since November 2019, has been overseeing a project currently in development that will improve access to government services. The pilot initiative has entailed the improvement of virtual agents’ understanding of the local dialects, using Singlish as a start, through the use of speech analytics.
This Q&A will explore how accessibility has increased due to ethical speech analytics capabilities, how social nuances can best be represented by AI, and managing bias in models.
Could you please provide an example of how ethical speech analytics can be put to good use?
For background, the national language of Singapore is Malay. The four commonly used languages of Singapore are English, Chinese, Malay, and Tamil. Singapore English is the set of varieties of the English language native to Singapore. The English language is spoken in Singapore, of which there are two main forms, are Standard Singapore English and Singapore Colloquial English (better known as Singlish).
Whereas commonly used languages in Singapore can be supported by today’s advanced cognitive services frameworks and tools, Singlish remains a big challenge for a machine to understand and support. We’ve trained the machine to understand Singlish terms and definitions; we trained the speech-to-text component for cognitive service with a data set of over 300+ different words relating to people, places, and languages.
In Singapore, there is a need to develop more virtual assistants based on Microsoft Azure, that have been trained and customised with the Singlish dialect, reducing the bias for greater inclusion and better performance. The project now is in pilot mode, and the plan is to go for production very soon. We started with Singapore, and the plan is to expand to the rest of Asia (Japan and China), and also test across Europe.
Can we automate data quality to support artificial intelligence and machine learning?
Why does bias relating to company culture sometimes end up being included in models?
When it comes to machine learning solutions, it is teams of people, data scientists, natural language processing specialists, engineers, developers, business people, and others who are designing machines to be intelligent and to fulfil the business need. These are the people who must identify the data sets used to train the machines and customise the variables that we need to pay attention to if we are to reduce bias.
There are two important things here to consider. First, we are creating the features that the machine should consider when learning, and secondly, the data sets on which the machine learning algorithm will learn; and bias can exist in both. Consider the first one: the features, also called machine learning variables, represent the business and the data science expertise. Together, it is the expertise from people working in and for the organisation, who develop the training and feature sets, so it is important to have a diverse team made up of males and females, different cultural and social backgrounds, ages, and possibly different countries. The more diverse the team that is building the solution is, the more diverse the solution and features will be as well.
The second point is the data set, and this is a key topic that we’re working on with the Universidad Francisco de Vitoria (UFV) in Madrid. We are running a research collaboration project with the university, where we are developing training algorithms using multiple datasets, different languages, and different points of view. We are developing an asset that searches and scrapes the website to get more data from diverse points of view. We are scraping articles from multiple journals, because we are convinced that the more diverse the data sets used to train AI, the more inclusive the AI solution will be.
How can societal nuances be best represented by artificial intelligence?
Societal nuances can be best represented when using AI by having more diverse data sets, by considering all points of view, and opening possibilities to search and collect data sets from outside the organisation. The Internet is a good place to mine data because you can find diverse perspectives. Another way is to have a diverse team working together to develop a less biased AI solution.
Is graph technology the fuel that’s missing for data-based government?
How can bias best be removed from models, and in turn, how can a more diverse training data set be ensured?
If an organisation has an existing model that contains bias, we will start by a deep dive into the model and the organisation to understand how the bias is generated. We will then inspect the features that were implemented in the algorithm, then determine which features are creating the bias, and try to replace or remove those or add others. Then, we could add more diverse data sets in the training phase that we can find inside and outside the organisation. And third, we’d study the organisation and the team who built this solution, and recommend bringing more diversity to the team if possible. These three factors — adding new data sets, inspecting the features in place to modify as needed, and recommending a more diverse team, will help reduce bias in AI models.