The Billion Dollar Mission to Integrate Language & Technology
Imagine for a moment- three rooms; with two containing a single human being, and the third room- a computer.
The roles for each are well defined, and each remains separate from the other. One human must ask a series of questions, which the computer and the other human must answer. As each answer arrives, and after enough time passes, the questioner must then select which answer belonged to the computer, and which belonged to the human. The questioner choices are then assessed. Should they have chosen incorrectly a third of the time (or more), the computer is considered “as human” as its counterpart.
It has passed the Turing test.
Since the test was conceptualised in 1950 by Alan Turing, the foundational bedrock of what has come to define the awakeness, much less the capacity for ‘humanity’, exhibited by machine technology, has shifted dramatically. Nowadays, bots that have ‘passed’ (or cheated) the Turing test don’t really seem to be intelligent. We’ve since hypothesised that simply answering questions is not a sufficient definition of true understanding and that real intelligence lies in an entity’s relationship with creativity.
(Alan Turing, Oxford, 1936)
It’s been a long time since 1950, and Alan Turing predicted that we’d have many programs passing the test by 2000, but why hasn’t this been the case? Why is it so difficult for a computer to communicate as a human would? Natural Language Processing (or NLP) aims to bridge this gap between man and machine.
NLP tools help process, analyse, and report on the swathes of unstructured text or voice data in the world. It’s the intersection of the fields of statistics, linguistics, and computer programming. Using machine learning and neural networks to train models of natural language, NLP attempts to solve difficult tasks such as translation, sentiment analysis, summarisation, and more. It is also used in more everyday tasks, and it’s guaranteed that you are using some form of NLP every day: spam filters, autocomplete, spell check, voice assistants.
“If a lion could talk, we couldn’t understand it…”
“Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo” is a valid grammatical sentence (it means “Buffalo bison, that other Buffalo bison bully, also bully Buffalo bison”).
‘The old man the boat’ makes sense, but only after a few readings (‘old’ is the noun, and ‘man’ is the verb. It reads more like “The old people man the boat”).
“My stomach hurts” means something very different about my mood if you know that I was laughing while I said it.
Language is strange and difficult. The limits of your language are often the limits of your world, so to speak. The words and symbols we use to describe, explain and express, allow us to share our individual realities with each other. In its most common role (communication), the meaning of words is often obscured, amplified and punctuated by body language, prosody, or semiotics. But, we don’t all pick up the same cues, and as much of meaning is garnered from personal experience, our understanding is not uniform across the board. The same words mean different things in different cultures.
And, then there’s slang; which comes and goes (and sometimes stays).
References can be completely nonsensical without their context, and some metaphors seem meaningless unless you already understand what they are saying. The movie ‘Arrival’, which was based on the short story by Ted Chiang called ‘Story of Your Life’, explores the ‘Sapir-Whorf Hypothesis’ – which states that our language influences our worldviews, thoughts and actions.
Despite its ever-evolving complexity, humans find the use of it remarkably simple. Our mastery of language as a tool of expression has enabled us to translate complex problems into relatively simple-to-understand concepts and sentences, and this has allowed societies to propagate and flourish.
We humans can teach each other language, but can we create a machine that understands it the same way?
Just how important is data?
Data is a collection of values that becomes information once analysed and used. Until recently, we could only really use data that was stored in organised, predefined formats (usually in relational databases or spreadsheets). Referred to as structured data, these could be words that were stored in character strings of variable length, dates that follow one of many formats (DD-MM-YY, MM/DD/YYYY), prices that are numeric values attached to a currency symbol, or many more. We did (and continue to do) amazing things with structured data. It’s the basis of the stock exchange. We sent people to the moon with it. We test and approve medicine through statistical testing.
However, today, structured data comprises only about 20% of global enterprise data. All those file cabinets filled with customer data, long email chains, and recorded phone calls were difficult to analyse at scale. They couldn’t be slotted into a spreadsheet or tallied up in an SQL database. These are forms of ‘unstructured data’, and are a significant portion of what is normally referred to as ‘big data’. Or large data. Or extremely, insanely massive data: 500 million tweets are made, 1 billion hours of YouTube are watched, 4.4 million blog posts are written, and 300 billion emails are sent, every single day. The data analytics market is predicted to grow to 163 billion terabytes, at a value of USD $77.6 billion, by 2025.
Unstructured data was barely usable until the early 2000s. We didn’t have the technology to comb through fields of text, digitise it, and understand it, without doing it manually. Imagine going through massive drawers of customer feedback, identifying the most common issues, and reporting them to relevant stakeholders.
Today, that task is quite common. You could scan the papers and use an OCR (Optical Character Recognition) tool to convert it to text. A Topic Analysis tool would help identify common themes among the feedback, and a Sentiment Analysis tool could identify whether the feedback is negative or positive.
What has changed?
Accelerated progress a result of Machine Learning
The rise of machine learning accelerated a move away from the traditional theoretical approach to understanding language, towards a computational one. Algorithms now essentially learn the rules of language, by themselves, from the text that already exists in books and documents (corpus linguistics). By leaving them to figure out the quirky bits of language, huge amounts of text data can be statistically modelled and analysed.
When you think “fire”, you probably think “hot”. When you think “ocean”, you probably think “water”. Creating these associations are essentially what machine learning is all about. If you write a program and define the hex code “#FFFFFF” as the representation of the colour white, then you can easily identify blank images – they would be those with 100% white pixels. You could then tweak the program a bit to have it identify whether the image is in colour or is black and white. You could even tell whether an image is mostly red, blue, or green.
These are relatively simple programs, and can hardly be called machine learning, but what if you wanted to identify whether each image contained a car? In this case, you would need a slightly more advanced process. You would need to label a set of images as either having a car or not and use a machine-learning algorithm. The algorithm would then go through this set of information, called “training data”, in order to identify what exactly it is about the pixels of the “has-car” images that give them that label. It would look at the patterns of density, colour, edges, slopes (and much more) in the images, and create a probabilistic model to determine whether an image would have a car in it.
One of the most valuable things about machine learning is that, using deep learning, you do not need to identify these patterns yourself. You provide the data, and the algorithm uses statistical inference to discover them automatically. Once this is done, you can provide any image to the algorithm and have it classify it as a “has-car” image or a “does-not-have-car” image. You can then train it on more data and tweak the algorithm in order to improve its accuracy. In a nutshell, this is what machine learning is.
Similarly, by creating an algorithm that is trained on large corpora of text data (or voice data), you can train models to predict whether a piece of text is positive or negative, what the topic of the text is, or whether it matches a certain consensus. You could have it learn to translate the text, summarise it, or predict what the next paragraph would look like. These tasks can be incredibly valuable to businesses. Support tickets can be quickly categorised and assigned to team leaders. The satisfaction of internal and external stakeholders can be identified, and core themes within communication channels that require attention can be easily found. Translating content & media through machine learning is incredibly valuable for scaling a business. Speech recognition and summarisation allows for efficient minute taking, and speeds up microtasks across the board, saving valuable minutes. It’s no wonder that AI-powered NLP services are a multi-billion dollar industry.
Natural Language Processing is a collection of subtasks that are powered by machine learning, dictionaries, and neural networks. Some of these tasks are powerful enough to be products themselves, and others behave like helpers, solving individual, specialised problems. For example, a subtask that identifies the topics present in a piece of text is incredibly useful to businesses, but a stemming algorithm is most useful when applied to text analysis.
There’s no escaping it. The landscape of enterprise tools is changing rapidly, and the core engine for this change is artificial intelligence. As processes are innovated and revised, conversations change. It is important to be aware of what tools are at our disposal. While the following is not an exhaustive list, it is an introduction to the various subtasks of Natural Language Processing that will help us navigate the expanding territory of language, data, and AI.
“Every company has big data in its future, and every company will eventually be in the data business.”
- Thomas H. Davenport
Stemming & Lemmatisation
With language, we can often express the same intent, emotion, or meaning in different ways. Ordinarily, this is not a problem for humans, but it can be quite a challenge for computers. How do we instruct a computer that “I would rather walk there” and “I prefer walking there” mean the same thing?
Stemming and lemmatisation attempt to solve this problem, by reducing each word to its root form, known as a stem or lemma, respectively. Stemming is a process that tries to cut off suffixes, prefixes, or other inflections in a quicker, rule-based, manner, while lemmatisation uses a contextual, dictionary-based approach instead.
For example, compare the sentences: “By that reckoning, it would take a long time to walk there.”, and “Walking there will be your reckoning.” In this case, stemming would miss that the word “reckoning” means something very different in the second sentence, and would still try to reduce it to the form “reckon”. By using a corpus to infer context, lemmatization would not have this problem.
Part of speech tagging is the process by which the function of each word in a sentence is identified. In the sentence “I prefer walking there”, the lemma “walk” functions as a verb. However, in the sentence “I prefer a walk there”, it functions as a noun.
Consider the example we saw earlier in the article; “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo”, which means “Buffalo bison, that other Buffalo bison bully, also bully Buffalo bison”. We know what this sentence means through identifying the part of speech function of each word, despite them being the same.
The following image, known as a ‘parse tree’, depicts this visually, and is how a program can understand the sentence:
NP = noun phrase
RC = relative clause
VP = verb phrase
PN = proper noun
N = noun
V = verb
Entities refer to any ‘thing’ that exists in the real world – whether conceptual or physical. Whether it is a person, object, place, number, address, or an idea, an entity allows us to distinguish one thing from another. For example, we can know that ‘Apple’ is a different corporation from ‘Microsoft’, and is a different ‘apple’ from the fruit. Entity recognition tries to extract and classify these objects and concepts from unstructured data.
By establishing a database of entities and their relationships with each other (also known as a ‘knowledge graph’), we can comb through lots of unstructured data and extract the entities that are mentioned within. This has incredible applications and economic value. It is a very important part of search algorithms, as people often search for information about things and concepts – or entities. Some other examples of the value of named entity recognition are in categorising articles, classifying support tickets, and providing recommendations.
The Turing test has shown that it is extremely difficult for a computer to answer questions the same way a human does. However, as recent technologies have shown, a computer does not need to be indistinguishable from a human in order to provide valuable answers to questions.
By using some of the concepts outlined above, a machine learning algorithm can comb through large corpora of text to find the entities that match those mentioned in a question and find the information about them that is requested.
Search engines can be considered answer engines as they go through a very similar process to display results for a search query. While it is not always an ‘answer’ that is displayed, search engines can still often answer queries related to the time, weather, and a lot more, thanks to question answering algorithms. Voice assistants also use question answering algorithms to provide quick answers to user queries.
Sentiment analysis is the process of assigning an emotional context to a piece of unstructured data like voice or text. We can then infer its intent, or make decisions about their implications.
Sentiment analysis can be an incredibly useful but complex task, as it varies greatly with the domain it is used within. Contexts and cultural signifiers can heavily skew how a machine learning algorithm classifies the emotion of a text. For example, “You are smashing it with this product”, and “You make me want to smash this product” have similar entities and lemmas within them, but convey opposite sentiments. Machine learning is used to aid various rule-based processes in order to better classify and understand this data in order to classify them better.
Sentiment analysis has revolutionized customer support and feedback systems. By being able to analyse many more avenues of conversation, such as phone calls, social media, and email, millions of dollars of value are generated for businesses and services.
In this article, we explored how computer science, data science, and linguistics come together to provide enormous economic value through natural language processing. We explored how challenges in its history were overcome, and the benefits it brings to society today. We explained a few sub-processes that aid NLP while being independently valuable and touched on how companies can use AI to improve their business ventures. AI has been said to be the next electricity, as it enhances and transforms industries around the world. It is important that we stay active in this conversation as it evolves, and learn how to use this technology it ourselves.
In the next article, we will explore how machine learning and NLP aids the digital marketing industry, with a strong focus on their applications and importance within search engine optimisation.