What article on artificial intelligence is complete without a linguistics joke?
An esteemed professor of linguistics addresses his 101 class. “In English, a double negative is a positive. In Russian, a double negative is still a negative. But there are no languages we know of, where a double positive can be a negative.” A kid in the back row scoffs, “Yeah, right.”
As machine learning becomes more commonplace every day, we encounter more obstacles. This article will explain some of the issues faced by companies analyzing Big Data sets, and the unusual challenges endemic to training machines to understand language, the written word, and how we communicate with each other.
First, you need data
Finding the right data is the first critical issue. Why? Several reasons, but here are some top ones.
- Humans make mistakes. One of those mistakes can be choosing the wrong dataset to begin with.
- Not choosing enough data to analyze. The smaller your dataset is, the greater the noise and harder it is for the system to figure out what data is “normal” and what is an outlier.
- Choosing the wrong company to analyze the data. Again, blame humans for this.
Training machines to grasp English slang is like teaching geometry to cats.
The first issue with language is related to the three previous bullet points. English is rather a hot mess as languages go. The language itself is not only a living history of invasions, it evolves daily, adding new terms, throwing out old ones, and even reversing the meanings when the whim of the populace deems it so.
A good example of a word reversing meaning is ‘literally’. It used to mean something that actually happened. Now it is used as a replacement for the correct term, ‘figuratively’ (Ex. “My head literally exploded when he took my laptop.”) Literally now has a second definition in all English dictionaries to accommodate common usage.
Another good example of English-as-a-hot mess: the days of the week. Three of our days are named after planets (Saturn Day, Sun Day, and Moon Day). The other four are named for Norse deities (Tiw’s Day, Woden’s Day, Thor’s Day, and Freyja’s Day).
Thor is our weatherman
The Norse influence also can be seen in how we describe the weather. “It is raining,” we say. There is a hidden pronoun in that sentence. Thor is making it rain. This makes no sense grammatically. We should follow the Cherokee and say, “We are experiencing rain.” But no one ever accused the English language of being logical. We also say “the sun rises and sets,” when we know full well that the Earth is rotating. What must machines think of us?
This sentence no verb.
What happens when a system encounters missing data? In filling out a form, this can be easily remedied by making a form field required. But you can’t require people to speak or write in complete sentences (except in school). How is a machine supposed to understand incomplete sentences, or, as is a common response these days, to respond to a passage with a single word (Ex. Feels, relatable, oof, or bruh)?
Or how will the system respond to intentionally misused grammar, such as those used in memes? Consider the meme below; a picture of a large spider in a bathroom captioned, “So full of nope.” The system may recognize the image as “bathroom AND spider.” Will it be able to determine how a spider or a bathroom can be filled with nope? It’s unlikely that it could. If most humans over 45 can’t grasp this, how can a software system be expected to?
Yet this is the same type of data we often see in Big Data collection. Social media conversations are one of the most commonly accessed and analyzed big data sets as they are massive in scale, show sentiment, intent, and topics (all helpful for businesses), and generally free to access.
Multiple meanings present huge challenges
Several words in English have multiple meanings. Some, like head and round, have over 100 meanings each. A word that is quite challenging to NPL systems is the verb run. Take the following sentences literally:
- “I ran out of wine.”
- “I ran into a problem.”
- “Let me run something past you.”
- “I could really go for a run.”
- “His apartment is run down.”
You can see how this can quickly be an issue for NPL systems. And someone can be “pretty ugly.” We know from experience that pretty is used as an adverb in this case, but from a machine standpoint, this phrase presents two seeming opposites. Without context, or all meanings present, the system will not interpret the data properly.
How about this one, which few humans can grasp:
“Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.”
Knowing that buffalo is an animal, a verb meaning “to bully,” and a city (Buffalo, NY), we can (sort of) grasp that the sentence means, “Buffalo from the city Buffalo — that the buffalo from the city Buffalo bully — are bullying buffalo from the city Buffalo.” Got that?
“I shot an elephant in my pajamas. How the elephant got in my pajamas, I’ll never know.”
Groucho Marx may have predicted one of the greatest language challenges in this old bit. We can say sentences that are technically correct but challenging to understand because the order of words is critical to comprehension.
“What kind of Coke do you want?” This is a strange sentence to hear for an American — unless they come from Texas. The word “cute” means pretty to most English speakers. For many Americans in the South, the word cute is usually taken negatively and used condescendingly (See also, “Bless your heart.”).
The point is that the same word can have disagreement by location. In England, a car has a boot and a bonnet. In the US? A trunk and a hood, but elephants also have trunks and cobras also have hoods.
It’s not all bad news
- Big advances are being made in how NPL systems handle and understand Arabic.
- Companies and universities are overcoming all of these irregularities and challenges at a pace not previously believed possible.
So what do we do about all of this?
- Understand that the initial pitfalls and risks involved are all related to human decisions.
- Know that training a machine has its limitations, and patience (a lot of patience) is needed.
- Be part of the solution by trying out your own theories. Companies like MonkeyLearn let you try out your data samples on state-of-the-art algorithms (free and paid models).