It is one thing to slice and dice Text, to chop it up into its component and parts, count them into piles and determine what’s in those piles. It is quite a different problem to comprehend the concepts and topics that a document contains. Natural Language Understanding (NLU) and its counterpart Natural Language Generation (NLG) are considered hard problems within AI whether dealing with Voice or Text data, because they both have as their end goal Machine Reasoning. In other words, how does a machine not just parse the speech but actually stitch all those parts together to truly understand the concept or idea that is being conveyed by the speaker (Voice) or writer (Text).
Sometimes people compare NLP to the process of taking text and breaking it
down into data and NLU/NLG as the process of taking data and transforming it
into readable grammatically correct text that can be used in articles, reports,
and many other places. Using data analytics, you can target your content to a
particular audience; you can transform information into a more readable form;
you can also scale up content creation saving time and maintenance.
Natural Language Understanding
As noted in our earlier examination of How
Language Works, the first step is to break down the utterance,
whether that be a phrase or sentence into its parts of speech and tag them with
a Part
of Speech tagger. Then step two is to understand the grammatical
structure of the phrase or how to parse the word order. A third step is to try
and put that phrase into a known context or domain of knowledge to guide the
comprehension or reasoning. This results in a type of classification or putting
the communication into context. From there, it is possible to dive into a more
diverse syntactical analysis. Getting to the ultimate meaning of a phrase is a
hard problem because each domain of knowledge requires a deep ontology to
represent its area of lexical complexity. For example, the Medical and Legal
domains are highly different from the Roman Classical Literature domain, and
yet they both share the Latin language as a vocabulary source.
Speech Recognition has its own set of challenges and is a sub-discipline which incorporates recognizing languages, machine translation, and converting speech to text. This latter technique is very important, as there are many more tools for working with text than speech, so if you can get the spoke word into a written form then its easier to manipulate and process.
These models are often so large that they are deployed on the cloud and then
use a network connection to interact with. This is why virtual assistants exist
and they have a lag time in responding of a few seconds. When you say “Hey
Siri” it takes a few moments for Siri to get your answer. She’s off in the
cloud looking up the answer in a vast database of responses based on a trained
model in your local language.
Virtual Assistants, Siri, Alexa and the like are the intermediaries that put
a human face on computer technology. The first computer assistant, Clippy was
in the form of a paper clip that attempted to automate the actions in Microsoft
Office applications, but did not contain a speech component. The attempt to put
a human face to computers will one day extend to computer trying to read lips
when image facial recognition technology is good enough to parse subtle muscle
changes and map them to sound files. But this is no Max Headroom.
Another application of NLU is medical records transcription. You go for a
Dr. appointment and the notes are taken. In prior times, those notes were
handwritten and then sent for transcription to a typist, then added to your
permanent file. Over time, the technology got a bit more sophisticated: the Dr.
would use a recorder to dictate the notes into a tape cassette. The
transcription service listened to the tape recording and again typed out the
data into a medical record to be added to a paper file. Now with NLU, a
computer accomplishes this task in a much more efficient time using
speech-to-text, and it only needs a final quality check by a reviewer. The
resulting Electronic Medical Record is added to your history instantly instead
of having to be shipped by mail to your doctor’s office by post.
A third example is customer support: when you call and are asked to choose a
numeric option or say Yes/No, the software will recognize when you say your
choice. If you state your birthday, or account number, this is NLU in action.
When the automated voice repeats that number back to you, another function is
in play, Natural Language Generation (NLG).
Natural Language Generation
As the subject implies, once a machine knows how to parse language,
reversing the process to create speech is the next step in the journey. But we
are far from a Babel Fish or Jarvis level of operability. In order for a
machine to genetically create thoughts and interact with a human being on the
level of AI that is portrayed in movies, the amount of computing power required
for a true simulation of the human brain and its vast network of neurons simply
does not yet exist.
There are some companies working on the Babel Fish translator problem,
including Waverly Labs
and Timekettle,
who provide smart ear buds that hook up to real-time translation services in
the cloud. These are paired with apps on your smart phone, containing phrase
databases in which your most common speech patterns can be stored. The
translation software learns your patterns as you use the app, it the same way
that Alexa or Siri recognizes your unique voice print. The system then responds
back with the corresponding translation if you need, say to conduct a
conversation in French or Chinese. Conversely if you are in France or China and
need something translated to your native tongue, it will translate that into
French if you are a Frenchman traveling in China, or Chinese if you’re on
vacation in Paris in our example. This is jut one simple example of a speech
generation application using NLG.
NLG is far more complex than translating from one language to another. In
fact, it involves understanding the context of a conversation and creating
appropriate responses to the person the machine is interacting with. One of the
most common examples where NLG is employed today is when you call a support
line for help or to make an appointment. The infamous “Press 1 to speak with
Customer Support, Press 2 to speak with Accounting to pay your Bill, Press 3 to
make an appointment,” and so forth. And then you get stuck in phone-tree hell
where you never get your question answered, just a computer-generated voice
sending you to menu after menu of options until you are disconnected.
This early example where computers tried to replace humans in the phone
system was just a lot of prerecorded messages, much like today’s database
lookups of pre-recorded responses from Alexa or Siri. Customer Support
solutions are now much more sophisticated and responsive, with algorithms
behind them that are designed to react to your voice when you say your problem
using a phrase or a few words. The computer in the background will then search
through a knowledge base, and a voice synthesizer will read the response that
it finds.
NLG also uses grammar and speech patterns to take common phrases and chunks of text and organizes it all into a document or voice response. There are three basic steps that need to occur for this process to be successful: content determination, information structuring, and aggregation.
- Content
determination is basically looking at the context and topic being
addressed and deciding what information needs to be included in the
document or response. If the goal is to compose a test-based output such
as auto-generating webpages, news articles, or a business report, then
it’s crucial to have the right data as input. If the response is to create
a verbal interchange between a person and a machine over a phone call,
then generating the right few sentences is even more important as a person
asking questions or seeking information in a phone call does not devote a
lot of time to the task.
- Information
structuring takes all of the content, organizes it into the most
important, makes decisions about word order, lexical choices, and such.
There is also a portion of the processing called “realization” whereby the
code determines syntax, morphology, and word order, in other words how to
write the actual sentences, or compose the morphemes of speech in the case
of verbal expression with speech synthesizers.
- Aggregation
serves to finalize the content by merging or consolidating similar
concepts for readability and conciseness. If an article is too long, then
it will seem clunky and unnatural. This step is unnecessary for
speech-based NLG.
Another approach to NLG is to use large sets of labeled, tagged data for
training models with machine learning algorithms. The use of trained models is
most apparent in chatbots, text-to-speech systems that conduct online
conversations using an avatar as an agent that guide users through a process.
The earliest chatbot assistant was ELIZA,
a 1966 effort to create an interaction with a seemingly human program.
An often-overlooked use of NLG is the autocomplete function in many
applications. Essentially, when you are typing the computer is predicting what
words you will choose, what proper grammar is best to use, and the like. This,
whether you recognize it or not, is a form of generating language on your
behalf. The technology was originally developed by Nuance and is now so
ubiquitous on smart phones that most of us take it for granted.
Teaching a computer, a logic machine, to put together words into phrases and
complete sentences is a far cry from teaching it to reason, to think for itself
as Jarvis does. He is a long way from the bumbling C3PO, showing pathos and
sentiment as well as subtle humor in the face of crisis. C3PO and his sidekick
R2D2 are much more the traditional stereotype of AI inside metal skin, robots
who click and whirl while spitting out speech. They are simulacra of humans,
with speech built in. Jarvis on the other hand appears fully human while Other.
With Jarvis we wonder “Do we need Asimov’s rules after all?” Thankfully, we are
a far way away from needing to answer that question.
S. Bolding—Copyright © 2021 ·Boldingbroke.com
No comments:
Post a Comment