Boldingbroke: AI Building Blocks: Natural Language Understanding and Generation

It is one thing to slice and dice Text, to chop it up into its component and parts, count them into piles and determine what’s in those piles. It is quite a different problem to comprehend the concepts and topics that a document contains. Natural Language Understanding (NLU) and its counterpart Natural Language Generation (NLG) are considered hard problems within AI whether dealing with Voice or Text data, because they both have as their end goal Machine Reasoning. In other words, how does a machine not just parse the speech but actually stitch all those parts together to truly understand the concept or idea that is being conveyed by the speaker (Voice) or writer (Text).

Sometimes people compare NLP to the process of taking text and breaking it down into data and NLU/NLG as the process of taking data and transforming it into readable grammatically correct text that can be used in articles, reports, and many other places. Using data analytics, you can target your content to a particular audience; you can transform information into a more readable form; you can also scale up content creation saving time and maintenance.

Natural Language Understanding

As noted in our earlier examination of How Language Works, the first step is to break down the utterance, whether that be a phrase or sentence into its parts of speech and tag them with a Part of Speech tagger. Then step two is to understand the grammatical structure of the phrase or how to parse the word order. A third step is to try and put that phrase into a known context or domain of knowledge to guide the comprehension or reasoning. This results in a type of classification or putting the communication into context. From there, it is possible to dive into a more diverse syntactical analysis. Getting to the ultimate meaning of a phrase is a hard problem because each domain of knowledge requires a deep ontology to represent its area of lexical complexity. For example, the Medical and Legal domains are highly different from the Roman Classical Literature domain, and yet they both share the Latin language as a vocabulary source.

This is all well and good when working with the written word, where the text is immutable. But when dealing with the spoken word, accents, intonation, and background noises need to be accounted for (speaker identification.) The machine must be able to filter sound files and understand what is the human saying, what is a pause or breathing for example, and how pronunciation variations, such as emotion, change when a particular word is spoken. Various acoustic modeling techniques are employed to take audio wave files and break them into individual words or parts of words. The goal is to create a model for a given language and train based on examples of people pronouncing words or phrases in that language. Then use the trained model for automatic speech recognition.

Speech Recognition has its own set of challenges and is a sub-discipline which incorporates recognizing languages, machine translation, and converting speech to text. This latter technique is very important, as there are many more tools for working with text than speech, so if you can get the spoke word into a written form then its easier to manipulate and process.

These models are often so large that they are deployed on the cloud and then use a network connection to interact with. This is why virtual assistants exist and they have a lag time in responding of a few seconds. When you say “Hey Siri” it takes a few moments for Siri to get your answer. She’s off in the cloud looking up the answer in a vast database of responses based on a trained model in your local language.

Virtual Assistants, Siri, Alexa and the like are the intermediaries that put a human face on computer technology. The first computer assistant, Clippy was in the form of a paper clip that attempted to automate the actions in Microsoft Office applications, but did not contain a speech component. The attempt to put a human face to computers will one day extend to computer trying to read lips when image facial recognition technology is good enough to parse subtle muscle changes and map them to sound files. But this is no Max Headroom.

Another application of NLU is medical records transcription. You go for a Dr. appointment and the notes are taken. In prior times, those notes were handwritten and then sent for transcription to a typist, then added to your permanent file. Over time, the technology got a bit more sophisticated: the Dr. would use a recorder to dictate the notes into a tape cassette. The transcription service listened to the tape recording and again typed out the data into a medical record to be added to a paper file. Now with NLU, a computer accomplishes this task in a much more efficient time using speech-to-text, and it only needs a final quality check by a reviewer. The resulting Electronic Medical Record is added to your history instantly instead of having to be shipped by mail to your doctor’s office by post.

A third example is customer support: when you call and are asked to choose a numeric option or say Yes/No, the software will recognize when you say your choice. If you state your birthday, or account number, this is NLU in action. When the automated voice repeats that number back to you, another function is in play, Natural Language Generation (NLG).

Natural Language Generation

As the subject implies, once a machine knows how to parse language, reversing the process to create speech is the next step in the journey. But we are far from a Babel Fish or Jarvis level of operability. In order for a machine to genetically create thoughts and interact with a human being on the level of AI that is portrayed in movies, the amount of computing power required for a true simulation of the human brain and its vast network of neurons simply does not yet exist.

There are some companies working on the Babel Fish translator problem, including Waverly Labs and Timekettle, who provide smart ear buds that hook up to real-time translation services in the cloud. These are paired with apps on your smart phone, containing phrase databases in which your most common speech patterns can be stored. The translation software learns your patterns as you use the app, it the same way that Alexa or Siri recognizes your unique voice print. The system then responds back with the corresponding translation if you need, say to conduct a conversation in French or Chinese. Conversely if you are in France or China and need something translated to your native tongue, it will translate that into French if you are a Frenchman traveling in China, or Chinese if you’re on vacation in Paris in our example. This is jut one simple example of a speech generation application using NLG.

NLG is far more complex than translating from one language to another. In fact, it involves understanding the context of a conversation and creating appropriate responses to the person the machine is interacting with. One of the most common examples where NLG is employed today is when you call a support line for help or to make an appointment. The infamous “Press 1 to speak with Customer Support, Press 2 to speak with Accounting to pay your Bill, Press 3 to make an appointment,” and so forth. And then you get stuck in phone-tree hell where you never get your question answered, just a computer-generated voice sending you to menu after menu of options until you are disconnected.

This early example where computers tried to replace humans in the phone system was just a lot of prerecorded messages, much like today’s database lookups of pre-recorded responses from Alexa or Siri. Customer Support solutions are now much more sophisticated and responsive, with algorithms behind them that are designed to react to your voice when you say your problem using a phrase or a few words. The computer in the background will then search through a knowledge base, and a voice synthesizer will read the response that it finds.

NLG also uses grammar and speech patterns to take common phrases and chunks of text and organizes it all into a document or voice response. There are three basic steps that need to occur for this process to be successful: content determination, information structuring, and aggregation.

Content determination is basically looking at the context and topic being addressed and deciding what information needs to be included in the document or response. If the goal is to compose a test-based output such as auto-generating webpages, news articles, or a business report, then it’s crucial to have the right data as input. If the response is to create a verbal interchange between a person and a machine over a phone call, then generating the right few sentences is even more important as a person asking questions or seeking information in a phone call does not devote a lot of time to the task.
Information structuring takes all of the content, organizes it into the most important, makes decisions about word order, lexical choices, and such. There is also a portion of the processing called “realization” whereby the code determines syntax, morphology, and word order, in other words how to write the actual sentences, or compose the morphemes of speech in the case of verbal expression with speech synthesizers.
Aggregation serves to finalize the content by merging or consolidating similar concepts for readability and conciseness. If an article is too long, then it will seem clunky and unnatural. This step is unnecessary for speech-based NLG.

Another approach to NLG is to use large sets of labeled, tagged data for training models with machine learning algorithms. The use of trained models is most apparent in chatbots, text-to-speech systems that conduct online conversations using an avatar as an agent that guide users through a process. The earliest chatbot assistant was ELIZA, a 1966 effort to create an interaction with a seemingly human program.

An often-overlooked use of NLG is the autocomplete function in many applications. Essentially, when you are typing the computer is predicting what words you will choose, what proper grammar is best to use, and the like. This, whether you recognize it or not, is a form of generating language on your behalf. The technology was originally developed by Nuance and is now so ubiquitous on smart phones that most of us take it for granted.

Teaching a computer, a logic machine, to put together words into phrases and complete sentences is a far cry from teaching it to reason, to think for itself as Jarvis does. He is a long way from the bumbling C3PO, showing pathos and sentiment as well as subtle humor in the face of crisis. C3PO and his sidekick R2D2 are much more the traditional stereotype of AI inside metal skin, robots who click and whirl while spitting out speech. They are simulacra of humans, with speech built in. Jarvis on the other hand appears fully human while Other. With Jarvis we wonder “Do we need Asimov’s rules after all?” Thankfully, we are a far way away from needing to answer that question.

Boldingbroke: AI Building Blocks

Tuesday, April 23, 2024

Natural Language Understanding and Generation

Natural Language Understanding

Natural Language Generation

No comments:

Post a Comment

Generative AI: Risks and Rewards

Search This Blog

Report Abuse

Search This Blog