Boldingbroke: AI Building Blocks: What is NLP?

When users of a system–customers or employees–communicate with each other in the regular course of business, they convey their plans and intentions in written and verbal speech patterns. These are in turn recorded to ensure continuity of business, financial record keeping, compliance to regulatory laws and code of conduct standards, and other use cases. The work environment is dynamic and language patterns and terminology change constantly. Correspondingly, the systems and tools have to keep pace with the rate of change and must be able to continuously learn and reveal unforeseen and actionable connections to uncover opportunities as well as risk. To do this, machines need to be able to understand human language.

Natural Language Processing (NLP) is the science of breaking down human language into discrete patterns that a machine can understand and interpret. Whereas the understanding and responding to basic commands is straightforward, a machine cannot understand the nuances of why a person says what they say, or their intent and sentiment. This is where data sciences and computational linguistics have created tools to help machines understand what humans are really trying to accomplish when they type or say something.

NLP has several disciplines, including Natural Language Understanding (NLU) or speech recognition based on disambiguation (understanding precisely what a word means) and Natural Language Generation (NLG) or speech creation, where a computer independently composes sentences such as with a “chatbot.” We will look at these side disciplines in future blog posts. For now, you can follow the links to Wikipedia for a quick reference definition of these areas.

Why Should You Care

In the 2021 Algorithmia Enterprise Trends in Machine Learning survey, there was a reported urgency around AI/ML projects: “When we asked respondents why, 43% said their AI/ML initiatives “matter way more than we thought.” Nearly one in four said that their AI/ML initiatives should have been their top priority sooner.” (p6) While many areas of IT budgets are downsizing, the AI/ML line item is ramping up significantly. If 2020 has taught businesses anything, it is that automation of DevOps and management of data assets are key strategic investments that recession-proof operations and keep a business viable in uncertain times.

Organizations are looking at an increasing number of use cases for ML and with it, NLP. In future articles, we will dive into these scenarios and their ML/NLP applications in depth. In the meantime, here are just a few of the possible best practices and outcomes where this technology can be applied:

Improving Customer acquisition, retention, interactions, and experience and therefore customer loyalty
Process, supply chain, and back office automation, reducing operational costs and increasing ROI
Fraud and Insider Threat detection
Sales pipeline, recommendation systems, loyalty, brand awareness, and marketing program intelligence
Financial planning and insights
Governance, Risk, and Compliance management and workflow for audit and regulatory reporting

Governance is by far the most problematic of these use cases, with over half of all organizations ranking it as the top challenge. The Ethics, Explainability, and Data Privacy concerns inherent in the AI/ML discipline are fodder for many conversations and debate in this emerging space. As with any new technology, standards are not yet established. But governance mandates that the handling and processing of data, especially PII, be treated with kid gloves… and an audit trail in order to minimize risk. Data is exploding at a rate that makes it hard to trace: even with data lakes and cloud solutions, data is messy by nature.

Why Does “Big Data” Matter

Big Data is a term that illustrates the growth in volume, velocity, and variety of data in the world. It is exemplified by an explosion in the quantity of data; and primarily so in the unstructured “messy” data of chats, and the semi-structured data of emails. Notes, images, and attachments increase the complexity of what must be captured and supervised. And while communications appear to be small, when viewed independently, they provide more insightful patterns and context when viewed in aggregate. Until recently, a human eye was required to spot these patterns and understand the context – but now we have access to advanced tools such as statistical analysis, machine learning, data mining, NLP, information retrieval, and predictive analytics.

Fundamentals of Pattern Analysis in Language

Detecting behavioral patterns in unstructured, text-based data is often compared to a “Needle-in-a-Haystack” scenario. The basic assumption is that the vast majority of people are following common patterns, and only a small percentage are outliers, such as First Movers with a new technology (positive use case), the “bad actors” who try to mask their intent for Insider Threats (negative use case), or a person who is acting under duress due to life circumstances (neutral use case). Still, NLP systems must look at everything in order to find evidence in the communications of those few individuals. Most communications detected as potential trends are generally innocuous, and we call these “false positives” when we have to review them.

NLP software therefore seeks to sift through and filter out the vast majority of documents that are valid, business-related communications, and reduce the haystack down to an interesting set of data where the “needles” or outlier behaviors are hiding. It is these “interesting” communications that are considered high-value, or “true positives” and for which we want to generate data sets for further examination.

In any search application, there is a classic tradeoff between Recall and Precision [Fig.1]. Recall is the scope of coverage: “Did I miss anything?” Precision is how closely can I come to getting exactly what I intend to find? We seek to optimize the balance between reducing False Positives (increasing Precision) and making sure as few True Positives as possible are missed (increasing Recall). Complete coverage with high precision is the goal of all NLP solutions.

The balance between Precision and Recall can be viewed as a balance between the two competing needs of a business: to have a strong assurance of coverage, suggesting that few positive cases are missed; and yet concurrently to have a low volume of highly targeted, precise documents for analysis.

One method for managing this competition is to map the business needs and risks to a behavior taxonomy and then link individual rules and algorithms to them. The taxonomic technique allows us to measure the tool’s performance with respect to each of the managed use cases, and to demonstrate that each of the risks has a corresponding coverage. This management method renders the typical tradeoff between sacrificing Recall in favor of Precision insignificant, as there is now an assurance of business coverage, while providing targeted risk reporting for those behaviors that senior management prioritize.

NLP—An In-Depth Explanation

NLP tries to break down the complexity of human speech in two parts and solve each part independently, with what are called Semiotic and Semantic analysis.

The first challenge is to understand word origins and their evolution over time, in order to find relationships between them. This is called “Semiotic Analysis” and is the subject of a lot of research in the industry. It has been broadly addressed by breaking words down into character strings and sub-strings (their stem and “lemma” or main representative form). For example, “run, runs, ran, running” are all forms of the lemma “run.” The system then performs clustering and counting of words within documents, finding which are most commonly present in the proximity of others. The tools for this work are effective and openly available as open-source toolkits.

Common toolkits that are used in the industry are Stanford Core NLP and Apache OpenNLP, for the following operations:

Normalization: Correcting spelling errors and standardizing words.
Stemming: Looking for the stem, i.e. the most basic, common form of a word.
Entity Extraction: Identifying nouns and tagging them with properties that are useful for analysis (for example: Barclays can be tagged with “Bank” or “Counterparty”, and “cell phone” can be tagged with “communication channel”.)
Fuzzy Matching: A type of string-based matching of phrases to a dictionary of interesting words, or topics, that accounts for variations in position or spelling (so “call my cell” would be the same as “call @ cell”.)
Synonyms: Simple word substitutes to capture the context provided by other words (so “call my cell” would be the same as “call my mobile” in search terms.)
Feature Construction: The use or combination of the above techniques to generate and store complex context along with the data (so “call my cell” would be stored as “use of external communication channel” –and would match the text “reach me on my mobile”.)

The second challenge is to understand each word’s meaning, and how that meaning changes within a single sentence and within the context. This becomes more complex in larger passages because the context and sentiment will shift between sentences or chat lines in a conversation. This type of analysis is called “Semantic Analysis” and is a harder problem to address because it reflects the deeper linguistic intricacies of human communication.

In language, context is everything. We, humans, are exceptionally good at understanding emotions, nuance, and innuendos, all things that machines cannot grasp. A diagram of the language’s structure helps explain why [Fig.2]. There are two levels of semantics in human speech, shown here as breadth and depth.

The basic structure of a sentence is represented in the top line: Subject-->Verb-->Direct Object. The conduct risk behaviors that we are trying to detect are constructed as Verb-->Direct Object phrases, representing the discrete activities that the Subjects perform. The deeper structures of prepositional phrases provide context. This is where the intent can be best discerned: the “why” that explains a person’s motivation and actions.

NLP enables systems to learn the context and meaning of words from sentences, paragraphs, and entire documents by reading each line. Unfortunately, though, computers cannot yet read “between” the lines. It is instructive to point out the ambiguity inherent in human language by examining a few expressions:

“He was on fire lst nite.” (last night)

How does a machine know that a person is not really burning up when someone says that a sports player is “on fire” - meaning that they are performing well? In an NLP platform, the fuzzy matching capability is able to accommodate the spelling error or abbreviation and recognize “lst nite” as a “timeframe”. Its NLP semiotic techniques can then understand that “on fire” is a synonym for “high performance” when combined with the dual contexts of “person” and “timeframe”.

“She jumped for joy.”

In the same way “jumping for joy” is not a literal action, but it could be especially when the context is dealing with small children.

“They really love to do bad things.”

In this example, the software will need to be trained to understand that while “love” is positive (let’s think about it as a +1) and “bad things” is negative (-1), to love a negative makes the statement doubly negative (-2) instead of neutral (which is reasonable, given +1 -1 = 0). In order for the machine to understand the above examples, the instructions we give have to take away the ambiguity that humans are very adept at handling naturally. This is in essence the semantic power of NLP.

Our NLP tools can look at adverbs and prepositional phrases to discern sentiment or emotions (“love”, “hate,”) and intensifiers (“really”, “maybe”, “hesitantly”). However, the ability to discern free-will choices and the intent of a person’s speech patterns based on deeper semantic structures is still an area of research that has yet to enter the commercial space. In order to approximate the understanding of such human psychology, it is possible to create semi-static structures against which rules can be mapped. We call these structures “taxonomies of risk,” or “behavioral taxonomies.” In future topics, we will look at how language works and the way taxonomies aid computers in organizing human knowledge.

In this post, we have looked at the basics of NLP at an extremely high level and posited several use cases for businesses to consider when applying advanced AI/ML techniques to their operations. Making an investment in ML for the long run is a strategic decision that should be mapped out with deliberation and an understanding of the investment in DevOps, Infrastructure and Data Sciences that are necessary to fully support the initiative. To accomplish this type of transformation takes buy in at the most senior levels of the company. Therefore, it is critical that the C-Suite understand the concepts involved and the investments necessary to drive innovation in the new world of Predictive and Behavioral Analytics powered by NLP.

Boldingbroke: AI Building Blocks

Tuesday, April 23, 2024

What is NLP?

Why Should You Care

Why Does “Big Data” Matter

Fundamentals of Pattern Analysis in Language

NLP—An In-Depth Explanation

No comments:

Post a Comment

Generative AI: Risks and Rewards

Search This Blog

Report Abuse

Search This Blog