Tuesday, April 30, 2024

What is an Entity?

 As we begin to explore the world of Natural Language Processing (NLP) and other forms of Machine Learning (ML) or ArtificialIntelligence (AI) tools, there is a foundational concept that will appear in various forms, often by more than one name.

That is the foundational concept of “entities.” Entities are observations in the data of real-life people, companies, places, or other things like cell phones or vehicles. They represent a real-life “who” or “what.” Links in that data are indicative of shared attributes which can create relationships. These relationships create a context for exploration into what the entities are doing and why they are captured into their unique transactional history.  Just as nouns have adjectives that describe them, entities have properties or attributes that help distinguish them as unique or similar to other objects in the system. Data extraction techniques help to determine one entity from another by its attributes. Another tool can then cluster similar entities into like groupings by commonalities. 

In machine learning with an AI-focused on understanding real-world actors, their relationships, and the meaning of their communications, you will encounter two similarly described activities: Entity or Identity Resolution and Entity Extraction. Entity Resolution is the practice of distilling an individual identity of a person, place, or thing from the parts of structured data coming from many sources. It may appear under many other names as well, like identity resolution, record linking, relationship linkage, record matching, and several different terms.  If the job is to match up records from structured data to come up with the ultimate identity of a real-world person, place, or thing, and the relationships between them, then the job is Entity Resolution. Entity Extraction is the practice of identifying the name of real-life people, places, and things mentioned in semi-structured and unstructured text. In other words, Entity Extraction is identifying a unique person in the first place, and Entity Resolution is making sure they really are who they say they are. 


Entities are the beating heart of systems dedicated to taking action based on understanding who is who, how they are related, what they are doing, why they are doing it, and if that is good or bad news for an organization. Despite how straightforward this may sound, it is utterly non-trivial. Entity Management means enabling systems to match up all the data from different origin systems required to create a unified identity and then monitor transactions originating from many other systems between entities and activities. These transactions can contain semi-structured, unstructured, and structured data. Each position may provide what’s needed to generate the necessary context to gain insights from or generate system actions with that data.

To build systems able to respond to actors’ behaviors, you must marry Entity Resolution with Entity Extraction, Semantic NLP tools, and a well-developed business, compliance rules base set that when combined permit a context-driven flagging of activity. Under what circumstances would you want to invest the time in building systems that compose various ML/AI-driven components, given the expense and time involved? We can look to the recent news and industry reports for some examples:

Fraud

The State of California loses ~ 8 billion taxpayer dollars to unemployment insurance fraud. Reporting shows that bad actors used the chaos of COVID and the social shut down of the economy to file false claims using identity fraud. The state was unable to verify the identity of claimants, some as young as one-year-old identity fraud victims, due to an inability to cross check databases such as DMV, prison, and death records. 

Insider Threat

Ponemon Institute shared in their 2020 Cost of Insider Threats: Global study that the three largest industries affected were companies in financial services, services, and technology and software incurred average annual costs of $14.05 million, $12.31 million, and $12.30 million, respectively. Those are the hard costs of identification and containment and don’t estimate losses generated in events that materially damage customers, create public distrust and dislike.  

Both examples have two principal entity types, one or both may be bad actors; people and companies.

  • The State of California might have avoided multi-billion dollar losses and enabled more citizens to be processed quicker during the COVID-19 crisis if the Unemployment Security department could efficiently marry up all known data held on a citizen and business entities by the state, search that data, apply eligibility rules programmatically, and flag suspicious claims for verification. Systems can also compare answers between relatives to see if the content is similar in nature, use location data, and verify age, and other demographics. 
  • Insider Threat detection is an order of magnitude more nuanced. This requires a cross-system understanding of people, systems, permissions levels, other access levels, relationships, and communication activities. Monitoring behaviors access and communication aligned with rule sets that create the opportunity to neutralize a threat condition before it becomes a costly breach. (Read “What is NLP?” for a primer on teaching systems to recognize the intent in written communication.)

There are other valuable business drivers pushing the desire to know the identity of people, their relationships, and the content of communications between them.

  • Customer 360 to have a clear understanding of both individual and aggregate customer journeys in their relationship with a product and its supporting services that permit process automation, customization, retention, recommendations, and other marketing intelligence.
  • Process Automation for business processes, back office, supply chain to reduce operational costs. This includes log file processing for anomalies, maintaining inventory and fulfilling orders, and alerting humans when the assembly line breaks, to name a few examples. 
  • Financial Planning and insights for a view of customer plans and investing strategies including risk appetite. Estate planning strategies and trust management are also potential beneficiaries of relational content and identity management data.
  • Governance, Risk, and Compliance management with workflow and audit and regulatory reporting. This includes Know-Your-Customer and other watchlist checking.

Entity Resolution also involves graph theory and can extend into Actor-Based Network analysis, where the interactions between entities are mapped and resolved to determine patterns of behaviors within communities of practice. This can occur between people groups, companies (think of Enron and the many companies who were impacted by its malfeasance), or even within germ cultures in medical research. Looking at this level of entity interactions is an advanced discussion and will be the topic of a future post on Actor-Based Network Analysis. 

As you continue to explore NLP and other ML/AI toolsets and how they can be composed into systems the deliver risk reduction, cost savings, and greater customer value in product experience delivery, understanding the concept of entities will help make engaging with many other topics in Data Science, Machine Learning, and AI engineering easier.  When using AI toolsets in many practical business applications, we use entities to understand who is doing what, when, and why. We often want to either respond at the time to actions and situations, understand details after the fact, or predict the next actions.  

S. Bolding —Copyright © 2021 · Boldingbroke.com


No comments:

Post a Comment

Generative AI: Risks and Rewards

 Benefits of GenAI The landscape of general computing has changed significantly since the initial introduction of ChatGPT in November, 202...