Thursday, May 2, 2024

Generative AI: Risks and Rewards

 Benefits of GenAI

The landscape of general computing has changed significantly since the initial introduction of ChatGPT in November, 2022 and similar LLMs for usage by developers and end users alike. And yet. The technology is not new, just expanded. Built on previous generations of research and model building, ChatGPT and its fellow models are examples of stitching together several targeted models in various domains to reflect a more global view of reality. ChatGPT itself comprises at least eight distinct models under the hood. It’s not the apocalypse some critics predict: End of the Middle Class. Destruction of white-collar jobs. Advent of the Age of Machines.

Cynicism and fear generate clicks, using the specter of ‘radical tech’ for marketing purposes and to show you how important they, the titans of tech, are. Human-computer interactions via AI went from prescriptive to collaborative overnight. The shift centers around a new interface paradigm. No longer tied to the physical elements of screen, keyboard, mouse, humans can now use natural language to order their machines around. This currently includes voice, gestures, facial expressions which augment the traditional written text input. We talk, using words to explain what we mean, and AI talks back to us.

These types of interactions defined as more ‘human’ imply collaboration, in that humans do part of the task. People scope, bound, and provide context to the questions. This is now known as ‘prompt engineering’ a whole new job category that emerged with the advent of interactive models. Collaboration implies that humans do part of the tasks, AI does others, and the back and forth of it all feels more ‘natural’ with a better result at the end of the day. The trend is to view these tools and bots as augmentation to existing efforts. While Ironman’s JARVIS is a long way from the old paradigm of Jeeves (as in AskJeeves of the Internet 1.0), generative AI is far from independent, able to have agency and purpose without human input.

There is a potential downside, in that AI amplifies what’s already there. So if you have great processes you’re automating, they will become more efficient. If your processes are faulty, inefficient, or misdirected, then those flaws will multiply causing more loss of time, income, and ultimately clients. It’s a double-edged sword. 

Operationalizing AI

AI as a topic is about more than just LLMs and content creation. At its core, the LLMs that underpin today’s advances in AI are built on data. When you operationalize AI, you're essentially operationalizing data. For the longest time (in internet terms) the focus has been on building out networks, infrastructure, and the applications that run on top of them. Now however, to protect your valuable information, the paradigm has shifted to protecting the data that runs through and powers that infrastructure. Think of data as the gas that fuels the engine. Without fuel, the systems and processes are just so much piping and potential.

To get to a point where companies are protecting both the infrastructure and the data, a core discipline of “always on encryption” will be needed to protect sensitive information and prevent breaches in both internal systems and cloud environments.

Data Security in Today’s Environment

Data, your personal information, is a form of currency in today’s online world. No longer is your privacy guaranteed as geographical boundaries and jurisprudence are increasingly ignored by hackers, thieves, and well-intentioned analytics gurus. Data mining is a profitable business, and you are the product being milked for all its worth. There are reports every day of medical information being shared without permission to insurance groups, biostatisticians, and others in the name of the ’public good.’ Standard organizations can’t even agree on basic definitions. Rules and regulations like GDPR give many the right to be forgotten. But you must deliberately opt out. And when you don’t realize you’re even ‘in’ the game in the first place, it's a never-ending battle to protect your digital footprint.

How does the smart consumer of digital life go about shielding their most intimate details, while still enjoying the many entertainments and services at hand on the web of today? Some experts say it’s already too late. And they may be right. Your cellphone alone provides a wealth of data via GPS, location tracking and connectivity services, the baseline necessary to stay connected while out and about, before you even begin to indulge in a little TikTok. 

Needless to say, data security is a core discipline for companies to protect sensitive information and prevent breaches in both internal systems and cloud environments. In the case of expert systems, where the software mines information internally from repositories and active channels, such as MS Teams or Slack, the employees are subject to exposure on a regular basis. These smart knowledge management systems advertise that they distribute the right information to the right people at the right time. However, if an employee is chatting about personal plans, or sensitive health information, that could be shared with coworkers unintentionally. Another example: executives are talking with each other about the financial implications of an upcoming merger. It is essential that the information remains confidential to comply with SEC regulations and not violate insider trading restrictions. Ensuring that the conversations remain limited to a small cohort is where data privacy rules and policies come into play.

Encryption at rest and in transit

Maintaining data security is critical for most organizations. Unauthorized access to even moderately sensitive data can damage the trust, relationships, and reputation that you have with your customers. Any truly responsible company encrypts data stored at rest by default. Any object uploaded to the cloud can be encrypted using for example an Amazon-managed encryption key or a Google key store. Having a key management solution is a critical step for any company to take. Data in transit should also be encrypted to follow industry best practices and prevent leakage or exposure to bad actors. Again, each company will employ their own keys and methods to ensure this protection is in place. It’s all about good hygiene along the entire route of getting the bits of information to and from where they are most needed.

Risks of Unbounded AI

The government, as usual, is behind the curve in regulating the industry. There are no clear ways to apply old paradigms enshrined in law to new ways of viewing the world.  Regulatory controls and agencies such as the FEC and SEC are designed to monitor and force compliance of human beings to laws regarding monetary policy. If an AI agent acts contrary to the rules, who do you punish? The programmers who created the agent or bot? The company that deployed the AI? Murphy’s Law of unintended consequences always raises its head.

A case pending before the Supreme Court addresses YouTube’s use of content to generate video recommendations for users. Does the liability rest in the organization’s ability to shape content via recommendation engines and the algorithms that power them? This case is relevant to the debate, as ChatGPT and its brethren operate according to the same principles as those recommendation engines. In other words, do Section 230 protections generally apply to any third-party content from users, to content or information created by a company out of that third-party data?  Does it protect companies from the consequences of their own products? More importantly, who ‘owns’ the output of the engines? In Thaler v. Vidal, the courts maintained that US Patent Law requires a human inventor.

Many institutions and groups are organizing around ‘Ethical AI’ to lay the groundwork for a public policy debate that surely must come. Groups such as Oxford University, the Institute for Ethical AI and ML (UK), Stanford University, AI for Good (UN), and Global IA Ethics Institute (Paris) are all attempting to lead the way. There are many more out there offering classes, certificates, and services such as audits and systemic quality reviews. It’s early days. 

Ethical and Privacy Concerns Inherent in the AI/ML Space

When it comes to the use of data to create LLMs and the output of those models in the GenAI world, there are a few points of interest to consider. When you stop the flow of data, the gas that fuels the engine, you interrupt the flow of business. Effective cybersecurity means safeguarding the flow as well as the contents of data. In terms of AI, and how that data is used to build models, the process of securing your digital assets starts at the moment of creation and then curation is the name of the game. A few points of interest are in order to understand how AI sucks in your digits, chews them up and ultimately spits out a version of your reality, or view of the world.

Provenance: Who chooses the data to train the model?

There are many sources of data that a company generates: documents, communications, imagery, marketing, financial, and so forth. Granting access to the databases, file repositories, and financial systems inherently opens that system up to intrusions, because once you provide an API or other access, it can be compromised. Guaranteeing security at the source is critical. It always comes down to a people problem. Trust the people, secure the API endpoints, and instantiate best practices such as 2FA or MFA to ensure the person or system accessing your data is really authorized and authenticated to use it.

But people have biases, and data must be cleansed, normalized, labeled, and in many other ways massaged to be useful to machines. Unstructured learning techniques avoid a lot of these issues. However, they are less efficient and more prone to statistical errors than having humans curate the data set by preprocessing it. The quality of data matters, and just as ‘you are what you eat’, your ‘model is what it ingests.’ If you only feed it emails, then a skewed view of your company results.

Data has a sort of provenance to it, and this evidence can be found in the metadata, like ‘was it created by a human or a computer,’ ‘when was it created,’ ‘where is it housed,’ and so forth. Each element, while able to be faked, adds up to a sort of fingerprint or watermark, an information trail as it were.

Methodology: What technique is used to produce and maintain the model

There are many approaches to model generation: Supervised Learning, Unsupervised Learning, Reinforcement Learning, and Deep Learning, to name the most popular. You may also have heard of techniques like ‘zero-shot’ and ‘single-shot’ models.

Each method is an approach to statistically evaluating the data from the first issue of data above. How and why you pick one approach over another is purely based on the results you are looking for. Business applications and solutions to process bottlenecks require differing perspectives to achieve improved outcomes.

These techniques are fundamentally descriptive and focus on what humans actually have done. They do not hand the ambiguity of what we ‘should do,’ or negation, what we ‘shouldn’t do.’ Consider the method used to train an AI for identifying which recruits to interview for a new hire. If the data set is skewed and then the method is weighted toward a desired outcome: the decision making of the machine automatically throws out non-statistically significant resumes. There is no way to argue against it. Valid candidates don’t even get a rejection email. Classifications can be reviewed, but who’s to say it’s the right set of classifiers in the first place? The explainability of your method is just as critical to the process as is the data. Values such as beneficence, non-maleficence, autonomy, accountability, responsibility, and privacy are also key dimensions.

Don’t exaggerate what your algorithm or model can do. Don’t speculate that it delivers fair or unbiased results. Your statements must be backed up by evidence and this is where we get to the next issue, one of quality and veracity.

Quality: What technique is used to test the model and validate its output

Transparency is the cornerstone of trust when it comes to what for all intents and purposes is a black box process. The means by which you start out must be the way you finish. And the ends cannot justify the means. If a company is not careful, decisions and activities can be delegated to bots and justified by the thought that bots can get away with everything a human can’t. This is a logical fallacy: this evasion of responsibility ends up becoming no more than a legal dodge, a sort of ‘human behavior laundering’ in the guise of technical advances. Who’s to blame? You can’t point to a single person, the machine’s doing the work. 

While the alphabet agencies are developing frameworks, and talking compliance and certification, there are still basic steps to testing, validation, and certification you can take. Obvious proactive measures are:

  • Independent audits
  • Legal reviews
  • Opening your data or source code to inspection by 3rd parties

Quality control of the data input lays the foundation for verification of the output. Does the results match the expected area of information and standard responses a reasonable person would accept? Are statistical measures within tolerances? All of this implies that you have established acceptance criteria and KPIs in advance, expectations that support your business goals.

In testing, skepticism about the data and the output is your best tool. Deep fakes are an increasing problem. Confirmation bias creates a reinforcement feedback loop that increases the willingness of some to believe whatever a machine creates. ‘The data says so’ ‘that’s what the data shows’ are no longer strong arguments. Corroboration of information means human curation at some point. Quality control is that step.

Product: Who owns the output

The legal responsibility for the impact of a model is the question at hand. As ridiculous as it may seem, lawyers are already being hired to represent AI chatbots. This maneuver is an attempt to displace responsibility away from the humans behind the curtain. In effect, the developers and the companies who employ them are trying to say, ‘hand to heart, we can’t be held responsible for discrimination, it’s an algorithm.’

Logic dictates that we look at this attitude. Common sense says ‘the creator is greater than the creation.’ When a human encodes instructions and gives direction to model, they are essentially the puppet masters controlling the strings. A computer sifts through data, looking for patterns. Who told it which patterns were relevant and which patterns to ignore? The humans who wrote the algorithms and code obviously. The teams that create, curate, process and quality control the data should be held responsible for the outcome of their creation.

The next question that immediately comes to mind is ‘who then owns the output?’ Some companies are arguing that the end use is the legal owner. OpenAI’s terms assign the right, title and interest in output to a user. Legal experts are cautionary in stating that it may be a nice position to take, but large corporations are wont to change their mind retroactively. Additional criteria may include ‘independent intellectual effort’ (perhaps demonstrated by prompt creation), ‘originality’ of authorship, and so forth. None of these are inherent in AI-generated output.

In Conclusion

Learning about all these issues and more will protect you and your company in the coming months and years as the current wild west of GenAI turns into the established way of doing business. Taking care now to balance and protect your interests will ensure a safe and prosperous future for all.


No comments:

Post a Comment

Generative AI: Risks and Rewards

 Benefits of GenAI The landscape of general computing has changed significantly since the initial introduction of ChatGPT in November, 202...