Tuesday, April 30, 2024

Machine Learning Step by Step

 Machine Learning is at its most simple definition a technique whereby the computer learns from data rather than being programmed to perform an explicit task. Using labeled datasets, known as training data, algorithms look for patterns and are taught to recognize what is good data that matches those patterns. A model is then produced and applied to a larger, unstructured set of data. Because the model was taught, it can comb through the vast amounts of unstructured data faster and find the patterns we humans are seeking to uncover.

There, that wasn’t so hard, was it? But wait. It’s far more complicated than what I just described. How do you label the data? What are those pesky algorithms anyway and how do they know how to read through the training data and pick up what’s important out of the language? Model sounds easy, but there are many types of models that can be created depending on your business needs and goals. The approaches to machine learning are varied and evolving as research progresses in orthogonal directions. supervised, unsupervised, reinforced, and deep learning are just a few of the more popular approaches.

There are three steps to follow:

  1. Get some data
  2. Create and train the model
  3. Test and refine the model

Training Data

In any big data processing challenge, the first hurdle to overcome is finding the right set of data to work with. Sometimes too much data is just as much a problem as too little data. The need to cleanse and normalize content so that it is able to be handled by software has long been a challenge in Computer Sciences. This process is known as ETL—Extract, Transform, Load. However, if you were able to work with unstructured data and avoid the ETL process altogether, there would be a huge time and cost savings. Enter the concept of Machine Learning and Training Data.

With Training Data, you teach the computer to recognize patterns by labeling a small set of data with the information you are seeking. You place tags or metadata into the information, and let the computer learn from a subset of the ideas. Then the computer goes and sifts through all that unstructured data looking for the same patterns, discarding anything that doesn’t match and returning results of a search for those that match the pattern or are near matches within a threshold that can be set. Another application is to apply tagging or adding metadata to those items that do match. This is one type of Auto-Labeling.

Right now, the process of creating training data is very manual. For most purposes, humans need to review the training data and add the tags according to the business goals. Depending on the outcomes, one set of data can have a variety of purposes. For example, reviewing project documentation can show the need for more job training or analyzing electronic and voice communications can reveal espionage.

There is a growing trend of using NLP for Auto-Labeling to create training data. This process looks at the context and content to extract entities and determine what are the most important elements of a document or data item (such as a banking transaction, wire transfer, communiqué) and adds the metadata automatically. These services are still in their first stages of development and the quality is not yet established. Therefore, the output is always validated by humans and is frequently subject to iterative learning processes (see below) and rigorous testing.

Training Data has a hidden risk in it. Whoever tags the data creates the focus or intent of the model that learns from it. In other words, those tags are the “teacher” of the model. If that person is biased or leans even unintentionally in a particular direction, the model becomes biased as well. A lack of data creates bias. For example, a well-known bias is that much medical data is based on white males. There is not a lot of medical research data for let’s say females from 3rd-world countries or elderly poverty communities in the records.

Therefore, models using training data will naturally be biased towards those demographics that were used to train them. The way to fix this problem is to have a more diverse set of training data, and more involvement within people who tag the data. Another solution is the Auto-Labeling. Let the machines tag the data, and then have humans review the output for bias or lack thereof in the quality control stages.

Algorithms for Model Creation

When a model is created, a specific set of code is used to process the data. These steps to process and sift through the information are called algorithms. There are many types that have been created by researchers who specialize in understanding how information is catalogued and broken down. Some of the most popular are regression algorithms, decision trees, clustering and associative algorithms, and neural networks. Each one has its benefits and drawbacks. If you want to go deeper, IBM has a nice description of the various approaches here.

The process of creating a model based on the approach you choose is a topic in its own right for a future discussion. In brief, you train the model using your training data set, compare the output with what you were expecting it to produce, and then adjust the variables until you get the desired results. This creates a targeted, precision model that should be trained to find what you are looking for. In other words, you have taught it to look for a specified set of data based on the problem to be solved. A model looking to solve medical problems will be very different in nature from a model looking to drive a robotic arm painting pictures.

Iterative Learning

Iterative Learning is the process of repeated training on renewed data sets over time. When first training the model, as noted above, it’s very important to have the right set of data to get the expected outcomes. However over time that data becomes stale or ineffective. In reality, situations and context change and the information in the model is no longer relevant. The model must be retrained periodically for it to remain precise and effective. This is often required on a 6-month basis, or even more often, and can be very expensive.

Techniques are being invented to keep models up to date without the need for intermittent retraining. Iterative learning is one of those techniques where essentially the machine gets a lot of “practice” at targeting the search for information.

The process is basically one of constantly retraining the base model through pre-processing training data sets and then random testing the results. Once the test cases pass muster, the feedback loop that is returned to the system on the results shows that “tuning” or adjustments for the model are acceptable. As it passed the threshold or bar of acceptability, that portion of the model is updated and put into operation. This iterative process can take place in an offline manner and the results promoted into the operational model once human users approve the end results.

In a true machine-learned system, the whole process can be automated, and the model updated in real-time for immediate use. It depends on the degree to which the humans using the system accept and trust the software to be accurate and of high quality. And of course, it also depends on the domain in which the software operates. If safety concerns are an issue, human-in-the-loop decision making is critical for oversight to ensure lives are not lost, for example with medical applications or airplane safety. If you are talking about low-risk applications, then there is less of a need for concern.

Supervised Learning

Supervised Machine Learning works on labeled data sets with information classified in advance to ensure a predetermined outcome. This is what we most commonly think of when talking about machine learning. The model can be compared to the actual labeled results for testing, but that leads to the danger of “overfitting.” A model that is so closely tied to training data doesn’t handle variation in new data very well. It is too strict in its definitions and will miss edge cases and emerging trends. This is where bias in models comes into play, leading us to the unsupervised approach.

Unsupervised Learning

Unsupervised Learning starts by ingesting unlabeled data, avoiding the cost and time required to tag the training data. It requires algorithms to extract “features” or interesting elements such as nouns, phrases, and topics of interest. The algorithms then sort and classify the chunks of text into patterns that occur with varying degrees of frequency.

This approach is less about decision and predictions and more about identifying patterns that humans would miss because the volume of data being handled is just so vast. Computers are experts at processing huge amounts of data far more efficiently than a team of humans could ever hope to tag and label that same data set. Spam filtering comes to mind in this regard. This approach is popular in cybersecurity.

Reinforcement Learning

Reinforcement Learning can be summed up as the “trial and error” approach. While it is somewhat similar to Supervised Learning, there is no sample set of training data involved. Rather, it attempts to map the best decision by trying a series of answers and recording the success or failure rate over time. It must know the state of the environment it is operating in to succeed. This approach is great for game theory applications.

Deep Learning

Deep Learning frequently involves artificial neural networks. The process requires vast amounts of data to work with due to its design mimicking the learning processes of the human brain. Data passes through multiple layers of calculations, some obvious and others hidden, with weighting and biases built in. The programmers who create each unique neural network are looking to achieve a particular outcome and put the biases in to craft and direct the results to a targeted goal. The learning process for these models can be unsupervised or semi-supervised, meaning the data can have some tagging or metadata applied to it before being ingested.


Applications for Deep Learning include computer vision, speech recognition, self-driving cars, and Natural Language Generation (NLG). Many of these innovations are critical in the field of applied robotics and space exploration. Other use cases are digital assistants, chatbots, medical image analysis, fraud detection, and cybersecurity.


Applications for Deep Learning include computer vision, speech recognition, self-driving cars, and Natural Language Generation (NLG). Many of these innovations are critical in the field of applied robotics and space exploration. Other use cases are digital assistants, chatbots, medical image analysis, fraud detection, and cybersecurity.

A Cautionary Note

Machine Learning is already impacting our world today and how we live as seen by the examples noted above. It’s not just Siri and Alexa making our lives easier. Protecting our increasingly digital life from cyber terrorism, identity theft, and malicious actors is one way data scientists work in the background to keep consumers from losing their digital right to privacy. We need an Internet Bill of Individual Data Rights.

We are well into a new era. It’s no longer the Age of the Internet. It’s the Age of Digital Data, where a person’s virtual identity is just as important to safeguard as their physical paper trail, if not more so. Increasingly, everything is about who owns the data and where is it housed. Who has access and who is monetizing your digital footprint? Companies like Facebook, Amazon, and Apple have long stated that you are the product because they are using your data, data that they own by virtue of their terms of service, to create revenue. With machine learning techniques this is becoming more true than ever before. Be wise, be safe. Every click you take, every Like you make, they’re watching you.

It's call Predictive Analytics, another way to make even more revenue off of your data, and a topic for a future post.

S Bolding—Copyright © 2021 · Boldingbroke.com


No comments:

Post a Comment

Generative AI: Risks and Rewards

 Benefits of GenAI The landscape of general computing has changed significantly since the initial introduction of ChatGPT in November, 202...