Machine Learning is at its most simple definition a technique whereby the computer learns from data rather than being programmed to perform an explicit task. Using labeled datasets, known as training data, algorithms look for patterns and are taught to recognize what is good data that matches those patterns. A model is then produced and applied to a larger, unstructured set of data. Because the model was taught, it can comb through the vast amounts of unstructured data faster and find the patterns we humans are seeking to uncover.
There, that wasn’t so hard, was it? But wait. It’s far more complicated than
what I just described. How do you label the data? What are those pesky
algorithms anyway and how do they know how to read through the training data
and pick up what’s important out of the language? Model sounds easy, but there
are many types of models that can be created depending on your business needs
and goals. The approaches to machine learning are varied and evolving as
research progresses in orthogonal directions. supervised,
unsupervised,
reinforced,
and deep
learning are just a few of the more popular approaches.
There are three steps to follow:
- Get
some data
- Create
and train the model
- Test
and refine the model
Training Data
In any big data processing challenge, the first hurdle to overcome is
finding the right set of data to work with. Sometimes too much data is just as
much a problem as too little data. The need to cleanse and normalize content so
that it is able to be handled by software has long been a challenge in Computer
Sciences. This process is known as ETL—Extract, Transform, Load. However, if
you were able to work with unstructured data and avoid the ETL process
altogether, there would be a huge time and cost savings. Enter the concept of
Machine Learning and Training Data.
With Training Data, you teach the computer to recognize patterns by labeling
a small set of data with the information you are seeking. You place tags or
metadata into the information, and let the computer learn from a subset of the
ideas. Then the computer goes and sifts through all that unstructured data
looking for the same patterns, discarding anything that doesn’t match and
returning results of a search for those that match the pattern or are near
matches within a threshold that can be set. Another application is to apply
tagging or adding metadata to those items that do match. This is one type of
Auto-Labeling.
Right now, the process of creating training data is very manual. For most
purposes, humans need to review the training data and add the tags according to
the business goals. Depending on the outcomes, one set of data can have a
variety of purposes. For example, reviewing project documentation can show the
need for more job training or analyzing electronic and voice communications can
reveal espionage.
There is a growing trend of using NLP for Auto-Labeling to create training
data. This process looks at the context and content to extract entities and
determine what are the most important elements of a document or data item (such
as a banking transaction, wire transfer, communiqué) and adds the metadata
automatically. These services are still in their first stages of development
and the quality is not yet established. Therefore, the output is always
validated by humans and is frequently subject to iterative learning processes
(see below) and rigorous testing.
Training Data has a hidden risk in it. Whoever tags the data creates the
focus or intent of the model that learns from it. In other words, those tags
are the “teacher” of the model. If that person is biased or leans even
unintentionally in a particular direction, the model becomes biased as well. A
lack of data creates bias. For example, a well-known bias is that much medical
data is based on white males. There is not a lot of medical research data for
let’s say females from 3rd-world countries or elderly poverty
communities in the records.
Therefore, models using training data will naturally be biased towards those
demographics that were used to train them. The way to fix this problem is to
have a more diverse set of training data, and more involvement within people
who tag the data. Another solution is the Auto-Labeling. Let the machines tag
the data, and then have humans review the output for bias or lack thereof in
the quality control stages.
Algorithms for Model Creation
When a model is created, a specific set of code is used to process the data.
These steps to process and sift through the information are called algorithms.
There are many types that have been created by researchers who specialize in
understanding how information is catalogued and broken down. Some of the most
popular are regression algorithms, decision trees, clustering and associative
algorithms, and neural networks. Each one has its benefits and drawbacks. If
you want to go deeper, IBM has a nice description of the various approaches here.
The process of creating a model based on the approach you choose is a topic
in its own right for a future discussion. In brief, you train the model using
your training data set, compare the output with what you were expecting it to
produce, and then adjust the variables until you get the desired results. This
creates a targeted, precision model that should be trained to find what you are
looking for. In other words, you have taught it to look for a specified set of
data based on the problem to be solved. A model looking to solve medical
problems will be very different in nature from a model looking to drive a
robotic arm painting pictures.
Iterative Learning
Iterative Learning is the process of repeated training on renewed data sets
over time. When first training the model, as noted above, it’s very important
to have the right set of data to get the expected outcomes. However over time
that data becomes stale or ineffective. In reality, situations and context
change and the information in the model is no longer relevant. The model must
be retrained periodically for it to remain precise and effective. This is often
required on a 6-month basis, or even more often, and can be very expensive.
Techniques are being invented to keep models up to date without the need for
intermittent retraining. Iterative learning is one of those techniques where
essentially the machine gets a lot of “practice” at targeting the search for
information.
The process is basically one of constantly retraining the base model through
pre-processing training data sets and then random testing the results. Once the
test cases pass muster, the feedback loop that is returned to the system on the
results shows that “tuning” or adjustments for the model are acceptable. As it
passed the threshold or bar of acceptability, that portion of the model is
updated and put into operation. This iterative process can take place in an
offline manner and the results promoted into the operational model once human
users approve the end results.
In a true machine-learned system, the whole process can be automated, and
the model updated in real-time for immediate use. It depends on the degree to
which the humans using the system accept and trust the software to be accurate
and of high quality. And of course, it also depends on the domain in which the
software operates. If safety concerns are an issue, human-in-the-loop decision
making is critical for oversight to ensure lives are not lost, for example with
medical applications or airplane safety. If you are talking about low-risk
applications, then there is less of a need for concern.
Supervised Learning
Supervised Machine Learning works on labeled data sets with information
classified in advance to ensure a predetermined outcome. This is what we most
commonly think of when talking about machine learning. The model can be
compared to the actual labeled results for testing, but that leads to the
danger of “overfitting.” A model that is so closely tied to training data
doesn’t handle variation in new data very well. It is too strict in its
definitions and will miss edge cases and emerging trends. This is where bias in
models comes into play, leading us to the unsupervised approach.
Unsupervised Learning
Unsupervised Learning starts by ingesting unlabeled data, avoiding the cost
and time required to tag the training data. It requires algorithms to extract
“features” or interesting elements such as nouns, phrases, and topics of
interest. The algorithms then sort and classify the chunks of text into
patterns that occur with varying degrees of frequency.
This approach is less about decision and predictions and more about identifying patterns that humans would miss because the volume of data being handled is just so vast. Computers are experts at processing huge amounts of data far more efficiently than a team of humans could ever hope to tag and label that same data set. Spam filtering comes to mind in this regard. This approach is popular in cybersecurity.
Reinforcement Learning
Reinforcement Learning can be summed up as the “trial and error” approach.
While it is somewhat similar to Supervised Learning, there is no sample set of
training data involved. Rather, it attempts to map the best decision by trying
a series of answers and recording the success or failure rate over time. It
must know the state of the environment it is operating in to succeed. This
approach is great for game theory applications.
Deep Learning
Deep Learning frequently involves artificial neural networks. The process
requires vast amounts of data to work with due to its design mimicking the
learning processes of the human brain. Data passes through multiple layers of
calculations, some obvious and others hidden, with weighting and biases built
in. The programmers who create each unique neural network are looking to
achieve a particular outcome and put the biases in to craft and direct the
results to a targeted goal. The learning process for these models can be
unsupervised or semi-supervised, meaning the data can have some tagging or
metadata applied to it before being ingested.
Applications for Deep Learning include computer vision, speech recognition, self-driving cars, and Natural Language Generation (NLG). Many of these innovations are critical in the field of applied robotics and space exploration. Other use cases are digital assistants, chatbots, medical image analysis, fraud detection, and cybersecurity.
Applications for Deep Learning include computer vision, speech recognition,
self-driving cars, and Natural Language Generation
(NLG). Many of these innovations are critical in the field of applied robotics
and space exploration. Other use cases are digital assistants, chatbots,
medical image analysis, fraud detection, and cybersecurity.
A Cautionary Note
Machine Learning is already impacting our world today and how we live as
seen by the examples noted above. It’s not just Siri and Alexa making our lives
easier. Protecting our increasingly digital life from cyber terrorism, identity
theft, and malicious actors is one way data scientists work in the background
to keep consumers from losing their digital right to privacy. We need an
Internet Bill of Individual Data Rights.
We are well into a new era. It’s no longer the Age of the Internet. It’s the
Age of Digital Data, where a person’s virtual identity is just as important to
safeguard as their physical paper trail, if not more so. Increasingly,
everything is about who owns the data and where is it housed. Who has access
and who is monetizing your digital footprint? Companies like Facebook, Amazon,
and Apple have long stated that you are the product because they are using your
data, data that they own by virtue of their terms of service, to create
revenue. With machine learning techniques this is becoming more true than ever
before. Be wise, be safe. Every click you take, every Like you make, they’re
watching you.
It's call Predictive Analytics, another
way to make even more revenue off of your data, and a topic for a future post.
S Bolding—Copyright © 2021 · Boldingbroke.com
No comments:
Post a Comment