Behind all of the buzz about NLP, is Language. We wouldn’t have text-based communications, intermediated by machines, without two people using a common method of transferring ideas to one another. How did all of this arise? We examined in How Language Works in a Nutshell, the surface issues of social contracts, marketplace dynamics, and tribal interactions. In this post, I’ll take a deeper look at shifts in language, the micro and macro forces at play, how to adjust and account for those elements. Some refer to this as Evolutionary Linguistics, which relies heavily on biology and comes across as rather Darwinian. While this approach is based on 19th century understanding of language, it is biased. For example, the authors of this entry state that there is no archaeological trace of early human language. This is false. Much of the research can be found in Sociology and Anthropology studies. In addition, there is a new and emerging trend in Linguistics, where Archaeo-linguistics seeks to combine archaeology and linguistics into a blended greenfield approach.
An interesting book from 2007 by anthropologist D W Anthony,
The
Horse, the Wheel, and Language, presents primarily archaeological findings
about the Kurgan
hypothesis, and takes the position that language arises and evolves in
parallel with technical innovations. To wit, the inventions surrounding the
domestication of horses on the Sarmatian plains north of the Caspian Sea. (This
area is approximately the equivalent of modern-day Ukraine.) Combined with the
invention of the wheel, to create a mobilized society in the Bronze Age, this
theory of evolutionary linguistics takes on the origins of Proto-Indo-European
(PIE) by means of archaeology and evolutionary biology, specifically spending a
greater portion of the book examining middens and pottery shards.
Imagine if you will a bronze-age innovator who decides to
stop eating horses and instead domesticates one of them. This enterprising
individual inserts a stick into the mouth of the horse, and puts two ropes on
the ends, to make the horse follow his guidance. That eventually leads to the
invention of a durable bit. Soon you have a set of traces connecting a pallet
of your worldly possessions to the sides of the horse, so it pulls your goods
to and from the winter camps. You no longer need to pull the sledge yourself.
Now someone else comes along and invents the wheel, turning that flat pallet
dragging along the ground into a cart that moves faster. Imagine the advantages
you have over your neighbors. The horse is doing the work of men.
There are several inventions here to take note of. One
technology builds on another. First, the idea of a pallet to stack food and
possessions on, instead of carrying it on your back or in your arms. Then the
horse or the wheel to make the transportation of items faster, easier. Now
instead of a cart, a chariot for war. And so it goes, as history attests.
But more importantly to our story of language evolution, our
clever inventor decides to trade copies of his bit, wheel, and other
improvements to his fellow tribesmen. We now have business that exchanges goods
for technology innovations. Perhaps a few sacks of grain for this new thing
they decide to call a bit. Or two cattle for a set of those so-called wheels.
As new inventions emerge, words are created to label that thing over there
versus this thing here. Pointing just doesn’t suffice.
Now imagine if you will, a neighboring tribe sees the
increased mobility, the speed and advantages of fighting off of the horse’s
back instead of on foot. One can easily see that there are two choices. The
tribes will align as allies and exchange technology and goods as friends. Alternately,
one of the tribes decides that might makes right, goes to war against the other
tribe and whoever wins incorporates the other tribe as slaves into their
society. The rise of wealth, the need to have a common medium of communication,
the desire to safely buy and sell possessions and crops all lead to the rise of
marketplaces. Common meeting ground where people must talk to each other in
order to achieve the desired outcomes. Soon, instead of a barter system, a
token is found to equate value to goods. And this is the rise of money.
Whatever a society values the most becomes an easy medium of value exchange. Is
it gold, beads, shells, or simply a piece of paper that promises there’s gold
behind it somewhere in a bank.
The gods must have their tithes and the king his taxes: Not
only to keep this social construct of the marketplace supported and protected,
but also to maintain their primacy of power over the people who gather to
exchange goods and ideas. It all needs a warrior class to guard against
invading neighbors. Authority is always based on power and money. Following the
money, the rise of a military state almost seems inevitable. Protection rackets
are not just for the mafia.
One can easily discern the causal links between technology,
commerce, and language development. Example: Google is a new noun and verb
based on technology shifts.
Shifts in Language
Language, while a social cultural construct, is not a
constant. Definitions change, words drop out of popularity and as we see, are
subject to the forces of history. You only need to look at English, to know
that a speaker of Old English would have no clue what today’s Queen’s English
is conveying. Researchers refer to the concept of Language Shift as a
large-scale phenomenon, where a population changes from using one language to
another. But what are the forces that lead up to such a radical shift?
Realizing that the British Isles have been invaded and
conquered many times by sundry Nordic groups from the far north, by the
neighboring France (creating Anglo-Norman
in the 11th century), it is self-evident that Old English, primarily a
Germanic language would be endangered and die out. Indeed OE, or Anglo-Saxon
was an invading culture, brought over in the mid-5th century. It replaced the
native Celtic languages. The dynamics of language communities demand a certain
amount of maintenance and care if the survival of a mother tongue is overcome
historic circumstances. Survival of language is why dictionaries exist: to
codify spelling, definition, etymology, and variants of words. Example: the Académie
Française in the 17th century forcing language standards on
publications, teaching institutions, and attempting to outlaw local dialects.
The progression of any type of speech within a new context
is characterized by migration, infiltration, or diffusion. When a whole speech
community moves to a new location, that group of people tend to cling to their
language, halting change for a time. Think of Québécois French, where a colony
tried to keep its connection to the old world by forcing the next generation to
maintain 17th century colloquialisms in the transmission of language
from the older generation. Then after that original set of colonists had died
off, the language began to change again, borrowing from the surrounding native
tribes, inventing new words for the discoveries they made in the conquest of
the continent. A variant, or creole, is created for that community, causing a
branching of the mother tongue in a new direction. Another New World example is
of course American vs. British English. Or Brazilian Portuguese vs. European
Portuguese. Mexican vs. Castilian vs. Andalusian and so forth. Spanish
has numerous dialects due to Spain’s colonialization of many parts of the world
in the 16th century and onward.
War (infiltration) is another factor. Forcing a conquered
people to adopt the language and culture of the victors, a sort of cultural
assimilation technique. Here a great example is the Russification efforts of
Soviet era policies, where native language and songs were outlawed, people from
Russia forcibly relocated to populate the territories (or encouraged them to
settle there), schools banned from teaching literature and history that might
glorify the original regime. This happened in Estonia, Latvia, Lithuania and
other post WWII mid-European states like Ukraine. In reality, the policy of
forced started under Tsar Alexander II in the 1860’s and even earlier in
medieval times. It was most successful in Belarus.
Diffusion is the cultural spread of a language. Here, a more
modern example is English spreading through pop culture such as movies, books,
and the internet. Another example is the popularity of anime and manga helping
to promote the learning of Japanese.
Micro and macro forces at play
The sociological forces discussed above constitute the
obvious Macro influencers for language shift. What are the micro forces?
Literacy is surely one of them. Borrowing terminology to expand lexically and
grammatically, the individual’s choices leading to localized slang. It is
through an individuals’ speech behavior that language is either maintained or
lost in the family context; and hence in the broader society.
Slang
Trade slang is a particularly interesting case to examine.
Dutch traders arrived in Indonesia in the late 16th century, they
surely did not speak the local language. Stepping off the ships, to the locals,
they must have appeared as aliens, unintelligible, and Oh So White. ‘Do we kill
them? Do we approach with caution? Do we try to make first contact?’ So many conflicting
thoughts must have gone through the minds of each side. ‘What have they got
that we want?’ This scene plays out repeatedly throughout history.
The need to have common terms arises. The forces of the
global marketplace win over the sword and/or spear. Of course, by the end of
the relationship, the sword wins after all. The need by the Dutch to keep the
British at bay, let alone the Spanish, would dictate having forts and closed
ports to protect their monopoly. Soon it means taking advantage of and
controlling the local population. A Dutch monopoly on export is paramount. But
back to Pidgin, which usually evolves around the domains of trade and labor.
A jargon, or set of vocabular terms, that is extremely
limited enables a basic form of communication between two incomprehensible
language speakers. It is often accompanied by hand signals and gestures. Sometimes
an imperfect grasp, but still some knowledge, of the other’s native vernacular
is required. There is a double illusion created when for example, the French
think they’re speaking an Indian language, and the natives believe they are
speaking good French. The conversations result in slang developing.
A clear example is Russenorsk, which arose in Northern
Norway and used by Russian merchants, Norweigian fishermen, and the like. The
first historic instance is from a 1785 lawsuit, and the last example shows it
being stamped out in WWI. It was a seasonal trade language for the summer
months, and never established itself as a creole with native speakers. Another
example is the Lingua Franca (Sabir) of the Middle Ages, which grew up
post-Crusades and dominated commerce in the Mediterranean, Black, and Irish seas.
Trade slang exists today, most notably on the trading floors
of major banks, where a specific vocabulary and shorthand grammar is used in
combination with hand gestures.
Phonology
Strong arguments are made on either side of the aisle about
phonology and the forced pronunciation rules governing ‘proper speech.’ How
words are pronounced influences spelling via errors in orthography. Consider a
few examples that are now accepted regional dialectic forms. Y’all instead of
You for the second person plural. It started out as saying “you all” to
indicate a group of people as a separation from “you” singular. Then at some
point, the error becomes the new standard. ‘Can’t’ instead of ‘cannot’. ‘Thru’
instead of ‘through’. And a classic favorite ‘Halloween’ instead of ‘All
Hallow’s Eve’. Now let’s look at a function shift from adjective to adverb. The
standard adverb is “well” as in “I am doing well.” In the past twenty years or
so it is fashionable to say “I’m doing good.” Or just “Good” as a response to
“How are you?” Many will look at you strangely if you respond “Well” instead of
“Good.” A proper English teacher of prior generations would not just cringe,
but flunk any student who speaks thusly. (And sound pompous for doing so.)
Adjusting and Accounting for These Elements
As always, why should we care about these issues of language
and grammar in the context of NLP? Firstly, language changes, and so models
must change to reflect the current state of the culture. After all a model is
just a reflection of the data it ingests. And each domain within an area of
knowledge will have shifting patterns of language. There is “Banking English,”
“Healthcare English,” “Legal English,” and “Academic English” to contend with,
let alone “East Coast English” “West Coast English” “Street English” “Australian
English” and so forth. Each one requires an understanding or at least an
awareness of the culture that created it.
All of this variation leads to the “natural” part of NLP. The discipline is not trying to necessarily have a formal understanding of the rules, but rather a practical understanding of the usage. The challenge is to not just count nouns and frequency of words in a text. It’s to understand the interrelated parts of speech that cause meaning to arise from an interaction between two individuals. A much more complicated challenge than TFIDF (term frequency–inverse document frequency) or other statistical approaches. To truly perform NLP at a level that leads to meaning and intent, a data scientist must understand how language works. If practitioners truly love languages and want to understand, they must study pure
linguistics as well as computational linguistics, the structure of speech as
well as the measurement and tallying of speech.
No comments:
Post a Comment