Josh Dunigan

I did not really get a break for a long, long time. I took some time off in August, and was looking for a long break over the holidays. I did not get my flu shot and was really sick from basically the first day through the holidays for my entire ~2 week vacation, while recovering another 2 weeks.

I took this next week off. Work is in a more stable place, teams are doing well, so I can get a nice, well deserved break.

I recently read the original transformer paper, Attention is All You Need. When I was studying computer science, this research was still fairly new. There was not a public release of something at scale until 2019 with GPT-2, and I feel like it was not until a few years ago the AI discourse started taking off after the crypto wave. I use Github CoPilot in beta, and still think this is the most valuable use of LLM in the industry right now, but there is a lot of potential.

The discourse around AI and chips is reaching a boiling point, so I figured I would go and read a bit more into what its all about. Granted, I know basically all of the fundamentals for being able to understand the research, but if you have done a little bit of computer science and calculus you can probably understand it too, without going all the way to nonlinear programming.

The transformer paper is interesting, and there is actually a lot of overlap between post-Heideggerian, analytic philosophy, and post-Sartre / WW2 French theory on language and meaning.

I think if I could explain it simply enough, the model is actually made of a few submodels, like different parts of the brain. Different parts of the brain excell at different types of processing and pattern matching. However, the brain grows, gets feedback, and learns as a whole, not through individual parts, i.e. the brain is a team. Similar, the transformer architecture, at least the original paper (I have not yet read subsequent papers on transformers), has different parts that develop to specialize while they are all judged together.

The transformer architecture has the submodels start in randomized states, which allows for specialization. If they started in the same state, they would make the same patterns and be judged the same, and learn the same. However, the random individual starting points and being given feedback as a whole helps mimic the way brains determine patterns and meaning. Although, I am not an evolutionary biologist, I do know that certain parts of the brain have been developing way before other parts, and not all of them work in unison. Sometimes, certain parts of our brain are dominant. Other parts of the brain develop at different rates.

However, the transformer architecture does really well at modeling (which is the goal of math and computer science!) how interpretation of text (and even images) works.

To compare, let us use Claude (Anthropic's LLM) to show some of the differences between architectures. This is how Recurrent Neural Networks work.

Imagine you're reading a book - as you read each word, your understanding depends not just on that word, but on all the words you've read before it. This is exactly how RNNs work. They process sequences by maintaining a "memory" of previous inputs through a simple feedback loop. At each step, the network combines the current input with its previous state to produce an output and update its state. However, basic RNNs face a significant challenge known as the vanishing gradient problem. Think of trying to remember the plot of a book you started reading weeks ago - the further back you go, the hazier the details become. Similarly, RNNs struggle to maintain information over long sequences because the gradient (the signal used to update the network's weights) becomes increasingly small as it's propagated backward through time.

To overcome this memory problem, researchers developed a new model called Long Short-Term Memory (LSTM) networks, which used gates that decided when to "forget" certain information or keep certain information. However, this helped the memory problem a bit, but also adds more complexity to the architecture. Gated Recurrent Units (GRUs) simplified the model from 3 to 2 gates.

A different family of models are Convulutional Neural Networks (CNNs). Again, here is Claude's description of how it works.

The core building block of a CNN is the convolution operation. Imagine sliding a small window (called a kernel or filter) across an image. At each position, this window looks for a specific pattern - it might be searching for edges, curves, or more complex features. It's similar to how an art teacher might use a viewfinder (a small square frame) to help students focus on specific parts of a scene while drawing.

However, this is all artificial. Using the viewfinder is for educational purposes, Similarly to RNNs and LSTMs and GSUs, we would not say this is how we actually "learn" as humans. This does not discount the models as valuable, a model is meant to solve a problem.

That being said ...

If you feel that finding out what something is must entail investigation of the world rather than of language, perhaps you are imagining a situation like finding out what somebody's name and address are, or what the contents of a will or a bottle are, or whether frogs eat butterflies. But now imagine that you are in your armchair reading a book of reminiscences and come across the word "umiak". You reach for your dictionary and look it up. Now what did you do? Find out what "umiak" means, or find out what an umiak is? But how could we have discovered something about the world by hunting in the dictionary? If this seems surprising, perhaps it is because we forget that we learn language and learn the world together, that they become elaborated and distorted together, and in the same places. We may also be forgetting how elaborate a process the learn- ing is. We tend to take what a native. speaker does when he looks up a noun in a dictionary as the characteristic process of learning language. (As, in what has become a less forgivable tendency, we take naming as the fundamental source of meaning.) But it is merely the end point in the process of learning the word. When we turned to the dictionary for "umiak" we already knew everything about the word, as it were, but its combination: we knew what a noun is and how to name an object and how to look up a word and what boats are and what an Eskimo is. We were all prepared for that umiak. What seemed like finding the world in a dictionary was really a case of bringing the world to the dictionary. We had the world with us all the time, in that armchair; but we felt the weight of it only when we felt a lack in it. Sometimes we will need to bring the dictionary to the world. — Must We Mean What We Say?, Stanley Cavell

How we learn and the relation to language to the world, as I have said since Heidegger, Russel/Moore/Wittgenstein, and the French theorists (for what I mean when I say this, see here and here). I should also probably include American pragmatists, especially folks who started with analytic philosophy and were drawn towards the similarities of pragmatists like Rorty and Brandom.

There is still much to explore in the relations between philosophy, linguistics, biology and neuroscience, computer science, and math, or as my wife said when reading me write this sentence, that's cognitive science (her major in college).

The transform architecture raises age old questions about meaning, existence, language, thought, and what it means to be human. The AI research here is not dissimilar to what philosophers have been doing for awhile, especially the groups mentioned above. In the "linguistic turn" in philosophy, be it analytic or continental/interpretive traditions. For Heidegger, language is the House of Being. The poet is the one who not only uses language but goes beyond language and let's Being shine forth. Analytic philosophy similarly started out as mainly a critique of contemporary British idealists and folks like Meinong who, to people like Russel and Moore, thought most of, if not all, problems of philosophy were problems from the misuse of language. Wittgenstein took this even further in his Philosophical Investigations, which was one of the main influences on Cavell.

However, the two traditions are very different. The interpretive tradition, say, Kierkegaard, Heidegger, or Derrida, leaned more towards a sort of "ineffableness" of Being, a mysticism. There was something that language could not capture, and that meaning or meaningfullness could not be reduced to what Heidegger calls "intelligibility". However, the work of analytic philosophy goes further than, say, Kant, by reducing the problem space of philosophy to logic, or even further, thinking that the only job of the philosopher is to guard against the misuse of ordinary language, and that philosophy does not really exist. Similarly, certain people in the interpretive tradition have advocated for a type of non-philosophy.

The artificial intelligence models, especially some of the fundamental approaches like the transformer architecture, eerily mirror some theories about language and meaning in human life, that language and meaning are just an endless series of patterns with no root or core or base. If so, we are just the product of a large evolutionary feedback loop, like the LLMs, nothing more. LLMs are just a more shallow speedrun of human intelligence, but in theory similar enough. This is not really anything new, the same problem occured many times in history, for example, Darwin's impact on religion and philosophy.

Something keeps drawing us back as philosophers that there is something more to this.

If there were no eternal consciousness in a man, if at the foundation of all there lay only a wildly seething power which writhing with obscure passions produced everything that is great and everything that is insignificant, if a bottomless void never satiated lay hidden beneath all– what then would life be but despair? If such were the case, if there were no sacred bond which united mankind, if one generation arose after another like the leafage in the forest, if the one generation replaced the other like the song of birds in the forest, if the human race passed through the world as the ship goes through the sea, like the wind through the desert, a thoughtless and fruitless activity, if an eternal oblivion were always lurking hungrily for its prey and there was no power strong enough to wrest it from its maw–how empty then and comfortless life would be! — Fear and Trembling, Søren Kierkegaard.

This is exactly the turn that Robert Pippin has had in the later part of his life. Abandoning Kantian-Hegelian rationality and normativity for the Heideggerian pursuit of the question of the meaning of Being. It is not that this type of philosophy is wrong for folks like Pippin, but it is missing out on something more fundamental, or has Heidegger says, primordial. That primordialness may just be a mirage, but I think the problem of Being as Intelligibility vs Heidegger's "question of the meaning of Being" is the most interesting debate that has been going on in philosophy, sometimes implicitly and under the surface.