One of the most peculiar aspects of existing as a sensory being is the ability to recognize an object the moment it is sensed. For some human senses, this is simple enough. A touch or a scent activates the related sense-memory and you immediately know you have touched a blanket or smelled a cheese pizza. Such sensory experiences are largely invariant and so need only trigger a specific sense-memory. That is to say, a blanket is almost always easy to tell by touch; food is usually easy to distinguish by scent. When it comes to sound, things get a little trickier. A non-language sound--suppose an ambulance siren or the sound of a fork clattering on the floor--tends to be recognized quite easily, though it might take a moment to place a sound if it is muffled, distant, or there are many competing noises at the same time.
For sounds that carry language, another layer of complexity is added. It's not enough to simply recognize what the sound is--a human voice. You will also want to recognize who said it, and what they said. Looking at a person while they speak is one way of keeping your brain on track as to knowing who is speaking: their lips move and match the sounds you're hearing. Recognizing words as they are spoken requires a level of sensory sophistication beyond that of simply activating the correct sense-memory. There's also a form of memory chaining involved, in that particular words tend to follow other words, so even if you cannot fully distinguish every single word being spoken to you, so long as you recognize the vast majority, you can likely infer the complete meaning. Wh-n -t c-m-s t- r--d-ng, y-- d-n't -v-n n--d v-w-ls; y--r br--n c-n -s--lly t-ll wh-t th- w-rds -r- fr-m th- w-rd's -v-r-ll -pp--r-nc-.
In fact, this approaches the point I'm actually here to discuss: the notion of sensory invariance.
Picture an object in your head. Make it a simple one, like a bowl. If you hold the bowl at different angles, you can still recognize it as a bowl. You can tell a bowl whether it's sitting on its base in the middle of the floor, or is turned upside down in a dish rack. How does this work? People don't have complete 360-degree scans of every single object they're able to recognize in their heads, do they?
Ironically, much of what we have learned about the working of human memory comes from attempts to replicate it using neural networks in computers. The idea that you can recognize a stimulus--an input--even when it is presented in a way you haven't experienced before, is known as invariance. It is clearly the case that human memories contain some kind of invariant representation of physical objects, but exactly how it works is not known. But consider all of the characteristics that must be accounted for in order to have an invariant representation of an object. You must be able to compensate for:
Humans are evidently good at compensation for all of these. We recognize objects by shape first, typically. It's not uncommon to teach children to recognize animals, for example, by showing them full-color illustrations of the animals, then testing their recall by showing only outlines or silhouettes. Intellectually, we know that just about any object is far more complex than its outline, yet the outline is often the only thing we need for our brains to tell us, "yup, that's a cow."
Most importantly, an outline of an object is a type of pattern. Computerized recognition algorithms operate quite similarly, learning an invariant version of an object's appearance by distilling simple patterns by which to recognize it. Unlike a human memory, however, digital representations are just two-dimensional images, often stored sparingly to highlight the most important details for future recall. How does a computer know what details are relevant for this? The process for this is, again, modeled after concepts around the workings of human brains.
The computer "pays attention," which is to say that it examines a collection of labeled data and attempts to glean the most valuable insights from it in order to develop an invariant representation of what it has observed. Take as a trivial example a photograph of a dog standing on a table. It might be labeled, "a photo of a dog standing on a table." On the first pass, a learning model would not know what any of those words mean, but it would keep the image and the associated words in memory to help it learn in comparison to additional examples. So, suppose the next input was a photo of a dog laying down on a sidewalk. Since the two images differ in all respects except that they both contain "a photo of a dog," if the model were to learn anything it would be the first inkling of what a "photo of a dog" is. If the next input had a photo of a cat, and in fact if every successive input was "a photo" of some kind, one of the strongest associations the model would make is that everything being fed to it is a photo.
Does such a model know what a photo is? Not in any sense of conscious intelligence. It is simply an association now baked into the model. But this association has an ironic effect: if everything is a photo, then it doesn't matter what is or isn't a photo. It is a distinction without a difference. This means that the model will ultimately learn to ignore the phrase "a photo," as it imparts no useful information about what is or isn't important in its input.
In other words, attention is the sensitivity of a machine learning model to specific concepts. Extremely common words will receive low attention weights as they are unlikely to impart relevant information, whereas uncommon words will receive higher weights, since they tend to signify the most important aspects of an input.
Humans do not operate too dissimilarly from this. After all, if you are handed a photograph, you are unlikely to fixate on the fact that it is a photo, but rather what the photo represents. First, you would probably disregard background elements, instead looking for objects of significance: people, animals, buildings, and so on. The background might be considered as an afterthought, and if you were to describe the image to a computer, you would likely describe the background last, too.
It is this conjunction of attention and invariance that allows us to recognize things immediately without getting bogged down in irrelevant details. Suppose you drive past a farm and see cows. Your first thought is likely that they are cows, not how many eyes or legs they have. Because they are cows, you would simply assume they have two eyes and four legs apiece, since in most cases that would be true.
But many animals have four legs and two eyes, don't they? How do you know a cow from a dog or a cat? How does a computer know the difference? This approaches a more complex facet of memory, both human and machine: context.
Think of a coworker you never see outside of work, or a relative you haven't seen in years. Imagine your reaction if you simply bumped into them at the grocery store. It would probably take you a moment to recognize them, wouldn't it? It's not because your brain suddenly failed at its job of remembering things, or even that social anxiety left you at a loss for words (though that certainly won't do you any favors.) Rather, your memory has been trained to expect certain people to appear in certain contexts, and is confused when something different and unexpected happens. Your brain has to do more work to recognize someone you would ordinarily recognize instantly!
Computerized attention learns these contextual cues as well, essentially coming to associate concepts which often appear together, and likewise being prone to confusion or failure when concepts typically associated with one another are presented separately. Returning to the cow example: if you saw a cow in the middle of the street in a downtown area, you would likely be perplexed at least for a moment. You would expect to see a cow on a farm, out in the country. Seeing a cow somewhere else simply throws your memory for a loop. It may take a fraction of a second longer to recognize because the usual contextual cues that help you know what a cow is--rural area, features of a farm--are missing, so your brain must fall back to other things it knows about cows, such as their size, shape, coloration, etc. in order to confirm with confidence that, although you are in Times Square, there is indeed a live cow wandering around. It wouldn't hurt if it made a "moo" or two, either.
There's one more property that emerges from this notion of representational invariance in memory that is essential for survival: prediction. Knowledge of what a cow is may not be that useful in and of itself, but what can you predict about a cow? Perhaps the first thing you might predict is that it won't harm you. Human reactions to other living creatures typically do involve that initial calculation: is this creature a threat to me? You won't spend minutes doing the mental math, of course. You know intuitively, almost immediately, whether it is safe to approach or whether you should be high-tailing it out of this situation. It is believed that some of this knowledge is encoded genetically, part of the toolset of basic survival information we're all born with. But in most cases, it is learned through experience and training. Just as when you have an encounter that defies your expected context, should you have an encounter which violates your intuitive predictions, you are likely to be left stumped for a moment or two until you can figure out how to respond next. Imagine a cow that begins reciting lines of Shakespeare to you in the voice of Morgan Freeman. You probably didn't expect that.
If you laughed, you have exhibited perhaps the most strangely common human reaction to events which disrupt our constant stream of mental predictions: nervous laughter!