Humans--all beings with eyes, really--see by taking in light through a lens and processing it through brains, or at least nerve clusters. This produces an image that is coherent to the being in question. Beings with binocular eyes, meaning the field of vision overlaps between the eyes, also have depth perception and can thus determine the relative positions and proximity of objects. Even the compound eyes of insects have a limited form of this.
Computers, on the other hand, lack sense organs of any kind. They do not understand vision or visual processing, only binary code. Still, encoding an image for a computer's benefit is trivial, given a monitor to display it on. All you need is a grid of pixels and a color value for each one. Sophisticated compression algorithms can make these very small, which is good since a typical photograph could contain over 14 million pixels!
So, reconstructing an existing image isn't that hard. But creating a new image from scratch purely from calculations? As good as computers are at number crunching, this turns out to be an intensive endeavor. Real-time graphics add further constraints, since you need processes and formats simple enough and fast enough to allow for constant updates to the screen. Nobody likes low frame rates.
For applications with two-dimensional graphics, you can accomplish a lot with tiles and sprites. Breaking the screen up into smaller sections which are each updated independently is pretty efficient. Likewise, premade graphics that can be stamped anywhere on the screen--sprites, in other words--save the trouble of having to generate such graphics from scratch every time. From here, there are a few ways to ensure smooth updates to the screen that aren't jagged and unpleasant to the user. One is to simply prepare the entire image you want to display, then "blit" it all at once. This is effective but slow. You can make it a bit faster by updating only the sections of the screen that changed, which is fine for applications like word processors and other tools where updating the entire screen is usually unnecessary. It is tedious for things like side-scrolling video games, however.
Another piece of technology makes this all a lot better: a frame buffer. It is essentially what it sounds like. Instead of having to push each frame to the screen as soon as it's done, you can push frames into a buffer and have them displayed at a constant rate. This is fast and makes for a smooth experience for the user, whether they're playing a game or making a presentation. It must be pointed out that from the time computer systems began having real-time graphics in the first place, convenience features to make those graphics faster and better weren't far behind. A computer manufacturer could see their market share dwindle rapidly if they didn't incorporate the latest such features in a timely manner.
This is all well and good for 2D graphics, but what about 3D? A computer obviously does not display graphics in three dimensions. What it does is concoct a 2D representation of a 3D scene, and it is the implied depth perception created by movement that generates the illusion that it is truly three-dimensional.
The tools that helped make 2D graphics fast and efficient were of little use here. If one were to treat a two-dimensional scene as if it had three dimensions, you might imagine a depth buffer (or "z-buffer") that draws the screen from back to front. Every sprite can be thought of as a flat billboard that simply shows its single front face to the screen; it has no back side in any meaningful sense, unless it's just a mirror image of the front.
Tile updates aren't very helpful in this situation since it's likely most or all of the screen will be updated with each frame. The depth buffer is useful because it arranges objects from back to front, so it is understood what objects are allowed to be in front of others, and which are obscured. However, it enables even more clever uses.
The basic primitive of a three-dimensional graphics system is the vertex. Some might argue that it is the polygon, but the vertex is the true original. The simplest polygon--a triangle--is made up of three vertices. The positions of these vertices define the location and shape of the polygon. This is basic geometry, of course. But with a depth buffer in the mix, it is possible to know which polygons do not face the screen, or indeed which polygons are outside of the screen's field of view altogether. Remember that in the computer's memory is a complete representation of the scene at hand, as defined by vertices and the features attached to them, such as textures, lighting, and so on. Not all of that scene needs to be drawn--only what is visible to the user must be! And so we come to the culling: backface culling avoids drawing polygons that don't face the screen, and frustum culling avoids drawing polygons that are outside the figurative viewing cone of the screen. You can reduce the number of polygons you have to draw by 50% or more using these techniques, so they are well worth a little extra calculation.
Humans never have to think about "backface culling," and anything outside our "viewing frustum" is still there. In many ways, this exercise illustrates just how radically different animals are from computers. Much is made of the notion that computers are capable of intelligence, even intelligence to surpass that of humans. But their ability to conceive of the world anything like we do is extremely limited. Vast increases in processing power have yet to close this gap, either.
It should also be noted that in a computerized 3D scene, the idea of a solid mass is purely by convention. There are algorithms dedicated to collision detection which determine when one object has overlapped with another. Sometimes this is solved as crudely as erecting an invisible cube around each object and forbidding those cubes from overlapping. Additionally, this perceptual solidity is a hoax in the sense that the inner contents of any closed 3D object are nothing but dead space. A computer-generated building might have opaque windows and a door, but within is absolutely nothing. There is an interesting trick of illusion here, however: so long as we do not approach the door, do not stare too carefully at the windows, we might imagine that the building has something inside it that we simply don't have time to investigate. So long as we can maintain that illusion--that suspension of disbelief--we can enjoy the artifice of a computerized 3D scene and convince ourselves, if only briefly, that it has qualities comparable to our physical world.
What humans and computers do have in common is constructive modeling, and it is for this reason that we can understand and interface with them easily, if we're so inclined. It is no surprise that we designed them to be comprehensible to us, after all. The images represented in our brains when we see are not the actual objects our eyes perceive, but constructed models of those things. If we see a bed, we know that it is likely soft and comfortable and will be nice to sleep on. We recognize food and know it can be eaten. We understand things not as they are at their most basic nature, but by the mental models we create of them. "The map is not the territory." Likewise, a 3D scene as programmed into a computer is only a model of the real thing. Even the most detailed, articulately produced 3D scene can only ever be an imperfect facsimile that represents the idea of a thing, not so much the thing itself.
The modeling goes even further, originating in no small part from language. Contrary to widespread belief, language does not define thought; rather, thought defines language. Languages develop and evolve based on their needs rather than any constraints placed upon them. Languages then model the priorities and perspectives of the speakers, hence the Scots and their 421 separate words for "snow." Language also models the humorous sensibilities of the speakers! Not every one of those 400 words for snow is serious in tone, after all.
As different as humans and computers are, then, it turns out we have some common ground after all. We work almost entirely in metaphors and representations and models. We work with computers using metaphors and abstractions. Computers are capable symbolic processors because humans evolved to be ardent cogitors of the abstract. We built them to be like us, though their abilities in many areas are quite far from matching our own.
Perhaps the most impressive symbolic work done by human brains, if only because it occurs so automatically, is pattern recognition. Indeed, the human brain is pattern-seeking. We tend to see patterns even where they don't exist. Humans know what a dog looks like from every possible angle and can recognize one on sight. Consider the way we draw diagrams of three-dimensional objects with the intention of building them: we may only need top, front, and side views to conceptualize what the final structure will look like. Our brains don't contain images of 3D objects from every possible angle such that we can recognize them. Rather, our brains have learned via millions of years of evolution how to pick out important features and base recognition on those.
How tragic it would be, then, if computers advanced to a level of comprehension similar to humans, only for one side or the other to fail to recognize the foibles we have in common. It is easy to think that what binds disparate groups together are common strengths, but it is just as plausible that common flaws and weaknesses could bring them together, too. Let it be so for man and machine, lest one destroy the other out of fear of the counterpart's power.