Let’s Make a Street View! Part 0

In what was once timely news (sorry), Google announced version 3 of the Google Maps API at their last I/O, designed specifically—but not exclusively—with the mobile space in mind. Since it’s not used on the main Maps page, I didn’t notice that the new API includes an embeddable “HTML5″ version of Street View until the news started making the Twitter rounds.

What’s especially interesting about this new Street View—aside from the fact that it’s now available without Flash—is that it takes a divide-and-conquer approach to ensuring good performance across browsers. There are at least three different rendering “modes” within the new library, triggered by different browsers and operating environments. I’m going to take a brief look at each of these before moving on to the main event: building our own Street View app. Though I might call it StreatVue.

I’m trying something new by embedding JavaScript demos within the page itself, hopefully keeping readability, WordPress content management, asynchronous script loading, and non-CPU-hogging demos all playing together nicely. Please let me know if this ends up stupid on your system; it’s convenient enough that I’d really like to get it right for the future. Source will be minified on this page but available in full (under an Apache 2.0 License, though that’s no guarantee it’ll be useful) in the link below each demo. Someday I will get good at Git, I promise.

Street View Ethology

This is a popular game here at Extremely Satisfactory Totalitarianism. Why dig into obfuscated source code—deminifying and rebeautifying—when we can make overly general conclusions based on limited data? And no I don’t mean ethnography. This has little to do with real ethology, of course, and it made more sense when it was about observing Firefox in action (get it??), but what are you going to do.

When using the new Maps API, you’re essentially linking against a dynamic library, loaded from Google’s servers, with no control over the exact version you receive or the way it will choose to render. This is great for bugfixes and ongoing expansion of the library, but it does mean that things could change completely before I even publish this blog entry. It’s a more explicit use of functionality already available when popular JavaScript libraries are loaded from publicly-accessible CDNs, for instance, but that’s an interesting topic for another day.

Conceptually, Street View (and pretty much any VR/Panorama/whatever program) takes images in all directions from a single position and places them on a virtual sphere that appears to surround the viewer. Looking at how the images are taken:

it’s at least easy to imagine how, starting with a view direction, the images could be placed on screen and distorted to give this impression. If the distortion is done correctly, spatial relationships between features in the scene are preserved and it appears as if that place is reconstructed on screen. Things don’t feel quite right (for instance, since this ideal viewer’s eye never moves, there is no parallax between the foreground and the background), but the effect is very convincing. The key is in how the images are projected onto the screen, but as long as pixels end up where they need to be, there are many ways to store and display the source images. Exploring one way of doing this will be the main concern of the next post.

I’ve picked a spot here in Austin that you can use to identify Street View’s rendering method in your favorite browser. It’s not the prettiest picture of a place ever taken, but it has some features that will be useful for our purposes, like the long straight lines of the road and the checkerboard patterns on the buildings up ahead. The top and bottom of the view are rather blurry—a hallmark of an older generation of Street View photos, I think—but on the plus side, there’s the totally sweet Texas State Capitol building, some kind of ghost car, and a stab at guerrilla product placement by FedEx-Kinko’s.

Click to see our spot in the new JavaScript Street View:

For reference, here is the same spot on maps.google.com, using the normal Flash client.

As I mentioned above, there appear to be three approaches to rendering in the new Street View.

Mode 1 – “HTML4″

Open the example page and look to the right, past the ghost car, at the bridge’s railing. If the railing and sidewalk remain curved no matter how you move the viewport (see picture), then Street View is using this type of rendering in your browser. If you can make the railing perfectly horizontal, move on to modes 2 and 3.

Google’s Geo team has informally referred to this mode as the “HTML4″ approach (and if that makes your head a splode, trust me, just turn back now).

Street View images are stored as tiles using an equirectangular projection, which you can see directly in this great demo by Jamie Thompson (and related blog post). I’ve stolen an image from him of tiles for one location, but go take a look at his work, it’s rad:

Since the tiles will be projected onto the inside of a sphere, it makes some sense that Google would store them as if they were a map themselves. It’s also likely that this format was chosen because it is the same projection used for the regular satellite and street maps on Google Maps, allowing a lot of the math involved in things like coordinate picking to be shared.

Looking at Jamie Thompson’s examples and this cool Tissot’s indicatrix created by Eric Gaba,

you can see that the equirectangular projection has relatively minimal distortion surrounding the equator, which, in a Street View scene, just so happens to be the area filled with the things (buildings, streets) that we want to look at. As a result, the “HTML4″ approach is to just display the Street View image tiles directly, exactly like satellite and street map tiles are rendered. The distortion increases as one looks toward the “poles” of the Street View virtual sphere, but the average user (aka not you) will probably not pan all the way up to check what the sky looked like on the day the Google-mobile passed by.

Since the equirectangular projection is non-linear, almost no straight lines are mapped to straight lines—hence the tell-tale curved bridge railing. If the view is zoomed out and overflow clipping is disabled, this becomes more obvious:

The vector overlay that traces the street is done in SVG or VML, depending on the browser, both of which only provide 2d transform mechanisms. Any 3d effect has to be done manually, but the overlay doesn’t quite match the movement of the street. Without really investigating the cause (we are playing ethology, here, after all), it appears that while the length of the overlay is affected by perspective, the rotation parameterization is not (in perspective, the horizon maps to a hyperbola, not half of some ellipse, a fact I still find surprising).

Finally, on my Windows 7 machine, HTML4 mode is currently used in Internet Explorer and, somewhat unexpectedly, the newest Firefox and Safari. It also appears to be used on all iOS and Android devices; for older models this isn’t really surprising—without OpenGL ES 2.0 support they will likely never be able to support WebGL— but this situation will probably change soon for just about any smartphone released in the last year. Which brings us to…

Mode 2 – WebGL

If you can make the bridge railing completely horizontal on the example page, and the page is super-silky smooth in movement and interaction, you probably have yourself a WebGL enabled browser. However, in the time I’ve been working on this project, WebGL mode has actually stopped appearing for me. If it’s not working for you either, here’s a video of Vladimir Vukićević demonstrating the WebGL Street View at the 2010 Mozilla Summit:

Please let me know if you know how to reliably get the WebGL version; I’d love to play with it some more.

Mode 3 – <canvas>

For those not reading this from the WebGL-blessed near future, we get this close approximation. If the example page’s projection is correct (the sidewalk and railing are straight), but the framerate is a bit lower and the imagery is a little jittery while in motion, this is your guy.

It appears that the straightforward WebGL approach is being emulated using the 2d canvas element. The geometry of a sphere is created—its vertices assigned texture coordinates within the image tiles for a Street View location—and then drawn to screen after being properly positioned around the viewer and projected in perspective.

Of course, canvas doesn’t currently support 3d transformations (it is the “2d” context, after all). This can be worked around for simple 3d line or shaded figures since drawing instructions must be specified with 2d coordinates anyway. To use the built-in transformation support for this purpose, every face of an object to be drawn would need its own transformation relative to the object as a whole, which would in turn be multiplied by whatever universe and view transforms were needed to correctly position the object with respect to the viewer (this was exactly the tack taken here, but in CSS).

However, it turns out it’s far faster and easier to just take 3d coordinates, apply the universe and view transform to every vertex, drop the z coordinates, and then feed the resulting [x, y] values into the canvas drawing routines. This is what is done in most of the canvas 3d engines out there, like Mr. doob’s awesome three.js.

Simple drawings aren’t the problem, texturing is. And for that, we’ll need a subheading:

Affine Texture Mapping: It’s Fast

Textures could be applied to faces similarly to what is described above—calculating a texel position for each pixel and then plotting it in 2d—but that would necessitate opening up the canvas’s pixel data and setting each pixel manually. This works for small objects, but fillrate with this approach is too slow in current browsers for it to work with large scenes.

Instead, we can fall back to the earlier, slower solution of calculating a transformation matrix per face, distorting the image so that when drawn to the canvas and clipped to the edges of the transformed polygon, a textured shape appears. Because it uses built-in image-handling routines (and is possibly hardware accelerated, depending on the browser), this approach ends up being fairly fast even with all the extra calculations that have to be made.

There are two subtleties here. First, if you’ve read the blog post I promised to post two months ago but probably never will (I’m stealing all the best bits for this series), you’d know that it isn’t possible to transform a quadrilateral, for instance, into another (arbitrary) quadrilateral using only a 2d affine transformation (a 2d affine transformation is a 2d linear transformation (which can scale, rotate, and shear) plus a translation, and is supported in the canvas 2d context API). The problem is overdetermined: there are 8 coordinate components for a quadrilateral but only 6 entries in our matrix.

Like linear transformations, affine transformations map parallel lines to parallel lines, so a rectangle is forever destined to be a parallelogram no matter the transformation. This is actually the geometric manifestation of the fact that the system of equations is overdetermined: once three vertices of a rectangle are moved at will, the position of the fourth is already determined, and a parallelogram remains. But one of the key characteristics of perspective is the vanishing point, where parallel lines appear to converge (and so would not be parallel on the screen).

Parallelogram:

Not a parallelogram:

However, three vertices can be transformed at will, which means, in terms of linear (and affine) transforms, a parallelogram is really just a fancy triangle. With a 2d affine transform, arbitrary triangles in texture space can be mapped to ones in screen space, and triangles can be put together to make just about anything. This is called “affine texture mapping” and has been used since time immemorial (or the 80s, who knows) to speed texturing on platforms that are especially slow or support only 2d transforms. This is again the technique used by most JavaScript 3d canvas engines to support textures (here is Mr. doob texturing a quad with data from an HTML5 video element), to the point that almost all of them draw textured triangles using a function descended from this library by Thatcher Ulrich.

The function (drawTriangle()) includes one of those classic “// TODO: optimize” comments that will probably be there forever because it doesn’t really need to be done; in practice, the time needed to calculate the transformation matrix is minuscule compared to the time the browser needs to complete the subsequent drawImage() call.

However, it was fun to play with, so I’ve taken a stab at reworking it, trying to bring out the geometric side of the problem. The solution becomes much simpler when both the input texture vertices and output screen vertices are shifted so that one vertex of each is at the origin. With this done, the origin now just maps to itself (the triangle is shifted back at the end), so only a 2×2 linear transformation is needed.

The full transform—using Paul Heckbert’s classic trick of mapping to and from a simple intermediate form (in this case, a “unit triangle”)—maps input [u, v, 1] to output [x, y, 1], making sure the input and output vertices are paired correctly:

$\begin{bmatrix}x \\ y \\ 1 \end{bmatrix}=\begin{bmatrix}1 & 0 & x_0 \\ 0 & 1 & y_0 \\ 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}x_1-x_0 & x_2-x_0 & 0 \\ y_1-y_0 & y_2-y_0 & 0 \\ 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}u_1-u_0 & u_2-u_0 & 0 \\ v_1-v_0 & v_2-v_0 & 0 \\ 0 & 0 & 1 \end{bmatrix}^{-1}\begin{bmatrix}1 & 0 & -u_0 \\ 0 & 1 & -v_0 \\ 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}u \\ v \\ 1 \end{bmatrix}$

If we set $u'_1=u_1 - u_0, x'_1=x_1 - x_0,$ etc, the middle two matrices combine to form

$\frac{1}{u'_1 v'_2 - u'_2 v'_1}\begin{bmatrix}v'_2 x'_1 - v'_1 x'_2 & u'_1 x'_2 - u'_2 x'_1 & 0 \\ v'_2 y'_1 - v'_1 y'_2 & u'_1 y'_2 - u'_2 y'_1 & 0 \\ 0 & 0 & u'_1 v'_2 - u'_2 v'_1 \end{bmatrix}$

while the translation matrices, surrounding a generic linear transformation like that, let the linear transform pass through unchanged, but add a new translation component

$\begin{bmatrix}a & c & x_0 - a u_0 - c v_0 \\ b & d & y_0 - b u_0 - d v_0 \\ 0 & 0 & 1 \end{bmatrix}=\begin{bmatrix}1 & 0 & x_0 \\ 0 & 1 & y_0 \\ 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}a & c & 0 \\ b & d & 0 \\ 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}1 & 0 & -u_0 \\ 0 & 1 & -v_0 \\ 0 & 0 & 1 \end{bmatrix}$

Here is the resulting code:

// uses affine texture mapping to draw a textured triangle // at screen coordinates [x0, y0], [x1, y1], [x2, y2] from // img *pixel* coordinates [u0, v0], [u1, v1], [u2, v2] function drawTexturedTriangle(img, x0, y0, x1, y1, x2, y2, u0, v0, u1, v1, u2, v2) {   ctx.beginPath(); ctx.moveTo(x0, y0); ctx.lineTo(x1, y1); ctx.lineTo(x2, y2); ctx.closePath();   x1 -= x0; y1 -= y0; x2 -= x0; y2 -= y0;   u1 -= u0; v1 -= v0; u2 -= u0; v2 -= v0;   var det = 1 / (u1*v2 - u2*v1),   // linear transformation a = (v2*x1 - v1*x2) * det, b = (v2*y1 - v1*y2) * det, c = (u1*x2 - u2*x1) * det, d = (u1*y2 - u2*y1) * det,   // translation e = x0 - a*u0 - c*v0, f = y0 - b*u0 - d*v0;   ctx.save(); ctx.transform(a, b, c, d, e, f); ctx.clip(); ctx.drawImage(img, 0, 0); ctx.restore(); }

It is now much clearer that we’re dividing by a determinant (there’s that parallelogram again), which means that everything will be fine as long as the three texture coordinates are not collinear. If they are, the input triangle has no area and the function will fail when it tries to divide by zero; this just reflects the reality that there’s no way to expand a line in texture space to a triangle in screen space. You could also, you know, put in a check for that situation, but be careful of numerical problems with nearly degenerate triangles. This can actually be more of an issue than is initially apparent: JavaScript uses double-precision floats for its numeric type, but for legacy and performance reasons, many browsers use single precision for things like transformations internally.

Hey, look, it’s interactive:

Hi! You'll need canvas support and running JavaScript to see this demo.

View Source!

But there’s something not quite right about that triangle, which brings us to the second subtlety I mentioned above (I hadn’t forgotten). It is, perhaps, not that subtle.

Affine Texture Mapping: It’s Really Wrong

Even though they’re clearly anchored to one another, when the box and the triangle are rotated, they don’t seem to move together correctly. It’s difficult to describe (it is for me, at least), but it seems that as the box is rotated one way and the triangle goes with it, some sort of distortion happens within the triangle that kind of shifts the texture back the other way.

The slider adjusts the opacity of the entire transformed checkerboard to show how the texture moves before it is clipped to the edges of the triangle (that is, if you see a slider. if not, sad face, but try entering a number between 0 (transparent) and 100 (opaque) in the text box to adjust). With the full checkerboard visible and manipulated, the texture looks like it’s being sheared and rotated—not put in perspective—which is because it is only being sheared and rotated. That’s all an affine transform can do.

Some, less geometric textures can sometimes get away with this, but the checkerboard reveals all. It’s made up of squares, which we’re trying to put into perspective, which, if done correctly, means that the parallel lines within the texture would need to converge if extended. But affine transforms map parallel lines to parallel lines…you see where I’m going. It’s not possible to do what we’re trying to do with the tools that we have.

Here’s an analogy that will both inform and make the next reveal slightly more dramatic. Picture that classic vanishing-point subject, the train tracks. I’ll help:

Thinking of our triangle again, we’re putting the triangle in perspective by distorting its geometry, but we’re specifying what to draw with an affine transform, which is just a translation combined with a linear transform. We’re implicitly asking for a linear interpolation of the texture across the available space: if the triangle was originally 100 pixels tall and there was a checkerboard square every 10 of those pixels, for every orientation we’d ask the browser to draw a (distorted) square every 1/10 of whatever the triangle’s new, distorted height was.

In terms of the train tracks, we would (not exactly, but analogously) be putting them into perspective by tilting the rails inward as they recede, but keeping the ties (aka sleepers) equally spaced and sized.

That’s wrong, of course, since in reality the size of the ties would appear to shrink as they recede, as would the distance between them. Our brains actually make the problem worse by adjusting our perception of the size of the ties, so the top tie looks thicker than the bottom one. This (in my opinion) is what feels most wrong about the checkerboard triangle: the squares not only don’t become smaller as they tilt away, they actually appear to grow larger.

Looking at the train tracks in “correct” perspective (here are some real ones), there are some points to consider.

First, the intersection points of the ties and each rail stay in a line, even in perspective. More mathematically, we could say that the perspective transform maps collinear points to collinear points. The derivation of this fact ends up being pretty cute (Jim Blinn: “It never ceases to amaze me that plotting one hyperbola against the other yields…a straight line”), but is probably overkill at this point.

Second, the intersection points of the ties and the rails are equally spaced before projection, but a perspective transform maps them to unequally spaced points on the screen.

I mentioned above that the railroad tracks are only analogous to our triangle (mildly dramatic reveal in 3, 2, 1…). Our situation is actually much worse. The incorrect railroad tracks are basically a distorted quad with an image linearly interpolated across it. It would take two triangles to actually do that, each with their own texture transform to get their own part of the texture distorted correctly within them. And while they would have two vertices in common, keeping the texture continuous across the divide, the difference between the two becomes very obvious:

Hi! Again, you'll need canvas support and running JavaScript to see this demo.

View Source!

That’s just two triangles. When a lot of large, affine-textured triangles are put together for a scene, well…

And that’s why the N64 was way more awesome. Also: OoT.

Affine Texture Mapping: It’s Still Useful (and Used)

Not surprisingly, there is a correct way to do texture mapping in perspective, but it involves hyperbolically interpolating the texture image over the surface of a transformed polygon. This requires a divide per pixel to do correctly, which is slow in software and impossible if we can only use our little 2d matrix; perspective projection is decidedly not a linear transformation.

So, how could affine texture mapping be useful? First, notice that when the textured quad is mostly upright and facing the viewer, the texture distortion isn’t too extreme, it just seems a bit “swimmy.”

Second, recall that the texture transform is constructed to exactly map the three texture coordinates to the three screen coordinates, regardless of the incorrect interpolation everywhere else. If every new vertex (along with a new triangle) guarantees one more correctly mapped point, subdividing the triangles into more triangles would anchor more texture coordinates correctly on the screen.

If our two-triangle quad is split into four triangles, it’s still wrong, but the movement within the quad is reduced significantly:

Hi! Again, you'll need canvas support and running JavaScript to see this demo.

They’re all the same source!

Each of those four triangles could be subdivided themselves, and the resulting eight triangles could be subdivided, and so on, as many times as we want. At the limit, where each triangle is only one pixel in size (or maybe three), the affine texturing would be exactly correct. The tradeoff is a bunch of calculations per triangle (at least 4 divisions, you’ll recall)—plus the probably overhead-heavy image machinery of the browser—which would end up slower than just doing the hyperbolic interpolation in the first place.

Instead, a middle ground can often be found, with triangles small enough to bring visible distortion down to a minimum, but big enough so that performance doesn’t suffer too much.

And here at the end, things become relevant again. It appears (ethology!) that mode 3 of Street View is indeed using affine texture mapping to emulate the perspective-correct texture mapping that WebGL hands out for free. There is some noticeable jitteriness, but a large part of that could be the aliasing along the edges of the triangles. But if you zoom in, you’ll see the subtle swimming of the textures (when zoomed in, the triangles are drawn larger, but they are less tilted relative to the screen). This is the example page pre-zoomed for you. Click and drag slowly with the mouse, looking especially at the patterns of the windows and the trees below them (obviously if Street View uses modes 1 or 2 in your browser, you won’t see much).

Finally, just for completeness, note that there are a bunch of smart ways that the triangles can be subdivided to ensure better performance. Thatcher Ulrich demonstrates one method here, using a heuristic that compares the affine and perspective texturing for a triangle and subdividing it if the difference between the two is too great.

Let’s Make a Street View!

Those are the rendering modes that I’ve come across. If you’d like to know more about the specific testing and decisions that went into making the new Street View, check out Marc Ridey’s segment of this Google I/O 2010 session: Google I/O 2010 – How Maps API v3 came to be. However, start at the beginning of that video to see a really great talk given by Susannah Raub on designing an inherently data-heavy API for those heavy-data-averse mobile devices.

I’ve actually been thinking about a completely different approach to a browser-based, “HTML5″ Street View before V3 of Maps was announced. Unsurprisingly, the Geo folks at Google seem pumped about the WebGL future, but I’ve gone a different route. Nothing novel, to be sure; the techniques all date back 30 years, but that only makes them way cooler. It has also been pretty remarkable to be working on the project, face a (usually application-level, but genuine) problem, realize some shiny new HTML5 feature solves it exactly, add it to the mix, and it all actually works. It’s apparent to me—with the pretty astounding features just released or about to be released from every major browser vendor—that the web today should be much fancier than it is. I’m doing my part; are you?

Next time: perspective projections, a prototype, and maybe some authorial conciseness. It’s a thrill a minute.

This entry was posted in Uncategorized and tagged , , , , , , , , , . Bookmark the permalink.

3 Responses to Let’s Make a Street View! Part 0

1. Mr.doob says:

Good to see people messing with all this :)

While thewildernessdowntown.com an approach I didn’t got to try (probably because it sounded way too crazy, was to do plane deformations (http://mrdoob.com/lab/javascript/effects/plane_deformations/). Because of the crossdomain limitation, the only way to do this would be by doing a drawImage per pixel in the screen.

The result/distortion should be 100% perfect, performance probably horrible.

• Brendan Kenny says:

haha, I had not thought of that.

Same origin restrictions definitely put a damper on some of the crazier things I’d like to do (and it looks like WebGL 1.0 might move to a similar conservative approach for at least the first version). The arguments for not relaxing them are of course good ones, but I do wish that I could just request and use images directly rather than ridiculously (and TOS-breakingly) proxying them through a server. Some day…

2. Mr.doob says:

By the way, I moved the three.js examples to github and I’ve broken some of the links on the post :/ You can get the new links from the project page: https://github.com/mrdoob/three.js/