Posted on January 2, 2019

One or two things you always wanted to know about tensors but were afraid to ask.

Tensor spaces and tensor products show up absolutely everywhere – physics, engineering, algebra, category theory, topology, geometry, machine learning, big data, you name it. Because they are used in so many different ways, there’s a significant amount of confusion about what they *are*. I, for one, didn’t really understand them until they had been explained to me in five different ways.

Part of the reason for all of this confusion is that the names “tensor” and “tensor space” are confusing. When I was in precalc, we learned about vectors, and I asked my teacher “but what *is* a vector?” She answered (and this is the correct answer) “an element of a vector space”. Of course, this is a very frustrating and unhelpful answer, because then what is a vector space? The point is, though, that the property of “vector-ness” is determined by how the vector interacts with other vectors and with numbers. Similarly, the property of “tensor-ness” is determined by how a tensor interacts with other tensors, vectors, and numbers. I’m not going to immediately explain this, but it’s good to keep in mind for what will come next.

I’m going to explain tensors in three different ways. Hopefully, the combination of the three will be better than either alone, and hopefully there will be something here for everyone! The last thing that I will say is that this will all be non-rigorous. The point here is not to prove things about tensors, the point is to get an intuition for tensors that will help when you read a more formal account. For now I’m just going to concentrate on what a tensor *feels* like.

The first way of understanding tensors is the physicist’s definition. (Really because I am a mathematician, this will be closer to the geometer’s definition, but also because I am a mathematician I consider phsysics to just be geometry, so whatever.) Because we are using the physicist’s definition, let’s start with a concrete example (gotta respect physics tradition). Suppose you and I are planning a hiking trip in Alps. We have an elevation map of the Alps, and we draw a route on it. We can measure how long the route on the map is and multiply by the scale of the map to get a rough idea of the distance we will travel, but this does not account for elevation. What we want to do is figure out how long the route through the mountains will actually be, accounting for elevation.

Mathematically, the map of the Alps is a function h : [0,1]^2 \to \mathbb{R} which gives a height at each point on the map. A path drawn on the map is a function \mathbf{r}: [0,1] \to [0,1]^2, which gives the position \mathbf{r}(t) at time t. Let the scale factor of the map be k \in \mathbb{R}, so that region of the Alps corresponding to the region on the map is a k \times k square. One way of computing the length of the path is by defining a function \mathbf{m}: [0,1]^2 \to \mathbb{R}^3, \mathbf{m}(x,y) = (kx, ky, h(x,y)). This is the “mountain” function – this takes a point on the map to a point on the mountains. Let’s call the image of this function M. We can compose this function with the path function to get our path through the mountains: \mathbf{c}= \mathbf{m}\circ \mathbf{r}. We can then do a bog-standard path integral to compute the length of the path:

l(\mathbf{c}) = \int_0^1 \lVert{}\mathbf{c}'(t)\rVert{} dt

(Quick refresher: \mathbf{c}'(t) is the velocity vector of a particle travelling along \mathbf{c} at time t, and so therefore \lVert \mathbf{c}'(t) \rVert is the speed of the particle, and \lVert \mathbf{c}'(t) \rVert dt is the distance that the particle travels in the infinitesimal slice of time [t, t + dt].)

However, this is a path integral through \mathbb{R}^3. This is like actually walking through mountains counting paces – we don’t want to do that! What we want to do is measure the length of the path without setting foot on the mountain. To do this, we need to measure the size of the tangent vector to the two-dimensional path, somehow taking into account changes in elevation.

l(\mathbf{r}) = \int_0^1 s(\mathbf{r}'(t)) dt

where s somehow measures the size of the tangent vector to \mathbf{r}, accounting for elevation. If we can derive s, then we truly will have made the mountain come to Muhammad. Let’s first write down what we know must be true about s.

s(\mathbf{r}'(t)) = \lVert{}\mathbf{c}'(t)\rVert{}

This is no good as a definition, because s doesn’t know anything about t so it can’t call \mathbf{c}' on t, but it’s a good start. We might get a bit farther if we remember how \mathbf{c} was defined: \mathbf{c}= \mathbf{m}\circ \mathbf{r}. This is hopeful, because it means that we can get from \mathbf{r} to \mathbf{c} by applying \mathbf{m}. However, we don’t want to get from \mathbf{r} to \mathbf{c}, we want to get from \mathbf{r}' to \mathbf{c}'. Let’s step back and visualize this a bit. \mathbf{r}'(t) is a *tangent vector* to the surface [0,1]^2 at the point \mathbf{r}(t), and \mathbf{c}'(t) is a tangent vector to the surface M at the point \mathbf{c}'(t). There are several ways of thinking about what a tangent vector \mathbf{v} to a surface S means, but I like to think of it as a function v : (-\epsilon, \epsilon) \to S that traces out a tiny curve in a specific direction.

\mathbf{c}'(t) = d \mapsto \mathbf{c}(t + d)

This definition simplifies our work a great deal, because we can map a tangent vector from \mathbf{R}^2 to M by just composing with \mathbf{m}.

\mathbf{c}'(t) = \mathbf{m}\circ \mathbf{r}'(t)

At this point you may be thinking that I played a trick on you. Aren’t tangent vectors things with components? How is a function from (-\epsilon, \epsilon) to M at all like a member of \mathbf{R}^2? The answer is that we can find a basis for functions from (-\epsilon, \epsilon) to M. For instance, if \mathbf{p} is a point on a plateau of M, then a basis for the functions from (-\epsilon, \epsilon) to M might look like:

f_1(d) = \mathbf{p}+ d \cdot \langle 1,0,0 \rangle

f_2(d) = \mathbf{p}+ d \cdot \langle 0,1,0 \rangle

Because \epsilon is very small, all functions from (-\epsilon, \epsilon) to M can be written as \alpha_1 f_1 + \alpha_2 f_2 for some \alpha_1, \alpha_2. A good way of thinking about this is that the function doesn’t have room to wiggle – it just has to go straight through p in one direction. This means that we can identify the intuitive notion of functions from (-\epsilon, \epsilon) to M with actual members of \mathbf{R}^2.

Sometimes it helps to sit back and give names to things so that they occupy solid places in your mind, so let’s do that. Let us define T_{\mathbf{p}}(S) (pronounced “the tangent space at \mathbf{p}”) to be the set of all the tangent vectors at \mathbf{p} (functions \mathbf{v}: (-\epsilon, \epsilon) \to S with \mathbf{v}(0) = \mathbf{p}). For M the tangent space is always 2 dimensional because M is a 2 dimensional surface. We can now go back to our earlier question and phrase it better: we want to find a function from T_{\mathbf{r}(t)}([0,1]^2) to T_{\mathbf{c}(t)}(M). And we found this: the function is composition with \mathbf{m}. This is because composition with \mathbf{m} allows us to take a way of drawing a tiny curve on [0,1]^2 and turn it into a way of drawing a tiny curve on M. In math, this function is called d\mathbf{m}_{\mathbf{p}} (we add the subscript to remind ourselves that it take vectors from the tangent space at \mathbf{p}). d\mathbf{m}_\mathbf{p} is called the *differential of \mathbf{m} at point \mathbf{p}*, but it’s also called the *push-forward of \mathbf{m} at point \mathbf{p}* because it takes tangent vectors from the domain of \mathbf{m} and “pushes them forward” to tangent vectors in the range of \mathbf{m}.

However, defining s raises a bit of a problem. We want s to take in vectors from the tangent space to a point \mathbf{p}, but in fact we want s to work on lots of different tangent spaces. To solve this, we make a family of functions \{ s_\mathbf{p}\}, where s_\mathbf{p}: T_\mathbf{p}([0,1]^2) \to \mathbf{R}.

s_\mathbf{p}(\mathbf{v}) = \lVert d\mathbf{m}_\mathbf{p}(\mathbf{v}) \rVert

Now s_\mathbf{p} encapsulates “how big” a tangent vector on [0,1]^2 is when we put it on the mountain. If at a given point \mathbf{p} on the map, the corresponding point \mathbf{m}(\mathbf{p}) is on a very steep part of the mountain, then s will tell us which direction the steepness is in. For instance, s(\langle 0, 1 \rangle) could be 1, if d\mathbf{m}_\mathbf{p}(\langle 0,1\rangle) points sideways on the mountain but s(\langle 1,0 \rangle) could be 5, if d\mathbf{m}_\mathbf{p}(\langle 1,0 \rangle) points straight up the mountain.

The upshot of all of this is if someone else does the work of calculating s_{\mathbf{p}} for every \mathbf{p}, we can compute the length of our trip while conceptually staying on the map.

\int_0^1 s_{\mathbf{r}(t)}(\mathbf{r}'(t)) dt

What this gives you is a notion of distance on the map that’s effectively decoupled from the mountain. The interesting thing about this is that you don’t actually need a sense of elevation to measure this distance, if you were a short-sighted ant on the surface of the mountain with a tape measure and a GPS, you could actually measure the distance it took to travel between different coordinates, without knowing anything about heights. Moreover, you could change the shape of the actual mountain while still preserving all the distances. For instance, if the mountain were flat, you could roll it into a tube, and the ant wouldn’t notice. Now, there is nothing special about two dimensions in this example. We could give some function s_\mathbf{p}: T_\mathbf{p}(\mathbf{R}^3) \to \mathbf{R} that gave a way of measuring distances of three-dimensional paths. The shortest path between two points might no longer be a straight line in this new metric, just like it might be faster to go around a peak instead of going over it in the Alps example. When people say that “space is curved”, this is what they mean. Light sometimes doesn’t travel in straight lines around large sources of gravity, because gravity changes the distance metric of space so that the shortest path is no longer a straight line. Pretty neat, huh?

Another thing that the ant can measure without having any idea about elevation is the angle between two tangent vectors. This is a pretty important thing to know, and when we talk about curved space, you would expect any structure that we put on space to give it a “curvature” would also be able to talk about angles. We can compute the angle between two tangent vectors on the mountain by remembering that \mathbf{u}\cdot \mathbf{v}= \lVert \mathbf{u}\rVert \lVert \mathbf{v}\rVert \cos \theta, where \theta is the angle between the two vectors. Therefore, \theta = (\mathbf{u}\cdot \mathbf{v})/(\lVert \mathbf{u}\rVert \lVert \mathbf{v}\rVert). This suggests that we want a function that allows us to take the dot product of vectors “push-forwarded” to the mountain, so in analogy to the definition of s_p,

g_p(\mathbf{u},\mathbf{v}) = d\mathbf{m}_\mathbf{p}(\mathbf{u}) \cdot d\mathbf{m}_\mathbf{p}(\mathbf{v})

Note that we can now define s_\mathbf{p} in terms of g_\mathbf{p}: s_\mathbf{p}(\mathbf{u}) = \sqrt{g_\mathbf{p}(\mathbf{u},\mathbf{u})}. The upshot of this is that g_\mathbf{p} gives us the information that we need to talk about curvature. It’s basically a “way of multiplying vectors” that varies along points in a space. And this is one example of what we call a “tensor field”.

Now, I have so far cheated a bit, because I haven’t actually defined d\mathbf{m} rigorously. When you define d\mathbf{m} rigorously, you discover that d\mathbf{m}_\mathbf{p} is actually a *linear transformation* between the tangent space on the map and the tangent space on the mountain. This allows us to prove some properties about g_p.

g_\mathbf{p}(\alpha \mathbf{u}, \mathbf{v}) = d\mathbf{m}_\mathbf{p}(\alpha \mathbf{u}) \cdot d\mathbf{m}_\mathbf{p}(\mathbf{v}) = \alpha d\mathbf{m}_\mathbf{p}(\mathbf{u}) \cdot d\mathbf{m}_\mathbf{p}(\mathbf{v}) = \alpha g_\mathbf{p}(\mathbf{u},\mathbf{v})

g_\mathbf{p}(\mathbf{u}_1 + \mathbf{u}_2,\mathbf{v}) = d\mathbf{m}_\mathbf{p}(\mathbf{u}_1 + \mathbf{u}_2) \cdot d\mathbf{m}_\mathbf{p}(\mathbf{v}) = (d\mathbf{m}_\mathbf{p}(\mathbf{u}_1) \cdot d\mathbf{m}_\mathbf{p}(\mathbf{v}) + (d\mathbf{m}_\mathbf{p}(u_2) \cdot d\mathbf{m}_\mathbf{p}(\mathbf{v})) = g_\mathbf{p}(\mathbf{u}_1,\mathbf{v}) + g_\mathbf{p}(\mathbf{u}_2,\mathbf{v})

Similar properties hold for the second argument of g_\mathbf{p}. We say that a function is “linear in all arguments”, or “n-linear”, if theses properties hold for all n arguments, so g_\mathbf{p} is 2-linear, or “bilinear”. With this is mind, one might guess that a tensor field is an n-linear function that varies along the points in a space and takes in arguments from the tangent space at that point. And indeed, anything following that description is a tensor field. However, there are also other types of tensor fields, but to get a better picture of that, we need to look at the mathematical approach.

In physics, we make up stuff to explain the world around us. In math, we make up stuff because we feel like it.

One thing that’s always kind of annoyed me is that there’s no good way of multiplying vectors. On the one hand you have the dot product, which is kind of sub-par because the result is a scalar, but on the other hand if you want a result which is a vector, you have to do cross product, which is also sub-par because it only works in three dimensions.

Let’s see if we can do better. I’m going to derive a way of multiplying vectors from the nice properties of multiplication.

First of all, we need a new symbol. We can’t use \times because that’s cross product, so let’s just put a circle around \times and call it a day: \otimes. With that out of the way, let’s state some nice properties

\otimes should play nicely with scalar multiplication.

(t\mathbf{v}) \otimes \mathbf{w}= t(\mathbf{v}\otimes \mathbf{w}) = \mathbf{v}\otimes (t\mathbf{w})

\otimes should play nicely with vector addition

(\mathbf{u}+ \mathbf{v}) \otimes \mathbf{w}= \mathbf{u}\otimes \mathbf{w}+ \mathbf{v}\otimes \mathbf{w}

“Wow!” you say, “I would be really impressed if you could figure out a way of multiplying vectors with those properties (unless you just wuss out and give the scalar product).”

Well, prepare to be impressed and disappointed. You probably assumed the type of \otimes was V \times V \to V, but I never actually said that. The type of \otimes is V \times V \to V \otimes V, where the \otimes in V \otimes V is, confusingly enough, a completely different operator that works on vector spaces instead of elements of vectors spaces. Yeah, I know it’s obnoxious. It’ll be more obnoxious if you go out into the real world not knowing the right notation.

“Wait a minute,” you say, “You can’t just make up an arbitrary new type for results of multiplication and expect things to just work out all fine and dandy!”

To which I answer, “Says who!”

V \otimes V is a vector space which has elements of the form \mathbf{v}_1 \otimes \mathbf{w}_1 + \ldots + \mathbf{v}_k \otimes \mathbf{w}_k, where k is not fixed. If you can deduce from the above laws (the so-called “nice properties”) that two elements are equal, then they are equal – otherwise they aren’t equal. Addition is defined by (\mathbf{v}_1 \otimes \mathbf{w}_1) + (\mathbf{v}_2 \otimes \mathbf{w}_2) = (\mathbf{v}_1 \otimes \mathbf{w}_1 + \mathbf{v}_2 \otimes \mathbf{w}_2). Scalar multiplication is defined by the above laws. We call V \otimes V a tensor space, and the elements of it are tensors. So the answer to the question “what are tensors?” is that they are elements of a tensors space, and a tensor space is just two vector spaces “tensored” together (which is how you pronounce the \otimes multiplication operation).

You might object that this all seems a bit weaselly — how would you implement this on a computer? It would be really obnoxious to have to do a proof search every time you wanted to compare equality. The answer lies in that magical property of vector spaces — they have bases. If \mathbf{e}_1, \ldots, \mathbf{e}_n is a basis for V, we can take any \mathbf{v}\otimes \mathbf{w} and write it as

\left(\sum_{i=1}^n\alpha_i\mathbf{e}_i\right) \otimes \left(\sum_{j=1}^n\beta_j\mathbf{e}_j\right)

and then use the tensor laws to get the result

\mathbf{v}\otimes \mathbf{w}= \sum_{i=1}^n \sum_{j=1}^n \alpha_i \beta_j (\mathbf{e}_i \otimes \mathbf{e}_j)

If we have a sum \mathbf{v}_1 \otimes \mathbf{w}_1 + \ldots + \mathbf{v}_k \otimes \mathbf{w}_k, we can reduce each term in the sum to the above form (often called a “normal form”), and then combine like terms.

Therefore, V \otimes V has a basis of \{\mathbf{e}_i \otimes \mathbf{e}_j\}, and this shows it has dimension n^2.

Because we’re doing math, let’s generalize. There’s no reason that we have to restrict ourselves to tensoring vectors from the same vector space – we can do the whole thing with V \otimes W, and we end up with vectors in a normal form of \sum_{i=1}^n\sum_{j=1}^m\alpha_i\beta_i(\mathbf{e}_i\otimes\mathbf{e}'_j), where \mathbf{e}_1,\ldots,\mathbf{e}_n is a basis for V and \mathbf{e}'_1,\ldots,\mathbf{e}'_m is a basis for W.

Let’s play around with this. First of all, this means that we can tensor 3 or more vectors from the same space together – we get \mathbf{u}\otimes (\mathbf{v}\otimes \mathbf{w}) \in V \otimes (V \otimes V). It turns out that it doesn’t actually matter which order we tensor in, so we can simply write \mathbf{u}\otimes \mathbf{v}\otimes \mathbf{w}\in V \otimes V \otimes V. A similar argument to before shows that the dimension of V \otimes V \otimes V is n^3.

We can also (and this turns out to be really interesting) tensor a vector space with its *dual*. You may not have learned about duals in linear algebra, so I’m going to give a quick refresher first. The dual of a vector space V (written V^*) is the set of *linear transformations* from V to \mathbf{R}, also known as the set of *linear functionals* on V. One concrete way of thinking about dual spaces is that if the regular space is the space of *column* vectors, the dual space is the space of *row* vectors, because row vectors act on column vectors via multiplication to produce scalars.

Let’s get a bit more familiar with the dual space before we go any further. First of all, if we have a basis \mathbf{e}_1,\ldots,\mathbf{e}_n for a vector space V, we can form what is called a “dual basis” for V^*, \phi_1, \ldots, \phi_n by defining \phi_i(\mathbf{e}_j) = [i = j], where [i = j] is the so called “Iverson Bracket” whose value is 1 is the proposition inside is true and 0 otherwise. Recall that you can define a linear transformation by giving its value on a basis, so \{\phi_i\} are well defined. In the row-vector model, each \phi_i is a row-vector with a 1 in the ith place and 0s everywhere else. You should think about why \phi_i(\mathbf{e}_j) = [i = j] in the row/column vector model.

Anyways, V^* \otimes V turns out to be a *very* familiar space. Suppose I have \psi_1 \otimes \mathbf{v}_1 + \ldots + \psi_k \otimes \mathbf{v}_k \in V^* \otimes V, and \mathbf{w}\in V. I can pass in \mathbf{w} to every element of the dual space to get scalars, and then change the tensor products to regular multiplications, and I end up with another element of V. This looks like:

\psi_1(\mathbf{w})\mathbf{v}_1 + \ldots + \psi_k(\mathbf{w})\mathbf{v}_k \in V

Therefore, an element of V^* \otimes V can act like a function V \to V. Moreover, because elements of the dual space are *linear* functionals, an element of V^* \otimes V when considered as a function V \to V is actually a *linear transformation*. And even better, *all* linear transformations can be written as elements of V^* \otimes V. To show this, let L be a linear transformation V \to V, and let \{\phi_i\}, \{e_j\} be as before. Let \alpha_{ij} be defined by

L\mathbf{e}_i = \alpha_{i1}\mathbf{e}_1 + \ldots + \alpha_{in}\mathbf{e}_n

for each \mathbf{e}_j. Then we can define

\hat{L} = \sum_{i,j}\alpha_{ij}\phi_i \otimes \mathbf{e}_j

By definition, when we plug in \mathbf{e}_i to the linear functionals, every term in \hat{L} drops out except for the ones with \phi_i, and we are left with exactly \alpha_{i1}\mathbf{e}_1 + \ldots + \alpha_{in}\mathbf{e}_n, proving that the interpretation of \hat{L} as a linear transformation is exactly L.

However, this tensor formulation allows one to generalize. For instance, you can consider the space V^* \otimes V^* \otimes V^* \otimes V. Elements of this are function that take in three vectors and return a vector (and are also linear in each argument, ie. 3-linear if you read the physics section). You can view each V^* as a “hole” where a vector should go. You can also view each V as a hole where a linear functional can go – you could also view elements of that space as functions that took in two vectors and a linear functional and returned a linear functional. If you view V^* \otimes V in this way, it’s like multiplying a matrix with a row vector on the right.

There’s a natural way of talking about tensor spaces that are the product of copies of a specific vector space and its dual. Because we can write an isomorphism between V \otimes W and W \otimes V, all that matters is the number of copies of the vector space, and the number of copies of its dual. If there are i copies of the vector space, and j copies of the dual, we call this space the space of tensors of type (i,j), and we give it the symbol T^{i,j}(V).

Now, I know this is somewhat confusing because the notation for the tangent space from the previous section to a surface is T_\mathbf{p}(M). To resolve this, note that the tangent space to V when considered as a geometric object is actually always just V again, and define T^{i,j}_\mathbf{p}(M) to be the tensor product of i copies of T_\mathbf{p}(M) and j copies of T_\mathbf{p}(M)^*. Then T_\mathbf{p}(M) = T^{1,0}_\mathbf{p}(M), and the use of notation is actually (somewhat) consistent.

You should only read this section if you have read both previous sections, so if you decided to skip one of them, you should maybe take a break and come back to the other after a bit, and then finally go on to this section.

We can now define a (i,j) tensor field to be a function f that assigns to each point p \in M an element f_p \in T^{i,j}_\mathbf{p}(M), where M is some k-dimensional surface. The function g that we were talking about earlier is a (0,2) tensor field, while a normal vector field is a (1,0) tensor field.

And that’s pretty much it. Not very mysterious once you know what’s going on, huh?

If you are a programmer, and you’ve been slogging through this whole post waiting and waiting to learn what this all has to do with computer science, now is the time.

In computer science, a tensor is just a multidimensional array. Typically each array has a “shape”, which is a list of integers that gives the dimension of each index into the array. For instance, a m \times n matrix would have a shape [m,n]. The way this is implemented is that if the shape of array is [m_1,\ldots,m_k], the array is a space of memory of size \prod_j m_j, and to access the element at index (i_1,\ldots,i_k) you access the memory cell at index \sum_j i_j \prod_{l=j+1}^k m_l.

The reason this is connected to the mathematical view of tensors is that given a shape [m_1, \ldots, m_k], an array of this shape actually specifies an element of \mathbf{R}^{m_1} \otimes \ldots \otimes \mathbf{R}^{m_n}. Namely, if e_j refers to the jth standard basis element in \mathbf{R}^n, then if a_{j_1,\ldots,j_k} is an array it corresponds to \sum_{j_l \in [1 \ldots m_k]} a_{j_1,\ldots,j_k} e_{j_1} \otimes \ldots \otimes e_{j_k}. Similarly, any element of \mathbf{R}^{m_1} \otimes \ldots \otimes \mathbf{R}^{m_n} gives a multi-dimensional array by simply writing out it out in the standard basis, and then taking the values for the array from the coefficients.

The interesting part of tensors in computer science come from how they are used, and also in implementations that avoid allocating an incredibly large amount of memory by taking advantage of the fact that most of the elements of most arrays that you see in a specific application are in fact 0. However, I don’t really have much experience with either of these aspects, and this post is already too long, so I’ll stop the computer science section here.

There’s a large distance between knowing the definition of something, and knowing what it actually is. Although I haven’t been terribly rigorous (for instance, I haven’t actually given a definition of tangent space), I hope I have successfully bridged some of that distance. If you want to go on and learn this in more detail, I will conclude by listing some resources. The pdfs for the second two were in the first page of google search, but aren’t official in anyway (insert disclaimer about legality?).

Linear Algebra Done Wrong, Sergei Treil

This one very good, and written by one of my professors at Brown! See Chapter 8 for tensors.

Calculus on Manifolds, Michael Spivak

A classic, does a good job of giving concrete definitions that still manage to be quite general. This will give a good definition of tangent space. If you want to get there fast, read chapter 1 and 2, then skip chapter 3 and 4 and then read chapter 5.1 and 5.2.

Functional Differential Geometry, Gerald Jay Sussman and Jack Wisdom

“What? I thought Sussman was the scheme guy?” you say. Yup, and this book is based on doing geometry in scheme. If you think that mathematical notation is inconsistent and confusing, and you think it would be much better if notation had to be vetted by computer implementation, this is the book for you. Sometimes the math going on under the surface is opaque, I would suggest reading Calculus on Manifolds for a bit if you get confused.