Essence of Linear Algebra (13): Tensors and Multilinear Algebra

If you have worked with deep learning, you have certainly encountered the word "tensor"— PyTorch calls it torch.Tensor, and TensorFlow literally has "Tensor" in its name. But what exactly is a tensor? Why do deep learning frameworks use this physics-sounding term?

This chapter starts from the familiar concepts of scalars, vectors, and matrices, and guides you to understand the essence of tensors: they are simply arrays generalized to arbitrary dimensions. We will see how tensor operations naturally describe multidimensional data in images, videos, and recommender systems, and how decomposition techniques like CP and Tucker help us compress and understand these high-dimensional structures.

From Scalars to Tensors: A Natural Generalization of Dimensions

A Number, A Row, A Table, A Cube

Let us start from the simplest objects:

Scalar (0th-order tensor): Just a single number. For example, the temperature , or your bank balancedollars. It has magnitude but no direction.

Vector (1st-order tensor): A sequence of numbers. For example, your location on a map requires two numbersto describe, and an RGB color needs three numbers. Vectors have both magnitude and direction.

Matrix (2nd-order tensor): A table of numbers. An Excel spreadsheet is a typical matrix — it has rows and columns. A grayscale image is also a matrix, where each position stores the brightness of a pixel.

3rd-order tensor: A "data cube." A color image has exactly this structure — height × width × color channels (three for RGB).

The general definition of a tensor naturally follows:

A tensor is a natural generalization of vectors and matrices to arbitrary dimensions. Anth-order tensor is an-dimensional array, denoted, whereare the sizes of each dimension.

Tensors Are Everywhere in Daily Life

You may not realize it, but tensors are everywhere:

Data Type	Order	Intuitive Understanding
Sentiment score of a sentence	0 (scalar)	A single number
Weekly temperature of a city	1	7 days, one temperature each
Grayscale photo	2	Height × Width pixel matrix
Color photo	3	Each pixel has RGB values
A video clip	4	T frames stacked together
A batch of training images	4	N images packed together
User-Movie-Time ratings	3	User u's rating of movie m at time t
EEG data	4	Channels × Time × Frequency × Sessions

A concrete example: Suppose you are building a video recommendation system. You have 1000 users, 10000 movies, and user preferences change over time (say, tracked weekly for 52 weeks). This data forms athird-order tensor! Using a traditional user-movie matrix would lose the temporal dimension entirely.

Order vs. Shape

Two important concepts to distinguish:

Order (or Mode): The number of dimensions of a tensor. A vector is 1st-order, a matrix is 2nd-order, an RGB image is 3rd-order. Sometimes called "dimensionality" or "number of modes."
Shape: The size of each dimension. ARGB image has shapeand order 3.

Note: Do not confuse "order" with "rank." Rank is a different concept that we will discuss in detail later.

Internal Structure of Tensors: Fibers and Slices

To understand tensors, you must learn to "see through" their internal structure. We have two core concepts: fibers and slices.

Fibers: "Toothpicks" Inside a Tensor

Imagine you are holding a Rubik's cube (a 3rd-order tensor). If you push a toothpick through the cube along one direction, the sequence of small blocks you pierce forms a "fiber."

Mathematical definition: Fix all indices except one to obtain a vector — that is a fiber.

For a third-order tensor:

Mode-1 fiber: Fixand, traverse the first dimension, obtaining a vector of length
Mode-2 fiber: Fixand, obtaining a vector of length
Mode-3 fiber: Fixand, obtaining a vector of length

Example in image processing: For an RGB image,is the color vectorat pixel position.

Slices: "Thin Sheets" Inside a Tensor

Using the Rubik's cube analogy again: if you cut the cube into several thin slices with a knife, each slice is a "slice."

Mathematical definition: Fix one index to obtain a matrix — that is a slice.

For a third-order tensor:

Horizontal slice: Fix, obtain amatrix
Lateral slice: Fix, obtain anmatrix
Frontal slice: Fix, obtain anmatrix

Example in video processing: For a video, the frontal sliceis the image at frame.

Tensor Unfolding (Matricization)

Sometimes we need to "flatten" a high-dimensional tensor into a matrix for processing. This operation is called unfolding or matricization.

Mode-n unfolding: Unfold the tensor along theth mode, where mode-n fibers become columns of the resulting matrix.

For, the mode-n unfolding yields a matrix:

Concrete example: Let - Mode-1 unfolding:(4×2=8 columns) - Mode-2 unfolding:(3×2=6 columns) - Mode-3 unfolding:(3×4=12 columns)

Why is unfolding useful? Because many tensor algorithms can be transformed into matrix operations, and we are already very familiar with matrix operations.

Basic Tensor Operations

Addition and Scalar Multiplication

These are the simplest operations, completely analogous to vectors and matrices:

Addition: For two tensors of the same shape, add corresponding elements.

Scalar multiplication: Multiply every element by the same scalar.In deep learning, these operations are ubiquitous. For example, the skip connection in ResNet$ = () + is tensor addition.

Tensor Contraction

Contraction is one of the most important operations on tensors — it generalizes matrix multiplication.

Core idea: Select one dimension from each of two tensors, sum along that dimension, and eliminate it.

The most familiar example — matrix multiplication:Here, the 2nd dimension ofand the 1st dimension ofare "contracted" away.

General tensor contraction: Letand. Contracting along the shared dimension:The resultis a 4th-order tensor.

Intuitive understanding: Contraction is like "pairing and summing." Imagineandeach have a row of numbers that need to be paired. Paired numbers are multiplied and summed, and that dimension disappears.

Outer Product — The Basic Way to Build Tensors

The outer product is the "inverse" of contraction — it combines lower-order tensors into higher-order tensors.

Outer product of two vectors:The result is a matrix where theelement is.

Outer product of three vectors:The result is a third-order tensor.

Rank-1 tensor: A tensor that can be written as an outer product of vectors is called a rank-1 tensor. It is the "simplest" tensor because its structure is completely determined by a few vectors.

An important fact: Any tensor can be written as a sum of rank-1 tensors. This is the theoretical foundation of tensor decomposition.

n-Mode Product

The n-mode product applies a matrix to one mode of a tensor.

Definition: The n-mode product of tensorand matrix:whereThe result.

Matrix form: Using unfolding, this becomes clearer:

Intuitive understanding: The n-mode product is "applying a linear transformation along the nth dimension."

Example: For a color image, if we apply acolor transformation matrixvia a 3-mode product:This performs a linear transformation on each pixel's color vector (e.g., white balance adjustment, tone mapping).

Important properties:

Products on different modes can be reordered:(when)
Consecutive products on the same mode:

Kronecker and Khatri-Rao Products

These products appear frequently in tensor decompositions.

Kronecker product:

Let,, then:Intuitive understanding: Replace each element ofwith "that element times the entire matrix."

Khatri-Rao product(column-wise Kronecker product):

Let,(same number of columns), then:The Khatri-Rao product plays a central role in CP decomposition.

Tensor Norms

Frobenius Norm

The Frobenius norm of a tensor is the square root of the sum of squares of all elements:

Important property: For any mode-n unfolding,.

This means we can unfold a tensor into a matrix to compute the norm, and the result is unchanged.

Inner Product

The inner product of two tensors with the same shape:Clearly,.

Tensor Products: The Multilinear Mapping Perspective

From a more abstract viewpoint, tensors represent multilinear mappings.

Multilinear Mappings

Linear mappings are familiar:satisfies.

Bilinear mappings take two vectors and are linear in each variable:satisfying: - - Example: The outer product of vectorsis bilinear.

Multilinear mappings generalize bilinear mappings to takevector inputs.

Tensor Product Spaces

Given vector spacesand, their tensor productis a new vector space satisfying:

For any bilinear map, there exists a unique linear mapsuch that.

Intuitive understanding: The tensor product space converts "bilinear" problems into "linear" problems. This is why tensors are so useful — they let us apply linear algebra tools to multilinear problems.

CP Decomposition: Breaking Tensors into Simple Components

What is CP Decomposition?

CP decomposition (CANDECOMP/PARAFAC) expresses a tensor as a weighted sum of rank-1 tensors:Or using more compact notation:where,,are called factor matrices.

Intuitive Understanding: Superimposing Simple Patterns

Imagine you are analyzing a "user-movie-time" rating tensor. CP decomposition tells you:Each componentrepresents a "simple pattern":

Component 1 might be "young users who like action movies, watching mostly on weekends"
Component 2 might be "middle-aged users who like art films, watching mostly in the evening"
...

All simple patterns superimposed approximate the complex original data.

Tensor Rank

The rank of a tensor is the minimum number of rank-1 components needed to represent it exactly:

Key differences from matrix rank:

Matrix rank can be computed exactly via SVD, but computing tensor rank is NP-hard!
The best low-rank approximation of a matrix is given by SVD, but the best low-rank approximation of a tensor may not exist (can be approached but not achieved)
Tensor rank can exceed the size of any dimension (while matrix rank cannot exceed the smaller of rows and columns)

Uniqueness of CP Decomposition

An important advantage of CP decomposition is essential uniqueness. Under mild conditions (Kruskal conditions), CP decomposition is unique (up to permutation and scaling of columns).

Kruskal condition: If, whereis the Kruskal rank of(the largestsuch that anycolumns are linearly independent).

This uniqueness is a significant advantage over matrix decompositions (like SVD). Theandfrom SVD can differ by an orthogonal transformation, while CP decomposition factors are largely determined, which is valuable for interpreting the practical meaning of factors.

Alternating Least Squares (ALS) Algorithm

The most common algorithm for CP decomposition is Alternating Least Squares (ALS):

Idea: Fix all but one factor matrix and optimize that one — this becomes an ordinary least squares problem.

Algorithm steps:

Randomly initialize$A, B, CB, CAA X_{(1)} (C B) [(C^T C) * (B^T B)]^{}$ - Fix, update: - Fix, update:wheredenotes Hadamard (element-wise) product anddenotes pseudo-inverse.

Note: ALS does not guarantee convergence to the global optimum, but typically performs well in practice. Multiple random initializations can be tried, keeping the best result.

import numpy as np
import tensorly as tl
from tensorly.decomposition import parafac

# Create a simulated user-movie-time rating tensor
# 100 users, 50 movies, 12 months
np.random.seed(42)
n_users, n_movies, n_months = 100, 50, 12
X = np.random.rand(n_users, n_movies, n_months) * 5  # ratings 1-5

# CP decomposition, extracting 5 latent factors
weights, factors = parafac(tl.tensor(X), rank=5, n_iter_max=100)
user_factors, movie_factors, time_factors = factors

print(f"User factor matrix shape: {user_factors.shape}")  # (100, 5)
print(f"Movie factor matrix shape: {movie_factors.shape}")  # (50, 5)
print(f"Time factor matrix shape: {time_factors.shape}")  # (12, 5)

Tucker Decomposition: A More Flexible Tensor Representation

Definition of Tucker Decomposition

Tucker decomposition is a more general form than CP decomposition:Or in element form:whereis the core tensor, and,,are factor matrices.

Relationship Between Tucker and CP

CP is a special case of Tucker: When the core tensoris "superdiagonal" (onlynonzero), Tucker reduces to CP.

Tucker is more flexible: CP requires all three factor matrices to have the same number of columns (rank), while Tucker allows different "ranks" for each mode.

HOSVD: Higher-Order Singular Value Decomposition

HOSVD (Higher-Order SVD) is a special form of Tucker decomposition that guarantees orthogonal factor matrices.

Computation steps:

For each mode-n unfolding, compute its SVD
Take the firstleft singular vectors as factor matrix$U^{(n)} = _1 U^{(1)T} _2 U^{(2)T} _3 U^{(3)T}$ Analogy with matrix SVD:

Matrix SVD	HOSVD

is diagonal	need not be diagonal (but has "all-orthogonality" property)
orthogonal	Eachorthogonal

Applications of HOSVD:

Data compression: Retain main components of the core tensor
Initialization: Provide good starting point for refined Tucker decomposition
Denoising: Truncate small core tensor elements

from tensorly.decomposition import tucker

# Tucker decomposition on the same rating tensor
# Compress 100 users to 10 dims, 50 movies to 8 dims, 12 months to 4 dims
core, factors = tucker(tl.tensor(X), rank=[10, 8, 4])

print(f"Core tensor shape: {core.shape}")  # (10, 8, 4)
print(f"Factor matrices shapes: {[f.shape for f in factors]}")  # [(100, 10), (50, 8), (12, 4)]

# Compression ratio
original_size = 100 * 50 * 12
compressed_size = 10 * 8 * 4 + 100 * 10 + 50 * 8 + 12 * 4
print(f"Compression ratio: {original_size / compressed_size:.2f}x")

Multilinear Rank

Tucker decomposition introduces the concept of multilinear rank:The multilinear rank tells you how many dimensions the data essentially has along each mode.

Tensors in Deep Learning: CNN Example

Why are deep learning frameworks called "TensorFlow" or use "Tensor"? Because neural network computations are essentially tensor operations.

Tensor Perspective on Convolution

Consider a standard 2D convolutional layer:

Input: -: batch size -: number of input channels -: height and width

Convolution kernel: -: number of output channels -: kernel height and width

Output:The convolution operation can be written as:This is a complex tensor contraction operation!

Compressing Neural Networks with Tensor Decomposition

Neural network weights are massive tensors. Through tensor decomposition, we can significantly reduce parameter counts.

CP decomposition of convolution kernels:

Original kernelcan be approximated as:This is equivalent to replacing the original large convolution with 4 smaller convolutions:

1.convolution:$C_{in} RK $convolution: depth-wise 3.convolution: depth-wise 4.convolution: Compression effect:

Original parameters:
After decomposition:Whenis small, the compression ratio can be very large.

Practical example: Aconvolution layer in VGG-16 has about 2.3 million parameters. With rank-64 CP decomposition, parameters drop to about 66,000— a 35x compression!

Tensor Decomposition in Recommender Systems

From Matrices to Tensors: Adding Context

Traditional recommender systems use a user-item matrix:whereis user's rating of item. Matrix factorization (like SVD, NMF) seeks:where(user features),(item features).

Problem: User preferences change over time and depend on context. You might want different movies on weekends versus workdays.

Tensor solution: Add a time dimension to create a user-item-time tensor:CP decomposition gives:wherecaptures the temporal pattern of factor.

Handling Sparse Data

In real recommender systems, most users rate only a few items. The tensor is extremely sparse (perhaps only 0.1% observed).

CP decomposition with missing values: Only compute loss at observed positions:whereis the set of observed positions andis the regularization parameter.

Other Tensor Decomposition Methods

Tensor Train (TT) Decomposition

For very high-order tensors (like multi-body systems in quantum physics), CP and Tucker complexity can explode. Tensor Train (TT) decomposition is a more scalable method:whereis a matrix ().

Advantages of TT decomposition:

Parameter count grows linearly:, not exponentially
Stable algorithms exist (TT-SVD)
Particularly suitable for high-order tensors

Non-negative Tensor Factorization (NTF)

When tensor elements are non-negative and interpretable factors are needed, use non-negative tensor factorization:

Applications:

Image analysis (pixel values are non-negative)
Text mining (word frequencies are non-negative)
Chemometrics (concentrations are non-negative)

Non-negativity constraints make factors easier to interpret as "parts" or "components."

Exercises

Conceptual Understanding

Exercise 1: Determine the tensor order of the following data structures:

A mono MP3 song (44.1kHz sampling rate)
A stereo song
A 5-minute 1080p RGB video (30fps)
The ImageNet dataset (1 millionRGB images)
Attention weights in a Transformer model Exercise 2: For tensor:
How many mode-1 fibers are there? What is the length of each?
How many frontal slices are there? What is the shape of each?
What is the shape of mode-2 unfolding?

Computational Exercises

Exercise 3: Let,,.

Compute the outer product
Compute the third-order tensor, write the value of
Compute Exercise 4: Let,.
Compute the Kronecker product
Verify that(for suitable matrices)

Decomposition Exercises

Exercise 5: Consider a rank-2 tensor, where: (a) Compute the specific elements of (b) Write out the three factor matrices (c) Compute the mode-1 unfolding Exercise 6: For theidentity tensor (if and only if):

Compute its Frobenius norm
What is its CP rank? (Hint: Consider whether it can be written as fewer than 3 rank-1 tensors)
What is its multilinear rank?

Application Problems

Exercise 7 (Recommender System): A video platform has 3 users, 4 movies, and 2 time periods. Observed ratings:

User	Movie	Time Period	Rating
1	1	1	5
1	2	1	4
2	1	2	3
2	3	1	5
3	2	2	2
3	4	1	4

Construct the rating tensor(set unobserved positions to 0)
With rank-2 CP decomposition, how many parameters need to be estimated?
Discuss: What trade-offs might exist between rank-1 vs rank-2 decomposition?

Exercise 8 (Image Compression): ARGB image can be viewed as atensor.

How many bytes for original storage (assuming 1 byte per pixel value)?
Using Tucker decomposition with, how many bytes after compression?
Compute the compression ratio
With CP decomposition of rank 100, how many bytes after compression?

Programming Exercises

Exercise 9: Implement the following in Python:

# (a) Implement mode-n unfolding for third-order tensors
def unfold(X, mode):
    """
    Unfold third-order tensor X along mode into a matrix
    X: ndarray of shape (I, J, K)
    mode: 0, 1, or 2
    Returns: unfolded matrix
    """
    pass

# (b) Implement n-mode product
def mode_n_product(X, A, mode):
    """
    Compute mode-n product of tensor X and matrix A
    X: ndarray of shape (I, J, K)
    A: ndarray of shape (P, I_mode)
    mode: 0, 1, or 2
    Returns: product tensor
    """
    pass

# (c) Implement simple CP decomposition (using ALS)
def simple_cp_als(X, rank, n_iter=100):
    """
    CP decomposition of third-order tensor X
    Returns: (A, B, C) three factor matrices
    """
    pass

Exercise 10: Using the tensorly library:

Generate a rank-3 randomtensor (generate factors first, then construct via outer products)
Perform CP decomposition with ranks 2, 3, 4, 5 and compare reconstruction errors
Plot the reconstruction error vs rank curve
Apply HOSVD to the same tensor and compare results

Proof Exercises

Exercise 11: Prove the relationship between tensor Frobenius norm and unfolding: for any mode,.

Exercise 12: Prove properties of n-mode products:

(a) (b) When, Exercise 13: Letbe a rank-1 tensor. Prove:

(a) (b) Mode-1 unfolding

Chapter Summary

This chapter systematically covered tensors and multilinear algebra from basic concepts:

Core concepts: - Tensors are generalizations of vectors and matrices to arbitrary dimensions - Fibers, slices, and unfolding are fundamental tools for analyzing tensor structure - Tensor operations include addition, scalar multiplication, contraction, outer product, and n-mode product

Tensor decompositions: - CP decomposition represents tensors as sums of rank-1 components with essential uniqueness - Tucker decomposition is more flexible, allowing different ranks for different modes - HOSVD is orthogonal Tucker decomposition, commonly used for initialization and compression

Practical applications: - Deep learning: convolution kernel compression, network acceleration - Recommender systems: multi-dimensional context modeling - Signal processing, chemometrics, and other fields

The core idea of tensor methods is: decompose complex high-dimensional data structures into combinations of simple components. This allows us to compress data, extract features, and discover hidden patterns. As data dimensionality continues to grow, tensor methods will only become more important.

References

Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455-500.
Sidiropoulos, N. D., et al. (2017). Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13), 3551-3582.
Cichocki, A., et al. (2015). Tensor decompositions for signal processing applications. IEEE Signal Processing Magazine, 32(2), 145-163.
TensorLy Documentation: Python tensor learning library
Strang, G. (2019). Linear Algebra and Learning from Data. Chapter 7.

This is Chapter 13 of the "Essence of Linear Algebra" series.