If you have worked with deep learning, you have certainly encountered
the word "tensor"— PyTorch calls it torch.Tensor, and
TensorFlow literally has "Tensor" in its name. But what exactly is a
tensor? Why do deep learning frameworks use this physics-sounding
term?
This chapter starts from the familiar concepts of scalars, vectors,
and matrices, and guides you to understand the essence of tensors: they
are simply arrays generalized to arbitrary dimensions. We will see how
tensor operations naturally describe multidimensional data in images,
videos, and recommender systems, and how decomposition techniques like
CP and Tucker help us compress and understand these high-dimensional
structures.
From
Scalars to Tensors: A Natural Generalization of Dimensions
A Number, A Row, A Table, A
Cube
Let us start from the simplest objects:
Scalar (0th-order tensor): Just a single number. For
example, the temperature , or
your bank balancedollars. It
has magnitude but no direction.
Vector (1st-order tensor): A sequence of numbers.
For example, your location on a map requires two numbersto describe, and an RGB color needs
three numbers. Vectors
have both magnitude and direction.
Matrix (2nd-order tensor): A table of numbers. An
Excel spreadsheet is a typical matrix — it has rows and columns. A
grayscale image is also a matrix, where each position stores the
brightness of a pixel.
3rd-order tensor: A "data cube." A color image has
exactly this structure — height × width × color channels (three for
RGB).
The general definition of a tensor naturally follows:
A tensor is a natural generalization of vectors and
matrices to arbitrary dimensions. Anth-order tensor is an-dimensional array, denoted, whereare the sizes of each dimension.
Tensors Are Everywhere in
Daily Life
You may not realize it, but tensors are everywhere:
Data Type
Tensor Shape
Order
Intuitive Understanding
Sentiment score of a sentence
0 (scalar)
A single number
Weekly temperature of a city
1
7 days, one temperature each
Grayscale photo
2
Height × Width pixel matrix
Color photo
3
Each pixel has RGB values
A video clip
4
T frames stacked together
A batch of training images
4
N images packed together
User-Movie-Time ratings
3
User u's rating of movie m at time t
EEG data
4
Channels × Time × Frequency × Sessions
A concrete example: Suppose you are building a video
recommendation system. You have 1000 users, 10000 movies, and user
preferences change over time (say, tracked weekly for 52 weeks). This
data forms athird-order tensor! Using a traditional user-movie matrix
would lose the temporal dimension entirely.
Order vs. Shape
Two important concepts to distinguish:
Order (or Mode): The number of dimensions of a
tensor. A vector is 1st-order, a matrix is 2nd-order, an RGB image is
3rd-order. Sometimes called "dimensionality" or "number of
modes."
Shape: The size of each dimension. ARGB image has shapeand order 3.
Note: Do not confuse "order" with "rank." Rank is a different concept
that we will discuss in detail later.
Internal
Structure of Tensors: Fibers and Slices
To understand tensors, you must learn to "see through" their internal
structure. We have two core concepts: fibers and
slices.
Fibers: "Toothpicks" Inside a
Tensor
Imagine you are holding a Rubik's cube (a 3rd-order tensor). If you
push a toothpick through the cube along one direction, the sequence of
small blocks you pierce forms a "fiber."
Mathematical definition: Fix all indices except one
to obtain a vector — that is a fiber.
For a third-order tensor:
Mode-1 fiber: Fixand, traverse the first dimension,
obtaining a vector of length
Mode-2 fiber: Fixand, obtaining a vector of length
Mode-3 fiber: Fixand, obtaining a vector of length
Example in image processing: For an RGB image,is the
color vectorat pixel
position.
Slices: "Thin Sheets" Inside
a Tensor
Using the Rubik's cube analogy again: if you cut the cube into
several thin slices with a knife, each slice is a "slice."
Mathematical definition: Fix one index to obtain a
matrix — that is a slice.
For a third-order tensor:
Horizontal slice: Fix, obtain amatrix
Lateral slice: Fix, obtain anmatrix
Frontal slice: Fix, obtain anmatrix
Example in video processing: For a video, the frontal sliceis the image at
frame.
Tensor Unfolding
(Matricization)
Sometimes we need to "flatten" a high-dimensional tensor into a
matrix for processing. This operation is called
unfolding or matricization.
Mode-n unfolding: Unfold the tensor along theth mode, where mode-n fibers become
columns of the resulting matrix.
Why is unfolding useful? Because many tensor
algorithms can be transformed into matrix operations, and we are already
very familiar with matrix operations.
Basic Tensor Operations
Addition and Scalar
Multiplication
These are the simplest operations, completely analogous to vectors
and matrices:
Addition: For two tensors of the same shape, add
corresponding elements.
Scalar multiplication: Multiply every element by the
same scalar.In deep
learning, these operations are ubiquitous. For example, the skip
connection in ResNet$ = () + is tensor addition.
Tensor Contraction
Contraction is one of the most important operations on tensors — it
generalizes matrix multiplication.
Core idea: Select one dimension from each of two
tensors, sum along that dimension, and eliminate it.
The most familiar example — matrix
multiplication:Here, the 2nd dimension ofand the 1st dimension ofare "contracted" away.
General tensor contraction: Letand. Contracting along the shared dimension:The resultis a 4th-order tensor.
Intuitive understanding: Contraction is like
"pairing and summing." Imagineandeach have a row of numbers
that need to be paired. Paired numbers are multiplied and summed, and
that dimension disappears.
Outer Product —
The Basic Way to Build Tensors
The outer product is the "inverse" of contraction — it combines
lower-order tensors into higher-order tensors.
Outer product of two vectors:The result is a matrix where theelement is.
Outer product of three vectors:The result is a third-order tensor.
Rank-1 tensor: A tensor that can be written as an
outer product of vectors is called a rank-1 tensor. It is the "simplest"
tensor because its structure is completely determined by a few
vectors.
An important fact: Any tensor can be written as a
sum of rank-1 tensors. This is the theoretical foundation of tensor
decomposition.
n-Mode Product
The n-mode product applies a matrix to one mode of a tensor.
Definition: The n-mode product of tensorand matrix:whereThe result.
Matrix form: Using unfolding, this becomes
clearer:
Intuitive understanding: The n-mode product is
"applying a linear transformation along the nth dimension."
Example: For a color image, if we apply acolor transformation matrixvia a 3-mode product:This
performs a linear transformation on each pixel's color vector (e.g.,
white balance adjustment, tone mapping).
Important properties:
Products on different modes can be reordered:(when)
Consecutive products on the same mode:
Kronecker and Khatri-Rao
Products
These products appear frequently in tensor decompositions.
Kronecker product:
Let,, then:Intuitive understanding:
Replace each element ofwith "that
element times the entire matrix."
Let,(same number of columns), then:The Khatri-Rao product plays a central role in CP
decomposition.
Tensor Norms
Frobenius Norm
The Frobenius norm of a tensor is the square root of the sum of
squares of all elements:
Important property: For any mode-n unfolding,.
This means we can unfold a tensor into a matrix to compute the norm,
and the result is unchanged.
Inner Product
The inner product of two tensors with the same shape:Clearly,.
Tensor
Products: The Multilinear Mapping Perspective
From a more abstract viewpoint, tensors represent multilinear
mappings.
Multilinear Mappings
Linear mappings are familiar:satisfies.
Bilinear mappings take two vectors and are linear in
each variable:satisfying: - -Example: The outer product of
vectorsis bilinear.
Multilinear mappings generalize bilinear mappings to
takevector inputs.
Tensor Product Spaces
Given vector spacesand, their tensor
productis a new
vector space satisfying:
For any bilinear map, there exists a unique linear mapsuch
that.
Intuitive understanding: The tensor product space
converts "bilinear" problems into "linear" problems. This is why tensors
are so useful — they let us apply linear algebra tools to multilinear
problems.
CP
Decomposition: Breaking Tensors into Simple Components
What is CP Decomposition?
CP decomposition (CANDECOMP/PARAFAC) expresses a
tensor as a weighted sum of rank-1 tensors:Or using more compact
notation:where,,are called factor matrices.
Imagine you are analyzing a "user-movie-time" rating tensor. CP
decomposition tells you:Each
componentrepresents a "simple
pattern":
Component 1 might be "young users who like action movies, watching
mostly on weekends"
Component 2 might be "middle-aged users who like art films, watching
mostly in the evening"
...
All simple patterns superimposed approximate the complex original
data.
Tensor Rank
The rank of a tensor is the minimum number of rank-1
components needed to represent it exactly:
Key differences from matrix rank:
Matrix rank can be computed exactly via SVD, but computing
tensor rank is NP-hard!
The best low-rank approximation of a matrix is given by SVD, but the
best low-rank approximation of a tensor may not exist (can be approached
but not achieved)
Tensor rank can exceed the size of any dimension (while matrix rank
cannot exceed the smaller of rows and columns)
Uniqueness of CP
Decomposition
An important advantage of CP decomposition is essential
uniqueness. Under mild conditions (Kruskal conditions), CP
decomposition is unique (up to permutation and scaling of columns).
Kruskal condition: If, whereis the Kruskal rank of(the largestsuch that anycolumns are linearly independent).
This uniqueness is a significant advantage over matrix decompositions
(like SVD). Theandfrom SVD can differ by an orthogonal
transformation, while CP decomposition factors are largely determined,
which is valuable for interpreting the practical meaning of factors.
Alternating Least
Squares (ALS) Algorithm
The most common algorithm for CP decomposition is Alternating
Least Squares (ALS):
Idea: Fix all but one factor matrix and optimize
that one — this becomes an ordinary least squares problem.
Note: ALS does not guarantee convergence to the
global optimum, but typically performs well in practice. Multiple random
initializations can be tried, keeping the best result.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
import numpy as np import tensorly as tl from tensorly.decomposition import parafac
Tucker
Decomposition: A More Flexible Tensor Representation
Definition of Tucker
Decomposition
Tucker decomposition is a more general form than CP
decomposition:Or in element
form:whereis the core
tensor, and,,are factor matrices.
Relationship Between Tucker
and CP
CP is a special case of Tucker: When the core
tensoris "superdiagonal"
(onlynonzero), Tucker
reduces to CP.
Tucker is more flexible: CP requires all three
factor matrices to have the same number of columns (rank), while Tucker allows different "ranks"
for each mode.
HOSVD:
Higher-Order Singular Value Decomposition
HOSVD (Higher-Order SVD) is a special form of Tucker
decomposition that guarantees orthogonal factor matrices.
Computation steps:
For each mode-n unfolding, compute its SVD
Take the firstleft singular
vectors as factor matrix$U^{(n)} = _1 U^{(1)T} _2 U^{(2)T} _3 U^{(3)T}$ Analogy
with matrix SVD:
Matrix SVD
HOSVD
is diagonal
need not be diagonal
(but has "all-orthogonality" property)
orthogonal
Eachorthogonal
Applications of HOSVD:
Data compression: Retain main components of the core tensor
Initialization: Provide good starting point for refined Tucker
decomposition
Denoising: Truncate small core tensor elements
1 2 3 4 5 6 7 8 9 10 11 12 13
from tensorly.decomposition import tucker
# Tucker decomposition on the same rating tensor # Compress 100 users to 10 dims, 50 movies to 8 dims, 12 months to 4 dims core, factors = tucker(tl.tensor(X), rank=[10, 8, 4])
print(f"Core tensor shape: {core.shape}") # (10, 8, 4) print(f"Factor matrices shapes: {[f.shape for f in factors]}") # [(100, 10), (50, 8), (12, 4)]
Tucker decomposition introduces the concept of multilinear
rank:The multilinear rank tells you how many
dimensions the data essentially has along each mode.
Tensors in Deep Learning:
CNN Example
Why are deep learning frameworks called "TensorFlow" or use "Tensor"?
Because neural network computations are essentially tensor
operations.
Tensor Perspective on
Convolution
Consider a standard 2D convolutional layer:
Input: -: batch size -: number of input channels -: height and width
Convolution kernel: -: number of output channels -: kernel height and width
Output:The
convolution operation can be written as:This is a complex tensor contraction
operation!
Compressing
Neural Networks with Tensor Decomposition
Neural network weights are massive tensors. Through tensor
decomposition, we can significantly reduce parameter counts.
CP decomposition of convolution kernels:
Original kernelcan be
approximated as:This is equivalent to replacing the original large
convolution with 4 smaller convolutions:
After decomposition:Whenis small,
the compression ratio can be very large.
Practical example: Aconvolution layer in VGG-16 has
about 2.3 million parameters. With rank-64 CP decomposition, parameters
drop to about 66,000— a 35x compression!
Tensor
Decomposition in Recommender Systems
From Matrices to
Tensors: Adding Context
Traditional recommender systems use a user-item
matrix:whereis user's rating of item. Matrix factorization (like SVD, NMF)
seeks:where(user
features),(item features).
Problem: User preferences change over time and
depend on context. You might want different movies on weekends versus
workdays.
Tensor solution: Add a time dimension to create a
user-item-time tensor:CP decomposition gives:wherecaptures the temporal pattern of
factor.
Handling Sparse Data
In real recommender systems, most users rate only a few items. The
tensor is extremely sparse (perhaps only 0.1% observed).
CP decomposition with missing values: Only compute
loss at observed positions:whereis
the set of observed positions andis the regularization
parameter.
Other Tensor Decomposition
Methods
Tensor Train (TT)
Decomposition
For very high-order tensors (like multi-body systems in quantum
physics), CP and Tucker complexity can explode. Tensor Train
(TT) decomposition is a more scalable method:whereis a matrix ().
Advantages of TT decomposition:
Parameter count grows linearly:, not exponentially
Stable algorithms exist (TT-SVD)
Particularly suitable for high-order tensors
Non-negative Tensor
Factorization (NTF)
When tensor elements are non-negative and interpretable factors are
needed, use non-negative tensor factorization:
Applications:
Image analysis (pixel values are non-negative)
Text mining (word frequencies are non-negative)
Chemometrics (concentrations are non-negative)
Non-negativity constraints make factors easier to interpret as
"parts" or "components."
Exercises
Conceptual Understanding
Exercise 1: Determine the tensor order of the
following data structures:
A mono MP3 song (44.1kHz sampling rate)
A stereo song
A 5-minute 1080p RGB video (30fps)
The ImageNet dataset (1 millionRGB images)
Attention weights in a Transformer modelExercise
2: For tensor:
How many mode-1 fibers are there? What is the length of
each?
How many frontal slices are there? What is the shape of
each?
What is the shape of mode-2 unfolding?
Computational Exercises
Exercise 3: Let,,.
Compute the outer product
Compute the third-order tensor, write the value of
ComputeExercise 4: Let,.
Compute the Kronecker product
Verify that(for suitable matrices)
Decomposition Exercises
Exercise 5: Consider a rank-2 tensor,
where: (a) Compute the specific elements
of (b) Write out the
three factor matrices (c)
Compute the mode-1 unfoldingExercise 6: For theidentity tensor (if and only if):
Compute its Frobenius norm
What is its CP rank? (Hint: Consider whether it can be written as
fewer than 3 rank-1 tensors)
What is its multilinear rank?
Application Problems
Exercise 7 (Recommender System): A video platform
has 3 users, 4 movies, and 2 time periods. Observed ratings:
User
Movie
Time Period
Rating
1
1
1
5
1
2
1
4
2
1
2
3
2
3
1
5
3
2
2
2
3
4
1
4
Construct the rating tensor(set unobserved positions to 0)
With rank-2 CP decomposition, how many parameters need to be
estimated?
Discuss: What trade-offs might exist between rank-1 vs rank-2
decomposition?
Exercise 8 (Image Compression): ARGB image can be viewed as
atensor.
How many bytes for original storage (assuming 1 byte per pixel
value)?
Using Tucker decomposition with, how many bytes after compression?
Compute the compression ratio
With CP decomposition of rank 100, how many bytes after
compression?
# (a) Implement mode-n unfolding for third-order tensors defunfold(X, mode): """ Unfold third-order tensor X along mode into a matrix X: ndarray of shape (I, J, K) mode: 0, 1, or 2 Returns: unfolded matrix """ pass
# (b) Implement n-mode product defmode_n_product(X, A, mode): """ Compute mode-n product of tensor X and matrix A X: ndarray of shape (I, J, K) A: ndarray of shape (P, I_mode) mode: 0, 1, or 2 Returns: product tensor """ pass
# (c) Implement simple CP decomposition (using ALS) defsimple_cp_als(X, rank, n_iter=100): """ CP decomposition of third-order tensor X Returns: (A, B, C) three factor matrices """ pass
Exercise 10: Using the tensorly
library:
Generate a rank-3 randomtensor (generate factors first, then construct via
outer products)
Perform CP decomposition with ranks 2, 3, 4, 5 and compare
reconstruction errors
Plot the reconstruction error vs rank curve
Apply HOSVD to the same tensor and compare results
Proof Exercises
Exercise 11: Prove the relationship between tensor
Frobenius norm and unfolding: for any mode,.
Exercise 12: Prove properties of n-mode
products:
(a) (b) When,Exercise 13: Letbe a rank-1 tensor. Prove:
(a) (b) Mode-1 unfolding
Chapter Summary
This chapter systematically covered tensors and multilinear algebra
from basic concepts:
Core concepts: - Tensors are generalizations of
vectors and matrices to arbitrary dimensions - Fibers, slices, and
unfolding are fundamental tools for analyzing tensor structure - Tensor
operations include addition, scalar multiplication, contraction, outer
product, and n-mode product
Tensor decompositions: - CP decomposition represents
tensors as sums of rank-1 components with essential uniqueness - Tucker
decomposition is more flexible, allowing different ranks for different
modes - HOSVD is orthogonal Tucker decomposition, commonly used for
initialization and compression
Practical applications: - Deep learning: convolution
kernel compression, network acceleration - Recommender systems:
multi-dimensional context modeling - Signal processing, chemometrics,
and other fields
The core idea of tensor methods is: decompose complex
high-dimensional data structures into combinations of simple
components. This allows us to compress data, extract features,
and discover hidden patterns. As data dimensionality continues to grow,
tensor methods will only become more important.
References
Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions
and applications. SIAM Review, 51(3), 455-500.
Sidiropoulos, N. D., et al. (2017). Tensor decomposition for
signal processing and machine learning. IEEE Transactions on Signal
Processing, 65(13), 3551-3582.
Cichocki, A., et al. (2015). Tensor decompositions for signal
processing applications. IEEE Signal Processing Magazine, 32(2),
145-163.