Xavier Sumba | Linear Algebra

Some interesting links:

Notes of chapter 2 of Deep Learning book.

Mathematical objects

Scalar: single number \(x=1\)
Vector: array of numbers \(\mathbf{x}=\begin{vmatrix}x_1\\\vdots\\x_n\end{vmatrix}\)

\(\mathbf{x_{-1}}\): ignore element \(x_1\)
Matrix: 2D array of numbers \(\mathbf{A}\in\mathbb{R}^{mxn}\)

horizontal coordinate (all), ith row: \(\mathbf{A}_{i,:}\)

vertical coordinate (all), ith column: \(\mathbf{A}_{:,i}\)

value: \(f(\mathbf{A})_{i,j}\)
Tensor: array with more than two axes \(\mathbf{A}\)

value: \(\mathbf{A}_{i,j,k}\)

Transpose

Mirror across main diagonal \(\mathbf{A} \rightarrow \mathbf{A}^\intercal\)
Scalar matrix, one item \(\mathbf{a}=\mathbf{a}^\intercal\)
Vector \(\mathbf{x}^\intercal\)
Matrix product \((\mathbf{A}\mathbf{B})^\intercal=\mathbf{B}^\intercal\mathbf{A}^\intercal\)

Addition

If \(\mathbf{A}\) and \(\mathbf{B}\) same shape, \(\mathbf{C}=\mathbf{A}+\mathbf{B}\), where \(\mathbf{C}_{i,j}=\mathbf{A}_{i,j}+\mathbf{B}_{i,j}\).

Scalar addition and multiplication

\(\mathbf{D}=a\cdot\mathbf{B}+c\), where \(\mathbf{D}_{i,j}=a\cdot\mathbf{B}_{i,j}+c\)

Addition matrix and vector

\(\mathbf{C}=\mathbf{A}+\mathbf{b}\), where \(\mathbf{C}_{i,j}=\mathbf{A}_{i,j}+\mathbf{b}_j\)

broadcasting: \(\mathbf{b}\) is addded to each row of the matrix \(\mathbf{A}\).

Operations

Matrix product \(\mathbf{C}=\mathbf{A}\mathbf{B} \rightarrow \mathbf{C}_{m\times p}=\mathbf{A}_{m\times n}\mathbf{B}_{n\times p}\) and \(\mathbf{C}_{i,j}=\sum_k\mathbf{A}_{i,k}\mathbf{B}_{k,j}\)

Vectors of the same size: \(\mathbf{x}^\intercal\mathbf{y}\)
Elementwise product / Hadamard product \(\mathbf{A}\odot\mathbf{B}\)

Properties

Distributive: \(\mathbf{A}(\mathbf{B}+\mathbf{C})=\mathbf{A}\mathbf{B}+\mathbf{A}\mathbf{C}\)
Associative: \(\mathbf{A}(\mathbf{B}\mathbf{C})=(\mathbf{A}\mathbf{B})\mathbf{C}\)
Commutative: not \(\mathbf{A}\mathbf{B}\neq \mathbf{B}\mathbf{A}\) (not always)

yes for vectors \(\rightarrow\) \(\mathbf{x}^\intercal \mathbf{y}=\mathbf{y}^\intercal\mathbf{x}\)

Linear equations

\(\mathbf{A}\mathbf{x}=\mathbf{b}\) (notation compact)

\(\mathbf{A}\in\mathbb{R}^{m\times n}\) known
\(\mathbf{x}\in\mathbb{R}^m\) variable
\(\mathbf{b}\in\mathbb{R}^m\) known

Notation not compact, equations:

\[A_{1,:}\mathbf{x}=b_1 \rightarrow A_{1,1}x_1+\dots+A_{1,n}x_n=b_1\\ \vdots\\ A_{m,:}\mathbf{x}=b_m \rightarrow A_{m,1}x_1+\dots+A_{m,n}x_n=b_m\]

Identity matrix

\(\mathbf{I}_n\in\mathbb{R}^{nxn} \rightarrow \mathbf{I}_n\mathbf{x}=\mathbf{x}\:\:\:\:\:\:\: \forall_{\mathbf{x}}\in\mathbb{R}^n\)

Inverse Matrix

\(A^{-1}A=\mathbf{I}_n\), \(AA^{-1}=\mathbf{I}_n\), for square matrix left and right are the same.

For example: \(\begin{align} A\mathbf{x}&=\mathbf{b} \\ A^{-1}A\mathbf{x}&=A^{-1}\mathbf{b} \\ \mathbf{I}_n\mathbf{x}&=A^{-1}\mathbf{b}\\ \mathbf{x}&=A^{-1}\mathbf{b} \end{align}\)

Should not be used in practical applications because limited precision
\(A^{-1}\) might not be possible to find (singular matrix).

Note: Matrix should be square and singular. If not, we cannot get \(A^{-1}\).

Linear Combination

\({\mathbf{v}^{(1)},\dots,\mathbf{v}^{(n)}}\) Given by multiplying each vector \(\mathbf{v}^{(i)}\) by a scalar and adding.

\[\sum_i c_i\mathbf{v}^{(i)}\]

Span

Set of all points obtained by linear combination of the original vectors

\(A\mathbf{x}=\mathbf{b} \rightarrow\) solution if \(\mathbf{b}\) is in the span (known as column space / range of \(A\)) of columns \(A\)

\(A\) square matrix \(m=n\) and all columns linear independent.

Linear Dependence

Same column space / a combination

Linear Independence

No vector in the set is a linear combination of the other vectors.

Square matrix

\(A \in\mathbb{R}^{m\times n} m=n\)

Singular matrix

Matrix with linear independent columns

Norms

Norms are functions mapping vectors to non-negative values (distance from origin to point \(\mathbf{x}\))
Measure the size of a vector
Any function \(f\) is a norma as long as:
1. \(f(\mathbf{x})=0 \rightarrow \mathbf{x}=0\).
2. \(f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y})\) triangle inequality
3. \(\forall \alpha\in\mathbb{R}, f(\alpha\mathbf{x}) = \mid\alpha\mid f(\mathbf{x})\).

List:

\(L^p\) norm: \(\mid\mid x\mid\mid_p=(\sum_i\mid x_i\mid)^{\frac{1}{p}}\) where \(p\in\mathbb{R}, p\geq 1\)
\(L^2\) Euclidean norm (distance): \(\parallel\mathbf{x}\parallel_2=\parallel\mathbf{x}\parallel=\mathbf{x}^\intercal\mathbf{x}\)
- increases really slowly near the original
- squared \(L^2\) norm is better to work: 1) mathematically and 2) computationally.
\(L^1\) norm: \(\parallel\mathbf{x}\parallel_1=\sum_i\mid x_i\mid\)
- chooses a function that grows at the same rate in all locations.
- Discriminate elements that are 1) exactly zero; 2) small, but not zero.
\(L^\infty\) max norm: \(\parallel\mathbf{x}\parallel_\infty=\max_i\mid x_i\mid\)
Frobenius norm: measure size of matrix
- Analogous to \(L^2\) of a vector
- Dot product in terms of norms \(\mathbf{x}^\intercal\mathbf{y}=\parallel\mathbf{x}\parallel_2\parallel\mathbf{y}\parallel_2\cos\theta\)
\(\parallel A\parallel_F=\sqrt{\sum_{ij}A_{ij}^2}\)

Special kind of matrices and Vectors
Diagonal matrix (\(D\))
- \(diag(\mathbf{v})\): square
- \(diag(\mathbf{v})^{-1}=diag(\left[1/v_1,\dots,1/v_n\right]^\intercal)\).
- \(diag(\mathbf{v})\mathbf{x}=\mathbf{v}\odot\mathbf{x} \rightarrow x_i*v_i\).
- non-square: no inverse
Symmetric \(A=A^\intercal\)
Unit vector (unit norm): \(\parallel\mathbf{x}\parallel_2=1\)
Orthogonal: \(\mathbf{x}^\intercal\mathbf{y}=0\) Vectors are perpendicular (90 degrees)
- mutually orthogonal: Set of vectors are orthogonal
- orthonormal: vector is orthogonal and have unit norm
- orthogonal matrix: square matrix, rows and columns are mutually orthonormal
  \[\begin{align} A^\intercal A&=AA^\intercal=I\\ A^{-1}&=A^\intercal \end{align}\]

Eigendecomposition

Matrix Decomposition

Eigenvectors (\(\mathbf{v}\) vector)
Eigenvalues (\(\lambda\) scalar)

\[A\mathbf{v}=\lambda\mathbf{v}\]

Eigendecomposition

\(\mathbf{v}^{-1}\) ,linear independent eigenvectors
\(\mathbf{\lambda}\), corresponding eigenvalue

\[A=\mathbf{v}diag(\mathbf{\lambda})\mathbf{v}^{-1}\]

Not every matrix can be decomposed

Real Symmetric matrix

\(Q\), orthogonal matrix
\(\Lambda\), diagonal matrix

\[A=Q\Lambda Q^\intercal\]

Not defined for non-squared matrices

Singular Value Decomposition (SVD)

It’s more generally applicable

\(D\), diagonal matrix (singular values=\(diag(D)\), left-singular vector and right-singular vector are the columns of \(U\) and \(V\) respectively.)
\(U\) and \(V\), orthogonal matrices

\[A=UDV^\intercal\] \[m\times n = (m\times m) (m\times n) (n\times n)\]

Useful to partially generalize matrix inversion to non-square matrices.

The Moore-Penrose Pseudoinverse

Matrix inversion not defined for non-square matrices, these can be calculated with this method.

real (pseudoinverse): \(A^+=\lim_{\alpha\searrow 0}(A^\intercal A+\alpha I)^{-1}A^\intercal\)
practical \(A^+=VD^+U^\intercal\)

\(U, D, V\): singular value decomposition of \(A\)
\(D^+\): diagonal matrix. Take \(D\) and get the reciprocal of its nonzero element and take the transpose.

Linear Algebra

Mathematical objects

Transpose

Addition

Scalar addition and multiplication

Addition matrix and vector

Operations

Properties

Linear equations

Identity matrix

Inverse Matrix

Linear Combination

Span

Linear Dependence

Linear Independence

Square matrix

Singular matrix

Norms

List:

Special kind of matrices and Vectors

Eigendecomposition

Matrix Decomposition

Eigendecomposition

Real Symmetric matrix

Singular Value Decomposition (SVD)

The Moore-Penrose Pseudoinverse

Trace operator

Determinant