Linear Algebra
Some interesting links:
Notes of chapter 2 of Deep Learning book.
Mathematical objects
- Scalar: single number \(x=1\)
-
Vector: array of numbers \(\mathbf{x}=\begin{vmatrix}x_1\\\vdots\\x_n\end{vmatrix}\)
\(\mathbf{x_{-1}}\): ignore element \(x_1\)
-
Matrix: 2D array of numbers \(\mathbf{A}\in\mathbb{R}^{mxn}\)
horizontal coordinate (all), ith row: \(\mathbf{A}_{i,:}\)
vertical coordinate (all), ith column: \(\mathbf{A}_{:,i}\)
value: \(f(\mathbf{A})_{i,j}\)
-
Tensor: array with more than two axes \(\mathbf{A}\)
value: \(\mathbf{A}_{i,j,k}\)
Transpose
- Mirror across main diagonal \(\mathbf{A} \rightarrow \mathbf{A}^\intercal\)
- Scalar matrix, one item \(\mathbf{a}=\mathbf{a}^\intercal\)
- Vector \(\mathbf{x}^\intercal\)
- Matrix product \((\mathbf{A}\mathbf{B})^\intercal=\mathbf{B}^\intercal\mathbf{A}^\intercal\)
Addition
If \(\mathbf{A}\) and \(\mathbf{B}\) same shape, \(\mathbf{C}=\mathbf{A}+\mathbf{B}\), where \(\mathbf{C}_{i,j}=\mathbf{A}_{i,j}+\mathbf{B}_{i,j}\).
Scalar addition and multiplication
\(\mathbf{D}=a\cdot\mathbf{B}+c\), where \(\mathbf{D}_{i,j}=a\cdot\mathbf{B}_{i,j}+c\)
Addition matrix and vector
\(\mathbf{C}=\mathbf{A}+\mathbf{b}\), where \(\mathbf{C}_{i,j}=\mathbf{A}_{i,j}+\mathbf{b}_j\)
broadcasting: \(\mathbf{b}\) is addded to each row of the matrix \(\mathbf{A}\).
Operations
-
Matrix product \(\mathbf{C}=\mathbf{A}\mathbf{B} \rightarrow \mathbf{C}_{m\times p}=\mathbf{A}_{m\times n}\mathbf{B}_{n\times p}\) and \(\mathbf{C}_{i,j}=\sum_k\mathbf{A}_{i,k}\mathbf{B}_{k,j}\)
Vectors of the same size: \(\mathbf{x}^\intercal\mathbf{y}\)
-
Elementwise product / Hadamard product \(\mathbf{A}\odot\mathbf{B}\)
Properties
- Distributive: \(\mathbf{A}(\mathbf{B}+\mathbf{C})=\mathbf{A}\mathbf{B}+\mathbf{A}\mathbf{C}\)
- Associative: \(\mathbf{A}(\mathbf{B}\mathbf{C})=(\mathbf{A}\mathbf{B})\mathbf{C}\)
-
Commutative: not \(\mathbf{A}\mathbf{B}\neq \mathbf{B}\mathbf{A}\) (not always)
yes for vectors \(\rightarrow\) \(\mathbf{x}^\intercal \mathbf{y}=\mathbf{y}^\intercal\mathbf{x}\)
Linear equations
\(\mathbf{A}\mathbf{x}=\mathbf{b}\) (notation compact)
- \(\mathbf{A}\in\mathbb{R}^{m\times n}\) known
- \(\mathbf{x}\in\mathbb{R}^m\) variable
- \(\mathbf{b}\in\mathbb{R}^m\) known
Notation not compact, equations:
\[A_{1,:}\mathbf{x}=b_1 \rightarrow A_{1,1}x_1+\dots+A_{1,n}x_n=b_1\\ \vdots\\ A_{m,:}\mathbf{x}=b_m \rightarrow A_{m,1}x_1+\dots+A_{m,n}x_n=b_m\]Identity matrix
\(\mathbf{I}_n\in\mathbb{R}^{nxn} \rightarrow \mathbf{I}_n\mathbf{x}=\mathbf{x}\:\:\:\:\:\:\: \forall_{\mathbf{x}}\in\mathbb{R}^n\)
Inverse Matrix
\(A^{-1}A=\mathbf{I}_n\), \(AA^{-1}=\mathbf{I}_n\), for square matrix left and right are the same.
For example: \(\begin{align} A\mathbf{x}&=\mathbf{b} \\ A^{-1}A\mathbf{x}&=A^{-1}\mathbf{b} \\ \mathbf{I}_n\mathbf{x}&=A^{-1}\mathbf{b}\\ \mathbf{x}&=A^{-1}\mathbf{b} \end{align}\)
- Should not be used in practical applications because limited precision
- \(A^{-1}\) might not be possible to find (singular matrix).
Note: Matrix should be square and singular. If not, we cannot get \(A^{-1}\).
Linear Combination
\({\mathbf{v}^{(1)},\dots,\mathbf{v}^{(n)}}\) Given by multiplying each vector \(\mathbf{v}^{(i)}\) by a scalar and adding.
\[\sum_i c_i\mathbf{v}^{(i)}\]Span
Set of all points obtained by linear combination of the original vectors
\(A\mathbf{x}=\mathbf{b} \rightarrow\) solution if \(\mathbf{b}\) is in the span (known as column space / range of \(A\)) of columns \(A\)
\(A\) square matrix \(m=n\) and all columns linear independent.
Linear Dependence
Same column space / a combination
Linear Independence
No vector in the set is a linear combination of the other vectors.
Square matrix
\(A \in\mathbb{R}^{m\times n} m=n\)
Singular matrix
Matrix with linear independent columns
Norms
- Norms are functions mapping vectors to non-negative values (distance from origin to point \(\mathbf{x}\))
- Measure the size of a vector
- Any function \(f\) is a norma as long as:
- \(f(\mathbf{x})=0 \rightarrow \mathbf{x}=0\).
- \(f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y})\) triangle inequality
- \(\forall \alpha\in\mathbb{R}, f(\alpha\mathbf{x}) = \mid\alpha\mid f(\mathbf{x})\).
List:
- \(L^p\) norm: \(\mid\mid x\mid\mid_p=(\sum_i\mid x_i\mid)^{\frac{1}{p}}\) where \(p\in\mathbb{R}, p\geq 1\)
- \(L^2\) Euclidean norm (distance): \(\parallel\mathbf{x}\parallel_2=\parallel\mathbf{x}\parallel=\mathbf{x}^\intercal\mathbf{x}\)
- increases really slowly near the original
- squared \(L^2\) norm is better to work: 1) mathematically and 2) computationally.
- \(L^1\) norm: \(\parallel\mathbf{x}\parallel_1=\sum_i\mid x_i\mid\)
- chooses a function that grows at the same rate in all locations.
- Discriminate elements that are 1) exactly zero; 2) small, but not zero.
- \(L^\infty\) max norm: \(\parallel\mathbf{x}\parallel_\infty=\max_i\mid x_i\mid\)
- Frobenius norm: measure size of matrix
- Analogous to \(L^2\) of a vector
- Dot product in terms of norms \(\mathbf{x}^\intercal\mathbf{y}=\parallel\mathbf{x}\parallel_2\parallel\mathbf{y}\parallel_2\cos\theta\)
\(\parallel A\parallel_F=\sqrt{\sum_{ij}A_{ij}^2}\)
Special kind of matrices and Vectors
- Diagonal matrix (\(D\))
- \(diag(\mathbf{v})\): square
- \(diag(\mathbf{v})^{-1}=diag(\left[1/v_1,\dots,1/v_n\right]^\intercal)\).
- \(diag(\mathbf{v})\mathbf{x}=\mathbf{v}\odot\mathbf{x} \rightarrow x_i*v_i\).
- non-square: no inverse
- Symmetric \(A=A^\intercal\)
- Unit vector (unit norm): \(\parallel\mathbf{x}\parallel_2=1\)
- Orthogonal: \(\mathbf{x}^\intercal\mathbf{y}=0\) Vectors are perpendicular (90 degrees)
- mutually orthogonal: Set of vectors are orthogonal
- orthonormal: vector is orthogonal and have unit norm
-
orthogonal matrix: square matrix, rows and columns are mutually orthonormal
\[\begin{align} A^\intercal A&=AA^\intercal=I\\ A^{-1}&=A^\intercal \end{align}\]
Eigendecomposition
Matrix Decomposition
- Eigenvectors (\(\mathbf{v}\) vector)
- Eigenvalues (\(\lambda\) scalar)
Eigendecomposition
- \(\mathbf{v}^{-1}\) ,linear independent eigenvectors
- \(\mathbf{\lambda}\), corresponding eigenvalue
Not every matrix can be decomposed
Real Symmetric matrix
- \(Q\), orthogonal matrix
- \(\Lambda\), diagonal matrix
Not defined for non-squared matrices
Singular Value Decomposition (SVD)
It’s more generally applicable
- \(D\), diagonal matrix (singular values=\(diag(D)\), left-singular vector and right-singular vector are the columns of \(U\) and \(V\) respectively.)
- \(U\) and \(V\), orthogonal matrices
Useful to partially generalize matrix inversion to non-square matrices.
The Moore-Penrose Pseudoinverse
Matrix inversion not defined for non-square matrices, these can be calculated with this method.
- real (pseudoinverse): \(A^+=\lim_{\alpha\searrow 0}(A^\intercal A+\alpha I)^{-1}A^\intercal\)
- practical \(A^+=VD^+U^\intercal\)
- \(U, D, V\): singular value decomposition of \(A\)
- \(D^+\): diagonal matrix. Take \(D\) and get the reciprocal of its nonzero element and take the transpose.