Rademacher complexity




In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of real-valued functions with respect to a probability distribution.




Contents






  • 1 Definitions


    • 1.1 Rademacher complexity of a set


    • 1.2 Rademacher complexity of a function class




  • 2 Examples


  • 3 Using the Rademacher complexity


    • 3.1 Bounding the representativeness


    • 3.2 Bounding the generalization error




  • 4 Bounding the Rademacher complexity


    • 4.1 Bounds related to the VC dimension


    • 4.2 Bounds related to linear classes


    • 4.3 Bounds related to covering numbers




  • 5 Gaussian complexity


  • 6 References





Definitions



Rademacher complexity of a set


Given a set A⊆Rm{displaystyle Asubseteq mathbb {R} ^{m}}{displaystyle Asubseteq mathbb {R} ^{m}}, the Rademacher complexity of A is defined as follows:[1][2]:326


Rad⁡(A):=1mEσ[supa∈A∑i=1mσiai]{displaystyle operatorname {Rad} (A):={frac {1}{m}}mathbb {E} _{sigma }left[sup _{ain A}sum _{i=1}^{m}sigma _{i}a_{i}right]}{displaystyle operatorname {Rad} (A):={frac {1}{m}}mathbb {E} _{sigma }left[sup _{ain A}sum _{i=1}^{m}sigma _{i}a_{i}right]}

where σ1,σ2,…m{displaystyle sigma _{1},sigma _{2},dots ,sigma _{m}}sigma_1, sigma_2, dots, sigma_m are independent random variables drawn from the Rademacher distribution i.e.
Pr(σi=+1)=Pr(σi=−1)=1/2{displaystyle Pr(sigma _{i}=+1)=Pr(sigma _{i}=-1)=1/2}Pr(sigma_i = +1) = Pr(sigma_i = -1) = 1/2 for i=1,2,…,m{displaystyle i=1,2,dots ,m}i=1,2,dots,m.



Rademacher complexity of a function class


Given a sample S=(z1,z2,…,zm)∈Zm{displaystyle S=(z_{1},z_{2},dots ,z_{m})in Z^{m}}S=(z_1, z_2, dots, z_m) in Z^m, and a class F{displaystyle F}F of real-valued functions defined on a domain space Z{displaystyle Z}Z, the empirical Rademacher complexity of F{displaystyle F}F given S{displaystyle S}S is defined as:


RadS⁡(F)=1mEσ[supf∈F∑i=1mσif(zi)]{displaystyle operatorname {Rad} _{S}(F)={frac {1}{m}}mathbb {E} _{sigma }left[sup _{fin F}sum _{i=1}^{m}sigma _{i}f(z_{i})right]}{displaystyle operatorname {Rad} _{S}(F)={frac {1}{m}}mathbb {E} _{sigma }left[sup _{fin F}sum _{i=1}^{m}sigma _{i}f(z_{i})right]}

This can also be written using the previous definition:[2]:326


RadS⁡(F)=Rad⁡(F∘S){displaystyle operatorname {Rad} _{S}(F)=operatorname {Rad} (Fcirc S)}{displaystyle operatorname {Rad} _{S}(F)=operatorname {Rad} (Fcirc S)}

where F∘S{displaystyle Fcirc S}{displaystyle Fcirc S} denotes function composition, i.e.:


F∘S:={(f(z1),…,f(zm))|f∈F}{displaystyle Fcirc S:={(f(z_{1}),ldots ,f(z_{m}))|fin F}}{displaystyle Fcirc S:={(f(z_{1}),ldots ,f(z_{m}))|fin F}}

Let P{displaystyle P}P be a probability distribution over Z{displaystyle Z}Z.
The Rademacher complexity of the function class F{displaystyle F}F with respect to P{displaystyle P}P for sample size m{displaystyle m}m is:


RadP,m⁡(F):=ES∼Pm[RadS⁡(F)]{displaystyle operatorname {Rad} _{P,m}(F):=mathbb {E} _{Ssim P^{m}}left[operatorname {Rad} _{S}(F)right]}{displaystyle operatorname {Rad} _{P,m}(F):=mathbb {E} _{Ssim P^{m}}left[operatorname {Rad} _{S}(F)right]}

where the above expectation is taken over an identically independently distributed (i.i.d.) sample S=(z1,z2,…,zm){displaystyle S=(z_{1},z_{2},dots ,z_{m})}S=(z_1, z_2, dots, z_m) generated according to P{displaystyle P}P.



Examples


1. A{displaystyle A}A contains a single vector, e.g., A={(a,b)}⊂R2{displaystyle A={(a,b)}subset mathbb {R} ^{2}}{displaystyle A={(a,b)}subset mathbb {R} ^{2}}. Then:


Rad⁡(A)=12⋅(14⋅(a+b)+14⋅(a−b)+14⋅(−a+b)+14⋅(−a−b))=0{displaystyle operatorname {Rad} (A)={1 over 2}cdot ({1 over 4}cdot (a+b)+{1 over 4}cdot (a-b)+{1 over 4}cdot (-a+b)+{1 over 4}cdot (-a-b))=0}{displaystyle operatorname {Rad} (A)={1 over 2}cdot ({1 over 4}cdot (a+b)+{1 over 4}cdot (a-b)+{1 over 4}cdot (-a+b)+{1 over 4}cdot (-a-b))=0}

The same is true for every singleton hypothesis class.[3]:56


2. A{displaystyle A}A contains two vectors, e.g., A={(1,1),(1,2)}⊂R2{displaystyle A={(1,1),(1,2)}subset mathbb {R} ^{2}}{displaystyle A={(1,1),(1,2)}subset mathbb {R} ^{2}}. Then:



Rad⁡(A)=12⋅(14⋅max(1+1,1+2)+14⋅max(1−1,1−2)+14⋅max(−1+1,−1+2)+14⋅max(−1−1,−1−2)){displaystyle operatorname {Rad} (A)={1 over 2}cdot ({1 over 4}cdot max(1+1,1+2)+{1 over 4}cdot max(1-1,1-2)+{1 over 4}cdot max(-1+1,-1+2)+{1 over 4}cdot max(-1-1,-1-2))}{displaystyle operatorname {Rad} (A)={1 over 2}cdot ({1 over 4}cdot max(1+1,1+2)+{1 over 4}cdot max(1-1,1-2)+{1 over 4}cdot max(-1+1,-1+2)+{1 over 4}cdot max(-1-1,-1-2))}

=18(3+0+1−2)=14{displaystyle ={1 over 8}(3+0+1-2)={1 over 4}}{displaystyle ={1 over 8}(3+0+1-2)={1 over 4}}



Using the Rademacher complexity


The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn.



Bounding the representativeness


In machine learning, it is desired to have a training set that represents the true distribution of samples. This can be quantified using the notion of representativeness. Denote by P the probability distribution from which the samples are drawn. Denote by H{displaystyle H}H the set of hypotheses (potential classifiers) and denote by F{displaystyle F}F the corresponding set of error functions, i.e., for every h∈H{displaystyle hin H}hin H, there is a function fh∈F{displaystyle f_{h}in F}{displaystyle f_{h}in F}, that maps each training sample (features,label) to the error of the classifier h{displaystyle h}h on that sample (for example, if we do binary classification and the error function is the simple 0-1 loss, then fh{displaystyle f_{h}}{displaystyle f_{h}} is a function that returns 0 for each training sample on which h{displaystyle h}h is correct and 1 for each training sample on which h{displaystyle h}h is wrong). Define:




LP(f):=Ez∼P[f(z)]{displaystyle L_{P}(f):=E_{zsim P}[f(z)]}{displaystyle L_{P}(f):=E_{zsim P}[f(z)]} - the expected error on the real distribution;


LS(f):=1m∑i=1mf(zi){displaystyle L_{S}(f):={1 over m}sum _{i=1}^{m}f(z_{i})}{displaystyle L_{S}(f):={1 over m}sum _{i=1}^{m}f(z_{i})} - the estimated error on the sample.


The representativeness of the sample S{displaystyle S}S, with respect to P{displaystyle P}P and F{displaystyle F}F, is defined as:


RepP(F,S):=supf∈F(LP(f)−LS(f)){displaystyle Rep_{P}(F,S):=sup _{fin F}(L_{P}(f)-L_{S}(f))}{displaystyle Rep_{P}(F,S):=sup _{fin F}(L_{P}(f)-L_{S}(f))}

Smaller representativeness is better, since it means that the empirical error of a classifier on the training set is not much lower than its true error.


The expected representativeness of a sample can be bounded by the Rademacher complexity of the function class:[2]:326



ES∼Pm[RepP(F,S)]≤2⋅ES∼Pm[Rad⁡(F∘S)]{displaystyle E_{Ssim P^{m}}[Rep_{P}(F,S)]leq 2cdot E_{Ssim P^{m}}[operatorname {Rad} (Fcirc S)]}

{displaystyle E_{Ssim P^{m}}[Rep_{P}(F,S)]leq 2cdot E_{Ssim P^{m}}[operatorname {Rad} (Fcirc S)]}


Bounding the generalization error


When the Rademacher complexity is small, it is possible to learn the hypothesis class H using empirical risk minimization.


For example, (with binary error function),[2]:328 for every δ>0{displaystyle delta >0}delta >0, with probability at least 1−δ{displaystyle 1-delta }1-delta , for every hypothesis h∈H{displaystyle hin H}hin H:


LP(h)−LS(h)≤2Rad⁡(F∘S)+42ln⁡(4/δ)m{displaystyle L_{P}(h)-L_{S}(h)leq 2operatorname {Rad} (Fcirc S)+4{sqrt {2ln(4/delta ) over m}}}{displaystyle L_{P}(h)-L_{S}(h)leq 2operatorname {Rad} (Fcirc S)+4{sqrt {2ln(4/delta ) over m}}}


Bounding the Rademacher complexity


Since smaller Rademacher complexity is better, it is useful to have upper bounds on the Rademacher complexity of various function sets. The following rules can be used to upper-bound the Rademacher complexity of a set A⊂Rm{displaystyle Asubset mathbb {R} ^{m}}{displaystyle Asubset mathbb {R} ^{m}}.[2]:329–330


1. If all vectors in A{displaystyle A}A are translated by a constant vector a0∈Rm{displaystyle a_{0}in mathbb {R} ^{m}}{displaystyle a_{0}in mathbb {R} ^{m}}, then Rad(A) does not change.


2. If all vectors in A{displaystyle A}A are multiplied by a scalar c∈R{displaystyle cin mathbb {R} }{displaystyle cin mathbb {R} }, then Rad(A) is multiplied by |c|{displaystyle |c|}|c|.


3. Rad(A+B) = Rad(A) + Rad(B).[3]:56


4. (Kakade&Tewari Lemma) If all vectors in A{displaystyle A}A are operated by a Lipschitz function, then Rad(A) is (at most) multiplied by the Lipschitz constant of the function. In particular, if all vectors in A{displaystyle A}A are operated by a contraction mapping, then Rad(A) strictly decreases.


5. The Rademacher complexity of the convex hull of A{displaystyle A}A equals Rad(A).


6. (Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size. Formally, let A{displaystyle A}A be a set of N{displaystyle N}N vectors in Rm{displaystyle mathbb {R} ^{m}}mathbb {R} ^{m}, and let {displaystyle {bar {a}}}{bar {a}} be the mean of the vectors in A{displaystyle A}A. Then:


Rad⁡(A)≤maxa∈A||a−||⋅2log⁡Nm{displaystyle operatorname {Rad} (A)leq max _{ain A}||a-{bar {a}}||cdot {{sqrt {2log {N}}} over m}}{displaystyle operatorname {Rad} (A)leq max _{ain A}||a-{bar {a}}||cdot {{sqrt {2log {N}}} over m}}

In particular, if A{displaystyle A}A is a set of binary vectors, the norm is at most m{displaystyle {sqrt {m}}}{sqrt {m}}, so:


Rad⁡(A)≤2log⁡Nm{displaystyle operatorname {Rad} (A)leq {sqrt {2log {N} over m}}}{displaystyle operatorname {Rad} (A)leq {sqrt {2log {N} over m}}}


Bounds related to the VC dimension


Let H{displaystyle H}H be a set family whose VC dimension is d{displaystyle d}d. It is known that the growth function of H{displaystyle H}H is bounded as:


for all m>d+1{displaystyle m>d+1}{displaystyle m>d+1}: Growth(H,m)≤(em/d)d{displaystyle Growth(H,m)leq (em/d)^{d}}{displaystyle Growth(H,m)leq (em/d)^{d}}

This means that, for every set h{displaystyle h}h with at most m{displaystyle m}m elements, |H∩h|≤(em/d)d{displaystyle |Hcap h|leq (em/d)^{d}}{displaystyle |Hcap h|leq (em/d)^{d}}. The set-family H∩h{displaystyle Hcap h}{displaystyle Hcap h} can be considered as a set of binary vectors over Rm{displaystyle mathbb {R} ^{m}}mathbb {R} ^{m}. Substituting this in Massart's lemma gives:


Rad⁡(H∩h)≤2dlog⁡(em/d)m{displaystyle operatorname {Rad} (Hcap h)leq {sqrt {2dlog {(em/d)} over m}}}{displaystyle operatorname {Rad} (Hcap h)leq {sqrt {2dlog {(em/d)} over m}}}

With more advanced techniques (Dudley's entropy bound and Haussler's upper bound[4]) one can show, for example, that there exists a constant C{displaystyle C}C, such that any class of {0,1}{displaystyle {0,1}}{0,1}-indicator functions with Vapnik-Chervonenkis dimension d{displaystyle d}d has Rademacher complexity upper-bounded by Cdm{displaystyle C{sqrt {frac {d}{m}}}}Csqrt{frac{d}{m}}.



Bounds related to linear classes


The following bounds are related to linear operations on S{displaystyle S}S - a constant set of m{displaystyle m}m vectors in Rn{displaystyle mathbb {R} ^{n}}mathbb {R} ^{n}.[2]:332–333


1. Define A2={(w⋅x1,…,w⋅xm)|||w||2≤1}={displaystyle A_{2}={(wcdot x_{1},ldots ,wcdot x_{m})|||w||_{2}leq 1}=}{displaystyle A_{2}={(wcdot x_{1},ldots ,wcdot x_{m})|||w||_{2}leq 1}=} the set of dot-products of the vectors in S{displaystyle S}S with vectors in the unit ball. Then:


Rad⁡(A2)≤maxi||xi||2m{displaystyle operatorname {Rad} (A_{2})leq {max _{i}||x_{i}||_{2} over {sqrt {m}}}}{displaystyle operatorname {Rad} (A_{2})leq {max _{i}||x_{i}||_{2} over {sqrt {m}}}}

2. Define A1={(w⋅x1,…,w⋅xm)|||w||1≤1}={displaystyle A_{1}={(wcdot x_{1},ldots ,wcdot x_{m})|||w||_{1}leq 1}=}{displaystyle A_{1}={(wcdot x_{1},ldots ,wcdot x_{m})|||w||_{1}leq 1}=} the set of dot-products of the vectors in S{displaystyle S}S with vectors in the unit ball of the 1-norm. Then:


Rad⁡(A1)≤maxi||xi||∞2log⁡(2n)m{displaystyle operatorname {Rad} (A_{1})leq max _{i}||x_{i}||_{infty }cdot {sqrt {2log(2n) over m}}}{displaystyle operatorname {Rad} (A_{1})leq max _{i}||x_{i}||_{infty }cdot {sqrt {2log(2n) over m}}}


Bounds related to covering numbers


The following bound relates the Rademacher complexity of a set A{displaystyle A}A to its external covering number - the number of balls of a given radius r{displaystyle r}r whose union contains A{displaystyle A}A. The bound is attributed to Dudley.[2]:338


Suppose A⊂Rm{displaystyle Asubset mathbb {R} ^{m}}{displaystyle Asubset mathbb {R} ^{m}} is a set of vectors whose length (norm) is at most c{displaystyle c}c. Then, for every integer M>0{displaystyle M>0}M>0:


Rad⁡(A)≤c⋅2−Mm+6cm⋅i=1M2−ilog⁡(Nc⋅2−iext(A)){displaystyle operatorname {Rad} (A)leq {ccdot 2^{-M} over {sqrt {m}}}+{6c over m}cdot sum _{i=1}^{M}2^{-i}{sqrt {log(N_{ccdot 2^{-i}}^{text{ext}}(A))}}}{displaystyle operatorname {Rad} (A)leq {ccdot 2^{-M} over {sqrt {m}}}+{6c over m}cdot sum _{i=1}^{M}2^{-i}{sqrt {log(N_{ccdot 2^{-i}}^{text{ext}}(A))}}}

In particular, if A{displaystyle A}A lies in a d-dimensional subspace of Rm{displaystyle mathbb {R} ^{m}}mathbb {R} ^{m}, then:


r>0:Nrext(A)≤(2cd/r)d{displaystyle forall r>0:N_{r}^{text{ext}}(A)leq (2c{sqrt {d}}/r)^{d}}{displaystyle forall r>0:N_{r}^{text{ext}}(A)leq (2c{sqrt {d}}/r)^{d}}

Substituting this in the previous bound gives the following bound on the Rademacher complexity:


Rad⁡(A)≤6cm⋅(dlog⁡(2d)+2d)=O(cdlog⁡(d)m){displaystyle operatorname {Rad} (A)leq {6c over m}cdot {bigg (}{sqrt {dlog(2{sqrt {d}})}}+2{sqrt {d}}{bigg )}=O{bigg (}{c{sqrt {dlog(d)}} over m}{bigg )}}{displaystyle operatorname {Rad} (A)leq {6c over m}cdot {bigg (}{sqrt {dlog(2{sqrt {d}})}}+2{sqrt {d}}{bigg )}=O{bigg (}{c{sqrt {dlog(d)}} over m}{bigg )}}


Gaussian complexity


Gaussian complexity is a similar complexity with similar physical meanings, and can be obtained from the Rademacher complexity using the random variables gi{displaystyle g_{i}}g_{i} instead of σi{displaystyle sigma _{i}}sigma _{i}, where gi{displaystyle g_{i}}g_{i} are Gaussian i.i.d. random variables with zero-mean and variance 1, i.e. gi∼N(0,1){displaystyle g_{i}sim {mathcal {N}}left(0,1right)}g_i sim mathcal{N} left( 0, 1 right).



References





  1. ^ Balcan, Maria-Florina (November 15–17, 2011). "Machine Learning Theory - Rademacher Complexity" (PDF). Retrieved 10 December 2016..mw-parser-output cite.citation{font-style:inherit}.mw-parser-output q{quotes:"""""""'""'"}.mw-parser-output code.cs1-code{color:inherit;background:inherit;border:inherit;padding:inherit}.mw-parser-output .cs1-lock-free a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/6/65/Lock-green.svg/9px-Lock-green.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-limited a,.mw-parser-output .cs1-lock-registration a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Lock-gray-alt-2.svg/9px-Lock-gray-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-lock-subscription a{background:url("//upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Lock-red-alt-2.svg/9px-Lock-red-alt-2.svg.png")no-repeat;background-position:right .1em center}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration{color:#555}.mw-parser-output .cs1-subscription span,.mw-parser-output .cs1-registration span{border-bottom:1px dotted;cursor:help}.mw-parser-output .cs1-hidden-error{display:none;font-size:100%}.mw-parser-output .cs1-visible-error{font-size:100%}.mw-parser-output .cs1-subscription,.mw-parser-output .cs1-registration,.mw-parser-output .cs1-format{font-size:95%}.mw-parser-output .cs1-kern-left,.mw-parser-output .cs1-kern-wl-left{padding-left:0.2em}.mw-parser-output .cs1-kern-right,.mw-parser-output .cs1-kern-wl-right{padding-right:0.2em}


  2. ^ abcdefg Chapter 26 in Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning - from Theory to Algorithms. Cambridge university press. ISBN 9781107057135.


  3. ^ ab Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. USA, Massachusetts: MIT Press. ISBN 9780262018258.


  4. ^ Bousquet, O. (2004). Introduction to Statistical Learning Theory. Biological Cybernetics, 3176(1), 169–207. http://doi.org/10.1007/978-3-540-28650-9_8




  • Peter L. Bartlett, Shahar Mendelson (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research 3 463-482

  • Giorgio Gnecco, Marcello Sanguineti (2008) Approximation Error Bounds via Rademacher's Complexity. Applied Mathematical Sciences, Vol. 2, 2008, no. 4, 153 - 176




Popular posts from this blog

澳門輕軌系統

水泉澳邨

Indian Forest Service