损失函数的目的是优化。而分类问题中,标签没有连续的概念。每个标签的距离也没有实际意义,因此预测值和标签的两个向量之间的平方差这个值不能反应分类这个问题的优化程度。 例:假设分类问题的类别是1,2,3,则当一个真实分类为2的样本X而言,模型分类结果为1或3,平方损失函数得到的结果都是一样的。这显然无法反应分类问题的优化程度。
(1) 记R=diag( r ( 1 ) r^{(1)} r(1), r ( 2 ) r^{(2)} r(2),…, r ( N ) r^{(N)} r(N)),则将经验风险函数写成矩阵形式为: R(w) = 1 2 \frac{1}{2} 21 ∑ n = 1 N r ( n ) ( y ( n ) − w T x ( n ) ) 2 \sum_{n=1}^{N} r^{(n)} {(y^{(n)}-w^Tx^{(n)})}^2 ∑n=1Nr(n)(y(n)−wTx(n))2= 1 2 \frac{1}{2} 21 ( Y − X W ) T R ( Y − X W ) (Y-XW)^T R(Y-XW) (Y−XW)TR(Y−XW) 求偏导得到: ∂ R ( w ) ∂ w = 1 2 ∂ ( Y T R T − 2 Y T R X W + W T X T R X W ) ∂ w = − X T R T Y + X T R X W \frac{\partial R(w)}{\partial w} = \frac{1}{2} \frac{\partial(Y^TRT - 2Y^TRXW + W^TX^TRXW)}{\partial w} = -X^TR^TY + X^TRXW ∂w∂R(w)=21∂w∂(YTRT−2YTRXW+WTXTRXW)=−XTRTY+XTRXW 令偏导为0解得: w = ( X T R X ) − 1 ( R X ) T Y w = (X^TRX)^{-1} (RX)^TY w=(XTRX)−1(RX)TY (2) 权重 r ( n ) r^{(n)} r(n)为每个样本分配了权重,相当于对每个样本都设置了不同的学习率,或理解为对每个样本重视程度不一样。
已知定理:设A,B分别为n×m,m×s的矩阵,则 r a n k ( A B ) ≤ m i n { r a n k ( A ) , r a n k ( B ) } {rank(AB)} \leq {min\{rank(A),rank(B)\}} rank(AB)≤min{rank(A),rank(B)} 而 X ∈ R ( d + 1 ) × N X \in \mathbb{R}^{(d+1)×N} X∈R(d+1)×N, X T ∈ R N × ( d + 1 ) X^T\in \mathbb{R}^{N \times(d+1)} XT∈RN×(d+1), r a n k ( X ) = r a n k ( X T ) = m i n ( ( d + 1 ) , N ) , N < d + 1 rank(X) = rank(X^T) = min ((d+1),N),N<d+1 rank(X)=rank(XT)=min((d+1),N),N<d+1。可知 r a n k ( X ) = N rank(X) = N rank(X)=N。即 r a n k ( X X T ) ≤ m i n { N , N } = N rank(XX^T) \leq{min\{N,N\}}=N rank(XXT)≤min{N,N}=N
已知 R ( w ) = 1 2 ∣ ∣ y − X T w ∣ ∣ 2 + 1 2 λ ∣ ∣ w ∣ ∣ 2 R(w) = \frac {1}{2} {{||y - X^Tw||}^2+ \frac{1}{2} \lambda{||w||}^2} R(w)=21∣∣y−XTw∣∣2+21λ∣∣w∣∣2 w ∗ = ( X X T + λ I ) − 1 X y w^* = (XX^T + \lambda I) ^{-1}Xy w∗=(XXT+λI)−1Xy 可得 ∂ R ( w ) ∂ w = 1 2 ∂ ( ∣ ∣ y − X T w ∣ ∣ ) 2 + λ ∣ ∣ w ∣ ∣ 2 ∂ w = − X ( y − X T w ) + λ w \frac{\partial R(w)}{\partial w} = \frac{1}{2} \frac{\partial {(||y-X^Tw||)^2 + \lambda||w||^2}}{\partial w} = -X(y-X^Tw) + \lambda w ∂w∂R(w)=21∂w∂(∣∣y−XTw∣∣)2+λ∣∣w∣∣2=−X(y−XTw)+λw 令 ∂ R ( w ) ∂ w = 0 \frac {\partial R(w)} {\partial w} = 0 ∂w∂R(w)=0 可得 − X Y + X X T w + λ w = 0 -XY + XX^Tw + \lambda w = 0 −XY+XXTw+λw=0 ( X X T + λ I ) w = X Y (XX^T + \lambda I)w = XY (XXT+λI)w=XY 即 w ∗ = ( X X T + λ I ) − 1 X y w^* = (XX^T + \lambda I)^{-1}Xy w∗=(XXT+λI)−1Xy
已知 log p ( y ∣ X ; w , θ ) = ∑ n = 1 N log N ( y ( n ) ∣ w T x ( n ) , σ 2 ) \log {p(y|X;w,\theta)} = \sum_{n=1}^N \log {N(y^{(n)}|w^Tx^{(n)},\sigma ^ 2)} logp(y∣X;w,θ)=n=1∑NlogN(y(n)∣wTx(n),σ2) 令 ∂ log p ( y ∣ X ; w , σ ) ∂ w = 0 \frac {\partial \log {p(y|X;w,\sigma)}}{\partial w} = 0 ∂w∂logp(y∣X;w,σ)=0 即有 ∂ ( ∑ n = 1 N − ( y ( n ) − w T x ( n ) ) 2 2 β ) ∂ w = 0 \frac {\partial (\sum_{n=1}^N - \frac{(y^{(n)} - w^Tx^{(n)})^2}{2\beta})} {\partial w} = 0 ∂w∂(∑n=1N−2β(y(n)−wTx(n))2)=0 ∂ 1 2 ∣ ∣ y − X T w ∣ ∣ 2 ∂ w = 0 \frac {\partial \frac {1}{2} ||y-X^Tw||^2}{\partial w} = 0 ∂w∂21∣∣y−XTw∣∣2=0 则 w M L = ( X X T ) − 1 X y w ^{ML} = (XX^T)^{-1}Xy wML=(XXT)−1Xy
(1) 由最大似然估计公式可知 ∂ ∑ n = 1 N log N ( y ( n ) ∣ u , σ 2 ∂ μ = 0 \frac {\partial \sum_{n=1}^N \log{N(y^{(n)}|u,\sigma^2}}{\partial \mu} = 0 ∂μ∂∑n=1NlogN(y(n)∣u,σ2=0 ∂ ∑ n = 1 N ( y ( n ) − μ ) 2 ∂ μ = 0 \frac {\partial \sum_{n=1}^N (y^{(n)} - \mu)^2} {\partial \mu} = 0 ∂μ∂∑n=1N(y(n)−μ)2=0 μ M L = 1 N ∑ n = 1 N y ( n ) \mu ^{ML} = \frac{1}{N}\sum_{n=1}^Ny^{(n)} μML=N1n=1∑Ny(n) (2) P ( μ ∣ y , X ; μ 0 , σ ) = P ( μ , y ∣ X ; μ 0 , σ ) ∑ u P ( u ; u 0 ) P ( y ∣ μ ; σ ) ⇒ P ( y ∣ X , μ ; σ ) P ( μ ; μ 0 , σ 0 ) P(\mu | y,X;\mu_0,\sigma) = \frac {P(\mu,y|X;\mu_0,\sigma)}{\sum_uP(u;u_0)P(y|\mu;\sigma)} \Rightarrow P(y|X,\mu;\sigma)P(\mu;\mu_0,\sigma_0) P(μ∣y,X;μ0,σ)=∑uP(u;u0)P(y∣μ;σ)P(μ,y∣X;μ0,σ)⇒P(y∣X,μ;σ)P(μ;μ0,σ0) 得到最大后验概率: μ M A P = a r g m a x u P ( y ∣ X , μ ; σ ) P ( μ ; μ 0 , σ 0 ) \mu_{MAP} = arg max_u P(y|X,\mu;\sigma)P(\mu;\mu_0,\sigma_0) μMAP=argmaxuP(y∣X,μ;σ)P(μ;μ0,σ0)
log ( P ( μ ∣ y , X ; μ 0 , σ ) ) ⇒ log P ( y ∣ X , μ ; σ ) + log P ( μ ; μ 0 , σ 0 ) \log {(P(\mu|y,X;\mu_0,\sigma)) \Rightarrow \log P(y|X,\mu;\sigma) + \log P(\mu;\mu_0,\sigma_0)} log(P(μ∣y,X;μ0,σ))⇒logP(y∣X,μ;σ)+logP(μ;μ0,σ0) = − 1 2 σ 2 ∑ i = 1 N ( x − μ ) 2 − 1 2 σ 0 2 ∑ i = 1 N ( μ − μ 0 ) 2 = - \frac {1}{2\sigma^2} \sum_{i=1}^N(x-\mu)^2-\frac {1}{2\sigma_0^2} \sum_{i=1}^N(\mu - \mu_0)^2 =−2σ21i=1∑N(x−μ)2−2σ021i=1∑N(μ−μ0)2 前一项是最大似然,后一项是正则化项,当N趋于无穷时,根据大数定律后一项趋于0,此时最大后验趋近于最大似然。
(1) R ( f ) = E ( x , y ) ~ p r ( x , y ) [ ( y − f ( x ) ) 2 ] R(f) = E_{(x,y)~p_r(x,y)[(y-f(x))^2]} R(f)=E(x,y)~pr(x,y)[(y−f(x))2] R ( f ) = E ( x , y ) ~ p r ( x , y ) [ ( y − E y ~ p r ( y ∣ x ) [ y ] + E y ~ p r ( y ∣ x ) [ y ] − f ( x ) ) 2 ] R(f) = E_{(x,y)~p_r(x,y)}[(y-E_{y~p_r(y|x)}[y] + E_{y~p_r(y|x)}[y]-f(x))^2] R(f)=E(x,y)~pr(x,y)[(y−Ey~pr(y∣x)[y]+Ey~pr(y∣x)[y]−f(x))2] R ( f ) = E ( x , y ) ~ p r ( x , y ) [ ( y − E y ~ p r ( y ∣ x ) [ y ] ) 2 + ( E y ~ p r ( y ∣ x ) [ y ] − f ( x ) ) 2 + 2 ( y − E y ~ p r ( y ∣ x ) [ y ] ) ( E y ~ p r ( y ∣ x ) [ y ] − f ( x ) ) ] R(f) = E_{(x,y)~p_r(x,y)} [(y-E_{y~p_r(y|x)}[y])^2 + (E_{y~p_r(y|x)}[y] - f(x))^2 + 2(y - E_{y~p_r(y|x)}[y])(E_{y~p_r(y|x)}[y] - f(x))] R(f)=E(x,y)~pr(x,y)[(y−Ey~pr(y∣x)[y])2+(Ey~pr(y∣x)[y]−f(x))2+2(y−Ey~pr(y∣x)[y])(Ey~pr(y∣x)[y]−f(x))] y、 E y ~ p r ( y ∣ x ) [ y ] E_{y~p_r(y|x)}[y] Ey~pr(y∣x)[y]都是不变的,只有f(x)在被学习,所以要学习到最好的模型就需要令期望风险最小,即 f ∗ ( x ) = E y ~ p r ( y ∣ x ) [ y ] f^*(x) = E_{y~p_r(y|x)}[y] f∗(x)=Ey~pr(y∣x)[y]
高偏差原因: (1).数据特征过少; (2).模型复杂度太低; (3).正则化系数 λ \lambda λ太大; 高方差原因: (1).数据样例过少; (2).模型复杂度过高; (3).正则化系数 λ \lambda λ太小; (4).没有使用交叉验证 综上,高偏差高方差的原因可能是数据样例少的同时数据特征也少。
E D [ ( f D ( x ) − f ∗ ( x ) ) 2 ] = E D [ ( f D ( x ) − E D [ f D ( x ) ] − f ∗ ( x ) ) 2 ] E_D[(f_D(x) - f^*(x))^2] = E_D[(f_D(x) - E_D[f_D(x)] - f^*(x))^2] ED[(fD(x)−f∗(x))2]=ED[(fD(x)−ED[fD(x)]−f∗(x))2] = E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 + ( E D [ f D ( x ) ] − f ∗ ( x ) ) 2 − 2 ( f D ( x ) − E D [ f D ( x ) ] ) ( E D [ f D ( x ) ] − f ∗ ( x ) ) ] =E_D[(f_D(x) - E_D[f_D(x)])^2 + (E_D[f_D(x)] - f^*(x))^2 - 2(f_D(x) - E_D[f_D(x)])(E_D[f_D(x)] - f^*(x))] =ED[(fD(x)−ED[fD(x)])2+(ED[fD(x)]−f∗(x))2−2(fD(x)−ED[fD(x)])(ED[fD(x)]−f∗(x))] = ( E D [ f D ( x ) ] − f ∗ ( x ) ) 2 + E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 ] − 2 E D [ ( f D ( x ) − E D [ f D ( x ) ] ) ( E D [ f D ( x ) ] − f ∗ ( x ) ) ] =(E_D[f_D(x)] - f^*(x))^2 + E_D[(f_D(x) - E_D[f_D(x)])^2] - 2E_D[(f_D(x) - E_D[f_D(x)])(E_D[f_D(x)] - f^*(x))] =(ED[fD(x)]−f∗(x))2+ED[(fD(x)−ED[fD(x)])2]−2ED[(fD(x)−ED[fD(x)])(ED[fD(x)]−f∗(x))] 对于-2 E D [ ( f D ( x ) − E D [ f D ( x ) ] ) ( E D [ f D ( x ) ] − f ∗ ( x ) ) ] E_D[(f_D(x) - E_D[f_D(x)])(E_D[f_D(x)] - f^*(x))] ED[(fD(x)−ED[fD(x)])(ED[fD(x)]−f∗(x))]这一项,对于一个确认已训练好的模型集合来说,E_D[f_D(x)]和f^*(x)都是确定的,所以原式就为: − 2 C E D [ ( f D ( x ) − E D [ f D ( x ) ] ) ] = − 2 C E D [ ( f D ( x ) ) ] + 2 C E D [ ( f D ( x ) ) ] = 0 -2CE_D[(f_D(x) - E_D[f_D(x)])] = -2 CE_D[(f_D(x))] + 2CE_D[(f_D(x))] = 0 −2CED[(fD(x)−ED[fD(x)])]=−2CED[(fD(x))]+2CED[(fD(x))]=0 所以, E D [ ( f D ( x ) − f ∗ ( x ) ) 2 ] = ( E D [ f D ( x ) ] − f ∗ ( x ) ) 2 + E D [ ( f D ( x ) − E D [ f D ( x ) ] ) 2 ] E_D[(f_D(x) - f^*(x))^2] = (E_D[f_D(x)] - f^*(x))^2 + E_D[(f_D(x) - E_D[f_D(x)])^2] ED[(fD(x)−f∗(x))2]=(ED[fD(x)]−f∗(x))2+ED[(fD(x)−ED[fD(x)])2]
一元特征词袋模型:“我”,“打了”,“张三” x 1 = [ 1 , 1 , 1 ] T x_1 = [1,1,1]^T x1=[1,1,1]T x 2 = [ 1 , 1 , 1 ] T x_2 = [1,1,1]^T x2=[1,1,1]T 二元特征词袋模型:“我”,“我打了”,“打了张三”,“张三打了”,“打了我”,“张三” x 1 = [ 1 , 1 , 1 , 0 , 0 , 1 ] T x_1 = [1,1,1,0,0,1]^T x1=[1,1,1,0,0,1]T x 2 = [ 1 , 0 , 0 , 1 , 1 , 1 ] T x_2 = [1,0,0,1,1,1]^T x2=[1,0,0,1,1,1]T 三元特征词袋模型:“我”,“我打了张三”,“张三打了我”,“张三” x 1 = [ 1 , 1 , 0 , 1 ] T x_1 = [1,1,0,1]^T x1=[1,1,0,1]T x 2 = [ 1 , 0 , 1 , 1 ] T x_2 = [1,0,1,1]^T x2=[1,0,1,1]T 对于n元特征词袋模型来说,n越大,计算压力和参数空间越大,数据越稀疏。然而,当n很小时又无法表示清楚,例如上述一元特征词袋模型,仅仅只是根据当前一个字来判断下一个字可能是什么,无法分辨 x 1 和 x 2 x_1和x_2 x1和x2。
精确率: P 1 = 1 2 P_1 = \frac {1}{2} P1=21 P 2 = 2 4 = 1 2 P_2 = \frac {2}{4} = \frac {1} {2} P2=42=21 P 3 = 2 3 P_3 = \frac {2}{3} P3=32 召回率: R 1 = 1 2 R_1 = \frac {1}{2} R1=21 P 2 = 2 3 P_2 = \frac {2}{3} P2=32 P 3 = 2 4 = 1 2 P_3 = \frac{2}{4} = \frac{1}{2} P3=42=21 F1值: F 1 = 2 × 0.5 × 0.5 1 × 0.5 + 0.5 = 0.5 F_1 = \frac {2×0.5×0.5}{1×0.5 + 0.5} = 0.5 F1=1×0.5+0.52×0.5×0.5=0.5 F 2 = 2 × 0.5 × 2 3 1 × 0.5 + 2 3 = 10 7 F_2 = \frac {2×0.5× \frac {2}{3}}{1×0.5 + \frac {2}{3}} = \frac{10}{7} F2=1×0.5+322×0.5×32=710 F 3 = 2 × 2 3 × 0.5 1 × 2 3 + 0.5 = 11 7 F_3 = \frac {2× \frac {2}{3} × 0.5}{1× \frac {2}{3} + 0.5} = \frac {11}{7} F3=1×32+0.52×32×0.5=711 宏平均: P m a c r o = 1 3 ∑ c = 1 3 P c = 5 9 P_{macro} = \frac {1}{3} \sum_{c=1}^3 P_c =\frac {5}{9} Pmacro=31c=1∑3Pc=95 R m a c r o = 1 3 ∑ c = 1 3 R c = 5 9 R_{macro} = \frac {1}{3} \sum_{c=1}^3 R_c =\frac{5}{9} Rmacro=31c=1∑3Rc=95 F 1 m a c r o = 2 × P m a c r o × R m a c r o P m a c r o + R m a c r o = 5 9 F1_{macro} = \frac {2×P_{macro}×R_{macro}}{P_{macro} + R_{macro}} = \frac{5}{9} F1macro=Pmacro+Rmacro2×Pmacro×Rmacro=95 微平均:
样本微平均P微平均R111200311411500611711800900