假设我们要从独立同分布的随机变量 D = { x 1 , … , x n } D=\{x_1,\dots,x_n\} D={x1,…,xn}来估计出一个 θ \theta θ 最大似然估计: L ( θ ) = P ( D ; θ ) = P ( x 1 , … , x n ; θ ) = P ( x 1 ; θ ) P ( x 2 ; θ ) … P ( x n ; θ ) = Π i N P ( x i ; θ ) L(\theta)=P(D;\theta)=P(x_1,\dots,x_n;\theta)\\=P(x_1;\theta)P(x_2;\theta)\dots P(x_n;\theta)\\=\Pi_{i}^NP(x_i;\theta) L(θ)=P(D;θ)=P(x1,…,xn;θ)=P(x1;θ)P(x2;θ)…P(xn;θ)=ΠiNP(xi;θ) θ ^ = arg max θ L ( θ ) = arg max θ l o g ( L ( θ ) ) \hat{\theta}=\argmax_{\theta}L(\theta)=\argmax_{\theta}log(L(\theta)) θ^=θargmaxL(θ)=θargmaxlog(L(θ))
例1:伯努利模型 P ( x ) = θ x ( 1 − θ ) 1 − x P(x)=\theta^x(1-\theta)^{1-x} P(x)=θx(1−θ)1−x 那么 L ( θ ) = Π P ( x i ; θ ) = Π ( θ x i ( 1 − θ ) 1 − x i ) = θ ∑ i = 1 N x i ( 1 − θ ) ∑ i = 1 N ( 1 − x i ) L(\theta)=\Pi P(x_i;\theta)=\Pi(\theta^{x_i}(1-\theta)^{1-x_i})=\theta^{\sum\limits_{i=1}^Nx_i}(1-\theta)^{\sum\limits_{i=1}^N(1-x_i)} L(θ)=ΠP(xi;θ)=Π(θxi(1−θ)1−xi)=θi=1∑Nxi(1−θ)i=1∑N(1−xi) log L ( θ ) = log θ ∑ i = 1 N x i ( 1 − θ ) ∑ i = 1 N ( 1 − x i ) = n h log θ + ( N − n h ) log ( 1 − θ ) \log L(\theta)=\log \theta^{\sum\limits_{i=1}^Nx_i}(1-\theta)^{\sum\limits_{i=1}^N(1-x_i)}=n_h\log\theta+(N-n_h)\log(1-\theta) logL(θ)=logθi=1∑Nxi(1−θ)i=1∑N(1−xi)=nhlogθ+(N−nh)log(1−θ) ∂ log L ( θ ) ∂ θ = n h θ − N − n h 1 − θ = 0 \frac{\partial\log L(\theta)}{\partial\theta}=\frac{n_h}{\theta}-\frac{N-n_h}{1-\theta}=0 ∂θ∂logL(θ)=θnh−1−θN−nh=0 ( 1 − θ ) n h = θ ( N − n h ) ⇒ θ ^ M L E = n h N (1-\theta)n_h=\theta(N-n_h)\Rightarrow\hat{\theta}_{MLE}=\frac{n_h}{N} (1−θ)nh=θ(N−nh)⇒θ^MLE=Nnh
例2:单变量正态分布模型 P ( x ) = ( 2 π σ 2 ) − 1 2 exp { − ( x − μ ) 2 / 2 σ 2 } P(x)=(2\pi\sigma^2)^{-\frac{1}{2}}\exp\{-(x-\mu)^2/2\sigma^2\} P(x)=(2πσ2)−21exp{−(x−μ)2/2σ2} L ( θ ) = Π P ( x i ) = Π ( 2 π σ 2 ) − 1 2 exp { − ( x i − μ ) 2 / 2 σ 2 } L(\theta)=\Pi P(x_i)=\Pi (2\pi\sigma^2)^{-\frac{1}{2}}\exp\{-(x_i-\mu)^2/2\sigma^2\} L(θ)=ΠP(xi)=Π(2πσ2)−21exp{−(xi−μ)2/2σ2} log L ( θ ) = − N 2 log 2 π σ 2 − 1 2 σ 2 ∑ i = 1 N ( x i − μ ) 2 \log L(\theta)=-\frac{N}{2}\log{2\pi\sigma^2}-\frac{1}{2\sigma^2}\sum\limits_{i=1}^N(x_i-\mu)^2 logL(θ)=−2Nlog2πσ2−2σ21i=1∑N(xi−μ)2 ∂ log L ( θ ) ∂ μ = 1 σ 2 ∑ i = 1 N ( x i − μ ) = 0 ⇒ μ M L E = ∑ i = 1 N x i N \frac{\partial\log L(\theta)}{\partial\mu}=\frac{1}{\sigma^2}\sum\limits_{i=1}^N(x_i-\mu)=0\Rightarrow\mu_{MLE}=\frac{\sum\limits_{i=1}^Nx_i}{N} ∂μ∂logL(θ)=σ21i=1∑N(xi−μ)=0⇒μMLE=Ni=1∑Nxi ∂ log L ( θ ) ∂ σ 2 = − N 2 σ 2 + 1 2 σ 4 ∑ i = 1 N ( x i − μ ) 2 = 0 ⇒ σ M L E 2 = 1 N ∑ i = 1 N ( x i − μ ) 2 \frac{\partial\log L(\theta)}{\partial\sigma^2}=-\frac{N}{2\sigma^2}+\frac{1}{2\sigma^4}\sum\limits_{i=1}^N(x_i-\mu)^2=0\Rightarrow\sigma^2_{MLE}=\frac{1}{N}\sum\limits_{i=1}^N(x_i-\mu)^2 ∂σ2∂logL(θ)=−2σ2N+2σ41i=1∑N(xi−μ)2=0⇒σMLE2=N1i=1∑N(xi−μ)2
为了防止过拟合(上面求解出来的太过于精确了)我们就会用 θ ^ M L = n h n h + n t \hat{\theta}_{ML}=\frac{n_h}{n_h+n_t} θ^ML=nh+ntnh
P ( θ ∣ D ) = P ( D ∣ θ ) P ( θ ) P ( D ) P(\theta|D)=\frac{P(D|\theta)P(\theta)}{P(D)} P(θ∣D)=P(D)P(D∣θ)P(θ),其中 P ( θ ∣ D ) P(\theta|D) P(θ∣D)就是后验, P ( D ∣ θ ) P(D|\theta) P(D∣θ)就是上面求的似然估计, P ( θ ) P(\theta) P(θ)就是 θ \theta θ的一个先验, P ( D ) P(D) P(D)就是相当于是一个归一项、常熟。 那么, P ( θ ∣ D ) ∝ P ( D ∣ θ ) P ( θ ) P(\theta|D)\propto P(D|\theta)P(\theta) P(θ∣D)∝P(D∣θ)P(θ)。 其中,最大后验概率就是 θ ^ M A P = arg max θ P ( θ ∣ D ) \hat{\theta}_{MAP}=\argmax_\theta P(\theta|D) θ^MAP=θargmaxP(θ∣D)。 当 θ \theta θ的先验是一个均匀分布时,那么 P ( θ ) P(\theta) P(θ)也是一个常熟,那么 P ( θ ∣ D ) ∝ P ( D ∣ θ ) P(\theta|D)\propto P(D|\theta) P(θ∣D)∝P(D∣θ)即MLE=MAP
例1:Beta分布,似然是伯努利分布 给 θ \theta θ一个先验是Beta分布, P ( θ ) = Γ ( α + β ) Γ ( α ) Γ ( β ) θ α − 1 ( 1 − θ ) β − 1 P(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1} P(θ)=Γ(α)Γ(β)Γ(α+β)θα−1(1−θ)β−1 那么 θ \theta θ的后验 P ( θ ∣ D ) ∝ P ( D ∣ θ ) P ( θ ) ∝ θ n h ( 1 − θ ) N − n h × θ α − 1 ( 1 − θ ) β − 1 = θ n h + α − 1 ( 1 − θ ) N − n h + β − 1 ∝ B e t a ( α + n h , β + n t ) P(\theta|D)\propto P(D|\theta)P(\theta)\propto\theta^{n_h}(1-\theta)^{N-n_h}\times\theta^{\alpha-1}(1-\theta)^{\beta-1}=\theta^{n_h+\alpha-1}(1-\theta)^{N-n_h+\beta-1}\propto Beta(\alpha+n_h,\beta+n_t) P(θ∣D)∝P(D∣θ)P(θ)∝θnh(1−θ)N−nh×θα−1(1−θ)β−1=θnh+α−1(1−θ)N−nh+β−1∝Beta(α+nh,β+nt) 也就是先验和后验共轭了 例2:Beta分布,似然是二项分布 给 θ \theta θ一个先验是Beta分布, P ( θ ) = Γ ( α + β ) Γ ( α ) Γ ( β ) θ α − 1 ( 1 − θ ) β − 1 P(\theta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1} P(θ)=Γ(α)Γ(β)Γ(α+β)θα−1(1−θ)β−1 P ( D ∣ θ ) = C n k θ k ( 1 − θ ) n − k P(D|\theta)=C_n^k\theta^k(1-\theta)^{n-k} P(D∣θ)=Cnkθk(1−θ)n−k 那么后验概率 P ( θ ∣ D ) ∝ P ( D ∣ θ ) P ( θ ) ∝ θ α − 1 ( 1 − θ ) β − 1 θ k ( 1 − θ ) n − k ∝ B e t a ( α + k , β + n − k ) P(\theta|D)\propto P(D|\theta)P(\theta)\propto\theta^{\alpha-1}(1-\theta)^{\beta-1}\theta^k(1-\theta)^{n-k}\propto Beta(\alpha+k,\beta+n-k) P(θ∣D)∝P(D∣θ)P(θ)∝θα−1(1−θ)β−1θk(1−θ)n−k∝Beta(α+k,β+n−k) 例3: 多项分布: P ( x 1 , … , x k ; n , p 1 , … , p k ) = n ! x 1 ! … x n ! p 1 x 1 … p k x k P(x_1,\dots,x_k;n,p_1,\dots,p_k)=\frac{n!}{x_1!\dots x_n!}p_1^{x_1}\dots p_k^{x_k} P(x1,…,xk;n,p1,…,pk)=x1!…xn!n!p1x1…pkxk 狄利克雷分布: P ( θ 1 , … , θ K ) = 1 B ( α ) Π i K θ i α i − 1 P(\theta_1,\dots,\theta_K)=\frac{1}{B(\alpha)}\Pi_i^K\theta_i^{\alpha_i-1} P(θ1,…,θK)=B(α)1ΠiKθiαi−1
参考:哈工大机器学习PPT
