(5/5)

MLE, MAP, Concentration (Pengtao)

1. MLE of Uniform Distributions [5 pts]

Given a set of i.i.d samples X1, ..., Xn Uniform(0, θ), find the maximum likelihood estimator of θ.

(a) Write down the likelihood function (3 pts)

(b) Find the maximum likelihood estimator (2 pts)

2. Concentration [5 pts]

The instructors would like to know what percentage of the students like the Introduction to Machine Learn- ing course. Let this unknown—but hopefully very close to 1—quantity be denoted by µ. To estimate µ, the instructors created an anonymous survey which contains this question:

”Do you like the Intro to ML course? Yes or No”

Each student can only answer this question once, and we assume that the distribution of the answers is i.i.d.

(a) What is the MLE estimation of µ? (1 pts)

(b) Let the above estimator be denoted by µˆ. How many students should the instructors ask if they want the estimated value µˆ to be so close to the unknown µ such that

P(|µˆ − µ| > 0.1) < 0.05, (4pts)

3. MAP of Multinational Distribution [10 pts]

You have just got a loaded 6-sided dice from your statistician friend. Unfortunately, he does not remem- ber its exact probability distribution p1, p2, ..., p6. He remembers, however, that he generated the vector (p1, p2, . . . , p6) from the following Dirichlet distribution.

Γ(Σ6

u ) Y Σ

P(p , p , . . . , p ) =

i=1 i

pui−1δ(

pi − 1),

where he chose ui = i for all i = 1, . . . , 6. Here Γ denotes the gamma function, and δ is the Dirac delta. To

estimate the probabilities p1, p2, . . . , p6, you roll the dice 1000 times and then observe that side i occurred

ni times (Σ6 ni = 1000).

(a) Prove that the Dirichlet distribution is conjugate prior for the multinomial distribution.

(b) What is the posterior distribution of the side probabilities, P(p1, p2, . . . , p6|n1, n2, . . . , n6)?

Linear Regression (Dani)

1. Optimal MSE rule [10 pts]

Suppose we knew the joint distribution PXY . The optimal rule f ∗ : X → Y which minimizes the MSE (Mean Square Error) is given as:

f ∗ = arg min E[(f (X) Y )2]

f

Show that f ∗(X) = E[Y |X].

2. Ridge Regression [10 pts]

In class, we discussed l2 penalized linear regression:

where Xi = [X(1) . . . X(p)].

β 2

i=1

i i

a) Show that a closed form expression for the ridge estimator is β = (ATA + λI)−1ATY where

A = [X1; . . . ; Xn] and Y = [Y1; ...; Yn].

b) An advantage of ridge regression is that a unique solution always exists since (ATA+λI) is invertible. To be invertible, a matrix needs to be full rank. Argue that (ATA + λI) is full rank by characterizing its p eigenvalues in terms of the singular values of A and λ.

Logistic Regression (Prashant)

1. Overfitting and Regularized Logistic Regression [10 pts]

a) Plot the sigmoid function 1/(1 + e−wX ) vs. X ∈ R for increasing weight w ∈ {1, 5, 100}. A qualitative sketch is enough. Use these plots to argue why a solution with large weights can cause logistic regression to overfit.

b) To prevent overfitting, we want the weights to be small. To achieve this, instead of maximum conditional likelihood estimation M(C)LE for logistic regression:

max

w0,...,wd

n

P (Yi|Xi, w0, . . . , wd),

i=1

we can consider maximum conditional a posterior M(C)AP estimation:

max

w0,...,wd

n

P (Yi|Xi, w0, . . . , wd)P (w0, . . . , wd)

i=1

where P (w0, . . . , wd) is a prior on the weights.

Assuming a standard Gaussian prior N (0, I) for the weight vector, derive the gradient ascent update rules for the weights.

(5/5)

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of