(5/5)

Section 1: Generalisation and Data Noise (50 Total Points)

In the lectures, we explored several aspects of the generalisation bound. One simple variant we discussed is that of a finite hypothesis class, where it was shown that the generalisation gap can upper-bounded

by a quantity scaling with O(1/ N) many samples. However, this requires the independent samples to

follow the true distribution. Inspired by the discussions and questions given by students in that lecture, we will consider a very simple method1 of adapting the uniform bound for a finite hypothesis class in the presence of noise.

Setup. Let us set up the scene. Suppose we are interested in the classification of inputs X to classes Y, with corresponding distribution D and corresponding joint probability density function p. We also consider the random variables of input and output classes given by the distribution (X, Y ) ∼ D.

Define a hypothesis class as a set of classifiers of the form f : . The objective of learning is to find the best possible f , where “best” means the classifier that gives us the lowest risk based on some loss function ℓ : R. Recall from the lectures that there are two ways we can aggregate the loss: either using the theoretical distribution to get the true risk, or using the empirical distribution (through an i.i.d. dataset) to get the empirical risk. The following definition will make it clearer.

Definition 1 (Risk2). The (true) risk of a classifier f : X → Y with respect to (w.r.t.) a distribution D

Note the slight difference in notation from lectures. The empirical risk is denoted by a hat

□[f] and each type of risk has an explicit subscript to denote what distribution we are considering. This is so we can compare the true risk and empirical risk for different distributions.

From the lecture, we know that the the difference between these two types of risk — also called the generalisation gap — is important. It determines how close we can get to the true risk given only a finite amount of data. We also know in the lecture that this gap can be bounded no matter what classifier f we choose, using the two-sided Hoeffding’s inequality (see Proposition 1 in Appendix A). You may use this result without proof later.

General noise. To consider the effects data noise has on our generalisation bounds, we will consider two different distributions: (1) a clean distribution which we want to learn from; and (2) a noisy distribution which can cause problems in learning. For example, we consider a noisy distribution as a noisy version

of . The “tilde” □ symbol is used to indicate noisy variants of object we have already introduced. Table 1 in Appendix A summarises the differences.

In this general setup, we have a natural question about the effect of noise: how close can we get to the true risk if we only have a noisy sample of data? Or mathematically, can the generalisation gap when we have noisy data, defined as

be bounded? To answer this, let us first examine the difference between the true risk of clean distribution and that of the noisy version.

1There are more slick bounds available if we use more sophisticated frameworks, i.e., PAC-Bayes.

2Technically, the co-domain of the classifier does not need to match that of the output space. However, we simplify to this case for convenience of notation.

term in Eq. (1.6) is actually a well known distance function called the total variation distance. One nice characteristic of this distance is that it is (polynomially) upper-bounded by the KL-divergence3

A hint for later: The integral in Eq. (1.6) can be simplified to a “summation” if we are considering a countable domain, i.e., a classification task with finite Y.

Thus putting everything together, let us establish a uniform bound for the noisy sample gap (given by Eq. (1.4)) for a finite hypothesis class H.

Label noise. Now that we have a “general” bound for different noises, let us specialise the bound for label noise. In particular, for the next couple questions we will consider label noise for binary classification, where = 1, +1 . Simply, label noise can be summarised the case where class labels used to train have a chance of being incorrect.

Definition 2 ((Asymmetric) Label Noise). Let (X, Y ) ∼ D be a data distribution. Given label flipping probabilities σ−1, σ+1 ∈ [0, 1], the corresponding noisy distribution under (asymmetric) label noise is the distribution (X˜, Y˜) ∼ D˜ in which noisy random variables X˜, Y˜ satisfy the following properties:

(LN1): Label flipping probability Pr(Y˜ = −y | Y = y) = σy for all y ∈ Y; (LN2): Identical input marginals Pr(X˜ = x) = Pr(X = x) for all x ∈ X;

(LN3): Conditional independence Y˜ ⊥ X | Y .

Symmetric label noise is the noise model when σ = σ−1 = σ+1.

Given the definition of asymmetric label noise, we can calculate certain posterior probability quantities which are more intuitive. For instance, one can summarise Definition 2 by considering the noise posterior Pr(Y˜ | X).

Question 1.4: Noisy Posterior for Label Noise (5 Points)

Given a clean distribution D and a noisy version of the distribution D under asymmetric label noise with flipping probabilities σ−1, σ+1 ∈ [0, 1], show that for all y ∈ Y

Pr(Y = y | X) = σ−y • (1 − Pr(Y = y | X)) + (1 − σy) • Pr(Y = y | X). (1.8)

Using Eq. (1.8) we can simply adapt the uniform bound given in the general case for label noise. In particular, we will look at the symmetric label noise version of the bound.

Remark. In Eq. (1.9), we can see that the noise term σ is linear in σ: Thus the noise is maximally harmful with σ = 1 and vanishes when σ = 0.

Section 2: Gaussian Processes and Bayesian Optimisation (50 Total Points)

Gaussian Processes. For reference, we have defined the key concepts and notations of a Gaussian Process in Appendix B.

Gaussian Process Regression. Having defined the prior and the joint distributions (check Ap- pendix B), we are now in position to define the Gaussian Process predictive distribution as,

f∗ | X, y, X∗ ∼ N(¯f∗, cov(f∗)) (2.1)

where we have,

¯f d=ef E[f∗ | X, y, X∗] = K⊤K−1y (2.2a)

cov(f∗) = K∗∗ − K⊤K−1K∗. (2.2b)

If we have only a single point to predict, the above equations simplify to,

f¯ = k⊤K−1y (2.3a)

∗ ∗ y

V(f∗) = k∗∗ − k⊤K−1k∗. (2.3b)

Depending on our choice of the covariance function, we would have to choose its parameters as well. This choice must be diligent, otherwise we will have trouble predicting new points with high confidence. It would be ideal to have a field expert for every exact problem we are trying to solve to inform us of the underlying parameters of the model, but this isn’t always possible. In those cases, we have to treat those covariance function parameters as hyperparameters, which we denote as θ, and optimise them in our algorithm. There are many ways to achieve this hyperparameter optimisation, but here we will maximise the log marginal likelihood with respect to those parameters θ, with the log marginal likelihood defined as,

Now, as you have noticed we are required many times to compute the inverse of the training covariance matrix, Ky−1. Taking the inverse straight away would be a highly inefficient and numerically unstable way, especially as the training dataset grows. So, instead we should use a more appropriate method by utilising techniques from matrix factorisation theory. More specifically, we will perform Cholesky factorisation on the training covariance matrix. An analytical implementation of the steps to implement this decomposition is given in Algorithm 1. Note the backslash notation A b is the vector x which solves Ax = b.

Algorithm 1 Efficient Gaussian Process Regression

1: L ← cholesky(Ky) ▷ Cholesky factorisation for predictions and log marginal likelihood

2: α ← L⊤\(L\y)

3: f¯ ← k⊤α ▷ predictive mean Eq. 2.3

4: v ← L\k∗

5: V(f∗) ← k∗∗ − v⊤v

6: log p(y|X) ← − 1 y⊤α −

▷ predictive variance Eq. 2.3

i log Lii − n log 2π ▷ log marginal likelihood Eq. 2.4

Choice of Covariance. As we discussed before, the choice of covariance plays an important factor in the success of a GP regression. For this assignment, we will consider a general class of Covariance functions, called Mat´ern. This class is defined mathematically as,

21−ν √2νd!ν √2νd!

with positive parameters ν and ℓ. Kν is a modified Bessel function and d = dist(x, y) denotes the pairwise Euclidean distance.

We can easily observe that when ν , we obtain the well-known Radial Basis function (RBF). Therefore, this class of covariance functions is a generalisation of the RBF class. However, RBF can be

too smooth sometimes and this is not always appropriate in a machine learning setting. For machine learning solutions, we usually tend to use two other covariance functions. Mat´ern32 and Mat´ern52, when ν = 3/2 and ν = 5/2 respectfully. Mat´ern32 covariance function works for single-differentiable functions, whereas Mat´ern52 could work for up to twice-differentiable functions. In general, a Gaussian process with Mat´ern covariance is ⌈ν⌉ − 1 times differentiable in the mean-square sense.

The two covariance functions are defined below as,

Bayesian Optimisation - Intuition. Bayesian Optimisation (BayesOpt) is a zeroth-order optimisation method of finding the maximum of expensive cost functions. BayesOpt employs Bayesian techniques of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utility-based selection of the next observation to make on the objective function, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement over the current best observation). More specifically, it builds a surrogate model for the objective and quantifies the uncertainty in that surrogate using Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample. We will define all those terms later.

Let’s assume we want to solve the following optimisation problem

max f(x). (2.7)

x∈X

The function defined in 2.7 also has the following important properties:

• The objective function f is L-Lipschitz-continuous in order to use Gaussian process regression.

• f is expensive to evaluate, either in the sense of time or money or other factors that will result in hurdles along the way.

• f lacks known special structure, such as convexity or quasi-convexity (similarly with concavity) to allow us to use known gradient-based optimisation algorithms. We say that f is a black-box in those cases.

Finally, our focus in on finding a global rather than local optimum.

Surrogate Model. As we said before, BayesOpt consists of two main components: a Bayesian statistical model for modelling the objective function, and an acquisition function for deciding where to sample next.

For the surrogate model we will use the Gaussian process regression model that we have implemented before. Since our evaluation functions will be continuous, a GPR is an ideal surrogate model. In general, you can read about surrogate models here.

Improvement-based Acquisition Functions. Now, let’s turn over focus on the acquisition functions. The role of the acquisition function is to guide the search for the optimum. Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because the prediction is high, the uncertainty is great, or both. Maximising the acquisition function is used to select the next point at which to evaluate the function. That is, we wish to sample f at arg maxxu(x | D), where u(•) is the generic symbol for an acquisition function.

There are many types of acquisition functions, but in this assignment we will focus only on improvement- based acquisition functions. This class of functions is based on the random variable Improvement :

I(x) = max{0, f(x) − f(x+) − ξ}, (2.8)

f(xi). Note that ξ ≥ 0 is the exploration-

exploitation trade-off hyperparameter.

Intuitively, I(x) assigns a reward of f(x) f(x+) ξ if f(x) > f(x+)+ξ, and zero otherwise. Therefore, we should select the point that has the highest probability of improving upon the incumbent function value f(x+). This effectively means that we should maximise the probability of the event {Pr (I(x) > 0)} or equivalently maximise the probability of the event {f(x) > f(x+) + ξ}. Another way of seeing this, is by taking the expectation of the utility function u(x) = 1{f(x)>f(x+)+ξ}, which is u(x) = 1 when f(x) > f(x+) + ξ and zero otherwise.

(5/5)

DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma

Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t

Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th

1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of

1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of