Glossary

A

B

bias: The bias of an estimator is the difference between its mean and the true value.
bias variance tradeoff: ?
boosting
bootstrap: An idea for statistical inference, using training sets created by re-sampling with replacement from the original training set, so examples may occur more than once.

C

canonical: ?
capacity: ?
capacity control: ?
clustering algorithms: Given a fixed number of clusters, we aim to find a grouping of the objects such that similar objects belong to the same cluster
complexity: any of various measures of the difficulty of a given decision problem, computational method, or algorithm; for example, the total number of bits, flops, or operations used may be regarded as approximately a function of the size of the problem, or the amount of work involved in its solution.

D

dichotomy: division into two parts or kinds

E

empirical data: ?
Empirical Risk Minimization: ?

F

The Principle of Falsification: The logical principle that it is only possible logically to prove a theory or hypothesis false. Introduced by Karl Popper.
feature extraction: Creating useful new features by combinations (usually linear) of existing features.
function learning: Supervised learning when the output space is a metric space such as the real numbers

G

generalization: A measure of the ability of a classifier to perform well on future examples, or such a measure applied to a method to design classifiers.
gradient descent: ?

H

hyperplane: a translate of the null space of any linear functional; a three-dimensional space in four dimensions, or more generally an (n-1)-space in n dimensions.
hypothesis (space?): ?

I

J

K

kernel: the inner product function
kernel classifier: The linear function involving kernel
kernel technique: choose a kernel, rather than a mapping, before applying a learning algorithm

L

Lagrangian: ?
law of large numbers: the fundamental statistical result that the average of a sequence of n independent identically distributed random variables tends to their common mean as n tends to infinity, whence the relative frequency of the occurrence of an event in n independent repetitions of an experiment tends to its probability as n increases without limit.
learning machine: ?
learning problem: finding a general rule that explains data given only a sample of limited size

M

maximum liklihood estimate: is a value which maximises the likelihood function, or, more loosely, is a stationary point of the likelihood function.

N

neural network: ?

O

Occam’s razor (also Ockham’s razor): Translates as "Plurality should not be posited without necessity." In other words, we should prefer the simplest hypothesis that fits the data.
optimal separating hyperplane: separates the two classes and maximizes the distance to the closest point from either class.
overfitting: ?

P

PAC learning: ?
posterior probability: The probability of an event conditional on the observations.
preference learning: The problem of supervised learning when the space is an order space.
principal components: are linear combinations of features with high variance.
prior probability: Probabilities specified before seeing the data, and so based on prior experience or belief.

Q

quadratic programming problem: a problem in mathematical programming in which the objective function is a quadratic and the constraints are linear.

R

ranks: The elements of the output space in preference learning.
regression: the analysis or measure of the association between a dependent variable and one or more independent variables, usually formulated as an equation in which the independent variables have parametric coefficients, which may enable future values of the dependent variable to be predicted.
regularization: A class of methods of avoiding over-fitting to the training set by penalizing the fit by a measure of 'smoothness' of the fitted function.
ridge regression

S

shattered: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.
Statistical learning theory: ?
Structural Risk Minimization: ?
supervised learning: Choosing a classifier from a training set of correctly classified examples.

T

test set: A set of examples used only to assess the performance of a fully-specified classifier.
training error: ?
training sample: a sample of input-output pairs
training set: A set of examples used for learning, that is to fit the parameters of the classifier.
test error: ?

U

unsupervised learning: Discovering groupings in the training set when none are pre-specified.

V

validation set: A set of examples used to tune the parameters of a classifier.
Vapnik
VC dimension (Vapnik-Chervonenkis dimension): The VC dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) is identically equal to infinity.

Glossary

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z