# Glossary

## B

bias
The bias of an estimator is the difference between its mean and the true value.
?
boosting
bootstrap
An idea for statistical inference, using training sets created by re-sampling with replacement from the original training set, so examples may occur more than once.

## C

canonical
?
capacity
?
capacity control
?
clustering algorithms
Given a fixed number of clusters, we aim to find a grouping of the objects such that similar objects belong to the same cluster
complexity
any of various measures of the difficulty of a given decision problem, computational method, or algorithm; for example, the total number of bits, flops, or operations used may be regarded as approximately a function of the size of the problem, or the amount of work involved in its solution.

## D

dichotomy
division into two parts or kinds

## E

empirical data
?
Empirical Risk Minimization
?

## F

The Principle of Falsification
The logical principle that it is only possible logically to prove a theory or hypothesis false. Introduced by Karl Popper.
feature extraction
Creating useful new features by combinations (usually linear) of existing features.
function learning
Supervised learning when the output space is a metric space such as the real numbers

## G

generalization
A measure of the ability of a classifier to perform well on future examples, or such a measure applied to a method to design classifiers.
?

## H

hyperplane
a translate of the null space of any linear functional; a three-dimensional space in four dimensions, or more generally an (n-1)-space in n dimensions.
hypothesis (space?)
?

## K

kernel
the inner product function
kernel classifier
The linear function involving kernel
kernel technique
choose a kernel, rather than a mapping, before applying a learning algorithm

## L

Lagrangian
?
law of large numbers
the fundamental statistical result that the average of a sequence of n independent identically distributed random variables tends to their common mean as n tends to infinity, whence the relative frequency of the occurrence of an event in n independent repetitions of an experiment tends to its probability as n increases without limit.
learning machine
?
learning problem
finding a general rule that explains data given only a sample of limited size

## M

maximum liklihood estimate
is a value which maximises the likelihood function, or, more loosely, is a stationary point of the likelihood function.

neural network
?

## O

Occam’s razor (also Ockham’s razor)
Translates as "Plurality should not be posited without necessity." In other words, we should prefer the simplest hypothesis that fits the data.
optimal separating hyperplane
separates the two classes and maximizes the distance to the closest point from either class.
overfitting
?

## P

PAC learning
?
posterior probability
The probability of an event conditional on the observations.
preference learning
The problem of supervised learning when the space is an order space.
principal components
are linear combinations of features with high variance.
prior probability
Probabilities specified before seeing the data, and so based on prior experience or belief.

## Q

a problem in mathematical programming in which the objective function is a quadratic and the constraints are linear.

## R

ranks
The elements of the output space in preference learning.
regression
the analysis or measure of the association between a dependent variable and one or more independent variables, usually formulated as an equation in which the independent variables have parametric coefficients, which may enable future values of the dependent variable to be predicted.
regularization
A class of methods of avoiding over-fitting to the training set by penalizing the fit by a measure of 'smoothness' of the fitted function.
ridge regression

## S

shattered
a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.
Statistical learning theory
?
Structural Risk Minimization
?
supervised learning
Choosing a classifier from a training set of correctly classified examples.

## T

test set
A set of examples used only to assess the performance of a fully-specified classifier.
training error
?
training sample
a sample of input-output pairs
training set
A set of examples used for learning, that is to fit the parameters of the classifier.
test error
?

## U

unsupervised learning
Discovering groupings in the training set when none are pre-specified.

## V

validation set
A set of examples used to tune the parameters of a classifier.
Vapnik
VC dimension (Vapnik-Chervonenkis dimension)
The VC dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) is identically equal to infinity.