Statistical model

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

A statistical or probabilistic model is a mathematical model that incorporates a set of statistical assumptions related to the generation of sample data (and similar data from a larger population). A statistical model represents, often in a highly idealized form, the process of generating data.

A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is 'a formal representation of a theory' (Herman Adèr citing Kenneth Bollen).

All statistical hypothesis tests and all statistical estimators are derived through statistical models. More generally, statistical models are part of the basis of statistical inference.

Introduction

Every neural network is a statistical model

Informally, a statistical model can be considered as a statistical assumption (or a set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions about the dice.

The first statistical assumption is the following: for each of the dice, the probability that each head (1, 2, 3, 4, 5, and 6) will come up is 1/6. From that assumption, we can calculate the probability that both dice come up 5: 1/6×1/6 = 1/36. More generally, we can calculate the probability of any event: for example (1 and 2) or (3 and 3) or (5 and 6).

The alternative statistical assumption is the following: for each of the dice, the probability that heads 5 will come up is 1/8 (because the dice are weighted). From that assumption, we can calculate the probability that both dice come up 5: 1/8×1/8 = 1/64. However, we cannot calculate the probability of any other non-trivial events, since the probabilities of the other heads are unknown.

The first statistical assumption constitutes a statistical model: because with the assumption alone, we can calculate the probability of any event. The alternative statistical assumption does not constitute a statistical model: because with the assumption alone, we cannot calculate the probability of each event.

In the example above, with the first assumption, calculating the probability of an event is easy. However, with some other examples, the computation may be difficult or even impractical (for example, it might require millions of years of computation). For an assumption to constitute a statistical model, such a difficulty is acceptable: doing the calculation need not be practicable, only theoretically possible.

Formal definition

In mathematical terms, a statistical model is generally considered as a pair ( $S,{\mathcal {P}}$ ), where $yes$ is the set of possible observations, that is, the sample space, and ${\math{P}}$ is a set of probability distributions in $yes$ .

The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose ${\math{P}}$ to represent a set (of distributions) that contains a distribution that adequately approximates the true distribution.

Note that we do not require that it ${\math{P}}$ contains the true distribution, and in practice this is rarely the case. In fact, as Burnham & Anderson state, "A model is a simplification or approximation of reality and therefore will not reflect all of reality", hence the saying "all models are wrong".

The set ${\math{P}}$ is almost always parameterized: ${\mathcal {P}}=\{P_{{\theta }}:\theta \in \Theta \}$ . The set $\Theta$ defines the parameters of the model. In general, a parameterization is required so that different parameter values give rise to different distributions, that is, it $P_{{\theta 1}}}=P_{{\theta 2}}}\right arrow θ1}=\theta 2$ must contain (in other words, it must be injective). A parameterization that meets the requirement is said to be identifiable .

An example

Suppose we have a population of children, with the children's ages uniformly distributed, in the population. The height of a child will be stochastically related to age: for example, when we know that a child is 7 years old, this influences the probability that he is 1.5 meters tall. We could formalize that relationship in a linear regression model, like this: height _i = b ₀ + b ₁ age _i + ε _i , where b ₀ is the intercept, b ₁ is a parameter by which the age is multiplied to obtain a prediction height, ε _i is the error term, and iidentify the child. This implies that height is predicted by age, with some error.

A plausible model must be consistent across all data points. Therefore, a straight line (height _i = b ₀ + b ₁ age _i ) cannot be the equation for a model of the data, unless it fits exactly all data points, that is, all points data are perfectly on the line. The error term, ε _i , must be included in the equation for the model to be consistent across all data points.

To make a statistical inference, we would first need to assume some probability distributions for the ε _i . For example, we could assume that the εi distributions _are iid Gaussian, with mean zero. In this case, the model would have 3 parameters: b ₀ , b ₁ and the variance of the Gaussian distribution.

We can formally specify the model in the form ( $S,{\mathcal {P}}$ ) as follows. The sample space, $yes$ , of our model comprises the set of all possible pairs (age, height). Each possible value of $\ theta$ = ( b ₀ , b ₁ , σ ) determines a distribution on $yes$ ; denote that the distribution by $P_{{\theta}}$ . If $\Theta$ is the set of all possible values of $\ theta$ , then ${\mathcal {P}}=\{P_{{\theta }}:\theta \in \Theta \}$ . (The parameterization is identifiable, and this is easy to verify.)

In this example, the model is determined by (1) specifying $yes$ and (2) making some relevant assumptions for ${\math{P}}$ . There are two assumptions: that height can be approximated by a linear function of age; that the errors in the approximation are distributed as iid Gaussian. The assumptions are sufficient to specify ${\math{P}}$ —as they are bound to do.

General remarks

A statistical model is a special kind of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is not deterministic. Thus, in a statistical model specified by mathematical equations, some of the variables do not have specific values, but instead have probability distributions; that is, some of the variables are stochastic. In the example above with the children's heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic.

Statistical models are often used even when the data generation process being modeled is deterministic. For example, flipping a coin is, in principle, a deterministic process; however, it is commonly modeled as stochastic (via a Bernoulli process).

Choosing an appropriate statistical model to represent a given data generation process is sometimes extremely difficult and may require knowledge of both the process and the relevant statistical analyses. Relatedly, statistician Sir David Cox has said, "The way in which [the] translation of the subject problem into the statistical model is done is often the most critical part of an analysis."

There are three purposes for a statistical model, according to Konishi & Kitagawa.

predictions
Information extraction
Description of stochastic structures

Those three purposes are essentially the same as the three purposes outlined by Friendly & Meyer: prediction, estimation, description. The three purposes correspond to the three types of logical reasoning: deductive reasoning, inductive reasoning, abductive reasoning.

Dimension of a model

Suppose we have a statistical model ( $S,{\mathcal {P}}$ ) with ${\mathcal {P}}=\{P_{{\theta }}:\theta \in \Theta \}$ . The model is said to be parametric if it $\Theta$ has a finite dimension. In notation, we write that <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/f309b19bf10f097c61b8cdb76eac39dcb42f122e" alt="{\displaystyle \Theta \subseteq \mathbb {R}where k is a positive integer ( $\math {R}$ denotes the real numbers; other sets can be used, in principle). Here, k is called the dimension of the model.

As an example, if we assume that the data arise from a univariate Gaussian distribution, then we are assuming that <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/3379eb4fa5de81ad44947719a0f7416e0781ff09" alt="{\displaystyle {\mathcal {P}}=\left\{P_{\mu ,\sigma }(x)\equiv {\frac {1}{{\sqrt {2\pi }}\sigma }}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma0\right\ }}">.

In this example, the dimension, k , is equal to 2.

As another example, suppose the data consists of points ( x , y ) that we assume are distributed according to a straight line with iid Gaussian residuals (with zero mean): this leads to the same statistical model as used in the example with children's heights. The dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note that in geometry, a straight line has dimension 1.)

Although formally {\display style \theta \in \Theta} $\theta \at \theta$ is a single parameter having dimension k , it is sometimes considered to comprise k separate parameters. For example, with the univariate Gaussian distribution, {\display style \theta} $\ theta$ is formally a single parameter with dimension 2, but is sometimes considered to comprise 2 separate parameters: the mean and the standard deviation.

A statistical model is nonparametric if the parameter set {\display style \theta} $\Theta$ is infinite dimensional. A statistical model is semiparametric if it has finite and infinite dimensional parameters. Formally, if k is the dimension of {\display style \theta} $\Theta$ and n is the number of samples, then both semiparametric and nonparametric models have {\displaystyle k\rightarrow\infty} $k \rightarrow \infty$ as {\display style n \ right arrow \infinity} $n\rightarrow\infty$ . If {\display style k/n \right arrow 0} ${\display style k/n\right arrow 0}$ like {\display style n \right arrow \infinity} $n\rightarrow\infty$ , then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most widely used statistical models. Regarding semi-parametric and non-parametric models, Sir David Cox has said: "They generally involve fewer assumptions of distribution structure and shape, but generally contain strong assumptions about the independences."

Nested models

Not to be confused with multilevel models.

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean on the set of all Gaussian distributions to obtain the zero-mean distributions. _{As a}second example, the quadratic model y = b0 + b1 x + b2 x + ε , _ε ~ ? ( 0 _,σ )

has, nested inside it, the linear model y = b0 + b1 x + ε, ε ~ ?( ₀_,σ )

— we restrict the parameter b ₂ to equal to 0.

In both examples, the first model has a higher dimension than the second model (for the first example, the zero-mean model has dimension 1). Such is often, but not always, the case. As a different example, the set of positive mean Gaussian distributions, which has dimension 2, is nested within the set of all Gaussian distributions.

Comparing models

Comparison of statistical models is fundamental to much of statistical inference. Indeed, Konishi & Kitagawa (2008, p. 75) state this: "Most problems in statistical inference can be considered problems related to statistical modeling. They are usually formulated as comparisons of various statistical models."

Common criteria for comparing models include the following: R , Bayes factor, Akaike's information criterion, and the likelihood ratio test along with its generalization, the relative likelihood.

Contenido relacionado

Más resultados...