The visual pathway ('what' pathway [41]) from retina across LGN to V1 is modelled. The following discussions are focussed on: 1. the optimal detection in V1 of retinal image R(x, y); 2. the optimal matching between R(x, y) and firing pattern of neuronal groups in V1; 3. Kronecker product obtained by optimal detection and matching; 4. determination of kernel function G(x, y); 5. realization of Kronecker product; 6. numerical experiments and discussions.
Image division and cortical response
As a first step of visual perception, external stimuli form a retinal image R(x, y). In the retina, a ganglion cell receives inputs from about 103~104 photo-receptor cells. If the size of a ganglion cell's receptive field is a = Δx × Δy, the total imaging area of R(x, y) in the retina is A, which is divided into M × N patches with the same size a as that of a ganglion cells receptive field, namely A = (M × N)a. Here a is referred to as sub-region, therefore, a indicates only its size; and the partially small image covered by the area a is called image unit or image primitive, and expressed by r(a). In this way, a patch or the image primitive at the i line and the j column can be denoted as ri, j(a)(i = 1,2,⋯, M; j = 1,2,⋯, N), and the entire image included in the area A is also expressed by R(A), as shown in Figure 1. That is to say, the whole image R(A) can be also formed from (M × N) patches by means of orderly putting patches together, i.e.
.
Figure 1. Visual image R(x, y) is divided into M × N local patches according to a ganglion cell's receptive field.
This division involves two aspects. First, every patch ri,j(a) contains local features (such as shapes of receptive fields and orientations) at (i, j) with area (Δx × Δy), while a pixel at (i, j) only contains intensity information. Second, in the parallel multi-channel vision system, every channel only deals with a local patch. Obviously this division is consistent with parallel and multi-channel properties of a vision system, and in a topographical mapping any patch can be located in the retina, so the neural processing based on functional columns can be realized and the corresponding mathematical description is possible. A patch ri,j(a) will activate the corresponding ganglion cell and output a coded firing spike train. This is transferred across LGN into V1 as a topographical mapping. Then, the firing spike train is decoded and the image represented by ri,j(a) is restored. A Kronecker product of the restored image with functional columns Bk,l(s) in V1 [40, 41, 33] leads to firing of neurons in receptive fields having similar orientation and bandwidth (that is so-called firing under a preferential stimulus). We denote the firing pattern of one singular neuron (or simple cell) as ϕi,j(b), where b denotes the area of the receptive field of the neuron. Since, A, the retinal image area, will be enlarged in the cortex V1 [42], for simplicity, with a magnification factor z = 2h(h = 0,1,2,⋯); if the area of image on the cortex V1 is denoted by B, we have B = zA = 2hA, b = za = 2ha. In this way, the spatial sum of all signals ϕi,j(b) in an orderly manner will form the overall firing pattern, that is
, which represents a reconstruction of the retinal image R(A) on V1.
Optimum detection in V1
For a single neuron in the functional columns, its receptive field consists of orientation, band-pass and spatial location [43, 44], with a strong selectivity with respect to visual image R(x, y) topologically mapped from the retina. When some patch ri,j(a) in visual image R(x, y) shows properties at specific orientation and specific frequency similar to some receptive field gi,j(b), the corresponding neuron will respond strongly. In other words, when a local patch coincides with the receptive field of a neuron at its sensitive orientation and sensitive frequency, the neuron fires most strongly, which means a detection and matching of functional columns in V1 to local retinal patch's features ri,j(a), as a random process, in other words, which is a detection and matching in a random process. Therefore, this process can be expressed as [45]
where ri,j(2ha) is a patch from input image R(A) containing basic features such as orientation, edge, and contour within the receptive field. Here ni,jis Gaussian white noise with zero mean and variance
. It is assumed here that the noise environment in different visual pathways is the same, i.e. ni,jis the same. αi,jis weight coefficient that can reflect excitation strength of an image primitive to neuron gi,j(b). We want to know what value of αi,jcan strongly activate neuron gi,j(b) to fire. Obviously, αi,jis an unknown estimated coefficient. If we make K observations on the random process (1), in other words, we carry out K sampling on the random process (1), then formula (1) can be written in the following general form
According to maximum likelihood estimation and least mean square error rule [45, 46]
The optimum estimation of ai,jis
where
is a ratio coefficient.
We can see ai,jreaches its optimum value
when image patch ri,j(a) coincides with receptive fields feature gi,j(b), i.e. ai,jri,j(2ha) = gi,j(b). So,
can be taken as the measure for matching extent between neuronal receptive field gi, j(b) and patch ri,j(a) in image R(A).
Above process is illustrated in Figure 2, where the local patch is a horizontal edge and an optimal matching is found in receptive fields of horizontal orientation with a strong response. No optimal matches are found in receptive fields of other orientations, so we have weak responses.
Figure 2. Selective matching between ri,j(a) (a horizontal edge) and different receptive fields in functional columns. The receptive field gi,j(b) with horizontal orientation responses strongly.
It is worthy noting that the multi-scale processing function of a visual pathway is to guarantee a clear image at a proper resolution in V1. So, the optimal matching is related with some resolution. When the scale or resolution changes, the optimal matching may concentrate on different extents of details, or may include or exclude some details, as is determined by the circumstances when the vision system is "observing" the world.
The whole matching between the retinal image and receptive field patterns in V1
In the previous section, we have described the local matching or detection (formula 3). Next we will analyze the image matching as a whole, to obtain a mathematical representation of the neural firing pattern in V1. It is understood that the topographical mapping from retinal image R(x, y) to V1 is actually a stimulating process of R(x, y) with respect to simple selective cells with features of orientation, bandwidth and so on. Receptive fields G(x, y) of activated neurons are combined to form the global firing pattern Φ(x, y) in the cortex, as the responding process of simple cell groups to the visual image. So, the firing process can be regarded as a global matching between G(x, y) and R(x, y) at the system level, and eventually responding pattern Φ(x, y) will be formed. Many methods can be used to measure the extent of matching. However, in order to ensure a minimal reconstruction error, we adopt the following measure:
where B is the imaging region in V1.
Let
. In the optimal matching in region B, we will have R(x, y) = G(x, y), as is the case when the equal operation is adopted in formula (5) and λ
RG
reaches its maximum value λ
MAX
. Therefore, we can define the normalized matching coefficient ρ
rg
as follows:
Obviously when ρ
rg
= 1, G(x, y) and R(x, y) reaches a complete match. In other words, the receptive fields of all activated neurons in V1 are combined to form the same responding pattern Φ(x, y) as the whole visual image. It shown that this process can be mathematically described as follows by multiplication of R(x, y) and G(x, y)
It can be seen later, R(x, y) and G(x, y) can be expressed as matrix form, for this reason, the formula (8) is essentially an inner product operation, it is not only more elegant on the mathematical form, but also more clear on the neurobiological significance.
Additionally, we can see the normalized matching coefficient ρ
rg
is equivalent to
in the previous section.
Determination of integral kernel function
In formula (8), the role played by receptive fields G(x, y) of cortical neurons is the same as the integral kernel in wavelet transform [46, 47]. It can be described as oriented and bandwidth two-dimensional Gabor function G(x, y)
λ,σ,θ,φ,γ
[34–39]
where γ is the ratio of the length in the major axis direction to that of in minor axis direction, usually set to a constant 0.5; σ is derivative of Gauss, determining the size of receptive fields; φ is the phase, when φ = 0; π; G(x, y)
λ,σ,θ,φ,γ
is symmetric about the origin; when φ = -(π/2); (π/2), G(x, y)
λ,σ,θ,φ,γ
is anti-symmetric about the origin; Θ is the optimal orientation, and λ is the wavelength. These arguments should be determined by experimental results from morphology and biophysics, but the exact data are not available so far [48]. One plausible way is to set the arguments according to input image features in an input-driven topological mapping [2]. This will be explained in the last section.
Substituting (9) into (8) and considering the cortical responses to orientation and bandwidth properties, we replace Φ(x, y) with Φ(x, y),λ,σ,θ,φ,γ
It is also the inner product, because the formula can be also expressed as matrix form. The formulae (8) and (10), as the inner product, which shows such an important neurobiological fact, that is, in the visual pathway topographical mapping indicates accurate positioning of retinal image in the visual cortex, therefore, these primitives can only respectively activate cells which are in the corresponding locations in visual cortex. Since it is a one to one excitation, scanning and convolution is no longer needed.
Comparison of inner product with convolution
For comparison, the convolution operation may be expressed as follows
We know that convolution and cross-correlation operations are essentially filtering operations in the frequency domain, which is not needed for V1, because such a filtering operation would lead to loss of high-and low-frequency information from the retinal picture. The second reason is that the scan process in such operations (convolution and cross-correlation) is a calculation with a high cost (see the section of discussion in this paper, for detail), in which, G(x, y)
λ,σ,θ,φ,γ
should be taken as the template to scan the whole image R(x, y) from top to bottom and from left to right. Obviously, it is not an effective method, because this scanning will cause too many responses of corresponding cells and the energy cost is too great.
Mathematically, the discrete convolution of formula (11) can be expressed as
For some point (x0, y0) in an image, G(x0 - k, y0 - l)
λ,σ,θ,φ,γ
is moved around on the image (by changing k and l) to realize an optimal match between Φ(x
k
, y
k
) and R(x, y). The matched simple cell then is activated. Figure 3 shows an example, in which the visual image R(x, y) is a small horizontal or vertical line. Obviously, a horizontal line in R(x, y) excites numerous cells whose receptive fields have an orientation similar to a horizontal line, and a similar effect occurs for a vertical line. Such a calculation comes with a high costs in time and complexity.
Figure 3. Convolution operation for horizontal lines (A) and vertical lines (B) in image patches in R(x, y).
The use of the inner product reveals that neuron firing caused by a visual stimulus is in fact a simple function. This is a simpler neural computation than the cross-correlation and convolution operations, because it needs only multiplication between corresponding pixels of Φ(x, y)
λ,σ,θ,φ,γ
and R(x
k
, y
k
). It is worth pointing out that the product R(x
k
, y
k
)Φ(x
k
, y
l
)
λ,σ,θ,φ,γ
means that the retinal image R(x
k
, y
k
) excites all cortical cells and forms a global activity pattern Φ (x, y)
λ,σ,θ,φ,γ
in V1. Different visual stimuli will excite and form different activity patterns corresponding to that stimulus; the differences in activity patterns occur only in a topographically connected weight coefficient of the pixels of R(x
k
, y
k
) with the corresponding pixels of Φ(x, y)
λ,σ,θ,φ,γ
in a fully mapping neural computation. Generally, the weight coefficients corresponding to detailed image information are much smaller than those corresponding to contour and edge information. The intensity of the spike firing of simple cells excited by the details of the stimulus is also weaker than the intensity corresponding to contours and edges.
The inner product
in equation (8) reveals the collective calculation of a simple neuronal "on" or "off" function. From this, we can see that the calculation of the inner product is very well suitable to the visual system in that it satisfies the prerequisites of efficiency, simplicity, and robustness and also provides an optimal means of detection under the condition of least-mean-square-error reconstruction.
In fact, formula (9) reflects a specific wavelet transform on retinal image R(x, y) by basis function G(x, y)
λ,σ,θ,φ,γ
. This formula reflects the neural firing stimulated by the retinal image at the system level. Next we will discuss how to process visual images according to this formula. Two important problems will be discussed, that is, how to divide visual image R(x, y) according to structures and functions of the visual pathway and how to express the orientation selectivity of functional columns in V1 by two-dimensional wavelet function G(x, y)
λ,σ,θ,φ,γ
.
Kronecker product in V1
Usually, visual image R(x, y) is independently transferred to LGN through 1 million ganglion cells, and then reaches the layers of 4Cα and 4Cβ in V1, and finally an image is reproduced in V1. Obviously, every image patch ri,j(a) is transferred through one channel. Suppose the number of channels is M × N, which means visual image R(x, y) is divided into M × N units. As pointed out in previous section, each patch is assumed to have the same size as the receptive field of a ganglion cell, namely, a = Δx × Δy. The area of the whole image is A. In every channel only one patch of [Ri,j(a)]M × N, with local features, is transferred. All ri,j(a) are added to form [Ri,j(a)]M × N.
The receptive field of a neuron distributed in V1 is gi, j(b). After being stimulated by [Ri,j(a)]M × N, a response pattern [Φi,j(b)]M × N(a M × N matrix) is formed from theses receptive fields as:
According to neurophysiology and neuroanatomy [43], cortical modules are densely distributed in V1, with approximately 103 modules; the area of each module is approximately 1.8 mm × 1.8 mm, containing two function columns for both left and right eyes. Thus, the area related with every function column Bk,l(s) is 0.9 mm × 0.9 mm. At the system level, before adequate neurophysiological and neuroanatomical knowledge may be available, these function columns are assumed to have the same function and be composed of many receptive fields with different orientations and frequencies [3].
In this paper, the receptive fields of the function column can be represented as a matrix. As in Figure 4, each row of the matrix represents eighteen oriented receptive fields of the same type uniformly distributed from 0° to 180°. Each column of the matrix represents eight types of receptive fields (orthogonal Gabor of different frequencies) with a same orientation. So [Bk,l(s)]K × Lis made up of 144 elements gi,j(b) (k = 1,2,⋯, K; l = 1,2,⋯, L)
Figure 4. Functional columns as basic information processing units. (A) Eight representative types of receptive fields in function columns in V1; (B) Orientations range from 0° to 180° with a same interval of 10°; (C) An example of receptive field calculated by formula (9).
Therefore, some edge or contour located at (i, j) in retinal image [Ri,j(a)]M × Nwith area a will find a best match with the receptive field gi,j(b) of the same orientation and shape. When ρ
rg
= 1, it means the patch at (i, j) in [Ri,j(a)]M × Ncompletely matches the cortical module [Bk,l(s)]K × Lwith the specific orientation. Then the neuron is activated with the strongest response. The process of this image reconstruction is shown in figure 5.
Figure 5. Optimal matching between a patch (upper right corner of the hat) and receptive fields of specific orientations in cortical modules [Bk,l(s)]K × L.
When all the M × N patches in retinal image [Ri,j(a)]M × Nsimultaneously (in a parallel manner) activate topographically corresponding neurons, response pattern [Φi,j(b)]M × Nis formed in V1. At the system level, this process can be described by Kronecker Product.
Optimal responding of a neuron in V1 is actually reached through detection of cortical module [Bk,l(s)]K × Lwith respect to a patch ri,j(a) in retinal image [Ri,j(a)]M × N, which can be represented as a product of them, i.e. ri,j(2ha)[Bk,l(s)]K × L
Where |max for {ri,j(2ha)Bk,l(s)| k = 1,2,⋯8; l = 1,2,⋯,18} means taking the maximum in [ri,j(2ha)g1,1(b)⋯,ri,j(2ha)g1,18(b);⋯,ri,j(2ha)g8,18(b)]. When i = 1,2,⋯, M; j = 1,2,⋯, N, it is equal to Kronecker product between [Ri,j(a)]M × Nand [Bk,l(s)]K × L. The two matrixes are not necessarily of the same dimension. It can be represented as
Where ⊗ denotes Kronecker product, |max for ∀{ri,j(2ha)Bk,l(s)},
= 1 or ρ
rϕ
= 1 denotes the maximal of all products between ri,j(2ha) and B1,1(s), B1,2(s),⋯, B8,18 (s) (As the result of an improvement of signal-to-noise ratio, the noise is reduced). In a neurobiological sense, only stimuli with optimal orientation and frequency may activate the strongest response of simple cells in V1. Formula (16) represents the activated pattern corresponding to a typical image. This pattern can be represented as a matrix
In formula (17), [Φi,j(b)]M × Nis the representation of retinal image [Ri,j(2ha)]M × Nin V1, which involves an essential difference with the traditional coding.
As is known, an image I(x, y) can be represented as a linear combination of orthogonal basis functions ψi,j(x, y)
where ai,jare weights. Obviously, the intersection of ai,jψi,j(x, y) is not null, i.e.
That is to say, the intensity at any location (x, y) in image I(x, y) is contributed by all basis functions ψi,j(x, y) (i = 1,2,⋯, M; j = 1,2,⋯, N), so the computation can be highly complicated. In our case, the orthogonal division of input images makes every patch ri,j(a) orthogonal as shown in Figure 6; at the same time, the following condition is satisfied:
Figure 6. Orthogonal division of a visual image.
Therefore, every patch ri,j(a) can be processed independently by functional columns. This is consistent with the neural mechanism of cortical information processing, and reduces computational complexity as well.