top of page

Recurrent Neural Networks to Correct Satellite Image Classification Maps

Authors : Samarth Chandra, Rupesh

Convolutional neural networks (CNNs) are increasingly being used for pixelwise semantic labelling of images, despite their origins in image categorization. Nonetheless, the The most popular CNN architectures are good at identifying objects but not so good at precisely localising them. This issue is exacerbated in the sense of aerial and satellite image marking, where spatially precise object outlining is critical. In the literature, various iterative enhancement algorithms have been presented to gradually boost the coarse CNN outputs, with the aim of sharpening object boundaries around real image edges. However, careful planning, selection, and implementation are needed. These algorithms are fine-tuned Instead, we want to learn the iterative process in its entirety. We do this by formulating a generic iterative enhancement mechanism based on partial differential equations, which we find can be represented as a recurrent neural network (RNN). As a result, for our enhancement mission, we train such a network using manually labelled data.

The pixelwise labelling of satellite imagery is one of the most explored problems in remote sensing. Labeling like this is used in a variety of functional applications, including precision agriculture and urban planning are two of the most important aspects of precision agriculture. Satellite data availability and resolution have significantly improved as a result of recent technical advancements. Aside from the issues of computational complexity, These advancements are presenting new problems in image processing. The fact that vast surfaces are covered, for example, causes a lot of variation in the appearance of the objects. Furthermore, the fine details in high-resolution images render classifying pixels from elements difficult. As a result, it's critical for defining object types.

Convolutional neural networks (CNN)  are gaining popularity due to their ability to learn on their own. In image classification problems, look for specific contextual features. CNNs have already been used in remote sensing ,and their powerful recognition capabilities have been demonstrated. Two most recent approaches to overcoming the systemic problems that lead to rough classification maps have been proposed. One of them is to employ new types of CNN architectures, which were created specifically for this purpose. pixel marking, which aims to solve the detection/localization conundrum. Noh et al. for example, replicate a base classification CNN by adding a reflected “deconvolution” network that learns to up sample rough classification maps. Another common practice is to use the base CNN as a rough classifier of the objects' positions, and then process this classification using the actual image as guidance, resulting in output objects that are similar to the input objects. Better fit to the edges of the real picture. While all of these studies have stretched the limits of CNN's pixel labelling capabilities, they all presume the availability of vast quantities of specifically labelled training data.

The process is working with more practical datasets, with the aim of providing a method for fine-tuning classification maps that are too coarse due to a lack of reference data. Due to the nature of the available training data, the first scheme, including the use of novel CNN architectures, appears to be unfeasible in the sense of large-scale satellite imagery. Our research is also relevant to natural image semantic segmentation. Completely convolutional networks are particularly common for conducting research.


Image marking on a pixel-by-pixel basis. FCN networks are made up of a stack of convolutional and pooling layers, followed by deconvolutional layers, which upsample the classification maps resolution, potentially combining features at different scales.

We formulate a general iterative refinement algorithm through an RNN and let the network learn the particular algorithm, rather than predefining the algorithmic specifics as in previous works. To our knowledge, very little research has been done on learning an iterative method. Assume we're given a set of score or heat maps uk, one for each possible class k L in a pixelwise labelling problem.


A difficulty according to the classifier's predictions, a pixel's score represents the probability of belonging to a class. Every pixel's final class is the one with the highest value, uk.

A softmax function, on the other hand, can be used to interpret the results as likelihood scores:

P(k) = e^uk /Σ˅(j ɛ l) ^uj

In the sense of satellite image classification, Our aim is to use the score maps uk in conjunction with information extracted from the input image (e.g., edge features) to sharpen the scores near real objects and improve classification.

Using partial differential equations to gradually improve the score maps is one way to accomplish this mission (PDEs).First, we'll go through the various types of PDEs in this segment. We could definitely envision designing a solution to our dilemma.

Partial differential equations (PDEs)

Different PDE approaches can be devised to enhance classification maps. The diffusion processes can be applied to the maps uk and using partial differential equations heat flow is described as:

∂uk(x)/∂t = div(∇uk(x))

div(·) denotes the divergence operator. This process of diffusion smooth out the heat maps but the main goal is to design an image-dependent smoothing process that aligns the heat maps to the image features which could be achieved by modulating the gradient by scalar function on the input image I:

∂uk(x)/∂t = div(g(I,x)(∇uk(x))


g(I,x) denotes an edge stopping function and it takes low values near borders of I(x) to slow down the smoothing process. The above equation is similar to the Perona-Malik diffusion but the exceptionis that Perona-Malik uses the smoothed function itself to guide the diffusion. We can replace g(I, x) by matrix D(I,x) which acts as as a diffusion tensor that redirects the flow based on image properties.

∂uk(x)/∂t = div(g(I,x)(∇uk(x))

The above formula relates to anisotropic diffusion process. We can also take inspiration from the level set framework and formulate the differnetial equation as:

∂uk(x)/∂t = |∇uk(x)|div( g(I,x)∇uk(x)/|∇uk(x)|

This formulation can help favoring the zero level set to align with minima of g(I,x) which could be used to improve heat maps uk. They all are different approaches and our main goal is to let the machine decide which one is good instead of using trial and error to perform such design.

Generic Classification Enhancement Process

Finite differences, which reflect derivatives as discrete convolution filters, are commonly used to discretize PDEs in space. We use this technique to write a generic discrete formulation of a recursive enhancement method.

For a score map Uk ,to perform differential equation of the type { ∂/∂x , ∂/∂y , ∂^2/∂x∂y , ∂^2/∂x^2 , ...} let us consider arbitrary number of feature maps {g1, ..., gp} derived from image I, convolution kernels {M1, M2, ...} and {N^j˅1 , N^j˅2 , ...} applied to the heat map uk and the features gj derived from image I. Instead of directly providing a bank of filters Mi and N^j˅i we let the system learn filters by itself.

Grouping all features in a single set Φ(uk, I):

Φ(uk, I) = { Mi * uk, N^j˅i * gj(I) ; ∀i,j,l }

Generic discretized scheme:

∂uk(x)/∂t = fk(Φ(uk,I)(x))

fk takes input the values of all the features in Φ(uk, I) at an image point x and combines them.If we restrict functions fk to be linear, we still obtain the set of all linear PDE. PDEs are usually discretized in time, having form:

u˅(k,t+1)(x) = u˅(k,t)(x) + δu˅(k,t)(x)

In simplified scheme we can directly operate on I by considering only convolutions: Ni*I. The set then becomes

Φ(uk,I) = {Mi*uk,Nj*I;∀i,j}

Mi , Nj and N^j˅i do not depend on, fk determines the contribution of each feature to the equation.

Iterative Processes as RNN

The generic iterative process can be implemented as an RNN.One iteration, defined from generic classification enhancement process,can be expressed in terms of common neural network layers.For simplicity let us consider u˅(k,t)(x) as ut.

At each iteration image I and heat map ut is taken as input to enhance time t. At first iteration the heat map ut is the initial coarse heat map to be improved which is the output by another pre trained neural network. We have to derive a series of filter responses from ut which correspond to Mi∗ut in Φ(uk, I). The responsed are find by the dot product between Mi and uk,t(·) .A set of filter responses are computed at the same spatial location on the input image .The operations are convolutions when performed densely in space. The filters are concatenated to form a pool of features Φ .Function δut describes how the heat map ut is updated at iteration t.


δut is modelled through a multilayer perceptron, as it approximate function within a bounded error. One hidden layer is included with nonlinear activation functions followed by an output neuron with a linear activation. The value of δut is added to ut to generate the map ut+1. The following architecture conveys the intention of a progressive refinement of the classification much better. The iterative process is implemented by unrolling a finite number of iterations and the parameters are shared among all iterations during training time by a simple modification to the back propagation where the derivatives of every instance of a weight at different iterations are averaged. The features which are extracted from the input image I are independent of the iteration.

Implementation Details

First, we'll go over the CNN that was used to generate the coarse predictions, and then we'll go over our RNN in detail. The Caffe deep learning library was used to implement the network architecture.

Our educated guess CNN is built on a previous Mnih-presented remote sensing network. We construct a completely convolutional version of Mnih's network since recent remote sensing research has demonstrated the theoretical and practical benefits of this architecture. The CNN uses 3-band colour picture patches with a resolution of 1m2 to generate as many heat maps as the number of classes considered.

To estimate the gradient of the network's parameters and back-propagate them, we group 64 patches with classification maps of size 64 64 into mini-batches. The cross-entropy between the target and expected class probabilities is our loss function. With a learning rate of 0.01, momentum of 0.9, and an L2 weight regularisation of 0.0002, stochastic gradient descent is used for optimization. However, neither these parameters nor the network architectures is optimised.

The above-mentioned RNN is now implemented in depth. It's worth noting that the weights of the initial coarse CNN and the manually labelled CNN are set at this stage.Only the RNN is trained using tile. Our RNN picks up 32 Mi and 32 Nj filters, each with a spatial dimension of 5 5. Per class has its own MLP, which is trained using 32 hidden neurons and rectified linear activations, while Mi and Nj filters are shared across classes.

We unroll five RNN iterations, which allows us to boost the classification maps substantially without using up all of our GPU's memory. It is essential to receive training.

As with the coarse CNN, this was achieved on random patches and with the cross-entropy loss function. Using a base learning rate of 0.01 and a gradient descent algorithm called AdaGrad, we were able to achieve faster convergence in our case (higher values make the loss diverge). All weights are chosen at random from a distribution that is proportional to the number of neuron inputs .



49 views1 comment

Recent Posts

See All
bottom of page