top of page

Local Convolutional Features with Unsupervised Training for Image Retrieval

Updated: Aug 26, 2021

Author : Samarth Chandra, Rupesh

Image retrieval is a challenging task as different images of the same object/scene may exhibit large variations in viewpoint, illumination, scaling, Patch level descriptors is the basic building block of many computer vision tasks, one important of which is content based Image retrieval. The SIFT descriptor is one of the oldest and widely used local descriptor used for image retrieval.

This paper focuses on improving patch retrieval on image retrieval performance. For the same purpose we have introduced a new dataset “Rome-Patches” .The 16k Flickr images in the “Rome-Patches”. The 3D reconstruction on these images provides sparse patch matches, which acts as the ground truth for our dataset. In a nutshell, we have tried to provide the following:

  1. We put forward a patch descriptor that relies on CKN architecture that uses simple procedure to compute feature embedding.

  2. We have introduced a new dataset named “Rome-patches” for the evaluation of patch and image retrieval and that also enables us to find the correlation between patch matching and image retrieval performance.

Figure_1 : Rome Patches

We have introduced a dataset “Rome Patches”, for the evaluation of patch and image retrieval, that will enable us to study the correlation between patch matching and image retrieval performance.

We Have tried to present a comparison between the current deep convolutional approach with the proposed patch-CKN for both patch and image retrieval on our novel dataset “Rome-Patches”. Interestingly we have found out that the results obtained were from the unsupervised patch CKN approach was competitive compared to the supervised CNN architecture.

Traditional patch descriptors. Shallow patch descriptors, deep learning for image retrieval and deep learning for patch description. Standard Patch Descriptors:

  • SIFT


  • LIOP

Problem with all these descriptors :

If the number of set parameters is large, then the approach is infeasible and optimal parameterization needs to be learned from data. This paper presents a deep-kernel based approach for the description of image patches for image retrieval. The key idea comes from expressive feature representations output of deep CNN that are used in Image classification. The feature output from an intermediate layer of an CNN can be used as an image level descriptor. The output of previous layers typically the 4th layer is preferred as the patch-level descriptor. With our contribution that is known as Patch-CKN that is based on Convolutional Kernel Networks (CKNs). CKNs were initially introduced for the purpose of image classification. The Patch-CKN we introduce generalizes kernel descriptors; the proposed procedure for computing an explicit feature embedding is faster and simpler.


SIFT - In classical sift pipeline people did was basically looking at local features. So SIFT feature is basically engineered representation to describe a local path.

  • SIFT is the most widely used two-layer architecture

  • SIFT is a two-layer architecture:

  • the first layer computing patch gradient orientations

  • Average-pooled in the second one.


It is used for many tasks such as stereo matching, content-based retrieval, or classification. State-of-the-art instance-level retrieval systems involve three steps:

  1. Interest point detection: to select key points Description.

  2. The choice of a good local representation to ensure robustness to viewing conditions.

  3. Matching: the goal is to define a suitable metric between two patch sets.

Now, when it comes to deep learning for image retrieval Intermediate layers are used as image retrieval. CNN responses at different scales and positions are extracted and replacing the dense grid with a patch detector. CNN descriptor for instance-based image retrieval and fine tune the descriptor on a surrogate landmark dataset. While fine-tuning improves results, it would be difficult to replicate this success beyond landmarks..Gglobal CNN descriptors lack geometric invariance, so they produce results below the state-of-the-art in instance-level image retrieval.

Image Retrieval Pipeline

The three step pipeline is described below

  • Interest point detection: Interest point detectors provide locations invariant to certain image transformations. This ensures that two views of the same scene even with changes in viewpoint or illumination share similar “interest points”. The idea is to extract points at their characteristic scale and estimate for each point an affine-invariant local region. Rotation invariance is obtained by rotating patches to align the dominant gradient orientation. These results in a set of interest points associated with locally affine-invariant regions.


  • Interest point description/patch description: we compute a normalized patch M feature representation φ(M) in a Euclidean space. The representation is expected to be robust to the perturbations that are not covered by the detector.

  • Patch matching: Matching pairs of patches is too expensive; we follow the standard practice of encoding the patch descriptors and aggregating them into a fixed-length image descriptor, using the VLAD representation. Vector of aggregated local descriptors (VLAD).


Convolutional Neural Networks

They transform an input image by a sequence of layers which performs a linear operation followed by a point wise nonlinearity. A great example off-the shelf CNN is Alex Net, which won the Image Net 2012 challenge. Alex Net has 7 layers: the first five are convolutional and the last ones are fully connected. The network is designed to process images of size 224 × 224, but convolutional layers may be fed with smaller inputs to produce 1x1 maps that we can use as low-dimensional patch descriptors.

The output of a CNN for some image x is:

f(x) = γK(σK(WK . . . γ2(σ2(W2γ1(σ1(W1x)). . .))

Wk = Matrices corresponding to linear operations

σk = These function pointwise non-linear functions

γk = These function perform a downsampling operation

Convolutional Kernels Networks

CKNs (Convolutional Kernel Networks ) were initially introduced for image classification. CKNs have the same architecture as classical CNNs. The feature representation of CKNs relies on kernel map, hence it is data independent. An explicit kernel map can be computed to approximate it for computational efficiency.A fast and simple procedure for this purpose is using sub-sampling of patches and stochastic gradient optimization, resulting CKN patch descriptor. Where comes the idea of CKN. It comes from the ideas that it is possible to learn competitive patch-level descriptors without supervision,which reduces time and cost compared to previous model as collecting and labelling features and labels.


CKN gives competitive patch-level descriptors compared to supervised CNNs for the purpose of patch and image retrieval, it is possible to learn competitive patch-level descriptors without supervision, and therefore at a fraction of the computational and annotation cost compared to previous supervised alternatives.

Let us understand it by an example of the following kernel:

K1(M, M0 ) = X z, z0∈Ω e −kz−z 0k 2/2β 2 1 k1(pz, p0 z 0 ),


k1(pz, p0 z 0 ) = kpzkkp 0 z 0ke −kp˜z−p˜ 0 z0k 2/2α 2

  • α1 and β1 are the kernel hyper parameters,

  • k denotes the usual L2 norm and p˜z , p˜ 0 z 0 are the L2-normalized versions of sub-patch pz and p 0 z 0.

The above kernel (feature) provides a feature representation for patches and images . Moreover, this kernel is a match kernel. Therefore, these kernels offer a tunable state of invariance with the choice of hyper parameters, and produces hierarchical convolutional representations that are well-suited for natural images.When an kernel is laid over a single kernel to get an overall “deeper” and for potentially better feature representation.

Implementation Process :

The process of patch extraction was carried out with 51 x 51 pixel patches. Both the above mentioned algorithm was implemented on the patches which are described below:

CNN Implementation :

For CNN, we have used the popular Caffe Framework and the AlexNet.We rescaled the 51 x 51 patches in such a way that when fed to the CNN, it produced an output of 1x1 image maps.

CKN Implementation :

We know that we can only learn CKNs, to do so, we randomly selected a set of 100k patches in the train split for the RomePatches dataset. We used stochastic gradient descent optimization in a run for 300K iterations on the go with a batch size of 1000. We explored three input types separately. For each layer, four hyper parameters have to be determined.


ImageRetrieval AutoEncoders:

In [1]:
import tensorflowas tf
from tensorflow.keras.models import save_model
#from tensorflow.python.framework import ops
import tensorflow.keras.layers as L
#import tensorflow.compat.v1.keras.backend as K
import numpyas np
from sklearn.model_selection import train_test_split
from lfw_datasetimport load_lfw_dataset
import matplotlib.pyplotas plt
#import keras_utils
import numpyas np

Load dataset:

Relevant links for Dataset:




In [2]:
ATTRS_NAME = "lfw_attributes.txt" 
IMAGES_NAME = "lfw-deepfunneled.tgz" 
RAW_IMAGES_NAME = "lfw.tgz" 

To clear session/graph if you rebuild your graph to avoid out-of-memory errors:

 In [3]:
 def reset_tf_session():
  s = K.get_session()
 return s

Load images

In [4]:
X, attr = load_lfw_dataset(use_raw=True, dimx=32, dimy=32)
IMG_SHAPE = X.shape[1:]

# center images
X = X.astype('float32') / 255.0 - 0.5

# split
X_train, X_test = train_test_split(X, test_size=0.1, random_state=42)
HBox(children=(FloatProgress(value=0.0, max=18983.0), HTML(value='')))

In [5]:
def show_image(x):
 plt.imshow(np.clip(x + 0.5, 0, 1))
In [6]:
plt.title('sample images')
for i inrange(6):

print("X shape:", X.shape)
print("attr shape:", attr.shape)
# try to free memory
del X

import gc
X shape: (13143, 32, 32, 3)
attr shape: (13143, 73)


Build Model

Going deeper: convolutional autoencoder

PCA is neat but surely we can do better. This time we want you to build a deep convolutional autoencoder by... stacking more layers.


The encoder half is pretty commonplace, we have a tendency to stack convolutional and pooling layers and end with a dense layer to urge the illustration of fascinating size (code_size).We suggest to use activation='elu' for all convolutional and dense layers.We suggest to repeat (conv, pool) four times with kernel size (3, 3), padding='same' and therefore the following numbers of output channels: thirty two, 64, 128, 256.Remember to flatten (L.Flatten()) output before adding the last dense layer! Decoder For decoder we'll use alleged "transpose convolution".

Traditional convolutional layer takes a patch of a picture and produces variety (patch -> number). In "transpose convolution" we would like to require variety and turn out a patch of a picture (number -> patch). we want this layer to "undo" convolutions in encoder.Here's how "transpose convolution" works: In this example we use a stride of 2 to produce 4x4 output, this way we "undo" pooling as well. Another way to think about it: we "undo" convolution with stride 2 (which is similar to conv + pool).

we can add "transpose convolution" layer in Keras:

L.Conv2DTranspose(filters=?, kernel_size=(3, 3), strides=2, activation='elu', padding='same')

Our decoder starts with a dense layer to "undo" the last layer of encoder. Remember to reshape its output to "undo" L.Flatten() in encoder.

Now we're ready to undo (conv, pool) pairs. For this we need to stack 4 L.Conv2DTranspose layers with the following numbers of output channels: 128, 64, 32, 3. Each of these layers will learn to "undo" (conv, pool) pair in encoder. For the last L.Conv2DTranspose layer use activation=None because that is our final image.

In [7]:
def build_deep_autoencoder(img_shape, code_size):
 """PCA's deeper brother. See instructions above. Use `code_size` in layer definitions."""
    H,W,C = img_shape
 # encoder
    encoder = tf.keras.models.Sequential()
 ### YOUR CODE HERE: define encoder as per instructions above ###
    encoder.add(L.Conv2D(filters=32, kernel_size=(3, 3), activation='elu', padding='same'))
    encoder.add(L.MaxPooling2D(pool_size=(2, 2)))
    encoder.add(L.Conv2D(filters=64, kernel_size=(3, 3), activation='elu', padding='same'))
    encoder.add(L.MaxPooling2D(pool_size=(2, 2)))
    encoder.add(L.Conv2D(filters=128, kernel_size=(3, 3), activation='elu', padding='same'))
    encoder.add(L.MaxPooling2D(pool_size=(2, 2)))
    encoder.add(L.Conv2D(filters=256, kernel_size=(3, 3), activation='elu', padding='same'))
    encoder.add(L.MaxPooling2D(pool_size=(2, 2)))

 # decoder
    decoder = tf.keras.models.Sequential()
 ### YOUR CODE HERE: define decoder as per instructions above ###
    decoder.add(L.Reshape((2, 2, 256)))
    decoder.add(L.Conv2DTranspose(filters=128, kernel_size=(3, 3), strides=2, activation='elu', padding='same'))
    decoder.add(L.Conv2DTranspose(filters=64, kernel_size=(3, 3), strides=2, activation='elu', padding='same'))
    decoder.add(L.Conv2DTranspose(filters=32, kernel_size=(3, 3), strides=2, activation='elu', padding='same'))
    decoder.add(L.Conv2DTranspose(filters=3, kernel_size=(3, 3), strides=2, activation=None, padding='same'))
 return encoder, decoder

In [8]:
# Check autoencoder shapes along different code_sizes
get_dim = lambda layer:[1:])
for code_size in [1,8,32,128,512]:
    encoder, decoder = build_deep_autoencoder(IMG_SHAPE, code_size=code_size)
 print("Testing code size %i" % code_size)
 assert encoder.output_shape[1:]==(code_size,),"encoder must output a code of required size"
 assert decoder.output_shape[1:]==IMG_SHAPE,   "decoder must output an image of valid shape"
 assert len(encoder.trainable_weights)>=6,     "encoder must contain at least 3 layers"
 assert len(decoder.trainable_weights)>=6,     "decoder must contain at least 3 layers"
 for layer in encoder.layers + decoder.layers:
 assert get_dim(layer) >= code_size, "Encoder layer %s is smaller than bottleneck (%i units)"%(,get_dim(layer))

print("All tests passed!")
Testing code size 1
Testing code size 8
Testing code size 32
Testing code size 128
Testing code size 512
All tests passed!
In [9]:
encoder, decoder = build_deep_autoencoder(IMG_SHAPE, code_size=32)

In [10]:
inp = L.Input(IMG_SHAPE)
code = encoder(inp)
reconstruction = decoder(code)

In [11]:
autoencoder = tf.keras.models.Model(inputs=inp, outputs=reconstruction)
autoencoder.compile(optimizer="adamax", loss='mse')

In [12]:
# we will save model checkpoints here to continue training in case of kernel death
model_filename = 'autoencoder.{0:03d}.hdf5'
last_finished_epoch = None

#### uncomment below to continue training from model checkpoint
#### fill `last_finished_epoch` with your latest finished epoch
# from keras.models import load_model
# s = reset_tf_session()
# last_finished_epoch = 4
# autoencoder = load_model(model_filename.format(last_finished_epoch))
# encoder = autoencoder.layers[1]
# decoder = autoencoder.layers[2]

In [13]:
class ModelSaveCallback(tf.keras.callbacks.Callback):
 def __init__(self, file_name):
        super(ModelSaveCallback, self).__init__()
 self.file_name = file_name

 def on_epoch_end(self, epoch, logs=None):
        model_filename = self.file_name.format(epoch)
        save_model(self.model, model_filename)
 print("Model saved in {}".format(model_filename))

In [14]:, y=X_train, epochs=25,
                validation_data=[X_test, X_test],
                initial_epoch=last_finished_epoch or 0)

<tensorflow.python.keras.callbacks.History at 0x1ba02257648>

Image retrieval with autoencoders

we have trained a network that converts image into itself imperfectly. This task is not that useful in and of itself, but it has a number of awesome side-effects. Let's see them in action.First thing we can do is image retrieval aka image search. We will give it an image and find similar images in latent space:

To speed up retrieval process, one should use Locality Sensitive Hashing on top of encoded vectors. This technique can narrow down the potential nearest neighbours of our image in latent space (encoder code). We will caclulate nearest neighbours in brute force way for simplicity.

In [15]:
images = X_train
codes = encoder.predict(images) 
assert len(codes) == len(images)
In [16]:
from sklearn.neighbors.unsupervised import NearestNeighbors
nei_clf = NearestNeighbors(metric="euclidean")

NearestNeighbors(algorithm='auto', leaf_size=30, metric='euclidean',
metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)

In [17]:
def get_similar(image, n_neighbors=5):
 assert image.ndim==3,"image must be [batch,height,width,3]"
code = encoder.predict(image[None])
(distances,),(idx,) = nei_clf.kneighbors(code,n_neighbors=n_neighbors)
 return distances,images[idx]

In [18]:
def show_similar(image):
 distances,neighbors = get_similar(image,n_neighbors=3)
 plt.title("Original image")
 for i in range(3):

In [19]:
# cherry picked smile images
# ethnicity
# glasses


106 views0 comments

Recent Posts

See All
bottom of page