SEARCH WITHIN CONTENT
Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 6, Issue 3, Pages 9-17, DOI: https://doi.org/10.21307/ijanmc-2021-022
License : (CC-BY-NC-ND 4.0)
Published Online: 08-October-2021
Semantic image segmentation is a vast area of interest for computer vision which has gained exceptional attention from the research community. It is the process of classifying each pixel in respective category. In this paper, we exploit the problem of scene understanding and perform the segmentation by combining different classification models as a feature encoder and segmentation models as a feature decoder. All of the experiments were performed on Camvid dataset. It covers a wide range of real-world applications such as autonomous driving, virtual/augmented reality, indoor navigation, etc.
Semantic segmentation  is the process of alloting class labels to each pixel in an image. Pixel-wise labels provides us better descriptions of images than bounding box labels. Concluding such labels is a much more challenging task because it involves extremely complex structured prediction problem. Semantic image segmentation  (pixel-level classification) is an immense area of interest for computer vision, machine learning , and deep learning  researchers with many challenges. It has a wide array of practical applications like remote sensing, autonomous driving, indoor navigation, video surveillance and virtual or augmented reality systems etc.
Nowadays Deep Learning techniques  provide state-of-the-art performance for image segmentation and classification as well as for detection tasks and captioning using Convolutional Neural Network models and have been mainly accelerating the recent breakthroughs in semantic segmentation using different combinations of CNN models such as VGGNet , AlexNet , and ResNet .
VGG is an advanced object-recognition convolutional neural network model that supports up to 19 layers pre-trained on ImageNet  (achieves 92.7% accuracy) and performs efficiently on many datasets outside of ImageNet . ResNet  is a deep neural network that has 150+ trainable layers. The modal achieves the highest accuracy in the 2015 ImageNet  dataset Challenge. U-Net  is a Convolutional Neural architecture designed to deal with biomedical images to solve the problem i-e what and where.
In this paper, we proposed a Segmentation Architecture by combining the two models i-e base model ,  with our segmented model ,  for segmentation. We use our base model as an object feature extractor and use the preceeding segmentation model to segment the images based on extracted features. We use different models with the implementation of an encoder-decoder  having skip architecture  for segmenting the boundaries accurately.
The second part of this paper involves a short survey for segmentation with CNN models. The third part describes the proposed methodology of our framework, the fourth part involves experiments results and graphs. Conclusion is in the fifth part and references are drawn in the last.
In recent research of computer vision and pattern recognition, CNN  capabilities are highlighted which solve challenging tasks like segmentation  and classification . Recent progress in semantic segmentation are mainly enhanced by powerful DNN architectures , , following by the ideas of FCN’s . Different architectures have been developed in this context. Some of the deep learning-based works for semantic segmentation include Fully convolutional networks , Encoder-decoder based models , Multi-scale and pyramid network-based models , Dilated convolutional models , and DeepLab family , Recurrent neural network-based models , Attention-based models , etc. All of these approaches have in common that they generally rely on the powerful feature extraction provided by CNN’s , , . Following is a brief study of some of our concerned techniques.
In 2014, Long and Shelhamer et al.  presented the novel approach of FCNs for semantic segmentation. The approach represented the state-of-the-art in semantic segmentation and has since set the standard for future directions. FCNs  are trained end-to-end, provide a pixel-to-pixel prediction. They also use skip architectures  to combine semantic and appearance information. The authors have demonstrated 62.2% mean pixel (IU) on the PASCAL VOC 2011 dataset .
The work by Long and Shelhamer et al.  builds off of the concept of CNNs pioneered by Matan et al. , and the concept of jets pioneered by Koenderink and Van Doorn . In 1991, Matan et al . Used CNNs for recognizing an unconstrained handwritten multi-digit string. They presented a feed-forward network architecture. This is an addition to the work on recognizing isolated digits. In 1987, Koenderink and Van Doorn  used local jets to give rich representations of local geometry and semantics with filters on multiple scales. Since the work of Long and Shelhamer et al. , several other methods have been explored to improve the performance of semantic segmentation. 
In 2017, Chen and Papandreou  incorporated probabilistic graphical models in the form of fully Conditional Random Fields (CRF) to overcome poor localization. They proposed “DeepLab” system by applying the ‘atrous convolution’ with upsampled filters trained on image classification to the task of semantic segmentation for dense feature extraction and further extend it to atrous spatial pyramid pooling. They also combine ideas from DCNNs  and FCRFs  to produce semantically precise predictions and comprehensive segmentation maps. The proposed technique significantly advances the state-of-art in several challenging datasets, including PASCAL VOC 2012  semantic image segmentation benchmark, PASCALContext , and Cityscapes  dataset.
Later, Zheng and Jayasumana  showed that unpacking dense CRFs into individual computations and joining them to the network yields further improvement. They combine the strengths of CNNs and CRFs  in a single deep network. Their formulation fully integrates CRF-based probabilistic graphical modeling with emerging deep learning techniques that are capable of passing on error differentials from its outputs to inputs during back-propagation-based training of the deep network while learning CRF  parameters. The approach achieves a state-of-the-art on the popular Pascal VOC segmentation benchmark .
In 2015, Noh et al . demonstrate a novel semantic segmentation algorithm by learning a deconvolution network that incorporates a learned deconvolution network for even better performance. Since coarse-to-fine structures of an object are reconstructed progressively through a sequence of deconvolution operations, it helps to generate dense and precise object segmentation masks. They further proposed an ensemble approach, which combines the outputs of the proposed algorithm and FCN-based  method, and achieved substantially better performance with the help of characteristics of both algorithms.
Losing the context information for images during segmentation was a problem until it was addressed by Yuantao Chen et.al  in the paper “improving semantic image segmentation based on feature fusion model”. They proposed a feature fusion model with context features layer-by-layer. Firstly, an image pyramid is formed by pre proceesing the original images. Secondly, an image pyramid is inputted into the network structure by the initialization of feature fusion and expanding receptive fields using Atrous Convolutions. Finally, the score map of the feature fusion model had been calculated and sent to the conditional random field for further processing to optimize results. The approach on the PASCAL VOC 2012 and PASCAL Contex  t datasets had achieved better IU than the state-of-the-art works. The method has about 6.3% improved to the conventional methods.
We started the problem by taking the camvid dataset . Our segmentation task was carried out by a combining two different models. One is used as a base model and the other one is the segmentation model. Our base model is a feature extractor for a given image and pre-trained on the ImageNet  dataset. We fine-tuned our base model on our relative dataset and use it as an encoder  part for our segmentation task. We use the skip architecture  by taking the output from our concern layers.
The second one is our segmentation model which is used as the decoder  part for our architecture. It takes the output from certain layers in our base model through skip architecture  as its input. Then the segmentation model segments the image based on the features extracted by the base model.
CNNs shows a state-of-art for image classification and recognition because of its high accuracy. The CNN follows a hierarchical model  that works on building a network, like a funnel, and finally gives out a fully-connected layer where all the neurons are connected to eachother and the output is processed. Our base model is used as a feature extractor having Input size of 224x224 pre-trained on ImageNet  with 1000 classes. We are taking classification features by removing fully connected nodes and fine-tune the model on specific layers. We transform the fully connected layers into convolution layers to produce a classification heatmap .
We get the image at the input layer and then initialize the weights to avoid layer activation outputs from exploding or vanishing during a feed-forward propagation . After weight initialization, all of the weights ‘w’ multiplied by input ‘x’ are summed up and add a bias of 1 to allow units to learn an appropriate threshold. (1).
We add zero paddings and apply 3x3 kernels with a stride of 2 and apply max-pooling. A trick of ‘shift and stitch’ in which the values are being max-pulled after doing the shifting and then we stitch the results into the original image. After that there implies a relu activation function ‘R’(2).
Base model works as an encoder  for our architecture. Encoder-Decoder  module works as a backbone for semantic segmentation tasks. The encoder extracts features from the input image which is used to produce segmentation output. We get the abstract representations via downsampling. In downsampling, we decrease the number of pixels by getting only the pixels with features. This is done because we are facing memory limits on computer and to reduce processing time. The result of using a pooling layer and creating downsampled or pooled feature maps is a summarized version of the features detected in the input.
We use skip connections  in our architecture, skip some layers in the neural network and feed the output of one layer as the input to the other layers instead of just passing to one next layer. By using a skip connection , we provide an alternate path for the gradient (with backpropagation). It makes it easier to estimate good weight values for the architecture to obtain better generalization performance. After cascading a set of CNN weights ‘w’, biase ‘b’, and non-linear layers to the input ‘x’, we extract image features ‘xf’ from each ‘n’ layer is defined by
all outputs ‘xfn’ are passed to the segmentation architecture through Skip Connections.
Image segmentation with CNNs, involves feeding segments of an image as input to the segmentation model , , which labels the pixels. Our segmentation module consists of different CONV layers that receive the input from different levels of the base model , . It involves upsampling of the images via deconvolution (also known as a transposed layer).
Let ‘Sn’ be the decovolution layer in our segmentation model receiving the inputs from base model. Deconvolution  layers need to be stacked very deeply which increases computations and memory allocation. So we use 1x1 conv  where the stride is 1 without bias. It gives us faster computation with less information loss by reducing the dimensions of the previous layer and also adds more non-linearity to enhance the potential representation of the network. Input samples ‘xfn’ are average pooled and passed through the 1x1 conv layer . Applying batch normalization, we regularized our model to avoid the need of dropout. It also reduce the training epochs and get higher accuracy. This is done before utilization of Relu activation function. So,
After that there is a concatenation layer ‘C’ which concatenates all the inputs receives from previous model in a linear form as well as from skip connections  which are then concatenated and pass through 1x1 conv layer with batch normalization and relu function ‘R’.
It passes the results to output layer ‘Z’ having a softmax activation function ‘SOF’.
Where ‘Z’ is the segmented image.
1) Dataset: We are using the Camvid  dataset consists of 701 original images of 360x480p. The images are divided into 3 sets having 367 training images, 233 test images and 101 validate image. We make annotations for each image in the original dataset. After that, data augmentation is performed for training set. The images are flipped vertically and horizantly, make 2 more images for each image. So the total number of training images is 1,101 with RGB colors.
2) Models: Pre-trained classification and segmentation models are fine tuned and combined which works as an encoder and decoder part for our architecture. By using transfer learning, we adopt VGG and Resnet with pre-trained weights on ImageNet as our base (encoder) module whereas U-net and PSP-net as our segmentation (decoder) module. We are training our model based on the combination of VGG_U-net, VGG_PSP-net, ResNet_U-net, ResNet_ PSP-net.
3) Training Setup: We are doing traing against loss and accuracy. Our loss is difference between predicted and actual value defined by ‘L’
Weights are updated according to the following relation for backpropogation to minimize the loss
Where ‘n’ is learning rate, n = 2e‒4, ‘wn’ is the new weight and ‘wa’ is the old weight. Batch normalization is used to avoid the need of dropout and serves us to regularize the model. We use adam optimizer to minimize the loss value. We do training for only 5 epochs due to limited resouces with 512 steps on each epoch. After every epoch the model will save its learned weights and it will also validate itself by using the given validation set. It takes around 13 seconds per step (6904s per epoch). During training the model, after each epoch the model will evaluate the performance based on the validation set. To check the overall performance of the model we use the test set with newly images for model prediction.
Results shown below describes about the model performance. Due to less resources these results were carried out on simple laptop(core i7, 8gb ram). Results are satisfactory as the model was trained only for 5 epochs.
In this work, we have demonstrated the concept of combining different pre-trained classification models and segmentations models for the semantic image segmentation. We developed an end-to-end trainable model that achieved good performance and results on camvid dataset as compared to the level of resources available. We trained our model for only 5 epochs due to limited number of resources. Moreover, if this model is trained on large dataset with a large number of epochs, more accurate and precise results will be achieved that can be used in many real-world applications like autonomous driving.