SEARCH WITHIN CONTENT
Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 6, Issue 3, Pages 42-49, DOI: https://doi.org/10.21307/ijanmc-2021-026
License : (CC-BY-NC-ND 4.0)
Published Online: 08-October-2021
Text is an important invention of humanity, which plays a key role in human life, so far from dark ages. Text in image is closely related to the scene or a product and is widely used in vision based application. In this paper we are addressing the problem of visual understanding with text. The main focus is combining textual cues and visual cues in deep neural network. First the text is recognized and classified from the image. Then we combine the attended word embedding and visual feature vector which are then optimized by CNN for Fine-grained image classification. We carried out the experiments on soft drink dataset in Pakistan. The results shows that the system achieves significant performance which can be potentially beneficial for real world application e.g. product search.
Fine-Grained image classification [1, 28] is a real-world emerging problem and it has received great attention from research communities around the globe. In computer vision, fine grained  image involves the problem of assigning images to classes where different instances of different classes differ slightly in their appearances e.g., flower species, animal species, product/place types. Therefore, fine-grained image classification  is a challenging assignment due to the slight variations among highly-confused categories of instances belonging to various classes of objects, which are hard to distinguish. Further, in some of the cases, human intervention or specific knowledge of a particular domain is also required to perform precise fine-grained image classification .
For some years, fine-grained image classification  is also being applied for natural scene classification. This involves the natural images of wide and diverse nature. It is witnessed that classification of shops, variety of products in shops, etc. are the considered areas where fine-grained image classification is being used . Using fine-grained image classification  techniques on the soft drink dataset is an area that has received limited attention from the researchers, whereas this area has also exciting applications in restaurants and shops, where automated orders could be places once a specific brand of soft drink is going to be out of stock.
In this work, we have exploited text and visual cues in form of features to be used with Convolutional Neural Network for attaining good performance in the fine-grained classification [1, 28] of soft drinks. To the best of our knowledge, it is a unique application of a classification technique in the domain of fine-grained image classification of soft drink datasets.
The second section of this paper discusses the proposed classification technique, the third section involves results and discussion and conclusions are drawn in the last section.
Text in images carries a high level of information which makes this property of images very rich in computer vision as well as for humans. The information encoded in the text can be very beneficial for many computer vision applications. Text detection and recognition  face some challenges like the diversity of nature, the complexity of background, interference factors, etc. The first novel approach of a real end-to-end model for text detection and recognition in a scene was proposed in 2010 by Neumann et al . It achieved a highly significant increase in recognition rate from 53% to 72% on the Char74k dataset (de Campos et al.) but there is a weakness in this system is that it was only applicable to horizontal texts. Later on Coates et al.  2011 apply large-scale algorithms to build highly effective classifying model for both detection and recognition. Their end-to-end system has high accuracy and performs well on complex natural images but the drawback is that it requires a relatively large volume of training data.
Further in 2014 Jaderberg et al  addresses the problem of text detection and recognition by generating text proposals with CNN and provides an end-to-end system for reading text in natural scene images. That system was capable of both text spotting and image retrieval and perform excellently on complex natural images. Yao et al  present a unified framework for detecting and recognizing the text in images to handle the texts of different orientations. They also provide a method for ‘search dictionary’ to correct the recognition errors. The system achieves highly competitive performance, especially on multi oriented texts. Also, the works of Shiva kumara et al.  and C Yao et al.  realize the significance of multi-oriented text detection and recognition to the research community. In the paper of Yingying Zhu et al. , they discussed in detail the recent advances and future trends for scene text detection and recognition.
Fine-Grained classification  aims for the deep insight into image that is why this problem got a lot of attention from researchers around the globe. Many approaches have been developed to address the particular problem till now with the margin of improvement in the future.
Existing deep learning-based fine-grained image classification approaches  could be sub-classified into the following according to the use of additional information or human inference:
1) Approaches that directly use the general deep CNNs for image fine-grained classification .
2) Part detection and alignment-based approaches .
3) Ensemble of network-based approaches .
4) Approaches based on attention mechanisms .
The prior work in fine-grained classification  can be simply divided into two paths. The first is to detect the discriminative object parts in the image to compensate for nuisance variations such as pose. Many parts-based methods with geometric constraints have been proposed for bird classification  and dogs .
The second track is to derive discriminative and robust features. Classic hand-crafted feature descriptors such as the Scale Invariant Feature Transform (SIFT) , Histogram of Oriented Gradients (HoG) , and Color Histogram  other methods such as the Part-based One-vs-One Features (POOFs)  focus on modeling corresponding parts activation. Deep convolutional neural network (DCNN) approaches for general object classification achieve state-of-the-art performance for fine-grained classification by applying transfer learning .
The idea of attention is one of the most influential ideas in deep learning allows the network to focus on specific aspects of a complex input. The main idea of the attention t mechanism is to allow the decoder to “look back” at the original input and extracts only the significant information that is important for decoding .
Consider we are attempting machine translation on the following sentence: “The cat is beautiful.” If you can ask someone to pick out the keywords of the sentence, i.e. which ones describe the most meaning, they would likely say “cat” and “beautiful.” Articles like “the” and “is” are not as relevant in translation as the previous words (though they aren’t completely insignificant). Therefore, we focus our attention on important words. We use the attention mechanism on texts to get our most relevant words from text features and we put attention on the visual features to get our most relevant features like edges, color, size of object, etc.
The attention mechanism  scores each input word (via dot product with attention weights), then to create a distribution scores are passed through softmax function. An attention vector is produce by multiplying distribution with the context vector and then passed to the decoder. The advantage of attention are its ability to identify the information in an input most pertinent to accomplishing a task, increasing performance especially in natural language processing but it increases computations unlike to human..
Multimodal processing  significantly enhances the understanding, modeling, and performance of human-computer interaction. In multimodal fusion , user interaction with system is through various input modalities like speech, gesture, and eye gaze. In our context, different multimedia researchers presented different fusion strategies used for combining multiple modalities in order to an encoder-decoder architecture accomplish various tasks [28, 29, 33].
The literature on multimodal fusion  research is presented through several classifications based on the fusion methodology. The methods can be described from their advantages, weaknesses, basic concept, and their usage in various analysis tasks but multimodal fusion has several issues that influence the process such as contextual information, confidence level, synchronization between different modalities, etc. In 2016  uses multilayer and multimodal fusion of deep neural networks for video classification. In 2017  uses weakly paired multimodal fusion for object recognition.  Uses Multimodal deep networks for image-based document and text classification by introduce an end-to-end learnable multimodal deep network that jointly learns text and image features and performs the final classification based on a fused heterogeneous representation of the document. They validated their approach on the Tobacco3482 and RVL-CDIP datasets. In 2020  did Fine-grained Classification by the Combination of Visual and Locally Pooled Textual Features.
Our approach is to classify soft drinks images into their respective classes with the help of text and visual features. We extract textual and visual features from the input image with the help of different models [37, 40] and treat those features as input for our multimodal  which combines both the inputs to anticipate the classification of the given image. Textual cues play a key role in fine-grained classification , especially in the classification of business places such as bakery, café, bookstore, and daily use products. Multiple models have been developed to address the particular problem. These models assist us to extract the textual information that is highly useful for image classification. We adopt word2vec .
Visual features are the second input to our modal. The visual feature is the information about the contents of an image which describes its specific structures such as shapes, edges, objects, patterns, and colors i.e. properties. In our case, we extract the visual features (works as second input) with the help of VGG pertained modal  on ImageNet  by fine-tuning the modal with our dataset. These visual features work as building blocks with the texture input to give us the desired results.
The 224x224 Input images are transferred to VGG model  for visual features extraction. We extract visual features by importing the VGG  model from tensor flow keras with the pretrained weights. The top layer is set to false when loading the model. Further, we unfreeze the last five layers to fine tune the modal. After defining the model, PIL object of the image has to be converted in a pixel data NumPy array where we only have one sample and the values are then appropriately scaled to get the features. We get the feature vector ‘yf’ from the last max-pooling layer as a 4096 dimensional features. Feature extraction part is from the input layer to the last max pooling layer.
At the same time, the input is send to OCR for character recognition. OCR gives us the classification of text by localizing and recognizing the text. Then OCR saves the recognized text in a file which is then passed to word2vec  model for textual representation ‘xf’. Word2Vec  is a two-layer neural networks which has been trained for the reconstruction of linguistic contexts of words. It takes a large corpus as its input and produces a vector space, of given dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. The purpose and usefulness of Word2vec  is to group similar words vectors together in vector space. That is, it detects similarity between vectors by using the ‘cosine similarity’ function. Let ‘a’ be the first vector and ‘b’ be the second vector,
After that, we put an attention mechanism  on both inputs of multimodal  i.e. visual features and textual features. There can be some recognized text that is more relevant than others at the moment of discriminating similar classes. So we need to capture the inner correlation between the textual and visual features. The attention mechanism learns a tensor of weight that is used between the visual features and textual features. Let ‘X’ be the extracted textual features and ‘Y’ is visual features and weight is ‘W’. We compute the attention mechanism by:
The resulting normalized attended vector ‘wa’, is multiplied with the textual features ‘xf’ to obtain the final attended textual features ‘wfa’. The obtained textual features ‘xfa’ and the visual features ‘yfa’ are concatenated in the multimodal to form the final features by
Finally, the resulting vector serves as input to a final classification layer that outputs the probability of a given class based on low-rank bilinear pooling operation.
We have collected the new dataset of soft drink bottles of 10 classes in Pakistan with 375 original images. The dataset contains several occluded, rotated, low quality and blurred text instances which increases the difficulty of performing successful text recognition. Due to limited resources, our dataset is not fully organized and has many limitations. The images are divided into 2 sets having 200 training images, 175 test images.
We start by augmentation the training images flipped vertically and horizontally to make 2 more images of every image to avoid the problem of overfitting for feature extraction mode. So the total number of training images is 600. We load the image and convert it to array using keras preprocessing. We also expand the dimensions and use a preprocess function in keras to fit the image according to our model. Then the predicted output is sent to for attention.
We extract text from all of the images by using the combination of tesseract  and easyOCR . EasyOCR is used to localize a text and tesseract is used for the recognition of the text. The extracted text is saved in two files each set of training set text and test set text. Then these sets are loaded for word embedding . To recognize context meanings we use two layer shallow neural network to describe word embedding. We import word2vec  from genism library.
We put the attention on both textual and visual features for the fine-grained classification  of our dataset. The network is trained for 5 epochs with Adam optimizer. The batch size employed in all our experiments is pre-defined from a library ‘config’, with a learning rate of 0.0001, momentum of 0.9.
These experiments were implemented by using tensor flow deep learning framework on a simple laptop of 8 GB ram and 2.7 GHz (i7).
Random results of some classes are shown below.
Bad results are because of lot of noise or the quality of the image.
In this paper we demonstrated the importance of textual and visual cues for fine grained classification. We developed a frame work for precise classification of soft drink bottles by combining pre-trained models. The results show that the textual features plays more important role then visual features for classification of real life products.
Furthermore, as the system is created with low resources it can be modified and enhanced for better performance on large datasets and real world classification applications.