ICMLT 2022, March 11–13, 2022, Rome, Italy
data from the semi-automatic checkouts to the SCO environment. area of research which alleviates these issues and ultimately tack-
This can be reflected as a data distribution shift between the semi- les the data scarcity problem. The TL paradigm aims to leverage
automatic checkout, in which the system is trained, and the SCO the information gained in one or more learning tasks having a
target domain, in which it is deployed. larger dataset to improve performance in another related task with
In this study, we are interested in fixed SCO systems as described a smaller dataset [10, 22]. Thus, TL helps bridge the distribution
by Beck [2]. A fixed SCO is when a customer performs the scanning gap between datasets retrieved from the source domain D S and the
of products in a designated machine. In a fixed SCO, a camera can target domain DT .
be placed where barcode readers typically are mounted, either be- More formally (inspired by the description in [12]), a domain
low or front-facing the product that is scanned. The camera stream D = {X, p(X )} can be defined as a space of all feature vectors X and
provides images of the product when positioned at the counter. a marginal probability distribution p(X ) where X = {x 1 , . . . , x n } ∈
Using the captured images and machine learning techniques, the X is a particular learning sample with n feature vectors. For a
product can be classified into its correct category. Automatising specific domain, the goal is to learn an objective predictive function
the recognition process has a number of benefits for both super- f that predicts a label yi from label space Y for the corresponding
markets and their customers, some of which can be related to the sample x i . This is called a task and can be noted T = {Y, f (x)}.
increase of revenue, identification of scanning errors, and detection Given source and target data, transfer learning is used when there
of malicious behavior in the checkout. is a discrepancy in either the domains (DS , DT ) or tasks (TS ,
We have opted to use text extracted from product images. This is TT ). Source and target domains are said to be different if their
well motivated by our previous paper [11] in which we showed that feature spaces are not the same (XS , XT ) or their marginal
the use of Optical Character Recognition (OCR) and NLP techniques probability distributions are unequal (p(X S ) , p(XT )). In such
are promising for the recognition and verification of retail products. cases, an effective way to learn the target task is to explicitly map
This success is due to the following reasons: Text extraction using the data to a common or domain invariant representation. This
OCR is fairly robust to changes in scale, rotation, and illumination. learning strategy is called Domain Adaptation (DA).
Furthermore, text recognition accuracy is high even on images The prevalent assumption is that the source and target label
containing few textual elements. spaces are similar. DA methods are classified further into three
Different camera placement and resolution, lighting conditions, categories depending on the availability of labels in the target
and color information creates a significant domain shift between domain [19]. Supervised DA methods assume that the target data
images in the source and target domain. The goal of this paper is is labeled (but generally small). In the semi-supervised case, in
to exploit the source domain knowledge and apply it in the target addition to a small amount of labeled target data, a larger amount
domain. The source domain consists of text extracted from RGB of unlabeled target data is available. In the unsupervised setting,
images while the target domain extracts text from a monochrome no labeled data is available for the target domain. In this paper, we
camera. Consequently, the objective of this paper is to assess the use supervised DA techniques and we experiment with varying a
efficiency of transfer learning from a source domain, using an RGB small amount of unlabeled target data to evaluate the performance
camera, to a target domain, using a monochrome camera, by us- of the retrained model.
ing OCR and deep learning approaches for NLP. Furthermore, we
evaluate two different kinds of models, a Convolutional Neural
Network (CNN) with GloVe embeddings and BERT, for the product 2.1 Transfer Learning in NLP
classification. The arrival of Transfer Learning and rapid improvements in the
performance of language models present a great progress in the
NLP domain. Global Vectors (GloVe) is one of the models allowing
to learn distributed word representation in multi-dimensional space
2 BACKGROUND [13]. It is a global log-bilinear unsupervised regression model that
It is common that the deployment of learning models is performed aims to obtain vector representations for words. This is achieved
in environments which are different from the training environment. by mapping words into a meaningful space where the distance
This is reflected by the fact that the data collected and used for between semantically similar words is small. The model is trained
the training is not perfectly matching the data in the deployment on word-word co-occurrence statistics from a large corpus. The
environment. Usually, statistical learning theory is based on the resulting representations capture meaningful linear substructures
assumption that both training and test data are derived by the of the word vector space and the obtained pre-trained word em-
same process (i.e., on the i.i.d. assumption), which is not practical beddings can be used for new tasks. However, these embeddings
for real world data in which the model can be used in similar but do not capture the context of words. Recently developed language
not identical environments. Naturally, the performance of trained models, such as BERT [5] and ULMFiT [7], can learn the context of
machine learning models may decrease when the distributions of each word and catch the nuances and inter-dependencies among
the training and test data differ. the various words of the text. The introduction of such language
Another challenge which may worsen the case is that the col- models has the advantage of allowing data scientists to ensure rea-
lection and labeling of data sets for each deployment environment, sonable accuracy using pretrained models without the need for
which are large enough to train an efficient model, is costly and an excessive amount of manually labeled data and to achieve the
we often have insufficient data that may lead to overfitting or un- state-of-art performance for various NLP tasks. One of the possi-
desirable behavior. Transfer Learning (TL) is an important new bilities to use these pretrained language models is to build new
neural network structures on top of them. Basically, there are three 2.2 Transfer learning for retail product
ways to train these stacked neural networks: stack-and-finetune, recognition
stack-only and finetune-only. In the stack-only method, the pre-
Transfer learning is exploited for retail product recognition using
trained model works as a feature extractor and its final layer is
image datasets. Tonioni and Di Stefano [17] have used an image-
removed and replaced by one or several layers. The weights of
to-image translation by means of a generative adversarial network
the pretrained model are frozen during training for the new task
(GAN) to address the domain shift and generate resilient cross-
and only the additional layer weights can be updated. Finetune-
domain feature embedding of product images. In the case where
only methods use only the pretrained model and train it further on
the source datasets includes subsets with different distributions,
the new domain. Peters et al. [14] compared to the stack-only and
Thota et al. [16] proposed a multi-source deep learning-based do-
finetune-only methods and concluded that finetune-only is better
main adaptation system which serves to identify and verify the
than the stack-only option. Finally, stack-and-finetune combines
presence and legibility of date information from the use-by on food
both methods, fine-tuning the pretrained language model while
packaging pictures. The method incorporate discrepancy losses,
also training the stacked model.
i.e., maximum mean discrepancy (MMD) [9] and correlation align-
An important feature of the recent language models which led
ment (CORAL) [15], to extract domain-invariant representations
to their success is the attention mechanism. Instead of encoding a
for all domains. Furthermore, it aligns distribution of all pairs of
single vector to represent the sequence, the attention mechanism
source and target domains in a common feature space, along with
computes a context vector for all tokens in the input sequence and
the class boundaries. Zhang et al. [23] developed a dual pyramid
for each token in the output [23]. The decoder computes a relevancy
scale network to learn the multi-scale feature of the data using both
score for all tokens on the input side. These scores are then nor-
detection and counting views. Furthermore, an iterative knowl-
malized by performing a softmax operation to obtain the attention
edge distillation training strategy is applied to leverage both image
weights. These weights are then used to perform a weighted sum
and instance levels and thereby narrow the semantic gap between
of the encoder’s hidden states, thus obtaining the context vector c t .
source domain and target domain. To tackle the cross-domain fine-
Then, the attention vector is obtained by performing a hyperbolic
grained product recognition, Wang et al. [21] develop a CNN based
tangent operation on the concatenation of the context vector and
model which consists of two specialized components. The first
the target hidden state. This attention vector generally provides a
component is an adversarial component which allows to handle
better representation of the sequence than traditional fixed-sized
the domain shift by gradually minimizing the discrepancy between
vector methods by identifying the relevant input tokens while gen-
different domains and ensuring alignment in both domain-level and
erating the output token. Using this mechanism, Bahdanu et al.
class-level. The second component is a self-attention module de-
[1] were able to achieve state-of-the-art performance in machine
signed for fine-grained recognition by capturing the discriminative
translation tasks.
image regions.
The Transformer architecture, building upon the significant im-
provements achieved by the attention mechanism and consisting
of an encoder and a decoder, was proposed in Vaswani et al. [18]. 3 THE PROPOSED APPROACH
The encoder includes a multi-head attention layer, residual con- Domain adaptation is a common challenge for image recognition
nections, normalization layer, and a generic feed-forward layer. systems. The domain shift between a source and target domain can
The decoder is identical to the encoder except that it contains a be caused by the change of background, pose, or type of camera.
”masked” multi-head attention layer. The Transformer has achieved This may lead to a significant degradation in performance for the
new state-of-the-art results in various tasks such as machine trans- new domain, caused by different data distributions between training
lation, entailment, and so on. and tests [4]. To mitigate this, re-calibration of the model to the
Based on transformers, Bidirectional Encoder Representations target domain has to be performed by collecting data from the
from Transformers (BERT) was developed to learn contextual em- target domain. However, this is time-consuming since it requires a
beddings for words [5]. BERT has at its core a Transformer language lot of manual work.
model with a variable number of encoder layers and self-attention The setup of the training procedure of our proposed approach is
heads. It is the first unsupervised language model which can be pre- illustrated in figure 1. The images from the source domain consist of
trained from unlabeled data extracted from a large corpus. During three-channel RGB data, while the target domain captures images
the training, some of the words in the input are randomly masked from a monochrome camera from another viewpoint. The image
out and a condition is used for each word bidirectionally, i.e., in data from both domains are processed by an OCR module that
both left and right contexts to predict the masked words. Being detects words in an image and outputs it in a textual data format.
bidirectional, BERT can learn efficiently the contextual embeddings The source textual data is used for training a product recognition
and achieve the state of the art in various tasks such as named model from scratch with RGB textual data, while the target domain
entity recognition and question answering. uses textual data from both the source and target domain. Further
details about the training and testing of the product recognition
model are explained in section 4.2.
In this study, we have used two machine learning models, namely,
CNN with GloVe and BERT. The architecture of CNN consists of 3
one-dimensional convolutional layers separated by two max pool-
ing layers, followed by one global max pooling and two dense layers.
(a) (b)
Figure 3: Example images (a), and (b) from the dataset. Each sample shows an image captured by the RGB camera (left) and
the line scan camera (right). Note that the example images in (a) is different samples and is displayed to show the domain
To get more insight into the data distribution in the space, we 4.2 Experiments
used t-distributed stochastic neighbor embedding (t-SNE) to visual- Three experiments are conducted to evaluate cross-domain recog-
ize both domains. Figure 4 shows the source domain, represented in nition of retail products using NLP. They are described in the fol-
red color, and the target domain, drawn in blue, where each point lowing points and a summarization of the training and test dataset
in the scatter plot represents a projection in 3D of a sentence em- for the experiments can be seen in table 3.
bedding representing one product. The shift between the domains Experiment 1. The goal of the first experiment is to evaluate
is visually clear and it is evident that the data distributions of the the performance of the model pretrained using RGB data on the
domains are different with an overlap in some regions of the space. LS data without any retraining of the model. The objective is to
get insights on the gap between the two domains. Thus, we train
the CNN and BERT models on a training set including only textual
data from the RGB domain using Train RG B . Then we evaluate it
on T est RG B and T est LS and report the obtained accuracies.
Experiment 2. The goal of the second experiment is to explore
how additional training samples from the target domain improve
the model performance compared to experiment 1. Thus, we train
the model with Train RG B and an additional increasing number of
samples from Train LS , using random selection of 2, 5, 10, 20, 30
and 50 samples for each training run. Due to the random selection
of samples for Train LS , 5 iterations are performed for each training
run and the mean accuracy is reported.
Figure 4: Visualization of source (red) and target (blue) do- Experiment 3. In a realistic context such as a fixed SCO system,
mains using T-SNE the products are often occluded when scanned. In this experiment,
we simulate how occlusion affects the performance of the model
In addition to the visualization, the gap between both domains by removing parts (either 25% or 50%) of the textual data. Since
is measured using the word moving distance (WMD) technique [8] clustering is used to find the clusters defining what to occlude in
and is presented in table 2. WMD measures the similarity between T est LS _CC , 5 iterations are evaluated for each case and the mean
documents by computing the minimal traveling distance between result is reported.
them using word2vec embeddings. The mean of the distances be- Table 3: Setup of datasets for the experiments
tween T rain RG B and the target domain is 1.32 while it is just 1.11 Experiment Training Dataset Test Dataset
between T rain RG B and T est RG B . The relative difference of these 1 T r ain RG B T est RG B
distances, which is 16% reflects the gap existing between the two T est LS
T r ain RG B +
domains and supports the visualization obtained by t-SNE. 2 {2, 5, 10, 20, 30, 50} samples T est LS
from T r ain LS
T est LS _CC with occlusion
Table 2: WMD distance between source and target domains 3 T r ain RG B factors of {0.25, 0.5} and
Domains Mean Variance cluster sizes of {4, 8}
T r ain RG B –> T r ain LS ∪ T est LS 1.32 0.33
T r ain RG B –> T est RG B 1.11 0.31
For the BERT model, we used a sequence length of 128, which
included most of the text snippets for the samples. On the few
samples that contained longer sequences than 128, the sequence accuracy of 80.1% achieved in experiment 1 for T est LS drops around
was cut. 5 percentage points when having an occlusion factor of 0.25 and
around 15 percentage points with an occlusion factor of 0.5 for
5 RESULTS AND DISCUSSION T est LS _CC . The classification difference is not very significant with
The first experiment compares the performance of the two differ- an OF of 0.25. This scenario is comparable to a hand covering a
ent kinds of models that we have used, the CNN with GloVe and medium sized package, showing that the BERT model is robust
BERT. The results are reported in table 4. Once the CNN model is to occlusion. The number of clusters was varied to investigate if
trained using T rain RG B , it is evaluated on the RGB domain using different cluster size affects our emulation of occlusion by a hand
the T est RG B set and it achieves the accuracy 81.3%. Without any covering a product. The results with cluster sizes of 4 and 8 show
retraining of the model, the same pretrained model achieves 65.1% almost the same accuracy for OF 0.25 and 0.5.
in the LS domain. On the other hand, the BERT model achieves an
accuracy of 88.2% on the test set from the source domain and as Table 6: Accuracy obtained in experiment 3 for the BERT
much as 80.3% in the target domain. model testing the impact of occlusion using T est LS _CC
In summary, the results of this experiment show that when using Dataset Clusters Occlusion factor Accuracy
4 0.25 76.2%
the model that has been pretrained on the source domain directly 4 0.5 67.9%
T est LS _CC
on the target domain, the accuracy is reduced by approximately 16 8 0.25 77.3%
percentage points for CNN and 8 percentage points for BERT. In 8 0.5 67.9%
addition, the BERT model shows a much higher performance than
CNN. Summarizing the results, the experiments show that NLP can
leverage the knowledge acquired in different environments. Specif-
Table 4: Accuracy results for Experiment 1 using BERT and ically, the BERT model can achieve good performance for retail
CNN-based model product recognition on a dataset with a large number of products,
Model T est RG B T est LS and it is robust for domain change. Retraining the model on 20
CNN 81.3% 65.1%
BERT 88.2% 80.3%
samples per product from the target domain causes a drop of only
2 percentage points in the accuracy and 50 samples were enough
to obtain a performance comparable to that obtained in the source
Table 5 shows the results of the second experiment in which domain.
we train the models with varying amounts of data from the target
domain. The goal is to determine the amount of data required for a 6 CONCLUSIONS
reasonable performance, which should preferably be similar to the
A common aspect for many learning tasks including retail product
one obtained in the source domain. As explained in the previous
recognition is that the available labeled data used for the training
section, the test dataset is fixed and the training data from the target
do not match the actual data in the deployment environment. This
domain used for fine-tuning is varying and includes 2, 5, 10, 20, 30
paper aims to handle this domain shift by transferring the learning
and 50 random text samples per class. Both models learn from a
acquired in the source domain to the deployment domain which
relatively small amount of data from the new domain. The accuracy
has different features such as sensors, occlusion, viewing angles,
of the CNN model increases by 4.4 percentage points by adding only
and lighting conditions. The results show that cross-domain NLP
2 samples per product and achieves 80.2% by adding 30 samples for
retail recognition preserves similar performance when using the
each class, which is close to the accuracy obtained on the source
BERT language model. We also show that a small amount of extra
domain. Retraining the BERT model with only 2 data points for each
training data from the target domain increases the performance to
product yields an accuracy of 82.9% which represents an increase
the same level as the source domain. In the future, the effects using
of 2.6 percentage points. Adding 20 samples in the training data
and combining different OCR engines could also be explored for
results in an accuracy of 86.0% whereas 50 samples were enough
real-time applicability.
to achieve 88.1%, which is comparable to the accuracy obtained
in the source domain. The CNN model also achieves an accuracy ACKNOWLEDGMENTS
comparable to that obtained in the source domain when 50 samples
are added to the training data, although it remains below the BERT This work was supported by the Swedish Knowledge Foundation
accuracy. (DATAKIND 20190194), the company ITAB, and Smart Industry
Sweden (KKS-2020-0044).
Table 5: Accuracy obtained in experiment 2 using BERT and
CNN-based model
NLP Cross-Domain Recognition of Retail Products ICMLT 2022, March 11–13, 2022, Rome, Italy
