Havi Doc Batch 10
Havi Doc Batch 10
Havi Doc Batch 10
ABSTRACT Object detection is an essential step in various applications. After deep learning appeared,
convolutional neural networks or transformers have shown significant improvement in object detection
compared to statistically motivated algorithms. But, they still require improvement in multiple aspects. One
is to maintain detection performance in unseen environments without retraining using images and labels from
unseen environments. The other is to reduce the tight requirements of labels. In object detection, a bounding
box is usually used as a label for an object. In this paper, we propose an object detection algorithm that
requires only images and the number of objects on images as labels. We approach the problem with deep
reinforcement learning. The proposed algorithm uses an actor-critic algorithm that can produce continuous
action. We make an actor model to produce multiple bounding boxes, and a critic model evaluates well as
training goes on. Also, we propose a reward model using a pre-trained model trained with an object detection
dataset. The proposed algorithm requires only images and the number of objects on images, not bounding
boxes. We show that the proposed algorithm gives a comparable result to the transformer-based approach
through experiments. Also, it can adapt to unseen environments by only using images and the number of
objects on images.
boxes without incrementally refining the bounding box’s and it shows a significant speed-up compared to the original
location. R-CNN. Fast R-CNN [6] improves SPPnet by fine-tuning all
The contributions of the proposed method are as follows. layers while minimizing a loss of both confidence and bound-
First, we propose an object detection method that uses only ing box regression. It uses the loss term that was proposed in
images and the number of objects on the image as labels. MultiBox [7]. Faster R-CNN [8] uses a region proposal net-
Second, we propose a structure for the actor and critic work (RPN) instead of selective search proposals. It combines
model and a reward model that reflects the evaluation result the RPN with Fast R-CNN.
of the object’s candidate region. Faster R-CNN can be categorized into two-stage anchor-
Third, we show that the proposed algorithm can cope with based detectors. To reduce computation time, many one-stage
unseen environments only by using images and the number of anchor-based detectors, such as YOLOv2 [9], YOLOv3 [10],
objects in the image. Though the proposed algorithm detects and SSD [11], have been introduced after Faster R-CNN.
objects without class categories, it can also be extended to
distinguish object types. B. OBJECT DETECTION WITH TRANSFORMER
Transformer [1] first showed a remarkable performance in
II. RELATED WORKS natural language processing (NLP). Immediately, researchers
A. OBJECT DETECTION WITH CNN tried its use on the images. First, a transformer was applied
R-CNN [3] is the first method for object detection that adopts to the classification [12]. Then, the first transformer-based
a convolutional neural network. It combines selective search object detection algorithm called DETR [2] appeared. After
region proposals [4] and post-classification with the con- DETR, many transformer-based object detection meth-
volutional network. SPPnet [5] proposes a spatial pyramid ods [13], [14], [15], [16], [17] have emerged, showing an
pooling layer for the classification layers to reuse features, improvement compared to CNN-based algorithms. Some
FIGURE 3. The structure of the proposed evaluation network of the object candidate area.
FIGURE 4. Training data generation process for training object candidate region evaluation model.
review papers [18], [19], [20], [21], [22] provide a detailed choose low- or high-spatial resolution images. Zhang et al.
analysis of transformers and their applications in various [27] proposed a deep reinforcement learning algorithm for
areas, including object detection. object detection with weakly labeled training images. Liu
et al. [28] proposed pay attention to them (PAT) for general
C. OBJECT DETECTION WITH REINFORCEMENT object detection by integrating the bottom-up single-shot con-
LEARNING volutional neural networks and a top-down operating strategy.
Caicedo et al. [23] proposed a deep reinforcement Zhang et al. [29] presented a detailed survey paper for weakly
learning-based active localization method using action com- supervised object localization and detection.
posed of horizontal moves, vertical moves, scale changes, Wu et al. [30] proposed a weakly supervised object detec-
aspect ratio changes, and triggers. Mathe et al. [24] proposed tion algorithm to handle problems caused by part domination
an algorithm for object detection with reinforcement learning and untight boxes. Feng et al. [31] proposed a weakly super-
by configuring sequential search as reinforcement learning. vised object detection algorithm that can address issues by
Pirinen and Sminchisescu [25] proposed a drl-RPN algorithm inconsistent learning for object variations and the unaware-
consisting of a sequential region proposal network (RPN) ness of localization quality. Fang et al. [32] proposed a
and an object detector under the deep reinforcement learning reinforcement learning-based algorithm for small object
framework. detection by enhanced representation learning with spatial
Uzkent et al. [26] proposed an algorithm for object detec- transformation and early convolution. We proposed an object
tion in large images with deep reinforcement learning. They detection algorithm based on reinforcement learning [39].
automatically train the agent in a dual reward setting to In this paper, we changed the actor model from CNN to
FIGURE 5. Trained behavior evaluation model results (a) when considered as an object
(b) when considered as not an object.
FIGURE 6. Object selection process among candidate regions generated by the actor model.
TABLE 1. The proposed reward model reflects the result of object region evaluation.
transformer, and experiments on unseen environments are can cope with situations without object location information
added. for object detection.
We propose a method based on reinforcement learning.
III. PROPOSED METHOD The actor model produces the locations of objects as out-
Object detection models produce outputs that consist of put. We consider the model’s action as the location of an
object type and location. When training a model under super- object on an image. Reinforcement learning can be applied
vised learning, the actual location of an object on an image to learn the location of an object. When an object detection
is considered ground truth, and the error is computed by model produces location information on an image, we can
comparing it to the model’s output. This supervised learning assume that the evaluation results of output are likely to have
method cannot be applied when object location information similar values and show continuous changes for similar loca-
does not exist in the image. In this paper, we propose a tions. It is also possible to determine whether the area is an
reinforcement learning-based object detection method that object. As a reinforcement learning method to learn an object
TABLE 3. The result of the region evaluation model is the different scenes from the train.
FIGURE 7. Result of applying region evaluation model on a different dataset (a) object case (b) non-object case.
detection model, A2C (Asynchronous Actor-Critic) and PPO γ is the discounting factor. The agent wants to maximize
(Proximal Policy Optimization) methods among policy-based the expected return from each state st . Policy-based model-
reinforcement learning are applied and compared. free methods directly parameterize the policy π (a | s;θ) and
find the parameters θ that maximize E [Rt ].
A. OVERVIEW OF ADVANTAGE ACTOR-CRITIC (A2C)
The REINFORCE family of algorithms can be
ALGORITHM
adopted [33]. Typical REINFORCE updates the policy
In this section, we briefly explain the advantage actor-critic parameters θ in the direction of ∇θ logπ (at | st ;θ) Rt and it
(A2C) algorithm for the completeness of the paper. In the guarantees an unbiased estimate of ∇θ E[Rt ]. However, it has
standard reinforcement learning setting, an agent interacts a high variance. It is possible to reduce the variance of this
with an environment ε over several discrete time steps. The estimate while keeping it unbiased by subtracting a learned
agent selects an action at at a state st by policy π (a | s), function of the state bt (st ), known as a baseline [33], from
where π is a mapping from states st to action at . In the return. The resulting gradient is as follows.
discrete case, an action at is chosen from possible actions set
A. After an agent does the action, it arrives at the next state
∇θ logπ (at | st ;θ) (Rt − bt (st )) (2)
st+1 receiving a scalar reward rt . The process continues until
the agent reaches a terminal state. The return that is defined
as total accumulated rewards Rt from time step t is as follows. A learned function of the value function V π (st ) is com-
X∞ monly used as the baseline, leading to a much lower variance
Rt = γ k rt+k (γ ∈ (0, 1]) (1) estimate of the policy gradient. The advantage of action at at
k=0
FIGURE 8. Model training results using PPO (a) object evaluation average (b) reward average considering the number of
objects.
FIGURE 9. Model training results using A2C (a) object evaluation average (b) reward average considering the number of
objects.
µ(st ;θ) and σ (st ;θ) is the mean and standard deviation In Figure 1, the output structure is designed to have a fixed
of the Gaussian distribution. The object detection model is number. The mean and standard deviation of the Gaussian
required to detect objects of various sizes on an image. The distribution are used for continuous object locations. The
object’s location on an image is expressed as a continuous final object position is determined by sampling. At the same
value between 0 and 1 through normalization. time, it outputs whether the corresponding area is an object.
Figure 1 shows the proposed actor model structure. It uses If we only use the location information of the object candidate
an image as input and produces multiple location information area as output, all location information must be evaluated,
of candidate objects on an image. The actor model struc- which takes much time to learn. To solve this problem, the
ture in Figure 1 was designed by referencing the DETR [2] actor model was designed also to produce an output for the
structure. Each object query in the decoder corresponds to probability of whether the candidate area is an object.
one object candidate area, and we used 100 queries in the Figure 2 shows the proposed critic model structure, which
experiments. A fully connected layer is applied for each plays the V (st ;θv ) role in Eq. (4). Figure 2 shows the opera-
query passed through the decoder. It produces three types of tion process of the critic model. The proposed critic model
information, including mean and standard deviation related consists of fully connected layers, and details are shown
to the location of objects and the probability of being an in Figure 2. The critic model produces an output for each
object. candidate object region produced by the actor model.
C. THE PROPOSED REWARD MODEL evaluation model shown in Figure 3. Oe represents the result
The critic model is required to correctly evaluate the object of the valuation model using Os as input. In the actor model
candidate regions produced by the actor model. The ground shown in Figure 1, each token has two output values for object
truth of object location is generally used when evaluating classification after going through a fully connected layer. This
the object detection model. Because the proposed method output value is referred to as Oc . In Oc , the first output value
uses images and the number of objects on an image for indicates whether it is a background, and the second output
training, there is no location information to evaluate the object value indicates whether it is an object. Eq. (7) shows the
detection results. correlation between the presence of an object in the actor
To solve this problem, we proposed a method based on the model and the output of the object evaluation model. It was
reinforcement learning. An actor model generates candidate configured to match the object guessing ability of the actor
areas of objects, and a pre-trained evaluation model evaluates model with the evaluation ability of the object evaluation
them. Finally, the result of the evaluation model is used to model.
determine the reward. Reward terms can be easily defined in In Eq. (7), if the actor model considers the candidate area
the posture control of a biped robot. However, it takes work to be an object, the output of the object evaluation model
to determine reward terms for object detection. We introduce is used. In Eq. (7), if the maximum value of Oc is 0, it is
a model that evaluates candidate regions of objects. The not an object. If the actor model determined that the candi-
model’s output is used to determine the reward value. date area was not an object, the output value of the object
Figure 3 shows the proposed structure of the evaluation evaluation model was used in reverse. The final output of
model. Each candidate region of an object is resized to a Eq. (7) is denoted as Eo . It will be used to compute reward
fixed size, and pre-trained ResNet-101 is used for feature values.
extraction. The final output produces a probability value The object evaluation model in Figure 3 was trained using
between 0 and 1, referred to as Oe . The specific process for an existing object detection dataset. Object detection datasets
determining the reward value is as follows. provide object location and class type on each image. Only
object location information was used to learn the proposed
Oe = feval (Os ) (6) region evaluation model, and object type information was
(
Oe (argmax (Oc ) ̸ = 0) not used. Training data for the model was created and used
Eo = (7) through the following process. Figure 4 shows the process of
1 − Oe (argmax (Oc ) = 0)
training data generation. Based on the location information
Os represents the image area corresponding to the bound- of the object, the corresponding region is cropped and con-
ing box presented by the actor model. feval represents the sidered an object. Non-object areas were created by moving
FIGURE 10. Changes in image detection results according to training progress (a) original images and ground-truth labels of objects
(b) epoch 15 (c) epoch 50 (d) epoch 97 ((b) ∼ (d) left: actor output results which display all 100 images regardless of object
probability right: model detection results after non-maximum suppression).
horizontally based on the location. The horizontal movement 3/4 of the area size, and the area created in this way was
size was randomly selected from a value between 1/4 and considered not an object. The object evaluation model only
FIGURE 11. Object detection results (a) proposed algorithm (b) ground-truth.
FIGURE 12. Example of additional object detection results (a) proposed algorithm (b) ground-truth (In (a), purple rectangles correspond to objects
where no labels are assigned in the original dataset).
decides whether it is an object. It does not determine what Object detection models generally have a structure that
class it is. takes images as input and detects multiple objects existing
Figure 5 shows the results of the trained object evaluation on images. Evaluating all candidate areas produced by the
model. Figure 5(a) shows a case where the object evaluation model is necessary during training object detection models.
model considers it an object. Figure 5(b) shows a case where We reflect the number of objects in each image when com-
the object evaluation model considers it not an object. puting the reward.
FIGURE 13. Example of misdetection results (a) proposed algorithm (b) ground-truth (In (a), red rectangles correspond to misdetection).
TABLE 5. Results of applying learning results using the COCO dataset or PASCAL dataset to the PASCAL dataset.
Figure 6 shows the object selection process among can- among candidate regions and choose the area with the
didate regions generated by the actor model. Object prob- most considerable objectness value. The ratio of overlapping
ability and degree of overlap are used in the decision. is determined by IoU (Intersection over Union). We use
First, object candidates were created based on the out- 0.75 as a threshold. The threshold for judging objects
put of the evaluation model. We investigate overlapping is 0.5.
FIGURE 14. Model training results using PPO in a new environment (a) object evaluation average (b) reward average
considering the number of objects.
The error was calculated based on the number of known by reflecting the detected number of objects.
objects and the number of objects detected by the actor. It is Xn
used for computing rewards. ro = r Eoi (9)
i=1
P rf = λ1 ro + λ2 re (10)
n− Eo
re = 1 − (8) n is the number of objects detected in the image. Eoi means
n
the value of the i-th detected object in the image in Eq. (7).
n is the number of objects on an image. re is a reward ro represents the reward sum of all candidate object regions.
term that reflects the difference between a known number of re is the reward value considering the total number of objects
objects and a detected number of objects on an image. shown in Eq. (8). λ1 and λ2 is the weights of the two reward
Table 1 shows the proposed reward value configura- terms and we used 1.0 and 1.5.
tion reflecting object region evaluation. We assign different
reward values according to the value range of Eo in Eq. (7). IV. EXPERIMENTAL RESULTS
The analysis is divided into three cases. We select the best one A. RESULTS OF OBJECT EVALUATION MODEL
by experiment. The final reward consists of two terms. One is The COCO dataset [37] was used to train the object eval-
the reward value computed for each candidate object region uation model of Figure 3. All 80 object classes in the
following Table 1. The other is the reward value calculated COCO dataset were used when generating training data.
FIGURE 15. Model training results using A2C in a new environment (a) object evaluation average (b) reward average
considering the number of objects.
We generated 502,606 object images from 118,287 training evaluation model with the COCO dataset to images of PAS-
images. The same 502,606 non-object images are generated CAL VOC.
from the same dataset. The validation dataset is generated
using 5,000 validation images. In total, 1,005,212 images are B. RESULTS IN THE SAME ENVIRONMENT AS TRAINING
used for the training object evaluation model. 42,737 images This section analyzes detection performance in the same
are used for validation during training. environment as training.
Table 2 shows the object evaluation model’s training Figure 8 and Figure 9 show changes in reward values
results. If the output value exceeds 0.5, it is considered an during training by the PPO and A2C algorithms, respectively.
object. In the opposite case, it is considered not an object. Figures 8(a) and 9(a) show changes in the output value of
Table 2 also shows the accuracy of the training and validation the object evaluation model during training. We can notice
dataset. Resizing to a fixed size gives better results than the that the output value of the object evaluation model increases
original image size. as training progresses for both the PPO and A2C algo-
Table 3 shows the results of applying the region evaluation rithms. This indicates that the performance of the actor model
model trained using the COCO dataset to the PASCAL VOC improves as training progresses. Figures 8(b) and 9(b) show
dataset [38]. The input image is resized to a fixed size. the change in reward value considering the number of objects
Figure 7 shows some results of applying a trained region according to training.
FIGURE 16. Variance changes during training in a new environment (a) training results by PPO (b) training results by A2C.
In the case of the PPO algorithm, it can be seen that a object detection methods without using the object’s class type
region evaluation model trained using input data converted information.
to a fixed size provides higher training performance than the In PPO and A2C, an object evaluation model trained using
region evaluation model. In the case of the A2C algorithm, input data resized to a fixed size performs better than when
this tendency is not visible, and both instances show similar not used. Additionally, using the number of objects in reward
performance. In both cases, the reward value increases as computation improves results.
training progresses, showing that the actor model generates Figure 10 shows the change in image detection results
more accurate object candidates. As training goes on, the according to the model’s training process. In this case, we use
output value of the object evaluation model increases. It indi- the model that gives the best results in Table 4 (PPO method
rectly indicates that the proposed method is valid. training, object area image evaluation using adjusted images,
Table 4 shows the results of the proposed method for object compensation case No. 2). As training progresses, the output
detection. The proposed method determines whether it is an of the actor model and the resulting object detection results
object without identifying the class type. This is similar to are shown. We notice that the candidates of objects become
the first stage of the region proposal network in a two-stage more accurate as training goes on.
object detection algorithm. The evaluation used only object Figure 11 shows the object detection results by the model
location information to compare performance with existing that showed the best results in Table 4 (PPO method training,
FIGURE 17. Changes in image detection results according to training progress during additional training in a new environment
(a) original images and ground-truth labels of objects (b) epoch 0 (c) epoch 15 (d) epoch 95 ((b) ∼ (d) left: actor output results which
display all 100 images regardless of object probability right: model detection results after non-maximum suppression).
object area image evaluation using adjusted images, compen- values, the variance shows high values compared to the case
sation case No. 2). Reinforcement learning was used only of training without using the results of the training COCO
requiring labels of images and the number of objects on dataset as initial values. In the graph, the amplitude of vari-
images without using object’s location information. The pro- ance when we finetune the model takes a high value for an
posed method shows results comparable to object detection extended period compared to when we retrain the model.
methods trained using an object’s location information. Figure 17 shows detection results according to epochs
Figure 12 shows the detection results of the proposed when we train the model in new environments. Figure 17(a)
method, which detects additional objects that do not exist on shows the original image and ground-truth bounding boxes
ground-truth labels. They are displayed in purple, and it is of objects. Figure 17(b) shows the detection results of the
reasonable to consider them objects. Specifically, it detects proposed algorithm at epoch 0, which corresponds to zero
parts not included in the 80 types of classes, which can shot. We can notice that the model can detect objects in new
be viewed as an advantage. Figure 13 shows the result of environments without additional training. But, after retrain-
incorrect object detection. In Figure 13(a), red rectangles ing using images of the new environment, we can obtain
represent misdetection; most correspond to small objects. better performance, as shown in Figure 17(d). We can notice
the improved results by comparing detection results in Fig-
C. RESULTS IN THE UNSEEN ENVIRONMENTS ures 17(a) and 17(d).
We apply a model trained from one dataset to a different
dataset to demonstrate the proposed method’s generalization V. CONCLUSION
ability. We further divide the investigation into two cases. In object detection by supervised learning, the class type and
In the first case, we used a different dataset for training. In the location of the object are used as a label for each training
second case, we do not use a different dataset in training, image. The same label is required for the new environment to
which corresponds to a zero-shot test. detect objects in an environment different from the training
Table 5 shows the results when the proposed model trained environment. In this paper, we proposed a reinforcement
using the COCO dataset is applied to the COCO and PASCAL learning-based object detection method that only requires
datasets. We can notice that the performance decreases when images and the number of objects on images as labels.
no additional training is done in the new dataset. However, A transformer-based object proposal model, an evaluation
performance improvement can be seen when we train the model using the corresponding area, and a reward configu-
proposed model with only images and the number of objects ration are proposed. The model for evaluating the presented
on images in a new environment. Table 5 uses the region eval- object candidate area was trained based on supervised learn-
uation model trained with images converted into a fixed size. ing. An existing object detection dataset was used for training.
The PPO and A2C algorithms show performance improve- Experimental results show that the proposed algorithm can
ment when training using images in a new environment. cope with unseen environments using labels of images and
Figures 14 and 15 show the results of the proposed the number of objects on images.
algorithm trained in a new environment using PPO and A2C, However, the proposed method has the following areas for
respectively. In Figure 14 and Figure 15, tuning indicates improvement. An object evaluation model is trained using the
using pre-trained results in the existing environment as initial existing object detection dataset in supervised learning. Addi-
values. When we train the model from scratch using images tionally, the proposed method only differentiates whether the
on a new environment, it is denoted as train. detected region is an object. For future research, we plan to
Figures 14(a) and 15(a) show the mean evaluation values extend the proposed algorithm in two ways. First, we will
by the object evaluation model. Figures 14(b) and 15(b) consider a direction that does not require the number of
show the average reward value considering the number of objects as the label. Second, we want to add classing object
objects. When applied to a new environment, using the values types in the proposed method.
trained in the existing environment as initial values provides
improved results compared to not using them. The above REFERENCES
results indicate that the proposed method trained with a large [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
dataset can be used as an initial model when applied to a new L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
environment. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[2] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
Additionally, the proposed method shows some general- S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc.
ization ability based on the fact that the proposed method Eur. Conf. Comput. Vis., 2020, pp. 213–229.
gives improved scores in unseen environments after addi- [3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ‘‘Rich feature hierarchies
for accurate object detection and semantic segmentation,’’ in Proc. IEEE
tional training with images in the new environment. However, Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
additional training in new environments requires further [4] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders,
investigation. Figure 16 shows variance changes during train- ‘‘Selective search for object recognition,’’ Int. J. Comput. Vis., vol. 104,
ing. We compare results according to cases whether we use no. 2, pp. 154–171, Sep. 2013.
[5] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Spatial pyramid pooling in deep con-
trained results with the COCO dataset as initial values. When volutional networks for visual recognition,’’ in Proc. Eur. Conf. Comput.
we use the trained results with the COCO dataset as initial Vis. (ECCV), 2014, pp. 346–361.
[6] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), [30] Z. Wu, C. Liu, J. Wen, Y. Xu, J. Yang, and X. Li, ‘‘Selecting high-
Dec. 2015, pp. 1440–1448. quality proposals for weakly supervised object detection with bottom-up
[7] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, ‘‘Scalable object aggregated attention and phase-aware loss,’’ IEEE Trans. Image Process.,
detection using deep neural networks,’’ in Proc. IEEE Conf. Comput. Vis. vol. 32, pp. 682–693, 2023.
Pattern Recognit., Jun. 2014, pp. 2155–2162. [31] X. Feng, X. Yao, H. Shen, G. Cheng, B. Xiao, and J. Han, ‘‘Learning an
[8] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time invariant and equivalent network for weakly supervised object detection,’’
object detection with region proposal networks,’’ in Proc. Adv. Neural Inf. IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 11977–11992,
Process. Syst. (NIPS), 2015. May 2023.
[9] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in [32] F. Fang, W. Liang, Y. Cheng, Q. Xu, and J.-H. Lim, ‘‘Enhancing repre-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, sentation learning with spatial transformation and early convolution for
pp. 6517–6525. reinforcement learning-based small object detection,’’ IEEE Trans. Cir-
[10] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ cuits Syst. Video Technol., vol. 34, no. 1, pp. 315–328, Jan. 2024.
2018, arXiv:1804.02767. [33] R. J. Williams, ‘‘Simple statistical gradient-following algorithms for
[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and connectionist reinforcement learning,’’ Mach. Learn., vol. 8, nos. 3–4,
A. C. Berg, ‘‘SSD: Single shot mutibox detector,’’ in Proc. Eur. Conf. pp. 229–256, May 1992.
Comput. Vis., 2016, pp. 21–37. [34] T. Degris, P. M. Pilarski, and R. S. Sutton, ‘‘Model-free reinforcement
learning with continuous action in practice,’’ in Proc. Amer. Control Conf.
[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
(ACC), Jun. 2012, pp. 2177–2182.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszko-
[35] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
reit, and N. Houlsby, ‘‘An image is worth 16x16 words: Transformers for
D. Wierstra, and M. Riedmiller, ‘‘Playing Atari with deep reinforcement
image recognition at scale,’’ 2020, arXiv:2010.11929.
learning,’’ 2013, arXiv:1312.5602.
[13] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable
[36] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
DETR: Deformable transformers for end-to-end object detection,’’ 2020,
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
arXiv:2010.04159.
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
[14] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through
‘‘Conditional DETR for fast training convergence,’’ in Proc. IEEE/CVF deep reinforcement learning,’’ Nature, vol. 518, no. 7540, pp. 529–533,
Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 3631–3640. 2015.
[15] S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang, [37] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,
‘‘DAB-DETR: Dynamic anchor boxes are better queries for DETR,’’ 2022, P. Perona, D. Ramanan, C. Lawrence Zitnick, and P. Dollár, ‘‘Microsoft
arXiv:2201.12329. COCO: Common objects in context,’’ 2014, arXiv:1405.0312.
[16] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, ‘‘DN- [38] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman,
DETR: Accelerate DETR training by introducing query DeNoising,’’ 2022, ‘‘The Pascal visual object classes (VOC) challenge,’’ Int. J. Comput. Vis.,
arXiv:2203.01305. vol. 88, no. 2, pp. 303–338, Jun. 2010.
[17] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, [39] K.-H. Choi and J.-E. Ha, ‘‘Object detection using policy-based reinforce-
‘‘DINO: DETR with improved DeNoising anchor boxes for end-to-end ment learning,’’ in Proc. 23rd Int. Conf. Control, Autom. Syst. (ICCAS),
object detection,’’ 2022, arXiv:2203.03605. Oct. 2023.
[18] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah,
‘‘Transformers in vision: A survey,’’ 2021, arXiv:2101.01169.
[19] E. Arkin, N. Yadikar, X. Xu, A. Aysa, and K. Ubul, ‘‘A survey: Object
detection methods from CNN to transformer,’’ Multimedia Tools Appl.,
vol. 82, no. 14, pp. 21353–21383, Jun. 2023.
[20] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao,
C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, ‘‘A survey on vision KEONG-HUN CHOI received the B.S. and M.E.
transformer,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, degrees in mechanical and automotive engineer-
pp. 87–110, Jan. 2023. ing from Seoul National University of Science
[21] Y. Li, N. Miao, L. Ma, F. Shuang, and X. Huang, ‘‘Transformer for object and Technology, Seoul, South Korea, in 2019 and
detection: Review and benchmark,’’ Eng. Appl. Artif. Intell., vol. 126, 2021, respectively, where he is currently pursu-
Nov. 2023, Art. no. 107021. ing the Ph.D. degree with the Graduate School
[22] A. M. Rekavandi, S. Rashidi, F. Boussaid, S. Hoefs, E. Akbas, and of Automotive Engineering. His research interests
M. bennamoun, ‘‘Transformers in small object detection: A benchmark include object detection, semantic segmentation,
and survey of state-of-the-art,’’ 2023, arXiv:2309.04902. and scene understanding.
[23] J. C. Caicedo and S. Lazebnik, ‘‘Active object localization with deep
reinforcement learning,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Dec. 2015, pp. 2488–2496.
[24] S. Mathe, A. Pirinen, and C. Sminchisescu, ‘‘Reinforcement learning
for visual object detection,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2016, pp. 2894–2902.
[25] A. Pirinen and C. Sminchisescu, ‘‘Deep reinforcement learning of region
JONG-EUN HA received the B.S. and M.E.
proposal networks for object detection,’’ in Proc. IEEE/CVF Conf. Com-
put. Vis. Pattern Recognit., Jun. 2018, pp. 6945–6954.
degrees in mechanical engineering from Seoul
National University, Seoul, South Korea, in
[26] B. Uzkent, C. Yeh, and S. Ermon, ‘‘Efficient object detection in large
images using deep reinforcement learning,’’ in Proc. IEEE Winter Conf. 1992 and 1994, respectively, and the Ph.D. degree
Appl. Comput. Vis. (WACV), Mar. 2020, pp. 1813–1822. in mechanical engineering from KAIST, Daejeon,
[27] D. Zhang, J. Han, L. Zhao, and T. Zhao, ‘‘From discriminant to complete: South Korea, in 2000. From February 2000 to
Reinforcement searching-agent learning for weakly supervised object August 2002, he was with Samsung Corning,
detection,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 12, where he developed an algorithm for a machine
pp. 5549–5560, Dec. 2020. vision system. From 2002 to 2005, he was in mul-
[28] S. Liu, D. Huang, and Y. Wang, ‘‘Pay attention to them: Deep reinforce- timedia engineering with Tongmyong University.
ment learning-based cascade object detection,’’ IEEE Trans. Neural Netw. Since 2005, he has been a Professor with the Department of Mechanical
Learn. Syst., vol. 31, no. 7, pp. 2544–2556, Jul. 2020. and Automotive Engineering, Seoul National University of Science and
[29] D. Zhang, J. Han, G. Cheng, and M.-H. Yang, ‘‘Weakly supervised object Technology. His research interests include deep learning, intelligent robots,
localization and detection: A survey,’’ IEEE Trans. Pattern Anal. Mach. and vehicles.
Intell., vol. 44, no. 9, pp. 5866–5885, Sep. 2022.