KR102143034B1

KR102143034B1 - Method and system for tracking object in video through prediction of future motion of object

Info

Publication number: KR102143034B1
Application number: KR1020180167361A
Authority: KR
Inventors: 정성균; 김경래; 김창수
Original assignee: 네이버랩스 주식회사; 고려대학교 산학협력단
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2020-08-10
Also published as: KR20200077942A

Abstract

객체의 미래 움직임 예측을 통한 동영상에서의 객체 추적을 위한 방법 및 시스템이 개시된다. 객체 추적 방법은, 동영상의 일 프레임에 해당되는 정지 영상에서 상기 정지 영상에 포함된 객체와 상기 객체의 주변 정보의 특징(feature)을 추출하는 단계; 상기 추출된 특징으로부터 상기 객체의 미래 움직임(future motion)을 예측하는 단계; 및 상기 미래 움직임의 예측 결과를 기초로 다음 프레임에서의 탐색 영역을 생성하여 상기 객체의 트래킹을 수행하는 단계를 포함한다.Disclosed is a method and system for tracking an object in a moving picture through prediction of an object's future motion. The object tracking method includes: extracting a feature of an object included in the still image and surrounding information of the object from a still image corresponding to one frame of a moving picture; Predicting a future motion of the object from the extracted features; And performing tracking of the object by generating a search area in a next frame based on the prediction result of the future motion.

Description

Method and system for object tracking in video through prediction of future motion of object {METHOD AND SYSTEM FOR TRACKING OBJECT IN VIDEO THROUGH PREDICTION OF FUTURE MOTION OF OBJECT}

아래의 설명은 동영상에서 객체를 추적하는 기술에 관한 것이다.The description below relates to a technology for tracking objects in a video.

광학적 플로우(optical flow), 행동 인식(action recognition), 미래 프레임 예측(future frame prediction), 비디오 압축(video compression) 등과 같은 많은 시각 및 영상 문제에서 객체에 대한 움직임(motion)의 이해와 표현이 연구되고 있다.Study on understanding and expression of motion for objects in many visual and visual problems such as optical flow, action recognition, future frame prediction, and video compression. Has become.

예컨대, 한국 공개특허공보 제10-2012-0106279호(공개일 2012년 09월 26일)에는 블록 매칭(block matching) 기술을 바탕으로 영상 신호에서 움직임을 예측하는 기술이 개시되어 있다.For example, Korean Patent Application Publication No. 10-2012-0106279 (published on September 26, 2012) discloses a technology for predicting motion in an image signal based on a block matching technology.

대부분의 기존 기술은 인접한 프레임에서 픽셀값의 차이를 최소화 하는 영역을 찾는 방식으로 움직임을 모델링하게 된다. 이는 영상을 취득하는 장비가 대부분 프레임 기반의 카메라이고 동일한 사물이 갖는 화소의 값이 유사하다는 밝기 일관성 제약(brightness consistency constraint)으로 근사된 모델이기 때문에 인간의 인지 능력에 미치지 못한다.Most of the existing technologies model motion by finding a region that minimizes the difference in pixel values in adjacent frames. This is a model that is approximated by a brightness consistency constraint that most of the image acquisition equipment is a frame-based camera, and the pixel values of the same object are similar, so it does not reach human cognitive ability.

반면에, 인간은 정지 영상임에도 불구하고 사물의 미래 움직임을 빠르고 정확하게 예측한다. 이는 인간이 움직임을 인식하는 과정에서 사물과 사물의 주변부의 정보를 종합적으로 판단하기 때문이라 할 수 있다.On the other hand, humans quickly and accurately predict the future motion of objects despite being still images. It can be said that this is because humans comprehensively judge the information of objects and their peripheries in the process of recognizing motion.

한편, 현존하는 대부분의 트래킹 기술은 객체 검출 결과를 중심으로 주변 영역에 대한 완전 탐색(exhaustive search)을 통해 매칭 스코어가 가장 높은 위치로 객체가 이동했다고 판단한다.On the other hand, most of the existing tracking technologies determine that the object has moved to the position with the highest matching score through an exhaustive search of the surrounding area centering on the object detection result.

기존 트래킹 기술은 인접한 프레임을 분석하여 객체의 움직임을 예측하기 때문에, 입력 영상의 프레임 레이트(frame rate)에 따라 트래킹 성능(매칭 정확도, 속도 측면 등)이 가변적이며 통신 지연 등으로 인해 갑작스럽게 입력 영상의 프레임 저하가 발생하는 경우 안정적인 트래킹 결과를 보장할 수 없다.Since the existing tracking technology analyzes adjacent frames to predict the movement of objects, the tracking performance (matching accuracy, speed aspect, etc.) is variable according to the frame rate of the input image, and the input image suddenly occurs due to communication delay. In the event of a frame drop, stable tracking results cannot be guaranteed.

단일 정지 영상(single still image)에서 객체의 미래 움직임을 예측하여 객체를 안정적으로 추적할 수 있는 방법과 시스템을 제공한다.It provides a method and system for stably tracking an object by predicting the future motion of an object from a single still image.

프레임 레이트가 현저히 낮은 동영상이나 프레임 드롭(frame drop)이 발생하는 경우 트래킹 성능 저하를 예방하고 안정적인 트래킹을 보장할 수 있는 방법과 시스템을 제공한다.Provides a method and system that can prevent deterioration of tracking performance and ensure stable tracking when a video or frame drop with a significantly low frame rate occurs.

단일 정지 영상인 일 프레임에서 객체의 미래 움직임을 예측하여 예측 결과를 기반으로 다음 프레임에서 객체를 탐색함으로써 객체 탐색 영역을 효율적으로 줄일 수 있는 방법과 시스템을 제공한다.Provides a method and system for efficiently reducing an object search area by predicting future motion of an object in one frame, which is a single still image, and searching for an object in the next frame based on the prediction result.

컴퓨터 시스템에서 수행되는 객체 추적 방법에 있어서, 동영상의 일 프레임에 해당되는 정지 영상에서 상기 정지 영상에 포함된 객체와 상기 객체의 주변 정보의 특징(feature)을 추출하는 단계; 상기 추출된 특징으로부터 상기 객체의 미래 움직임(future motion)을 예측하는 단계; 및 상기 미래 움직임의 예측 결과를 기초로 다음 프레임에서의 탐색 영역을 생성하여 상기 객체의 트래킹을 수행하는 단계를 포함하는 객체 추적 방법을 제공한다.An object tracking method performed in a computer system, the method comprising: extracting an object included in the still image and a feature of surrounding information of the object from a still image corresponding to one frame of a moving picture; Predicting a future motion of the object from the extracted features; And generating a search area in a next frame based on the prediction result of the future motion to perform tracking of the object.

일 측면에 따르면, 상기 예측하는 단계는, 상기 객체의 방향(direction), 속도(velocity), 및 행동(action) 중 적어도 하나를 각 클래스로 한 인스턴스 레벨의 미래 움직임을 예측하고, 상기 수행하는 단계는, 상기 객체의 미래 움직임으로 예측된 방향, 속도, 행동 중 적어도 하나에 대응하여 상기 다음 프레임에서 이동 확률이 높은 지역을 우선적으로 탐색함으로써 상기 객체의 트래킹을 수행할 수 있다.According to one aspect, the predicting comprises predicting an instance-level future motion in which at least one of a direction, a velocity, and an action of the object as each class, and performing the May perform tracking of the object by first searching for a region having a high movement probability in the next frame in response to at least one of a direction, a speed, and an action predicted as a future movement of the object.

다른 측면에 따르면, 상기 수행하는 단계는, 상기 객체의 미래 움직임으로 예측된 속도에 따라 상기 탐색 영역을 줄이는 단계를 포함할 수 있다.According to another aspect, the performing may include reducing the search area according to a speed predicted by a future motion of the object.

또 다른 측면에 따르면, 상기 수행하는 단계는, 상기 객체의 미래 움직임으로 예측된 방향에 대응되는 부채꼴 모양의 탐색 영역을 결정하는 단계를 포함할 수 있다.According to another aspect, the performing may include determining a sector-shaped search area corresponding to a direction predicted by the future movement of the object.

또 다른 측면에 따르면, 상기 수행하는 단계는, 객체 주변의 배경 이동을 보정한 후 상기 객체의 트래킹을 수행할 수 있다.According to another aspect, in the performing step, tracking of the object may be performed after correcting a background movement around the object.

또 다른 측면에 따르면, 상기 예측하는 단계는, 상기 정지 영상에 포함된 객체 각각의 미래 움직임으로 예측된 방향을 클러스터링하여 각 클러스터의 그룹 방향을 예측하는 단계를 포함할 수 있다.According to another aspect, the predicting may include predicting a group direction of each cluster by clustering a direction predicted by a future motion of each object included in the still image.

또 다른 측면에 따르면, 상기 추출하는 단계는, 학습 이미지에 포함된 객체와 객체의 주변 정보를 통합하여 학습한 학습 모델을 통해 상기 특징을 추출할 수 있다.According to another aspect, in the extracting, the feature may be extracted through a learning model learned by integrating the object included in the training image and surrounding information of the object.

또 다른 측면에 따르면, 상기 추출하는 단계는, 객체 인스턴스에 대해 정답 데이터로서 미래 움직임을 나타내는 방향, 속도, 및 행동 중 적어도 하나의 속성이 할당된 이미지들을 학습한 학습 모델을 통해 상기 특징을 추출할 수 있다.According to another aspect, in the extracting, the feature is extracted through a learning model obtained by learning images to which at least one of a direction, a speed, and an action representing a future movement is assigned as correct answer data for an object instance. I can.

또 다른 측면에 따르면, 상기 추출하는 단계는, 학습 모델을 구성하는 복수 개의 RoI(region of interest) 풀링 레이어(pooling layer)를 통해 상기 정지 영상에서 객체 영역에 대한 객체 피처, 객체를 포함하는 주변 영역에 대한 로컬 피처, 및 영상 전체 영역에 대한 글로벌 피처를 추출할 수 있다.According to another aspect, the extracting includes an object feature for an object region in the still image through a plurality of region of interest (RoI) pooling layers constituting a learning model, and a peripheral region including the object. A local feature for, and a global feature for the entire area of the image may be extracted.

또 다른 측면에 따르면, 상기 예측하는 단계는, 상기 학습 모델을 구성하는 완전 연결 레이어(fully connected layer)와 소프트맥스 레이어(softmax layer)를 통해 상기 RoI 풀링 레이어의 출력을 받아 상기 객체의 미래 움직임을 나타내는 방향, 속도, 행동 중 적어도 하나를 예측할 수 있다.According to another aspect, the predicting may include receiving the output of the RoI pooling layer through a fully connected layer and a softmax layer constituting the learning model to determine the future motion of the object. You can predict at least one of the direction, speed, and behavior you represent.

상기 객체 추적 방법을 컴퓨터에 실행시키기 위한 프로그램이 기록되어 있는 것을 특징으로 하는 컴퓨터에서 판독 가능한 기록매체를 제공한다.It provides a computer-readable recording medium, characterized in that a program for executing the object tracking method on a computer is recorded.

컴퓨터 시스템에 있어서, 컴퓨터에서 판독 가능한 명령을 실행하도록 구현되는 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 동영상의 일 프레임에 해당되는 정지 영상에서 상기 정지 영상에 포함된 객체와 상기 객체의 주변 정보의 특징을 추출하는 특징 추출부; 상기 추출된 특징으로부터 상기 객체의 미래 움직임을 예측하는 움직임 예측부; 및 상기 미래 움직임의 예측 결과를 기초로 다음 프레임에서의 탐색 영역을 생성하여 상기 객체의 트래킹을 수행하는 객체 트래킹부를 포함하는 컴퓨터 시스템을 제공한다.A computer system comprising at least one processor embodied to execute a command readable by a computer, wherein the at least one processor includes an object included in the still image in a still image corresponding to one frame of a moving picture and the object A feature extraction unit for extracting a feature of the surrounding information; A motion prediction unit predicting a future motion of the object from the extracted features; And an object tracking unit for tracking the object by generating a search area in a next frame based on a prediction result of the future motion.

본 발명의 실시예들에 따르면, 단일 정지 영상에서 객체의 미래 움직임을 예측하여 객체를 안정적으로 추적할 수 있다.According to embodiments of the present invention, an object can be stably tracked by predicting a future motion of an object in a single still image.

본 발명의 실시예들에 따르면, 단일 정지 영상에서 객체의 미래 움직임을 예측할 수 있어 프레임 레이트가 현저히 낮은 동영상이나 프레임 드롭이 발생하는 경우에도 트래킹 성능 저하를 예방하고 안정적인 트래킹을 보장할 수 있다.According to embodiments of the present invention, since future motion of an object can be predicted from a single still image, even when a video or frame drop having a significantly low frame rate occurs, it is possible to prevent a decrease in tracking performance and ensure stable tracking.

본 발명의 실시예들에 따르면, 단일 정지 영상인 일 프레임에서 객체의 미래 움직임을 예측하여 예측 결과를 기반으로 다음 프레임에서 객체를 탐색함으로써 객체 탐색 영역을 효율적으로 줄일 수 있다.According to embodiments of the present invention, an object search area can be efficiently reduced by predicting a future motion of an object in one frame, which is a single still image, and searching for an object in a next frame based on a prediction result.

도 1은 본 발명의 일실시예에 있어서 컴퓨터 시스템의 내부 구성의 일례를 설명하기 위한 블록도이다.
도 2는 본 발명의 일실시예에 따른 컴퓨터 시스템의 프로세서가 포함할 수 있는 구성요소의 예를 도시한 도면이다.
도 3은 본 발명의 일실시예에 따른 컴퓨터 시스템이 수행할 수 있는 객체 추적 방법의 예를 도시한 순서도이다.
도 4는 본 발명의 일실시예에 있어서 학습 이미지로 사용 가능한 이미지 예시를 설명하기 위한 도면이다.
도 5는 본 발명의 일실시예에 있어서 움직임에 대한 속도 클래스와 행동 클래스의 예시를 도시한 것이다.
도 6은 본 발명의 일실시예에 있어서 미래 움직임 예측을 위한 DNN 구조를 도시한 것이다.
도 7은 MCRoI 풀링 레이어를 포함하는 DNN의 분류 성능을 설명하기 위한 테이블을 도시한 것이다.
도 8은 본 발명의 일실시예에 있어서 미래 움직임 예측 결과를 이용하여 객체 트래킹을 위한 탐색 영역을 생성하는 과정을 설명하기 위한 예시 도면이다.
도 9는 본 발명의 일실시예에 있어서 미래 움직임 예측 결과를 이용한 군중 분석 결과의 예시를 도시한 것이다.1 is a block diagram illustrating an example of an internal configuration of a computer system according to an embodiment of the present invention.
2 is a diagram showing an example of components that may be included in a processor of a computer system according to an embodiment of the present invention.
3 is a flow chart illustrating an example of an object tracking method that can be performed by a computer system according to an embodiment of the present invention.
4 is a view for explaining an example of an image that can be used as a training image in an embodiment of the present invention.
5 illustrates an example of a speed class and an action class for movement in an embodiment of the present invention.
6 illustrates a DNN structure for future motion prediction according to an embodiment of the present invention.
7 shows a table for explaining the classification performance of a DNN including an MCRoI pooling layer.
FIG. 8 is an exemplary diagram for explaining a process of generating a search area for object tracking by using a future motion prediction result according to an embodiment of the present invention.
9 illustrates an example of a crowd analysis result using a future motion prediction result in an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 실시예들은 단일 정지 영상에서 객체의 미래 움직임을 예측하는 방식을 통해 동영상에서의 객체를 추적하는 기술에 관한 것이다.Embodiments of the present invention relate to a technology for tracking an object in a moving picture through a method of predicting a future motion of an object in a single still image.

본 명세서에서 구체적으로 개시되는 것들을 포함하는 실시예들은 프레임 레이트가 현저히 낮은 동영상이나 프레임 드롭이 발생하더라도 객체를 안정적으로 추적할 수 있고, 이를 통해 효율성, 정확성, 신속성, 비용 절감 등의 측면에 있어서 상당한 장점들을 달성한다.Embodiments including those specifically disclosed in this specification can stably track an object even if a video or frame drop with a significantly low frame rate occurs, and through this, significant in terms of efficiency, accuracy, speed, cost reduction, etc. Achieve the advantages.

도 1은 본 발명의 일실시예에 있어서 컴퓨터 시스템의 내부 구성의 일례를 설명하기 위한 블록도이다. 예를 들어, 본 발명의 실시예들에 따른 객체 추적 시스템이 도 1의 컴퓨터 시스템(100)을 통해 구현될 수 있다. 도 1에 도시한 바와 같이, 컴퓨터 시스템(100)은 객체 추적 방법을 실행하기 위한 구성요소로서 프로세서(110), 메모리(120), 영구 저장 장치(130), 버스(140), 입출력 인터페이스(150) 및 네트워크 인터페이스(160)를 포함할 수 있다.1 is a block diagram illustrating an example of an internal configuration of a computer system according to an embodiment of the present invention. For example, an object tracking system according to embodiments of the present invention may be implemented through the computer system 100 of FIG. 1. As shown in FIG. 1, the computer system 100 is a component for executing an object tracking method, and includes a processor 110, a memory 120, a permanent storage device 130, a bus 140, and an input/output interface 150. ) And a network interface 160.

프로세서(110)는 동영상에서 객체를 추적하기 위한 구성요소로서 명령어들의 시퀀스를 처리할 수 있는 임의의 장치를 포함하거나 그의 일부일 수 있다. 프로세서(110)는 예를 들어 컴퓨터 프로세서, 이동 장치 또는 다른 전자 장치 내의 프로세서 및/또는 디지털 프로세서를 포함할 수 있다. 프로세서(110)는 예를 들어, 서버 컴퓨팅 디바이스, 서버 컴퓨터, 일련의 서버 컴퓨터들, 서버 팜, 클라우드 컴퓨터, 컨텐츠 플랫폼 등에 포함될 수 있다. 프로세서(110)는 버스(140)를 통해 메모리(120)에 접속될 수 있다.The processor 110 may include or be part of any device capable of processing a sequence of instructions as a component for tracking an object in a video. The processor 110 may include, for example, a computer processor, a processor in a mobile device or other electronic device, and/or a digital processor. The processor 110 may be included in, for example, a server computing device, a server computer, a series of server computers, a server farm, a cloud computer, a content platform, and the like. The processor 110 may be connected to the memory 120 through the bus 140.

메모리(120)는 컴퓨터 시스템(100)에 의해 사용되거나 그에 의해 출력되는 정보를 저장하기 위한 휘발성 메모리, 영구, 가상 또는 기타 메모리를 포함할 수 있다. 메모리(120)는 예를 들어 랜덤 액세스 메모리(RAM: random access memory) 및/또는 다이내믹 RAM(DRAM: dynamic RAM)을 포함할 수 있다. 메모리(120)는 컴퓨터 시스템(100)의 상태 정보와 같은 임의의 정보를 저장하는 데 사용될 수 있다. 메모리(120)는 예를 들어 객체 추적을 위한 명령어들을 포함하는 컴퓨터 시스템(100)의 명령어들을 저장하는 데에도 사용될 수 있다. 컴퓨터 시스템(100)은 필요에 따라 또는 적절한 경우에 하나 이상의 프로세서(110)를 포함할 수 있다.The memory 120 may include volatile memory, permanent, virtual, or other memory for storing information used or output by the computer system 100. The memory 120 may include, for example, random access memory (RAM) and/or dynamic RAM (DRAM). The memory 120 can be used to store arbitrary information, such as status information of the computer system 100. Memory 120 may also be used to store instructions of computer system 100, including instructions for object tracking, for example. Computer system 100 may include one or more processors 110 as needed or appropriate.

버스(140)는 컴퓨터 시스템(100)의 다양한 컴포넌트들 사이의 상호작용을 가능하게 하는 통신 기반 구조를 포함할 수 있다. 버스(140)는 예를 들어 컴퓨터 시스템(100)의 컴포넌트들 사이에, 예를 들어 프로세서(110)와 메모리(120) 사이에 데이터를 운반할 수 있다. 버스(140)는 컴퓨터 시스템(100)의 컴포넌트들 간의 무선 및/또는 유선 통신 매체를 포함할 수 있으며, 병렬, 직렬 또는 다른 토폴로지 배열들을 포함할 수 있다.The bus 140 may include a communication infrastructure that enables interaction between various components of the computer system 100. The bus 140 may carry data, for example, between components of the computer system 100, for example between the processor 110 and the memory 120. The bus 140 may include wireless and/or wired communication media between components of the computer system 100, and may include parallel, serial or other topology arrangements.

영구 저장 장치(130)는 (예를 들어, 메모리(120)에 비해) 소정의 연장된 기간 동안 데이터를 저장하기 위해 컴퓨터 시스템(100)에 의해 사용되는 바와 같은 메모리 또는 다른 영구 저장 장치와 같은 컴포넌트들을 포함할 수 있다. 영구 저장 장치(130)는 컴퓨터 시스템(100) 내의 프로세서(110)에 의해 사용되는 바와 같은 비휘발성 메인 메모리를 포함할 수 있다. 영구 저장 장치(130)는 예를 들어 플래시 메모리, 하드 디스크, 광 디스크 또는 다른 컴퓨터 판독 가능 매체를 포함할 수 있다.Persistent storage device 130 is a component such as memory or other permanent storage device used by computer system 100 to store data for a predetermined extended period of time (eg, compared to memory 120). It may include. The permanent storage device 130 may include non-volatile main memory as used by the processor 110 in the computer system 100. The permanent storage device 130 may include, for example, a flash memory, hard disk, optical disk, or other computer readable medium.

입출력 인터페이스(150)는 키보드, 마우스, 음성 명령 입력, 디스플레이 또는 다른 입력 또는 출력 장치에 대한 인터페이스들을 포함할 수 있다. 구성 명령들 및/또는 객체 추적을 위한 입력이 입출력 인터페이스(150)를 통해 수신될 수 있다.The input/output interface 150 may include interfaces to a keyboard, mouse, voice command input, display, or other input or output device. Input for configuration commands and/or object tracking may be received through the input/output interface 150.

네트워크 인터페이스(160)는 근거리 네트워크 또는 인터넷과 같은 네트워크들에 대한 하나 이상의 인터페이스를 포함할 수 있다. 네트워크 인터페이스(160)는 유선 또는 무선 접속들에 대한 인터페이스들을 포함할 수 있다. 구성 명령들 및/또는 객체 추적을 위한 입력이 네트워크 인터페이스(160)를 통해 수신될 수 있다.The network interface 160 may include one or more interfaces to networks such as a local area network or the Internet. Network interface 160 may include interfaces for wired or wireless connections. Input for configuration commands and/or object tracking may be received through the network interface 160.

또한, 다른 실시예들에서 컴퓨터 시스템(100)은 도 1의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 컴퓨터 시스템(100)은 상술한 입출력 인터페이스(150)와 연결되는 입출력 장치들 중 적어도 일부를 포함하도록 구현되거나 또는 트랜시버(transceiver), GPS(Global Positioning System) 모듈, 카메라, 각종 센서, 데이터베이스 등과 같은 다른 구성요소들을 더 포함할 수도 있다.Also, in other embodiments, the computer system 100 may include more components than those in FIG. 1. However, there is no need to clearly show most prior art components. For example, the computer system 100 is implemented to include at least some of the input/output devices connected to the input/output interface 150 described above, or a transceiver, a global positioning system (GPS) module, a camera, various sensors, Other components, such as a database, may also be included.

본 발명에서는 인간 수준의 움직임 예측 능력을 모사하기 위해 다양한 수준의 의미 정보를 통합하는 학습 모델(일례로, DNN(deep neural network))를 제안하여 이를 통해 단일 이미지에서 인스턴스 레벨의 미래 움직임을 예측할 수 있다.The present invention proposes a learning model (for example, a deep neural network (DNN)) that integrates various levels of semantic information in order to simulate the human-level motion prediction capability, and through this, it is possible to predict the future motion of the instance level in a single image. have.

본 명세서에서는 보행자를 대표적인 객체로 하여 해당 객체의 미래 움직임을 예측하는 실시예를 설명하고 있으나, 이에 한정되는 것은 자동차, 동물 등 다른 종류의 객체를 대상으로 적용할 수 있다.In the present specification, an embodiment of predicting a future movement of a corresponding object by using a pedestrian as a representative object is described, but the limitation may be applied to other types of objects such as automobiles and animals.

도 2는 본 발명의 일실시예에 따른 컴퓨터 시스템의 프로세서가 포함할 수 있는 구성요소의 예를 도시한 도면이고, 도 3은 본 발명의 일실시예에 따른 컴퓨터 시스템이 수행할 수 있는 객체 추적 예측 방법의 예를 도시한 순서도이다.FIG. 2 is a diagram showing an example of components that can be included in a processor of a computer system according to an embodiment of the present invention, and FIG. 3 is an object tracking that can be performed by a computer system according to an embodiment of the present invention. It is a flowchart showing an example of a prediction method.

도 2에 도시된 바와 같이, 프로세서(110)는 특징 추출부(210), 움직임 예측부(220), 및 객체 트래킹부(230)를 포함할 수 있다. 이러한 프로세서(110)의 구성요소들은 적어도 하나의 프로그램 코드에 의해 제공되는 제어 명령에 따라 프로세서(110)에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 예를 들어, 프로세서(110)가 정지 영상에서 객체의 특징을 추출하도록 컴퓨터 시스템(100)을 제어하기 위해 동작하는 기능적 표현으로서 특징 추출부(210)가 사용될 수 있다. 프로세서(110) 및 프로세서(110)의 구성요소들은 도 3의 객체 추적 방법이 포함하는 단계들(S310 내지 S340)을 수행할 수 있다. 예를 들어, 프로세서(110) 및 프로세서(110)의 구성요소들은 메모리(120)가 포함하는 운영체제의 코드와 상술한 적어도 하나의 프로그램 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. 여기서, 적어도 하나의 프로그램 코드는 객체 추적 방법을 처리하기 위해 구현된 프로그램의 코드에 대응될 수 있다.As shown in FIG. 2, the processor 110 may include a feature extraction unit 210, a motion prediction unit 220, and an object tracking unit 230. The components of the processor 110 may be expressions of different functions performed by the processor 110 according to a control command provided by at least one program code. For example, the feature extraction unit 210 may be used as a functional expression operated to control the computer system 100 so that the processor 110 extracts a feature of an object from a still image. The processor 110 and components of the processor 110 may perform steps S310 to S340 included in the object tracking method of FIG. 3. For example, the processor 110 and components of the processor 110 may be implemented to execute instructions of the operating system code included in the memory 120 and at least one program code described above. Here, at least one program code may correspond to a code of a program implemented to process an object tracking method.

객체 추적 방법은 도시된 순서대로 발생하지 않을 수 있으며, 단계들 중 일부가 생략되거나 추가의 과정이 더 포함될 수 있다.The object tracking method may not occur in the illustrated order, and some of the steps may be omitted or an additional process may be further included.

단계(S310)에서 프로세서(110)는 객체 추적 방법을 위한 프로그램 파일에 저장된 프로그램 코드를 메모리(120)에 로딩할 수 있다. 예를 들어, 객체 추적 방법을 위한 프로그램 파일은 도 1을 통해 설명한 영구 저장 장치(130)에 저장되어 있을 수 있고, 프로세서(110)는 버스를 통해 영구 저장 장치(130)에 저장된 프로그램 파일로부터 프로그램 코드가 메모리(120)에 로딩되도록 컴퓨터 시스템(110)을 제어할 수 있다. 이때, 프로세서(110) 및 프로세서(110)가 포함하는 특징 추출부(210), 움직임 예측부(220), 및 객체 트래킹부(230) 각각은 메모리(120)에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(S320 내지 S340)을 실행하기 위한 프로세서(110)의 서로 다른 기능적 표현들일 수 있다. 단계들(S320 내지 S340)의 실행을 위해, 프로세서(110) 및 프로세서(110)의 구성요소들은 직접 제어 명령에 따른 연산을 처리하거나 또는 컴퓨터 시스템(100)을 제어할 수 있다.In step S310, the processor 110 may load a program code stored in a program file for an object tracking method into the memory 120. For example, a program file for an object tracking method may be stored in the permanent storage device 130 described with reference to FIG. 1, and the processor 110 may use a program from a program file stored in the persistent storage device 130 through a bus. Computer system 110 can be controlled so that code is loaded into memory 120. In this case, each of the feature extraction unit 210, the motion prediction unit 220, and the object tracking unit 230 included in the processor 110 and the processor 110 is a corresponding part of the program code loaded into the memory 120 It may be different functional expressions of the processor 110 for executing the subsequent steps S320 to S340 by executing the command of. For the execution of steps S320 to S340, the processor 110 and components of the processor 110 may process an operation according to a direct control command or control the computer system 100.

본 발명에 따른 객체 추적 방법은 객체 검출(object detection)을 통해 입력 동영상의 일 프레임인 단일 정지 영상에서 객체 영역을 추출하는 단계(S320); 객체 영역에 대한 미래 움직임 예측을 통해 해당 객체의 다음 방향, 속도, 행동을 미리 정해진 클래스로 예측하는 단계(S330); 및 일 프레임의 미래 움직임 예측 결과를 바탕으로 다음 프레임에서의 탐색 영역을 적응적으로 생성하여 해당 영역 안에서 객체의 트래킹을 수행하는 단계(S340)를 포함할 수 있다.The object tracking method according to the present invention includes the steps of extracting an object region from a single still image, which is one frame of an input video, through object detection (S320); Predicting a next direction, speed, and behavior of the corresponding object in a predetermined class through prediction of a future motion of the object region (S330); And adaptively generating a search area in a next frame based on a result of predicting a future motion of one frame and performing tracking of an object in the corresponding area (S340).

단계(S320)에서 특징 추출부(210)는 동영상의 일 프레임에 해당되는 단일 정지 영상을 입력 이미지로 받아 입력 이미지에서 객체와 관련된 특징을 추출할 수 있다. 특히, 특징 추출부(210)는 입력 이미지에서 다양한 크기의 블록, 즉 객체 영역, 객체를 포함하는 주변 영역(local context), 및 영상 전체 영역(global context)의 특징을 추출한 후 추출된 세 가지 종류의 특징을 모두 결합할 수 있다.In step S320, the feature extraction unit 210 may receive a single still image corresponding to one frame of the moving picture as an input image and extract a feature related to an object from the input image. In particular, the feature extraction unit 210 extracts features of blocks of various sizes from the input image, that is, an object area, a local context including an object, and a global context. You can combine all of the features of.

단계(S330)에서 움직임 예측부(220)는 단계(S320)에서 추출된 특징으로부터 해당 객체의 미래 움직임을 예측할 수 있다. 움직임 예측부(220)는 객체의 미래 움직임을 세 가지의 움직임, 즉 방향(direction), 속도(velocity), 및 행동(action)으로 구분하여 예측할 수 있다. 움직임 예측부(220)는 단일 이미지에서 방향, 속도, 행동을 클래스로 한 인스턴스 레벨의 미래 움직임을 예측할 수 있다. 예를 들어, 방향은 8개의 방위(N, E, S, W, NE, SE, SW, NW)로 구분할 수 있으며, 이는 영상 내의 좌표를 기반으로 할 수 있다. 또한, 속도는 복수 개의 단계, 예를 들어 정지(stop), 느림(slow), 빠름(fast) 3단계로 구분할 수 있다. 행동은 상황에 따라 가변적이나, 자율주행 시스템에서 보행자를 객체로 인식하는 경우, 예를 들어 보행자의 행동을 일반 보행(sidewalk), 횡단보도 보행(crosswalk), 무단횡단 보행(jaywalk)으로 구분할 수 있다.In step S330, the motion prediction unit 220 may predict the future motion of the corresponding object from the features extracted in step S320. The motion prediction unit 220 may predict a future motion of an object by dividing it into three types of motions, namely, direction, velocity, and action. The motion prediction unit 220 may predict a future motion at an instance level in which a direction, a speed, and a behavior are as classes in a single image. For example, the direction can be divided into eight directions (N, E, S, W, NE, SE, SW, NW), which can be based on coordinates in the image. In addition, the speed can be divided into a plurality of stages, for example, three stages: stop, slow, and fast. Behavior is variable depending on the situation, but when the autonomous driving system recognizes a pedestrian as an object, the behavior of the pedestrian can be classified into, for example, sidewalk, crosswalk, and jaywalk. .

단계(S340)에서 객체 트래킹부(230)는 단계(S330)에서 예측된 객체의 미래 움직임을 기초로 해당 객체에 대한 탐색 영역을 생성하여 해당 영역 안에서 트래킹을 수행할 수 있다. 객체 트래킹부(230)는 이전 프레임인 단일 정지 영상에서 객체의 미래 움직임으로 예측된 결과를 기반으로 해당 객체가 다음 프레임에서 이동할 확률이 높은 지역을 우선적으로 탐색할 수 있다. 객체 트래킹부(230)는 객체의 미래 움직임 예측 결과에 따라 객체에 대한 탐색 지점(search point)의 형태를 다르게 하여 속도 성능 저하 없이 객체 트래킹을 안정적으로 수행할 수 있다.In step S340, the object tracking unit 230 may generate a search area for the object based on the future movement of the object predicted in step S330, and perform tracking within the corresponding area. The object tracking unit 230 may preferentially search for a region having a high probability of moving a corresponding object in a next frame based on a result predicted as a future motion of an object in a single still image that is a previous frame. The object tracking unit 230 may stably perform object tracking without deteriorating speed performance by changing the shape of a search point for an object according to a result of predicting a future motion of the object.

단일 정지 영상에서 객체의 미래 움직임을 예측할 수 있어 프레임 레이트가 1fps 이하로 현저히 낮은 동영상에서도 객체를 안정적으로 트래킹할 수 있고, 자율 주행차나 로봇 등이 주변 환경을 인식하는 과정에서 갑작스러운 프레임 드롭이 발생하더라도 객체 트래킹 성능이 저하되는 것을 예방할 수 있고 주변 객체들의 행동을 미리 예측할 수 있다.The future motion of an object can be predicted from a single still image, so it can reliably track an object even in a video with a frame rate of 1 fps or less, and an abrupt frame drop occurs while autonomous vehicles or robots recognize the surrounding environment. Even so, it is possible to prevent deterioration of object tracking performance and predict the behavior of surrounding objects in advance.

본 발명에서는 단일 이미지에서 객체의 방향, 속도, 행동을 각 클래스로 한 인스턴스의 미래 움직임을 예측할 수 있으며, 이러한 미래 움직임의 예측을 수행하기 위해 다양한 수준의 의미 정보를 통합하는 DNN 모델을 제안한다.In the present invention, it is possible to predict the future motion of an instance with the direction, speed, and behavior of an object as each class in a single image, and proposes a DNN model that integrates various levels of semantic information in order to predict the future motion.

도 4를 참조하면, 인간은 정지된 하나의 이미지(400)를 보면 이미지(400)에 포함된 객체(401)의 다음 움직임을 어느 정도 예측이 가능하다. 본 발명은 미래의 움직임에 관한 인간 수준의 움직임 예측 능력을 구현하기 위해 DNN을 이용하고자 한다.Referring to FIG. 4, a human can predict the next movement of the object 401 included in the image 400 to some extent when looking at a still image 400. The present invention intends to use DNN to implement human-level motion prediction capability for future motion.

이하에서는 많은 비전 어플리케이션에서 가장 관심 있는 대상인 보행자 인스턴스를 대표적인 예시로 하여 구체적인 실시예를 설명하기로 한다.Hereinafter, a specific embodiment will be described using a pedestrian instance, which is the object of interest in many vision applications, as a representative example.

기존 데이터셋이나 인터넷 상의 이미지 데이터베이스 시스템 상에서 보행자가 포함된 이미지를 학습 이미지로 선택하여 미래 움직임 데이터셋을 수집할 수 있다. 이때, 학습 이미지에 포함된 보행자 영역에 대해 경계 박스 어노테이션(bounding box annotation)을 적용한 다음, 각 보행자 인스턴스에 정답 데이터로서 미래 움직임의 세 가지 속성(즉, 방향, 속도, 행동)을 수동으로 할당한다.It is possible to collect a future motion dataset by selecting an image containing a pedestrian as a training image on an existing dataset or an image database system on the Internet. At this time, a bounding box annotation is applied to the pedestrian area included in the training image, and then three properties of the future movement (i.e., direction, speed, and behavior) are manually assigned to each pedestrian instance as correct answer data. .

방향 클래스는 8개의 방향(N, E, S, W, NE, SE, SW, NW)으로 정의될 수 있다. 그리고, 도 5를 참조하면 속도 클래스(510)는 정지, 느림, 빠름으로 정의할 수 있고, 행동 클래스(520)는 일반 보행, 횡단보도 보행, 무단횡단 보행으로 정의할 수 있다. 이는 하나의 예시일 뿐, 경우에 따라 방향, 속도, 행동 클래스가 다르게 정의될 수도 있다. 학습 이미지 별로 보행자 인스턴스의 방향, 속도, 행동의 각 클래스에 수동으로 라벨을 붙이게 된다.The direction class may be defined in eight directions (N, E, S, W, NE, SE, SW, NW). And, referring to FIG. 5, the speed class 510 may be defined as stop, slow, or fast, and the action class 520 may be defined as general walking, crosswalk walking, and stepless walking. This is only an example, and direction, speed, and behavior class may be defined differently depending on the case. For each training image, each class of direction, speed, and behavior of the pedestrian instance is manually labeled.

일반적으로 인간은 움직이는 물체를 볼 때 대상 물체와 그 주변 환경의 시각적 정보를 동시에 지각한다. 이와 비슷하게, 본 발명에서는 미래 움직임 예측을 위한 장면 컨텍스트(scene context)를 활용하기 위해, 객체, 로컬 및 글로벌 피처를 통합하는 MCRoI(multi-context region of interest) 풀링 레이어(pooling layer)를 제안한다. MCRoI 풀링 레이어를 DenseNet(densely connected convolutional network)에 통합하여 인스턴스의 미래 방향, 속도 및 행동을 예측하기 위한 통합 모델을 학습한다.In general, when humans see a moving object, they simultaneously perceive visual information about the object and its surroundings. Similarly, the present invention proposes a multi-context region of interest (MCRoI) pooling layer integrating objects, local and global features in order to utilize a scene context for future motion prediction. By integrating the MCRoI pooling layer into a densely connected convolutional network (DenseNet), we learn a unified model to predict the future direction, velocity, and behavior of instances.

본 발명에서는 방향, 속도, 행동에 대한 세 가지 분류 작업을 수행하여 보행자의 미래 움직임을 예측하기 위해 DNN을 구축한다.In the present invention, a DNN is constructed to predict future movements of pedestrians by performing three classification tasks for direction, speed, and behavior.

도 6은 미래 움직임 예측을 위한 DNN 구조를 도시한 것이다.6 shows a DNN structure for future motion prediction.

도 6을 참조하면, DNN(600)은 입력 이미지 패치를 처리하게 되는데, 이때 입력 이미지 패치는 보행자가 중앙에 위치하며 세 가지 분류 결과를 생성한다.Referring to FIG. 6, the DNN 600 processes an input image patch, in which the input image patch is located in the center of a pedestrian and generates three classification results.

DNN(600)은 컴퓨터 시스템(100)의 프로세서(110)에 포함되는 것으로, 특징 추출부(210)와, 움직임 예측부(220)로 구성될 수 있다.The DNN 600 is included in the processor 110 of the computer system 100 and may include a feature extraction unit 210 and a motion prediction unit 220.

보행자 영역(61)에 대해 높이가 h인 경계 박스가 어노테이션된 학습 이미지에서 보행자 영역(61)의 경계 박스를 기준으로 2h×2h 크기로 자른 패치를 입력 이미지(60)로 사용하는 것으로 가정한다.It is assumed that from the training image in which a bounding box of height h is annotated with respect to the pedestrian area 61, a patch cut to a size of 2h×2h based on the bounding box of the pedestrian area 61 is used as the input image 60.

특징 추출부(210)는 컨볼루션 레이어 및 MCRoI 풀링 레이어를 포함할 수 있다. 이때, 컨볼루션 레이어는 DenseNet으로 구성될 수 있다.The feature extractor 210 may include a convolutional layer and an MCRoI pooling layer. In this case, the convolution layer may be composed of DenseNet.

특징 추출부(210)는 DenseNet을 기반으로 객체 영역에 해당되는 보행자 영역(61), 객체를 포함하는 주변 영역(62), 및 영상 전체 영역(63)의 특징을 만들어 낸다. 즉, 특징 추출부(210)는 DenseNet의 출력으로 객체 피처, 로컬 컨텍스트 피처, 글로벌 컨텍스트 피처를 추출한다.The feature extraction unit 210 generates features of the pedestrian area 61 corresponding to the object area, the peripheral area 62 including the object, and the entire image area 63 based on DenseNet. That is, the feature extraction unit 210 extracts an object feature, a local context feature, and a global context feature as an output of DenseNet.

특징 추출부(210)는 4개의 RoI 풀링 레이어를 사용하는 MCRoI 풀링 레이어를 포함한다.The feature extraction unit 210 includes an MCRoI pooling layer using four RoI pooling layers.

보행자 영역(61)에 대해 어노테이션된 경계 박스는 공간 해상도 7×7에 풀링되는 객체 피처의 RoI에 해당된다. 이때, 객체 영역에 해당되는 보행자 영역(61)은 최소의 배경과 함께 보행자의 외관 정보를 포함한다.The bounding box annotated for the pedestrian area 61 corresponds to the RoI of the object feature pooled at a spatial resolution of 7×7. At this time, the pedestrian area 61 corresponding to the object area includes appearance information of the pedestrian with a minimum background.

주변 영역(62)은 경계 박스를 기준으로 1.3h×1.3h 크기로 자른 패치로, 두 가지의 로컬 컨텍스트 피처를 위한 RoI에 해당된다. 패치는 각각 두 개의 RoI 풀링 레이어에 의해 해상도 9×9 및 11×11에 풀링된다. 로컬 컨텍스트 피처에는 보행자 주변의 배경 정보와 보행자 외관이 포함된다.The peripheral region 62 is a patch cut to a size of 1.3h×1.3h based on the bounding box, and corresponds to RoI for two local context features. The patches are pooled at resolutions 9×9 and 11×11 by two RoI pooling layers, respectively. Local context features include background information around the pedestrian and the pedestrian appearance.

영상 전체 영역(63)에 해당되는 2h×2h 크기의 패치는 글로벌 컨텍스트 피처를 위한 RoI에 해당되며, 해상도 5×5에 풀링된다. 글로벌 컨텍스트 피처는 장면에 대한 전체적인 의미 정보를 제공하기 때문에 미래 움직임의 예측에 있어 중요하다.A 2h×2h patch corresponding to the entire image area 63 corresponds to an RoI for a global context feature and is pooled at a resolution of 5×5. The global context feature is important in predicting future movements because it provides overall semantic information about the scene.

4개의 RoI 풀링 레이어의 출력 사이즈는 경험적으로 결정된다.The output sizes of the four RoI pooling layers are determined empirically.

움직임 예측부(220)는 완전 연결 레이어(fully connected layer)와 소프트맥스 레이어(softmax layer)로 구성될 수 있다. 특히, 움직임 예측부(220)는 MCRoI 풀링 레이어의 출력을 받아 완전 연결 레이어의 3단계로 처리한 후 세 개의 소프트맥스 레이어로 처리하여 보행자의 방향, 속도, 및 행동을 각각 분류한다.The motion prediction unit 220 may be composed of a fully connected layer and a softmax layer. In particular, the motion prediction unit 220 receives the output of the MCRoI pooling layer, processes it into three stages of a fully connected layer, and then processes it into three softmax layers to classify the direction, speed, and behavior of pedestrians, respectively.

DNN의 학습 과정은 다음과 같다.The learning process of DNN is as follows.

MCRoI 풀링 레이어는 4개의 풀링 레이어를 사용하여 컨볼루션 레이어와 완전 연결 레이어를 연결한다.The MCRoI pooling layer uses 4 pooling layers to connect the convolutional layer and the fully connected layer.

도 6의 DNN(600)은 임의의 사이즈의 입력 패치를 적용할 수 있으나, 효과적인 학습 및 추론을 위해 입력 패치의 크기를 400×400으로 표준화하여 크기가 200픽셀인 보행자 영역(61)을 포함한다.The DNN 600 of FIG. 6 may apply an input patch of any size, but includes a pedestrian area 61 having a size of 200 pixels by standardizing the size of the input patch to 400 x 400 for effective learning and inference. .

전체 손실 함수는 수학식 1과 같이 세 가지 손실의 합으로 정의된다.The total loss function is defined as the sum of the three losses as shown in Equation 1.

[수학식 1][Equation 1]

여기서, L_Dir은 방향 분류에 대한 손실을, L_Vel은 속도 분류에 대한 손실을, L_Act는 행동 분류에 대한 손실을 의미한다.Here, L _Dir is a loss for direction classification, L _Vel is a loss for speed classification, and L _Act is a loss for behavior classification.

각 분류 손실에 대해 크로스 엔트로피(cross entropy)를 적용한다.Cross entropy is applied for each classification loss.

방향 분류 손실의 경우 수학식 2와 같이 정의된다.The direction classification loss is defined as in Equation 2.

[수학식 2][Equation 2]

여기서,

는 8방향에 대한 소프트맥스 확률 벡터이고,

는 실측 결과(ground-truth) 바이너리 벡터를 의미한다.here,

Is the softmax probability vector for 8 directions,

Denotes a ground-truth binary vector.

L_Vel과 L_Act 또한 수학식 2와 유사하게 정의된다.L _Vel and L _{Act are} also defined similarly to Equation 2.

본 실시예에서는 0.9 모멘텀(momentum)과 14개의 에포크(epoch)에 대하여 배치 사이즈(batch size)가 4인 확률적인 경사 하강법(stochastic gradient descent)을 통해 도 6의 DNN(600)을 학습시킨다. 초기 파라미터로서 ImageNet에서 미리 학습한 DenseNet 모델을 사용한다.In this embodiment, the DNN 600 of FIG. 6 is trained through stochastic gradient descent in which a batch size is 4 for 0.9 momentum and 14 epochs. As an initial parameter, the DenseNet model learned in advance in ImageNet is used.

따라서, 본 발명에서는 MCRoI 풀링 레이어를 통해서 객체와 객체의 주변 정보를 통합하여 학습하는 네트워크 모델을 구현할 수 있고, 이를 통해 정지된 단일 영상만으로 객체에 대한 인스턴스 레벨의 미래 움직임(방향과 속도 및 행동)을 예측할 수 있다.Therefore, in the present invention, it is possible to implement a network model that integrates and learns the object and surrounding information through the MCRoI pooling layer, and through this, the future movement (direction, speed, and behavior) of the object at the instance level with only a single stationary image. Can be predicted.

도 7은 MCRoI 풀링 레이어를 포함하는 DNN의 분류 성능을 설명하기 위한 테이블을 도시한 것이다.7 shows a table for explaining the classification performance of a DNN including an MCRoI pooling layer.

도 7의 테이블은 MCRoI 풀링 레이어를 포함하는 네트워크 구조인 'Proposed' 네트워크의 성능을 기존 네트워크(객체 피처를 분류에 사용하는 'Object' 네트워크, 로컬 컨텍스트 피처를 분류에 사용하는 'Local context' 네트워크, 글로벌 컨텍스트 피처를 분류에 사용하는 'Global context' 네트워크, 객체 피처와 로컬 컨텍스트 피처를 분류에 사용하는 'Object+Local' 네트워크)와 비교한 것이다.The table in FIG. 7 shows the performance of the'Proposed' network, which is a network structure including the MCRoI pooling layer, in an existing network (a'Object' network using object features for classification, a'Local context' network using local context features for classification, This is a comparison with the'Global context' network, which uses global context features for classification, and the'Object+Local' network, which uses object features and local context features for classification).

도 7에서 'Direction+'는 인간조차 하나의 이미지를 보고 서로 다른 클래스로 판단하여 실제 방향을 정량화하기 어려운 경우에 대한 모호성을 고려한 것으로, 예측 방향이 실측 결과와 동일하거나 이웃한 방향의 경우 정확한 것으로 간주될 때의 정확도를 의미한다.In FIG. 7,'Direction+' considers ambiguity when it is difficult to quantify the actual direction because even a human sees one image and judges it as a different class.If the prediction direction is the same as the actual measurement result or the direction adjacent to it is regarded as accurate It means the accuracy when it becomes.

MCRoI 풀링 레이어를 포함하는 'Proposed' 네트워크는 객체 피처, 로컬 컨텍스트 피처, 글로벌 컨텍스트 피처, 즉 세 가지 종류의 피처를 모두 결합하여 미래의 움직임을 예측하기 때문에 전반적으로 기존 네트워크보다 성능이 우수함을 알 수 있다.The'Proposed' network including the MCRoI pooling layer predicts future motion by combining all three types of features: object features, local context features, and global context features, so it can be seen that overall performance is better than existing networks. have.

방향, 속도, 행동을 클래스로 한 인스턴스 레벨의 움직임 예측은 딥러닝을 활용하는데 있어 픽셀 단위의 움직임 예측에 비해 다음의 이점이 있다: (1) 픽셀 단위의 어노테이션의 경우 시간과 비용이 많이 들어 DNN를 학습하기에 충분한 데이터를 확보하기 어려운 측면이 있다. (2) 객체 단위의 움직임 모델링은 인간이 움직임을 인식하고 이해하는 과정과 더욱 유사할 뿐만 아니라 기존의 검출(detection) 및 추적(tracking)과 같은 다양한 컴퓨터 비전 기술과 융합이 용이하다.Instance-level motion prediction using direction, speed, and behavior as classes has the following advantages over pixel-level motion prediction in utilizing deep learning: (1) Pixel-based annotations require a lot of time and cost, so DNN There is a side in which it is difficult to secure enough data to learn. (2) Object-level motion modeling is more similar to the process of human recognition and understanding of motion, and it is easy to integrate with various computer vision technologies such as detection and tracking.

본 발명의 실시예에서는 다양한 수준의 의미 정보를 통합하여 학습한 DNN를 이용하여 단일 정지 영상에서 객체의 미래 움직임을 정확하고 효율적으로 예측할 수 있고 미래 움직임에 대한 신뢰할 수 있는 예측 성능을 제공할 수 있다.In an embodiment of the present invention, using a DNN learned by integrating various levels of semantic information, it is possible to accurately and efficiently predict the future motion of an object in a single still image and provide reliable prediction performance for future motion. .

상기한 미래 움직임 예측 알고리즘은 객체 트래킹을 더욱 효율적으로 만들 수 있다.The above-described future motion prediction algorithm can make object tracking more efficient.

연속된 프레임 간에 단일 객체는 물론이고 다중 객체에 대해서도 각 객체 별 미래 움직임 예측 결과를 이용하여 해당 객체의 트래킹을 위한 탐색 영역을 줄일 수 있다.For multiple objects as well as a single object between consecutive frames, the search area for tracking the corresponding object can be reduced by using the future motion prediction result for each object.

미래 움직임 예측 결과를 사용하여 목표 객체의 탐색 영역을 줄이면 객체를 보다 신속하고 효율적으로 트래킹할 수 있다.If the search area of the target object is reduced using the future motion prediction result, the object can be tracked more quickly and efficiently.

도 8은 (a) 기존 트래킹 방식의 탐색 영역(81)과, (b) 미래 움직임 예측 결과를 활용한 트래킹 방식의 탐색 영역(801)을 비교한 것이다.FIG. 8 is a comparison of (a) a search area 81 of an existing tracking method and (b) a search area 801 of a tracking method using a future motion prediction result.

기존에는 가우스 분포에서 기준선을 샘플링하여 사각 윈도우에 해당되는 탐색 영역(81) 내에서 검색 대상을 선택하는 방식인 반면에, 본 발명에 따른 트래킹 방식에서는 미래 움직임 예측 결과를 바탕으로 탐색 영역(801)을 줄일 수 있다. 이때, 탐색 영역(801)은 미래 움직임으로 예측된 방향, 속도, 행동에 따라 결정될 수 있다. 일례로, 미래 움직임으로 예측된 속도가 빠를수록 탐색 영역(801)의 크기가 커지고 속도가 느릴수록 탐색 영역(801)의 크기가 작아질 수 있다. 예를 들어, 객체 트래킹부(230)는 미래 움직임의 속도가 '정지'일 때 기존 탐색 영역(81)이나 사전에 정해진 크기의 탐색 영역보다 4배 작은 탐색 영역(801)을 적용할 수 있다. 다른 예로는, 객체 트래킹부(230)는 미래 움직임의 방향에 대응되는 부채꼴 모양의 탐색 영역(801)을 결정할 수 있다.In the past, a reference line is sampled from a Gaussian distribution to select a search target within the search area 81 corresponding to a rectangular window, whereas in the tracking method according to the present invention, the search area 801 is based on a future motion prediction result. Can be reduced. In this case, the search area 801 may be determined according to a direction, speed, and behavior predicted as future motion. For example, as the speed predicted by the future motion increases, the size of the search area 801 may increase, and as the speed decreases, the size of the search area 801 may decrease. For example, the object tracking unit 230 may apply a search area 801 that is four times smaller than the existing search area 81 or a search area having a predetermined size when the speed of the future movement is'stop'. As another example, the object tracking unit 230 may determine a sector-shaped search area 801 corresponding to a direction of a future movement.

본 발명에서는 객체 트래킹을 위한 탐색 영역을 줄일 수 있어 기존보다 작은 영역을 탐색하여 트래킹 속도를 향상시킬 수 있다.In the present invention, since a search area for object tracking can be reduced, a tracking speed can be improved by searching a smaller area than before.

낮은 프레임 레이트의 동영상에서 다중 객체 트래킹을 위해서는 추적별 탐지 방법(tracking-by-detection method)을 이용하여 정교한 트래킹을 수행할 수 있다. 일례로, 다중 객체 트래킹을 위해 각 객체의 미래 움직임 예측 결과를 바탕으로 탐색 영역을 줄이면서 아울러 샘플링 밀도를 높일 수 있다.In order to track multiple objects in a video with a low frame rate, sophisticated tracking can be performed using a tracking-by-detection method. For example, for multi-object tracking, a search area may be reduced based on a result of predicting a future motion of each object and a sampling density may be increased.

객체의 미래 움직임은 객체 주변의 배경 정보와 관련하여 객체의 상대적인 움직임으로 예측되기 때문에 배경이 이동하는 경우 미래 움직임을 이용한 탐색 영역의 축소는 무효가 될 수 있다. 이에, 본 발명에서는 키포인트 매칭(key point matching)과 아핀(affine) 변환 등의 알고리즘을 사용하여 배경 이동을 우선 보정한 후 객체 트래킹을 수행할 수 있다.Since the future motion of the object is predicted as the relative motion of the object in relation to the background information around the object, if the background moves, the reduction of the search area using the future motion may be invalid. Accordingly, in the present invention, object tracking may be performed after first correcting the background movement using an algorithm such as key point matching and affine transformation.

객체 트래킹 이외에도 미래 움직임 예측 알고리즘을 단일 이미지의 군중 분석에 적용하는 것 또한 가능하다.In addition to object tracking, it is also possible to apply a future motion prediction algorithm to crowd analysis of a single image.

일례로, 움직임 예측부(220)는 군중의 각 보행자의 예측 방향을 사용하여 군중을 여러 클러스터로 분할한 후 각 클러스터의 그룹 방향을 예측할 수 있다. 이때, 클러스터링을 위해 K-평균(mean) 알고리즘을 사용한다. 두 인스턴스 간의 거리를 계산하려면 수학식 3의 가중 거리를 사용할 수 있다.For example, the motion prediction unit 220 may predict the group direction of each cluster after dividing the crowd into multiple clusters using the predicted directions of each pedestrian in the crowd. At this time, the K-mean algorithm is used for clustering. To calculate the distance between two instances, the weighted distance in Equation 3 can be used.

[수학식 3][Equation 3]

여기서,

는 인스턴스들 사이의 유클리드 거리이고,

은 방향적 지수의 주기적인 차이를 의미한다. 예를 들어, 인접한 방향 사이의

은 1이고 반대 방향에서 최대

은 4이다. 또한,

은 가중 파라미터를 의미한다.here,

Is the Euclidean distance between instances,

Means the periodic difference of the directional index. For example, between adjacent directions

Is 1 and maximum in the opposite direction

Is 4. Also,

Means weighting parameter.

클러스터링 이후 각 클러스터의 그룹 방향을 획득하게 되고, 이를 위해 각 방향에 대해 클러스터 내의 인스턴스의 방향 확률의 합계를 계산할 수 있다. 이때, 확률 벡터는 도 6의 DNN(600)의 움직임 예측부(220)에 의해 생성될 수 있으며, 확률의 합이 최대인 방향을 해당 클러스터의 그룹 방향으로 결정할 수 있다.After clustering, the group direction of each cluster is acquired, and for this, the sum of the direction probabilities of instances in the cluster may be calculated for each direction. In this case, the probability vector may be generated by the motion prediction unit 220 of the DNN 600 of FIG. 6, and a direction in which the sum of probabilities is the maximum may be determined as the group direction of the corresponding cluster.

도 9의 (a)는 각 보행자의 이동 방향을 예측한 결과를 도시한 것이고, (b)는 클러스터링을 통해 클러스터링된 그룹의 이동 방향을 예측한 결과를 도시한 것이다. 도 9의 (b)에 도시한 이미지와 같이 데이터 클러스터링에 기초하여 보행자의 방향 정보를 압축하여 군중 분석 결과로서 제공할 수 있다.FIG. 9A shows a result of predicting the movement direction of each pedestrian, and FIG. 9B shows a result of predicting the movement direction of a clustered group through clustering. As shown in the image shown in (b) of FIG. 9, the direction information of pedestrians may be compressed based on data clustering and provided as a result of crowd analysis.

이처럼 본 발명의 실시예들에 따르면, 단일 정지 영상에서 객체의 미래 움직임을 예측할 수 있어 프레임 레이트가 현저히 낮은 동영상이나 프레임 드롭이 발생하는 경우에도 트래킹 성능 저하를 예방하고 안정적인 트래킹을 보장할 수 있다. 그리고, 본 발명의 실시예들에 따르면, 단일 정지 영상인 일 프레임에서 객체의 미래 움직임을 예측하여 예측 결과를 기반으로 다음 프레임에서 객체를 탐색함으로써 객체 탐색 영역을 효율적으로 줄일 수 있다.As described above, according to embodiments of the present invention, it is possible to predict the future motion of an object in a single still image, so that even when a video or frame drop having a significantly low frame rate occurs, it is possible to prevent a decrease in tracking performance and ensure stable tracking. Further, according to embodiments of the present invention, an object search area can be efficiently reduced by predicting a future motion of an object in one frame, which is a single still image, and searching for an object in a next frame based on a prediction result.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 어플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the devices and components described in the embodiments are a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable gate array (PLU). It may be implemented using one or more general purpose computers or special purpose computers, such as a logic unit), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may perform an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of understanding, a processing device may be described as one being used, but a person having ordinary skill in the art, the processing device may include a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include. For example, the processing device may include a plurality of processors or a processor and a controller. In addition, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configure the processing device to operate as desired, or process independently or collectively You can command the device. Software and/or data may be embodied in any type of machine, component, physical device, computer storage medium, or device in order to be interpreted by the processing device or to provide instructions or data to the processing device. have. The software may be distributed on networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 이때, 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수 개의 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 어플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. In this case, the medium may continuously store a program executable on a computer or may be temporarily stored for execution or download. Further, the medium may be a variety of recording means or storage means in a form in which a single or several pieces of hardware are combined, but is not limited to a medium directly connected to a computer system, but may be distributed on a network. Examples of the medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks, And program instructions including ROM, RAM, flash memory, and the like. Also, examples of other media include an application store for distributing applications, a site for distributing or distributing various software, and a recording medium or storage medium managed by a server.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by the limited embodiments and drawings, various modifications and variations are possible from the above description by those of ordinary skill in the art. For example, the described techniques are performed in a different order than the described method, and/or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and claims and equivalents fall within the scope of the claims to be described later.

Claims

In the object tracking method performed in a computer system,
Extracting a feature of an object included in the still image and surrounding information of the object from a still image corresponding to one frame of a moving picture;
Predicting a future motion of the object from the extracted features; And
Performing tracking of the object by generating a search area in the next frame based on the prediction result of the future motion
Including,
The extracting step,
Extracting an object feature for an object region, a local feature for a surrounding region including the object, and a global feature for the entire image region from the still image
Object tracking method, characterized in that.

The method of claim 1,
The predicting step,
Predicting the future motion of the instance level with at least one of the direction, velocity, and action of the object as each class,
The performing step,
Performing tracking of the object by first searching for a region with a high probability of movement in the next frame in response to at least one of a direction, speed, and behavior predicted by the future movement of the object
Object tracking method, characterized in that.

The method of claim 1,
The performing step,
Reducing the search area according to the speed predicted by the future movement of the object
Object tracking method comprising a.

The method of claim 1,
The performing step,
Determining a sector-shaped search area corresponding to a direction predicted by the future movement of the object
Object tracking method comprising a.

The method of claim 1,
The performing step,
Compensating for background movement around an object and then tracking the object
Object tracking method, characterized in that.

The method of claim 1,
The predicting step,
Predicting a group direction of each cluster by clustering a direction predicted by a future motion of each object included in the still image
Object tracking method comprising a.

The method of claim 1,
The extracting step,
Extracting the feature through a learned learning model by integrating the object included in the training image and the surrounding information of the object
Object tracking method, characterized in that.

The method of claim 1,
The extracting step,
Extracting the feature through a learning model that learns images to which at least one of a direction, a velocity, and an action representing a future motion is assigned as correct answer data for an object instance
Object tracking method, characterized in that.

The method of claim 1,
The extracting step,
Extracting the object feature, the local feature, and the global feature through a plurality of RoI (region of interest) pooling layers constituting a learning model
Object tracking method, characterized in that.

The method of claim 9,
The predicting step,
Predicting at least one of a direction, speed, and behavior representing the future movement of the object by receiving the output of the RoI pooling layer through a fully connected layer and a softmax layer constituting the learning model. that
Object tracking method, characterized in that.

A computer-readable recording medium having a program recorded thereon for executing the object tracking method of any one of claims 1 to 10 on a computer.

In the computer system,
At least one processor implemented to execute computer-readable instructions
Including,
The at least one processor,
A feature extraction unit for extracting features of an object included in the still image and surrounding information of the object from a still image corresponding to one frame of a moving picture;
A motion prediction unit predicting a future motion of the object from the extracted features; And
An object tracking unit for tracking the object by generating a search area in the next frame based on the prediction result of the future motion
Including,
The feature extraction unit,
Extracting an object feature for an object region, a local feature for a surrounding region including the object, and a global feature for the entire image region from the still image
Computer system characterized in that.