KR102498066B1

KR102498066B1 - Deep Reinforcement Learning Accelerator

Info

Publication number: KR102498066B1
Application number: KR1020200021129A
Authority: KR
Inventors: 유회준; 김창현
Original assignee: 한국과학기술원
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2023-02-10
Also published as: KR20210106222A

Abstract

본 발명은 딥러닝 강화학습(Deep Reinforcement Learning, DRL) 기술에 관한 것으로, DRL 가속 장치는, 입력 데이터를 수신하여 DNN(Deep Neural Network)을 통해 처리한 결과를 액션(Action)으로 출력하는 액터(Actor) 회로 블럭 및 다수의 상태 데이터 쌍을 이용하여 DNN의 가중치(Weight) 값들을 학습(Learning)하는 러너(Learner) 회로 블럭을 포함한다.The present invention relates to a deep reinforcement learning (DRL) technology, and a DRL accelerator is an actor that receives input data and outputs the result of processing through a DNN (Deep Neural Network) as an action (Action) It includes a learner circuit block that learns weight values of the DNN using an actor circuit block and a plurality of state data pairs.

Description

Deep Reinforcement Learning Accelerator

본 발명은 딥러닝 강화학습(Deep Reinforcement Learning, DRL) 기술에 관한 것으로, 특히 모바일 환경에서 사용 가능하도록 강화학습의 추론(inference)과 학습(training)의 서로 다른 데이터 재사용 방식을 동적으로 전환하고 학습용 데이터를 압축 저장하는 딥러닝 강화학습 가속기에 관한 것이다.The present invention relates to Deep Reinforcement Learning (DRL) technology, in particular, to dynamically switch different data reuse methods of inference and training of reinforcement learning so that it can be used in a mobile environment, and for learning. It is about a deep learning reinforcement learning accelerator that compresses and stores data.

최근, DRL은 모션 제어 및 물체 추적을 위한 순차적 의사결정 문제에서 적용되는 가장 강력한 기법 중 하나로 각광받고 있다. 여기에서는 DRL을 수행하는 인공지능을 DRL 에이전트라 칭하며 일례로 지능형 로봇을 들 수 있다. DRL 에이전트는 사람의 감독 없이 스스로 외부 환경과 상호작용하여 수집한 입출력 정보, 즉 경험(Experience)으로부터 최적의 동작을 학습할 수 있으며, 복잡한 작업을 거의 사람과 같은 수준으로 수행할 수 있다. 학습 능력이 없는 추론 전용 AI와 달리 DRL 에이전트는 끊임없이 변화하는 환경 조건 속에서도 목표에 맞추어 동적으로 행동을 조정할 수 있다.Recently, DRL has been spotlighted as one of the most powerful techniques applied in sequential decision-making problems for motion control and object tracking. Here, the artificial intelligence that performs DRL is referred to as a DRL agent, and an example is an intelligent robot. The DRL agent can learn optimal motions from input/output information, that is, experience, collected by interacting with the external environment on its own without human supervision, and can perform complex tasks almost at the same level as humans. Unlike reasoning-only AI, which lacks the ability to learn, DRL agents can dynamically adjust their behavior to target goals, even in constantly changing environmental conditions.

DRL의 경우 러너(Learner)가 DNN을 추론에도 사용하지만 동시에 업데이트(일종의 학습)도 해야 하므로 단순 추론 작업보다 더 큰 최대 메모리 대역폭이 필요하다. 또한 복수 개의 DNN을 필요로 하기 때문에 칩 상에 구현된 DNN 회로들을 공간적 및 시간적으로 별도의 DNN에 매핑(Mapping)하여야 하며 이 때문에 성능 저하가 발생할 수 있다.In the case of DRL, the learner uses the DNN for inference, but also needs to update (a kind of learning) at the same time, so it requires a larger maximum memory bandwidth than simple inference tasks. In addition, since a plurality of DNNs are required, DNN circuits implemented on a chip must be spatially and temporally mapped to a separate DNN, which may cause performance degradation.

이하의 선행기술문헌들은 추론을 수행하는 DNN 하드웨어에 데이터 병렬 처리를 도모하는 가속기를 도입한 기술적 수단을 제시하였다.The prior art documents below suggest technical means of introducing an accelerator that promotes parallel processing of data in DNN hardware that performs inference.

J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, "UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision," in 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018: IEEE, pp. 218-220. J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, "UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision," in 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018: IEEE, pp. 218-220. K. Ueyoshi et al., "QUEST: A 7.49 TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS," in 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018: IEEE, pp. 216-218.K. Ueyoshi et al., "QUEST: A 7.49 TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS," in 2018 IEEE International Solid-State Circuits Conference-(ISSCC) , 2018: IEEE, pp. 216-218.

본 발명이 해결하고자 하는 기술적 과제는, 종래의 DNN 하드웨어들이 추론만을 수행하였거나 데이터 병렬에는 고정된 데이터 경로만 존재하였기 때문에 학습의 경우에도 적응적인 데이터 재사용이 필요한 DRL 애플리케이션에 대해서는 메모리 액세스가 최적화되지 못하였다는 기술적 한계를 극복하고자 한다.The technical problem to be solved by the present invention is that memory access is not optimized for DRL applications that require adaptive data reuse even in the case of learning because conventional DNN hardware only performed inference or only a fixed data path existed in data parallelism. We want to overcome the technical limitations of doing it.

상기 기술적 과제를 해결하기 위하여, 본 발명의 일 실시예에 따른 DRL(Deep reinforcement Learning) 가속 장치는, 입력 데이터를 수신하여 DNN(Deep Neural Network)을 통해 처리한 결과를 액션(Action)으로 출력하는 액터(Actor) 회로 블럭; 및 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 다수의 데이터 쌍을 이용하여 DNN의 가중치(Weight) 값들을 학습(Learning)하는 러너(Learner) 회로 블럭;을 포함한다.In order to solve the above technical problem, a deep reinforcement learning (DRL) accelerator according to an embodiment of the present invention receives input data and outputs the result of processing through a deep neural network (DNN) as an action. Actor circuit block; and a learner circuit block that learns weight values of the DNN using a plurality of data pairs including {current state, median value, action, reward, next state}.

일 실시예에 따른 DRL 가속 장치에서, 상기 액터 회로 블럭은, 승산기(Multiplier) 및 누산기(Accumulator)를 조합한 MAC이 2차원 PE 어레이(Processing Element Array)를 형성하되, 복수 개의 PE 어레이들을 내부에 집적하여 외부로부터 입력되는 현재 상태를 수신하며 DNN 추론을 통해 액션과 다음 상태를 출력할 수 있다.In the DRL acceleration device according to an embodiment, in the actor circuit block, a MAC combining a multiplier and an accumulator forms a 2-dimensional PE array (Processing Element Array), and a plurality of PE arrays are inside. It integrates and receives the current state input from the outside, and can output the action and next state through DNN reasoning.

또한, 일 실시예에 따른 DRL 가속 장치에서, 상기 액터 회로 블럭은, 러너(Learner) 동작 시, 상기 복수 개의 PE 어레이들이 외부의 입력 대신 저장되어 있던 다수의 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 데이터 쌍을 독출하여 손실함수를 생성하고, 생성된 손실함수에 기초하여 역-전파(Back-Propagation) 연산을 수행하여 DNN의 가중치를 갱신(Update)할 수 있다.In addition, in the DRL accelerator device according to an embodiment, the actor circuit block, during a learner operation, a plurality of {current state, intermediate value, action, compensation, where the plurality of PE arrays were stored instead of external input .

일 실시예에 따른 DRL 가속 장치에서, 액터 및 러너를 위한 DNN들 및 기타 DNN들을 상기 PE 어레이에 매핑(Mapping)하는 경우, 사용 가능한 PE들을 공간적으로 분할하고, 분할된 각각의 PE에 상기 액터 및 러너를 위한 DNN들 및 기타 DNN들을 할당하여 병렬 처리할 수 있다. 또한, DNN 추론에 요구되는 연산량 또는 메모리 대역폭의 변화를 감지하고, 감지된 결과에 따라 상기 DNN들 각각에 할당된 PE의 수를 변화시키는 제어로직(Controller);을 더 포함할 수 있다.In the DRL accelerator according to an embodiment, when mapping DNNs for actors and runners and other DNNs to the PE array, the usable PEs are spatially divided, and each of the divided PEs is divided into the Actor and DNNs for runners and other DNNs can be allocated and processed in parallel. In addition, it may further include a control logic (controller) that detects a change in the amount of computation or memory bandwidth required for DNN inference and changes the number of PEs allocated to each of the DNNs according to the detected result.

일 실시예에 따른 DRL 가속 장치에서, 액터 및 러너를 위한 DNN들 및 기타 DNN들을 상기 PE 어레이에 매핑(Mapping)하는 경우, 상기 액터 및 러너를 위한 DNN들 및 기타 DNN들을 시간적으로 분할하여 사용 가능한 PE들을 순차적으로 할당하여 처리할 수 있다. 또한, 일 실시예에 따른 DRL 가속 장치에서, 액터 및 러너를 위한 DNN들 및 기타 DNN들을 상기 PE 어레이에 매핑(Mapping)하는 경우, 시공간분할에 따라 사용 가능한 PE들을 복수 개의 DNN에 매핑하여 병렬 처리하되, 먼저 종료된 DNN에 할당되었던 PE를 대기중인 다른 DNN에 순차적으로 할당하여 처리할 수 있다. 나아가, 일 실시예에 따른 DRL 가속 장치에서, 하나의 DNN의 연산이 종료되면 현재의 가중치 값, 입력 값 및 출력 값을 외부 메모리에 저장하고, 대기중인 다른 DNN이 먼저 종료된 DNN에 할당되었던 PE를 사용할 수 있도록 상기 대기중인 다른 DNN에서 요구되는 가중치 값, 입력 값 및 출력 값을 내부 또는 외부 메모리로부터 독출하는 제어로직(Controller);을 더 포함할 수 있다.In the DRL accelerator according to an embodiment, when DNNs and other DNNs for actors and runners are mapped to the PE array, the DNNs and other DNNs for actors and runners can be temporally divided and used. PEs can be sequentially allocated and processed. In addition, in the DRL accelerator according to an embodiment, when DNNs for actors and runners and other DNNs are mapped to the PE array, PEs usable according to space-time division are mapped to a plurality of DNNs for parallel processing. However, it is possible to sequentially allocate and process the PEs assigned to the previously terminated DNN to other waiting DNNs. Furthermore, in the DRL accelerator according to an embodiment, when the operation of one DNN is completed, the current weight value, input value, and output value are stored in an external memory, and another DNN in standby is the PE assigned to the previously completed DNN. It may further include a control logic (controller) that reads weight values, input values, and output values required by the other waiting DNNs from an internal or external memory so that can be used.

일 실시예에 따른 DRL 가속 장치는, 상기 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 다수의 데이터 쌍을 인코딩하거나 디코딩하는 압축기(Compressor);를 더 포함할 수 있다.The DRL accelerator according to an embodiment may further include a compressor that encodes or decodes a plurality of data pairs including the {current state, intermediate value, action, reward, next state}.

일 실시예에 따른 DRL 가속 장치에서, 상기 PE 어레이는, 액터에서의 입력 방향과 러너에서의 입력 방향이 서로 상이하도록 구성됨으로써, 액터 또는 러너 동작시 각각 입력 특징(Input Feature, IF) 및 가중치(Weight, W)의 데이터 재사용을 변화시키는 전환 PE(transposable Processing Element, tPE) 구조를 형성할 수 있다.In the DRL accelerator device according to an embodiment, the PE array is configured such that input directions in actors and input directions in runners are different from each other, so that input features (IF) and weights (Input Feature, IF) and weights ( It is possible to form a transition PE (transposable processing element, tPE) structure that changes the data reuse of Weight, W).

상기 기술적 과제를 해결하기 위하여, 본 발명의 다른 실시예에 따른 DRL(Deep reinforcement Learning) 가속 장치는, DRL(Deep reinforcement Learning)을 수행하는 복수 개의 DRL 코어(Core); 및 상위 컨트롤러의 제어에 따라 상기 복수 개의 DRL 코어와 온-칩 네트워크를 통해 연결되는 상위 공유 메모리;를 포함하고, 입력 데이터를 수신하여 DNN(Deep Neural Network)을 통해 처리한 결과를 액션(Action)으로 출력하는 액터(Actor)가 실행되는 동안, 상기 상위 공유 메모리는 DNN의 가중치(Weight) 값들을 상기 DRL 코어로 로드하며, {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 다수의 데이터 쌍을 이용하여 DNN의 가중치(Weight) 값들을 학습(Learning)하는 러너(Learner)가 실행되는 동안, 상기 복수 개의 DRL 코어에서 공유되는 경험(Experience) 데이터가 상기 상위 공유 메모리에 저장된다.In order to solve the above technical problem, a deep reinforcement learning (DRL) accelerator according to another embodiment of the present invention includes a plurality of DRL cores performing deep reinforcement learning (DRL); and an upper shared memory connected to the plurality of DRL cores through an on-chip network under the control of an upper controller, and receives input data and processes the result through a deep neural network (DNN) into an action. While the actor outputting to is running, the upper shared memory loads the weight values of the DNN into the DRL core, and multiple values including {current state, median value, action, reward, next state} Experience data shared by the plurality of DRL cores is stored in the upper shared memory while a learner that learns weight values of the DNN using data pairs of is executed.

다른 실시예에 따른 DRL 가속 장치에서, 상기 복수 개의 DRL 코어는, DRL 코어를 제어하는 코어 컨트롤러; 상기 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 다수의 데이터 쌍을 인코딩하거나 디코딩하는 압축기(Compressor); 및 현재 상태를 수신하며 DNN 추론을 통해 액션과 다음 상태를 출력하는 PE 어레이(Processing Element Array);를 각각 포함할 수 있다.In the DRL accelerator device according to another embodiment, the plurality of DRL cores include: a core controller controlling the DRL cores; a compressor for encoding or decoding a plurality of data pairs including the {current state, intermediate value, action, reward, next state}; and a PE array (Processing Element Array) that receives a current state and outputs an action and a next state through DNN inference.

다른 실시예에 따른 DRL 가속 장치에서, 상기 복수 개의 DRL 코어는, 상기 압축기와 연결되어 가중치(Weight, W) 및 입력 특징(Input Feature, IF)를 전달받는 브로드캐스트 메모리(BMEM) 및 유니캐스트 메모리(UMEM);를 더 포함하고, 상기 브로드캐스트 메모리 및 상기 유니캐스트 메모리는 각각 B-버퍼(Buffer) 및 U-버퍼를 통해 상기 PE 어레이에 입력 데이터를 제공할 수 있다.In the DRL accelerator according to another embodiment, the plurality of DRL cores are connected to the compressor and receive weights (Weight, W) and input features (Input Feature, IF) broadcast memory (BMEM) and unicast memory (UMEM); and the broadcast memory and the unicast memory may provide input data to the PE array through a B-buffer and a U-buffer, respectively.

다른 실시예에 따른 DRL 가속 장치에서, 상기 코어 컨트롤러는, DNN 네트워크 구조에 따라 자동으로 상기 가중치 및 상기 입력 특징을 상기 브로드캐스트 메모리 또는 상기 유니캐스트 메모리로 페치(Fetch)하고, 액터 또는 러너 동작시 각각 가중치 및 입력 특징의 데이터 재사용을 변화시키도록 상기 PE 어레이의 구성을 설정할 수 있다.In the DRL acceleration device according to another embodiment, the core controller automatically fetches the weights and the input characteristics to the broadcast memory or the unicast memory according to the DNN network structure, and when an actor or runner operates The configuration of the PE array can be set to change the data reuse of weights and input features, respectively.

다른 실시예에 따른 DRL 가속 장치에서, 상기 복수 개의 DRL 코어는, 비선형 함수를 처리하는 활성화 유닛; 및 가중치 업데이트 및 손실 계산을 위해 로그 함수, 덧셈 및 곱셈을 수행하는 1-D SIMD 유닛;을 더 포함할 수 있다.In the DRL accelerator device according to another embodiment, the plurality of DRL cores include: an activation unit processing a nonlinear function; and a 1-D SIMD unit that performs log functions, addition and multiplication for weight update and loss computation.

다른 실시예에 따른 DRL 가속 장치에서, 상기 PE 어레이가 DNN 연산을 처리하는 동안, 상기 압축기는 출력 버퍼를 스캔하고, 상기 출력 버퍼가 가득 찬 경우 상기 압축기가 코드 워드(code word), 지수부(Exponent) 및 가수부(Mantissa)를 순차적으로 인코딩하며, 입력 특징이 상기 DRL 코어로 전송되면, 상기 압축기가 상기 코드 워드를 스캔하고 순차적으로 입력된 입력 가수부와 지수부를 재결합하여 상기 입력 특징을 스트림-아웃(stream-out)할 수 있다.In the DRL accelerator according to another embodiment, while the PE array processes a DNN operation, the compressor scans an output buffer, and when the output buffer is full, the compressor generates a code word, an exponent ( exponent and mantissa are sequentially encoded, and when an input feature is transmitted to the DRL core, the compressor scans the code word and recombines the sequentially input input mantissa and exponent to stream the input feature. -Can stream-out.

다른 실시예에 따른 DRL 가속 장치에서, 상기 PE 어레이는, 러너(Learner) 동작 시, 외부의 입력 대신 저장되어 있던 다수의 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 데이터 쌍을 독출하여 손실함수를 생성하고, 생성된 손실함수에 기초하여 역-전파(Back-Propagation) 연산을 수행하여 DNN의 가중치를 갱신(Update)할 수 있다.In the DRL acceleration device according to another embodiment, the PE array includes a data pair including a plurality of {current state, median value, action, reward, next state} stored instead of external input during a runner operation. It is possible to generate a loss function by reading , and update the weight of the DNN by performing a back-propagation operation based on the generated loss function.

다른 실시예에 따른 DRL 가속 장치에서, 상기 PE 어레이는, MAC이 2차원 배열로 구성되고, 각각의 행은 브로드캐스트 데이터를 수신하는 행 버퍼를 공유하여 입력값을 MAC을 통해 승산하되, 새로운 데이터가 매 사이클마다 브로드캐스트되어 병렬 매트릭스 연산을 수행하며, U-버퍼로부터 제공되는 유니캐스트 데이터가 모두 재사용되었다면 새로운 유니캐스트 데이터가 상기 U-버퍼를 통해 입력될 수 있다.In the DRL accelerator according to another embodiment, the PE array is composed of a two-dimensional array of MACs, and each row shares a row buffer for receiving broadcast data and multiplies an input value through the MAC, but new data is broadcast every cycle to perform a parallel matrix operation, and if all unicast data provided from the U-buffer are reused, new unicast data can be input through the U-buffer.

다른 실시예에 따른 DRL 가속 장치에서, 상기 PE 어레이는, 액터에서의 입력 방향과 러너에서의 입력 방향이 서로 상이하도록 구성됨으로써, 액터 또는 러너 동작시 각각 입력 특징(Input Feature, IF) 및 가중치(Weight, W)의 데이터 재사용을 변화시키는 전환 PE(transposable Processing Element, tPE) 구조를 형성할 수 있다. 또한, 상기 PE 어레이는, 출력 채널(Output Cahnnel, Co) 수를 나타내는 가중치 행렬(W Matrix)의 행 길이가 일괄 처리 횟수(Batch Size, BA)를 나타내는 입력 특징 행렬(IF Matrix)의 열 길이보다 상대적으로 큰 경우 W 브로드캐스트(WBC)를 선택하고, 가중치 행렬의 행 길이가 입력 특징 행렬의 열 길이보다 상대적으로 작은 경우 IF 브로드캐스트(IFBC)를 선택할 수 있다.In the DRL accelerator device according to another embodiment, the PE array is configured such that input directions in the actor and input directions in the runner are different from each other, so that the input feature (IF) and weight (Input Feature, IF) and weight ( It is possible to form a transition PE (transposable processing element, tPE) structure that changes the data reuse of Weight, W). In addition, in the PE array, the row length of the weight matrix (W Matrix) representing the number of output channels (Co) is greater than the column length of the input feature matrix (IF Matrix) representing the number of batch processing (Batch Size, BA). W broadcast (WBC) can be selected when it is relatively large, and IF broadcast (IFBC) can be selected when the row length of the weight matrix is relatively smaller than the column length of the input feature matrix.

본 발명의 실시예들은, 학습 및 추론 모두에 대해 데이터 재사용을 동적으로 전환하는 PE 어레이 구조와 압축기를 이용함으로써, 외부 피크 메모리 대역폭과 전력 소비량을 현저하게 감소시킬 수 있다는 장점을 갖는다.Embodiments of the present invention have the advantage of significantly reducing external peak memory bandwidth and power consumption by using a compressor and a PE array structure that dynamically switches data reuse for both learning and inference.

도 1은 딥러닝 강화학습의 수행 주체에 따른 처리 과정을 설명하기 위한 도면이다.
도 2는 본 발명이 속하는 기술 분야에서 활용될 수 있는 Actor-Critic 알고리즘을 설명하기 위한 도면이다.
도 3은 DRL 에이전트(Agent)에서 행렬(Matric) 연산시의 데이터 재사용을 설명하기 위한 도면이다.
도 4는 입력 특징 및 가중치의 데이터 재사용 방법의 차이를 도시한 도면이다.
도 5는 학습 중 동적 데이터 재사용 패턴의 동적 변화를 도시한 도면이다.
도 6은 DNN 칩에서 활용될 수 있는 PE 어레이(Array)를 예시한 도면이다.
도 7 및 도 8은 각각 본 발명의 실시예들이 활용할 수 있는 공간 분할 배분법 및 시간 분할 배분법을 예시한 도면이다.
도 9 내지 도 11은 본 발명의 실시예들이 활용할 수 있는 시공간 분할 배분법과 다양한 조합을 예시한 도면이다.
도 12는 본 발명의 일 실시예에 따른 데이터 재사용을 동적으로 전환하는 DRL 가속 장치를 도시한 블록도이다.
도 13은 본 발명의 다른 실시예에 따른 데이터 재사용을 동적으로 전환하는 DRL 가속 장치의 구조를 도시한 도면이다.
도 14는 본 발명의 실시예들이 제안하는 압축기의 구조를 도시한 도면이다.
도 15는 본 발명의 실시예들이 제안하는 전환 PE(transposable Processing Element, tPE) 어레이의 구조를 도시한 도면이다.
도 16은 전환 PE 어레이 구조를 활용하여 W 브로드캐스트(WBC)와 IF 브로드캐스트(IFBC)를 동적으로 선택하는 방법을 설명하기 위한 도면이다.1 is a diagram for explaining a processing process according to a subject performing deep learning reinforcement learning.
2 is a diagram for explaining an Actor-Critic algorithm that can be utilized in the technical field to which the present invention belongs.
FIG. 3 is a diagram for explaining data reuse in a matrix operation in a DRL agent.
4 is a diagram showing differences in data reuse methods of input features and weights.
5 is a diagram illustrating a dynamic change of a dynamic data reuse pattern during learning.
6 is a diagram illustrating a PE array that can be utilized in a DNN chip.
7 and 8 are diagrams illustrating a space division distribution method and a time division distribution method that can be utilized by embodiments of the present invention, respectively.
9 to 11 are diagrams illustrating space-time division and distribution methods and various combinations that can be utilized by embodiments of the present invention.
12 is a block diagram illustrating a DRL accelerator for dynamically switching data reuse according to an embodiment of the present invention.
13 is a diagram showing the structure of a DRL accelerator for dynamically switching data reuse according to another embodiment of the present invention.
14 is a diagram showing the structure of a compressor proposed by embodiments of the present invention.
15 is a diagram showing the structure of a transposable processing element (tPE) array proposed by embodiments of the present invention.
16 is a diagram for explaining a method of dynamically selecting W broadcast (WBC) and IF broadcast (IFBC) by utilizing a switched PE array structure.

이하에서는 도면을 참조하여 본 발명의 실시예들을 구체적으로 설명하도록 한다. 다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 덧붙여, 명세서 전체에서, 어떤 구성 요소를 '포함'한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. However, detailed descriptions of well-known functions or configurations that may obscure the gist of the present invention will be omitted in the following description and accompanying drawings. In addition, throughout the specification, 'including' a certain component means that other components may be further included, not excluding other components unless otherwise stated.

본 발명에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구비하다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "comprise" or "comprise" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but that one or more other features or It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

특별히 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미이다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미인 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless specifically defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this application, they are not interpreted in an ideal or excessively formal meaning. .

도 1은 딥러닝 강화학습의 수행 주체에 따른 처리 과정을 설명하기 위한 도면이다. DRL 에이전트는 일반적으로 각각 학습과 추론을 수행하는 액터(Actor)와 러너(Learner)의 두 가지 컴포넌트를 통해 내부에 장착된 DNN이 관측한 환경 상태(State, S_t)를 입력으로 받아 스스로의 최적의 동작(Action, A_t)을 출력한다. 관측된 현재 상태(S_t)로부터 DRL 에이전트는 DNN 가중치(Weight, θ)와 정책(π, Policy)를 사용해 최적의 동작(A_t)을 결정한다. DRL을 처리하는 동안, 액터는 지속적으로 환경과 상호작용을 하며, 러너는 DNN을 정기적으로 학습시켜 환경으로부터의 보상(Reward, R_t)을 최대화한다. 엄밀히 말하자면, 현재 정책 따라 행동했을 때 얻게될 보상의 총합 Return-Gt(=R_t+γR_t+1+γ²R_t+2+...)을 최대화하는 방향으로 학습한다(γ는 Discounting Factor로 0<γ<1의 값을 가진다. 이는 동물의 행동 양식처럼 현재의 행동으로부터 바로 얻을 수 있는 보상에 가중치를 더 두는 역할을 한다.).1 is a diagram for explaining a processing process according to a subject performing deep learning reinforcement learning. A DRL agent generally receives the environmental state (State, S _t ) observed by the DNN equipped inside through two components, Actor and Learner, which perform learning and inference, respectively, and optimizes itself. Outputs the action (Action, A _t ) of From the observed current state (S _t ), the DRL agent determines the optimal action (A _t ) using the DNN weight (Weight, θ) and policy (π, Policy). During DRL processing, the actor continuously interacts with the environment, and the runner trains the DNN regularly to maximize the reward (R _t ) from the environment. Strictly speaking, we learn in the direction of maximizing Return-Gt (=R _t +γR _t+1 +γ ² R _t+2 +...), the sum of the rewards that will be obtained when acting according to the current policy (γ is the Discounting Factor has a value of 0 < γ < 1. This plays a role of putting more weight on the reward that can be obtained directly from the current action, like the behavior of animals).

액터는 환경의 상태(S_t)를 관측하고, 현재의 DNN 가중치(Weight, θ)와 그 정책(π, Policy)에 의해 결정된 Action(A_t)을 선택한다. DRL 에이전트가 액션(Action)을 수행한 후에 에이전트는 다음번에 관측된 외부 환경 상태(S_t+1)와 동시에 스칼라 보상(R_t)을 입력으로 받는다. 매번, 경험(Experience) 메모리에 저장된 경험(Experience)들을 샘플링하고 이를 이용하여 러너가 DNN을 학습시킨다. 안정된 DRL 에이전트의 학습을 위해 DRL 알고리즘에 널리 사용되는 "Experience Replay 기술"을 활용하기 위해 경험 메모리는 많은 경험(약 10,000개 이상)을 저장해야 한다. 러너는 여러 개의 무작위 경험 샘플을 일괄 처리(Batch) 형태로 페치(Fetch)하고, 손실 함수(Loss Function)을 활용하여 현재의 DNN에 의해 결정된 액션(Action)에 대한 보상을 극대화한다. 이후 러너는 손실 함수로부터 계산된 기울기(Gradient)를 이용하여 DNN 가중치를 업데이트한다.The actor observes the state of the environment (S _t ) and selects an action (A _t ) determined by the current DNN weight (Weight, θ) and its policy (π, Policy). After the DRL agent performs an action, the agent receives the next observed external environment state (S _t+1 ) and a scalar reward (R _t ) as an input. Each time, the experience stored in the experience memory is sampled and the runner trains the DNN using it. In order to utilize the "Experience Replay technique" widely used in DRL algorithms for learning stable DRL agents, the experience memory needs to store a large number of experiences (approximately 10,000 or more). The runner fetches a number of random experience samples in batch form and utilizes a loss function to maximize the reward for the action determined by the current DNN. Then, the runner updates the DNN weights using the gradient calculated from the loss function.

일반적으로 DRL의 학습(Learning)은 크게 On-Policy Learning과 Off-Policy learning으로 나누어 볼 수가 있다. On-Policy Learning이란 학습하는 정책(Learner의 Policy)과 행동하는 정책(Actor의 Policy)이 같은 학습 방식을 말하며, 일반적으로 매 타임 스텝마다 얻은 경험(Expereince)으로 정책 DNN을 학습시키는 방법을 말한다. 반대로, Off-Policy란 이 둘의 정책이 반드시 같지 않은 학습 방식으로 DRL의 Action Generation을 여러 번 경험으로 축적한 후, 축적해둔 경험을 랜덤-샘플링하여 정책 DNN을 학습시키는 것을 말한다. 둘의 가장 큰 차이는 이전 경험을 사용하는 방식에서 큰 차이를 갖는다. On-Policy 학습 방식을 사용할 경우, 매번 현재의 경험으로부터 최선의 정책을 찾아, 그 정책으로 새로운 경험을 얻어내어 더 빠른 학습 속도를 갖지만, Local-minima에 빠져 더 이상 더 나은 정책을 찾지 못하는 상태에 빠지는 경우가 발생한다. 이를 보완하는 Off-Policy Learning은 이전에 축적된 경험을 여러 번 재사용하므로 경험 수집이 어려운 업무에 대해 효율적인 데이터 사용이 가능하다. 또한 랜덤 샘플링된 경험으로부터 학습하여 On-Policy Learning 대비 느린 학습속도를 갖지만 다양한 시행착오를 거쳐 Local-minima에 빠지지 않는 안정적인 학습을 제공한다.In general, DRL learning can be largely divided into On-Policy Learning and Off-Policy Learning. On-Policy Learning refers to a learning method in which the learning policy (Learner's Policy) and the acting policy (Actor's Policy) are the same. On the contrary, Off-Policy means that the policy DNN is trained by randomly sampling the accumulated experience after accumulating the DRL's Action Generation several times with a learning method in which the two policies are not necessarily the same. The biggest difference between the two is in the way they use their previous experience. In the case of using the On-Policy learning method, the best policy is found from the current experience every time, and a new experience is obtained with that policy to have a faster learning rate, but it is in a state where it is lost in Local-minima and cannot find a better policy anymore. Falling out happens. Complementing this, Off-Policy Learning reuses previously accumulated experience multiple times, enabling efficient data use for tasks where experience collection is difficult. In addition, it learns from randomly sampled experiences and has a slower learning speed than On-Policy Learning, but provides stable learning that does not fall into Local-minima through various trials and errors.

DQN(Deep-Q Network)은 대표적인 Off-Policy Learning 알고리즘이며 Actor-Critic은 대표적인 On-Policy Learning 알고리즘이다(물론 Actor-Critic도 Off-Policy Learning이 가능하며 이 경우가 더 수렴을 잘하는 경우도 있다.).DQN (Deep-Q Network) is a representative Off-Policy Learning algorithm and Actor-Critic is a representative On-Policy Learning algorithm. ).

DQN은 1개의 Main Q-network와 1개의 Target Network로 구성된 2개의 DNN network를 포함하고 있다. DQN은 Main Q-network을 통해 현재 상태 s_t로부터 액션(Action) a_t를 결정한다. 액터의 출력들을 {현재상태 s_t, 액션 a_t, 보상 R_t ₊₁, 다음상태 s_t+ ₁}의 집합으로 묶어서 별도의 메모리(리플레이 메모리)에 저장한 후, 리플레이 메모리로부터 무작위로 데이터 집합을 불러내어 Target Network를 학습시킨다. 이 Target network를 일정 주기마다 Main Q-network로 동기화 시키는데, 이를 통해 업데이트에 생기는 노이즈를 줄여 안정적인 DQN 학습을 제공한다. 아래의 수학식을 통해 Target Network를 학습한다. DQN includes 2 DNN networks consisting of 1 Main Q-network and 1 Target Network. DQN determines Action a _t from the current state s _t through the Main Q-network. Actor outputs are grouped into a set of {current state s _t , action a _t , reward R _t ₊₁ , next state s _t+ ₁ } and stored in a separate memory (replay memory), and then a data set is randomly selected from the replay memory. Call out to learn the target network. This target network is synchronized with the Main Q-network at regular intervals, and through this, noise generated in updates is reduced to provide stable DQN learning. Learn the target network through the equation below.

여기서, Q_m은 Main Q-network, 즉 액터 DNN을 가리키고 Q_t는 Target Network를 가리키며

는 현재 학습 중인 Target network에서 구한다. (2)식의 앞의 첫 항은 다음 상태 s_t+1에서 현재 학습 중인 Target Network가 추론한 행동으로부터 얻을 수 있는 보상의 총합을 나타내며, 두 번째 항은 다음 상태 s_t+1에서 현재 Main Q-network가 추론한 행동으로부터 얻을 수 있는 보상의 총합을 나타낸다. 즉 다음 상태 s_t+1에서의 Target Network가 추론한 Q 값이 최대가 되는 액션 a와 현재 에이전트의 행동 정책으로부터 얻을 수 있는 Q 값의 차를 최소화함으로써 최대의 보상을 갖는 정책을 찾아낼 수 있다. 따라서 이를 손실 함수(Loss Function)로 사용하여 Target Network와 Main Q-network의 가중치(Weight) 값들을 갱신(Update)한다.Here, Q _m points to the Main Q-network, namely the Actor DNN, and Q _t points to the Target Network.

is obtained from the target network currently being trained. The first term in equation (2) represents the sum of rewards that can be obtained from the actions inferred by the target network currently learning in the next state s _t+1 , and the second term represents the current Main Q in the next state s _t+1. -Represents the total amount of rewards that can be obtained from actions inferred by the network. That is, the policy with the maximum reward can be found by minimizing the difference between the action a for which the Q value inferred by the target network in the next state s _t+1 is maximized and the Q value that can be obtained from the action policy of the current agent. . Therefore, this is used as a loss function to update the weight values of the target network and the main Q-network.

도 2는 본 발명이 속하는 기술 분야에서 활용될 수 있는 Actor-Critic 알고리즘을 설명하기 위한 도면이다. Actor-Critic 알고리즘에서는 DNN을 하나 더 만들어 Q 함수의 근사를 맡긴다. 즉, 정책(Policy)을 근사하는 기존의 Policy Network DNN (여기서 Policy Network는 DQN의 Main Q-network와 Target Network를 아우른다.)에 더하여 Q 값을 근사하는 Value Network DNN을 추가한 것이다. 각각의 Policy Network와 Value Network는 Main Q-network와 Target Network를 포함하여, 총 4개의 DNN을 사용한다. 이 경우 Policy Network와 Value Network가 완전히 독립적인 경우와 하나의 DNN의 앞쪽 여러 층들을 공유하며 마지막 몇 층만 독립하여 따로 설치한 경우로 나누어 볼 수가 있다.2 is a diagram for explaining an Actor-Critic algorithm that can be utilized in the technical field to which the present invention belongs. In the Actor-Critic algorithm, another DNN is created and entrusted with approximation of the Q function. That is, in addition to the existing Policy Network DNN (where the Policy Network encompasses the Main Q-network and Target Network of the DQN) that approximates the policy, a Value Network DNN that approximates the Q value is added. Each Policy Network and Value Network use a total of 4 DNNs, including the Main Q-network and Target Network. In this case, it can be divided into a case where the Policy Network and a Value Network are completely independent, and a case where the front several layers of one DNN are shared and only the last few layers are installed independently.

Actor-Critic 알고리즘은 On-Policy Learning 과 Off-Policy Learning이 다 가능한데, 여기에서는 On-Policy Learning을 예로 설명하기로 한다. Policy Network의 출력으로부터 Cross-Entropy Error Function,

을 구할 수 있고 Value Network 출력으로부터 Time Difference Error,

를 구할 수가 있으며 이를 곱하여

를 최대화하도록 가중치(Weight) θ를 구하는 것으로 학습한다. Value Network의 학습은 출력

가 정확한 값을 갖도록 하여야 하는데 결국 실제 행동을 취해서 얻게 되는 행동가치

와 일치시키는 것으로 결국

으로 학습(learning)을 진행한다. Policy Network는 Cross-Entropy Error Function과 Time Difference Error의 곱으로 새로운 오류함수를 정의하고 이 오류함수로 Policy Network를 갱신(Update)한다.Both On-Policy Learning and Off-Policy Learning are possible for Actor-Critic Algorithm. Here, On-Policy Learning will be explained as an example. From the output of Policy Network, Cross-Entropy Error Function,

can be obtained, and from the Value Network output, Time Difference Error,

can be obtained and multiplied by

It learns by finding the weight θ to maximize . Value Network learning is an output

should have an accurate value, but in the end, the action value obtained by taking an actual action

Finally, by matching

proceed with learning. Policy Network defines a new error function as the product of Cross-Entropy Error Function and Time Difference Error, and updates Policy Network with this error function.

이상에서 설명한 DQN 알고리즘에서는 액터의 Main DNN과 Target DNN으로 2개의 DNN이 필요하며, Actor-Critic 알고리즘에서는 Actor DNN, Critic DNN 등 복수개의 DNN들이 필요하다. 이하에서 기술되는 본 발명에서는 복수 개의 DNN을 지원하며 DQN 알고리즘과 Action-Critic 알고리즘 모두가 하나의 칩에서 동작하는 DRL CMOS 칩을 제안하도록 한다.In the DQN algorithm described above, two DNNs are required, the Main DNN and the Target DNN of the actor, and the Actor-Critic algorithm requires multiple DNNs, such as Actor DNN and Critic DNN. In the present invention described below, a DRL CMOS chip in which a plurality of DNNs are supported and both the DQN algorithm and the Action-Critic algorithm operate in one chip is proposed.

앞서, DRL의 경우 러너가 DNN을 추론에도 사용하지만 동시에 업데이트도 해야 하므로 단순 추론 작업보다 더 큰 최대 메모리 대역폭이 필요함을 지적하였다. 또한 복수 개의 DNN을 필요로 하기 때문에 칩상에 구현된 DNN 회로들을 공간적 및 시간적으로 별도의 DNN에 매핑하여야 하며 이 때문에 성능저하가 발생할 수 있음을 지적하였다.Earlier, in the case of DRL, it was pointed out that the runner uses the DNN for inference, but also needs to update it at the same time, so it requires a larger maximum memory bandwidth than simple inference work. In addition, it is pointed out that since a plurality of DNNs are required, DNN circuits implemented on a chip must be spatially and temporally mapped to separate DNNs, which can cause performance degradation.

설명의 편의를 위해, 2족 보행 시뮬레이션 로봇이 스스로 걷는 법을 학습하는 DRL 에이전트를 가정하자. 위 에이전트는 512개 뉴런/레이어 및 212개 입력 매개변수를 가진 4개의 완전 연결 은직(Hidden) 레이어로 구성되어 있다. 도 3은 DRL 에이전트(Agent)에서 FCL(Fully-Connecrted Layer) 및 RNN의 행렬(Matric) 연산시의 데이터 재사용을 설명하기 위한 도면으로, DRL 에이전트에서는 러너(Learner)에서 요구되는 최대 메모리 대역폭이 동일 DRL 에이전트 내의 액터(Actor)에서의 요구보다 10배 더 크다는 것을 보여준다. 도 4는 입력 특징 및 가중치의 데이터 재사용 방법의 차이를 도시한 도면이다. 메모리 대역폭 분석에 기초해 분석하여 보면 액터와 러너는 도 4와 같이 입력 특징(Input Feature, IF 또는 Activation)과 가중치(DNN Weight, W)의 데이터 재사용방법이 다르다. 일반적으로, DNN 연산은 IF 매트릭스와 W 매트릭스의 곱으로 간소화하여 표현할 수 있다. 만약 입력된 IF와 W 데이터를 다시 입력시킬 필요없이 계속 재사용하여 연산을 수행한다면, 동일한 IF와 W를 다시 페치(Fetch)할 필요가 없기 때문에, 입출력 대역폭이 감소하는 효과를 볼 수 있다.For convenience of explanation, let's assume a DRL agent in which a bipedal walking simulation robot learns to walk on its own. The above agent consists of 4 fully connected hidden layers with 512 neurons/layers and 212 input parameters. 3 is a diagram for explaining data reuse in the DRL Agent during FCL (Fully-Connected Layer) and RNN Matrix calculation. In the DRL Agent, the maximum memory bandwidth required by the Learner is the same. It shows that it is 10 times larger than the demand from the actors in the DRL agent. 4 is a diagram showing differences in data reuse methods of input features and weights. When analyzed based on memory bandwidth analysis, actors and runners have different data reuse methods of input features (Input Feature, IF or Activation) and weights (DNN Weight, W) as shown in FIG. In general, DNN operations can be simplified and expressed as the product of an IF matrix and a W matrix. If input IF and W data are continuously reused and operations are performed without the need to re-enter the same IF and W data, there is no need to fetch the same IF and W data, so the input/output bandwidth can be reduced.

도 3 및 도 4를 참조하면, 하나의 IF(원으로 표시)는 W 매트릭스 내 동일한 행에 위치한 W들(삼각형으로 표시)에 대해 재사용될 수 있다. 따라서 IF의 최대 재사용 횟수는 W 매트릭스의 행 길이로서, 한 레이어의 출력 채널(Output Channel, Co) 수와 같다. 마찬가지로 하나의 W 데이터의 최대 재사용 횟수는 IF 행렬의 열 길이와 같으며 이는 일괄 처리 횟수(Batch Size, B)와 동일하다. 액터는 환경과 상호작용하는 동안, 일반적으로 단일 IF를 입력으로 받기 때문에, IF 재사용이 더 효율적이다. 러너의 경우, 각 레이어별로 Co 크기가 다르기 때문에 상황에 따라 IF 재사용이 효율적이거나, W 재사용이 더 효율적일 수 있다. Referring to FIGS. 3 and 4 , one IF (indicated by a circle) may be reused for Ws (indicated by a triangle) located in the same row in the W matrix. Therefore, the maximum reuse count of IF is the row length of the W matrix, equal to the number of Output Channels (Co) in one layer. Similarly, the maximum reuse count of one W data is equal to the column length of the IF matrix, which is equal to the batch size (B). Because actors typically take a single IF as input while interacting with the environment, IF reuse is more efficient. In the case of the runner, since the Co size is different for each layer, IF reuse or W reuse may be more efficient depending on the situation.

도 5는 학습 중 동적 데이터 재사용 패턴의 동적 변화를 도시한 도면으로서, 러너 처리 중 레이어 별로 데이터 재사용 패턴의 동적 변화를 보여준다. 도면에서 볼 수 있듯이, 메모리 액세스 수는 IF 재사용과 W 재사용 사이에서 적응적 선택을 통해 ~x10까지 줄일 수 있다.5 is a diagram illustrating a dynamic change of a dynamic data reuse pattern during learning, and shows a dynamic change of a data reuse pattern for each layer during runner processing. As can be seen in the figure, the number of memory accesses can be reduced to ~x10 through adaptive selection between IF reuse and W reuse.

앞서 지적한 바와 같이, 이전의 DNN 하드웨어는 추론만 수행했으며, 이들의 데이터 병렬에는 고정된 데이터 경로만 있었기 때문에 메모리 액세스는 DNN 추론에만 최적화되어 있을 뿐, 학습에도 적응적인 데이터 재사용이 필요한 DRL 애플리케이션에는 최적화되지 않았다.As noted earlier, previous DNN hardware only performed inference, and their data parallels had only fixed data paths, so memory access was optimized only for DNN inference, but for DRL applications that require adaptive data reuse for training. It didn't work.

따라서, 이하에서 제시되는 본 발명의 실시예들은 다양한 DRL 알고리즘들에서 요구하는 복수 개의 DNN들을 고성능으로 구현시키기 위해 칩(Chip)에 존재하는 PE 어레이(Array)를 DNN에 공간적 분할, 시간적 분할, 시공간적 분할 및 복합 분할 매핑(Mapping)하는 방법을 제안한다. 또한, PE의 재구성성을 높혀 RNN(Recurrent Neural Network), FCL(Fully-Connected Layer), CNN(Convolution Neural Network) 등의 다양한 DNN 학습에 최적화되어 있는 에너지 효율적인 DRL 가속기를 제안한다. 특히, 학습에 적합한 DNN에서 W와 IF의 데이터 재사용 방법이 추론의 경우와 다르기 때문에, 새로운 재구성 가능한 PE 어레이인 전환 PE(Transposable PE, tPE) 어레이를 제안한다. 나아가, 러너에서의 경험(Experience)과 학습 중간 데이터(Intermediate Layer Data)를 압축하여 데이터를 줄일 수 있는 압축기를 제안한다.Therefore, the embodiments of the present invention presented below provide spatial division, temporal division, spatio-temporal division of the PE array present in the chip to the DNN in order to implement a plurality of DNNs required by various DRL algorithms with high performance. We propose a segmentation and composite segmentation mapping method. In addition, we propose an energy-efficient DRL accelerator that is optimized for learning various DNNs such as RNN (Recurrent Neural Network), FCL (Fully-Connected Layer), and CNN (Convolution Neural Network) by increasing the reconfigurability of PE. In particular, since the data reuse method of W and IF in DNNs suitable for learning is different from that of inference, we propose a new reconfigurable PE array, Transposable PE (tPE) array. Furthermore, we propose a compressor that can reduce data by compressing experience and intermediate layer data in the runner.

도 6은 DNN 칩에서 활용될 수 있는 PE 어레이(Array)를 예시한 도면이다. 일반적인 DNN 칩은 도시된 바와 같이 PE 어레이들로 구성되어 있으며 PE 어레이는 다시 수개 내지 수천개의 PE들로 구성되어 있다. 이렇게 많은 수의 PE들을 공간적으로 분할하여 각각의 DNN에 할당할 수 있으며 이를 공간 분할 배분법이라 부른다. 6 is a diagram illustrating a PE array that can be utilized in a DNN chip. As shown, a general DNN chip is composed of PE arrays, and the PE array is composed of several to thousands of PEs. This large number of PEs can be spatially partitioned and assigned to each DNN, which is called the spatial division distribution method.

도 7은 본 발명의 실시예들이 활용할 수 있는 공간 분할 배분법을 예시한 도면이다. 일정 개수의 PE를 액터에 할당하고, 일정 개수의 PE를 Critic에 할당하며, 알고리즘에서 필요로 하는 기타 DNN들에 정해진 개수의 PE들을 할당한다. 이러한 할당은 시간적으로 동적으로 변경할 수가 있어 각각의 DNN들의 연산 요구량에 맞추어 동작 중에도 재변경이 가능하다. 각각의 DNN들은 배정된 PE들을 사용하여 독립적으로 DNN 연산을 진행하게 되며 DNN 사이에 의존성(Dependency)가 적어서 동시에 병렬 처리가 가능하다. 하지만 처리가 완료된 DNN에 할당되었던 PE 어레이들은 다음번 처리 요구 전까지는 대기 상태에 놓이게 되어 PE의 활용도가 저하될 수 있다.7 is a diagram illustrating a spatial division and distribution method that can be utilized by embodiments of the present invention. A certain number of PEs are assigned to actors, a certain number of PEs are assigned to Critic, and a certain number of PEs are assigned to other DNNs required by the algorithm. These allocations can be changed dynamically in time, so that they can be changed again during operation according to the computational requirements of each DNN. Each DNN performs DNN operation independently using the assigned PEs, and parallel processing is possible at the same time because there is little dependency between DNNs. However, the PE arrays allocated to the DNN on which processing has been completed are placed in a standby state until the next processing request, which may reduce the utilization of PEs.

한편, 전체 PE 어레이들을 모두를 하나의 DNN을 처리하는데 사용하고, 이를 알고리즘에서 필요로 하는 DNN 들, 예를 들어, Actor DNN, Critic DNN 및 기타 DNN들이 시간적으로 번갈아 가면서 전체 PE 어레이를 사용하는 방법을 고려할 수 있는며, 이를 시간 분할 배분법이라 부른다.On the other hand, all of the PE arrays are used to process one DNN, and DNNs required by the algorithm, such as Actor DNN, Critic DNN, and other DNNs, alternate in time to use the entire PE array can be considered, and this is called the time-division distribution method.

도 8은 본 발명의 실시예들이 활용할 수 있는 시간 분할 배분법을 예시한 도면이다. 공간 분할 배분법에서와 같은 병렬 처리는 불가능하지만 모든 PE 어레이들이 하나의 DNN을 구현하고 있어 하나의 DNN의 처리속도는 빨라지는 장점이 존재한다. 특히 복수의 DNN들이 많은 층(layer)들을 공유하는 알고리즘의 경우 시간 분할을 활용하면 다른 DNN들은 최초의 DNN 연산의 중간결과를 받아서 마지막 몇 개의 자체 DNN층에 대해서만 연산하면 되므로 훨씬 간편하다.8 is a diagram illustrating a time division distribution method that embodiments of the present invention can utilize. Parallel processing as in the spatial division distribution method is impossible, but since all PE arrays implement one DNN, there is an advantage that the processing speed of one DNN is fast. In particular, in the case of algorithms in which multiple DNNs share many layers, using time division makes it much simpler because other DNNs only need to receive the intermediate results of the first DNN operation and operate only on the last few DNN layers of their own.

그러나, 각각의 DNN들은 요구되는 연산량과 메모리 대역폭이 서로 달라서 이를 고려한 공간 분할 배분법과 시간 분할 배분법을 조합한 시공간 분할 배분법도 고려될 수 있다. 도 9를 참조하면, 시간 구간 t₁-t₂까지는 모든 PE 어레이들을 Actor DNN에 할당하고 시간 구간 t₂-t₃에서는 Critic DNN과 기타 DNN이 공간 분할 배분법에 의해 전체 PE들을 나누어 사용하게 된다. 이 경우 t₃-t₄ 구간에서는 기타 DNN에 할당되었던 PE 어레이들이 유휴(Idle) 상태에 있어 PE의 활용도가 저하될 수가 있다. 이러한 PE 활용도의 저하를 막는 방법으로는 기타 DNN의 연산이 종료되면 이에 할당되었던 PE 어레이들을 다음 단계인 Actor DNN에 할당하여 Actor DNN 연산이 일찍 시작할 수 있도록 유도할 수 있다.However, since each DNN has a different required amount of computation and memory bandwidth, a space-time division-allocation method combining a space-division distribution method and a time-division distribution method considering this may also be considered. Referring to FIG. 9, all PE arrays are allocated to the Actor DNN in the time interval t ₁ -t ₂ , and in the time interval t ₂ -t ₃ , the Critic DNN and other DNNs divide and use all PEs by the spatial division distribution method. . In this case, in the period t ₃ -t ₄ , PE arrays allocated to other DNNs are in an idle state, and PE utilization may decrease. As a way to prevent this decrease in PE utilization, when the operation of other DNNs is completed, the PE arrays allocated to it can be allocated to the next stage, Actor DNN, so that Actor DNN operation can be started early.

또한, 도 10을 참조하면, 공간 분할 배분법에 의한 병렬 처리를 더욱 발전시켜 전체의 PE 어레이들을 Actor DNN, Critic DNN 및 기타 DNN들로 배분하여 DNN 연산을 수행하며 수행이 종료된 PE 어레이들을 다른 DNN에 할당하여 PE 어레이의 활용도를 향상시킬 수 있다. 나아가, 상기된 다양한 방법들을 조합하여 한꺼번에 사용되는 복합 방법도 가능하다. 도 11에서와 같이 시간 분할, 시공간 분할 및 공간 분할이 모두 함께 사용될 수도 있다.In addition, referring to FIG. 10, by further developing the parallel processing by the spatial division distribution method, all PE arrays are distributed to Actor DNN, Critic DNN, and other DNNs to perform DNN operations, and PE arrays that have been performed are transferred to other DNNs. The utilization of the PE array can be improved by assigning it to a DNN. Furthermore, a composite method in which various methods described above are combined and used at once is also possible. As shown in FIG. 11, time division, space-time division, and space division may all be used together.

각각의 DNN들은 신경망의 구조 및 가중치(Weight) 값들이 서로 다르므로 시간 분할의 경우 이전 시간 DNN에 대한 값들이 후에도 사용된다면 이들을 메모리에 저장하고 후에 다시 불러와야 하며 이들을 제어하는 별도의 컨트롤러(Controller)가 칩 상에 집적되어야 한다.Since each DNN has a different neural network structure and weight values, in the case of time division, if the values for the previous time DNN are used later, they must be stored in memory and recalled later, and a separate controller to control them must be integrated on a chip.

이하에서는 본 발명의 실시예들이 제안하는 데이터 재사용을 동적으로 전환하는 DRL 가속 장치의 개요를 살펴본 후 구체적인 칩 구조를 순차적으로 기술하도록 한다.Hereinafter, an outline of a DRL accelerator for dynamically switching data reuse proposed by embodiments of the present invention will be reviewed, and then a specific chip structure will be sequentially described.

도 12는 본 발명의 일 실시예에 따른 데이터 재사용을 동적으로 전환하는 DRL 가속 장치(300)를 도시한 블록도로서, 수행 주체(액터 및 러너)를 중심으로 개념적인 동작과 처리 방식을 설명한다. 이러한 DRL 가속 장치(300)는 CMOS 칩으로 구현될 수 있다.12 is a block diagram illustrating a DRL acceleration device 300 that dynamically switches data reuse according to an embodiment of the present invention, and explains conceptual operations and processing methods centered on performers (actors and runners). . The DRL accelerator 300 may be implemented as a CMOS chip.

액터(Actor) 회로 블럭(100)은, 입력 데이터를 수신하여 DNN(Deep Neural Network)을 통해 처리한 결과를 액션(Action)으로 출력한다. 러너(Learner) 회로 블럭(200)은, {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 다수의 데이터 쌍을 이용하여 DNN의 가중치(Weight) 값들을 학습(Learning)한다.The actor circuit block 100 receives input data, processes it through a deep neural network (DNN), and outputs a result as an action. The learner circuit block 200 learns weight values of the DNN using a plurality of data pairs including {current state, median value, action, reward, next state}.

여기서, 상기 액터 회로 블럭(100)은, 승산기(Multiplier) 및 누산기(Accumulator)를 조합한 MAC이 2차원 PE 어레이(Processing Element Array)를 형성하되, 복수 개의 PE 어레이들을 내부에 집적하여 외부로부터 입력되는 현재 상태를 수신하며 DNN 추론을 통해 액션과 다음 상태를 출력할 수 있다. 또한, 상기 액터 회로 블럭(100)은, 러너(Learner) 동작 시, 상기 복수 개의 PE 어레이들이 외부의 입력 대신 저장되어 있던 다수의 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 데이터 쌍을 독출하여 손실함수를 생성하고, 생성된 손실함수에 기초하여 역-전파(Back-Propagation) 연산을 수행하여 DNN의 가중치를 갱신(Update)할 수 있다. 이때, DNN은 RNN(Recurrent Neural Network), CNN(Convolutional Neural Network), FCL(Fully-Connected Layer) 등 다양한 알고리즘이 적용 가능하다.Here, in the actor circuit block 100, a MAC combining a multiplier and an accumulator forms a two-dimensional PE array (Processing Element Array), and a plurality of PE arrays are internally integrated to input from the outside. It receives the current state and can output the action and next state through DNN reasoning. In addition, the actor circuit block 100 includes a plurality of {current state, intermediate value, action, compensation, next state} stored in the plurality of PE arrays instead of external input during a runner operation. A loss function may be generated by reading a data pair, and weights of the DNN may be updated by performing a back-propagation operation based on the generated loss function. At this time, the DNN can apply various algorithms such as a recurrent neural network (RNN), a convolutional neural network (CNN), and a fully-connected layer (FCL).

또한, 구현의 관점에서, 상기 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 다수의 데이터 쌍을 인코딩하거나 디코딩하는 압축기(Compressor)를 더 포함하는 것이 바람직하다. 이러한 압축기는 상기된 다수의 데이터 쌍을 저장하는 메모리의 크기를 줄이고, 페치(Fetch) 및 스토어(Store)에 소요되는 메모리 대역폭을 감소시키기 위해 이 쌍들을 압축하고 압축을 푸는 별도의 하드웨어가 필요하기 때문에 안출된 것이다.Further, from an implementation point of view, it is preferable to further include a compressor that encodes or decodes a plurality of data pairs including {current state, intermediate value, action, reward, next state}. These compressors reduce the size of memory for storing the multiple data pairs described above and require separate hardware to compress and decompress these pairs in order to reduce the memory bandwidth required for fetch and store. It was created because

나아가, 상기 PE 어레이는, 액터에서의 입력 방향과 러너에서의 입력 방향이 서로 상이하도록 구성됨으로써, 액터 또는 러너 동작시 각각 입력 특징(Input Feature, IF) 및 가중치(Weight, W)의 데이터 재사용을 변화시키는 전환 PE(transposable Processing Element, tPE) 구조를 형성할 수 있다. 이는 앞서 도 3 내지 도 5를 통해 설명한 바와 같이, 학습에 적합한 DNN에서 W와 IF의 데이터 재사용 방법이 추론의 경우와 다르다는 점을 고려하여 안출되었다. 보다 구체적인 구조 및 전환 방식은 이후 도 16을 통해 자세히 설명하도록 한다.Furthermore, the PE array is configured such that input directions in the actor and input directions in the runner are different from each other, thereby reusing data of input features (IF) and weights (Weight, W) during the operation of the actor or runner, respectively. It is possible to form a transition PE (transposable processing element, tPE) structure that changes. As described above with reference to FIGS. 3 to 5, this was devised in consideration of the fact that the data reuse method of W and IF in a DNN suitable for learning is different from that of inference. A more specific structure and conversion method will be described in detail later with reference to FIG. 16 .

한편, 앞서 설명한 바와 같이, 본 발명의 실시예들에 따른 DRL 가속 장치는 다양한 방식으로 PE 어레이에 할당 복수 개의 DNN들을 할당할 수 있다.Meanwhile, as described above, the DRL accelerator according to the embodiments of the present invention may allocate a plurality of DNNs to the PE array in various ways.

첫째, 액터 및 러너를 위한 DNN들 및 기타 DNN들을 상기 PE 어레이에 매핑(Mapping)하는 경우, 사용 가능한 PE들을 공간적으로 분할하고, 분할된 각각의 PE에 상기 액터 및 러너를 위한 DNN들 및 기타 DNN들을 할당하여 병렬 처리할 수 있다. 구현의 관점에서, DNN 추론에 요구되는 연산량 또는 메모리 대역폭의 변화를 감지하고, 감지된 결과에 따라 상기 DNN들 각각에 할당된 PE의 수를 변화시키는 제어로직(Controller)을 더 포함하는 것이 바람직하다.First, in the case of mapping DNNs for actors and runners and other DNNs to the PE array, the available PEs are spatially partitioned, and the DNNs for actors and runners and other DNNs are divided into each divided PE. can be allocated and processed in parallel. From an implementation point of view, it is preferable to further include a controller that detects a change in the amount of computation required for DNN inference or memory bandwidth, and changes the number of PEs allocated to each of the DNNs according to the detected result. .

둘째, 액터 및 러너를 위한 DNN들 및 기타 DNN들을 상기 PE 어레이에 매핑(Mapping)하는 경우, 상기 액터 및 러너를 위한 DNN들 및 기타 DNN들을 시간적으로 분할하여 사용 가능한 PE들을 순차적으로 할당하여 처리할 수 있다.Second, when DNNs for actors and runners and other DNNs are mapped to the PE array, DNNs for actors and runners and other DNNs are temporally divided to sequentially allocate and process available PEs. can

셋째, 액터 및 러너를 위한 DNN들 및 기타 DNN들을 상기 PE 어레이에 매핑(Mapping)하는 경우, 시공간분할에 따라 사용 가능한 PE들을 복수 개의 DNN에 매핑하여 병렬 처리하되, 먼저 종료된 DNN에 할당되었던 PE를 대기중인 다른 DNN에 순차적으로 할당하여 처리할 수 있다.Third, when DNNs for actors and runners and other DNNs are mapped to the PE array, PEs that can be used according to space-time division are mapped to a plurality of DNNs for parallel processing, but the PEs assigned to the previously terminated DNNs are mapped. can be processed by sequentially assigning to other waiting DNNs.

구현의 관점에서 두 번째 및 세 번째의 경우, 하나의 DNN의 연산이 종료되면 현재의 가중치 값, 입력 값 및 출력 값을 외부 메모리에 저장하고, 대기중인 다른 DNN이 먼저 종료된 DNN에 할당되었던 PE를 사용할 수 있도록 상기 대기중인 다른 DNN에서 요구되는 가중치 값, 입력 값 및 출력 값을 내부 또는 외부 메모리로부터 독출하는 제어로직(Controller)을 더 포함하는 것이 바람직하다.In the second and third cases from the implementation point of view, when the operation of one DNN is completed, the current weight value, input value, and output value are stored in external memory, and the other waiting DNN is the PE that was assigned to the DNN that was completed earlier. It is preferable to further include a control logic (controller) for reading weight values, input values, and output values required by the other waiting DNNs from an internal or external memory so that .

도 13은 본 발명의 다른 실시예에 따른 데이터 재사용을 동적으로 전환하는 DRL 가속 장치의 구조를 도시한 도면으로서, 구체적인 칩 아키텍처를 제시하였다.13 is a diagram showing the structure of a DRL accelerator for dynamically switching data reuse according to another embodiment of the present invention, and a specific chip architecture is presented.

DRL 코어(Core)(또는 t-Core)(20)는, DRL 가속 장치 내에 복수 개(예를 들어, 64KB의 최상위 컨트롤러 4개) 마련되어 DRL(Deep reinforcement Learning)을 수행한다. 상위 공유 메모리(10)는, 상위 컨트롤러(Top Ctrlr)의 제어에 따라 상기 복수 개의 DRL 코어(20)와 온-칩 네트워크(On-Chip Network)를 통해 연결된다. 입력 데이터를 수신하여 DNN(Deep Neural Network)을 통해 처리한 결과를 액션(Action)으로 출력하는 액터(Actor)가 실행되는 동안, 상기 상위 공유 메모리(10)는 DNN의 가중치(Weight) 값들을 상기 DRL 코어(20)로 로드(load)하며, {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 다수의 데이터 쌍을 이용하여 DNN의 가중치(Weight) 값들을 학습(Learning)하는 러너(Learner)가 실행되는 동안, 상기 복수 개의 DRL 코어(20)에서 공유되는 경험(Experience) 데이터가 상기 상위 공유 메모리(10)에 저장된다.A plurality of DRL cores (or t-Cores) 20 (eg, 4 top-level controllers of 64 KB) are provided in the DRL accelerator to perform deep reinforcement learning (DRL). The upper shared memory 10 is connected to the plurality of DRL cores 20 through an on-chip network under the control of an upper controller (Top Ctrlr). While an actor that receives input data and outputs the result of processing through a deep neural network (DNN) as an action is running, the upper shared memory 10 recalls the weight values of the DNN. A runner that loads into the DRL core 20 and learns the weight values of the DNN using a plurality of data pairs including {current state, median value, action, reward, next state} While (Learner) is being executed, experience data shared by the plurality of DRL cores 20 is stored in the upper shared memory 10.

복수 개의 DRL 코어(20)는, DRL 코어를 제어하는 코어 컨트롤러(21), 상기 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 다수의 데이터 쌍을 인코딩하거나 디코딩하는 압축기(Compressor)(22) 및 현재 상태를 수신하며 DNN 추론을 통해 액션과 다음 상태를 출력하는 PE 어레이(Processing Element Array)(23)를 각각 포함한다.The plurality of DRL cores 20 include a core controller 21 that controls the DRL cores, and a compressor that encodes or decodes a plurality of data pairs including {current state, intermediate value, action, compensation, next state}. ) 22 and a PE array (Processing Element Array) 23 that receives a current state and outputs an action and a next state through DNN inference.

또한, 상기 복수 개의 DRL 코어(20)는, 상기 압축기(22)와 연결되어 가중치(Weight, W) 및 입력 특징(Input Feature, IF)를 전달받는 브로드캐스트 메모리(BMEM)(24)(64KB로 예시되었다.) 및 유니캐스트 메모리(UMEM)(26)(32KB로 예시되었다.)를 더 포함할 수 있다. 이러한 브로드캐스트 메모리(24) 및 유니캐스트 메모리(26)는 각각 B-버퍼(Buffer)(미도시) 및 U-버퍼(미도시)를 통해 상기 PE 어레이(23)에 입력 데이터를 제공할 수 있고, PE 어레이(23)의 출력은 누산 유닛(atiton Unit)에 입력되어 더해진다. 여기서, 상기 코어 컨트롤러(21)는, DNN 네트워크 구조에 따라 자동으로 상기 가중치 및 상기 입력 특징을 상기 브로드캐스트 메모리(24) 또는 상기 유니캐스트 메모리(26)로 페치(Fetch)하고, 액터 또는 러너 동작시 각각 가중치(W) 및 입력 특징(IF)의 데이터 재사용을 변화시키도록 상기 PE 어레이(23)의 구성을 설정할 수 있다.In addition, the plurality of DRL cores 20 are connected to the compressor 22 and receive weights (Weight, W) and input features (Input Features, IF) in a broadcast memory (BMEM) 24 (64 KB). exemplified) and a unicast memory (UMEM) 26 (exemplified as 32 KB) may be further included. The broadcast memory 24 and the unicast memory 26 may provide input data to the PE array 23 through a B-buffer (not shown) and a U-buffer (not shown), respectively. , the output of the PE array 23 is input to an accumulation unit and added. Here, the core controller 21 automatically fetches the weight and the input feature to the broadcast memory 24 or the unicast memory 26 according to the DNN network structure, and operates an actor or runner. The configuration of the PE array 23 may be set to change the data reuse of the weight W and the input feature IF, respectively.

또한, 상기 복수 개의 DRL 코어(20)는, 비선형 함수를 처리하는 활성화 유닛 및 가중치 업데이트 및 손실 계산을 위해 로그 함수, 덧셈 및 곱셈을 수행하는 1-D SIMD 유닛(28)을 더 포함할 수 있다.In addition, the plurality of DRL cores 20 may further include an activation unit for processing non-linear functions and a 1-D SIMD unit 28 for performing log functions, additions and multiplications for updating weights and calculating losses. .

도 14는 본 발명의 실시예들이 제안하는 압축기의 구조를 도시한 도면이다. 실시간 수집된 데이터를 압축 저장 및 로드하는 압축기(22)는 인코더와 디코더로 구성될 수 있다. 학습 시 그 값이 작은 에러와 ΔW로 인해 전체 데이터의 분포가 넓기 때문에, 입력 특징(IF)은 bfloat16으로 표현될 수 있다. 경험(Experience)으로 저장된 입력 특징의 지수부(Exponent) 데이터의 분포값으로 이는 좁은 범위에 고도로 집중되어 있다. 러너(Learner)의 20,480개의 경험(Experience)과 중간 레이터(Intermediate Layer) 데이터의 경우, 이러한 집중의 정도가 더 심한데, 제일 빈번하게 나타나는 3가지 지수 값(상위-3 지수)의 평균 비율은 각각 68%, 85%이다. 2b 코드 워드는 압축을 위해 추가한 데이터로, 현재 나타내고 있는 IF 데이터가 압축되었는지 여부를 나타낸다. 이 코드 워드를 통해 현재 노드가 상위-3 지수에 포함되는 지수를 갖고 있다면, 8-비트 지수부를 스킵(Skip)하며, 상위-3 지수에 포함되지 않는 지수를 갖고 있을 때는 지수부 데이터를 스킵할 수 없다. 코드 '00'는 현재 노드가 상위-3 지수 내에 속하지 않기 때문에 인덱싱을 건너뛸 수 없음을 나타낸다. 다른 코드 '01', '10' 및 '11'은 현재 노드의 인덱스가 각각 상위 1, 상위 2 및 상위 3개 인덱스에서 인덱스 값을 건너뛰고 있음을 나타낸다.14 is a diagram showing the structure of a compressor proposed by embodiments of the present invention. The compressor 22 that compresses, stores, and loads the data collected in real time may be composed of an encoder and a decoder. Since the distribution of the entire data is wide due to a small error and ΔW during learning, the input feature (IF) can be expressed as bfloat16. It is a distribution value of exponent data of input features stored as experience, which is highly concentrated in a narrow range. In the case of Learner's 20,480 Experience and Intermediate Layer data, this degree of concentration is even greater, with the average proportions of the three most frequent exponent values (top-3 indices) being 68 each. %, 85%. The 2b code word is data added for compression and indicates whether the currently displayed IF data is compressed. Through this code word, if the current node has an index included in the upper-3 index, the 8-bit exponent is skipped, and when the current node has an index not included in the upper-3 index, the exponent data is skipped. can't Code '00' indicates that indexing cannot be skipped because the current node does not fall within the top-3 index. The other codes '01', '10' and '11' indicate that the index of the current node is skipping index values in the top 1, top 2 and top 3 indices, respectively.

도 13 및 도 14를 참조하면, 상기 PE 어레이(23)가 DNN 연산을 처리하는 동안, 상기 압축기(22)는 출력 버퍼를 스캔하여 가장 자주 사용하는 세 가지 인덱스 값을 확인한다. 상기 출력 버퍼가 가득 찬 경우 상기 압축기(22)가 코드 워드(code word), 지수부(Exponent) 및 가수부(Mantissa)를 순차적으로 인코딩하며, 입력 특징(IF)이 상기 DRL 코어로 전송되면, 상기 압축기(22)(도 14의 Decompressor)가 상기 코드 워드를 스캔하고 순차적으로 입력된 입력 가수부와 지수부를 재결합하여 상기 입력 특징을 스트림-아웃(stream-out)한다. 다음의 표 1은 앞서 예시한 2족 로봇 워커에게 넘어지지 않고 걷도록 훈련할 수 있는 DRL 에이전트로 압축률을 측정한 결과로서, 측정 결과에 따르면 평균 압축률은 35%이다Referring to FIGS. 13 and 14 , while the PE array 23 processes a DNN operation, the compressor 22 scans the output buffer to check the three most frequently used index values. When the output buffer is full, the compressor 22 sequentially encodes a code word, an exponent, and a mantissa, and when an input feature (IF) is transmitted to the DRL core, The compressor 22 (Decompressor in FIG. 14) scans the code word and recombines the sequentially input input mantissa and exponent to stream-out the input feature. The following Table 1 is the result of measuring the compression rate with the DRL agent that can train the bipedal robot walker exemplified above to walk without falling. According to the measurement result, the average compression rate is 35%

도 15는 본 발명의 실시예들이 제안하는 전환 PE(transposable Processing Element, tPE) 어레이(23)의 구조를 도시한 도면으로서, 매트릭스 곱셈을 수행하는 강화학습 처리를 위한 2차원 전환 PE 어레이의 아키텍처를 보여준다. 첫 번째 행 PE의 B₁₁, B₂₁, B₃₁ 및 B₄₁ 데이터는 A₁으로 곱할 수 있으며, 이 중 새로운 데이터는 각 사이클마다 브로드캐스트되어 병렬 매트릭스 연산을 수행할 수 있다. 16 4 4 4 전환(Transposable) PE 어레이를 통합하여 BMEM으로부터 브로드캐스트 데이터를 수신하는 행 버퍼를 공유한다. 그리고 각 PE 어레이는 유니캐스트 데이터를 제공하는 U-버퍼(Buffer)(27)에 연결된다. FP-FXP MAC는 16b bfloat타입 데이터와 16b 정수의 곱셈을 수행하며, 이는 각 PE에 집적되어 있다. 4개의 4b 곱셈과 2개의 8b 곱셈 결과는 정확도와 성능의 다른 요건에 따라 무게 비트 정밀도가 달라질 수 있도록 추가 비용 없이 동시에 병렬로 구해질 수 있다. 만약 유니캐스트 데이터가 모두 재사용되었다면, 새로운 유니캐스트 데이터가 U-버퍼를 통해 PE 어레이(23)로 입력된다.15 is a diagram showing the structure of a transposable processing element (tPE) array 23 proposed by embodiments of the present invention, showing the architecture of a two-dimensional transition PE array for reinforcement learning processing that performs matrix multiplication. show B ₁₁ , B ₂₁ , B ₃₁ , and B ₄₁ data in the first row PE can be multiplied by A ₁ , and new data among them can be broadcast each cycle to perform parallel matrix operation. Integrates 16 4 4 4 Transposable PE arrays to share row buffers receiving broadcast data from BMEM. And each PE array is connected to a U-buffer 27 providing unicast data. The FP-FXP MAC performs multiplication of 16b bfloat type data by 16b integer, which is integrated in each PE. The results of four 4b multiplications and two 8b multiplications can be obtained simultaneously and in parallel at no additional cost so that the weight bit precision can be varied for different requirements of accuracy and performance. If all unicast data is reused, new unicast data is input to the PE array 23 through the U-buffer.

PE 어레이(23)는, 러너(Learner) 동작 시, 외부의 입력 대신 저장되어 있던 다수의 {현재 상태, 중간값, 액션, 보상, 다음 상태}를 포함하는 데이터 쌍을 독출하여 손실함수를 생성하고, 생성된 손실함수에 기초하여 역-전파(Back-Propagation) 연산을 수행하여 DNN의 가중치를 갱신(Update)할 수 있다. 이를 위해, 상기 PE 어레이(23)는, MAC이 2차원 배열로 구성되고, 각각의 행은 브로드캐스트 데이터를 수신하는 행 버퍼를 공유하여 입력값을 MAC을 통해 승산하되, 새로운 데이터가 매 사이클마다 브로드캐스트되어 병렬 매트릭스 연산을 수행하며, U-버퍼로부터 제공되는 유니캐스트 데이터가 모두 재사용되었다면 새로운 유니캐스트 데이터가 상기 U-버퍼를 통해 입력된다.The PE array 23 generates a loss function by reading data pairs including a plurality of stored {current state, median value, action, reward, next state} instead of external input during a runner operation, , it is possible to update the weights of the DNN by performing a back-propagation operation based on the generated loss function. To this end, the PE array 23 is composed of a two-dimensional array of MACs, and each row shares a row buffer for receiving broadcast data and multiplies an input value through the MAC, and new data is generated every cycle. It is broadcast to perform a parallel matrix operation, and if all unicast data provided from the U-buffer are reused, new unicast data is input through the U-buffer.

도 16은 전환 PE 어레이 구조를 활용하여 W 브로드캐스트(WBC)와 IF 브로드캐스트(IFBC)를 동적으로 선택하는 방법을 설명하기 위한 도면이다. 앞서 소개한 바와 같이, PE 어레이는, 액터에서의 입력 방향과 러너에서의 입력 방향이 서로 상이하도록 구성됨으로써, 액터 또는 러너 동작시 각각 입력 특징(Input Feature, IF) 및 가중치(Weight, W)의 데이터 재사용을 변화시키는 전환 PE(transposable Processing Element, tPE) 구조를 형성한다.16 is a diagram for explaining a method of dynamically selecting W broadcast (WBC) and IF broadcast (IFBC) by utilizing a switched PE array structure. As previously introduced, the PE array is configured so that the input direction in the actor and the input direction in the runner are different from each other, so that when the actor or runner operates, the input feature (IF) and weight (Weight, W) It forms a transposable processing element (tPE) structure that changes data reuse.

첫째, C_O의 크기가 BA보다 큰 경우에는 IF의 재사용이 더 효율적이다. 이 경우 도 16의 (A)에 나타난 바와 같이 W 브로드캐스트를 채택한다. 동일한 행의 PE는 다른 배치의 동일한 입력 채널에 IF를 저장하고, 동일한 열의 PE는 동일한 배치의 다른 입력 채널에 IF를 저장한다. 예를 들어, DNN 처리(T=0) 시작 시에 PE₀은 원의 IF(BA=0), C_i=0) 및 PE₁은 직사각형의 IF(BA=1, C_i=0)를 저장한다. 그리고 PE₄는 삼각형의 IF를 저장한다(BA=0, C_i=1). B-버퍼(Buffer)는 매 사이클마다 4개의 서로 다른 입력 채널의 4개의 서로 다른 W를 브로드캐스트하며, 동일한 열에 있는 PE의 결과는 누산 유닛(Accumulation Unit)에서 더해진다. 4개의 입력 채널의 W 브로드캐스트가 완료된 후, PE 어레이는 U-버퍼로부터 새로운 IF를 가져오고, PE 어레이는 DNN 처리를 계속한다. 즉, PE 어레이는, 출력 채널(Output Cahnnel, Co) 수를 나타내는 가중치 행렬(W Matrix)의 행 길이가 일괄 처리 횟수(Batch Size, BA)를 나타내는 입력 특징 행렬(IF Matrix)의 열 길이보다 상대적으로 큰 경우 W 브로드캐스트(WBC)를 선택하는 것이 바람직하다.First, the reuse of IF is more efficient when the size of _CO is larger than that of BA. In this case, as shown in (A) of FIG. 16, W broadcast is adopted. PEs in the same row store IFs in the same input channels in different batches, and PEs in the same column store IFs in different input channels in the same batch. For example, at the start of DNN processing (T=0), PE ₀ stores the IF of a circle (BA=0), C _i =0) and PE ₁ stores the IF of a rectangle (BA=1, C _i =0). do. And PE ₄ stores the IF of the triangle (BA=0, C _i =1). The B-Buffer broadcasts 4 different Ws of 4 different input channels every cycle, and the results of PEs in the same column are added in an Accumulation Unit. After the W broadcast of the four input channels is complete, the PE array fetches a new IF from the U-buffer, and the PE array continues DNN processing. That is, in the PE array, the row length of the weight matrix (W Matrix) representing the number of output channels (Co) is relative to the column length of the input feature matrix (IF Matrix) representing the number of batch processing (Batch Size, BA). When , W broadcast (WBC) is preferably selected.

둘째, BA 크기가 C_O보다 큰 경우에는 W의 재사용이 IF의 재사용보다 효율적이다. 이 경우 도 16의 (B)에 나타난 바와 같이 IF 브로드캐스트를 채택한다. PE₀과 PE₂는 서로 다른 출력 채널에 W를 저장하고, PE₀과 PE₄는 동일한 출력 채널에 W를 저장한다. 즉, PE 어레이는, 가중치 행렬의 행 길이가 입력 특징 행렬의 열 길이보다 상대적으로 작은 경우 IF 브로드캐스트(IFBC)를 선택하는 것이 바람직하다.Second, when BA size is larger than _CO , reuse of W is more efficient than reuse of IF. In this case, as shown in (B) of FIG. 16, IF broadcast is adopted. PE ₀ and PE ₂ store W in different output channels, and PE ₀ and PE ₄ store W in the same output channel. That is, the PE array preferably selects IF broadcast (IFBC) when the row length of the weight matrix is relatively smaller than the column length of the input feature matrix.

액터 동작 시 출력 채널의 크기보다 배치 크기가 커서 IF 브로드캐스트보다 W 브로드캐스트의 외부 메모리 액세스가 적다. 이 경우 W 브로드캐스트는 메모리 액세스 측면에서는 더욱 효율적인 것으로 보이지만, 많은 PE는 유휴(IDLE) 상태에 있어야 할 IF 데이터를 수신할 수 없어, W 브로드캐스트는 코어의 활용도가 낮기 때문에 IF 브로드캐스트보다 프레임률이 낮아진다. 그러므로 IF 브로드캐스트가 액터에게 더 효율적이다. 액터와 러너 동작 시, 코어 컨트롤러에 의해 전환 PE 어레이의 구성이 자동적으로 선택될 수 있다.When an actor operates, the batch size is larger than the size of the output channel, so W broadcast requires less external memory access than IF broadcast. In this case, W broadcast appears to be more efficient in terms of memory access, but many PEs cannot receive IF data that should be in idle (IDLE) state, so W broadcast has lower utilization of cores, so the frame rate is lower than IF broadcast. this lowers Therefore, IF broadcasts are more efficient for actors. Upon actor and runner operation, the configuration of the transition PE array can be automatically selected by the core controller.

표 2는 본 발명의 실시예들에 따른 DRL 에이전트의 워커 시뮬레이션으로 제안된 계획을 측정하였다. IF 브로드캐스트(IFBC)만을 사용할 경우, 액터가 DRL의 처리 시간의 대부분을 차지하기 때문에 W 브로드캐스트(WBC)보다 적은 전력을 소비한다. 그러나 IFBC 전용의 경우는 WBC가 러너 처리 중에 IFBC보다 메모리 액세스에 더 효율적이기 때문에 WBC 전용 사례보다 높은 메모리 대역폭을 요구한다. 전환 PE 어레이를 활용해 전력 소비량과 피크 메모리 대역폭을 줄일 수 있도록 IFBC와 WBC를 적응적으로 선정한다. 또한 경험(Experience) 압축과 함께 전환 PE 어레이의 전력 소비량 및 피크 메모리 대역폭을 측정한다. 경험 압축과 함께 IFBC와 WBC의 적응형 선택을 통해 평균 전력 소비량은 31% 감소하고 최대 메모리 대역폭은 41% 감소하였음을 실험결과로서 얻을 수 있었다.Table 2 measures the proposed scheme with worker simulation of the DRL agent according to the embodiments of the present invention. If only IF broadcast (IFBC) is used, it consumes less power than W broadcast (WBC) because actors take up most of the DRL's processing time. However, the IFBC-only case requires higher memory bandwidth than the WBC-only case because WBC is more efficient in memory access than IFBC during runner processing. The IFBC and WBC are adaptively selected to reduce power consumption and peak memory bandwidth by utilizing the switched PE array. We also measure the power consumption and peak memory bandwidth of the switched PE array with Experience compression. Through the adaptive selection of IFBC and WBC together with experience compression, it was obtained as an experimental result that the average power consumption was reduced by 31% and the maximum memory bandwidth was reduced by 41%.

본 발명의 실시예들에서는 부동 소수점 시스템을 채택하고 데이터 압축 방식과 적응형 데이터 경로를 사용하여 대규모 메모리 대역폭과 DRL 처리의 서로 다른 데이터 재사용을 처리하였다. 그 결과 실시간 저전력 DRL 작동을 위해 정밀한 DNN 학습이 가능하다. 본 발명의 실시예들은 동적 환경에서의 자율적 DRL 작동을 실현하기 위해 일회용 PE 어레이와 경험(Experience) 압축기를 갖춘 DRL 가속기를 제안하였다. 또한 추론와 훈련을 위한 적응형 데이터의 재사용이 가능하여 전력 및 최대 메모리 대역폭이 각각 31%와 41% 감소함을 실험결과로서 얻을 수 있었다.Embodiments of the present invention adopt a floating-point system and use a data compression scheme and an adaptive data path to handle large memory bandwidth and different data reuse in DRL processing. The result is precise DNN training for real-time, low-power DRL operation. Embodiments of the present invention propose a DRL accelerator equipped with a disposable PE array and an experience compressor to realize autonomous DRL operation in a dynamic environment. In addition, it was possible to reuse adaptive data for reasoning and training, and it was obtained as an experimental result that power and maximum memory bandwidth were reduced by 31% and 41%, respectively.

한편, 본 발명의 실시예들은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다.Meanwhile, the embodiments of the present invention can be implemented as computer readable codes in a computer readable recording medium. The computer-readable recording medium includes all types of recording devices in which data that can be read by a computer system is stored.

컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등을 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고 본 발명을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술 분야의 프로그래머들에 의하여 용이하게 추론될 수 있다.Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. In addition, the computer-readable recording medium may be distributed to computer systems connected through a network, so that computer-readable codes may be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present invention can be easily inferred by programmers in the technical field to which the present invention belongs.

이상에서 본 발명에 대하여 그 다양한 실시예들을 중심으로 살펴보았다. 본 발명에 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.In the above, the present invention was examined focusing on various embodiments thereof. Those of ordinary skill in the art pertaining to the present invention will be able to understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative rather than a limiting point of view. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent scope will be construed as being included in the present invention.

100: 액터 회로 블럭
200: 러너 회로 블럭
300: DRL 가속 장치
10: 상위 공유 메모리
20: DRL 코어
21: 코어 컨트롤러
22: 압축기
23: PE 어레이
24: 브로드캐스트 메모리(BMEM)
25: B-버퍼
26: 유니캐스트 메모리(UMEM)
27: U-버퍼
28: SIMD 유닛100: Actor circuit block
200: runner circuit block
300: DRL accelerator
10: upper shared memory
20: DRL core
21: core controller
22: Compressor
23: PE array
24: Broadcast memory (BMEM)
25: B-buffer
26: Unicast Memory (UMEM)
27: U-buffer
28: SIMD unit

Claims

delete

An actor circuit block that receives input data and outputs a result of processing through a deep neural network (DNN) as an action; and
DRL (including a Learner circuit block that learns weight values of the DNN using a plurality of data pairs including {current state, median value, action, reward, next state}) As a deep reinforcement learning) accelerator,
In the actor circuit block, a MAC combining a multiplier and an accumulator forms a two-dimensional PE array (Processing Element Array), and a plurality of PE arrays are internally integrated to receive a current state input from the outside, Output the action and next state through DNN inference,
When mapping DNNs for actors and runners and other DNNs to the PE array, the available PEs are spatially partitioned and the DNNs for actors and runners and other DNNs are allocated to each of the divided PEs. to process in parallel,
A control logic (Controller) that senses a change in the amount of computation or memory bandwidth required for DNN inference, and changes the number of PEs allocated to each of the DNNs according to the detected result; further comprising a DRL accelerator.

An actor circuit block that receives input data and outputs a result of processing through a deep neural network (DNN) as an action; and
DRL (including a Learner circuit block that learns weight values of the DNN using a plurality of data pairs including {current state, median value, action, reward, next state}) As a deep reinforcement learning) accelerator,
In the actor circuit block, a MAC combining a multiplier and an accumulator forms a two-dimensional PE array (Processing Element Array), and a plurality of PE arrays are internally integrated to receive a current state input from the outside, Output the action and next state through DNN inference,
When DNNs for actors and runners and other DNNs are mapped to the PE array, a DRL that divides the DNNs for actors and runners and other DNNs in time and sequentially allocates and processes usable PEs. accelerator.

An actor circuit block that receives input data and outputs a result of processing through a deep neural network (DNN) as an action; and
DRL (including a Learner circuit block that learns weight values of the DNN using a plurality of data pairs including {current state, median value, action, reward, next state}) As a deep reinforcement learning) accelerator,
In the actor circuit block, a MAC combining a multiplier and an accumulator forms a two-dimensional PE array (Processing Element Array), and a plurality of PE arrays are internally integrated to receive a current state input from the outside, Output the action and next state through DNN inference,
When mapping DNNs for actors and runners and other DNNs to the PE array, the available PEs are spatially partitioned and the DNNs for actors and runners and other DNNs are allocated to each of the divided PEs. to process in parallel,
When DNNs for actors and runners and other DNNs are mapped to the PE array, PEs available according to space-time division are mapped to a plurality of DNNs for parallel processing, but the PE assigned to the DNN that has been terminated first is waited for. A DRL accelerator that sequentially assigns and processes other DNNs in progress.

According to any one of claims 6 and 7,
When the operation of one DNN is finished, the current weight values, input values, and output values are stored in external memory, and other waiting DNNs can use the PE allocated to the DNN that was completed earlier. A control logic (Controller) for reading weight values, input values, and output values from an internal or external memory; further comprising a DRL accelerator.

delete

An actor circuit block that receives input data and outputs a result of processing through a deep neural network (DNN) as an action; and
DRL (including a Learner circuit block that learns weight values of the DNN using a plurality of data pairs including {current state, median value, action, reward, next state}) As a deep reinforcement learning) accelerator,
In the actor circuit block, a MAC combining a multiplier and an accumulator forms a two-dimensional PE array (Processing Element Array), and a plurality of PE arrays are internally integrated to receive a current state input from the outside, Output the action and next state through DNN inference,
When mapping DNNs for actors and runners and other DNNs to the PE array, the available PEs are spatially partitioned and the DNNs for actors and runners and other DNNs are allocated to each of the divided PEs. parallel processing,
The PE array,
Conversion PE (transposable processing PE) that changes data reuse of input features (IF) and weights (W), respectively, when the input direction in the actor and the input direction in the runner are configured to be different from each other, when the actor or runner operates. Element, tPE) structure, DRL accelerator.

A plurality of DRL cores that perform DRL (Deep Reinforcement Learning); and
An upper shared memory connected to the plurality of DRL cores through an on-chip network under the control of a higher controller;
While an actor that receives input data and outputs the result of processing through a deep neural network (DNN) as an action is running, the upper shared memory transfers the weight values of the DNN to the DRL core. load,
While a learner that learns the weight values of the DNN using multiple data pairs including {current state, median value, action, reward, next state} is running, the plurality of DRLs A deep reinforcement learning (DRL) accelerator in which experience data shared by the core is stored in the upper shared memory.

According to claim 11,
The plurality of DRL cores,
a core controller that controls the DRL core;
a compressor for encoding or decoding a plurality of data pairs including the {current state, intermediate value, action, reward, next state}; and
A DRL accelerator including, respectively, a PE array (Processing Element Array) that receives a current state and outputs an action and a next state through DNN inference.

According to claim 12,
The plurality of DRL cores,
A broadcast memory (BMEM) and a unicast memory (UMEM) connected to the compressor and receiving weights (Weight, W) and input features (Input Feature, IF); further comprising,
Wherein the broadcast memory and the unicast memory provide input data to the PE array through a B-Buffer and a U-Buffer, respectively.

According to claim 13,
The core controller,
automatically fetching the weights and the input features to the broadcast memory or the unicast memory according to the DNN network structure;
A DRL accelerator for setting the configuration of the PE array to change data reuse of weights and input features, respectively, when an actor or runner operates.

According to claim 12,
The plurality of DRL cores,
an activation unit that processes a nonlinear function; and
1-D SIMD unit performing log functions, addition and multiplication for weight update and loss calculation; DRL accelerator device further comprising.

According to claim 12,
While the PE array processes DNN operations, the compressor scans the output buffer;
When the output buffer is full, the compressor sequentially encodes a code word, an exponent, and a mantissa;
When an input feature is transmitted to the DRL core, the compressor scans the code word and recombines the sequentially input input mantissa and exponent to stream-out the input feature.

According to claim 12,
The PE array,
During the runner operation, a loss function is generated by reading data pairs including a number of stored {current state, median value, action, reward, next state} instead of external input, and based on the generated loss function A DRL accelerator that updates the weights of the DNN by performing a back-propagation operation.

According to claim 12,
The PE array,
MAC consists of a two-dimensional array,
Each row shares a row buffer that receives broadcast data and multiplies an input value through MAC, and new data is broadcast every cycle to perform parallel matrix operation,
If all unicast data provided from the U-buffer is reused, new unicast data is input through the U-buffer.

According to claim 12,
The PE array,
Conversion PE (transposable processing PE) that changes data reuse of input features (IF) and weights (W), respectively, when the input direction in the actor and the input direction in the runner are configured to be different from each other, when the actor or runner operates. Element, tPE) structure, DRL accelerator.

According to claim 19,
The PE array,
When the row length of the weight matrix (W Matrix) representing the number of output channels (Co) is relatively larger than the column length of the input feature matrix (IF Matrix) representing the number of batches (Batch Size, BA) W broadcast (WBC),
A DRL accelerator that selects IF broadcast (IFBC) when the row length of the weight matrix is relatively smaller than the column length of the input feature matrix.

The method of any one of claims 5, 6, 7, and 10,
The actor circuit block,
During a learner operation, the plurality of PE arrays generate a loss function by reading data pairs including a plurality of {current state, median value, action, reward, next state} stored instead of external input, A DRL accelerator that updates the weights of the DNN by performing a back-propagation operation based on the generated loss function.

The method of any one of claims 5, 6, 7, and 10,
A compressor for encoding or decoding a plurality of data pairs including the {current state, intermediate value, action, compensation, next state}; further comprising a DRL accelerator.