개발 피드 목록

블로그 첫 글에서 비트윈의 시스템 아키텍처에 대해 다룬 적이 있습니다. 시스템 구성의 미래에 대한 계획으로 멀티티어 아키텍처에 대해 언급했었는데, 이는 프로토콜을 단순화시키고 배포 자동화를 가능하게 하기 위해서 클라이언트와 비즈니스 로직을 담당하는 서버 사이에 일종의 게이트웨이를 두는 것이었습니다. 그 외에도 여러 가지 필요성이 생겨 해당 역할을 담당하는 프레젠터라는 것을 만들게 되었고 비트윈의 채팅 시스템에 적용하게 되었습니다. 만드는 과정 중에 여러 기술적인 문제들이 있었고 이를 해결하기 위한 노력을 하였습니다. 이 글에서는 비트윈 시스템에서의 프레젠터에 대해 이야기 하고자 합니다.프레젠터프레젠터는 일종의 게이트웨이 입니다. 기존의 시스템에서는 클라이언트들이 ELB를 통해 채팅 서버에 직접 TCP 연결을 하였습니다. 하지만 비트윈 PC버전과 자체 푸시 서버를 만들면서 ELB로는 해결할 수 없는 부족한 점들이 생겼고, ELB의 부족한 점을 채워줄 수 있는 시스템이 필요하게 되었습니다. ELB를 대체하는 역할 외에도 다른 여러 필요했던 기능들을 제공하는 프레젠터를 만들기로 하였습니다.프레젠터는 ELB의 역할을 할 뿐만 아니라 여러 다른 기능들도 제공합니다.프레젠터의 기능패킷을 적절한 샤드로 중계비트윈에서는 커플 단위로 샤딩하여 같은 커플의 채팅 요청에 대해서는 같은 채팅 서버에서 처리하고 있습니다. Consistent Hash를 통해 커플을 여러 채팅 서버로 샤딩하고 ZooKeeper를 이용하여 이 정보를 여러 채팅 서버 간 공유합니다. 프레젠터 또한 ZooKeeper와 연결을 하여 어떤 채팅 서버가 어떤 커플을 담당하는지에 대한 정보를 알고 있도록 설계되어 있습니다. 따라서 프레젠터는 첫 연결 시 보내는 인증 패킷을 보고 해당 채팅 연결에서 오는 요청들을 어떤 채팅 서버로 보내야 할지 판단할 수 있습니다. 어떤 채팅 서버로 보낼지 판단하는 과정은 처음 한 번만 일어나며, 이후 패킷부터는 자동으로 해당 채팅 서버로 중계합니다.프레젠터의 이런 기능 덕분에 클라이언트는 더 이상 어떤 채팅 서버로 붙어야 하는지 알아내는 과정 없이 아무 프레젠터와 연결만 맺으면 채팅을 할 수 있게 되었습니다. 기존에는 클라이언트들이 여러 채팅 서버 중 어떤 서버에 붙어야 하는지 확인하는 작업을 한 후에 할당된 채팅 서버로 연결 맺어야 했습니다. 그래서 클라이언트가 채팅 서버와 연결을 맺기 위해 다소 복잡한 과정을 거쳐야 했지만, 이제는 클라이언트가 프레젠터의 주소로 연결 요청만 하면 DNS Round Robin 통해 아무 프레젠터와 연결하는 방식으로 프로토콜을 단순화할 수 있었습니다. 덕분에 새로운 채팅 서버를 띄울 때마다 ELB를 Warm-Up 시켜야 했던 기존 시스템의 문제가 없어졌습니다. 그래서 비트윈 개발팀의 오랜 염원이었던 채팅 서버 오토스케일의 가능성도 열렸습니다.많은 수의 연결을 안정적으로 유지PC버전과 푸시 서버를 만들면서 기존의 채팅 연결과 다르게 많은 수의 연결이 장시간 동안 유지 되는 경우를 처리할 수 있어야 했습니다. 기존에는 TCP 릴레이를 하도록 설정된 ELB가 연결들을 받아주었습니다. 한 머신당 6만 개 정도의 Outbound TCP 연결을 맺을 수 있는데, ELB도 트래픽에 따라 여러 대의 머신에서 돌아가는 일종의 프로그램이므로 이 제한에 걸린다고 생각할 수 있습니다. 따라서 많은 수의 연결을 맺어놓고 있어야 하는 경우 ELB에 문제가 생길 수 있다고 판단했습니다. (과거 ELB가 연결 개수가 많아지는 경우 스케일아웃이 안되는 버그 때문에 문제가 된 적이 있기도 했습니다) 또한 클라이어트 연결당 내부 연결도 하나씩 생겨야 하면 클라이언트가 연결을 끊거나 맺을 때마다 서버 내부 연결도 매번 끊거나 연결해야 하는 오버헤드가 발생합니다.이를 해결하기 위해 프레젠터에서는 TCP 연결을 Multiplexing하는 프로토콜을 구현하여 적은 수의 내부 연결로 많은 수의 클라이언트 연결을 처리할 수 있도록 하였습니다. 서버 내부에서는 고정된 개수의 몇 개의 연결만 맺어 놓고 이 연결들만으로 수많은 클라이언트 연결을 처리할 수 있습니다. 이처럼 TCP Multiplexing을 하는 것은 Finagle과 같은 다른 RPC 프로젝트에서도 지원하는 기능입니다.TCP Multiplexing 프로토콜을 통해 많은 수의 클라이언트 연결을 소수의 서버 내부 연결로 처리합니다.또한, 프레젠터는 많은 수의 SSL 연결을 처리해야 하므로 암복호화 로직을 실행하는데 퍼포먼스가 매우 중요하게 됩니다. 채팅 서버 한 대를 제거하거나 하는 경우 많은 연결이 한꺼번에 끊어지고 연이어 한꺼번에 연결을 시도하게 되는 경우가 있을 수 있는데, 이 때 대량의 SSL Handshaking을 하게 됩니다. 기존 서버들로 대량의 SSL Handshaking을 빠른 시간안에 처리하기 위해서는 높은 퍼포먼스가 필요합니다. Java로 작성된 프로그램만으로 이런 퍼포먼스 요구사항을 달성하기 어려우므로, 클라이언트와의 연결을 담당하는 부분은 OpenSSL, libevent를 이용한 C++로 코드로 작성하였습니다. 인증 패킷을 파싱하거나 패킷들을 릴레이 하는 등의 로직을 담당하는 부분은 Alfred라는 Netty를 이용하여 만든 인하우스 RPC 라이브러리를 이용해 작성되었습니다. 연결을 담당하는 부분은 TCP 연결을 유지하는 역할과 들어온 패킷들을 Netty로 작성된 모듈로 릴레이 하는 역할만 담당하므로 매우 간단한 형태의 프로그램입니다. 짧은 시간 안에 어럽지 않게 구현할 수 있었습니다.클라이언트의 연결을 받아주는 역할을 하는 부분은 C++, 실제 로직이 필요한 부분은 Java로 작성하였습니다.여러 네트워크 최적화 기술의 지원ELB에는 여러 네트워크 최적화 기술들을 아직 제공하지 않는 경우가 있습니다. 대표적으로 HTTP/2 혹은 SPDY, QUIC, TCP Fast Open 등이 있습니다. 특히 모바일 환경에서는 SSL Handshaking 등 부가적인 RTT로 인한 지연을 무시할 수 없으므로 이런 기술들을 이용한 초기 연결 시간 최적화는 서비스 퀄리티에 중요한 부분 중 하나입니다. ELB는 AWS에서 관리하는 서비스이므로 AWS에서 이런 기능들을 ELB에 적용하기 전에는 이용할 수 없지만, 프레젠터는 직접 운영하는 서버이므로 필요한 기능을 바로바로 적용하여 서비스 품질을 높일 수 있습니다. ELB에서 이미 제공하는 최적화 기술인 SSL Session Ticket이나 다른 몇몇 기술은 이미 적용되어 있고 아직 적용하지 않은 기술들도 필요에 따라 차차 적용할 예정입니다.프레젠터의 구현C++ 연결 유지 모듈프레젠터는 퍼포먼스를 위해 C++로 작성되었습니다. 이는 Pure Java를 이용한 암복호화는 프레젠터에서 원하는 정도의 퍼포먼스를 낼 수 없기 때문입니다. 처음에는 OpenSSL과 libevent를 이용해 작성된 코드를 JNI를 통해 Netty 인터페이스에 붙인 event4j라는 인하우스 라이브러리를 이용하려고 했으나, 코드가 복잡하고 유지보수가 어렵다는 점 때문에 포기하였습니다. 그 후에는 netty-tcnative를 이용해보고자 했으나 테스트 결과 연결당 메모리 사용량이 큰 문제가 있었고, 이를 수정하기에는 시간이 오래 걸릴 것 같아 포기하였습니다. 결국, 페이스북에서 오픈소스로 공개한 C++ 라이브러리인 folly를 활용하여 프레젠터를 작성하게 되었습니다. folly의 네트워크 API들이 OpenSSL과 libevent를 이용해 구현되어 있습니다.릴레이 로직프레젠터는 첫 인증 패킷을 파싱하여 릴레이할 채팅 서버를 판단하며, 이후의 패킷부터는 실제 패킷을 까보지 않고 단순 릴레이 하도록 설계하였습니다. 처음의 Netty 파이프라인에는 Alfred 프로토콜을 처리할 수 있는 핸들러들이 설정되어 있어 인증 패킷을 파싱 할 수 있으며 인증 패킷에 있는 정보를 바탕으로 어떤 채팅 서버로 패킷을 릴레이 할지 결정합니다. 그 이후 파이프라인에 있던 핸들러를 모두 제거 한 후, 읽은 byte 스트림을 Multiplexing Protocol 프레임으로 감싸서 그대로 릴레이 하는 매우 간단한 로직을 담당하는 핸들러 하나를 추가합니다. 덕분에 로직 부분의 구현도 매우 간단해질 수 있었으며, 채팅 서버에 API가 추가되거나 변경되어도 프레젠터는 업데이트할 필요가 없다는 운영상 이점도 있었습니다.Multiplexing Protocol프레젠터의 Multiplexing Protocol은 Thrift를 이용하여 직접 정의 하였으며, 비트윈 개발팀 내부적으로 사용 중인 RPC 라이브러리인 Alfred에 이 프로토콜을 구현하였습니다. Thrift를 통해 C++과 Java로 컴파일된 소스코드를 각각 프레젠터의 연결 처리 부분과 로직 처리 부분에서 이용하여 통신합니다. 프레젠터에서는 Multiplexing된 TCP 연결들을 Stream이라고 명명하였으며 이는 SPDY나 HTTP/2에서의 호칭 방법과 유사합니다. SPDY나 HTTP/2도 일종의 Multiplexing 기능을 제공하고 있으며, 프레젠터의 Multiplexing Protocol도 SPDY 프레임을 많이 참고하여 작성되었습니다.수 많은 클라이언트와의 TCP연결을 Stream으로 만들어 하나의 내부 TCP연결을 통해 처리합니다.Alfred에서는 Multiplexing 된 TCP 연결을 Netty의 Channel 인터페이스로 추상화하였습니다. Netty에서 TCP 연결 하나는 Channel 하나로 만들어지는데, 실제 Stream도 Channel 인터페이스로 데이터를 읽거나 쓸 수 있도록 하였습니다. 이 추상화 덕분에 비트윈 비즈니스 로직을 담당하는 코드에서는 Stream으로 Multiplexing 된 TCP 연결을 마치 기존의 TCP 연결과 똑같이 Channel을 이용해 사용할 수 있었습니다. 그래서 실제 비즈니스 로직 코드는 전혀 건드리지 않고 프레젠터를 쉽게 붙일 수 있었습니다.로드 밸런싱클라이언트는 Route53에서 제공하는 DNS Round Robin 기능을 이용하여 아무 프레젠터에 연결하여 채팅 요청을 날리게 됩니다. 하지만 무조건 동등하게 Round Robin 하게 되면 새로 켜지거나 하여 연결을 거의 맺지 않고 놀고 있는 프레젠터가 있는데도 연결을 많이 맺고 있는 기존 프레젠터에에 연결이 할당되는 문제가 생길 수 있습니다. 충분한 시간이 흐르면 결국에는 연결 개수는 동등하게 되겠지만, 처음부터 놀고 있는 프레젠터에 새로운 연결을 가중치를 주어 할당하면 로드를 분산되는 데 큰 도움이 될 것입니다. 그래서 Route53의 Weighted Routing Policy 기능을 이용하기로 하였습니다. 현재 연결 개수와 CPU 사용량 등을 종합적으로 고려하여 Weight를 결정하고 이를 주기적으로 Route53의 레코드에 업데이트합니다. 이런 방법으로 현재 로드가 많이 걸리는 서버로는 적은 수의 새로운 연결을 맺게 하고 자원이 많이 남는 프레젠터로 더 많은 새로운 연결이 맺어지도록 하고 있습니다.스케일 인/아웃AWS에서는 트래픽에 따라 서버 개수를 늘리기도 하고 줄이기도 하는 AutoScaling 이라는 기능이 있습니다. 프레젠터가 스케일 아웃될때에는 프레젠터가 스스로 Route53에 레코드를 추가하는 식으로 새로운 연결을 맺도록 할 수 있습니다. 하지만 스케일 인으로 프레젠터가 제거될 때에는 Route53에서 레코드를 삭제하더라도 함부로 프레젠터 서버를 종료시킬 수 없습니다. 종종 클라이언트의 DNS 캐싱 로직에 문제가 있어, Route53에서 레코드를 삭제되었는데도 불구하고 이를 업데이트하지 못해 기존 프레젠터로 연결을 시도하는 경우가 있을 수 있기 때문입니다. 따라서 프레젠터 클러스터가 스케일 인 될 때에는 기존의 모든 연결이 끊어지고 충분한 시간 동안 새로운 연결이 생기지 않은 경우에만 서버를 종료시켜야 합니다. AutoScaling Group의 LifeCycleHook을 이용하여 위와 같은 조건을 만족 시켰을 때에만 프레젠터 서버를 완전히 종료시키도록 하였습니다.못다 한 이야기프레젠터라는 이름이 이상하다고 생각하시는 분들이 있을 것으로 생각합니다. 멀티티어 아키텍처를 이야기할 때 프레젠테이션 티어, 어플리케이션 티어, 데이터베이스 티어로 구분하곤 하는데 이 프레젠테이션 티어에서 나온 이름입니다. 지금은 실제 프레젠터가 하는 역할과 프레젠테이션 티어가 보통 맡게 되는 역할에는 많은 차이가 있지만, 어쩌다 보니 이름은 그대로 가져가게 되었습니다.프레젠터에서 AutoScaling을 하기 위해 LifeCycleHook을 이용합니다. 이때 프레젠터를 위해 LifeCycleHook 이벤트를 처리하는 프로그램을 직접 짠 것이 아니라 비트윈 개발팀이 내부적으로 만든 Kharon이라는 프로그램을 이용하였습니다. Kharon은 인스턴스가 시작되거나 종료될 때 실행할 스크립트를 작성하고 인스턴스의 특정 위치에 놓는 것만으로 LifeCycleHook을 쉽게 이용할 수 있게 하는 프로그램입니다. Kharon 덕분에 비트윈 내 다양한 시스템에서 별다른 추가 개발 없이 LifeCycleHook을 쉽게 활용하고 있습니다. 후에 Kharon에 대해 자세히 다뤄보도록 하겠습니다.정리비트윈 개발팀에서는 오랫동안 유지되는 수많은 채팅 서버 연결들을 처리하고 클라이언트와 서버 간 프로토콜을 단순화시키는 등 여러 이점을 얻고자 ELB의 역할을 대신하는 프레젠터를 만들었습니다. 프레젠터를 만드는 과정에서 여러 기술적 문제가 있었습니다. 이를 해결하기 위해 C++로 연결 유지 모듈을 따로 작성하였고 Multiplexing Protocol을 따로 정의하였으며 그 외 여러 가지 기술적인 결정들을 하였습니다. 이런 과정에서 시행착오들이 있었지만 이를 발판 삼아 더 좋은 기술적 결정을 내리기 위해 고민하여 결국 기존 시스템에 쉽게 적용할 수 있고 쉽게 동작하는 프레젠터를 만들어 이용하고 있습니다.

PrefaceDeep learning has led to a series of breakthroughs in many areas. However, successful deep learning models often require significant amounts of computational resources, memory and power. Deploying efficient deep learning models on mobile devices became the main technological challenge for many mobile tech companies.Hyperconnect developed a mobile app named Azar which has a large fan base all over the world. Recently, Machine Learning Team has been focusing on developing mobile deep learning technologies which can boost user experience in Azar app. Below, you can see a demo video of our image segmentation technology (HyperCut) running on Samsung Galaxy J7. Our benchmark target is a real-time (>= 30 fps) inference on Galaxy J7 (Exynos 7580 CPU, 1.5 GHz) using only a single core.Figure 1. Our network runs fast on mobile devices, achieving 6 ms of single inference on Pixel 1 and 28 ms on Samsung Galaxy J7. Full length video. There are several approaches to achieve fast inference speed on mobile device. 8-bit quantization is one of the popular approaches that meet our speed-accuracy requirement. We use TensorFlow Lite as our main framework for mobile inference. TensorFlow Lite supports SIMD optimized operations for 8-bit quantized weights and activations. However, TensorFlow Lite is still in pre-alpha (developer preview) stage and lacks many features. In order to achive our goal, we had to do the following:Understand details of TensorFlow and Tensorflow Lite implementation.Design our own neural network that can fully utilize optimized kernels of TensorFlow Lite. (Refer to 1, 2 and 3)Modify TOCO: TensorFlow Lite Optimizing Converter to correctly convert unsupported layers. (Refer to 4)Accelerate several quantized-layers with SIMD optimized code. (Refer to 5 and 6)We are willing to share our trials and errors in this post and we hope that readers will enjoy mobile deep learning world :)1) Why is depthwise separable layer fast in Tensorflow Lite ?Implementing low-level programs requires a bit different ideas and approaches than usual. We should be aware that especially on mobile devices using cache memory is important for fast inference.Figure 2. Memory access requires much more energy (640 pJ) than addition or multiplication.Accessing cache memory (8 pJ) is much cheaper than using the main memory (table courtesy of Ardavan Pedram) In order to get insight into how intrinsics instructions are implemented in Tensorflow Lite, we had to analyze some implementations including depthwise separable convolution with 3x3 kernelsBelow we describe some of the main optimization techniques that are used for building lightweight and faster programs.Loop UnrollingCan you spot the difference between the following two code fragments?for (int i = 0; i < 32; i++) { x[i] = 1; if (i%4 == 3) x[i] = 3; } for (int i = 0; i < 8; i++) { x[4*i ] = 1; x[4*i+1] = 1; x[4*i+2] = 1; x[4*i+3] = 3; } The former way is what we usually write, and the latter is loop-unrolled version of former one. Even though unrolling loops are discouraged from the perspective of software design and development due to severe redundancy, with low-level architecture this kind of unrolling has non-negligible benefits. In the example described above, the unrolled version avoids examining 24 conditional statements in for loop, along with neglecting 32 conditional statements of if.Furthermore, with careful implementation, these advantages can be magnified with the aid of SIMD architecture. Nowadays some compilers have options which automatically unroll some repetitive statements, yet they are unable to deal with complex loops.Separate implementation for each caseConvolution layer can take several parameters. For example, in depthwise separable layer, we can have many combinations with different parameters (depth_multiplier x stride x rate x kernel_size). Rather than writing single program available to deal with every case, in low-level architectures, writing number of case-specific implementations is preferred. The main rationale is that we need to fully utilize the special properties for each case. For convolution operation, naive implementation with several for loops can deal with arbitrary kernel size and strides, however this kind of implementation might be slow. Instead, one can concentrate on small set of actually used cases (e.g. 1x1 convolution with stride 1, 3x3 convolution with stride 2 and others) and fully consider the structure of every subproblem.For example, in TensorFlow Lite there is a kernel-optimized implementation of depthwise convolution, targeted at 3x3 kernel size:template <int kFixedOutputY, int kFixedOutputX, int kFixedStrideWidth, int kFixedStrideHeight> struct ConvKernel3x3FilterDepth8 {}; Tensorflow Lite further specifies following 16 cases with different strides, width and height of outputs for its internal implementation:template <> struct ConvKernel3x3FilterDepth8<8, 8, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 4, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 2, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 1, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 2, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 4, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<1, 4, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 1, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 2, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 4, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 1, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 2, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 4, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 1, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<1, 2, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<1, 4, 2, 2> { ... } Smart Memory Access PatternSince low-level programs are executed many times in repetitive fashion, minimizing duplicated memory access for both input and output is necessary. If we use SIMD architecture, we can load nearby elements together at once (Data Parallelism) and in order to reduce duplicated read memory access, we can traverse the input array by means of a snake-path.Figure 3. Memory access pattern for 8x8 output and unit stride, implemented in Tensorflow Lite's depthwise 3x3 convolution. The next example which is targeted to be used in much smaller 4x1 output block also demonstrates how to reuse preloaded variables efficiently. Note that the stored location does not change for variables which are loaded in previous stage (in the following figure, bold variables are reused):Figure 4. Memory access pattern for 4x1 output and stride 2, implemented in Tensorflow Lite's depthwise 3x3 convolution. 2) Be aware of using atrous depthwise convolutionAtrous (dilated) convolution is known to be useful to achieve better result for image segmentation[1]. We also decided to use atrous depthwise convolution in our network. One day, we tried to set stride for atrous depthwise convolution to make it accelerate computation, however we failed, because the layer usage in TensorFlow (≤ 1.8) is constrained.In Tensorflow documentation of tf.nn.depthwise_conv2d (slim.depthwise_conv2d also wraps this function), you can find this explanation of rate parameter.rate: 1-D of size 2. The dilation rate in which we sample input values across the height and width dimensions in atrous convolution. If it is greater than 1, then all values of strides must be 1.Even though TensorFlow doesn’t support strided atrous function, it doesn’t raise any error if you set rate > 1 and stride > 1. >>> import tensorflow as tf >>> tf.enable_eager_execution() >>> input_tensor = tf.constant(list(range(64)), shape=[1, 8, 8, 1], dtype=tf.float32) >>> filter_tensor = tf.constant(list(range(1, 10)), shape=[3, 3, 1, 1], dtype=tf.float32) >>> print(tf.nn.depthwise_conv2d(input_tensor, filter_tensor, strides=[1, 2, 2, 1], padding="VALID", rate=[2, 2])) tf.Tensor( [[[[ 302.] [ 330.] [ 548.] [ 587.]] [[ 526.] [ 554.] [ 860.] [ 899.]] [[1284.] [1317.] [1920.] [1965.]] [[1548.] [1581.] [2280.] [2325.]]]], shape=(1, 4, 4, 1), dtype=float32) >>> 0 * 5 + 2 * 6 + 16 * 8 + 9 * 18 # The value in (0, 0) is correct 302 >>> 0 * 4 + 2 * 5 + 4 * 6 + 16 * 7 + 18 * 8 + 20 * 9 # But, the value in (0, 1) is wrong! 470 Let’s find the reason why this difference happened. If we just put tf.space_to_batch and tf.batch_to_space before and after convolution layer, then we can use convolution operation for atrous convolution (profit!). On the other hand, it isn’t straightforward how to handle stride and dilation together. In TensorFlow, we need to be aware of this problem in depthwise convolution.3) Architecture design principles for efficient segmentation networkUsually segmentation takes more time than classification since it has to upsample high spatial resolution map. Therefore, it is important to reduce inference time as much as possible to make the application run in real-time.It is important to focus on spatial resolution when designing real-time application. One of the easiest ways is to reduce the size of input images without losing accuracy. Assuming that the network is fully convolutional, you can accelerate the model about four times faster if the size of input is halved. In literature[2], it is known that small size of input images doesn’t hurt accuracy a lot.Another simple strategy to adopt is early downsampling when stacking convolution layers. Even with the same number of convolution layers, you can reduce the time with strided convolution or pooling within early layers.There is also a tip for selecting the size of input image when you use Tensorflow Lite quantized model. The optimized implementations of convolution run best when the width and height of image is multiple of 8. Tensorflow Lite first loads multiples of 8, then multiples of 4, 2 and 1 respectively. Therefore, it is the best to keep the size of every input of layer as a multiple of 8.Substituting multiple operations into single operation can improve speed a bit. For example, convolution followed by max pooling can be usually replaced by strided convolution. Transpose convolution can also be replaced by resizing followed by convolution. In general, these operations are substitutable because they perform the same role in the network. There are no big empirical differences among these operations. Tips described above help to accelerate inference speed but they can also hurt accuracy. Therefore, we adopted some state-of-the-art blocks rather than using naive convolution blocks.Figure 5. Atrous spatial pyramid pooling (figure courtesy of Liang-Chieh Chen) Atrous spatial pyramid pooling[1] is a block which mimics the pyramid pooling operation with atrous convolution. DeepLab uses this block in the last layer.We also substitute most of the convolution layers with efficient depthwise separable convolution layers. They are basic building blocks for MobileNetV1[3] and MobileNetV2[4] which are well optimized in Tensorflow Lite.4) Folding batchnorm into atrous depthwise convolutionWhen quantizing convolution operation followed by batchnorm, batchnorm layer must be folded into the convolution layers to reduce computation cost. After folding, the batchnorm is reduced to folded weights and folded biases and the batchnorm-folded convolution will be computed in one convolution layer in TensorFlow Lite[5]. Batchnorm gets automatically folded using tf.contrib.quantize if the batchnorm layer comes right after the convolution layer. However, folding batchnorm into atrous depthwise convolution is not easy.In TensorFlow’s slim.separable_convolution2d, atrous depthwise convolution is implemented by adding SpaceToBatchND and BatchToSpaceND operations to normal depthwise convolution as mentioned previously. If you add a batchnorm to this operation by including argument normalizer_fn=slim.batch_norm, batchnorm does not get attached directly to the convolution layer. Instead, the graph will look like the diagram below: SpaceToBatchND → DepthwiseConv2dNative → BatchToSpaceND → BatchNorm The first thing we tried was to modify TensorFlow quantization to fold batchnorm bypassing BatchToSpaceND without changing the order of operations. With this approach, the folded bias term remained after BatchToSpaceND, away from the convolution layer. Then, it became separate BroadcastAdd operation in TensorFlow Lite model rather than fused into convolution. Surprisingly, it turned out that BroadcastAdd was much slower than the corresponding convolution operation in our experiment:Timing log from the experiment on Galaxy S8 ... [DepthwiseConv] elapsed time: 34us [BroadcastAdd] elapsed time: 107us ... To remove BroadcastAdd layer, we decided to change the network itself instead of fixing TensorFlow quantization. Within slim.separable_convolution2d layer, we swapped positions of BatchNorm and BatchToSpaceND. SpaceToBatchND → DepthwiseConv2dNative → BatchNorm → BatchToSpaceND Even though batchnorm relocation could lead to different outputs values compared to the original, we did not notice any degradation in segmentation quality.5) SIMD-optimized implementation for quantized resize bilinear layerAt the time of accelerating TensorFlow Lite framework, conv2d_transpose layer was not supported. However, we could use ResizeBilinear layer for upsampling as well. The only problem of this layer is that there is no quantized implementation, therefore we implemented our own SIMD quantized version of 2x2 upsampling ResizeBilinear layer.Figure 6. 2x2 bilinear upsampling without corner alignnment. To upsample image, original image pixels (red circles) are interlayed with new interpolated image pixels (grey circles). In order to simplify implementation we do not compute pixel values for the bottommost and rightmost pixels, denoted as green circles.for (int b = 0; b < batches; b++) { for (int y0 = 0, y = 0; y <= output_height - 2; y += 2, y0++) { for (int x0 = 0, x = 0; x <= output_width - 2; x += 2, x0++) { int32 x1 = std::min(x0 + 1, input_width - 1); int32 y1 = std::min(y0 + 1, input_height - 1); ResizeBilinearKernel2x2(x0, x1, y0, y1, x, y, depth, b, input_data, input_dims, output_data, output_dims); } } } Every new pixel value is computed for each batch separately. Our core function ResizeBilinearKernel2x2 computes 4 pixel values across multiple channels at once.Figure 7. Example of 2x2 bilinear upsampling of top left corner of image. (a) Original pixel values are simply reused and (b) – (d) used to interpolate new pixel values. Red circles represent original pixel values. Blue circles are new interpolated pixel values computed from pixel values denoted as circles with black circumference. NEON (Advanced SIMD) intrinsics enable us to process multiple data at once with a single instruction, in other words processing multiple data at once. Since we deal with uint8 input values we can store our data in one of the following formats uint8x16_t, uint8x8_t and uint8_t, that can hold 16, 8 and 1 uint8 values respectively. This representation allows to interpolate pixel values across multiple channels at once. Network architecture is highly rewarded when channels of feature maps are multiples of 16 or 8:// Handle 16 input channels at once int step = 16; for (int ic16 = ic; ic16 <= depth - step; ic16 += step) { ... ic += step; } // Handle 8 input channels at a once step = 8; for (int ic8 = ic; ic8 <= depth - step; ic8 += step) { ... ic += step; } // Handle one input channel at once for (int ic1 = ic; ic1 < depth; ic1++) { ... } SIMD implementation of quantized bilinear upsampling is straightforward. Top left pixel value is reused (Fig. 7a). Bottom left (Fig. 7b) and top right (Fig. 7c) pixel values are mean of two adjacent original pixel values. Finally, botom right pixel (Fig. 7d) is mean of 4 diagonally adjacent original pixel values.The only issue that we have to take care of is 8-bit integer overflow. Without a solid knowledge of NEON intrinsics we could go down the rabbit hole of taking care of overflowing by ourself. Fortunately, the range of NEON intrinsics is broad and we can utilize those intrinsics that fit our needs. The snippet of code below (using vrhaddq_u8) shows an interpolation (Fig. 7d) of 16 pixel values at once for bottom right pixel value:// Bottom right output_ptr += output_x_offset; uint8x16_t left_interpolation = vrhaddq_u8(x0y0, x0y1); uint8x16_t right_interpolation = vrhaddq_u8(x1y0, x1y1); uint8x16_t bottom_right_interpolation = vrhaddq_u8(left_interpolation, right_interpolation); vst1q_u8(output_ptr, bottom_right_interpolation); 6) Pitfalls in softmax layer and demo codeThe first impression of inference in TensorFlow Lite was very slow. It took 85 ms in Galaxy J7 at that time. We tested the first prototype of TensorFlow Lite demo app by just changing the output size from 1,001 to 51,200 (= 160x160x2)After profiling, we found out that there were two unbelievable bottlenecks in implementation. Out of 85 ms of inference time, tensors[idx].copyTo(outputs.get(idx)); line in Tensor.java took up to 11 ms (13 %) and softmax layer 23 ms (27 %). If we would be able to accelerate those operations, we could reduce almost 40 % of the total inference time!First, we looked at the demo code and identified tensors[idx].copyTo(outputs.get(idx)); as a source of problem. It seemed that the slowdown was caused by copyTo operation, but to our surprise it came from int[] dstShape = NativeInterpreterWrapper.shapeOf(dst); because it checks every element (in our case, 51,200) of array to fill the shape. After fixing the output size, we gained 13 % speedup in inference time.<T> T copyTo(T dst) { ... // This is just example, of course, hardcoding output shape here is a bad practice // In our actual app, we build our own JNI interface with just using c++ code // int[] dstShape = NativeInterpreterWrapper.shapeOf(dst); int[] dstShape = {1, width*height*channel}; ... } The softmax layer was our next problem. TensorFlow Lite’s optimized softmax implementation assumes that depth (= channel) is bigger than outer_size (= height x width). In classification task, the usual output looks like [1, 1(height), 1(width), 1001(depth)], but in our segmentation task, depth is 2 and outer_size is multiple of height and width (outer_size » depth). Implementation of softmax layer in Tensorflow Lite is optimized for classification task and therefore loops over depth instead of outer_size. This leads to unacceptably slow inference time of softmax layer when used in segmentation network.We can solve this problem in many different ways. First, we can just use sigmoid layer instead of softmax in 2-class portrait segmentation. TensorFlow Lite has very well optimized sigmoid layer.Secondly, we could write SIMD optimized code and loop over depth instead of outer_size. You can see similar implementation at Tencent’s ncnn softmax layer, however, this approach has still its shortcomings. Unlike ncnn, TensorFlow Lite uses NHWC as a default tensor format:Figure 8. NHWC vs NCHW In other words, for NHWC, near elements of tensor hold channel-wise information and not spatial-wise. It is not simple to write optimized code for any channel size, unless you include transpose operation before and after softmax layer. In our case, we tried to implement softmax layer assumming 2-channel output.Thirdly, we can implement softmax layer using pre-calculated lookup table. Because we use 8-bit quantization and 2-class output (foreground and background) there are only 65,536 (= 256x256) different combinations of quantized input values that can be stored in lookup table:for (int fg = 0; fg < 256; fg++) { for (int bg = 0; bg < 256; bg++) { // Dequantize float fg_real = input->params.scale * (fg - input->params.zero_point); float bg_real = input->params.scale * (bg - input->params.zero_point); // Pre-calculating Softmax Values ... // Quantize precalculated_softmax[x][y] = static_cast<uint8_t>(clamped); } } ConclusionIn this post, we described the main challenges we had to solve in order to run portrait segmentation network on mobile devices. Our main focus was to keep high segmentation accuracy while being able to support even old devices, such as Samsung Galaxy J7. We wish our tips and tricks can give a better understanding of how to overcome common challenges when designing neural networks and inference engines for mobile devices.At the top of this post you can see portrait segmentation effect that is now available in Azar app. If you have any questions or want to discuss anything related to segmentation task, contact us at [email protected]. Enjoy Azar and mobile deep learning world!References[1] L. Chen, G. Papandreou, F. Schroff, H. Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. June 17, 2017, https://arxiv.org/abs/1706.05587[2] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the Inception Architecture for Computer Vision. December 11, 2015, https://arxiv.org/abs/1512.00567[3] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, April 17, 2017, https://arxiv.org/abs/1704.04861[4] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. January 18, 2018, https://arxiv.org/abs/1801.04381[5] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. December 15, 2017, https://arxiv.org/abs/1712.05877