개발 피드 목록

PrefaceDeep learning has led to a series of breakthroughs in many areas. However, successful deep learning models often require significant amounts of computational resources, memory and power. Deploying efficient deep learning models on mobile devices became the main technological challenge for many mobile tech companies.Hyperconnect developed a mobile app named Azar which has a large fan base all over the world. Recently, Machine Learning Team has been focusing on developing mobile deep learning technologies which can boost user experience in Azar app. Below, you can see a demo video of our image segmentation technology (HyperCut) running on Samsung Galaxy J7. Our benchmark target is a real-time (>= 30 fps) inference on Galaxy J7 (Exynos 7580 CPU, 1.5 GHz) using only a single core.Figure 1. Our network runs fast on mobile devices, achieving 6 ms of single inference on Pixel 1 and 28 ms on Samsung Galaxy J7. Full length video. There are several approaches to achieve fast inference speed on mobile device. 8-bit quantization is one of the popular approaches that meet our speed-accuracy requirement. We use TensorFlow Lite as our main framework for mobile inference. TensorFlow Lite supports SIMD optimized operations for 8-bit quantized weights and activations. However, TensorFlow Lite is still in pre-alpha (developer preview) stage and lacks many features. In order to achive our goal, we had to do the following:Understand details of TensorFlow and Tensorflow Lite implementation.Design our own neural network that can fully utilize optimized kernels of TensorFlow Lite. (Refer to 1, 2 and 3)Modify TOCO: TensorFlow Lite Optimizing Converter to correctly convert unsupported layers. (Refer to 4)Accelerate several quantized-layers with SIMD optimized code. (Refer to 5 and 6)We are willing to share our trials and errors in this post and we hope that readers will enjoy mobile deep learning world :)1) Why is depthwise separable layer fast in Tensorflow Lite ?Implementing low-level programs requires a bit different ideas and approaches than usual. We should be aware that especially on mobile devices using cache memory is important for fast inference.Figure 2. Memory access requires much more energy (640 pJ) than addition or multiplication.Accessing cache memory (8 pJ) is much cheaper than using the main memory (table courtesy of Ardavan Pedram) In order to get insight into how intrinsics instructions are implemented in Tensorflow Lite, we had to analyze some implementations including depthwise separable convolution with 3x3 kernelsBelow we describe some of the main optimization techniques that are used for building lightweight and faster programs.Loop UnrollingCan you spot the difference between the following two code fragments?for (int i = 0; i < 32; i++) { x[i] = 1; if (i%4 == 3) x[i] = 3; } for (int i = 0; i < 8; i++) { x[4*i ] = 1; x[4*i+1] = 1; x[4*i+2] = 1; x[4*i+3] = 3; } The former way is what we usually write, and the latter is loop-unrolled version of former one. Even though unrolling loops are discouraged from the perspective of software design and development due to severe redundancy, with low-level architecture this kind of unrolling has non-negligible benefits. In the example described above, the unrolled version avoids examining 24 conditional statements in for loop, along with neglecting 32 conditional statements of if.Furthermore, with careful implementation, these advantages can be magnified with the aid of SIMD architecture. Nowadays some compilers have options which automatically unroll some repetitive statements, yet they are unable to deal with complex loops.Separate implementation for each caseConvolution layer can take several parameters. For example, in depthwise separable layer, we can have many combinations with different parameters (depth_multiplier x stride x rate x kernel_size). Rather than writing single program available to deal with every case, in low-level architectures, writing number of case-specific implementations is preferred. The main rationale is that we need to fully utilize the special properties for each case. For convolution operation, naive implementation with several for loops can deal with arbitrary kernel size and strides, however this kind of implementation might be slow. Instead, one can concentrate on small set of actually used cases (e.g. 1x1 convolution with stride 1, 3x3 convolution with stride 2 and others) and fully consider the structure of every subproblem.For example, in TensorFlow Lite there is a kernel-optimized implementation of depthwise convolution, targeted at 3x3 kernel size:template <int kFixedOutputY, int kFixedOutputX, int kFixedStrideWidth, int kFixedStrideHeight> struct ConvKernel3x3FilterDepth8 {}; Tensorflow Lite further specifies following 16 cases with different strides, width and height of outputs for its internal implementation:template <> struct ConvKernel3x3FilterDepth8<8, 8, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 4, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 2, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 1, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 2, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 4, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<1, 4, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 1, 1, 1> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 2, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 4, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<4, 1, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 2, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 4, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<2, 1, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<1, 2, 2, 2> { ... } template <> struct ConvKernel3x3FilterDepth8<1, 4, 2, 2> { ... } Smart Memory Access PatternSince low-level programs are executed many times in repetitive fashion, minimizing duplicated memory access for both input and output is necessary. If we use SIMD architecture, we can load nearby elements together at once (Data Parallelism) and in order to reduce duplicated read memory access, we can traverse the input array by means of a snake-path.Figure 3. Memory access pattern for 8x8 output and unit stride, implemented in Tensorflow Lite's depthwise 3x3 convolution. The next example which is targeted to be used in much smaller 4x1 output block also demonstrates how to reuse preloaded variables efficiently. Note that the stored location does not change for variables which are loaded in previous stage (in the following figure, bold variables are reused):Figure 4. Memory access pattern for 4x1 output and stride 2, implemented in Tensorflow Lite's depthwise 3x3 convolution. 2) Be aware of using atrous depthwise convolutionAtrous (dilated) convolution is known to be useful to achieve better result for image segmentation[1]. We also decided to use atrous depthwise convolution in our network. One day, we tried to set stride for atrous depthwise convolution to make it accelerate computation, however we failed, because the layer usage in TensorFlow (≤ 1.8) is constrained.In Tensorflow documentation of tf.nn.depthwise_conv2d (slim.depthwise_conv2d also wraps this function), you can find this explanation of rate parameter.rate: 1-D of size 2. The dilation rate in which we sample input values across the height and width dimensions in atrous convolution. If it is greater than 1, then all values of strides must be 1.Even though TensorFlow doesn’t support strided atrous function, it doesn’t raise any error if you set rate > 1 and stride > 1. >>> import tensorflow as tf >>> tf.enable_eager_execution() >>> input_tensor = tf.constant(list(range(64)), shape=[1, 8, 8, 1], dtype=tf.float32) >>> filter_tensor = tf.constant(list(range(1, 10)), shape=[3, 3, 1, 1], dtype=tf.float32) >>> print(tf.nn.depthwise_conv2d(input_tensor, filter_tensor, strides=[1, 2, 2, 1], padding="VALID", rate=[2, 2])) tf.Tensor( [[[[ 302.] [ 330.] [ 548.] [ 587.]] [[ 526.] [ 554.] [ 860.] [ 899.]] [[1284.] [1317.] [1920.] [1965.]] [[1548.] [1581.] [2280.] [2325.]]]], shape=(1, 4, 4, 1), dtype=float32) >>> 0 * 5 + 2 * 6 + 16 * 8 + 9 * 18 # The value in (0, 0) is correct 302 >>> 0 * 4 + 2 * 5 + 4 * 6 + 16 * 7 + 18 * 8 + 20 * 9 # But, the value in (0, 1) is wrong! 470 Let’s find the reason why this difference happened. If we just put tf.space_to_batch and tf.batch_to_space before and after convolution layer, then we can use convolution operation for atrous convolution (profit!). On the other hand, it isn’t straightforward how to handle stride and dilation together. In TensorFlow, we need to be aware of this problem in depthwise convolution.3) Architecture design principles for efficient segmentation networkUsually segmentation takes more time than classification since it has to upsample high spatial resolution map. Therefore, it is important to reduce inference time as much as possible to make the application run in real-time.It is important to focus on spatial resolution when designing real-time application. One of the easiest ways is to reduce the size of input images without losing accuracy. Assuming that the network is fully convolutional, you can accelerate the model about four times faster if the size of input is halved. In literature[2], it is known that small size of input images doesn’t hurt accuracy a lot.Another simple strategy to adopt is early downsampling when stacking convolution layers. Even with the same number of convolution layers, you can reduce the time with strided convolution or pooling within early layers.There is also a tip for selecting the size of input image when you use Tensorflow Lite quantized model. The optimized implementations of convolution run best when the width and height of image is multiple of 8. Tensorflow Lite first loads multiples of 8, then multiples of 4, 2 and 1 respectively. Therefore, it is the best to keep the size of every input of layer as a multiple of 8.Substituting multiple operations into single operation can improve speed a bit. For example, convolution followed by max pooling can be usually replaced by strided convolution. Transpose convolution can also be replaced by resizing followed by convolution. In general, these operations are substitutable because they perform the same role in the network. There are no big empirical differences among these operations. Tips described above help to accelerate inference speed but they can also hurt accuracy. Therefore, we adopted some state-of-the-art blocks rather than using naive convolution blocks.Figure 5. Atrous spatial pyramid pooling (figure courtesy of Liang-Chieh Chen) Atrous spatial pyramid pooling[1] is a block which mimics the pyramid pooling operation with atrous convolution. DeepLab uses this block in the last layer.We also substitute most of the convolution layers with efficient depthwise separable convolution layers. They are basic building blocks for MobileNetV1[3] and MobileNetV2[4] which are well optimized in Tensorflow Lite.4) Folding batchnorm into atrous depthwise convolutionWhen quantizing convolution operation followed by batchnorm, batchnorm layer must be folded into the convolution layers to reduce computation cost. After folding, the batchnorm is reduced to folded weights and folded biases and the batchnorm-folded convolution will be computed in one convolution layer in TensorFlow Lite[5]. Batchnorm gets automatically folded using tf.contrib.quantize if the batchnorm layer comes right after the convolution layer. However, folding batchnorm into atrous depthwise convolution is not easy.In TensorFlow’s slim.separable_convolution2d, atrous depthwise convolution is implemented by adding SpaceToBatchND and BatchToSpaceND operations to normal depthwise convolution as mentioned previously. If you add a batchnorm to this operation by including argument normalizer_fn=slim.batch_norm, batchnorm does not get attached directly to the convolution layer. Instead, the graph will look like the diagram below: SpaceToBatchND → DepthwiseConv2dNative → BatchToSpaceND → BatchNorm The first thing we tried was to modify TensorFlow quantization to fold batchnorm bypassing BatchToSpaceND without changing the order of operations. With this approach, the folded bias term remained after BatchToSpaceND, away from the convolution layer. Then, it became separate BroadcastAdd operation in TensorFlow Lite model rather than fused into convolution. Surprisingly, it turned out that BroadcastAdd was much slower than the corresponding convolution operation in our experiment:Timing log from the experiment on Galaxy S8 ... [DepthwiseConv] elapsed time: 34us [BroadcastAdd] elapsed time: 107us ... To remove BroadcastAdd layer, we decided to change the network itself instead of fixing TensorFlow quantization. Within slim.separable_convolution2d layer, we swapped positions of BatchNorm and BatchToSpaceND. SpaceToBatchND → DepthwiseConv2dNative → BatchNorm → BatchToSpaceND Even though batchnorm relocation could lead to different outputs values compared to the original, we did not notice any degradation in segmentation quality.5) SIMD-optimized implementation for quantized resize bilinear layerAt the time of accelerating TensorFlow Lite framework, conv2d_transpose layer was not supported. However, we could use ResizeBilinear layer for upsampling as well. The only problem of this layer is that there is no quantized implementation, therefore we implemented our own SIMD quantized version of 2x2 upsampling ResizeBilinear layer.Figure 6. 2x2 bilinear upsampling without corner alignnment. To upsample image, original image pixels (red circles) are interlayed with new interpolated image pixels (grey circles). In order to simplify implementation we do not compute pixel values for the bottommost and rightmost pixels, denoted as green circles.for (int b = 0; b < batches; b++) { for (int y0 = 0, y = 0; y <= output_height - 2; y += 2, y0++) { for (int x0 = 0, x = 0; x <= output_width - 2; x += 2, x0++) { int32 x1 = std::min(x0 + 1, input_width - 1); int32 y1 = std::min(y0 + 1, input_height - 1); ResizeBilinearKernel2x2(x0, x1, y0, y1, x, y, depth, b, input_data, input_dims, output_data, output_dims); } } } Every new pixel value is computed for each batch separately. Our core function ResizeBilinearKernel2x2 computes 4 pixel values across multiple channels at once.Figure 7. Example of 2x2 bilinear upsampling of top left corner of image. (a) Original pixel values are simply reused and (b) – (d) used to interpolate new pixel values. Red circles represent original pixel values. Blue circles are new interpolated pixel values computed from pixel values denoted as circles with black circumference. NEON (Advanced SIMD) intrinsics enable us to process multiple data at once with a single instruction, in other words processing multiple data at once. Since we deal with uint8 input values we can store our data in one of the following formats uint8x16_t, uint8x8_t and uint8_t, that can hold 16, 8 and 1 uint8 values respectively. This representation allows to interpolate pixel values across multiple channels at once. Network architecture is highly rewarded when channels of feature maps are multiples of 16 or 8:// Handle 16 input channels at once int step = 16; for (int ic16 = ic; ic16 <= depth - step; ic16 += step) { ... ic += step; } // Handle 8 input channels at a once step = 8; for (int ic8 = ic; ic8 <= depth - step; ic8 += step) { ... ic += step; } // Handle one input channel at once for (int ic1 = ic; ic1 < depth; ic1++) { ... } SIMD implementation of quantized bilinear upsampling is straightforward. Top left pixel value is reused (Fig. 7a). Bottom left (Fig. 7b) and top right (Fig. 7c) pixel values are mean of two adjacent original pixel values. Finally, botom right pixel (Fig. 7d) is mean of 4 diagonally adjacent original pixel values.The only issue that we have to take care of is 8-bit integer overflow. Without a solid knowledge of NEON intrinsics we could go down the rabbit hole of taking care of overflowing by ourself. Fortunately, the range of NEON intrinsics is broad and we can utilize those intrinsics that fit our needs. The snippet of code below (using vrhaddq_u8) shows an interpolation (Fig. 7d) of 16 pixel values at once for bottom right pixel value:// Bottom right output_ptr += output_x_offset; uint8x16_t left_interpolation = vrhaddq_u8(x0y0, x0y1); uint8x16_t right_interpolation = vrhaddq_u8(x1y0, x1y1); uint8x16_t bottom_right_interpolation = vrhaddq_u8(left_interpolation, right_interpolation); vst1q_u8(output_ptr, bottom_right_interpolation); 6) Pitfalls in softmax layer and demo codeThe first impression of inference in TensorFlow Lite was very slow. It took 85 ms in Galaxy J7 at that time. We tested the first prototype of TensorFlow Lite demo app by just changing the output size from 1,001 to 51,200 (= 160x160x2)After profiling, we found out that there were two unbelievable bottlenecks in implementation. Out of 85 ms of inference time, tensors[idx].copyTo(outputs.get(idx)); line in Tensor.java took up to 11 ms (13 %) and softmax layer 23 ms (27 %). If we would be able to accelerate those operations, we could reduce almost 40 % of the total inference time!First, we looked at the demo code and identified tensors[idx].copyTo(outputs.get(idx)); as a source of problem. It seemed that the slowdown was caused by copyTo operation, but to our surprise it came from int[] dstShape = NativeInterpreterWrapper.shapeOf(dst); because it checks every element (in our case, 51,200) of array to fill the shape. After fixing the output size, we gained 13 % speedup in inference time.<T> T copyTo(T dst) { ... // This is just example, of course, hardcoding output shape here is a bad practice // In our actual app, we build our own JNI interface with just using c++ code // int[] dstShape = NativeInterpreterWrapper.shapeOf(dst); int[] dstShape = {1, width*height*channel}; ... } The softmax layer was our next problem. TensorFlow Lite’s optimized softmax implementation assumes that depth (= channel) is bigger than outer_size (= height x width). In classification task, the usual output looks like [1, 1(height), 1(width), 1001(depth)], but in our segmentation task, depth is 2 and outer_size is multiple of height and width (outer_size » depth). Implementation of softmax layer in Tensorflow Lite is optimized for classification task and therefore loops over depth instead of outer_size. This leads to unacceptably slow inference time of softmax layer when used in segmentation network.We can solve this problem in many different ways. First, we can just use sigmoid layer instead of softmax in 2-class portrait segmentation. TensorFlow Lite has very well optimized sigmoid layer.Secondly, we could write SIMD optimized code and loop over depth instead of outer_size. You can see similar implementation at Tencent’s ncnn softmax layer, however, this approach has still its shortcomings. Unlike ncnn, TensorFlow Lite uses NHWC as a default tensor format:Figure 8. NHWC vs NCHW In other words, for NHWC, near elements of tensor hold channel-wise information and not spatial-wise. It is not simple to write optimized code for any channel size, unless you include transpose operation before and after softmax layer. In our case, we tried to implement softmax layer assumming 2-channel output.Thirdly, we can implement softmax layer using pre-calculated lookup table. Because we use 8-bit quantization and 2-class output (foreground and background) there are only 65,536 (= 256x256) different combinations of quantized input values that can be stored in lookup table:for (int fg = 0; fg < 256; fg++) { for (int bg = 0; bg < 256; bg++) { // Dequantize float fg_real = input->params.scale * (fg - input->params.zero_point); float bg_real = input->params.scale * (bg - input->params.zero_point); // Pre-calculating Softmax Values ... // Quantize precalculated_softmax[x][y] = static_cast<uint8_t>(clamped); } } ConclusionIn this post, we described the main challenges we had to solve in order to run portrait segmentation network on mobile devices. Our main focus was to keep high segmentation accuracy while being able to support even old devices, such as Samsung Galaxy J7. We wish our tips and tricks can give a better understanding of how to overcome common challenges when designing neural networks and inference engines for mobile devices.At the top of this post you can see portrait segmentation effect that is now available in Azar app. If you have any questions or want to discuss anything related to segmentation task, contact us at [email protected]. Enjoy Azar and mobile deep learning world!References[1] L. Chen, G. Papandreou, F. Schroff, H. Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. June 17, 2017, https://arxiv.org/abs/1706.05587[2] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna. Rethinking the Inception Architecture for Computer Vision. December 11, 2015, https://arxiv.org/abs/1512.00567[3] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, April 17, 2017, https://arxiv.org/abs/1704.04861[4] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. January 18, 2018, https://arxiv.org/abs/1801.04381[5] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. December 15, 2017, https://arxiv.org/abs/1712.05877

필자가 재직 중인 일정 데이터 스타트업 히든트랙(린더)은 현재 SKT NUGU, Google Assistant에서 '아이돌 캘린더'라는 이름의 일정 검색/구독 서비스를 운영 중이며, 삼성 빅스비와 협업을 통해 내년 상반기 전시/공연 일정 검색/구독 서비스 상용화를 앞두고 있다.https://blog.naver.com/nuguai/221387861674세계적으로도 아직 음성 관련 서비스 사례가 많지 않은 상황에서 VUI 기반 서비스 개발에 도움이 될만한 자료를 국내에서 찾기는 더더욱 쉽지 않았고, 향후 음성 기반 서비스를 준비하는 다른 이들이 우리가 겪었던 시행착오를 줄일 수 있기를 바라는 마음으로 간단하게 5부작 형태의 글로 우리가 고민해온 과정을 준비해보았다.음성 서비스 시장의 확대해외 리서치 업체 닐슨에 따르면 2018년 2분기 기준 미국 가구 중 4분의 1에 해당하는 24%가 최소 1대 이상의 AI 스피커를 소유하고 있으며 미국 성인의 20%가 하루 1회 이상 음성 검색 서비스를 활용하고 있다. 국내 리서치 전문 기관인 컨슈머 인사이트에 따르면 국내 AI 스피커 사용 경험률은 11%에 달하며 올해 안으로 세계 5위 수준의 스피커 시장 점유율(3%)을 확보할 것으로 예상된다.아마존 에코는 시각 장애인들이 콘텐츠에 접근하는 속도를 최대 10배까지 빠르게 만들어주었으며 SKT 내비게이션 서비스 T-Map은 NUGU의 음성 인터페이스를 통해 터치 인터랙션을 26%까지 감소시켜 사고 위험을 줄였다.음성 서비스 시장이 확대되고 있다는 것과, 그 변화가 사람들의 삶에 많은 영향을 끼치고 있다는 것은 누구도 부정할 수 없는 자명한 사실이다.하지만 여전히 아쉬운 일상 속 음성 서비스 만족도그렇다면 과연 우리의 일상 속 음성 서비스 경험의 만족도는 어떨까?지난 4월 진행된 컨슈머인사이트의 조사에 따르면 국내 주요 음성 서비스에 대한 사용자 만족률은 49%로, 절반에 채 못 미치고 있는 상황이다."국내 음성 서비스 만족도 - 49%"주요 불만족 이유로는 ‘음성 명령이 잘되지 않는다’(50%), ‘자연스러운 대화가 곤란하다’(41%), ‘소음을 음성 명령으로 오인한다’(36%) 등이 꼽혔으며, 아직도 대다수의 사용자들에게 AI 스피커는 기업들의 서툰 시도로 인식되고 있다.국내 음성 기반 서비스 만족도는 타 스피커 상용화 국가들과 대비해서도 현저히 낮은 편인데, 유독 국내의 사용자들이 만족스러운 음성 서비스 경험을 누리지 못하고 있는 이유가 대체 무엇인지, 이번 글을 통해 잠시 논해보고자 한다.1. 과열된 AI 마케팅국내 'AI 스피커' 시장은 타 국가 대비 매우 치열한 점유율 경쟁이 벌어지고 있는 곳이다. 미국의 경우만 하더라도 구글 어시스턴트, 아마존 알렉사, 애플 시리의 삼파전이 벌어지고 있는 상황에서 국내는 KT 기가지니, SKT NUGU, 네이버 클로바, 카카오 i, 삼성 빅스비 등 5개가 넘는 다양한 플레이어들이 이 작은 시장을 차지하기 위해 혈투를 벌이고 있다.AI, 즉 인공지능은 사전적으로 '인간의 지능으로 할 수 있는 사고, 학습, 자기 개발 등을 컴퓨터가 할 수 있도록 하는 방법'을 뜻하는데, 현존하는 대다수의 속칭 'AI' 서비스들이 해당 수준에 다다르기에는 아직 많은 시간이 필요하다는것은 누구도 부정할 수는 없을듯 하다. 경쟁이 과열되다 보면 제품을 판매하기 위해 다소 공격적인 선택을 하는 경우가 있고, 현재 국내에서 이루어지고 있는 AI라는 용어의 지나친 남발이 바로 그 대표적인 예시라고 할 수 있다.멀리 갈 것 없이 각 나라에서 스피커를 부르는 호칭을 보면 잘 알 수 있는데, 우리가 흔히 'AI 스피커'라 부르는 구글 홈, 아마존 에코 등 대다수의 스피커는 미국 내에서 '스마트 스피커'라는 단어로 통용된다.(구글에 AI Speaker를 검색해보면 Smart Speaker로 자동 대체되는 것을 확인할 수 있다)구글 내 AI 스피커 검색 결과(첫 두 검색은 광고)즉, 아직은 '스마트'하다고 부를 수밖에 없는 수준의 기능에 대한 과장 된 'AI 마케팅'으로 인해 국내 사용자들은 시장 생성 초기부터 고도화된 인공지능을 기대하게 되고, 이는 결국 자연스레 낮은 사용자 만족도로 이어질 수밖에 없는 것이다.향후 AI가 음성 기반 서비스의 핵심 기술이 될것은 분명하지만 당장의 지나친 기대감은 되려 국내 음성 기반 서비스의 *캐즘 기간을 장기화시킬 수 있을것으로 우려된다.*캐즘: 첨단기술 제품이 선보이는 초기 시장에서 주류시장으로 넘어가는 과도기에 일시적으로 수요가 정체되거나 후퇴하는 단절 현상2. 조금 더 시간이 필요한 기술력앞서 언급한 컨슈머 인사이트의 조사에 따르면 사용자의 불만족 이유 중 TOP 3 모두가 '낮은 인식률' 바탕으로 하고 있는 것을 재차 확인할 수 있다.1. 음성 명령이 잘되지 않는다(50%)2. 자연스러운 대화가 곤란하다(41%)3. 소음을 음성 명령으로 오인한다(36%) 컨슈머인사트 AI 스피커 만족도 통계음성 서비스 경험은 사용자의 명확한 의사가 전달되지 않는다면 애초에 시작될 수 없다. 자연스러운 대화를 진행하기 위해서는 결국 사람의 언어, 즉 자연어를 분석하여 의도를 파악할 수 있어야 하며 이를 실현하기 위해서는 아래에 소개 된 ASR(음성 인식)과 NLU(자연어 처리)가 높은 수준으로 구현되어야 한다.T map X NUGU 디자인 사례로 알아보는 음성인터페이스 디자인 1강 - https://youtu.be/Dz-rxGV-dOAASR과 NLU 성능이 뒷받침되지 않는 음성 서비스는 아무리 고도화 된 서비스 로직이 준비된들 '대화'가 진행될 수 없으며 부족한 성능은 결국 국내 대다수 스피커들이 "죄송합니다. 무슨 말인지 이해 못했어요"를 출력하며 사용자 불만족도를 상승시키는 주요 요인으로 볼 수 있다.인식 정확도를 상승시키기 위해서는 결과적으로 더 많은 양의 학습 데이터가 필요하며 대다수의 업체가 아직 관련 기술력이 많이 부족한 상황에서도 공격적으로 스피커를 출시하는 이유 또한 결국 초기 점유율 높여 이 학습 데이터를 지속적으로 쌓기 위해서다.국내에서는 아직 높은 수준으로 두 단계를 구축한 메이저 업체가 없는 상황에서, 국내 기업들은 경쟁력을 확보하기 위해 관련 기술력을 가진 국내외 다양한 기업에 지속적으로 투자를 늘려나가고 있는 상황이다.http://www.zdnet.co.kr/view/?no=201702231628363. 더 많은 고민이 필요한 음성 사용자 경험(VUX) 디자인이번 협업 프로젝트를 진행하며 VUX를 공부하는 과정에서 우리의 사례를 포함한 몇 가지 재미있는 질문들을 발견할 수 있었다.질문1. 음악 앱이 재생되는 상황에서 사용자가 "앞으로 10초"라고 말했다면, 빨리 감기를 하는 게 맞을까 되감기를 하는 게 맞을까? - 네이버 클로바 사례질문2. 자정이 살짝 넘은 새벽 1시, 사용자가 "내일 일정 알려줘"라고 말했다면, 향후 23시간 동안의 일정을 알려주는 게 맞을까 23시간이 지난 그 다음날 일정을 알려주는 게 맞을까? - 히든트랙 린더(빅스비, SKT 파트너 스타트업) 사례질문3. '오늘'이라는 이름의 기업이 존재하는 상황에서 "오늘 기업 정보 알려줘"라고 말했다면, 오늘의 주요 기업 정보를 제공하는게 맞을까 주식회사 '오늘'의 정보를 제공하는게 좋을까? - 딥서치(빅스비 파트너 스타트업) 사례앞서 언급했던 1,2번의 사용자 만족도 문제가 이미 어쩔 수 없는 국내 시장의 지나친 경쟁과 더 시간이 필요한 기술력에 대한 아쉬움을 토로하는 내용이었다면, 3번의 VUI상의 새로운 경험에 대한 고민들이 이번 글을 쓰게 된 계기이자 목적이라고 볼 수 있다. 아직도 각 질문에 대한 뚜렷한 정답이 없는 상황에서 위와 같은 고민들을 함께 논의하며 최대한으로 정답에 가까운 선택을 내릴 수 있었으면 한다.클로바의 "앞으로 10초", 린더의 "내일 일정 알려줘", 딥서치의 "오늘 기업 정보 알려줘"에 대한 해답과 같이 '최선'이라고 부를 수 있는 가이드가 아직 존재하지 않는 현 VUX 시장은 더욱더 깊은 고민과 통찰이 필요한 시점이다. 단순히 해외 사례를 그대로 인용하여 국내 서비스에 적용하는 것이 아닌 정서와 문화, 그리고 각 콘텐츠에 대한 높은 이해도를 바탕으로 적절히 녹여낼 수 있어야 한다.올해 초 처음으로 챗봇을 디자인해보며 겪었던 애로사항들을 적은 부족한 글이 새로운 디자인을 시도하는 이들에게 조금이나마 도움이 되었다는 피드백을 받을 수 있었고,http://magazine.ditoday.com/ui-ux/일정-구독-서비스-린더의-탄생/이에 용기를 얻어 이번에는 다소 길지만 조금 더 많은 내용을 담고 있는 글을 준비하게 되었다.SKT NUGU, 삼성 빅스비와의 협업 과정에서 '음성 기반 인터페이스(VUI)'는 챗봇과는 확연히 다른 또 다른 형태의 디자인이라는 것을 알 수 있었고, 단순히 대화형 인터페이스(CI: Chatting Interface)를 음성의 형태로 재가공하는 것이 아닌, 서비스 기반부터 리디자인이 필요하다는것을 깨달았다.이미 구글, 아마존, 애플 등 메이저 업체들이 수년간의 경험과 데이터를 기반으로 다양한 VUX 가이드라인을 제시하고 있으며, 최근에는 SKT NUGU, 네이버 클로바 등 국내 업체들도 조금씩 VUX 서비스 제작에 대한 구체적인 로드맵을 제공하고 있는 상황이다.https://developers.nugu.co.kr/docs/voice-service-design-guideline/앞으로 약 다섯 달간 연재 진행 예정인 향후 4편의 내용들은 위 가이드 문서들에서 언급하는 다양한 해외와 국내 사례들을 바탕으로 주제를 선정하였으며, 각 편의 내용들은 VUI 서비스 제작 경험이 있는 다양한 국내 회사들의 고민 과정을 조금씩 담고 있다.1편: 음성 기반 인터페이스의 등장2편: 음성 기반 인터페이스와 TPO3편: 음성 기반 인터페이스와 페르소나4편: 음성 기반 인터페이스 vs GUI5편: 국내 음성 기반 인터페이스 현황음성 인터페이스는 정말 유용할까?음성 인터페이스는 먼 미래의 것이 아니다. 우리는 이미 수 년 전부터 다양한 종류의 음성 인터페이스를 접해왔으며, 그중 대표적인 예시가 바로 누구나 한 번쯤은 경험해보았을 ARS, 자동응답 시스템이다.각종 정보를 음성으로 저장 한 후, 사용자가 전화를 이용하여 시스템에 접속하면 음성으로 필요한 정보를 검색할 수 있도록 사용법을 알려주고, 필요한 정보를 찾으면 이를 음성으로 들려 주는 바로 그 시스템이 현 음성 인터페이스 경험의 모태라 할 수 있다.예약을 진행하는 과정에서 어떤 제품군을 수리 맡기고 싶은지, 냉장고인지, 컴퓨터인지, 노트북인지, 핸드폰인지 '말로 검색하고 말로 예약 확인을 받는' 바로 그 과정이 바로 수년 전부터 존재해온 음성 인터페이스이다. 우리가 말로, 음성으로 수리하고 싶은 제품을 말하고 응답을 받아온 이유는 간단하다.더 편했기 때문이다.다만 그렇다고 해서 음성 인터페이스가 모든 분야를 혁신시킬 변화의 축이 되기는 힘들다.음성 입출력의 한계는 매우 명확하며, 시각적 입출력이 반드시 필요한 산업과 분야(음식, 지도 등)는 꾸준히 기존과 같은 시각 기반의 인터페이스를 필요로 할 것이다.모든 분야에 적용될 수는 없는 음성 인터페이스이지만 한가지 확실한 것은 이제 시작이라는 것이다.다소 장황하고 부족한 이 글이 조금이나마 앞으로의 험난한 여정을 도울 기초적인 가이드가 될 수 있었으면 하는 마음으로 연재를 시작해본다.저도 아직 많이 낯선 분야인만큼 의아하시거나 틀린부분이 있다면 댓글로 많은 지적 및 피드백 부탁드립니다. 감사합니다 :)#히든트랙 #음성기반기술 #스타트업인사이트 #UX디자인 #음성기반디자인