YUV 4:2:2 v210 10-bit packed decoder in GLSL

This is what I’ve done in a sunny day, when I’ve been asked to investigate and make a prototype of a YUV –> RGB 4:2:2 v210 converter to RGB in GLSL.

Following are some raw references to know more about the YUV standard and differences with YCbCr, a description of the memory layout of packed YUV 10-bit and the necessary informations to convert YUV to RGB:

YUV format:
- http://www.fourcc.org/yuv.php
- https://msdn.microsoft.com/en-us/library/windows/desktop/bb970578(v=vs.85).aspx
YUV pixel format v210:
- http://wiki.multimedia.cx/?title=V210
- http://www.digitalpreservation.gov/formats/fdd/fdd000353.shtml
- http://www.fourcc.org/yuv.php#V210
- example of v210 video: http://samples.mplayerhq.hu/V-codecs/v210/
- Apple technical note; it’s the best source of information about the memory layout
Implementation
- planar YUV 8-bit decoder in GLSL: https://code.google.com/p/o3d/source/browse/trunk/samples_webgl/shaders/yuv2rgb-glsl.shader?r=219
- Some interesting implementations with SIMD instructions: http://www.mikekohn.net/stuff/image_processing.php

The following is a short schema of the memory operations to perform, considering that the single Y,U,V channels are 10-bit wide.

I’m not sure that matrix coefficients are the correct ones, however they could be changed as required, even passed to the shader in a uniform variable.

Some notes about the compute shader I’ve done:

take an image1d as the input buffer containing the YUV encoded buffer (data_in)
the output is placed in the image2d (data_out)
as W,H are the image sizes, each workgroups is 6×1 sized; the size is selected because the decoding pattern is repeated every 6 pixels
the input YUV buffer is readed with a linearized mapping from the x,y pixel coordinates
the shader execution is divided in two steps:
the first step is a parallel load to a shared memory, that use 4 of 6 threads; I use a pre-calculated indexes buffer to copy each YUV components in 1 (Y) or 2 (U and V) location, avoiding unnecessary control divergence. A barrier is placed after the load phase
then all the 6 threads in the workgroup, one for each pixel, execute the YUV → RGB conversion, using a map matrix.

Something to improve in this code or you’re generally interested in this type of stuff, or maybe in GPGPU in general? GLSL in primis (and CUDA/openCL) are the disruptive technology to boost your software/algorithm. Send me a note, i will be happy to share my thoughts.

// =====================================
// YUV4:2:2 v210 10-bit decoder
// (C) by Andrea de Carolis 12/4/2015
// =====================================
// protected under MIT license

/*
Copyright (c) 2015 Andrea De Carolis

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
*/

// decode a packet of 6 neighbours pixels in a scanline. The input // image must have width = multiple of 6 pixel.

#version 430 core
//#extension GL_ARB_gpu_shader5 : require // not required by GTX9x

#define FACTOR 3

// size of workgroup: a single 6 pixel packet
// (4 encoded 32bit WORD)
layout (local_size_x = 6, local_size_y = 1) in; 
// single channel, 32bit WORD
layout (binding=1,r32ui) uniform uimage1D data_in; 
// output image RGBA buffer
layout (binding=2,rgba8ui) uniform uimage2D data_out; 
// local shared buffer for YUV components (6 pixels*3 YUV values)
shared uint shared_buf[18]; 
uniform int imageW;
uniform int imageH;
// number of DWORD in an encoded YUV scanline
uniform int stride_row;

const uint local_buf_ofs[24]=
{
 1,4,0,0,2,5, // Cb0,Y0,Cr0 offset into the shared buffer
 3,3,7,10,6,6, // Y1,Cb1,Y2
 8,11,9,9,13,16, // Cr1,Y3,Cb2
 12,12,14,17,15,15 // Y4,Cr2,Y5
};

void main (void)
{
 uint pxidx = gl_GlobalInvocationID.x%6;
 
 if (pxidx<4)
 {
 // parallel load of all 4 DWORD

uint cp1,cp2,cp3;
 // stride_row = selected by caller, generally W*4 
 // because all DWORD are consecutive for simplification.
 // that 
 uint value = imageLoad(data_in,int((gl_GlobalInvocationID.x/6)*4+
 pxidx+stride_row*gl_GlobalInvocationID.y)).r;

cp1 = value & 1023;
 cp2 = (value >> 10) & 1023;
 cp3 = (value >> 20) & 1023;

uint baseofs = pxidx*6;

// copy each component in at least one buffer location, 
 // and at max into two locations. For example
 // Yi will be written twice in the same location, 
 // Cb and Cr in two different location because they are 
 // shared across two consecutive pixels
 
 shared_buf[baseofs+0]=shared_buf[baseofs+1]=cp1;
 shared_buf[baseofs+2]=shared_buf[baseofs+3]=cp2;
 shared_buf[baseofs+4]=shared_buf[baseofs+5]=cp3;
 }
 
 barrier (); // wait all threads in the workgroup

// execute the YUV to RGB conversion for each of 6 pixel 
// of the workgroup
 //
 // there are many different version of this conversion. 
// This is one of them, but must be checked against 
// the specific values contained in Y'CbCr buffer

vec3 YUV = vec3(
  shared_buf[pxidx*3],
  shared_buf[pxidx*3+1],
  shared_buf[pxidx*3+2]);
 const mat3 YUV2RGB_MAPPING = mat3 (
 1.1678f, 0.0f ,1.6007f,
 1.1678f, -0.3929f ,-0.81532f,
 1.1678f, +2.0232f ,0.0f
 );
 YUV=YUV+vec3(-64,-512,-512);
 vec3 rgb = YUV2RGB_MAPPING*YUV;

ivec4 rgba = ivec4 (
 clamp(int(rgb.r) >> FACTOR, 0, 255),
 clamp(int(rgb.g) >> FACTOR, 0, 255),
 clamp(int(rgb.b) >> FACTOR, 0, 255),
 255
 );

// store final rgba value
 imageStore (data_out,ivec2(gl_GlobalInvocationID.xy),rgba);
}

YUV 4:2:2 v210 10-bit packed decoder in GLSL

Pubblicato da adec

Lascia un commento Annulla risposta