It is currently Fri Jul 21, 2017 7:46 am

All times are UTC - 7 hours





Post new topic Reply to topic  [ 31 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Wed Jun 01, 2016 11:05 am 
Offline
User avatar

Joined: Fri Oct 14, 2011 1:09 am
Posts: 248
A year or two ago someone on #nesdev talked with me about my NTSC decoder. I think it was Thefox. After I created the example code for the NTSC video wiki page, I had rewritten the NTSC modem for my QuickBASIC NES emulator, resulting in much more efficient design. I promised them to share my code, but I never did. Here it is now:

Code:
/**
 * NTSC_DecodeLine(Width, Signal, Target, Phase0)
 *
 * Convert NES NTSC graphics signal into RGB using integer arithmetics only.
 *
 * Width: Number of NTSC signal samples.
 *        For a 256 pixels wide screen, this would be 256*8. 283*8 if you include borders.
 *
 * Signal: An array of Width samples.
 *         The following sample values are recognized:
 *          -29 = Luma 0 low   32 = Luma 0 high (-38 and  6 when attenuated)
 *          -15 = Luma 1 low   66 = Luma 1 high (-28 and 31 when attenuated)
 *           22 = Luma 2 low  105 = Luma 2 high ( -1 and 58 when attenuated)
 *           71 = Luma 3 low  105 = Luma 3 high ( 34 and 58 when attenuated)
 *         In this scale, sync signal would be -59 and colorburst would be -40 and 19,
 *         but these are not interpreted specially in this function.
 *         The value is calculated from the relative voltage with:
 *                   floor((voltage-0.518)*1000/12)-15
 *
 * Target: Pointer to a storage for Width RGB32 samples (00rrggbb).
 *         Note that the function will produce a RGB32 value for _every_ half-clock-cycle.
 *         This means 2264 RGB samples if you render 283 pixels per scanline (incl. borders).
 *         The caller can pick and choose those columns they want from the signal
 *         to render the picture at their desired resolution.
 *
 * Phase0: An integer in range 0-11 that describes the phase offset into colors on this scanline.
 *         Would be generated from the PPU clock cycle counter at the start of the scanline.
 *         In essence it conveys in one integer the same information that real NTSC signal
 *         would convey in the colorburst period in the beginning of each scanline.
 */
void NTSC_DecodeLine(int Width,
                     const char Signal[/*Width*/],
                     unsigned Target[/*Width*/],
                     int Phase0)
{
    static constexpr int Ywidth = 12, Iwidth = 23, Qwidth = 23;
    /* Ywidth, Iwidth and Qwidth are the filter widths for Y,I,Q respectively.
     * All widths at 12 produce the best signal quality.
     * 12,24,24 would be the closest values matching the NTSC spec.
     * But off-spec values 12,22,26 are used here, to bring forth mild
     * "chroma dots", an artifacting common with badly tuned TVs.
     * Larger values = more horizontal blurring.
     */
    static constexpr int Contrast = 167941, Saturation = 144044;

    static constexpr char sinetable[27] = {0,4,7,8,7,4, 0,-4,-7,-8,-7,-4,
                                           0,4,7,8,7,4, 0,-4,-7,-8,-7,-4,
                                           0,4,7}; // 8*sin(x*2pi/12)
    // To finetune hue, you would have to recalculate sinetable[].
    // Coarse changes can be made with Phase0.

    auto Read = [=](int pos) -> char { return pos>=0 ? Signal[pos] : 0; };
    auto Cos  = [=](int pos) -> char { return sinetable[(pos+36)%12  +Phase0]; };
    auto Sin  = [=](int pos) -> char { return sinetable[(pos+36)%12+3+Phase0];   };

    int ysum = 0, isum = 0, qsum = 0;
    for(int s=0; s<Width; ++s)
    {
        ysum += Read(s)          - Read(s-Ywidth);
        isum += Read(s) * Cos(s) - Read(s-Iwidth) * Cos(s-Iwidth);
        qsum += Read(s) * Sin(s) - Read(s-Qwidth) * Sin(s-Qwidth);
        constexpr int br=Contrast, sa=Saturation;
        constexpr int yr = br/Ywidth, ir = br* 1.994681e-6*sa/Iwidth, qr = br* 9.915742e-7*sa/Qwidth;
        constexpr int yg = br/Ywidth, ig = br* 9.151351e-8*sa/Iwidth, qg = br*-6.334805e-7*sa/Qwidth;
        constexpr int yb = br/Ywidth, ib = br*-1.012984e-6*sa/Iwidth, qb = br* 1.667217e-6*sa/Qwidth;
        int r = std::min(255,std::max(0, (ysum*yr + isum*ir + qsum*qr) / 65536 ));
        int g = std::min(255,std::max(0, (ysum*yg + isum*ig + qsum*qg) / 65536 ));
        int b = std::min(255,std::max(0, (ysum*yb + isum*ib + qsum*qb) / 65536 ));
        Target[s] = (r << 16) | (g << 8) | b;
    }
}

Tip: This code lends itself excellently to parallelization. I will leave the required changes as an exercise to the reader.


Last edited by Bisqwit on Sun Jun 05, 2016 8:20 pm, edited 4 times in total.

Top
 Profile  
 
PostPosted: Wed Jun 01, 2016 11:41 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 18651
Location: NE Indiana, USA (NTSC)
Under what license may this decoder be used?


Top
 Profile  
 
PostPosted: Wed Jun 01, 2016 11:54 am 
Offline
User avatar

Joined: Fri Oct 14, 2011 1:09 am
Posts: 248
tepples wrote:
Under what license may this decoder be used?

MIT. Or CC-BY-SA, I guess. Since this is more of a documentation thing.


Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 12:35 am 
Offline
User avatar

Joined: Tue Apr 19, 2011 11:26 am
Posts: 106
Location: RU
Woah how smart! That still can be done in old C++ with #define, can it? I think lambda is also traslatable. If I manage to tweak it to meet the PAL standard, it will save me months of brainstorming.


Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 8:14 am 
Offline
User avatar

Joined: Fri Oct 14, 2011 1:09 am
Posts: 248
feos wrote:
Woah how smart! That still can be done in old C++ with #define, can it? I think lambda is also traslatable. If I manage to tweak it to meet the PAL standard, it will save me months of brainstorming.

Yes, lambda can be converted into a #define and constexpr into const, to support compilation on more than 5 years old compilers.

As far as I know, to change it into PAL, you have to convert the YIQ matrix (those floating point constants) into YUV matrix, and that's pretty much it for this function (and even then it is unnecessary, since YUV is just a phase variant of YIQ). In the signal generator you would generate opposing phases on every other scanline, and indicate Phase0 accordingly.


Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 8:52 am 
Offline
User avatar

Joined: Tue Apr 19, 2011 11:26 am
Posts: 106
Location: RU
Right. Can you post some example images of your filter?


Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 9:43 am 
Offline
User avatar

Joined: Fri Oct 14, 2011 1:09 am
Posts: 248
feos wrote:
Right. Can you post some example images of your filter?

Well I haven't put it into any emulator, but here's three snapshots created by recording the NTSC signal from my QuickBASIC NES emulator, saving it into file, and decoding with this C++ program, rendering at 640x240 (with no interpolation), writing to stdout, and rendering with ffplay with its default sws scaler to 1280x960 (I believe that's bicubic). Click to view full size. I have not adjusted the hue/contrast/saturation settings to match some particular palette, so they are what they are.
Image Image Image


Last edited by Bisqwit on Thu Jun 02, 2016 10:04 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 9:57 am 
Offline
User avatar

Joined: Tue Apr 19, 2011 11:26 am
Posts: 106
Location: RU
If not inside an emulator, then how did you test it?


Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 9:59 am 
Offline

Joined: Sun Sep 19, 2004 11:12 pm
Posts: 18651
Location: NE Indiana, USA (NTSC)
You may have missed this line:
Bisqwit wrote:
recording the NTSC signal from my QuickBASIC NES emulator, saving it into file, and decoding with this C++ program


Another way would involve a "preview" tool that works on BMP or PNG images.
  1. Load a screenshot of an NES game
  2. Convert RGB values to closest NES color numbers
  3. Run filter on each line
  4. Save filtered screenshot

These screenshots appear to be of one field. Normally the NES PPU's color subcarrier phase has a two-field sequence due to the missing dot between pre-render and the first line of picture on every other field. So to operate on still images (instead of live emulator output), use starting phase offsets 0 and 4 and average them.


This appears to be a generic NTSC signal decoder that operates at 12 times color burst. I bet it wouldn't need any change to work on Apple II graphics or CGA graphics in the unofficial 640x200+color burst mode, which operate at 2 to 4 times color burst: just feed each hires pixel 6 times or each double hires pixel 3 times, with transitions delayed by 3 units in slivers with bit 7 set to true.

One optimization might be to produce output only for pixels at 0, 3, 7, 11, 14, ... which would have the side effect of also correcting the pixel aspect ratio to an even 1:2.


Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 10:07 am 
Offline
User avatar

Joined: Fri Oct 14, 2011 1:09 am
Posts: 248
As I just explained. Maybe you didn't read it. In my QuickBASIC NES emulator, that does contain a NTSC modem, I save the generated NTSC signal into a file.*

Then this file gets read by the C++ program showcased in this thread. The C++ program generates RGB. The main program nearest-neighbor-scales the picture down to 640×240 (from the 2240×240 that it is), and writes into stdout. This output is piped into the following command, which renders it. ffplay -pix_fmt bgra -s 640x240 -vf scale=1280:960 -f rawvideo - And lastly a window screenshot is saved into a PNG file.

For the sake of completeness, here's the rest of the program.
Code:
#include <cmath>
#include <algorithm>
#include <iostream>

<<< Insert the function from the first post here>>>

int main()
{
    /* Read 280x240 image frames (2240*240), render them at 640*240 */
    FILE* fp = popen("lz4 -d </chii/bisqwit/qbexample/qbnes/ntsclog.bin.lz4", "r");
    FILE* fp2 = stdout;//fopen("rgb.bin", "wb");
    const unsigned W = 280 * 8;
    while(!feof(fp))
    {
        unsigned Screen[W*240];
        char Buf[W*240];
        for(int rem=sizeof(Buf), p=0, c; rem>0 && (c=std::fread(Buf+p,1,rem,fp)) > 0; rem-=c, p+=c)
            {}

        #pragma omp parallel for
        for(unsigned y=0; y<240; ++y)
            NTSC_DecodeLine(W, Buf+y*W, Screen+y*W, (11+y*4)%12);

        for(unsigned y=0; y<240; ++y)
            for(unsigned x=0; x<640; ++x)
            {
                unsigned p = Screen[y*W + (x*W/640)];
                std::fwrite(&p, 1, 4, fp2);
            }
    }
}

NES palette color numbers are not involved at any point.

*) Portions of the signal. The NTSC modem in that emulator actually processes contiguous signal for the whole frame, including the vblank, hsyncs, colorbursts and all that; but I only write into the file the portions corresponding to the visible region. Here's an example of that signal (LZ4 compressed): http://bisqwit.iki.fi/kala/snap/ntsc0_kby_log.lz4 And here's what it looks like if interpreted as monochrome television signal or using a digital oscilloscope, using lz4 -d < file.lz4 | ffplay -pix_fmt gray -s 2240x240 -vf scale=1280:960 -f rawvideo -:
http://bisqwit.iki.fi/kala/snap/ntsc0_kby_mono.png

tepples wrote:
This appears to be a generic NTSC signal decoder that operates at 12 times color burst. I bet it wouldn't need any change to work on Apple II graphics or CGA graphics in the unofficial 640x200+color burst mode, which operate at 2 to 4 times color burst: just feed each hires pixel 6 times or each double hires pixel 3 times, with transitions delayed by 3 units in slivers with bit 7 set to true.

One optimization might be to produce output only for pixels at 0, 3, 7, 11, 14, ... which would have the side effect of also correcting the pixel aspect ratio to an even 1:2.

You are correct.


Last edited by Bisqwit on Thu Jun 02, 2016 10:37 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 10:35 am 
Offline
User avatar

Joined: Tue Apr 19, 2011 11:26 am
Posts: 106
Location: RU
It all makes sense now, thank you.

Can you feed it this ROM? Preferably, in both monochrome and color.


Attachments:
nes palette viewer.nes [24.02 KiB]
Downloaded 63 times
Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 10:43 am 
Offline
User avatar

Joined: Fri Oct 14, 2011 1:09 am
Posts: 248
feos wrote:
It all makes sense now, thank you.
Can you feed it this ROM? Preferably, in both monochrome and color.

Image Image
It seems to be a bit on the unsaturated side, but like I said before, I haven't particularly finetuned those settings. The second picture is not perfectly grayscale, because ffmpeg is interpreting the negative luma0-lo as a large positive number. Shortcoming of the simple ffmpeg commandline I used. The pictures are also of different frames, which explains the differences in the positionings of the little artifacts.


Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 11:24 am 
Offline
User avatar

Joined: Tue Apr 19, 2011 11:26 am
Posts: 106
Location: RU
I just compared it to my FamicomAV captured through a tuner. It's damn close!

Image

Can you say a few words how your filter only adds moire to color borders? I can't instantly see it form the code.


Top
 Profile  
 
PostPosted: Thu Jun 02, 2016 12:59 pm 
Offline
User avatar

Joined: Fri Oct 14, 2011 1:09 am
Posts: 248
The blue is kind of different though.

feos wrote:
Can you say a few words how your filter only adds moire to color borders? I can't instantly see it form the code.

It comes naturally from how NTSC works. To accurate determine a both color and amplitude, you need 12 samples of the color signal (in terms of NES PPU half-clock cycles).
A color signal could look like this: LLLLHHHHHHLL, which repeated gives LLLLHHHHHHLLLLLLHHHHHHLLLLLLHHHHHHLLLLLLHHHHHHLL. A nice square wave.
Another color would look like this: LLHHHHHHLLLL, which repeated gives LLHHHHHHLLLLLLHHHHHHLLLLLLHHHHHHLLLLLLHHHHHHLLLL.
You can sample this square wave from any point you want, and you always get the same color, provided you know which phase you are comparing it against (this is the colorburst, or Phase0 in my decoder). But when these two signals are put next to each others:
LLLLHHHHHHLLLLLLHHHHHHLLLLHHHHHHLLLLLLHHHHHHLLLL
000000000000111111111111222222222222333333333333
And here in the middle of the transition, you get a signal sample like this:
HHHHLLLLHHHH
111111222222
Which decodes into something completely different from either source color. It's not even a nicely-proportioned square wave anymore: there's more "high" than "low" in it. That's how the color edge artifacts are born. That is the basic idea. This is compounded by the fact that NES pixels are 8 samples long, not 12, even though the color wavelength is 12 samples.

As for the staircase pattern, this is caused by the PPU clock cycle counter being different on each scanline, repeating every 3 scanlines. You can see this easily in the monochrome pictures shown earlier.
Suppose you have a 4x6 pixel region. For this 4x6 pixel region, on consecutive clockcycles, the PPU signal generator clock could go like this:
pixel0..pixel1..pixel2..pixel3..
0123456789AB0123456789AB01234567
456789AB0123456789AB0123456789AB
89AB0123456789AB0123456789AB0123
0123456789AB0123456789AB01234567
456789AB0123456789AB0123456789AB
89AB0123456789AB0123456789AB0123
The color-wave clock (periodicity = 12) aligns differently with respect to screen pixels (periodicity = 8) on each scanline. Because of this, the pattern that gets falsely decoded between two pixels changes from scanline to the next. The pixels themselves do not move, but the decoding of colors produces different results at the edges of color changes.
There is nothing in the NTSC decoder that is responsible for producing or simulating this phenomenon. It is just a fact of how the NTSC signal is generated by the NES PPU in the first place.


Top
 Profile  
 
PostPosted: Sun Jun 05, 2016 8:24 pm 
Offline
User avatar

Joined: Fri Oct 14, 2011 1:09 am
Posts: 248
I said earlier the brightness & hue options have not been particularly finetuned. I have updated the code in the first post now. I generated a palette with Drag's tool at http://drag.wootest.net/misc/palgen.html , set brightness to -0.15 (the default was -0.20), all clipping methods to desaturate, and downloaded a .pal file. Then I created a program that optimizes the parameters in my generator using a genetic algorithm until it gets a palette that is very close to what Drag's tool generated. (Tip: Instead of changing the sine wave offset, you can bake the phase rotation into the YIQ-RGB conversion matrix, for same effect.)

Note that this process also involved changing the input data values.

Sample output:
Image

Here's a tiny version of the function, without a saturation adjust knob:
Code:
void NTSC_DecodeLine(int Width, const char* Signal, unsigned* Target, unsigned Phase0)     
{
    static constexpr int Ywidth = 12, Cwidth = 23, Y = 168084/Ywidth;
    auto sine = [p=Phase0+11u](unsigned x)
        { return "\4\7\10\7\4\0\374\371\370\371\374"[(p+x)%12u]; }; // 8*sin((Phase0+x) * 2pi / 12)
    for(int y=0,i=0,q=0, s=0; s<Width; ++s)
    {
        y += Signal[s]             - (s>=Ywidth ? Signal[s-Ywidth]                       : 0);
        i += Signal[s] * sine(0+s) - (s>=Cwidth ? Signal[s-Cwidth] * sine(0+s-Cwidth%12) : 0);
        q += Signal[s] * sine(3+s) - (s>=Cwidth ? Signal[s-Cwidth] * sine(3+s-Cwidth%12) : 0);
        Target[s] = 0x1u * std::min(255u, unsigned(std::max(0, y*Y + i*(-25002/Cwidth) + q*( 40674/Cwidth))) >> 16)
                + 0x100u * std::min(255u, unsigned(std::max(0, y*Y + i*(  1852/Cwidth) + q*(-16386/Cwidth))) >> 16)
              + 0x10000u * std::min(255u, unsigned(std::max(0, y*Y + i*( 48343/Cwidth) + q*( 23925/Cwidth))) >> 16);
    }
}


EDIT: Here is a version that uses Intel AVX2 intrinsics to calculate eight pixels simultaneously with a single CPU core using the 256-bit YMM registers. It is significantly faster.
It requires a CPU capable of AVX2, i.e. Haswell or newer.

Code:
#include <immintrin.h>

void NTSC_DecodeLine(int Width, const char* Signal, unsigned* Target, unsigned Phase0)
{
    static constexpr int YW = 12, CW = 23;
    __m128i yiq = _mm_setzero_si128();
    for(int s=0; s<Width; s+=8)
    {
        auto to128lo = [](__m256i a) { return _mm256_extractf128_si256(a, 0); };
        auto to128hi = [](__m256i a) { return _mm256_extractf128_si256(a, 1); };
        auto combine128 = [](__m128i a,__m128i b) { return _mm256_insertf128_si256(_mm256_castsi128_si256(b), a, 1); };
        auto modulo12 = [&](__m256i v)
        {
            return _mm256_sub_epi16(v, _mm256_mullo_epi16(_mm256_set1_epi16(12),
                                       _mm256_srli_epi16(_mm256_mulhi_epu16(v,_mm256_set1_epi16(0xAAAB)),3)));
        };
        auto sine = [&](__m256i mod12) // 8*sin(x * 2pi / 12)
        {
            //static constexpr int st[12]={0,4,7,8,7,4,0,-4,-7,-8,-7,-4};
            //return _mm256_i32gather_epi32(st, mod12, 4);
            mod12 = _mm256_slli_epi32(mod12, 2); // Multiply by 4
            auto c = _mm256_set1_epi32(0x478740), t = _mm256_set1_epi32(24);
            // x<24 ? ((0x478740u>>(x)) & 0xF) : -((0x478740u>>(x-24)) & 0xF);
            return _mm256_blendv_epi8(
                _mm256_and_si256(_mm256_set1_epi32(0xF), _mm256_srlv_epi32(c, mod12)),
                _mm256_sub_epi32(_mm256_setzero_si256(),
                _mm256_and_si256(_mm256_set1_epi32(0xF), _mm256_srlv_epi32(c, _mm256_sub_epi32(mod12, t)))),
                _mm256_cmpgt_epi32(mod12, _mm256_set1_epi32(23)));
        };

        // Load signal samples. 3 values for each pixel (current, y-old, c-old)
        auto sigoffsets0 = _mm256_add_epi16(_mm256_set1_epi16(s),
            _mm256_set_epi16(3-CW,3-YW,7,5, 2-CW,2-YW,6,4,
                             1-CW,1-YW,3,1, 0-CW,0-YW,2,0));
        auto sigoffsets1 = _mm_add_epi16(_mm_set1_epi16(s),
            _mm_set_epi16(7-CW,7-YW,5-CW,5-YW, 6-CW,6-YW, 4-CW,4-YW));
        auto siggood0 = _mm256_cmpgt_epi16(sigoffsets0, _mm256_set1_epi16(-1));
        auto siggood1 =    _mm_cmpgt_epi16(sigoffsets1,    _mm_set1_epi16(-1));
        __m256i sigs0 = _mm256_sub_epi32(_mm256_and_si256(_mm256_add_epi32(_mm256_set1_epi32(128),
            _mm256_mask_i32gather_epi32(_mm256_setzero_si256(), (const int*)Signal,
            _mm256_cvtepi16_epi32(to128lo(sigoffsets0)),
            _mm256_cvtepi16_epi32(to128lo(siggood0)), 1)), _mm256_set1_epi32(0xFF)), _mm256_set1_epi32(128));
        __m256i sigs1 = _mm256_sub_epi32(_mm256_and_si256(_mm256_add_epi32(_mm256_set1_epi32(128),
            _mm256_mask_i32gather_epi32(_mm256_setzero_si256(), (const int*)Signal,
            _mm256_cvtepi16_epi32(to128hi(sigoffsets0)),
            _mm256_cvtepi16_epi32(to128hi(siggood0)), 1)), _mm256_set1_epi32(0xFF)), _mm256_set1_epi32(128));
        __m256i sigs2 = _mm256_sub_epi32(_mm256_and_si256(_mm256_add_epi32(_mm256_set1_epi32(128),
            _mm256_mask_i32gather_epi32(_mm256_setzero_si256(), (const int*)Signal,
            _mm256_cvtepi16_epi32(sigoffsets1),
            _mm256_cvtepi16_epi32(siggood1), 1)), _mm256_set1_epi32(0xFF)), _mm256_set1_epi32(128));

        // Load sinetable values. 4 values for each pixel (sin+cos now and back-then)
        __m256i nv0 = modulo12(_mm256_add_epi16(_mm256_set1_epi16(Phase0+s),
            _mm256_set_epi16(90-CW,87-CW,90,87, 89-CW,86-CW,89,86, 88-CW,85-CW,88,85, 87-CW,84-CW,87,84)));
        __m256i n0 = sine(_mm256_cvtepu16_epi32(to128lo(nv0))), n1 = sine(_mm256_cvtepu16_epi32(to128hi(nv0)));

        __m256i nv1 = modulo12(_mm256_add_epi16(_mm256_set1_epi16(Phase0+s),
            _mm256_set_epi16(94-CW,91-CW,94,91, 93-CW,90-CW,93,90, 92-CW,89-CW,92,89, 91-CW,88-CW,91,88)));
        __m256i n2 = sine(_mm256_cvtepu16_epi32(to128lo(nv1))), n3 = sine(_mm256_cvtepu16_epi32(to128hi(nv1)));

        // Multiply and subtract
        auto ca = _mm256_sub_epi32(
            _mm256_mullo_epi32(_mm256_shuffle_epi32(sigs0, 0x00), _mm256_blend_epi32(_mm256_set_epi32(0,0,0,1, 0,0,0,1), _mm256_shuffle_epi32(n0,16), 0x66)),
            _mm256_mullo_epi32(_mm256_shuffle_epi32(sigs0, 0xFE), _mm256_blend_epi32(_mm256_set_epi32(0,0,0,1, 0,0,0,1), _mm256_shuffle_epi32(n0,56), 0x66)));
        auto cb = _mm256_sub_epi32(
            _mm256_mullo_epi32(_mm256_shuffle_epi32(sigs0, 0x55), _mm256_blend_epi32(_mm256_set_epi32(0,0,0,1, 0,0,0,1), _mm256_shuffle_epi32(n1,16), 0x66)),
            _mm256_mullo_epi32(_mm256_shuffle_epi32(sigs1, 0xFE), _mm256_blend_epi32(_mm256_set_epi32(0,0,0,1, 0,0,0,1), _mm256_shuffle_epi32(n1,56), 0x66)));
        auto cc = _mm256_sub_epi32(
            _mm256_mullo_epi32(_mm256_shuffle_epi32(sigs1, 0x00), _mm256_blend_epi32(_mm256_set_epi32(0,0,0,1, 0,0,0,1), _mm256_shuffle_epi32(n2,16), 0x66)),
            _mm256_mullo_epi32(_mm256_shuffle_epi32(sigs2, 0x54), _mm256_blend_epi32(_mm256_set_epi32(0,0,0,1, 0,0,0,1), _mm256_shuffle_epi32(n2,56), 0x66)));
        auto cd = _mm256_sub_epi32(
            _mm256_mullo_epi32(_mm256_shuffle_epi32(sigs1, 0x55), _mm256_blend_epi32(_mm256_set_epi32(0,0,0,1, 0,0,0,1), _mm256_shuffle_epi32(n3,16), 0x66)),
            _mm256_mullo_epi32(_mm256_shuffle_epi32(sigs2, 0xFE), _mm256_blend_epi32(_mm256_set_epi32(0,0,0,1, 0,0,0,1), _mm256_shuffle_epi32(n3,56), 0x66)));
        __m128i a = _mm_add_epi32(yiq, to128lo(ca)), b = _mm_add_epi32(a, to128hi(ca)); __m256i com0=combine128(a,b);
        __m128i c = _mm_add_epi32(b,   to128lo(cb)), d = _mm_add_epi32(c, to128hi(cb)); __m256i com1=combine128(c,d);
        __m128i e = _mm_add_epi32(d,   to128lo(cc)), f = _mm_add_epi32(e, to128hi(cc)); __m256i com2=combine128(e,f);
        __m128i g = _mm_add_epi32(f,   to128lo(cd)), h = _mm_add_epi32(g, to128hi(cd)); __m256i com3=combine128(g,h); yiq = h;

        // Convert into RGB
        __m256i lo0 = _mm256_unpacklo_epi32(com0,com2), lo1 = _mm256_unpacklo_epi32(com1,com3);
        __m256i hi0 = _mm256_unpackhi_epi32(com0,com2), hi1 = _mm256_unpackhi_epi32(com1,com3);
        __m256i dy = _mm256_mullo_epi32(_mm256_set1_epi32( 168084/YW ),
                     _mm256_permutevar8x32_epi32(_mm256_unpacklo_epi32(lo0,lo1), _mm256_set_epi32(3,7,2,6,1,5,0,4)));
        __m256i di = _mm256_permutevar8x32_epi32(_mm256_unpackhi_epi32(lo0,lo1), _mm256_set_epi32(3,7,2,6,1,5,0,4));
        __m256i dq = _mm256_permutevar8x32_epi32(_mm256_unpacklo_epi32(hi0,hi1), _mm256_set_epi32(3,7,2,6,1,5,0,4));
        __m256i R = _mm256_max_epi32(_mm256_add_epi32(_mm256_add_epi32(dy,
                                     _mm256_mullo_epi32(di,_mm256_set1_epi32( -25002/CW ))),
                                     _mm256_mullo_epi32(dq,_mm256_set1_epi32(  40674/CW ))), _mm256_setzero_si256());
        __m256i G = _mm256_max_epi32(_mm256_add_epi32(_mm256_add_epi32(dy,
                                     _mm256_mullo_epi32(di,_mm256_set1_epi32(   1852/CW ))),
                                     _mm256_mullo_epi32(dq,_mm256_set1_epi32( -16386/CW ))), _mm256_setzero_si256());
        __m256i B = _mm256_max_epi32(_mm256_add_epi32(_mm256_add_epi32(dy,
                                     _mm256_mullo_epi32(di,_mm256_set1_epi32(  48343/CW ))),
                                     _mm256_mullo_epi32(dq,_mm256_set1_epi32(  23925/CW ))), _mm256_setzero_si256());
        R =                   _mm256_min_epu32(_mm256_set1_epi32(0xFF), _mm256_srli_epi32(R, 16));
        G = _mm256_slli_epi32(_mm256_min_epu32(_mm256_set1_epi32(0xFF), _mm256_srli_epi32(G, 16)), 8);
        B = _mm256_and_si256(_mm256_min_epu32(B, _mm256_set1_epi32(0xFF0000)), _mm256_set1_epi32(0xFF0000));
        _mm256_storeu_si256((__m256i*)(Target+s), _mm256_add_epi32(_mm256_add_epi32(R,G),B));
    }
}


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 31 posts ]  Go to page 1, 2, 3  Next

All times are UTC - 7 hours


Who is online

Users browsing this forum: GradualGames and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group