Archive for the ‘linaro’ Category

April 24, 2014

64-bit ARM usermode emulation in QEMU 2.0.0

The QEMU Project released version 2.0.0 of QEMU last week; this seems like a good time to summarise our progress with ARMv8 QEMU work.

One of the major new ARM related features in this release is support for emulating AArch64 processes in QEMU’s “linux-user” mode; in Linaro we’ve been working on this over the last few months (building on a great foundation established by SUSE) and we just managed to squeeze support for the last few instructions into 2.0.0.

“linux-user” mode is where we run a single Linux guest binary, and QEMU converts the system calls the guest makes into system calls to the host Linux kernel. Typically you’d use this to run an AArch64 binary on a more conveniently available host, usually x86_64, by setting up a cross-architecture chroot and putting QEMU in it. We’ve implemented support for all the mandatory A64 instructions, including floating point and Advanced SIMD, but not the optional instructions in the crypto and CRC extensions.

As well as adding an entirely new instruction set for 64 bit support, the ARMv8 architecture included a few new instructions for the 32 bit A32 and T32 instruction sets. QEMU also now implements all the mandatory new instructions, though this will for the moment probably mostly be of use only to people running compiler test suites.

Two other uses for QEMU involve running it on AArch64 hardware. Firstly, you can use it to emulate other CPU architectures on AArch64 hosts, for instance running an x86 kernel in an emulated machine. This was contributed by Huawei last year, and has been supported since the previous release of QEMU (1.7).

You can also use QEMU as the userspace device emulation part of a virtual machine which uses KVM and the hardware’s virtualization extensions to provide fast AArch64-on-AArch64 VMs. This too has been supported since 1.7, though some features are not yet implemented (for instance, VM migration and debugging a guest VM are both not currently supported).

The final use for QEMU I want to talk about is the only one which isn’t in the 2.0.0 release, but many people have been waiting for it so here’s a status update. AArch64 system emulation is where you emulate a complete system and boot a full system including an AArch64 Linux kernel and user space, typically running on an x86 host. We’re working on this right now, and in fact as soon as QEMU’s git repository reopened for development after the 2.0.0 release we landed a large set of patches which implement all the necessary CPU emulation support. The only remaining missing piece in upstream QEMU master to be able to boot a kernel is to add support for running the “virt” board model with a Cortex-A57 and a GICv2 with an appropriate register layout. This last bit of work should be done shortly.

If you want to try out QEMU 2.0.0 you can build it yourself from the upstream released tarballs. If you’re an Ubuntu user then you’re in luck, because these changes are also in the QEMU shipped in the newly released Ubuntu 14.04 LTS.

Posted in linaro, qemu | No Comments »

Linaro 14.04 Release Now Available for Download!

 “The world is full of magical things patiently waiting for our wits to grow sharper.” ~ Bertrand Russell

Linaro 14.04 release is now available for download.  See the detailed highlights of this release to get an overview of what has been accomplished by the Working Groups, Landing Teams and Platform Teams. The release details are linked from the Details column for each released artifact on the release information:

We encourage everybody to use the 14.02 release.

This post includes links to more information and instructions for using the images. The download links for all images and components are available on our downloads page:

USING THE ANDROID-BASED IMAGES

The Android-based images come in three parts: system, userdata and boot. These need to be combined to form a complete Android install. For an explanation of how to do this please see:

If you are interested in getting the source and building these images yourself please see the following pages:

USING THE UBUNTU-BASED IMAGES

The Ubuntu-based images consist of two parts. The first part is a hardware pack, which can be found under the hwpacks directory and contains hardware specific packages (such as the kernel and bootloader). The second part is the rootfs, which is combined with the hardware pack to create a complete image. For more information on how to create an image please see:

USING THE OPEN EMBEDDED-BASED IMAGES

With the Linaro provided downloads and with ARM’s Fast Models virtual platform, you may boot a virtual ARMv8 system and run 64-bit binaries.  For more information please see:

GETTING INVOLVED

More information on Linaro can be found on our websites:

Also subscribe to the important Linaro mailing lists and join our IRC channels to stay on top of Linaro developments:

KNOWN ISSUES WITH THIS RELEASE

For any errata issues, please see:

Bug reports for this release should be filed in Launchpad against the individual packages that are affected. If a suitable package cannot be identified, feel free to assign them to:

UPCOMING LINARO CONNECT EVENTS: LINARO CONNECT USA 2014

Registration for Linaro Connect USA 2014 (LCU14), which will be in Burlingame, California from September 15 – 19, 2014 is now open.  More information on this event can be found at: http://www.linaro.org/connect/lcu/lcu14/

 

The post Linaro 14.04 Release Now Available for Download! appeared first on Linaro.

Posted in android, Evaluation builds, linaro, Linaro Blog, Release, release cycle, software, Toolchain, ubuntu | No Comments »

April 15, 2014

OpenCL accelerated sqlite on Shamrock an open source CPU only driver

Within the GPGPU team Gil Pitney has been working on Shamrock which is an open source OpenCL implementation. It’s really a friendly fork of the clover project but taken in a bit of a new direction.

Over the past few months Gil has updated it to make use of the new MCJIT from llvm which works much better for ARM processors. Further he’s updated Shamrock so that it uses current llvm. I have a build based on 3.5.0 on my chromebook.

The other part about Gil’s Shamrock work is it will in time also have the ability to drive Keystone hardware which is TI’s ARM + DPSs on board computing solution. Being able to drive DSPs with OpenCL is quite an awesome capability. I do wish I had one of those boards.

The other capability Shamrock has is to provide a CPU driver for OpenCL on ARM. How does it perform? Good question!

I took my OpenCL accelerated sqlite prototype and built it to use the Shamrock CPU only driver. Would you expect that a CPU only OpenCL driver offloading SQL SELECT queries to be faster or would the  sqlite engine?

If you guessed OpenCL running on a CPU only driver, you’re right. Now remember the Samsung ARM based chromebook is a dual A15. The queries are against 100,000 rows in a single table database with 7 columns. Lower numbers are better and times

sql1 took 43653 microseconds
OpenCL handcoded-opencl/sql1.cl Interval took 17738 microseconds
OpenCL Shamrock 2.46x faster
sql2 took 62530 microseconds
OpenCL handcoded-opencl/sql2.cl Interval took 18168 microseconds
OpenCL Shamrock 3.44x faster
sql3 took 110095 microseconds
OpenCL handcoded-opencl/sql3.cl Interval took 18711 microseconds
OpenCL Shamrock 5.88x faster
sql4 took 143278 microseconds
OpenCL handcoded-opencl/sql4.cl Interval took 19612 microseconds
OpenCL Shamrock 7.30x faster
sql5 took 140398 microseconds
OpenCL handcoded-opencl/sql5.cl Interval took 18698 microseconds
OpenCL Shamrock 7.5x faster

These numbers for running on the CPU are pretty consistent and I was concerned there was some error in the process. Yet the returned number of matching rows is the same for both the sqlite engine and the OpenCL versions which helps detect functional problems. I’ve clipped the result row counts from the results above for brevity.

I wasn’t frankly expecting this kind of speed up, especially with a CPU only driver. Yet there it is in black and white. It does speak highly of the capabilities of OpenCL to be more efficient at computing when you have data parallel problems.

Another interesting thing to note in this comparison, the best results achieved have been with the Mali GPU using vload/vstores and thus take advantage of SIMD vector instructions. On a CPU this would equate to use of NEON. The Shamrock CPU only driver doesn’t at the moment have support for vload/vstore so the compiled OpenCL kernel isn’t even using NEON on the CPU to achieve these results.


Posted in linaro, OpenCL, open_source | No Comments »

April 2, 2014

sq-cl code posted and some words about vectors

I’ve posted my initial OpenCL accelerated sqlite prototype code:

http://git.linaro.org/people/tom.gall/sq-cl.git

Don’t get excited. Remember, it’s a prototype and a quite contrived one at that. It doesn’t handle the general case yet and of course it has bugs. But!  It’s interesting and I think shows what’s possible.

Over at the mali developer community that ARM hosts. I happened to mention this work which in a post that ended up resulting in some good suggestions to use of vectors as well as other good feedback. While working with vectors was a bit painful due to the introduction of some bugs on my part, I made my way through it and have some initial numbers with a couple of kernels so I can get an idea just what a difference it makes.

Alot.

The core of the algorithm for sql1 changes from:

    do {
        if ((data[offset].v > 60) && (data[offset].w < 0)) {
            resultArray[roffset].id = data[offset].id;
            resultArray[roffset].v = data[offset].v;
            resultArray[roffset].w = data[offset].w;
            roffset++;
        }
        offset++;
        endRow--;
    } while (endRow);

To

    do {
        v1 = vload4(0, data1+offset);
        v2 = vload4(0, data2+offset);
        r = (v1 > 60) && ( 0 > v2);
        vstore4(r,0, resultMask+offset);
        offset+=4;
        totalRows--;
    } while (totalRows);

With each spin through the loop, the vectorized version of course is operating over 4 values at once to check for a match. Obvious win. To do this the data has to come in in pure columns and I’m using an vector as essentially a bitmask to indicate if that row is a match or not. This requires a post processing loop to spin through and assemble the resulting data into a useful state. For the 100,000 row database I’m using it doesn’t seem to have as much of a performance impact as I thought it might.

For the first sql1 test query the numbers look like this:

CPU sql1 took 43631 microseconds
OpenCL sql1  took 14545 microseconds  (2.99x or 199% better)
OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

Not bad. sql3 sees even better results:

CPU sql3 took 111020 microseconds
OpenCL sql3 took 44533 microseconds (2.49x  or 149% better)
OpenCL (using vectors) took 4436 microseconds (25.02x or 2402% better)

There’s another factor why these vectorized versions are doing better. With the newer code I am using less registers on the Mali GPU and thus am able to up the number of work units from 64 to 128.

I do have one bug that I need to track down. I am (of course) validating that all the versions are coming up with the same matches. The new vector versions are off by a couple of rows. The missing rows don’t seem to follow a pattern. I’m sure I’ve done something dumb. Now that there is the ability for more eyes on the code perhaps someone will spot it.


Posted in linaro, OpenCL, open_source | No Comments »

March 27, 2014

Linaro 14.03 Release Now Available for Download!

When you have a great and difficult task, something perhaps almost impossible, if you only work a little at a time, every day a little, suddenly the work will finish itself.  ~  Isak Dinesen (Karen Blixen)

See the detailed highlights of this release to get an overview of what has been accomplished by the Working Groups, Landing Teams and Platform Teams. The release details are linked from the Details column for each released artifact on the release information:

https://wiki.linaro.org/Cycles/1403/Release#Release_Information

This post includes links to more information and instructions for using the images. The download links for all images and components are available on our downloads page:

USING THE ANDROID-BASED IMAGES

The Android-based images come in three parts: system, userdata and boot. These need to be combined to form a complete Android install. For an explanation of how to do this please see:

If you are interested in getting the source and building these images yourself please see the following pages:

USING THE UBUNTU-BASED IMAGES

The Ubuntu-based images consist of two parts. The first part is a hardware pack, which can be found under the hwpacks directory and contains hardware specific packages (such as the kernel and bootloader). The second part is the rootfs, which is combined with the hardware pack to create a complete image. For more information on how to create an image please see:

USING THE OPEN EMBEDDED-BASED IMAGES

With the Linaro provided downloads and with ARM’s Fast Models virtual platform, you may boot a virtual ARMv8 system and run 64-bit binaries.  For more information please see:

GETTING INVOLVED

More information on Linaro can be found on our websites:

Also subscribe to the important Linaro mailing lists and join our IRC channels to stay on top of Linaro developments:

KNOWN ISSUES WITH THIS RELEASE

For any errata issues, please see:

Bug reports for this release should be filed in Launchpad against the individual packages that are affected. If a suitable package cannot be identified, feel free to assign them to:

UPCOMING LINARO CONNECT EVENTS: LINARO CONNECT USA (LCU14)

Registration for Linaro Connect USA (LCU14), which will be in Burlingame, California from September 15 – 19, 2014 is now open.  More information on this event can be found at: http://www.linaro.org/connect-lcu14

Posted in android, big.little, connect, CortexA8, CortexA9, Evaluation builds, kernel, linaro, Linaro Connect, linux, Linux on ARM, release cycle, Releases, Toolchain, ubuntu | No Comments »