April 15, 2014

OpenCL accelerated sqlite on Shamrock an open source CPU only driver

Within the GPGPU team Gil Pitney has been working on Shamrock which is an open source OpenCL implementation. It’s really a friendly fork of the clover project but taken in a bit of a new direction.

Over the past few months Gil has updated it to make use of the new MCJIT from llvm which works much better for ARM processors. Further he’s updated Shamrock so that it uses current llvm. I have a build based on 3.5.0 on my chromebook.

The other part about Gil’s Shamrock work is it will in time also have the ability to drive Keystone hardware which is TI’s ARM + DPSs on board computing solution. Being able to drive DSPs with OpenCL is quite an awesome capability. I do wish I had one of those boards.

The other capability Shamrock has is to provide a CPU driver for OpenCL on ARM. How does it perform? Good question!

I took my OpenCL accelerated sqlite prototype and built it to use the Shamrock CPU only driver. Would you expect that a CPU only OpenCL driver offloading SQL SELECT queries to be faster or would the  sqlite engine?

If you guessed OpenCL running on a CPU only driver, you’re right. Now remember the Samsung ARM based chromebook is a dual A15. The queries are against 100,000 rows in a single table database with 7 columns. Lower numbers are better and times

sql1 took 43653 microseconds
OpenCL handcoded-opencl/sql1.cl Interval took 17738 microseconds
OpenCL Shamrock 2.46x faster
sql2 took 62530 microseconds
OpenCL handcoded-opencl/sql2.cl Interval took 18168 microseconds
OpenCL Shamrock 3.44x faster
sql3 took 110095 microseconds
OpenCL handcoded-opencl/sql3.cl Interval took 18711 microseconds
OpenCL Shamrock 5.88x faster
sql4 took 143278 microseconds
OpenCL handcoded-opencl/sql4.cl Interval took 19612 microseconds
OpenCL Shamrock 7.30x faster
sql5 took 140398 microseconds
OpenCL handcoded-opencl/sql5.cl Interval took 18698 microseconds
OpenCL Shamrock 7.5x faster

These numbers for running on the CPU are pretty consistent and I was concerned there was some error in the process. Yet the returned number of matching rows is the same for both the sqlite engine and the OpenCL versions which helps detect functional problems. I’ve clipped the result row counts from the results above for brevity.

I wasn’t frankly expecting this kind of speed up, especially with a CPU only driver. Yet there it is in black and white. It does speak highly of the capabilities of OpenCL to be more efficient at computing when you have data parallel problems.

Another interesting thing to note in this comparison, the best results achieved have been with the Mali GPU using vload/vstores and thus take advantage of SIMD vector instructions. On a CPU this would equate to use of NEON. The Shamrock CPU only driver doesn’t at the moment have support for vload/vstore so the compiled OpenCL kernel isn’t even using NEON on the CPU to achieve these results.

Posted in linaro, OpenCL, open_source | No Comments »

April 12, 2014

Building Android KitKat on OS X Mavericks

I run OSX on my laptop. (gasp!) I ssh into my various linux boxes to work on various projects. As I’m doing a little work with Renderscript and my sqlite acceleration project I thought it would be handy to build Android on my OS X laptop. Turns out it’s not entirely difficult and required just one fix to the code.


There are several projects to bring various linux/unix tools onto OSX. I use MacPorts. Brew is probably another good option. Either way this gives us a foundation of tools that the android build system is going to need.

The install instructions offer an extra easy pkg option.

Next we need to install some software.

sudo port install coreutils findutils pngcrush gsed gnupg


Xcode is of course Apple’s development environment for OSX and iOS. You need it, and it can be installed directly out of the App Store.


Make sure you have java installed.

java -version
java version "1.6.0_65"

If you don’t, you’ll get a popup dialog that will ask if you want to install it. Do!


Make sure you have python installed. If I recall correctly that’s a default install with OSX Mavericks.  There is an option to install via ports.

sudo port install python


Pull down repo.

curl http://commondatastorage.googleapis.com/git-repo-downloads/repo > ~/bin/repo

Make sure you add your ~/bin to your PATH

export PATH="$PATH:~/bin"

Android SDK tools

You need to download the android sdk tools built for the Mac. Download these from here. Extract. At this point I created an android directory and put the tools inside of it.

mkdir -p ~/android
mv <whereever>/android-sdk  ~/android

Filesystem setup

OSX for all it’s joys doesn’t deal with case differences in it’s file system unless you specifically created the file system to do so. The default doesn’t. It’s not 8.3, but it’s still 1990s lame. So you’ll need to create a file system for the Android source code to live in.

Make sure you have the space in your file system. I created a 100 gig file system. I wouldn’t go below 50. I also put this onto my desktop. Makes it easy to double click later to mount it. Feel free to mount it where it works best for you. However remember this location!

hdiutil create -type SPARSE -fs "Case-sensitive Journaled HFS+" -size 100g -volname "android" -attach ~/Desktop/Android

Android source code

Download as you normally would. (note the cd to the location of where you just attached the new HFS case sensitive file system.

cd ~/Desktop/Android
git clone http://android.googlesource.com/platform/manifest.git
git branch -r   // this will show you all the branch options. I was after the latest.
repo init -u git://android.git.kernel.org/platform/manifest.git  -b android-4.4_r1.2
repo sync

Environment Setup

We need to setup a few environment variables. First add the android sdk tools to your path

export PATH=~/android/android-sdk/sdk/platform-tools:$PATH
export LC_CTYPE=C
export LANG=C

The One Fix

So in jni_generator.py there is a slight issue where it doesn’t handle that situation where one of the tool parameters isn’t available. So we need to defensively work around it. (yeah yeah I should just post the patch)

In external/chromium_org/base/android/jni_generator/jni_generator.py

At the top of the file (around line 20) add

import platform

Then lower down add the following if to check for Darwin so that -fpreprocessed isn’t passed:

531   def _RemoveComments(self, contents):
532     # We need to support both inline and block comments, and we need to handle
533     # strings that contain '//' or '/*'. Rather than trying to do all that with
534     # regexps, we just pipe the contents through the C preprocessor. We tell cpp
535     # the file has already been preprocessed, so it just removes comments and
536     # doesn't try to parse #include, #pragma etc.
537     #
538     # TODO(husky): This is a bit hacky. It would be cleaner to use a real Java
539     # parser. Maybe we could ditch JNIFromJavaSource and just always use
540     # JNIFromJavaP; or maybe we could rewrite this script in Java and use APT.
541     # http://code.google.com/p/chromium/issues/detail?id=138941
542     system = platform.system()
543     if system == 'Darwin':
544       cpp_args = ['cpp']
545     else:
546       cpp_args = ['cpp', '-fpreprocessed']
547     p = subprocess.Popen(args=cpp_args,
548                          stdin=subprocess.PIPE,
549                          stdout=subprocess.PIPE,
550                          stderr=subprocess.PIPE)
551     stdout, _ = p.communicate(contents)

Ready To Build

That’s it. Least I hope I captured everything I had to do. Build away.

Posted in android | No Comments »

April 10, 2014

Q1 2014 CEO Note

The announcement of five new members at Linaro Connect, including Club members Qualcomm, MediaTek and ZTE, is a significant endorsement of the value of Linaro’s business model, and further increases the ROI for every one of our members.

By coincidence the Linux Foundation recently published a report on collaborative engineering subtitled:  “Companies increase collaborative development, view as essential to success”.  The report points out that technology companies face enormous pressure to innovate faster and cut costs. A survey carried out for the Linux Foundation shows that professionals in a wide variety of leading technology companies recognize that collaborative software development is increasing in their organization and across industries, and indeed that it’s becoming central to their company’s mission.

Key findings are that:

  • Companies get involved in collaborative software development to advance business objectives and to be part of industry innovation
  • Investments in collaborative software development are on the rise

  • Individual developers and businesses both benefit from the trend toward collaboration
  • Business needs are driving increased adoption of collaborative development practices

This is a further strong validation of Linaro’s approach. Rather than each company itself developing non-value adding but key technology, duplicating effort and in many cases complicating and delaying the open source process, Linaro offers a different path. Instead, our members come together to engineer open source software to meet their common product needs once. The result is significantly lower costs for everyone and a higher quality deliverable given the shared effort and knowledge that has been utilized. Furthermore, using the same code base reduces maintenance costs and can accelerate time to market. Using this open source software as the basis of member’s products, coupled with the knowledge gained from being part of the development process, enables Linaro members to focus their own resources on their own value add and accelerates their ability to innovate in their chosen market segment(s). The Linux Foundation report supports the idea that members derive substantially more value from being inside Linaro than outside.

The current work on ARMv8 inside Linaro is a good example of the results that collaborative engineering can deliver. Many engineers in Linaro are working on ARMv8 projects, from boot architecture to virtualization, from kernel device drivers to Java, and the results speak for themselves. This effort is delivering code that all members with ARMv8 projects will be using.

While Linaro’s work is deliberately carried out as new work in the relevant open source projects (for example currently at the Linux 3.14 mainline Linux kernel tree) most members cannot easily leverage this new work directly into products. To address this challenge members asked Linaro to provide a stable kernel, based on the Linux Long Term Supported (LTS) kernel, but including backports of the latest stable Linaro technology.

The Linaro Stable Kernel (LSK) is based on the most recent LTS release (currently Linux 3.10). Today the LSK (base and Android versions) includes many of the latest ARMv8 patches from later kernels backported by both ARM into LTS (and therefore reflected into LSK) and by Linaro directly into the LSK. At present this functionality can only be tested on models, and in a limited manner on member hardware that we have in house under the terms of a restricted use license.

As additional member hardware becomes available to Linaro we are committed to having the LSK support the functionality, quality and stability that members need for their initial products. Of course a balance between stability and new features must always be made, but our goal is to enable our members to use the LSK as the basis of their products now. This will be a significant demonstration of the benefits of collaborative engineering that have been advocated by Linaro and the Linux Foundation.

1 Linux Foundation Collaborative Development Trends Report 2014

The post Q1 2014 CEO Note appeared first on Linaro.

Posted in armv8, connect, kernel, Linaro Blog, linux, Linux on ARM, LSK | No Comments »

April 5, 2014

A day in the life…

This video is so perfect... I could literally write a book about it. If you're studying to be a software developer and are wondering what your future will hold, don't imagine your life as the next Zuckerberg, watch this video. Watch this video until it is no longer funny, because it is not funny, this will be your career:

What it’s like to be an engineer in a sales meeting
Posted in Uncategorized | No Comments »

April 2, 2014

sq-cl code posted and some words about vectors

I’ve posted my initial OpenCL accelerated sqlite prototype code:


Don’t get excited. Remember, it’s a prototype and a quite contrived one at that. It doesn’t handle the general case yet and of course it has bugs. But!  It’s interesting and I think shows what’s possible.

Over at the mali developer community that ARM hosts. I happened to mention this work which in a post that ended up resulting in some good suggestions to use of vectors as well as other good feedback. While working with vectors was a bit painful due to the introduction of some bugs on my part, I made my way through it and have some initial numbers with a couple of kernels so I can get an idea just what a difference it makes.


The core of the algorithm for sql1 changes from:

    do {
        if ((data[offset].v > 60) && (data[offset].w < 0)) {
            resultArray[roffset].id = data[offset].id;
            resultArray[roffset].v = data[offset].v;
            resultArray[roffset].w = data[offset].w;
    } while (endRow);


    do {
        v1 = vload4(0, data1+offset);
        v2 = vload4(0, data2+offset);
        r = (v1 > 60) && ( 0 > v2);
        vstore4(r,0, resultMask+offset);
    } while (totalRows);

With each spin through the loop, the vectorized version of course is operating over 4 values at once to check for a match. Obvious win. To do this the data has to come in in pure columns and I’m using an vector as essentially a bitmask to indicate if that row is a match or not. This requires a post processing loop to spin through and assemble the resulting data into a useful state. For the 100,000 row database I’m using it doesn’t seem to have as much of a performance impact as I thought it might.

For the first sql1 test query the numbers look like this:

CPU sql1 took 43631 microseconds
OpenCL sql1  took 14545 microseconds  (2.99x or 199% better)
OpenCL (using vectors) 4114 microseconds (10.6x better or 960%)

Not bad. sql3 sees even better results:

CPU sql3 took 111020 microseconds
OpenCL sql3 took 44533 microseconds (2.49x  or 149% better)
OpenCL (using vectors) took 4436 microseconds (25.02x or 2402% better)

There’s another factor why these vectorized versions are doing better. With the newer code I am using less registers on the Mali GPU and thus am able to up the number of work units from 64 to 128.

I do have one bug that I need to track down. I am (of course) validating that all the versions are coming up with the same matches. The new vector versions are off by a couple of rows. The missing rows don’t seem to follow a pattern. I’m sure I’ve done something dumb. Now that there is the ability for more eyes on the code perhaps someone will spot it.

Posted in linaro, OpenCL, open_source | No Comments »