Random Access Peculiarities of Hadoop’s ABFSS support

Azure Data Lake Storage Accounts (Gen2) providing Blob Containers with hierarchical namespaces really are an interesting and versatile storage alternative to the (more or less outdated) Azure Data Lake Storage (Gen1) service.

That’s why they have been dedicated/contributed a completely revamped filesystem abstraction in the hadoop ecosystem (org.apache.hadoop.fs.azurebfs under the abfss:// scheme) alongside of the already existing packages (org.apache.hadoop.fs.azure under the adl:// scheme).

As its always the case with (good) abstractions, they hide away most of the nifty details of dealing with a raw REST/Autentication API such as Oauth2/WebHDFS in the case of ADL. But sometimes, they (deliberately) behave very differently under the hood to the utter regret of the provider of a service on top of it.

In our case, we have such a setting, in which we have implemented a distributed parser which consists of

a) a single-threaded, but presumably very fast skip-scan initial phase

b) followed by a distributed parsing phase

Both phases operate “sequentially streaming” on their particular splits of the files. But it is the phase a) which shows a severe performance degradation under abfss:// versus adl:// (for 200MB files, we already seen factors of 16, for 1GB files this accumulates to factors of 40 and more).

We played a lot with buffer sizes and such (which does not make sense, if we want to skip most of the buffer anyway IMHO) and we always surprised to see the most counterintuitive effects you could imagine.

So it was not until I found this somehow long debated and finally accepted pull request that I could make up my mind what was happening. The sophisticated engineers (from Hortonworks?) really built some random access pattern detector into the stream which should optimize read-ahead and minimize network bloat!

Unfortunately (and our use case seems not to be the only one as can be deduced from the pull requests descriptions) this detector jumps in no matter if you seek() or skip() rendering our scanning strategy basically useless. Fortunately and thanks to Nokia’s persistence in that matter, there is now a switch by means of the “fs.azure.read.alwaysReadBufferSize” option.

Set it to “true” and get predictable Gen1-like performance for your not-quite-random-access cases.

Optimizing public meet.jit.si rooms for many participants (more than 30)

No! This will not be another post starting with “In the Corona crisis, …”. In fact, it starts with “No!”

In each case, people (your step-father) and groups (his regulars table, the orchestra, …) you would have never thought of started to hold video conferences on public platforms with the most diverse browsers, devices and numbers of participants far from being thought effective for collaboration. But hey, its better than no social exchange.

One of my preferred platforms is meet.jit.si for its simplicity, its scaleable “out-of-the-box” behaviour and a (developer) community really caring about such issues. Hence, with a little bit of background knowledge, public meet.jit.si rooms can be “tweaked” such that non-tech-savvy invitees will enter the conference in a most resource-saving manner which makes both their own conferencing experience as well as those of their companions satisfying.

Of course the ultimate measure would be to setup your own, tailored instance of jit.si, but this is out-of-site for most people this post targets at.

So usually, you start a meeting (room) by simply typing in a URL


Fortunately, jit.si accepts URL parameters in a very consistent manner, so we can simply add configurations by appending to it.


Most effect will have the default video resolution which we enforce to the “medium” setting and some other fancy processings to the images and the audio. We can argue over the contraintuitive config.disableH264=true. Usually, this is one of the most effective codecs, but users have reported a performance issue in Firefox on MacOS. If you will not host such companions in your virtual party, leave it.

And if that is still not enough, you can strim down the video resolution even more:


Reportedly, aquaintances have been able to run congregations with more than 30 people. After that, it really depends on the network quality and the attached devices whether you will experience some partial freezing. That said, please remember that you cannot really “test” above parameters with only 1-2 participants, as jit.si won’t switch from P2P-mode to the conference mode (for which most of above parameters do hold) under 3 partitipants.

Decoding a Shark`s nervous system

So, time for a little calibration experiment. Flying shark is remotely controlled via infrared, so the best guess is via ordinary IR codes. Hence, I’ve setup a digital IR receiver on the breadboard and used the well-known IRRemote library to log/monitor the received readings via the serial port.


An this is how Frankenshark`s motoric capabilities read over the wire:

// bootstrap for pairing & intermediate connect
// IR_Code: 444AD90D, Bits: 32
// IR_Code: 444AD90D, Bits: 32
// IR_Code: 444AD90D, Bits: 32
// IR_Code: 444AD90D, Bits: 32
// IR_Code: 444AD90D, Bits: 32
// IR_Code: 444AD90D, Bits: 32

// left flap
// IR_Code: 77CCE159, Bits: 32

// right flap
// IR_Code: 1DCF2C27, Bits: 32

// dive/nose down
// IR_Code: EF1AF1FB, Bits: 32

// climb/nose up
// IR_Code: 5F5B64A8, Bits: 32

// demo mode
// IR_Code: 6FECA321, Bits: 32

Next, I need to order an IR sender and bloody helium …



Gleich Zwiebelhäuten spannen sich

Schalen der Evolution

um Pflanzen, die Tiere und den Mensch.


Für sie ist Wahrheit keine Dimension.


So müssen wir

Gespenster jagen,

Phantome sammeln.

Successfully importing 1.9er VirtualBox Machines into a 1.14 environment

A lot of prebuilt images on virtualboxes.org are still submitted in older formats.

If trying to import the machine.xml (which is in current versions called machine.vbox), you will be presented with the error message that the referred machine.vdi file (which is currently co-located to the vbox file) cannot be found in the media registry.

This is because in versions >=1.14, the media must be either centrally known in the VirtualBox.xml settings file (via the media manager) or locally to the vbox file (which is then automatically added and removed from the media manager). Manually adding the VDI to the media manager does not always work.

So you could manipulate the .vbox file by yourself:

Old 1.9 version looks like this, please note that you will see the UUID of the VDI that needs to be registered in the tag AttachedDevice.

<?xml version="1.0"?>
** If you make changes to this file while any VirtualBox related application
** is running, your changes will be overwritten later, without taking effect.
** Use VBoxManage or the VirtualBox Manager GUI to make changes.
<VirtualBox xmlns="http://www.innotek.de/VirtualBox-settings" version="1.9-macosx">
 <Machine uuid="{77f238d7-c16d-4082-8782-f542910ab1b3}" name="opensuse-11.2-x86" OSType="OpenSUSE" snapshotFolder="Snapshots" lastStateChange="2014-11-07T08:37:02Z">
 <StorageController name="Controller IDE" type="PIIX4" PortCount="2" useHostIOCache="true" Bootable="true">
 <AttachedDevice type="HardDisk" port="0" device="0">
 <Image uuid="{81e5deb2-a60f-4525-b219-858b5aefb54e}"/>

And the manipulated 1.14 fake that can be successfully imported then looks like this:

<?xml version="1.0"?>
** If you make changes to this file while any VirtualBox related application
** is running, your changes will be overwritten later, without taking effect.
** Use VBoxManage or the VirtualBox Manager GUI to make changes.
<VirtualBox xmlns="http://www.innotek.de/VirtualBox-settings" version="1.14-windows">
 <Machine uuid="{77f238d7-c16d-4082-8782-f542910ab1b3}" name="opensuse-11.2-x86" OSType="OpenSUSE" snapshotFolder="Snapshots" lastStateChange="2014-11-07T08:37:02Z">
 <HardDisk uuid="{81e5deb2-a60f-4525-b219-858b5aefb54e}" location="opensuse-11.2-x86.vdi" format="VDI" type="Normal"/>

That’s all folks!