Random Access Peculiarities of Hadoop’s ABFSS support

Azure Data Lake Storage Accounts (Gen2) providing Blob Containers with hierarchical namespaces really are an interesting and versatile storage alternative to the (more or less outdated) Azure Data Lake Storage (Gen1) service.

That’s why they have been dedicated/contributed a completely revamped filesystem abstraction in the hadoop ecosystem (org.apache.hadoop.fs.azurebfs under the abfss:// scheme) alongside of the already existing packages (org.apache.hadoop.fs.azure under the adl:// scheme).

As its always the case with (good) abstractions, they hide away most of the nifty details of dealing with a raw REST/Autentication API such as Oauth2/WebHDFS in the case of ADL. But sometimes, they (deliberately) behave very differently under the hood to the utter regret of the provider of a service on top of it.

In our case, we have such a setting, in which we have implemented a distributed parser which consists of

a) a single-threaded, but presumably very fast skip-scan initial phase

b) followed by a distributed parsing phase

Both phases operate “sequentially streaming” on their particular splits of the files. But it is the phase a) which shows a severe performance degradation under abfss:// versus adl:// (for 200MB files, we already seen factors of 16, for 1GB files this accumulates to factors of 40 and more).

We played a lot with buffer sizes and such (which does not make sense, if we want to skip most of the buffer anyway IMHO) and we always surprised to see the most counterintuitive effects you could imagine.

So it was not until I found this somehow long debated and finally accepted pull request that I could make up my mind what was happening. The sophisticated engineers (from Hortonworks?) really built some random access pattern detector into the stream which should optimize read-ahead and minimize network bloat!

Unfortunately (and our use case seems not to be the only one as can be deduced from the pull requests descriptions) this detector jumps in no matter if you seek() or skip() rendering our scanning strategy basically useless. Fortunately and thanks to Nokia’s persistence in that matter, there is now a switch by means of the “fs.azure.read.alwaysReadBufferSize” option.

Set it to “true” and get predictable Gen1-like performance for your not-quite-random-access cases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s