The helium-based blood preserve as well as the digital spinal cord (IR transmitter) of Frankenshark just arrived by courier …

# Decoding a Shark`s nervous system

So, time for a little calibration experiment. Flying shark is remotely controlled via infrared, so the best guess is via ordinary IR codes. Hence, I’ve setup a digital IR receiver on the breadboard and used the well-known IRRemote library to log/monitor the received readings via the serial port.

An this is how Frankenshark`s motoric capabilities read over the wire:

// bootstrap for pairing & intermediate connect

// IR_Code: 444AD90D, Bits: 32

// IR_Code: 444AD90D, Bits: 32

// IR_Code: 444AD90D, Bits: 32

// IR_Code: 444AD90D, Bits: 32

// IR_Code: 444AD90D, Bits: 32

// IR_Code: 444AD90D, Bits: 32

// left flap

// IR_Code: 77CCE159, Bits: 32

// right flap

// IR_Code: 1DCF2C27, Bits: 32

// dive/nose down

// IR_Code: EF1AF1FB, Bits: 32

// climb/nose up

// IR_Code: 5F5B64A8, Bits: 32

// demo mode

// IR_Code: 6FECA321, Bits: 32

Next, I need to order an IR sender and bloody helium …

# Blup, Blup

Yesterday, Frankensharks mind began to glow.

Today, his body arrived.

# Blink, Blink

This is the start of a new project on autonomous behaviour.

At least, it can already autonomously blink 😦

Bliiiink, blink ….

# Sinnsucher

Gleich Zwiebelhäuten spannen sich

Schalen der Evolution

um Pflanzen, die Tiere und den Mensch.

Für sie ist Wahrheit keine Dimension.

So müssen wir

Gespenster jagen,

Phantome sammeln.

# Successfully importing 1.9er VirtualBox Machines into a 1.14 environment

A lot of prebuilt images on virtualboxes.org are still submitted in older formats.

If trying to import the machine.xml (which is in current versions called machine.vbox), you will be presented with the error message that the referred machine.vdi file (which is currently co-located to the vbox file) cannot be found in the media registry.

This is because in versions >=1.14, the media must be either centrally known in the VirtualBox.xml settings file (via the media manager) or locally to the vbox file (which is then automatically added and removed from the media manager). Manually adding the VDI to the media manager does not always work.

So you could manipulate the .vbox file by yourself:

Old 1.9 version looks like this, please note that you will see the UUID of the VDI that needs to be registered in the tag AttachedDevice.

<?xml version="1.0"?> <!-- ** DO NOT EDIT THIS FILE. ** If you make changes to this file while any VirtualBox related application ** is running, your changes will be overwritten later, without taking effect. ** Use VBoxManage or the VirtualBox Manager GUI to make changes. --> <VirtualBox xmlns="http://www.innotek.de/VirtualBox-settings" version="1.9-macosx"> <Machine uuid="{77f238d7-c16d-4082-8782-f542910ab1b3}" name="opensuse-11.2-x86" OSType="OpenSUSE" snapshotFolder="Snapshots" lastStateChange="2014-11-07T08:37:02Z"> <ExtraData> ... </ExtraData> <StorageControllers> <StorageController name="Controller IDE" type="PIIX4" PortCount="2" useHostIOCache="true" Bootable="true"> <AttachedDevice type="HardDisk" port="0" device="0"> <Image uuid="{81e5deb2-a60f-4525-b219-858b5aefb54e}"/> </AttachedDevice> ... </StorageController> </StorageControllers> </Machine> </VirtualBox>

And the manipulated 1.14 fake that can be successfully imported then looks like this:

<?xml version="1.0"?> <!-- ** DO NOT EDIT THIS FILE. ** If you make changes to this file while any VirtualBox related application ** is running, your changes will be overwritten later, without taking effect. ** Use VBoxManage or the VirtualBox Manager GUI to make changes. --> <VirtualBox xmlns="http://www.innotek.de/VirtualBox-settings" version="1.14-windows"> <Machine uuid="{77f238d7-c16d-4082-8782-f542910ab1b3}" name="opensuse-11.2-x86" OSType="OpenSUSE" snapshotFolder="Snapshots" lastStateChange="2014-11-07T08:37:02Z"> <MediaRegistry> <HardDisks> <HardDisk uuid="{81e5deb2-a60f-4525-b219-858b5aefb54e}" location="opensuse-11.2-x86.vdi" format="VDI" type="Normal"/> </HardDisks> <DVDImages/> <FloppyImages/> </MediaRegistry> ... </Machine> </VirtualBox>

That’s all folks!

# Germany´s Next Startup: Zielgruppengerichtetes, kollaboratives Pissoir-Marketing

An: Das Europäische Patentamt

Betreff: Patent-Anmeldung #8.343.852

Kurzbeschreibung: Wir schlagen eine Methode und einen Apparat vor, welcher die gegenwärtigen, sehr erfolgreichen Trends zu interaktiven Marketing-Installationen in der Öffentlichkeit (z.B. Litfaßsäulen oder Mülleimer mit Projektionsflächen und Bewegungsmeldern) sowie lokalitäts-spezifischen Werbemassnahmen (siehe rechts z.B. die beliebten Granufink- oder Dachdecker-Plakate – “Undicht? Wir haben was dagegen …” – über den Urinalen gut besuchter Männer-Toiletten an Raststätten und Biergärten) in optimaler Weise zusammenführt.

Ziel: Mit dieser Idee sollte es möglich sein, die nächsten 20 Mio. Wagniskapital aus freudigen Investorengesichtern ausgehändigt zu bekommen, um sie dann genüsslich im wahrsten Sinne des Wortes in den Orkus zu kippen.

Funktionsweise: Der mittels einer mobilen Urin-Analyse-Einheit auf Basis einer elektrochemischen Säure-/Base-Reaktion gekoppelt an die Lightning-Schnittstelle eines handelsüblichen Smartphone (iLoo) aufgebaute Apparat setzt sich mittels einer Alkalische-Analyse-Apparat-App (A-A-A-A) mit einem günstig von indischen Betreibern gemieteten Cloud-Service in Verbindung. Die festgestellten Rahmenparameter (z.B. PS – Peeing Strenght, DDA – Degree of Drug Abuse, DM – Diabetes Mellitus, …) werden mittels eines zuvor ermittelten Cluster-Modells den verschiedenen Käuferclustern aus dem Amazon-Kundenmodell verglichen. Mittels einer semantischen Engine werden daraufhin passgenaue Botschaften auf das Handy-Display zurückgespielt: “Los, einer geht noch.” oder “Andere Pisser haben sich danach die Hände gewaschen.” oder gar “Die Nummer des nächstgelegenen Taxi-Unternehmens ist: +49-…). Denkbar sind auch kollaborative Spiele, wenn sich etwa gegnerische Clans finden, welche eine bestimmte Literzahl zuerst vollmachen müssen. Wir denken in einem nächsten Schritt auch über eine Vernetzung mit einer passenden Architektur für Damentoiletten nach, so daß beispielsweise hormonell abgestimmte Vorschläge für potentielle Sexualpartner online vermittelt werden können.

# Efficient Subset Matching using SQL (with an open question regarding the physical versus the virtual appliance of Teradata)

Usually, I tend to publish only successful experiments. For the first time, however, I am seemingly not able to somehow emulate the soft- and hardware setup needed to verify my hypotheses. So I desperately need your help.

Today, we wander through the realm of “big” propositional rule matching. Suppose we’ve got on the one hand 10^7 formulas of the form precondition -> conclusion where the propositions are key-encoded and we assume for simplicity a fixed length for their conjunction.

On the other hand, we are given a set of 250.000 fact formulas which are represented quite similarly but with a variable conjunction size whose average is three times the length of the preconditions.

So here is the vertical DDL model, followed by a Teradata sample population code:

create table RULE ( ruleId integer not null, propositionId integer not null, conclusionId integer not null, constraint pk primary key(ruleId,propositionId) ) ; create table FACT ( factId integer not null, propositionId integer not null, constraint pk primary key(factId,propositionId) ) ; CREATE PROCEDURE populate_rule(in maxRules integer, in maxSteps integer) BEGIN DECLARE ruleCount INTEGER DEFAULT 0; DECLARE stepCount INTEGER DEFAULT 0; DECLARE currentProposition INTEGER DEFAULT 0; DECLARE conclusion INTEGER DEFAULT 0; loop_label: WHILE ruleCount maxSteps THEN SET stepCount = 1; SET ruleCount = ruleCount + 1; SET currentProposition = 0; END IF; Set currentProposition = currentProposition + random(1,30); Set conclusion = random(1,2100); insert into rule(:ruleCount, :currentProposition, :conclusion); if ruleCount mod 1000 = 0 THEN commit; END IF; END WHILE loop_label; commit; END; / CREATE PROCEDURE populate_fact(in maxFacts integer) BEGIN DECLARE factCount INTEGER DEFAULT 0; DECLARE stepCount INTEGER DEFAULT 0; DECLARE currentProposition INTEGER DEFAULT 0; DECLARE currentSteps INTEGER DEFAULT 0; loop_label: WHILE factCount currentSteps THEN SET stepCount = 1; SET currentSteps = random(1,40); SET factCount = factCount + 1; SET currentProposition = 0; END IF; Set currentProposition = currentProposition + random(1,10); insert into fact(:factCount, :currentProposition); if factCount mod 500 = 0 THEN commit; END IF; END WHILE loop_label; commit; END; /

Under the given model, the task of matching rules to facts equals to determining, which fixed-length sets in the RULE table correspond are subsets of the variable-length sets in the FACT table. Hence, a quite traditional JOIN followed by a filtered aggregation:

select factId , ruleId , conclusionId from FACT join RULE on FACT.propositionId=RULE.propositionId group by factId, ruleId having count(*)=7

The performance of this method, though at least partially covered by the primary keys, is not overwhelming (1.500 seconds in the standard TDExpress14 VMWare image on a Core i7). Which is not too surprising, given the obtained execution plan:

1) First, we lock a distinct PRODUCT_QUALITY."pseudo table" for read on a RowHash to prevent global deadlock for PRODUCT_QUALITY.FACT. 2) Next, we lock a distinct PRODUCT_QUALITY."pseudo table" for read on a RowHash to prevent global deadlock for PRODUCT_QUALITY.RULE. 3) We lock PRODUCT_QUALITY.FACT for read, and we lock PRODUCT_QUALITY.RULE for read. 4) We execute the following steps in parallel. 1) We do an all-AMPs RETRIEVE step from PRODUCT_QUALITY.RULE by way of an all-rows scan with no residual conditions into Spool 4 (all_amps) fanned out into 5 hash join partitions, which is duplicated on all AMPs. The result spool file will not be cached in memory. The size of Spool 4 is estimated with high confidence to be 206,708 rows (4,340,868 bytes). The estimated time for this step is 0.65 seconds. 2) We do an all-AMPs RETRIEVE step from PRODUCT_QUALITY.FACT by way of an all-rows scan with no residual conditions into Spool 5 (all_amps) fanned out into 5 hash join partitions, which is built locally on the AMPs. The input table will not be cached in memory, but it is eligible for synchronized scanning. The result spool file will not be cached in memory. The size of Spool 5 is estimated with high confidence to be 350,001 rows (7,350,021 bytes). The estimated time for this step is 1.32 seconds. 5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a RowHash match scan, which is joined to Spool 5 (Last Use) by way of a RowHash match scan. Spool 4 and Spool 5 are joined using a merge join, with a join condition of ("propositionId = propositionId"). The result goes into Spool 3 (all_amps), which is built locally on the AMPs. The size of Spool 3 is estimated with no confidence to be 61,145,139 rows (1,406,338,197 bytes). The estimated time for this step is 2 minutes and 42 seconds. 6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by way of an all-rows scan , grouping by field1 ( PRODUCT_QUALITY.FACT.factId ,PRODUCT_QUALITY.RULE.ruleId ,PRODUCT_QUALITY.RULE.conclusionId). Aggregate Intermediate Results are computed globally, then placed in Spool 6. The aggregate spool file will not be cached in memory. The size of Spool 6 is estimated with no confidence to be 45,858,855 rows (1,696,777,635 bytes). The estimated time for this step is 1 hour and 22 minutes. 7) We do an all-AMPs RETRIEVE step from Spool 6 (Last Use) by way of an all-rows scan with a condition of ("(Field_5 (DECIMAL(15,0)))= 7.") into Spool 1 (group_amps), which is built locally on the AMPs. The result spool file will not be cached in memory. The size of Spool 1 is estimated with no confidence to be 45,858,855 rows ( 1,696,777,635 bytes). The estimated time for this step is 3 minutes and 36 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 1 hour and 28 minutes.

The problem that we are faced with is that due to the nature of the primary key (leading key factId or ruleId) and the Teradata parallelization strategy, the rows cannot be locally joined in the AMPS (or as to put it in traditional SQL terms: we need to arrange a SORT-ORDER MERGE JOIN).

Hence we need to fiddle with the table layout itself, i.e., we manipulate the primary index to just point to the crucial propositionId column:

1) First, we lock a distinct PRODUCT_QUALITY."pseudo table" for read on a RowHash to prevent global deadlock for PRODUCT_QUALITY.FACT. 2) Next, we lock a distinct PRODUCT_QUALITY."pseudo table" for read on a RowHash to prevent global deadlock for PRODUCT_QUALITY.RULE. 3) We lock PRODUCT_QUALITY.FACT for read, and we lock PRODUCT_QUALITY.RULE for read. 4) We do an all-AMPs JOIN step from PRODUCT_QUALITY.RULE by way of a RowHash match scan with no residual conditions, which is joined to PRODUCT_QUALITY.FACT by way of a RowHash match scan with no residual conditions. PRODUCT_QUALITY.RULE and PRODUCT_QUALITY.FACT are joined using a merge join, with a join condition of ("PRODUCT_QUALITY.FACT.propositionId = PRODUCT_QUALITY.RULE.propositionId"). The result goes into Spool 3 (all_amps), which is built locally on the AMPs. The size of Spool 3 is estimated with low confidence to be 137,806 rows (3,169,538 bytes). The estimated time for this step is 1.02 seconds. 5) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by way of an all-rows scan , grouping by field1 ( PRODUCT_QUALITY.FACT.factId ,PRODUCT_QUALITY.RULE.ruleId ,PRODUCT_QUALITY.RULE.conclusionId). Aggregate Intermediate Results are computed globally, then placed in Spool 4. The aggregate spool file will not be cached in memory. The size of Spool 4 is estimated with no confidence to be 103,355 rows ( 3,824,135 bytes). The estimated time for this step is 1.42 seconds. 6) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of an all-rows scan with a condition of ("(Field_5 (DECIMAL(15,0)))= 7.") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 103,355 rows (3,824,135 bytes). The estimated time for this step is 0.51 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 2.95 seconds.

That looks like a nice parallel plan and an vene better prediction. But the runtime is … tatarata … 1.500 seconds!

Wait a minute. How is that possible?

It is possible (hypothesis), because

- a) the VMWare image is configured with 1 logical processor, mainly disabling any hyperthreading or multicore support by the hosting i7 and because
- b) the TDExpress configuration inside the VMWare is only equippied with two AMPs, hereby efficiently disabling any noticable partitioning/distribution effect at all.

# Efficient Exponential Regression Using Ansi-SQL Windowing

This is the fourth part of a posting series (I, II and III) which deals with pushing the declarative boundary of SQL engines in the domain of statistics as far as possible.

In the previous examples, we have gathered samples for (piecewise-linear) risk probabilities.

Now we would rather use these samples to regress a typical exponential risk probability disttibution. The appeal of this particular family of distributions is not only that it can adapt to the most reasonable shapes (increasing, decreasing, even a kind of normal distribution), but that it has most interesting numeric properties.

In particular, by transforming the exponential regression problem with a double-logarithmic representation, we gather the simplest linear regression problem (one variable, one target) which we can even compute “by hand”.

The remainder of this posting is now rather trivial, given the already presented definitions of the database schema:

select material -- the exponential shape parameter corresponds to the linear slope parameter , beta -- the exponential acceleration parameter can be computed out of the linear horizontal translation -- be aware of zero slopes , exp(avg_ln_age-case when beta 0 then avg_ln_ln_surv/beta else 0 end) as alpha from ( -- Aggregate the linear coefficients per material and solve the equation select material , (count(*)*sum(ln_age_surv)-sum(ln_age_sum_surv))/(count(*)*sum(ln_age_sqr)-sum(ln_age)*sum(ln_age)) as beta -- prepare computation of horizontal translation , avg(ln_age) as avg_ln_age , avg(ln_ln_surv) as avg_ln_ln_surv from ( -- Next we prepare the equation solving by some -- windowing/aggregation select material , ln_age , ln_ln_surv , ln_age*ln_age as ln_age_sqr , ln_age*ln_ln_surv as ln_age_surv , ln_age*sum(ln_ln_surv) over (partition by material) as ln_age_sum_surv from ( -- Here, we transform the original problem into the -- logarithmic scale, edge probabilities are somehow avoided select material , ln(age) as ln_age , ln(ln(1/(1-greatest(least(probability,0.999999),0.00001)))) as ln_ln_surv from repair_estimations ) transf(material, ln_age, ln_ln_surv) ) compl(material,ln_age,ln_ln_surv,ln_age_sqr, ln_age_surv, ln_age_sum_surv) group by material ) final(material,beta,avg_ln_age,avg_ln_ln_surv) ;

# Progressively Fusing Independent Probabilities using Proprietary SQL Windowing (Teradata version)

As announced in the second part of the posting series, the scheme to use aggregate User-Defined-Functions (UDF) for fusing probabilities will work in a similar fashion for all standard databases, such as Oracle and Teradata.

Similar to the Oracle formulation (in which the aggregation object itself keeps state and functional logic), here is the Teradata version consisting of a static logic class and a seperate state class that is serialized and passed):

package de.jasna.teradata.repair; import java.sql.SQLException; import com.teradata.fnc.Context; import com.teradata.fnc.Phase; import java.io.Serializable; /** * Storage class for ProbabilityUnion Java User Defined Function * Remember that you have to reserve the appropriate context memory for its serialized form * in the final function definition */ class ProStor implements Serializable { // memory private double probability=0; /** * Default Constructor */ public ProStor() { } /** * @return current probability */ public double getProbability() { return this.probability; } /** * @param aProbability to set */ public void setProbability(double aProbability) { this.probability = aProbability; } } /** * Teradata version of the union of independent probabilities */ public class ProbabilityUnion { /** * Compute method * @param phase of aggregation * @param context of aggregation including states * @param aProbability value to integrate * @return union of probabilities */ public static double unite(Phase phase , Context[] context , double aProbability) throws SQLException { try { switch (phase.getPhase()) { /* The AGR_INIT phase is executed once. */ case Phase.AGR_INIT: ProStor newStor=new ProStor(); context[0].initCtx(newStor); context[0].setObject(1, newStor); return 0; /* The AGR_DETAIL phase is executed once per row. */ case Phase.AGR_DETAIL: ProStor yet = (ProStor) context[0].getObject(1); yet.setProbability(yet.getProbability() + aProbability - yet.getProbability() * aProbability); context[0].setObject(1, yet); return 0; /* AGR_COMBINE runs over the distributed results */ case Phase.AGR_COMBINE: ProStor res1 = (ProStor) context[0].getObject(1); ProStor res2 = (ProStor) context[0].getObject(2); res1.setProbability(res1.getProbability() + res2.getProbability() - res1.getProbability() * res2.getProbability()); context[0].setObject(1, res1); return 0; /* we are through */ case Phase.AGR_FINAL: ProStor finale = (ProStor) context[0].getObject(1); return finale.getProbability(); /* prob of empty set is zero */ case Phase.AGR_NODATA: return 0; default: throw new SQLException("Invalid Phase", "38905"); } } catch (Exception ex) { throw new SQLException(ex.getMessage(), "38101"); } } }

Once you deployed the jar file under a unique id (e.g., “de.jasna.teradata.repair” for which your user needs “execute prodedure” grants on the ‘SQLJ’ database), you can define the aggregation UDF as follows (note that the storage class above needs some extra class aggregate space wrt the default 64 ?bytes?).

create function union_indep_prop(aProbability float) returns float class aggregate(128) specific union_indep_prop language java no sql no external data parameter style java deterministic called on null input external name 'de.jasna.teradata.repair:de.jasna.teradata.repair.ProbabilityUnion.unite(com.teradata.fnc.Phase,com.teradata.fnc.Context[],double) returns double' ;

Finally, we can define the hierarchical decomposition and probability propagation query (given the STRUCTURED_MATERIAL definition as in the previous post) as follows:

/* * A recursive view to decompose the bill of materials down to the leaf level */ create recursive view MATERIAL_HIERARCHY(root_material, component_material,depth) as ( SELECT STRUCTURED_MATERIAL.composite_material as root_material , STRUCTURED_MATERIAL.component_material as component_material , 0 as depth FROM STRUCTURED_MATERIAL union all SELECT MATERIAL_HIERARCHY.root_material , STRUCTURED_MATERIAL.component_material , MATERIAL_HIERARCHY.depth+1 as depth FROM MATERIAL_HIERARCHY join STRUCTURED_MATERIAL on MATERIAL_HIERARCHY.component_material=STRUCTURED_MATERIAL.composite_material ); /* * Here come the propagated probabilities */ insert into repair_estimations select MATERIAL_HIERARCHY.root_material , AGE.age , min(AGE.age-1) over (partition by MATERIAL_HIERARCHY.root_material order by AGE.age rows between 1 following and unbounded following) as next_age , union_indep_prop(ALL_REPAIRS.probability) as probability , union_indep_prop(ALL_REPAIRS.lower_bound_95) as lower_bound_95 , union_indep_prop(ALL_REPAIRS.upper_bound_95) as upper_bound_95 , max(MATERIAL_HIERARCHY.depth) as depth from MATERIAL_HIERARCHY join REPAIR_ESTIMATIONS AGE on AGE.material=MATERIAL_HIERARCHY.component_material and AGE.depth=0 join MATERIAL_HIERARCHY ALL_PARTS on ALL_PARTS.root_material=MATERIAL_HIERARCHY.root_material join REPAIR_ESTIMATIONS ALL_REPAIRS on ALL_REPAIRS.material=ALL_PARTS.component_material and ALL_REPAIRS.depth=0 and AGE.age between ALL_REPAIRS.age and coalesce(ALL_REPAIRS.next_age, AGE.age) group by MATERIAL_HIERARCHY.root_material , AGE.age ;

Some helpful links to get this implementation flying were this debugging guide and the Java Aggregate Function tutorial.