Progressively Fusing Independent Probabilities using Proprietary SQL Windowing (Teradata version)

As announced in the second part of the posting series, the scheme to use aggregate User-Defined-Functions (UDF) for fusing probabilities will work in a similar fashion for all standard databases, such as Oracle and Teradata.

Similar to the Oracle formulation (in which the aggregation object itself keeps state and functional logic), here is the Teradata version consisting of a static logic class and a seperate state class that is serialized and passed):

package de.jasna.teradata.repair;

import java.sql.SQLException;

import com.teradata.fnc.Context;
import com.teradata.fnc.Phase;
import java.io.Serializable;

/**
 * Storage class for ProbabilityUnion Java User Defined Function
 * Remember that you have to reserve the appropriate context memory for its serialized form 
 * in the final function definition
 */
class ProStor implements Serializable {
	// memory
	private double probability=0;

	/**
	 * Default Constructor
	 */
	public ProStor() {
	}

	/**
	 * @return current probability
	 */
	public double getProbability() {
		return this.probability;
	}

	/**
	 * @param aProbability to set
	 */
	public void setProbability(double aProbability) {
		this.probability = aProbability;
	}
}

/**
 * Teradata version of the union of independent probabilities
 */
public class ProbabilityUnion {

	/**
	 * Compute method
	 * @param phase of aggregation
	 * @param context of aggregation including states
	 * @param aProbability value to integrate
	 * @return union of probabilities
	 */
	public static double unite(Phase phase
	                         , Context[] context
			                 , double aProbability) 
	   throws SQLException {

		try {
			switch (phase.getPhase()) {

			/* The AGR_INIT phase is executed once. */
			case Phase.AGR_INIT:
				ProStor newStor=new ProStor();
				context[0].initCtx(newStor);
				context[0].setObject(1, newStor);
				return 0;

			/* The AGR_DETAIL phase is executed once per row. */
			case Phase.AGR_DETAIL:
				ProStor yet = (ProStor) context[0].getObject(1);
				yet.setProbability(yet.getProbability() + aProbability
						- yet.getProbability() * aProbability);
				context[0].setObject(1, yet);
				return 0;

			/* AGR_COMBINE runs over the distributed results */
			case Phase.AGR_COMBINE:
				ProStor res1 = (ProStor) context[0].getObject(1);
				ProStor res2 = (ProStor) context[0].getObject(2);
				res1.setProbability(res1.getProbability()
						+ res2.getProbability() - res1.getProbability()
						* res2.getProbability());
				context[0].setObject(1, res1);
				return 0;

			/* we are through */
			case Phase.AGR_FINAL:
				ProStor finale = (ProStor) context[0].getObject(1);
				return finale.getProbability();

			/* prob of empty set is zero */
			case Phase.AGR_NODATA:
				return 0;

			default:
				throw new SQLException("Invalid Phase", "38905");
			}
		} catch (Exception ex) {
			throw new SQLException(ex.getMessage(), "38101");
		}
	}

}

Once you deployed the jar file under a unique id (e.g., “de.jasna.teradata.repair” for which your user needs “execute prodedure” grants on the ‘SQLJ’ database), you can define the aggregation UDF as follows (note that the storage class above needs some extra class aggregate space wrt the default 64 ?bytes?).

create function union_indep_prop(aProbability float)
returns float
class aggregate(128)
specific union_indep_prop
language java
no sql
no external data
parameter style java
deterministic
called on null input
external name 'de.jasna.teradata.repair:de.jasna.teradata.repair.ProbabilityUnion.unite(com.teradata.fnc.Phase,com.teradata.fnc.Context[],double) returns double'
;

Finally, we can define the hierarchical decomposition and probability propagation query (given the STRUCTURED_MATERIAL definition as in the previous post) as follows:

/*
 * A recursive view to decompose the bill of materials down to the leaf level
 */
create recursive view MATERIAL_HIERARCHY(root_material, component_material,depth) as (
  SELECT STRUCTURED_MATERIAL.composite_material as root_material
       , STRUCTURED_MATERIAL.component_material as component_material
       , 0 as depth
    FROM STRUCTURED_MATERIAL
  union all
  SELECT MATERIAL_HIERARCHY.root_material
       , STRUCTURED_MATERIAL.component_material
       , MATERIAL_HIERARCHY.depth+1 as depth
    FROM MATERIAL_HIERARCHY
       join STRUCTURED_MATERIAL on MATERIAL_HIERARCHY.component_material=STRUCTURED_MATERIAL.composite_material
);

/*
 * Here come the propagated probabilities
 */
insert into repair_estimations
 select MATERIAL_HIERARCHY.root_material
      , AGE.age
      , min(AGE.age-1) over (partition by MATERIAL_HIERARCHY.root_material 
                             order by AGE.age
                             rows between 1 following 
                                  and unbounded following) as next_age
      , union_indep_prop(ALL_REPAIRS.probability) as probability
      , union_indep_prop(ALL_REPAIRS.lower_bound_95) as lower_bound_95
      , union_indep_prop(ALL_REPAIRS.upper_bound_95) as upper_bound_95
      , max(MATERIAL_HIERARCHY.depth) as depth
  from MATERIAL_HIERARCHY
    join REPAIR_ESTIMATIONS AGE on AGE.material=MATERIAL_HIERARCHY.component_material and AGE.depth=0
    join MATERIAL_HIERARCHY ALL_PARTS on ALL_PARTS.root_material=MATERIAL_HIERARCHY.root_material 
    join REPAIR_ESTIMATIONS ALL_REPAIRS on ALL_REPAIRS.material=ALL_PARTS.component_material and ALL_REPAIRS.depth=0
         and AGE.age between ALL_REPAIRS.age and coalesce(ALL_REPAIRS.next_age, AGE.age)
 group by MATERIAL_HIERARCHY.root_material
        , AGE.age
;

Some helpful links to get this implementation flying were this debugging guide and the Java Aggregate Function tutorial.

Advertisements

Progressively Fusing Independent Probabilities using Proprietary SQL Windowing (Oracle version)

This is the sequel post to this proposal how to estimate the lifetime of spare parts using ordinary query language semantics.

This time, we are interested in propagating the hence obtained individual probabities up a more complex material hierarchy. For which we start by the simplifying assumption, that our only variable (“age” could also be named or measured as general “wear and tear”) does already take into account all the influences to a given material’s span of life. And when combining two materials into some higher structure, we assume that they do not influence each other’s durability alltoomuch.

create table STRUCTURED_MATERIAL (
  composite_material int references MATERIAL(material) not null,
  component_material int references MATERIAL(material) not null
);

insert into STRUCTURED_MATERIAL values (4,1);
insert into STRUCTURED_MATERIAL values (4,2);

insert into STRUCTURED_MATERIAL values (5,4);
insert into STRUCTURED_MATERIAL values (5,3);

Unfortunately, even under these assumptions fusing the individual probabilities turns out to require a “progressive” calculation, i.e., a formula which cannot be stated in terms of an ANSI SQL aggregation.

union_of_indep_probabities

Fortunately, the expressiveness of the typical Used-Defined-Function (UDF) frameworks, such as the one of the Oracle Database, is well able to handle that requirement even under parallel/distributed execution settings:

/* Oracle needs a type/object definition as the UDF-base */
create type ProbabilityUnionImpl as object
(
  probability NUMBER, -- memory just keeps current probability
  static function ODCIAggregateInitialize(sctx IN OUT ProbabilityUnionImpl) return number,
  member function ODCIAggregateIterate(self IN OUT ProbabilityUnionImpl
                                     , value IN number) return number,
  member function ODCIAggregateTerminate(self IN ProbabilityUnionImpl
                                       , returnValue OUT number
                                       , flags IN number) return number,
  member function ODCIAggregateMerge(self IN OUT ProbabilityUnionImpl
                                   , ctx2 IN ProbabilityUnionImpl) return number
);
/
/* and here are the method implementations */
create or replace type body ProbabilityUnionImpl is 

/* We start with an empty probability */
static function ODCIAggregateInitialize(sctx IN OUT ProbabilityUnionImpl) return number is 
begin
  sctx := ProbabilityUnionImpl(0);
  return ODCIConst.Success;
end;

/* Local iteration implements fusion under independence */
member function ODCIAggregateIterate(self IN OUT ProbabilityUnionImpl
                                   , value IN number) return number is
begin
  self.probability:=self.probability+value-self.probability*value;
  return ODCIConst.Success;
end;

/* At the end, just return the stored probability */
member function ODCIAggregateTerminate(self IN ProbabilityUnionImpl
                                     , returnValue OUT number
                                     , flags IN number) return number is
begin
  returnValue := self.probability;
  return ODCIConst.Success;
end;

/* Distributed/Parallel merge, same case as local iteration */
member function ODCIAggregateMerge(self IN OUT ProbabilityUnionImpl
                                 , ctx2 IN ProbabilityUnionImpl) return number is
begin
  self.probability:=self.probability+ctx2.probability-self.probability*ctx2.probability;
  return ODCIConst.Success;
end;

end;
/

/* finally, the aggregation function is defined in terms of the type */
create function union_indep_prop (input number) return number 
PARALLEL_ENABLE AGGREGATE USING ProbabilityUnionImpl;

With that backing installed, we are now able to complete our estimation table even for complex materials. Since we are Oracle-specific anyway, we also use the established recursion syntax when flattening the bill of material tree.

insert into REPAIR_ESTIMATIONS
with
  /* we just need the root - leaf material relations */
  DECOMPOSITION as (
  select 
         CONNECT_BY_ROOT 
         composite.composite_material as root_material
       , composite.component_material as component_material
       , level as depth
    from STRUCTURED_MATERIAL composite
 connect by prior composite.component_material=composite.composite_material 
)
 /* for each age of an individual age, find all corresponding probabilities of 
    sibling materials for fusion */
 select DECOMPOSITION.root_material
      , REPAIR.age
      , lead(REPAIR.age-1) over (partition by DECOMPOSITION.root_material order by REPAIR.age) as next_age
      , union_indep_prop(ALL_REPAIRS.probability) as probability
      , union_indep_prop(ALL_REPAIRS.lower_bound_95) as lower_bound_95
      , union_indep_prop(ALL_REPAIRS.upper_bound_95) as upper_bound_95
      , max(DECOMPOSITION.depth) as depth
  from DECOMPOSITION
    join REPAIR_ESTIMATIONS REPAIR on REPAIR.material=DECOMPOSITION.component_material and REPAIR.depth=0
    join DECOMPOSITION ALL_PARTS on ALL_PARTS.root_material=DECOMPOSITION.root_material 
    join REPAIR_ESTIMATIONS ALL_REPAIRS on ALL_REPAIRS.material=ALL_PARTS.component_material and
         REPAIR.age between ALL_REPAIRS.age and coalesce(ALL_REPAIRS.next_age, REPAIR.age)
 group by DECOMPOSITION.root_material
        , REPAIR.age
;

Stay tuned for the upcoming Teradata formulation.

Estimating Multiplicative Probabilities Using ANSI-SQL Windowing

SQL can be a quite efficient tool to implement particular types of statistics. Since windowing became a part of the ANSI standard, this holds for more and more databases as well as mathematical/structural complexities.

For example, product limits can be incrementally estimated by multiplying the probabilities over age. Only the quite impressive ANSI SQL aggregate function list  does not contain a corresponding operation.

Fortunately, we know (or at least Google knows 😉 that for positive P(i), it holds that ΠP(i) = EXP(Σln(P(i)))

This allows us to realize a (batch) product limit estimation including the determination of confidence intervals such as in the following example which has been generalized from Oracle to Teradata. Already included are some objects needed for the sequel post, just in case you wonder about the verbosity:

create table MATERIAL (
	material int primary key not null,
	description varchar(100)
);
insert into MATERIAL values (1, 'sit');
insert into MATERIAL values (2, 'aenean');
insert into MATERIAL values (3, 'adipiscing');
insert into MATERIAL values (4, 'nulla');
insert into MATERIAL values (5, 'in');
insert into MATERIAL values (6, 'sed');

create table PRODUCT (
	serial int primary key not null,
	material INT references MATERIAL(material) not null,
	manufacturing_date date not null
);

insert into PRODUCT  values (1, 2, '2002-08-14');
insert into PRODUCT  values (2, 1, '2001-02-20');
insert into PRODUCT  values (3, 1, '2005-10-27');
insert into PRODUCT  values (4, 1, '2003-10-23');
insert into PRODUCT  values (5, 1, '2001-12-21');
insert into PRODUCT  values (6, 2, '2010-10-11');
insert into PRODUCT  values (7, 1, '2002-06-28');
insert into PRODUCT  values (8, 2, '2009-12-29');
insert into PRODUCT  values (9, 1, '2002-01-10');
insert into PRODUCT  values (10, 2, '2008-10-19');
insert into PRODUCT  values (11, 3, '2007-01-04');
insert into PRODUCT  values (12, 3, '2008-08-06');
insert into PRODUCT  values (13, 4, '2010-08-01');
insert into PRODUCT  values (14, 4, '2011-12-01');
insert into PRODUCT  values (15, 5, '2012-02-03');

create table REPAIR (
	serial_flawed int unique references PRODUCT(serial) not null,
	repair_date date not null
);

insert into REPAIR values (7,'2004-01-22');
insert into REPAIR values (6,'2011-04-09');
insert into REPAIR values (2,'2001-06-21');
insert into REPAIR values (4,'2010-07-25');
insert into REPAIR values (1,'2011-06-16');
insert into REPAIR values (10,'2009-04-17');
insert into REPAIR values (5,'2003-04-13');
insert into REPAIR values (3,'2011-12-31');
insert into REPAIR values (9,'2002-06-10');
insert into REPAIR values (11,'2010-06-10');

/* Not all ANSI databases cope with my beloved nested CTEs, so here is the explicit version */ 

/* aggregating repair data by part age */ 
create view BASE_EXPERIMENT as
   select PRODUCT.material
        , REPAIR.repair_date-PRODUCT.manufacturing_date as age
        , cast(sum(1) as float) as samples
     from PRODUCT 
       join REPAIR on REPAIR.serial_flawed=PRODUCT.serial
    group by PRODUCT.material
           , REPAIR.repair_date-PRODUCT.manufacturing_date
;

/* building intervals and base statistics */
create view POPULATION as 
   select BASE_EXPERIMENT.*
        , row_number() over (partition by BASE_EXPERIMENT.material 
                             order by BASE_EXPERIMENT.age)
          as experiment
        , cast(
             sum(BASE_EXPERIMENT.samples) over (partition by BASE_EXPERIMENT.material 
                                                order by BASE_EXPERIMENT.age 
                                                rows between current row 
                                                   and unbounded following) 
          as float) as population
        , min(BASE_EXPERIMENT.age) over (partition by BASE_EXPERIMENT.material 
                                         order by BASE_EXPERIMENT.age
                                         rows between 1 following 
                                           and unbounded following) 
          as next_age
     from BASE_EXPERIMENT
;

/* Here is the actual estimation magic done */
create view SURVIVAL as 
   select POPULATION.*
        , case 
             when POPULATION.population<=1 then null 
             else 
                exp(
                 sum(
                   ln(
                     case 
                        when POPULATION.age is null then 1.0
                        when POPULATION.population=POPULATION.samples then null 
                        else 1.0-POPULATION.samples/POPULATION.population 
                     end
                   )
                 ) over (partition by POPULATION.material 
                         order by POPULATION.age 
                         rows between unbounded preceding 
                                  and current row 
                 )
                )
          end as probability
        , sqrt(
             sum(
                 case 
                    when POPULATION.experiment=1 then 0.0
                    when POPULATION.population>=POPULATION.samples 
                         then POPULATION.samples/
                              (POPULATION.population*(POPULATION.population-POPULATION.samples))
                    else null 
                 end
             ) over (partition by POPULATION.material 
                     order by POPULATION.age 
                     rows between unbounded preceding and 1 preceding 
             )
          ) as confidence
     from POPULATION
;

create table REPAIR_ESTIMATIONS as (
   select SURVIVAL.material
        , SURVIVAL.age
        , SURVIVAL.next_age
        , 1-coalesce(SURVIVAL.probability,0) as probability
        , 1-coalesce(least(SURVIVAL.probability+1.96*SURVIVAL.probability*SURVIVAL.confidence,1),1) as lower_bound_95
        , 1-coalesce(greatest(0,SURVIVAL.probability-1.96*SURVIVAL.probability*SURVIVAL.confidence),0) as upper_bound_95
        , 0 as depth
     from SURVIVAL  
) 
WITH DATA -- Teradata slang, remove for Oracle
;

/* Thats all folks ... until we speak of structured materials next time */
select * 
  from REPAIR_ESTIMATIONS 
 order by material
        , age
;

BTW, Mockaroo is a really nice sample data generation service … if only we could influence random distributions and impose constraints …

Bowdenzug an der Fahrertür des Skoda Oktavia tauschen

Endlich mal wieder ein erfolgreiches Wochenendprojekt 😉

Auf dem !!Nachhauseweg vom  TüV!! machte es vor Weihnachten beim Aussteigen plötzlich “ratsch”. Die Fahrertür ließ sich nicht mehr von innen öffnen – zumindestens nicht auf würdige Art und Weise und ohne eine lächelnde Zuschauerschaft auf dem Discounter-Parkplatz.

Bei der ersten Analyse fiel dann aus der Innenverkleidung sofort ein kleiner Haken mit einem Drahtrest raus, der darauf schliessen ließ, dass es sich dabei um die abgerissene Befestigung des Bowdenzugs handelte.

Bei Ebay wird man schnell fündig, wenngleich es kein Überangebot an solchen Ersatzteilen gibt. Der Austausch war dann etwas fuddelig, aber selbst mit begrenztem handwerklichen Talent ohne grössere Kollateralschäden durchführbar:

  1. Innenverkleidung des Aussenspiegels mit flacher Klinge zart lockern und abheben. Stecker lösen.
  2. Steuerung des Aussenspiegels am Innenöffner mit flacher Klinge zart lockern und heraushebeln. Stecker lösen. Kleine Schraube hinter dem Innenöffner lösen.
  3. Hintere Plastikverschalung am inneren Türgriff leicht in Richtung Autotür abwechselnd links und rechts hebeln und abziehen. Dadurch wird nun auch der Rest der Bedienkonsole frei, die man dann nach oben und vorne herausnehmen kann. Stecker abziehen.
  4. Jetzt die 8 Torque-Schrauben an Stirnseite (2 Stück), Fussseite (2 Stück) und unten (3 Stück) lösen.
  5. Schliesslich die drei wesentlichen Kreuz-Schrauben in der verbliebenen Bedienkonsole lösen. Damit wird die Innenverkleidung nun frei.
  6. Wer die Innenverkleidung nicht komplett abreissen will, kann sie nun anheben und mit einem Ständer, Stuhl o.ä. einfach hochgeklappt lassen, um darunter zu arbeiten.
  7. Jetzt die Isolierfolie den alten Bowdenzug und ein paar Zentimeter am linken Schloß entlang mit einem scharfen Messer aufschneiden,  und aufklappen. Die Folie wird noch gebraucht, daher Vorsicht.
  8. Zwei Torque-Schrauben an der Stirnseite, die das Schloß halten, lösen.
  9. Das Schloß muss man nun nach rechts in Richtung der Türöffnung drücken, bis die Aufhängung des Bowdenzugs sichtbar wird. Den alten Bowdenzug aus den Halterungen lösen und senkrecht nach unten halten, damit er aus der Aufhängung gezogen werden kann. Sollte das Schloß nicht genügend Spiel haben, kann man die Schraube des davor angebrachten Querblechs auch lockern.
  10. Selbes Spiel beim Einbau des neuen Bowdenzugs: Senkrecht halten, in die Aufhängung einfädeln, dann in Zielposition bringen und wieder festhaken!
  11. Alles in umgekehrter Reihenfolge wieder zusammenbauen, Folie an der Schnittnaht wieder sauber zusammenkleben, da sonst Kondenswasser eintreten kann.

GMaps-API + Awesome Screenshot = High-Res Treasure Map

There is a certain positive correlation to observe between

  • the age of a child and
  • the thrill expected from its birthday party.

So last week, I found myself thinking about some kind of Ingress competition for my son’s 10th anniversary. With some intervention of my wife, fortunately, we came up with a more down-to-earth scheme: Two teams of five boys each (the “red knights” versus the “blue monks”)  had to hunt down 20 “magic symbols” (really horrifying tongue tattoos only to be “resolved by a truthful mouth”) which the “white sorcerer” had hidden in a rural area of ~20 square kilometers between our house and a playground/barbecue place in the village nearby.

From the two tasks of the white sorcerer

  • Hiding the treasures only equipped with a bike and a single hour of sparetime
  • Printing two maps on A3 paper detailed enough such that the young lads stay on track

only the latter turned out to be easy enough, thanks to the Google Maps Javascript API.

Treasure Map using Google Maps
Treasure Map using Google Maps

It allows to create a web page with a quite high resolution of

html { width: 4200px; height: 2970px }
body { height: 100%; width:100%; margin: 0; padding: 0 }

For leaving out any unnecessary distractions, you can customize the map options

var mapOptions = {
center: new google.maps.LatLng(49.405500,7.195000),
zoom: 17,
disableDefaultUI: true,
mapTypeId: google.maps.MapTypeId.ROADMAP,
styles: [
{featureType: 'road',elementType: 'labels.text',stylers: [{visibility: 'off'}]},
{featureType: 'poi',elementType: 'all',stylers: [{visibility: 'off'}]}
]
};

var map = new google.maps.Map(document.getElementById(“map_canvas”),
mapOptions);

Then, you can place valuable hints on the resulting bitmap once you detected their GPS coordinates (e.g., by using the right-click command “What’s here” in the official Gmaps interface).

var start = new google.maps.Marker({
position: new google.maps.LatLng(49.406500,7.177241),
title:"Start",
icon: 'tower.png'
});
// To add the marker to the map, call setMap();
start.setMap(map);

Finally, you need to capture the single-bitmap-web-page at once, independently of your screen resolution. This is the domain of Awesome screenshot.  Believe me, I tested a whole lot of these extensions in Chrome, but that was the only “awesome” one that didn’t crash and produced a reasonable result (click on the image above).

Trading ORA-30926 for ORA-08006

Trading ORA-30926 for ORA-08006

This one has bitten me this morning when violating against golden rule #3 (‘Never change two not completely predictable things at a common target simultaneously at least unless you are in hazardous mood.’ An even more popular exemplification of that rule would be the slogan ‘Don’t drink & drive.’).

So I changed

  • the semantics of one source table of a merge statement (extract individual phrases out of a text field into multiple rows)
  • the partitioning of the target table (including the enablement row movement)

The result of the former change alone should have produced an ORA-30926 (‘unable to get a stable set of rows’) when some … hmm, say humanoid at least … managed to restate the same phrase over and over again in the source system.

In conjunction with row movement, however, the merge statement will instead issue an ORA-08006 (‘Someone deleted that row. It wasn’t me.’) which leaves one puzzled until one manages to find Todor Botevs helpful investigation.

 

 

 

Sampling consistently out of (unordered, splitted, fragmented) detail data

One of the problems when sampling data out of raw files is that there maybe consistency constraints on the picked lines.

hashrandomFor example, if you like to extract representative sales slips out of TLOG texts, you want all the positions belonging to one sales slip to be sampled as a whole … and you want that property consistently over all processes/machines onto which the texts/sampling has been distributed.

Using the typically seeded pseudo-random sequences will not work as expected, here. You would have to first aggregate the sales slip headers, sample on the level of headers and, again, join the resulting sample with the original TLOG data.

A nice idea that circumvents that necessity with only a minimal bit of overhead is inspired by Chuck Lam’s method of simply constructing Bloom-filter hash-functions. For each line, the pseudo-random generator is seeded with the hash-code of a given key/object. Then, a fixed position of the random sequence (the new “seed”) is read as the observed random value.

For our TLOG case, all the lines carrying the identical sales slip key will get the same random boolean computed and get filtered or passed through allthesame.