Efficient Subset Matching using SQL (with an open question regarding the physical versus the virtual appliance of Teradata)

Usually, I tend to publish only successful experiments. For the first time, however, I am seemingly not able to somehow emulate the soft- and hardware setup needed to verify my hypotheses. So I desperately need your help.

Today, we wander through the realm of “big” propositional rule matching. Suppose we’ve got on the one hand 10^7 formulas of the form precondition -> conclusion where the propositions are key-encoded and we assume for simplicity a fixed length for their conjunction.
On the other hand, we are given a set of 250.000 fact formulas which are represented quite similarly but with a variable conjunction size whose average is three times the length of the preconditions.

So here is the vertical DDL model, followed by a Teradata sample population code:

create table RULE (
 ruleId integer not null,
 propositionId integer not null,
 conclusionId integer not null,
 constraint pk primary key(ruleId,propositionId)
)
;

create table FACT (
 factId integer not null,
 propositionId integer not null,
 constraint pk primary key(factId,propositionId)
)
;

CREATE PROCEDURE populate_rule(in maxRules integer, in maxSteps integer)
BEGIN 
  DECLARE ruleCount INTEGER DEFAULT 0; 
  DECLARE stepCount INTEGER DEFAULT 0; 
  DECLARE currentProposition INTEGER DEFAULT 0;
  DECLARE conclusion INTEGER DEFAULT 0;

  loop_label: WHILE ruleCount maxSteps THEN
      SET stepCount = 1;
      SET ruleCount = ruleCount + 1;
      SET currentProposition = 0;
    END IF;

    Set currentProposition = currentProposition + random(1,30);
    Set conclusion = random(1,2100);

    insert into rule(:ruleCount, :currentProposition, :conclusion);

    if ruleCount mod 1000 = 0 THEN
       commit;
    END IF;

  END WHILE loop_label;  

  commit;

END;
/

CREATE PROCEDURE populate_fact(in maxFacts integer)
BEGIN 
  DECLARE factCount INTEGER DEFAULT 0; 
  DECLARE stepCount INTEGER DEFAULT 0; 
  DECLARE currentProposition INTEGER DEFAULT 0;
  DECLARE currentSteps INTEGER DEFAULT 0;

  loop_label: WHILE factCount currentSteps THEN
      SET stepCount = 1;
      SET currentSteps = random(1,40);
      SET factCount = factCount + 1;
      SET currentProposition = 0;
    END IF;

    Set currentProposition = currentProposition + random(1,10);

    insert into fact(:factCount, :currentProposition);

    if factCount mod 500 = 0 THEN
       commit;
    END IF;

  END WHILE loop_label;  

  commit;
END;
/

Under the given model, the task of matching rules to facts equals to determining, which fixed-length sets in the RULE table correspond are subsets of the variable-length sets in the FACT table. Hence, a quite traditional JOIN followed by a filtered aggregation:

select factId
     , ruleId
     , conclusionId 
  from FACT join RULE on FACT.propositionId=RULE.propositionId 
 group by factId, ruleId
having count(*)=7

The performance of this method, though at least partially covered by the primary keys, is not overwhelming (1.500 seconds in the standard TDExpress14 VMWare image on a Core i7). Which is not too surprising, given the obtained execution plan:

  1) First, we lock a distinct PRODUCT_QUALITY."pseudo table" for read           
     on a RowHash to prevent global deadlock for PRODUCT_QUALITY.FACT.           
  2) Next, we lock a distinct PRODUCT_QUALITY."pseudo table" for read            
     on a RowHash to prevent global deadlock for PRODUCT_QUALITY.RULE.           
  3) We lock PRODUCT_QUALITY.FACT for read, and we lock                          
     PRODUCT_QUALITY.RULE for read.                                              
  4) We execute the following steps in parallel.                                 
       1) We do an all-AMPs RETRIEVE step from PRODUCT_QUALITY.RULE  
          by way of an all-rows scan with no residual conditions into            
          Spool 4 (all_amps) fanned out into 5 hash join partitions,             
          which is duplicated on all AMPs.  The result spool file will           
          not be cached in memory.  The size of Spool 4 is estimated             
          with high confidence to be 206,708 rows (4,340,868 bytes).             
          The estimated time for this step is 0.65 seconds.                 
       2) We do an all-AMPs RETRIEVE step from PRODUCT_QUALITY.FACT 
          by way of an all-rows scan with no residual conditions into            
          Spool 5 (all_amps) fanned out into 5 hash join partitions,             
          which is built locally on the AMPs.  The input table will not          
          be cached in memory, but it is eligible for synchronized               
          scanning.  The result spool file will not be cached in memory.         
          The size of Spool 5 is estimated with high confidence to be            
          350,001 rows (7,350,021 bytes).  The estimated time for this           
          step is 1.32 seconds.                      
  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a             
     RowHash match scan, which is joined to Spool 5 (Last Use) by way            
     of a RowHash match scan.  Spool 4 and Spool 5 are joined using a            
     merge join, with a join condition of ("propositionId =                      
     propositionId").  The result goes into Spool 3 (all_amps), which            
     is built locally on the AMPs.   The size of Spool 3 is estimated with no confidence to be                   
     61,145,139 rows (1,406,338,197 bytes).  The estimated time for              
     this step is 2 minutes and 42 seconds.                                              
  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by          
     way of an all-rows scan , grouping by field1 (                              
     PRODUCT_QUALITY.FACT.factId ,PRODUCT_QUALITY.RULE.ruleId                    
     ,PRODUCT_QUALITY.RULE.conclusionId).  Aggregate Intermediate Results 
     are computed globally, then placed           
     in Spool 6.  The aggregate spool file will not be cached in memory.         
     The size of Spool 6 is estimated with no confidence to be                   
     45,858,855 rows (1,696,777,635 bytes).  The estimated time for              
     this step is 1 hour and 22 minutes.                  
  7) We do an all-AMPs RETRIEVE step from Spool 6 (Last Use) by way of           
     an all-rows scan with a condition of ("(Field_5 (DECIMAL(15,0)))=           
     7.") into Spool 1 (group_amps), which is built locally on the AMPs.         
     The result spool file will not be cached in memory.  The size of            
     Spool 1 is estimated with no confidence to be 45,858,855 rows (             
     1,696,777,635 bytes).  The estimated time for this step is 3                
     minutes and 36 seconds.                                                     
  -> The contents of Spool 1 are sent back to the user as the result of          
     statement 1.  The total estimated time is 1 hour and 28 minutes.

The problem that we are faced with is that due to the nature of the primary key (leading key factId or ruleId) and the Teradata parallelization strategy, the rows cannot be locally joined in the AMPS (or as to put it in traditional SQL terms: we need to arrange a SORT-ORDER MERGE JOIN).

Hence we need to fiddle with the table layout itself, i.e., we manipulate the primary index to just point to the crucial propositionId column:

  1) First, we lock a distinct PRODUCT_QUALITY."pseudo table" for read           
     on a RowHash to prevent global deadlock for PRODUCT_QUALITY.FACT.           
  2) Next, we lock a distinct PRODUCT_QUALITY."pseudo table" for read            
     on a RowHash to prevent global deadlock for PRODUCT_QUALITY.RULE.           
  3) We lock PRODUCT_QUALITY.FACT for read, and we lock                          
     PRODUCT_QUALITY.RULE for read.                                              
  4) We do an all-AMPs JOIN step from PRODUCT_QUALITY.RULE by way of a           
     RowHash match scan with no residual conditions, which is joined to          
     PRODUCT_QUALITY.FACT by way of a RowHash match scan with no                 
     residual conditions.  PRODUCT_QUALITY.RULE and                              
     PRODUCT_QUALITY.FACT are joined using a merge join, with a join             
     condition of ("PRODUCT_QUALITY.FACT.propositionId =                         
     PRODUCT_QUALITY.RULE.propositionId").  The result goes into Spool           
     3 (all_amps), which is built locally on the AMPs.   
     The size of Spool 3 is estimated with low confidence to be           
     137,806 rows (3,169,538 bytes).  The estimated time for this step           
     is 1.02 seconds.                             
  5) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by          
     way of an all-rows scan , grouping by field1 (                              
     PRODUCT_QUALITY.FACT.factId ,PRODUCT_QUALITY.RULE.ruleId                    
     ,PRODUCT_QUALITY.RULE.conclusionId).   Aggregate Intermediate                
     Results are computed globally, then placed in Spool 4.  The                 
     aggregate spool file will not be cached in memory.  The size of             
     Spool 4 is estimated with no confidence to be 103,355 rows (                
     3,824,135 bytes).  The estimated time for this step is 1.42                 
     seconds.                                                          
  6) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of           
     an all-rows scan with a condition of ("(Field_5 (DECIMAL(15,0)))=           
     7.")  into Spool 1 (group_amps), which is built locally on the AMPs.         
     The size of Spool 1 is estimated with no confidence to be 103,355           
     rows (3,824,135 bytes).  The estimated time for this step is 0.51           
     seconds.             
  -> The contents of Spool 1 are sent back to the user as the result of          
     statement 1.  The total estimated time is 2.95 seconds.

That looks like a nice parallel plan and an vene better prediction. But the runtime is … tatarata … 1.500 seconds!

Wait a minute. How is that possible?

It is possible (hypothesis), because

  • a) the VMWare image is configured with 1 logical processor, mainly disabling any hyperthreading or multicore support by the hosting i7 and because
  • b) the TDExpress configuration inside the VMWare is only equippied with two AMPs, hereby efficiently disabling any noticable partitioning/distribution effect at all.
Advertisements

Efficient Exponential Regression Using Ansi-SQL Windowing

This is the fourth part of a posting series (I, II and III) which deals with pushing the declarative boundary of SQL engines in the domain of statistics as far as possible.

In the previous examples, we have gathered samples for (piecewise-linear) risk probabilities.

Now we would rather use these samples to regress a typical exponential risk probability disttibution. The appeal of this particular family of distributions is not only that it can adapt to the most reasonable shapes (increasing, decreasing, even a kind of normal distribution), but that it has most interesting numeric properties.

In particular, by transforming the exponential regression problem with a double-logarithmic representation, we gather the simplest linear regression problem (one variable, one target) which we can even compute “by hand”.

The remainder of this posting is now rather trivial, given the already presented definitions of the database schema:

select material
       -- the exponential shape parameter corresponds to the linear slope parameter     
     , beta     
       -- the exponential acceleration parameter can be computed out of the linear horizontal translation
       -- be aware of zero slopes
     , exp(avg_ln_age-case when beta  0 then avg_ln_ln_surv/beta else 0 end) as alpha
  from (
     -- Aggregate the linear coefficients per material and solve the equation
     select material
         , (count(*)*sum(ln_age_surv)-sum(ln_age_sum_surv))/(count(*)*sum(ln_age_sqr)-sum(ln_age)*sum(ln_age)) as beta
           -- prepare computation of horizontal translation
         , avg(ln_age) as avg_ln_age
         , avg(ln_ln_surv) as avg_ln_ln_surv
      from (
            -- Next we prepare the equation solving by some
            -- windowing/aggregation
         select material
              , ln_age
              , ln_ln_surv
              , ln_age*ln_age as ln_age_sqr
              , ln_age*ln_ln_surv as ln_age_surv
              , ln_age*sum(ln_ln_surv) over (partition by material) as ln_age_sum_surv
           from (
              -- Here, we transform the original problem into the
              -- logarithmic scale, edge probabilities are somehow avoided
              select material
                   , ln(age) as ln_age
                   , ln(ln(1/(1-greatest(least(probability,0.999999),0.00001)))) as ln_ln_surv 
             from repair_estimations 
        ) transf(material, ln_age, ln_ln_surv)
     ) compl(material,ln_age,ln_ln_surv,ln_age_sqr, ln_age_surv, ln_age_sum_surv)
  group by material
  ) final(material,beta,avg_ln_age,avg_ln_ln_surv)
;

Progressively Fusing Independent Probabilities using Proprietary SQL Windowing (Teradata version)

As announced in the second part of the posting series, the scheme to use aggregate User-Defined-Functions (UDF) for fusing probabilities will work in a similar fashion for all standard databases, such as Oracle and Teradata.

Similar to the Oracle formulation (in which the aggregation object itself keeps state and functional logic), here is the Teradata version consisting of a static logic class and a seperate state class that is serialized and passed):

package de.jasna.teradata.repair;

import java.sql.SQLException;

import com.teradata.fnc.Context;
import com.teradata.fnc.Phase;
import java.io.Serializable;

/**
 * Storage class for ProbabilityUnion Java User Defined Function
 * Remember that you have to reserve the appropriate context memory for its serialized form 
 * in the final function definition
 */
class ProStor implements Serializable {
	// memory
	private double probability=0;

	/**
	 * Default Constructor
	 */
	public ProStor() {
	}

	/**
	 * @return current probability
	 */
	public double getProbability() {
		return this.probability;
	}

	/**
	 * @param aProbability to set
	 */
	public void setProbability(double aProbability) {
		this.probability = aProbability;
	}
}

/**
 * Teradata version of the union of independent probabilities
 */
public class ProbabilityUnion {

	/**
	 * Compute method
	 * @param phase of aggregation
	 * @param context of aggregation including states
	 * @param aProbability value to integrate
	 * @return union of probabilities
	 */
	public static double unite(Phase phase
	                         , Context[] context
			                 , double aProbability) 
	   throws SQLException {

		try {
			switch (phase.getPhase()) {

			/* The AGR_INIT phase is executed once. */
			case Phase.AGR_INIT:
				ProStor newStor=new ProStor();
				context[0].initCtx(newStor);
				context[0].setObject(1, newStor);
				return 0;

			/* The AGR_DETAIL phase is executed once per row. */
			case Phase.AGR_DETAIL:
				ProStor yet = (ProStor) context[0].getObject(1);
				yet.setProbability(yet.getProbability() + aProbability
						- yet.getProbability() * aProbability);
				context[0].setObject(1, yet);
				return 0;

			/* AGR_COMBINE runs over the distributed results */
			case Phase.AGR_COMBINE:
				ProStor res1 = (ProStor) context[0].getObject(1);
				ProStor res2 = (ProStor) context[0].getObject(2);
				res1.setProbability(res1.getProbability()
						+ res2.getProbability() - res1.getProbability()
						* res2.getProbability());
				context[0].setObject(1, res1);
				return 0;

			/* we are through */
			case Phase.AGR_FINAL:
				ProStor finale = (ProStor) context[0].getObject(1);
				return finale.getProbability();

			/* prob of empty set is zero */
			case Phase.AGR_NODATA:
				return 0;

			default:
				throw new SQLException("Invalid Phase", "38905");
			}
		} catch (Exception ex) {
			throw new SQLException(ex.getMessage(), "38101");
		}
	}

}

Once you deployed the jar file under a unique id (e.g., “de.jasna.teradata.repair” for which your user needs “execute prodedure” grants on the ‘SQLJ’ database), you can define the aggregation UDF as follows (note that the storage class above needs some extra class aggregate space wrt the default 64 ?bytes?).

create function union_indep_prop(aProbability float)
returns float
class aggregate(128)
specific union_indep_prop
language java
no sql
no external data
parameter style java
deterministic
called on null input
external name 'de.jasna.teradata.repair:de.jasna.teradata.repair.ProbabilityUnion.unite(com.teradata.fnc.Phase,com.teradata.fnc.Context[],double) returns double'
;

Finally, we can define the hierarchical decomposition and probability propagation query (given the STRUCTURED_MATERIAL definition as in the previous post) as follows:

/*
 * A recursive view to decompose the bill of materials down to the leaf level
 */
create recursive view MATERIAL_HIERARCHY(root_material, component_material,depth) as (
  SELECT STRUCTURED_MATERIAL.composite_material as root_material
       , STRUCTURED_MATERIAL.component_material as component_material
       , 0 as depth
    FROM STRUCTURED_MATERIAL
  union all
  SELECT MATERIAL_HIERARCHY.root_material
       , STRUCTURED_MATERIAL.component_material
       , MATERIAL_HIERARCHY.depth+1 as depth
    FROM MATERIAL_HIERARCHY
       join STRUCTURED_MATERIAL on MATERIAL_HIERARCHY.component_material=STRUCTURED_MATERIAL.composite_material
);

/*
 * Here come the propagated probabilities
 */
insert into repair_estimations
 select MATERIAL_HIERARCHY.root_material
      , AGE.age
      , min(AGE.age-1) over (partition by MATERIAL_HIERARCHY.root_material 
                             order by AGE.age
                             rows between 1 following 
                                  and unbounded following) as next_age
      , union_indep_prop(ALL_REPAIRS.probability) as probability
      , union_indep_prop(ALL_REPAIRS.lower_bound_95) as lower_bound_95
      , union_indep_prop(ALL_REPAIRS.upper_bound_95) as upper_bound_95
      , max(MATERIAL_HIERARCHY.depth) as depth
  from MATERIAL_HIERARCHY
    join REPAIR_ESTIMATIONS AGE on AGE.material=MATERIAL_HIERARCHY.component_material and AGE.depth=0
    join MATERIAL_HIERARCHY ALL_PARTS on ALL_PARTS.root_material=MATERIAL_HIERARCHY.root_material 
    join REPAIR_ESTIMATIONS ALL_REPAIRS on ALL_REPAIRS.material=ALL_PARTS.component_material and ALL_REPAIRS.depth=0
         and AGE.age between ALL_REPAIRS.age and coalesce(ALL_REPAIRS.next_age, AGE.age)
 group by MATERIAL_HIERARCHY.root_material
        , AGE.age
;

Some helpful links to get this implementation flying were this debugging guide and the Java Aggregate Function tutorial.

Progressively Fusing Independent Probabilities using Proprietary SQL Windowing (Oracle version)

This is the sequel post to this proposal how to estimate the lifetime of spare parts using ordinary query language semantics.

This time, we are interested in propagating the hence obtained individual probabities up a more complex material hierarchy. For which we start by the simplifying assumption, that our only variable (“age” could also be named or measured as general “wear and tear”) does already take into account all the influences to a given material’s span of life. And when combining two materials into some higher structure, we assume that they do not influence each other’s durability alltoomuch.

create table STRUCTURED_MATERIAL (
  composite_material int references MATERIAL(material) not null,
  component_material int references MATERIAL(material) not null
);

insert into STRUCTURED_MATERIAL values (4,1);
insert into STRUCTURED_MATERIAL values (4,2);

insert into STRUCTURED_MATERIAL values (5,4);
insert into STRUCTURED_MATERIAL values (5,3);

Unfortunately, even under these assumptions fusing the individual probabilities turns out to require a “progressive” calculation, i.e., a formula which cannot be stated in terms of an ANSI SQL aggregation.

union_of_indep_probabities

Fortunately, the expressiveness of the typical Used-Defined-Function (UDF) frameworks, such as the one of the Oracle Database, is well able to handle that requirement even under parallel/distributed execution settings:

/* Oracle needs a type/object definition as the UDF-base */
create type ProbabilityUnionImpl as object
(
  probability NUMBER, -- memory just keeps current probability
  static function ODCIAggregateInitialize(sctx IN OUT ProbabilityUnionImpl) return number,
  member function ODCIAggregateIterate(self IN OUT ProbabilityUnionImpl
                                     , value IN number) return number,
  member function ODCIAggregateTerminate(self IN ProbabilityUnionImpl
                                       , returnValue OUT number
                                       , flags IN number) return number,
  member function ODCIAggregateMerge(self IN OUT ProbabilityUnionImpl
                                   , ctx2 IN ProbabilityUnionImpl) return number
);
/
/* and here are the method implementations */
create or replace type body ProbabilityUnionImpl is 

/* We start with an empty probability */
static function ODCIAggregateInitialize(sctx IN OUT ProbabilityUnionImpl) return number is 
begin
  sctx := ProbabilityUnionImpl(0);
  return ODCIConst.Success;
end;

/* Local iteration implements fusion under independence */
member function ODCIAggregateIterate(self IN OUT ProbabilityUnionImpl
                                   , value IN number) return number is
begin
  self.probability:=self.probability+value-self.probability*value;
  return ODCIConst.Success;
end;

/* At the end, just return the stored probability */
member function ODCIAggregateTerminate(self IN ProbabilityUnionImpl
                                     , returnValue OUT number
                                     , flags IN number) return number is
begin
  returnValue := self.probability;
  return ODCIConst.Success;
end;

/* Distributed/Parallel merge, same case as local iteration */
member function ODCIAggregateMerge(self IN OUT ProbabilityUnionImpl
                                 , ctx2 IN ProbabilityUnionImpl) return number is
begin
  self.probability:=self.probability+ctx2.probability-self.probability*ctx2.probability;
  return ODCIConst.Success;
end;

end;
/

/* finally, the aggregation function is defined in terms of the type */
create function union_indep_prop (input number) return number 
PARALLEL_ENABLE AGGREGATE USING ProbabilityUnionImpl;

With that backing installed, we are now able to complete our estimation table even for complex materials. Since we are Oracle-specific anyway, we also use the established recursion syntax when flattening the bill of material tree.

insert into REPAIR_ESTIMATIONS
with
  /* we just need the root - leaf material relations */
  DECOMPOSITION as (
  select 
         CONNECT_BY_ROOT 
         composite.composite_material as root_material
       , composite.component_material as component_material
       , level as depth
    from STRUCTURED_MATERIAL composite
 connect by prior composite.component_material=composite.composite_material 
)
 /* for each age of an individual age, find all corresponding probabilities of 
    sibling materials for fusion */
 select DECOMPOSITION.root_material
      , REPAIR.age
      , lead(REPAIR.age-1) over (partition by DECOMPOSITION.root_material order by REPAIR.age) as next_age
      , union_indep_prop(ALL_REPAIRS.probability) as probability
      , union_indep_prop(ALL_REPAIRS.lower_bound_95) as lower_bound_95
      , union_indep_prop(ALL_REPAIRS.upper_bound_95) as upper_bound_95
      , max(DECOMPOSITION.depth) as depth
  from DECOMPOSITION
    join REPAIR_ESTIMATIONS REPAIR on REPAIR.material=DECOMPOSITION.component_material and REPAIR.depth=0
    join DECOMPOSITION ALL_PARTS on ALL_PARTS.root_material=DECOMPOSITION.root_material 
    join REPAIR_ESTIMATIONS ALL_REPAIRS on ALL_REPAIRS.material=ALL_PARTS.component_material and
         REPAIR.age between ALL_REPAIRS.age and coalesce(ALL_REPAIRS.next_age, REPAIR.age)
 group by DECOMPOSITION.root_material
        , REPAIR.age
;

Stay tuned for the upcoming Teradata formulation.

Sampling consistently out of (unordered, splitted, fragmented) detail data

One of the problems when sampling data out of raw files is that there maybe consistency constraints on the picked lines.

hashrandomFor example, if you like to extract representative sales slips out of TLOG texts, you want all the positions belonging to one sales slip to be sampled as a whole … and you want that property consistently over all processes/machines onto which the texts/sampling has been distributed.

Using the typically seeded pseudo-random sequences will not work as expected, here. You would have to first aggregate the sales slip headers, sample on the level of headers and, again, join the resulting sample with the original TLOG data.

A nice idea that circumvents that necessity with only a minimal bit of overhead is inspired by Chuck Lam’s method of simply constructing Bloom-filter hash-functions. For each line, the pseudo-random generator is seeded with the hash-code of a given key/object. Then, a fixed position of the random sequence (the new “seed”) is read as the observed random value.

For our TLOG case, all the lines carrying the identical sales slip key will get the same random boolean computed and get filtered or passed through allthesame.

SCD3+1 – Save the Dimensional Bus

The standard ways of modelling dimensional qualitative change (as opposed to factual quantitative dynamics) have been widely accepted since Ralph Kimballs publishing his Data-Warehouse Toolkit bible.

  • SCD0 – Position the changing attribute in the fact table
  • SCD1 – Put the attribute into the dimension table and simply overwrite old attribute values
  • SCD2 – Reify temporal changes into seperate attributes of the dimension table
  • SCD3 – Allow for seperately attributed versions of the dimension entities in the dimension table
Various Types of Slowly Changing Dimensions (SCD)
Various Types of Slowly Changing Dimensions (SCD)

Surely, the latter solution is the most flexible, but it comes at the price of loosing the original granularity of the dimension (now: rather versions or states than static entities) and hence the necessity of remodelling all the dependent fact tables, too – even those whose interpretation is not interesting wrt the changing attribute in the first place. Once your Datawarehouse has grown to a respectable complexity and you reused your dimensions well in the spirit of Mr. Kimballs fruitful bus matrix design, the effort to introduce such a seemingly small dynamic feature into a running and well-fed information layer explodes easily.

An architectural solution to this dilemma is what we call SCD3+1, a combination of the SCD3 and SCD1 techniques. Here, we construct two variants of the dimension, a full-blown SCD3 variant to which all the required facts can be connected and a (now derived) SCD1 view which stays connected to the rest of the Datawarehouse. For that purpose, each”current” version of a SCD3 entity is marked with a special-purpose attribute which can be filtered to obtain the corresponding SCD1 subset. This way, even state changes in the future can be already tracked andm, e.g., get connected to the planning measures.

Smarter than the average MSI (or: Running HDP-1.1.0-GA on Windows 2008 Server)

This is so great.

  1. Like to try out HDInsight Server Preview?
  2. Already got access to some “quite modern” Windows 2008 Server  with most of the SDK-, .NET- and JDK-orgy set up?
  3. Unless … this wonderful piece of hardware just resides in the intranet and the web-installer is not an option.
  4. Ummpf. So, the M$-infrastructure partner also provides its own Hadoop Data Platform for Windows distribution.
  5. You also successfully managed to install the rest of the prerequisites (please don’t let the admins reveil that, especially the remote powershell part) and configuration issues.
  6. And all this %§$”%- MSI responds is “Visual Studio C++ Redistributable (x64) Package not installed” while it is indeed.

This is soo great. See, you are not alone.

I simply can’t believe that a distribution of a managed code framework depends on a particular minor version of the underlying operating systems.

I simply can’t believe that a disfunctioning MSI is a dead end of a once promising long road.

Rightly believed. Here is the fundamental trick how to tweak such a package using the ORCA Tool shipped with the Windows SDK.

Looking into the LaunchCondition Section and see the too malicious tests on the Visual C++ redistributable and the operating system version.

msi_before_orca

Delete the vc_redist dependency and adapt the operating system version accordingly.

msi_after_orca

And be sure that your JAVA_HOME (the environment path, not the shell variable, as the installer spawns a new environment during its run) does neither contain spaces (like “Program Files”) nor double quotes – a lesson that can be learned when inspecting closely the resulting installer logs.

Installation completed successfully.