Efficient Subset Matching using SQL (with an open question regarding the physical versus the virtual appliance of Teradata)

Usually, I tend to publish only successful experiments. For the first time, however, I am seemingly not able to somehow emulate the soft- and hardware setup needed to verify my hypotheses. So I desperately need your help.

Today, we wander through the realm of “big” propositional rule matching. Suppose we’ve got on the one hand 10^7 formulas of the form precondition -> conclusion where the propositions are key-encoded and we assume for simplicity a fixed length for their conjunction.
On the other hand, we are given a set of 250.000 fact formulas which are represented quite similarly but with a variable conjunction size whose average is three times the length of the preconditions.

So here is the vertical DDL model, followed by a Teradata sample population code:

create table RULE (
 ruleId integer not null,
 propositionId integer not null,
 conclusionId integer not null,
 constraint pk primary key(ruleId,propositionId)
)
;

create table FACT (
 factId integer not null,
 propositionId integer not null,
 constraint pk primary key(factId,propositionId)
)
;

CREATE PROCEDURE populate_rule(in maxRules integer, in maxSteps integer)
BEGIN 
  DECLARE ruleCount INTEGER DEFAULT 0; 
  DECLARE stepCount INTEGER DEFAULT 0; 
  DECLARE currentProposition INTEGER DEFAULT 0;
  DECLARE conclusion INTEGER DEFAULT 0;

  loop_label: WHILE ruleCount maxSteps THEN
      SET stepCount = 1;
      SET ruleCount = ruleCount + 1;
      SET currentProposition = 0;
    END IF;

    Set currentProposition = currentProposition + random(1,30);
    Set conclusion = random(1,2100);

    insert into rule(:ruleCount, :currentProposition, :conclusion);

    if ruleCount mod 1000 = 0 THEN
       commit;
    END IF;

  END WHILE loop_label;  

  commit;

END;
/

CREATE PROCEDURE populate_fact(in maxFacts integer)
BEGIN 
  DECLARE factCount INTEGER DEFAULT 0; 
  DECLARE stepCount INTEGER DEFAULT 0; 
  DECLARE currentProposition INTEGER DEFAULT 0;
  DECLARE currentSteps INTEGER DEFAULT 0;

  loop_label: WHILE factCount currentSteps THEN
      SET stepCount = 1;
      SET currentSteps = random(1,40);
      SET factCount = factCount + 1;
      SET currentProposition = 0;
    END IF;

    Set currentProposition = currentProposition + random(1,10);

    insert into fact(:factCount, :currentProposition);

    if factCount mod 500 = 0 THEN
       commit;
    END IF;

  END WHILE loop_label;  

  commit;
END;
/

Under the given model, the task of matching rules to facts equals to determining, which fixed-length sets in the RULE table correspond are subsets of the variable-length sets in the FACT table. Hence, a quite traditional JOIN followed by a filtered aggregation:

select factId
     , ruleId
     , conclusionId 
  from FACT join RULE on FACT.propositionId=RULE.propositionId 
 group by factId, ruleId
having count(*)=7

The performance of this method, though at least partially covered by the primary keys, is not overwhelming (1.500 seconds in the standard TDExpress14 VMWare image on a Core i7). Which is not too surprising, given the obtained execution plan:

  1) First, we lock a distinct PRODUCT_QUALITY."pseudo table" for read           
     on a RowHash to prevent global deadlock for PRODUCT_QUALITY.FACT.           
  2) Next, we lock a distinct PRODUCT_QUALITY."pseudo table" for read            
     on a RowHash to prevent global deadlock for PRODUCT_QUALITY.RULE.           
  3) We lock PRODUCT_QUALITY.FACT for read, and we lock                          
     PRODUCT_QUALITY.RULE for read.                                              
  4) We execute the following steps in parallel.                                 
       1) We do an all-AMPs RETRIEVE step from PRODUCT_QUALITY.RULE  
          by way of an all-rows scan with no residual conditions into            
          Spool 4 (all_amps) fanned out into 5 hash join partitions,             
          which is duplicated on all AMPs.  The result spool file will           
          not be cached in memory.  The size of Spool 4 is estimated             
          with high confidence to be 206,708 rows (4,340,868 bytes).             
          The estimated time for this step is 0.65 seconds.                 
       2) We do an all-AMPs RETRIEVE step from PRODUCT_QUALITY.FACT 
          by way of an all-rows scan with no residual conditions into            
          Spool 5 (all_amps) fanned out into 5 hash join partitions,             
          which is built locally on the AMPs.  The input table will not          
          be cached in memory, but it is eligible for synchronized               
          scanning.  The result spool file will not be cached in memory.         
          The size of Spool 5 is estimated with high confidence to be            
          350,001 rows (7,350,021 bytes).  The estimated time for this           
          step is 1.32 seconds.                      
  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a             
     RowHash match scan, which is joined to Spool 5 (Last Use) by way            
     of a RowHash match scan.  Spool 4 and Spool 5 are joined using a            
     merge join, with a join condition of ("propositionId =                      
     propositionId").  The result goes into Spool 3 (all_amps), which            
     is built locally on the AMPs.   The size of Spool 3 is estimated with no confidence to be                   
     61,145,139 rows (1,406,338,197 bytes).  The estimated time for              
     this step is 2 minutes and 42 seconds.                                              
  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by          
     way of an all-rows scan , grouping by field1 (                              
     PRODUCT_QUALITY.FACT.factId ,PRODUCT_QUALITY.RULE.ruleId                    
     ,PRODUCT_QUALITY.RULE.conclusionId).  Aggregate Intermediate Results 
     are computed globally, then placed           
     in Spool 6.  The aggregate spool file will not be cached in memory.         
     The size of Spool 6 is estimated with no confidence to be                   
     45,858,855 rows (1,696,777,635 bytes).  The estimated time for              
     this step is 1 hour and 22 minutes.                  
  7) We do an all-AMPs RETRIEVE step from Spool 6 (Last Use) by way of           
     an all-rows scan with a condition of ("(Field_5 (DECIMAL(15,0)))=           
     7.") into Spool 1 (group_amps), which is built locally on the AMPs.         
     The result spool file will not be cached in memory.  The size of            
     Spool 1 is estimated with no confidence to be 45,858,855 rows (             
     1,696,777,635 bytes).  The estimated time for this step is 3                
     minutes and 36 seconds.                                                     
  -> The contents of Spool 1 are sent back to the user as the result of          
     statement 1.  The total estimated time is 1 hour and 28 minutes.

The problem that we are faced with is that due to the nature of the primary key (leading key factId or ruleId) and the Teradata parallelization strategy, the rows cannot be locally joined in the AMPS (or as to put it in traditional SQL terms: we need to arrange a SORT-ORDER MERGE JOIN).

Hence we need to fiddle with the table layout itself, i.e., we manipulate the primary index to just point to the crucial propositionId column:

  1) First, we lock a distinct PRODUCT_QUALITY."pseudo table" for read           
     on a RowHash to prevent global deadlock for PRODUCT_QUALITY.FACT.           
  2) Next, we lock a distinct PRODUCT_QUALITY."pseudo table" for read            
     on a RowHash to prevent global deadlock for PRODUCT_QUALITY.RULE.           
  3) We lock PRODUCT_QUALITY.FACT for read, and we lock                          
     PRODUCT_QUALITY.RULE for read.                                              
  4) We do an all-AMPs JOIN step from PRODUCT_QUALITY.RULE by way of a           
     RowHash match scan with no residual conditions, which is joined to          
     PRODUCT_QUALITY.FACT by way of a RowHash match scan with no                 
     residual conditions.  PRODUCT_QUALITY.RULE and                              
     PRODUCT_QUALITY.FACT are joined using a merge join, with a join             
     condition of ("PRODUCT_QUALITY.FACT.propositionId =                         
     PRODUCT_QUALITY.RULE.propositionId").  The result goes into Spool           
     3 (all_amps), which is built locally on the AMPs.   
     The size of Spool 3 is estimated with low confidence to be           
     137,806 rows (3,169,538 bytes).  The estimated time for this step           
     is 1.02 seconds.                             
  5) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by          
     way of an all-rows scan , grouping by field1 (                              
     PRODUCT_QUALITY.FACT.factId ,PRODUCT_QUALITY.RULE.ruleId                    
     ,PRODUCT_QUALITY.RULE.conclusionId).   Aggregate Intermediate                
     Results are computed globally, then placed in Spool 4.  The                 
     aggregate spool file will not be cached in memory.  The size of             
     Spool 4 is estimated with no confidence to be 103,355 rows (                
     3,824,135 bytes).  The estimated time for this step is 1.42                 
     seconds.                                                          
  6) We do an all-AMPs RETRIEVE step from Spool 4 (Last Use) by way of           
     an all-rows scan with a condition of ("(Field_5 (DECIMAL(15,0)))=           
     7.")  into Spool 1 (group_amps), which is built locally on the AMPs.         
     The size of Spool 1 is estimated with no confidence to be 103,355           
     rows (3,824,135 bytes).  The estimated time for this step is 0.51           
     seconds.             
  -> The contents of Spool 1 are sent back to the user as the result of          
     statement 1.  The total estimated time is 2.95 seconds.

That looks like a nice parallel plan and an vene better prediction. But the runtime is … tatarata … 1.500 seconds!

Wait a minute. How is that possible?

It is possible (hypothesis), because

  • a) the VMWare image is configured with 1 logical processor, mainly disabling any hyperthreading or multicore support by the hosting i7 and because
  • b) the TDExpress configuration inside the VMWare is only equippied with two AMPs, hereby efficiently disabling any noticable partitioning/distribution effect at all.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s