Estimating Multiplicative Probabilities Using ANSI-SQL Windowing

SQL can be a quite efficient tool to implement particular types of statistics. Since windowing became a part of the ANSI standard, this holds for more and more databases as well as mathematical/structural complexities.

For example, product limits can be incrementally estimated by multiplying the probabilities over age. Only the quite impressive ANSI SQL aggregate function list  does not contain a corresponding operation.

Fortunately, we know (or at least Google knows 😉 that for positive P(i), it holds that ΠP(i) = EXP(Σln(P(i)))

This allows us to realize a (batch) product limit estimation including the determination of confidence intervals such as in the following example which has been generalized from Oracle to Teradata. Already included are some objects needed for the sequel post, just in case you wonder about the verbosity:

create table MATERIAL (
	material int primary key not null,
	description varchar(100)
);
insert into MATERIAL values (1, 'sit');
insert into MATERIAL values (2, 'aenean');
insert into MATERIAL values (3, 'adipiscing');
insert into MATERIAL values (4, 'nulla');
insert into MATERIAL values (5, 'in');
insert into MATERIAL values (6, 'sed');

create table PRODUCT (
	serial int primary key not null,
	material INT references MATERIAL(material) not null,
	manufacturing_date date not null
);

insert into PRODUCT  values (1, 2, '2002-08-14');
insert into PRODUCT  values (2, 1, '2001-02-20');
insert into PRODUCT  values (3, 1, '2005-10-27');
insert into PRODUCT  values (4, 1, '2003-10-23');
insert into PRODUCT  values (5, 1, '2001-12-21');
insert into PRODUCT  values (6, 2, '2010-10-11');
insert into PRODUCT  values (7, 1, '2002-06-28');
insert into PRODUCT  values (8, 2, '2009-12-29');
insert into PRODUCT  values (9, 1, '2002-01-10');
insert into PRODUCT  values (10, 2, '2008-10-19');
insert into PRODUCT  values (11, 3, '2007-01-04');
insert into PRODUCT  values (12, 3, '2008-08-06');
insert into PRODUCT  values (13, 4, '2010-08-01');
insert into PRODUCT  values (14, 4, '2011-12-01');
insert into PRODUCT  values (15, 5, '2012-02-03');

create table REPAIR (
	serial_flawed int unique references PRODUCT(serial) not null,
	repair_date date not null
);

insert into REPAIR values (7,'2004-01-22');
insert into REPAIR values (6,'2011-04-09');
insert into REPAIR values (2,'2001-06-21');
insert into REPAIR values (4,'2010-07-25');
insert into REPAIR values (1,'2011-06-16');
insert into REPAIR values (10,'2009-04-17');
insert into REPAIR values (5,'2003-04-13');
insert into REPAIR values (3,'2011-12-31');
insert into REPAIR values (9,'2002-06-10');
insert into REPAIR values (11,'2010-06-10');

/* Not all ANSI databases cope with my beloved nested CTEs, so here is the explicit version */ 

/* aggregating repair data by part age */ 
create view BASE_EXPERIMENT as
   select PRODUCT.material
        , REPAIR.repair_date-PRODUCT.manufacturing_date as age
        , cast(sum(1) as float) as samples
     from PRODUCT 
       join REPAIR on REPAIR.serial_flawed=PRODUCT.serial
    group by PRODUCT.material
           , REPAIR.repair_date-PRODUCT.manufacturing_date
;

/* building intervals and base statistics */
create view POPULATION as 
   select BASE_EXPERIMENT.*
        , row_number() over (partition by BASE_EXPERIMENT.material 
                             order by BASE_EXPERIMENT.age)
          as experiment
        , cast(
             sum(BASE_EXPERIMENT.samples) over (partition by BASE_EXPERIMENT.material 
                                                order by BASE_EXPERIMENT.age 
                                                rows between current row 
                                                   and unbounded following) 
          as float) as population
        , min(BASE_EXPERIMENT.age) over (partition by BASE_EXPERIMENT.material 
                                         order by BASE_EXPERIMENT.age
                                         rows between 1 following 
                                           and unbounded following) 
          as next_age
     from BASE_EXPERIMENT
;

/* Here is the actual estimation magic done */
create view SURVIVAL as 
   select POPULATION.*
        , case 
             when POPULATION.population<=1 then null 
             else 
                exp(
                 sum(
                   ln(
                     case 
                        when POPULATION.age is null then 1.0
                        when POPULATION.population=POPULATION.samples then null 
                        else 1.0-POPULATION.samples/POPULATION.population 
                     end
                   )
                 ) over (partition by POPULATION.material 
                         order by POPULATION.age 
                         rows between unbounded preceding 
                                  and current row 
                 )
                )
          end as probability
        , sqrt(
             sum(
                 case 
                    when POPULATION.experiment=1 then 0.0
                    when POPULATION.population>=POPULATION.samples 
                         then POPULATION.samples/
                              (POPULATION.population*(POPULATION.population-POPULATION.samples))
                    else null 
                 end
             ) over (partition by POPULATION.material 
                     order by POPULATION.age 
                     rows between unbounded preceding and 1 preceding 
             )
          ) as confidence
     from POPULATION
;

create table REPAIR_ESTIMATIONS as (
   select SURVIVAL.material
        , SURVIVAL.age
        , SURVIVAL.next_age
        , 1-coalesce(SURVIVAL.probability,0) as probability
        , 1-coalesce(least(SURVIVAL.probability+1.96*SURVIVAL.probability*SURVIVAL.confidence,1),1) as lower_bound_95
        , 1-coalesce(greatest(0,SURVIVAL.probability-1.96*SURVIVAL.probability*SURVIVAL.confidence),0) as upper_bound_95
        , 0 as depth
     from SURVIVAL  
) 
WITH DATA -- Teradata slang, remove for Oracle
;

/* Thats all folks ... until we speak of structured materials next time */
select * 
  from REPAIR_ESTIMATIONS 
 order by material
        , age
;

BTW, Mockaroo is a really nice sample data generation service … if only we could influence random distributions and impose constraints …

Advertisements

5 thoughts on “Estimating Multiplicative Probabilities Using ANSI-SQL Windowing

  1. Hi, I really like this… trying to run it on sql server… but the SURVIVAL.confidence seems to be missing some lines. There is no function specified… only the OVER clause is specified. Also, missing equals sign on the second WHEN on the same column.
    Then I am intrigued with this approach because, I have forgotten a lot of my stats, but I am wondering don’t you need distribution tables for which are generally not available to the database engine (unless you are using something like an R extension behind Postgres for example)
    Liam

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s