Efficient Exponential Regression Using Ansi-SQL Windowing

This is the fourth part of a posting series (I, II and III) which deals with pushing the declarative boundary of SQL engines in the domain of statistics as far as possible.

In the previous examples, we have gathered samples for (piecewise-linear) risk probabilities.

Now we would rather use these samples to regress a typical exponential risk probability disttibution. The appeal of this particular family of distributions is not only that it can adapt to the most reasonable shapes (increasing, decreasing, even a kind of normal distribution), but that it has most interesting numeric properties.

In particular, by transforming the exponential regression problem with a double-logarithmic representation, we gather the simplest linear regression problem (one variable, one target) which we can even compute “by hand”.

The remainder of this posting is now rather trivial, given the already presented definitions of the database schema:

select material
       -- the exponential shape parameter corresponds to the linear slope parameter     
     , beta     
       -- the exponential acceleration parameter can be computed out of the linear horizontal translation
       -- be aware of zero slopes
     , exp(avg_ln_age-case when beta  0 then avg_ln_ln_surv/beta else 0 end) as alpha
  from (
     -- Aggregate the linear coefficients per material and solve the equation
     select material
         , (count(*)*sum(ln_age_surv)-sum(ln_age_sum_surv))/(count(*)*sum(ln_age_sqr)-sum(ln_age)*sum(ln_age)) as beta
           -- prepare computation of horizontal translation
         , avg(ln_age) as avg_ln_age
         , avg(ln_ln_surv) as avg_ln_ln_surv
      from (
            -- Next we prepare the equation solving by some
            -- windowing/aggregation
         select material
              , ln_age
              , ln_ln_surv
              , ln_age*ln_age as ln_age_sqr
              , ln_age*ln_ln_surv as ln_age_surv
              , ln_age*sum(ln_ln_surv) over (partition by material) as ln_age_sum_surv
           from (
              -- Here, we transform the original problem into the
              -- logarithmic scale, edge probabilities are somehow avoided
              select material
                   , ln(age) as ln_age
                   , ln(ln(1/(1-greatest(least(probability,0.999999),0.00001)))) as ln_ln_surv 
             from repair_estimations 
        ) transf(material, ln_age, ln_ln_surv)
     ) compl(material,ln_age,ln_ln_surv,ln_age_sqr, ln_age_surv, ln_age_sum_surv)
  group by material
  ) final(material,beta,avg_ln_age,avg_ln_ln_surv)
;
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s