Spicing Your Reports With Future Predictions
<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p>By
John Adolph Palinski
</o:p>
A typical developer has created scores of reports that display summarized values for a series of periods. These reports show statistics such as sales by year, customer complaints by month, house starts per quarter, or personal disposable income (pdi) per year. The reports show results for multiple years allowing the report user to compare the current period results with other period results. This is an effective tool but is more effective with the inclusion of trend or regression analysis. Trend analysis components include:
<o:p> </o:p>
· The overall movement or trend (up or down) of the values.
· The degree of movement or trend.
· Difference between the actual value and the trend line.
· Expected future values.
<o:p> </o:p>
The overall movement or trend helps the user determine whether performance is moving in the proper direction. This fact determines whether sales are increasing or customer complaints are decreasing. These are important facts and hard to grasp when values from different periods move radically. The degree of movement is another important fact. It shows how fast the movement is changing. A large positive degree in the movement of sales is good. A low positive degree indicates sluggish sales. This may be expected, but if the company is aggressively trying to increase sales, it may indicate problems. On the other hand, a negative movement in sales indicates problems. The difference between the actual value and the trend values indicates the performance of the specific period. By identifying periods with a difference, the report user may be able to identify techniques or occurrences that may have caused the variance. It may be possible to repeat the technique or occurrence causing the same effect. Finally, expected future values are important for budgetary reasons.
<o:p> </o:p>
Oracle has database regression functions that enable the report developer to compute these values and add them to the result set. Regression functions allow you to perform regression analysis. Regression analysis consists of computing a value that corresponds to a second value. The computed value is the best fit based upon the set of values. This best fit is a straight line. It essentially smoothes the varying set of actual values. Regression analysis uses three important calculations:
<o:p> </o:p>
· Slope – A value or quantity change from one unit to the next. This value somewhat indicates the degree of overall movement.
<o:p> </o:p>
· Coefficient of correlation – The value determines how well the data correlates. A high degree indicates that a trend does exist. Values close to 0 or 1 indicate the highest degree of correlation. A value close to .5 indicates the values are not correlated.
<o:p> </o:p>
· Y Intercept – This value is a constant used to compute the value.
<o:p> </o:p>
The regression functions are used in the following formula to compute values:
<o:p> </o:p>
Y = mx + b
<o:p> </o:p>
Where
<o:p> </o:p>
Y is the computed value
m is the slope
x is the plugged in value
b is the intercept
<o:p> </o:p>
The Oracle functions that compute the formula values are:
<o:p> </o:p>
· Regr_intercept - Computes the Y intercept
<o:p> </o:p>
· Regr_slope - Computes the slope
<o:p> </o:p>
· Regr_r2 - Coefficient of correlation
<o:p> </o:p>
Note that these functions are aggregate functions similar to the Sum, Avg, or Var functions historically available in Oracle. To illustrate the functions, the remainder of this article depicts regression analysis performed on a set of data. The data consists of the <?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /><st1:country-region><st1:place>US</st1:place></st1:country-region> personal disposable income per capita in the years 1998 – 2002. Listing 1 depicts the analyzed data.
<o:p> </o:p>
Listing 1 – US Personal Disposal Income per Capita (1998 – 2002)
SQL> select year, pdi_per_capita
2 from aggregate_income;
YEAR PDI_PER_CAPITA
---------- --------------
1998 22175
1999 22366
2000 22995
2001 22997
2002 23697
<o:p> </o:p>
The first computed value is the Coefficient of Correlation. Listing 2 shows this calculation using the Regr_r2 function. The computed value is .93 indicating a high degree of correlation. This indicates that the computed trend will have some validity. The Regr_r2 as well as the other regression functions have two parameters. The first value is the measured value. In the case of the Listing 2, it is the personal disposable income. The second value is the value that corresponds to the measured value. In the case of Listing 2, it is the year. This value will be plugged-in to compute the personal disposable income.
<o:p> </o:p>
Listing 2 – The Coefficient of Correlation
<o:p> </o:p>
SQL> select regr_r2 (pdi_per_capita, year)
2 from aggregate_income;
<o:p> </o:p>
REGR_R2(PDI_PER_CAPITA,YEAR)
----------------------------
.931522633
<o:p> </o:p>
The next calculation is the slope. This calculation uses the Regr_slope function. Listing 3 illustrates the calculation. The function uses the same parameters as depicted in Listing 2. The calculated value is 367.5. This means that personal disposable income increases $367.50 each year. This is a positive, but small degree of movement. A $367.50 change for a disposable income $22,995 (see Listing 1) is approximately 1.59%. This is a relatively small change from one year to the next.
<o:p> </o:p>
Listing 3 – The Slope
<o:p> </o:p>
SQL> select regr_slope (pdi_per_capita, year)
2 from aggregate_income;
REGR_SLOPE(PDI_PER_CAPITA,YEAR)
-------------------------------
367.5
<o:p> </o:p>
The final calculation is the Y intercept illustrated in Listing 4. The Y intercept is the value of Y when the X value is 0. It is the b value in the formula and is a constant in the point calculation. The Regr_intercept function calculates the value.
<o:p> </o:p>
Listing 4 – The Y intercept
<o:p> </o:p>
SQL> select regr_intercept (pdi_per_capita, year)
2 from aggregate_income ;
<o:p> </o:p>
REGR_INTERCEPT(PDI_PER_CAPITA,YEAR)
-----------------------------------
-712154
<o:p> </o:p>
At this point, we can put everything together. Listing 4 contains a Select statement that produces a result set that displays the actual personal disposable income values for the years 1998 – 2002. It also displays estimated values for the year 2003-2007. The Select statement is actually two unioned Select statements. The first statement adds the actual records to the result set. The text literal “Actual” is added to each of these records to denote their status.
<o:p> </o:p>
The second Select statement beginning on Line 4 computes the estimated values. The “Estimated” text literal denotes each record. This Select statement has several interesting features. These are:
<o:p> </o:p>
· The Select statement contains two in-line views. An in-line view is a Select statement that resides in another Select statement’s From clause. It produces a derived result set for the master Select statement. The first in-line view begins on Line 7. The purpose of the in-line view is to produce five records with values of 1, 2, 3, 4, and 5. Computations use these records to arrive at the years 2003 – 2007. The Select statement employs the User_tables table as a data source. This table is always available and has more than five records. However, any table with the desired number of records is suitable. The statement also uses the psuedo-column Rownum. The value of Rownum is the sequence number of the last fetched records. Thus a condition of “rownum <=5” will produce a result set of five records. Rownum is the only value in the Select clause. User_table values are unneeded. The statement simply needs the sequential Rownum values.
<o:p> </o:p>
· The second in-line view begins on line 10. This in-line view uses the Regr_intercept and Regr_slope functions to compute the Y intercept and slope values. The master Select statement uses these values to determine the estimated personal disposable income.
<o:p> </o:p>
· The Aggregate_income table (line 6) is used by the master Select statement to determine the last actual value year. It uses the Max function to determine the most recent year from the Aggregate_income table. The estimated years are determined by adding the values from the first in-line view (line 7) to the Aggregate_income maximum year. The result is the values 2003, 2004, 2005, 2006, and 2007.
<o:p> </o:p>
· Notice that a join condition does not exist for the Aggregate_income table and the two in-line views. This causes a Cartesian join. You will not often see a Cartesian join applied. A Cartesian join matches each record from one result set to each record in another. It is possible to use a Cartesian join because the values retrieved from the Aggregate_income table and the second in-line view (line 10) will be the same on every row. The effect of the Cartesian joins is to create a five record result set. Each record will have a distinct period (rownum) value and the same year, Y intercept, and slope.
<o:p> </o:p>
· The master Select statement’s Select clause calculates the predicted trend values. This expression is using the formula mx + b (slope * plugged in value + the intercept).
<o:p> </o:p>
Listing 4 illustrates the results of the Select statement.
<o:p> </o:p>
Listing 4 – Actual and Estimated Personal Disposable Income for the Years 1998 - 2007
<o:p> </o:p>
SQL> select year, 'Actual', pdi_per_capita<o:p></o:p>
2 from aggregate_income<o:p></o:p>
3 union<o:p></o:p>
4 select max(year) + period, 'Estimated',<o:p></o:p>
5 intercept + (slope*(max(year) + period)) y<o:p></o:p>
6 from aggregate_income,<o:p></o:p>
7 (select rownum period<o:p></o:p>
8 from user_tables<o:p></o:p>
9 where rownum <= 5),<o:p></o:p>
10 (select regr_intercept (pdi_per_capita, year) intercept,<o:p></o:p>
11 regr_slope (pdi_per_capita, year) slope<o:p></o:p>
12 from aggregate_income)<o:p></o:p>
13 group by period, intercept, slope;<o:p></o:p>
<o:p> </o:p>
YEAR 'ACTUAL' PDI_PER_CAPITA <o:p></o:p>
---------- --------- -------------- <o:p></o:p>
1998 Actual 22175 <o:p></o:p>
1999 Actual 22366 <o:p></o:p>
2000 Actual 22995 <o:p></o:p>
2001 Actual 22997 <o:p></o:p>
2002 Actual 23697 <o:p></o:p>
2003 Estimated 23948.5 <o:p></o:p>
2004 Estimated 24316 <o:p></o:p>
2005 Estimated 24683.5 <o:p></o:p>
2006 Estimated 25051 <o:p></o:p>
2007 Estimated 25418.5 <o:p></o:p>
<o:p> </o:p>
10 rows selected.<o:p></o:p>
<o:p> </o:p>
SQL>
<o:p> </o:p>
Listing 5 is another Select statement performing regression analysis. The output is slightly different. The output has the expected percentage change between years. It also has the difference between the expected value and the actual value. These are two additional pieces of information of interest to your users.
<o:p> </o:p>
Listing 5 – Regression analysis showing variances
<o:p> </o:p>
SQL> select slope/((year*slope)+intercept)"Expected Change",<o:p></o:p>
2 year, type, decode (pdi_per_capita, 0,<o:p></o:p>
3 ((year*slope)+intercept), pdi_per_capita) <o:p></o:p>
4 pdi_per_capita,<o:p></o:p>
5 decode (pdi_per_capita, 0,<o:p></o:p>
6 ((year*slope)+intercept), pdi_per_capita) -<o:p></o:p>
7 ((year*slope)+intercept) "Actual Variance"<o:p></o:p>
8 from<o:p></o:p>
9 (select regr_intercept (pdi_per_capita, year) intercept,<o:p></o:p>
10 regr_slope (pdi_per_capita, year) slope<o:p></o:p>
11 from aggregate_income),<o:p></o:p>
12 (select year, 'Actual' type, pdi_per_capita<o:p></o:p>
13 from aggregate_income<o:p></o:p>
14 union all<o:p></o:p>
15 select year + rownum, 'Estimated', 0<o:p></o:p>
16 from user_tables,<o:p></o:p>
17 (select max(year) year from aggregate_income)<o:p></o:p>
18 where rownum <= 5);<o:p></o:p>
<o:p>
</o:p>
Expected Change YEAR TYPE PDI_PER_CAPITA Actual Variance
--------------- -------- ------------ --------------- ----------------
.016620687 1998 Actual 22175 64
.016348956 1999 Actual 22366 -112.5
.016085967 2000 Actual 22995 149
.015831305 2001 Actual 22997 -216.5
.015584581 2002 Actual 23697 116
.015345429 2003 Estimated 23948.5 0
.015113506 2004 Estimated 24316 0
.014888488 2005 Estimated 24683.5 0
.014670073 2006 Estimated 25051 0
.014457974 2007 Estimated 25418.5 0
10 rows selected
SQL>
I hope these simple examples demonstrate how easy it is to add regression values to your reports. These values will certainly add more value to your output.
<o:p> </o:p>
About the Author
<o:p> </o:p>
John Adolph Palinski is an Oracle Certified Professional developer. He has been adjunct faculty at various institutions. He is the author of: “Oracle9i Developer: Developing Web Applications with Forms Builder”, Course Technology, 2003; “Oracle SQL and PL/SQL Handbook: A Guide for Data Administrators, Developers, and Business Analyst”, Addison Wesley, 2003; and the “Oracle Database Construction Kit”, Que, 1997. He has also written numerous Oracle articles. He provides various pre-developed and custom Oracle training courses through his company Realistic Software Training. He is always on the search for interesting topics to write. Contact him at www.oracle-trainer.com<o:p></o:p>
<o:p> </o:p>