Saturday, April 2, 2016

APPROX_COUNT_DISTINCT overview

Oracle has a new function advertised with 12.1.0.2 called APPROX_COUNT_DISTINCT. This SQL function is will allow the user to query a rough estimate of a group of distinct values. This feature is supposed to run more efficiently than the current methods.

This post will test that out with a couple of examples. I will look at accuracy and performance to evaluate the functionality. The first test will be of a full scan, the second test will use the SAMPLE clause, and the final test will use the new APPROX_COUNT_DISTINCT function.

To simplify the test I will create a copy of the DBA_OBJECTS table called BIG_TABLE. The query used will group by OWNER while performing a count on distinct OBJECT_ID. To evaluate the accuracy I will reduce the output of 38 rows to 7 key rows.

As a note, when I evaluate the execution plan I will be using the gather_plan_statistics hint to get ALLSTATS from dbms_xplan.display_cursor. When I execute different queries I often use an additional string in the hint to help search v$sql for the SQL_TEXT I want to review. That explains why you will see hints like /*+ gather_plan_statistics a*/.

Queries

The queries evaluated were the following:

A. Full Scan
select /*+ gather_plan_statistics a*/ owner, count(distinct object_id)
from big_table
group by owner
order by owner;

B. Using sample size of 20%
select /*+ gather_plan_statistics b*/ owner, count(distinct object_id)
from big_table sample (20)
group by owner
order by owner;

C. Using new feature
select /*+ gather_plan_statistics c*/ owner, approx_count_distinct (object_id)
from big_table
group by owner
order by owner;

The results will show that even though the sample size query ran faster than APPROX_COUNT_DISTINCT, the results of the latter were more accurate. The performance of the full scan had a run time of twice that of APPROX_COUNT_DISTINCT, so the performance was also much improved with the new function.

Detail Results

The details are provided here for anybody interested. Following this I will have a Summary of this information.

A:
OWNER                COUNT(DISTINCTOBJECT_ID)
-------------------- ------------------------
APEX_050000                              3600
BI                                          8
HR                                         34
LBACSYS                                   237
PUBLIC                                  37125
SYS                                     42110
XFILES                                    141


PLAN_TABLE_OUTPUT                                                               
--------------------------------------------------------------------------------
SQL_ID  07q279qdm8pgp, child number 0                                           
-------------------------------------                                           
select /*+ gather_plan_statistics a*/ owner, count(distinct object_id)          
from big_table group by owner order by owner                                    
                                                                                
Plan hash value: 2085152455                                                     
                                                                                
----------------------------------------------------------------------------------------------------------- 
| Id  | Operation            | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |      1 |        |     38 |00:00:00.30 |    1577 |   1565 |     
|   1 |  SORT GROUP BY       |           |      1 |  90869 |     38 |00:00:00.30 |    1577 |   1565 |    
|   2 |   VIEW               | VM_NWVW_1 |      1 |  90869 |  93680 |00:00:00.27 |    1577 |   1565 |      
|   3 |    HASH GROUP BY     |           |      1 |  90869 |  93680 |00:00:00.24 |    1577 |   1565 |    
|   4 |     TABLE ACCESS FULL| BIG_TABLE |      1 |  90869 |  93680 |00:00:00.18 |    1577 |   1565 |       
------------------------------------------------------------------------------------------------------------
                                                                                


B:
OWNER                COUNT(DISTINCTOBJECT_ID)
-------------------- ------------------------
APEX_050000                               741
HR                                          4
LBACSYS                                    49
PUBLIC                                   7455
SYS                                      8508
XFILES                                     30



PLAN_TABLE_OUTPUT                                                               
--------------------------------------------------------------------------------
SQL_ID  729uapczb0dd2, child number 0                                           
-------------------------------------                                           
select /*+ gather_plan_statistics b*/ owner, count(distinct object_id)          
from big_table sample (20) group by owner order by owner                        
                                                                                
Plan hash value: 1425374893                                                     
                                                                                
------------------------------------------------------------------------------------------------------------| Id  | Operation              | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | 
----------------------------------------------------------------------------------------------------------  
|   0 | SELECT STATEMENT       |           |      1 |        |     37 |00:00:00.06 |    1577 |   1565 |     
|   1 |  SORT GROUP BY         |           |      1 |   3635 |     37 |00:00:00.06 |    1577 |   1565 | 
|   2 |   VIEW                 | VM_NWVW_1 |      1 |   3635 |  18904 |00:00:00.05 |    1577 |   1565 |    
|   3 |    HASH GROUP BY       |           |      1 |   3635 |  18904 |00:00:00.05 |    1577 |   1565 | 
|   4 |     TABLE ACCESS SAMPLE| BIG_TABLE |      1 |   3635 |  18904 |00:00:00.03 |    1577 |   1565 |     
------------------------------------------------------------------------------------------------------------



C:
OWNER                APPROX_COUNT_DISTINCT(OBJECT_ID)
-------------------- --------------------------------
APEX_050000                                      3630
BI                                                  8
HR                                                 34
LBACSYS                                           236
PUBLIC                                          37338
SYS                                             42118
XFILES                                            140



PLAN_TABLE_OUTPUT                                                               
--------------------------------------------------------------------------------
SQL_ID  3bmz5b1a9uzqh, child number 0                                           
-------------------------------------                                           
select /*+ gather_plan_statistics c*/ owner, approx_count_distinct              
(object_id) from big_table group by owner order by owner                        
                                                                                
Plan hash value: 3184991183                                                     
                                                                                
------------------------------------------------------------------------------------------------------------| Id  | Operation            | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | 
------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |      1 |        |     38 |00:00:00.14 |    1577 |   1565 |      
|   1 |  SORT GROUP BY APPROX|           |      1 |  90869 |     38 |00:00:00.14 |    1577 |   1565 |   
|   2 |   TABLE ACCESS FULL  | BIG_TABLE |      1 |  90869 |  93680 |00:00:00.09 |    1577 |   1565 |     
--------------------------------------------------------------------------------------------------------

Summary

To summarize the information seen

Performance

Method Run time
Count Full 0.30
Count Sample 0.06
Count Approx 0.14

Accuracy

Owner Count full Count sample Count sample*5 Count Approx
APEX_050000 3600 741 3705 3630
BI 8 0 0 8
HR 34 4 20 34
LBACSYS 237 49 245 236
PUBLIC 37125 7455 37275 37338
SYS 42110 8508 42540 42118
XFILES 141 30 150 140


Conclusion

The new function runs faster than the full scan and it is more accurate than using the sample feature. With the 20% sample, the BI records were not even seen. If you multiple the 20% sample by 5 you can get an estimate of the full table values with this method, but the values are not as close as the APPROX_COUNT_DISTINCT.

If you need to get a rough estimate of counts in a table and you are using a version >= 12.1.0.2 then you should consider this feature. If exact numbers are still needed then you will need to perform a full scan.