Integrating Association Rule Mining with Relational Database Systems:
Alternatives and Implications
Sunita Sarawagi (IBM Almaden Research Center)
Shiby Thomas (University of Florida, Gainesville)
Rakesh Agrawal (IBM Almaden Research Center)
Data mining on large data warehouses is becoming increasingly
important. In support of this trend, we consider a spectrum of
architectural alternatives for coupling mining with database systems.
These alternatives include: loose-coupling through a SQL cursor
interface; encapsulation of a mining algorithm in a stored procedure;
caching the data to a file system on-the-fly and mining;
tight-coupling using primarily user-defined functions; and SQL
implementations for processing in the DBMS. We comprehensively study
the option of expressing the mining algorithm in the form of SQL
queries using Association rule mining as a case in point. We consider
four options in SQL-92 and six options in SQL enhanced with
object-relational extensions (SQL-OR). Our evaluation of the different
architectural alternatives shows that from a performance perspective,
the Cache-Mine option is superior, although the performance of the
SQL-OR option is within a factor of two. Both the Cache-Mine and the
SQL-OR approaches incur a higher storage penalty than the
loose-coupling approach which performance-wise is a factor of 3 to 4
worse than Cache-Mine. The SQL-92 implementations were too slow to
qualify as a competitive option. We also compare these alternatives on
the basis of qualitative factors like automatic parallelization,
development ease, portability and inter-operability.