Clementine Application Templates
Clementine Application Templates
Clementine Application Templates
Performance
on large datasets
Clementine ®
Server
®
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server
In the past, Clementine’s approach worked best with sampled data. With the release of
Clementine Server, Clementine’s interactive data mining approach can be used on much
larger datasets. This is because Clementine Server scales the entire data mining process.
For example, visualization techniques are scaled for data understanding. Data preparation
steps such as field and record operations also see significant gains, as do modeling processes
that include pre-processing steps. Finally, model evaluation and deployment can be performed
more efficiently.
Technical report 2
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server
With Clementine Server, however, the stream processing is pushed back onto the DBMS via
SQL queries. Any operation that cannot be represented as SQL queries is performed in a
more powerful application server tier. Only relevant results are passed back to the client tier.
This approach takes advantage of optimized operations in a DBMS and increased processing
power found at the application server tier to deliver predictable, scalable performance
against large datasets.
Clementine gives you feed-
back when in-database
mining is activated. During
execution, nodes turn purple
if the operations represent-
ed by the node are executed
in-database. At the point of
the last purple node in the
stream, the remaining data
is extracted and performed
at the application server tier.
Since in-database mining is
almost always faster than
application server processing, This Clementine stream shows many nodes shown in purple during
execution rather than the usual blue. Purple nodes mean that the
the more nodes that are
operations represented by the node are being translated into SQL
pushed back to the database, and executed in-database.
the better. Clementine max-
imizes in-database mining by using rules of thumb to order operations automatically. You
don't have to worry about the mechanics of stream building, because operation reordering
is automatic. Instead, you can focus on the business problem at hand. Operations will not
be reordered if the reordering would change the results of the stream.
Visualization:
you an indication of the degree ■ Point or line plots
■ Distribution bar graphs
of scalability achieved with ■ Web association graphs
Technical report 3
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server
disk space is required depending on the amount of data that are processed. Data volume is
proportional to both the number of columns and the number of rows in your dataset.
Also, more space is needed if you do not push data processing back into the database. When
processing cannot be done in the database because the operations cannot be expressed
as SQL queries or you are mining flat files, operations are done in the application server tier.
In these instances, use of the Aggregate, Distinct, Merge, Sort, Table or any modeling node
will create temporary disk copies of some or all of the data, requiring additional disk space.
A good rule of thumb for allocating additional disk space for data is to measure the size of
the largest table to be mined as a flat file and multiply by at least three.
Our intention with this white paper is to provide some test results of common operations
conducted with Clementine Server. These operations were selected because they are typical
of operations used in the different stages of the data mining process. The results of these
tests should provide you with an understanding of its performance on large datasets. The
operations included are:
Data processing
This involves accessing two data sources, a customer data table and a transaction table.
The transactions are aggregated to a customer key and then merged with the customer data.
Two fields are derived.
Modeling
A new field is derived and then a C&RT decision tree model is used. C&RT is used because
performance for this algorithm is a good indicator of overall model-building performance.
Neural networks, for example, tend to take longer to train and GRI tends to take less time.
The time taken to build models always depends on the data and parameters settings of the
model. Default settings in Clementine attempt to build a more accurate model, so if speed
is more important, you may need to change the parameters.
Scoring
Unlike model building, which almost never requires using all the available cases in order to
receive a good result, scoring often requires that the whole population be scored. For example,
a response rate for a mailing may be one percent; building a model on who will respond requires
that the data is “balanced” — that the number of responders roughly equals the number of non-
responders. This means the data for training the model could be about two percent of the total
responses. Scoring, on the other hand, often needs to be done for the whole customer base
(or at least a whole potential mailing population). These benchmarks show real-time scoring
of a few cases and batch scoring (scoring a large batch of cases).
The benchmark testing presented in this paper was performed with the following client and
server specifications:
Client Server
Windows 2000 Windows NT Server
Dell Latitude CPt C400GT Dell Poweredge 6300
Intel Celeron 400 4 x 500Mhz cpu’s
130MB RAM 1GB RAM
6GB disk 36GB disk
10MB ethernet SQL Server
Technical report 4
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server
The dataset used in most of the testing has millions of records – from one million to 13 million.
One of the datasets in the data preparation benchmark has 16 fields — eight symbolic and
eight numeric — and the other dataset has eight fields — four symbolic and four numeric.
The model building dataset has nine fields, five of which are symbolic, and the table written
for scoring has eight fields with an equal number of symbolic and numeric variables.
All the figures shown (except real-time scoring) are from a testing environment in which
a database is used and SQL optimization is enabled using the appropriate check box in the
Clementine interface. Using a database but disabling SQL optimization means that the data
which is processed is pulled from a database; this is different than reading flat files located in
the application server tier. You will get better performance using a database even without SQL
optimization enabled because Clementine must read in all data from a flat file and only relevant
columns from a database. Using a database with Clementine is always strongly encouraged to
get the best performance, but it is even more important if you have a large number of fields.
Using flat files in the server tier is still faster than using them in the desktop tier, however.
Caching, creating an optimized copy of the data, can help with flat file performance in the server
tier. Caching creates a performance hit (tests show that it takes about twice as long to read the
data) the first time a stream is run. However, benchmarks have shown subsequent runs of the
stream to be as much as eight times faster than without caching.
800
Benchmark
testing results: 700
Technical report 5
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server
4500
Benchmark testing
results: modeling 4000
3500
The average increase in time
3000
required to process one mil-
2500
lion records increases slightly
as millions of records are 2000
0
1 million 3 million 5 million 7 million 9 million 11 million 13 million
Technical report 6
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server
1200
The scoring was run 1000
times with five concur- 1000
0
x2 x3 x4 x5
Conclusion
The ever-growing amount of data created by organizations presents opportunities and
challenges for data mining. Growing data warehouses that integrate all information about
customer interactions present new possibilities for delivering personalization, resulting in
increased profits and better service delivery. The challenge lies in making this vision a reality.
Scaling the entire data mining process with Clementine Server makes mining large datasets
more efficient, shortening the time needed to turn data into better customer relationships.
Technical report 7
Pe r f o r m a n c e o n l a r g e d a t a s e t s : Clementine® Server
About SPSS
SPSS helps people solve business problems using statistics and data mining. This predictive
technology enables our customers in the commercial, higher education and public sectors to
make better decisions and improve results. SPSS software and services are used successfully
in a wide range of applications, including customer attraction and retention, cross-selling,
survey research, fraud detection, enrollment management, Web site performance, forecasting
and scientific research. SPSS' market-leading products and product lines include SPSS,®
Clementine,® AnswerTree,® DecisionTime,® SigmaPlot® and LexiQuest.™ For more information,
visit our Web site at www.spss.com.
SPSS is a registered trademark and the other SPSS products named are trademarks of SPSS Inc.
All other names are trademarks of their respective owners. Printed in the U.S.A. © Copyright 2002 SPSS Inc.