DeVisa - Data Mining Models Management System

Abstract

DeVisa is a framework for unifying the expression of different prediction models using Web technologies. The prediction models are stored in a PMML repository providing the following functions:

  • Applying scoring algorithms on new instances via XQuery queries and/or Web Services
  • Uploading/downloading models by data mining applications
  • Applying XML specific queries (XQuery etc)
  • Composing the stored models via model selection of sequencing

Motivation

Data mining often is characterized as being predictive or descriptive. The predictive nature of data mining is that the models produced from historical data, have the ability to predict outcomes. The descriptive nature of data mining is where the model itself is inspected, to understand the essence of the knowledge or patterns found in the data. Some models serve both predictive and descriptive purposes. For example, a decision tree can not only predict outcomes, but also provide human interpretable rules that explain why a prediction was made. Clustering models do not only provide the ability to assign a record to a cluster, but also a description of each cluster, either in the form of a representative point called a centroid, or as a rule that describes why a record is considered part of the cluster.

The algorithms for building data mining models are computationally expensive, both because they are based on analyzing large volumes of data and because the algorithms themselves are complex. Therefore it is very practical to save the models and further process or query them. Furthermore, the true value of data mining does not reside in a set of complex algorithms, but in the practical questions that it can help solve. DeVisa is focusing on maintaining and exploiting a repository of predictive models. Hence the knowledge is treated as data facilitating that new knowledge is derived from it.

The use of open standards provides wide access to the classification models. Users are capable to search and find a useful model, that can be tested online, compared to other models and/or combined (using techniques as bagging or boosting). The models can be refined and enhanced during their exploitation. Furthermore, domain experts will test the models on new data and can provide direct feedback to the mining experts that developed the models. The use of Web services technology as well as the use of a standard open format specially designed to express data mining models (PMML) improve the interoperability and scalability of the system.

DeVisa Features

DeVisa is built on top of the native database system eXist. Thus the model repository takes advantage of all the database management facilities.

  • DeVisa does not store the data the models were built on, but only the PMML models themselves;
  • The DeVisa approach leverages the XML native storage and processing ca- pabilities like indexing (structural, full text and range), query optimization, inter-operation with the XML based family of languages and technologies;
  • DeVisa defines a XML based query language - PMQL - used for interaction with the DM consumers. PMQL is wrapped in a SOAP message, interpreted within DeVisa and executed against the PMML repository;
  • It provides a XML-based language for expressing the Metadata Catalog;
  • DeVisa deals with schema integration aspects in the scoring process and provides a 1 : 1 schema matching technique and an adaptive similarity measure;
  • Uses a functional dependency approach in verifying if a consumer’s schema can be derived from the existing schemas in DeVisa;
  • DeVisa allows online composition of prediction models either during the scoring process or explicitly;
  • The interoperability with other applications (e.g consumers) is achieved exclusively through the use of web services;
  • DeVisa integrates a native XQuery library for processing PMML docu- ments.

References

  1. http://www.exist-db.org
  2. http://www.dmg.org
  3. http://www.cs.waikato.ac.nz/ml/weka
  4. http://www.dmg.org/pmml-v3-2.html