A brief introduction to Weka

Weka is a GPL data mining tool written in Java, published by the University of Waikato. It includes an extensive series of pre-implemented machine learning algorithms, including well known classification and clustering algorithms. If you’ve ever been curious how Bayes Theorem works, this is a great tool to get up and running.

Weka uses a custom data format, called ARFF files (Attribute Relation File Format). This, in essence, specifies a table of data, along with a CSV style data listing. The data types are scaled down from a type of database: numerics, strings, dates, and nominal attributes (i.e. an equivalent to enumeration or pick list).

You can connect to any database with a JDBC connection string, provided the appropriate jars are on the classpath. Weka ships with a file called DatabaseUtils.props, which maps database types to the Weka types listed above.

Once you get some data in, you can try out different algorithms (and see how difficult it is to build predictive systems!)