Data Science - Gary Sieling

Cliff Click, 0xdata

Click represents 0xdata, which is building a system that can handle R-style analysis at a large speed/scale, aimed at companies that do advertising or credit card fraud detection, where transaction volume is large, and where money is lost waiting for models to rebuild. Typically these data comes from a variety of sources, file formats, etc, and requires much pre-cleaning to be functional. Post-cleaning, it is typically loaded into a dimensional model, and in this case, models are generated using linear models, k-means, or random forests, which Click said represent 80% of the models they see.

These types of models are used because they generate fast linear equations- for instance a credit card swipe gets 2-10ms dedicated to determine if it is suspected fraud. The challenge is building the models in a time-efficient manner – a common response is to downsample data, which loses accuracy. Clicknoted that 10x data increases lead to 75-85% increases in model accuracy, which sometimes can be repeated more than once.

Their database (H2O) can be started easily from just a jar:

java -jar h2o.jar

Servers will find each other using multicast, and are designed to use the maximize CPU and memory use to reduce time to complete queries using a distributed Fork/Join. The system is designed to reduce GC pause overhead, presumably based on Click’s past experience working at Azul – the only GC parameter they set is -xmx. Notably, they apparently make heavy use of UDP for communication.

This talk introduced me to fastr – this provides R semantics within Java, which is awesome.

Interesting finds from this talk-

http://www.csg.is.titech.ac.jp/~chiba/javassist/

http://radare.org/doc/html/Chapter14.html

Category: Data Science

Building a Terabyte-scale Math Platform