,

Mining Association Rules with R and Postgres

In one of my earlier pieces I explored decision trees in python, which lets you to train a machine learning algorithm to predict or classify data. I like this style of model because the model itself is valuable; I’m more interested in finding underlying patterns than attempting to predict the future. Decision trees are nice […]

, ,

Testing the output of tuned Postgres queries

Tuning SQL queries is a useful skill, and while many people struggle to manage complex SQL, the work is actually a series of simple tricks. For instance, refactoring a query often brings about algorithmic improvements, and if you tune enough queries, finding mathematically equivalent forms becomes muscle memory. This includes operations like inlining common table […]

,

Extracting Dates and Times from Text with Stanford NLP and Scala

Stanford NLP is a library for text manipulation, which can parse and tokenize natural language texts. Typically applications which operate on text first split the text into words, then annotate the words with their part of speech, using a combination of heuristics and statistical rules. Other operations on the text build upon these results with […]

Exception in thread “main” java.lang.RuntimeException: Error parsing file: edu/stanford/nlp/models/sutime/defs.sutime.txt

The following error: Exception in thread “main” java.lang.RuntimeException: Error parsing file: edu/stanford/nlp/models/sutime/defs.sutime.txt Is caused by using the Stanford NLP jars, but not the entire project on the classpath. To fix it, you want to download the “full” distribution and add that folder to the classpath.

UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe1′ in position 8: ordinal not in range(128)

If you do the following: [str(x) for x in data] You’ll sometimes get this error: UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe1′ in position 8: ordinal not in range(128) This indicates that you have non-ascii characters in the data, and should use a wider type. You can fix it by doing the following (alternately, you […]

“ValueError: Mix of label input types (string and number)” when using LabelBinarizer

If you do this: from sklearn import preprocessing lb = preprocessing.LabelBinarizer() lb.fit(["a", 2]) You will get the following error: ValueError: Mix of label input types (string and number) When you mix numbers and strings, it’s unclear whether you are mixing different types of classes, or if you’re mixing continuous and non-continuous data. If the latter- […]

Decision Tree Testing Lessons

I’m running some tests on sklearn decision trees, and the lessons learned so far may be interesting. I’ve put my measurement code at the end – I’m tracking % correct, number of tests that are positive, negative, and false positives and negatives. When running predictions, if you have a defect where you include the ‘answer’ […]

Inspecting Postgres column types with SqlAlchemy

SQL Alchemy makes it easy to get types out of the database: from sqlalchemy import * engine = create_engine( "postgresql+pg8000://postgres:postgres@localhost/postgres", isolation_level="READ UNCOMMITTED" ) c = engine.connect()   meta = MetaData()   t = Table(’table’, meta, autoload=True, autoload_with=engine, schema=’test’)   columns = [col for col in t.columns] And then from there, you can filter the column […]