Sunday, February 23, 2014

AIWorld6 - The joy of small data

I'm happy because I've failed at clustering.

I do programming in my free time and I'm working on a system that evolves artificial life an example video). This might seem like a cool-ass AI programming but it's mostly the regular drudgery of c arrays and test case writing. After all, I'm building the world, not making them smart. It's up to them to evolve smartness.

As a result, I was super psyched today when I realized I've failed at my naive approach to detecting species. I knew they'd speciated because in every world I start from scratch there's a clear moment when almost everyone dies, and then after that I see a lot of agents with low energy and a small number with 100x as much. That die-off is the start of predation. Those high energy agents are predators. And from then on, those two different types of agents are probably evolving differently. So how can I prove this? I want to gather stats on the two different kinds.

Of course, I could just check how often agents attack each other and label the ones that do it the most predators, but that's differentiating on the same thing I want to measure, which is pointless. It also doesn't scale; if there are other species doing other behaviours I want to find them too.

The first clustering algorithm I tried was simply taking a sum of all the connections in their brain and assigning a color to that number. The idea is that if agents have similar brains, they should get similar colors. So as the brain construction of two species diverges, they turn different colors. When I see a bunch of them around a given color, I know I have a species. This works ok, but it takes a lot of tuning to not get the color spectrum too wide that it's washed out or too narrow that everyone appears to be the same. Worst of all, it's still relying on my visual ability to discern where one color band ends and the next begins.

Then I tried what any thinking-inside-the-box, data-too-big programmer would do, I wrote a simple clustering algorithm that could run in linear time. Essentially, if there's a gap of more than X spaces where I see less than Y agents, it's probably a gap between species. Then I just call all the parts that aren't gaps different species. It's a shit algorithm. It takes a ton of tuning and I'm still missing the really uncommon species.

At that point, I figured it was time to call in the big guns of a real algorithm and use a real comp-sci library for it. I looked up clustering algorithms, chose one arbitrarily, and spent an hour prototyping k-means clustering using scipy. The results are shit.

For the histogram of brain sums, even if I give k-means my personal guesses ([14900,15050,15300,15500,15650,15900,16100]) it still gives me results that I'd describe as not-helpful ([14828,15072,15451,15575,15694,15825,16056]). And it doesn't even automatically give me the min-max of those means.

Which brings me to what makes me so happy. I'm reminded of all those comp-sci classes where they talked about picking the right algorithm for the problem. I have a small-data problem; there's never more than about 40k agents in the world so I can pick almost anything. I can think about why my problem is different than the generic clustering would be; outliers matter because they're under the same selection pressure as anyone else and they're still alive. Also, small clusters as equally as important as big ones since they must still be a species. I'm hoping to find an algorithm that works for me. And if I don't, I'll invent one and code it myself.

For someone who spends most of their time crunching big data and having to use simple algorithms to do it, I'm gonna love picking just the right one for my small data.


  1. For clustering in 1D there are better solutions than k-means:

    1. I will totally check that out. Thanks for the pointer. :)

    2. Yea. One of the algorithms in there, KDE, did excellent.