Monday, February 24, 2014

AIWorld6 - The Joy of Small Data - Follow up

I got some great feedback from people, props go to rolisz for suggesting a stack exchange conversation that lead me to Kernel density estimation.

You can see from the plot of evaluated and the plot of the derivative of the plot that it'll be easy to pick out what the species are. A naive parser that just looks at the local maximums as indicating a species and local minimums as indicating the difference between species did quite well on this data. I would have guessed [14900, 15050, 15300, 15500, 15650, 15900, 16100] and the KDE with this naive parsing went with [14732, 14861, 15054, 15674, 15725, 16087]. With the exception of 15674 and 15725 being too close to each other I think it was a total win. I'll be implementing this algorithm shortly. And perhaps even showing the histogram to let a viewer second guess the automatic stats gathering if they feel the need.

Sunday, February 23, 2014

AIWorld6 - The joy of small data

I'm happy because I've failed at clustering.

I do programming in my free time and I'm working on a system that evolves artificial life an example video). This might seem like a cool-ass AI programming but it's mostly the regular drudgery of c arrays and test case writing. After all, I'm building the world, not making them smart. It's up to them to evolve smartness.

As a result, I was super psyched today when I realized I've failed at my naive approach to detecting species. I knew they'd speciated because in every world I start from scratch there's a clear moment when almost everyone dies, and then after that I see a lot of agents with low energy and a small number with 100x as much. That die-off is the start of predation. Those high energy agents are predators. And from then on, those two different types of agents are probably evolving differently. So how can I prove this? I want to gather stats on the two different kinds.

Of course, I could just check how often agents attack each other and label the ones that do it the most predators, but that's differentiating on the same thing I want to measure, which is pointless. It also doesn't scale; if there are other species doing other behaviours I want to find them too.

The first clustering algorithm I tried was simply taking a sum of all the connections in their brain and assigning a color to that number. The idea is that if agents have similar brains, they should get similar colors. So as the brain construction of two species diverges, they turn different colors. When I see a bunch of them around a given color, I know I have a species. This works ok, but it takes a lot of tuning to not get the color spectrum too wide that it's washed out or too narrow that everyone appears to be the same. Worst of all, it's still relying on my visual ability to discern where one color band ends and the next begins.

Then I tried what any thinking-inside-the-box, data-too-big programmer would do, I wrote a simple clustering algorithm that could run in linear time. Essentially, if there's a gap of more than X spaces where I see less than Y agents, it's probably a gap between species. Then I just call all the parts that aren't gaps different species. It's a shit algorithm. It takes a ton of tuning and I'm still missing the really uncommon species.

At that point, I figured it was time to call in the big guns of a real algorithm and use a real comp-sci library for it. I looked up clustering algorithms, chose one arbitrarily, and spent an hour prototyping k-means clustering using scipy. The results are shit.

For the histogram of brain sums, even if I give k-means my personal guesses ([14900,15050,15300,15500,15650,15900,16100]) it still gives me results that I'd describe as not-helpful ([14828,15072,15451,15575,15694,15825,16056]). And it doesn't even automatically give me the min-max of those means.

Which brings me to what makes me so happy. I'm reminded of all those comp-sci classes where they talked about picking the right algorithm for the problem. I have a small-data problem; there's never more than about 40k agents in the world so I can pick almost anything. I can think about why my problem is different than the generic clustering would be; outliers matter because they're under the same selection pressure as anyone else and they're still alive. Also, small clusters as equally as important as big ones since they must still be a species. I'm hoping to find an algorithm that works for me. And if I don't, I'll invent one and code it myself.

For someone who spends most of their time crunching big data and having to use simple algorithms to do it, I'm gonna love picking just the right one for my small data.

Tuesday, February 4, 2014

AIWorld6: Real Science

It's time to setup the first experiment. The system runs. The lifeforms do crazy things. Now the question is, what to start with first? Some options:

  • Is life more successful with a concept of age?
  • Is life more resilient when the population is genetically diverse?
  • Is life more resilient when the terrain diverse?
  • Do more species occur when the terrain is diverse?

So far, the interesting things I've observed are die-offs from very aggressive predators, specialization into predators and prey, and specialization based on terrain. So far the most interesting is the speciation. I think the next step will be seeing what causes a diverse set of species and then seeing if that prevents die-offs.