2006-07-08

Numenta white paper

Numenta published a white paper a while back.

It describes in a nutshell the hierarchical memory-prediction framework (they call it Hierarchical Temporal Memory) put forth in the On Intelligence book.

It also gives out a lot of interesting implementation details on how learning and prediction works. The hints on using HTMs for different applications are also thought-provoking.

The most intriguing feature, is the use of probability distributions for matching patterns. While it seems reasonable after one considers it, matching patters always tricked me into matching exact patterns.

I will start working on this probability distribution approach as soon as possible. So stay tuned.

2005-12-13

Pattern recognition

The question is: how do you recognize patterns?

Let's get it systematically.

What is a pattern? If you look at a string [AAAAAAA...] it is clear that there is a simple pattern here: A is always followed by an A. Expressing this as a pattern however is a little bit problematic. What pattern should we choose? [AA]? Or [AAA]? Or [A]?

Then another pattern: [ABABABAB...]. This is simpler. The pattern is [AB]. Thanslate [AB] to [X] and you have an output [XXXX...]. Which is the same as the problematic pattern above.

Then you can go to [ABCABCABCABC...]. With the pattern being [ABC].

Then what about [ABABXYABXYABXYXYABABXYXY] with two patterns repeating ([AB] and [XY]) seemingly randomly. Here we should learn the [AB] and [XY] patterns, translate it to [T] and [U], outputting [TTUTUTUUTTUU...], and let the next level find the higher order patterns.

Then there is the problem of noise. What if you have [ABABABXABABXABABXABAB] with a regular [AB] sometimes interrupted by an X?

Based on these cases I started to play with different algorithms. Some of them work well in simple cases, others work well in longer contexts, others perform equally poor on any input pattern.

Seems like a long way ahead...

I'll come back to keep you informed.

Sudden change of course

Long time passed. An update on what happened follows:

First: I created a project on sourceforge for the thinking machine. I called it brain-game. You'll find it here.

Second: I made some interesting experiments (programming-wise), which resulted in solving the layering problem. Sometimes it gets confusing this predecessor/successor thing.

Third: after I set up the layering and ran some tests I was deeply disappointed. The results were far worst than the single-layer prediction.

This failure to get results prompted me to start looking in the learning part of the Predictor.

And I found out that there are serious issues with it.

So I started to explore the core algorithms.

This somehow threw me out of my original track, and got me to a simple but essential problem: how do you make pattern recognition.

This is where I stand today.

2005-08-11

A Small Creepy Prototype

Have some news, and some time to report it.

After several mis-starts and failures, I finally successfully compiled a C# GUI which uses the C++ predictor class.

I made a simple application in which I feed an array of predictors with a 32x32 image and check what it's prediction is.

It's really creepy to see how the single-layer predictor-collection learns a circle.

It's probably something like what new-borns see.

2005-04-28

Sleeping

Last night I watched my six-year old son sleeping.

Besides the fact that it is aesthetically pleasing to watch a six-year old sleeping (especially if he is your son), it's puzzling.

Why do we sleep? Why do we dream?

And of course I have an idea of mine about this: it's about sorting out the day's happenings.

Sleep probably has a biological/chemical explanation also (which I think has to do with the synaptic links between two neurons: they might get "tired": after a firing the synapse is not that "clear" as before, as some chemicals will not completely clean out, and after a day's full of firing the accuracy/precision of the firing degrades).

I think it goes something like:

On cortical region level, the new patterns learned during the day will integrate with the old patterns, the "dictionary" of the regions are updated (common subpatterns extracted, rarely found patterns discarded (forgot?), common patterns found on a level maybe pushed down to the previous levels (learning?)).

We should devise experiments for these

Possible experiments

I was thinking on how I will experiment with the predictors.

One idea was the car-driving game (Vehicle Control with Neural Networks).

Another idea (probably simpler) is the image recognition and prediction used by Dileep George and Jeff Hawkins (see previous post for references)

Yet another interesting idea would be to try the predictors on natural language.

A trickier would be to use it on music.

And the ultimate challenge will be to make a robot: eyes, ears, possibility to move the head. As creepy as it may sound, it's not that far :-)

A better day

Today is a better day.

I have had doubts about the simple approach which ignores the six layers of the cortex.

I just read two papers written by Dileep George and Jeff Hawkins: A Hierarchical Bayesian Model of Invariant Pattern Recognition in the Visual Cortex. and Invariant Pattern Recognition Using Bayesian Inference on Hierarchical Sequences

And while there is a mention of a match between the Bayesian network mechanics and the biological happenings in the six layers, the experiments described do not replicate directly what happens in the brain at that level.

I might have to revisit the modified Markov models proposed in my degree paper however.

As soon as I get in a phase where trials can be made using prediction, I might decide that the signals traveling up and down are not simple values, but a distribution of probability of several values.

It's interesting to see how things converge:
  • the hierarchical concept-forming idea of mine (maybe someday I post them here) seems to resemble very closely the hierarchical structure of the memory-prediction framework,
  • the statistical NLP knowledge I accumulated and used in the degree paper seems to be of central importance in the learning and prediction of the memory-prediction framework.

2005-04-27

Numenta

Jeff Hawkins founded a company Numenta.

It will build the things described in On Intelligence.

This is exciting.

Doubts creeping in

Today is a sad day.

I have doubts.

I have doubts that the simplicity of the predictor is a good thing.

Maybe I'm committing the same error AI researchers (the computer guys) committed all the time.

I'm ignoring the intricacies of the human brain (the six-layered details of the cortex).

Maybe it's important not to ignore it.

Why I have doubts?

There are several issues which are disturbing with the simple predictor:
- how it handles predictions coming from higher levels when there are several contradictory predictions from several higher level layers
- how it handles partial matches in input
- how it generates its own predictions based on partial histories

These things are getting out of hand.

It might be the case that extending the prediction/expectation part to handle more than one expectation and a list of candidates will solve the thing.

On the other hand it might not.

I have doubts.

Today is a sad day.

2005-04-20

Combining Predictors

I say we start linking predictors together.

A linear string of Predictors

If we start simple, and set the Predictor's Input width to 1 (basically reducing it to a transcoder), we can create a string of Predictors.

Each Predictor will have it's output set to the next Predictor's input.

The first Predictor gets the input and the last Produces the output.

What would a string like this do? It would compress the incoming stream of inputs.

Not much a feat, heh?

A hierarchy of layers

So we extend it: we allow inputs wider than 1, and organize these Predictors in layers.

The first layer will have N1 Predictors. Each predictor will have several inputs.

Each Predictor in a layer will redirect it's output to every Predictor in the next layer.

By setting up hierarchies with various number of layers, and number of Predictors in the layers, we can create all kinds of compressors.

Connecting Predictors

If we set every Predictor from a layer to every Predictor in the next layer then we end up with a very densely connected network.

Apart from these densely connected networks, we can devise various ways to link predictors from a layer to the next layers.

And if we want, we can also link Predictors from the same layer. And we can also link Predictors from a higher layer to one in the lower layers.

I think the possibilities are endless...

2005-04-19

Horizontal and vertical compression

Recall the Predictor class in the previous post.
What it basically does is it collects histories of input vectors, creates a dictionary out of them and emits the corresponding signal.

If you set the number of inputs to be received to 1, you will get a stream compressor.

Additionally set the maximum length of the history to 1 and you will get a repeater.

If you set the number of inputs to greater than 1 and the history to 1, you will get a strange compressor: you can give him patterns, and it reduces each pattern to a single signal. I call this vertical compression (as opposed to the horizontal compression done by the stream compressor above).

Combine the two (horizontal and vertical) and you get an exciting compressor which compresses streams of patterns.

Someone should do a quick research on what is known about these two-dimensional compressors.

Basic Predictor

I was thinking on this predictor. And I got to a pretty simple solution.

I will try to explain it, and show how it builds up. So let's start.

The simplest predictor ever

Let's start with something simple like:

class Predictor

{

public:

int input;

int output;



void ProcessInput()

{

output = input;

}

}



This simple class does not learn nothing, nor it predicts anything. It just repeats on it's output whatever in put it gets. It's a repeater: if you give him the input stream 1,2,1,2,1,2 you'll get 1,2,1,2,1,2 as output.

The transcoder

Let's add a dictionary to the repeater:


typedef std::map<int, int> Dictionary;



class Predictor

{

public:

int input;

int output;

private:

Dictionary dictionary;

public:

void ProcessInput()

{

Dictionary::const_iterator it = dictionary.find(input);

// check if we already have this input

if (it == dictionary.end())

{

// if this is a new input, add it to the dictionary

int new_output = rand();

std::pair<Dictionary::const_iterator, bool> result = dictionary.insert(std::pair<int, int>(input, new_output));

it = result.first;

}



output = (*it).second;

}

}



Now, this does something. For each input it receives, it remembers, transforms to something else and outputs this new thing. If it receives the same input, it will produce the same output: feeding him 1,2,3,1,2,3,1,2,3 will produce something like 42, 43, 44, 42, 43, 44, 42, 43, 44. (the actual output values are randomly generated for now, but I don't think it matters).

Learning spatial patterns

We can do another trick: what if we don't get a single int as input, but a list of int's? Something like:


typedef std::vector<int> Input;

typedef std::map<Input, int> Dictionary;



class Predictor

{

public:

Input input;

int output;

private:

Dictionary dictionary;

public:

void ProcessInput()

{

Dictionary::const_iterator it = dictionary.find(input);

// check if we already have this input

if (it == dictionary.end())

{

// if this is a new input, add it to the dictionary

int new_output = rand();

std::pair<Dictionary::const_iterator, bool> result = dictionary.insert(std::pair<Input, int>(input, new_output));

it = result.first;

}



output = (*it).second;

}

}



What we get? We get a thing which receives patterns and produces a single output representing this pattern.

What this thing learns are spatial patterns as they are described in the On Intelligence book: giving it an input [1,2,3] [1,2,3] [2,3,4] [2,3,4] it will produce an output something like 42, 42, 43, 43.

Learning Temporal Patterns

Another trick we can do to the transceiver is adding a buffer to keep the incoming inputs:


typedef std::vector<int> History;

typedef std::map<History, int> Dictionary;





class Predictor

{

public:

int input;

int output;

private:

History history;

Dictionary dictionary;

public:

void ProcessInput()

{

history.push_back(input);

Dictionary::const_iterator it = dictionary.find(history);

// check if we already have this input

if (it == dictionary.end())

{

// if this is a new input, add it to the dictionary

int new_output = rand();

std::pair<Dictionary::const_iterator, bool> result = dictionary.insert(std::pair<History, int>(history, new_output));

it = result.first;

output = (*it).second;

history.clear();

}



}

}



What this thing does? It learns simple streams of inputs.

Let's see a scenario. Assume that the following inputs are fed to this thing: 1, 2, 1, 2, 1, 2, 1, 2.

When it receives the first 1, it immediately adds to the dictionary and outputs 42.
Then it sees a 2, adds to the Dictionary and outputs 43.
Then it sees another 1. It finds it in the history, and does nothing.
Then it receives the second 2, adds to the History (which becomes (1,2)), looks it up, it does not find it, adds it and emits 44.
The next 2 signals are received and processed without any output, but changing the history to (1,2).
Then a new 1 comes, produces a new unseen history of (1,2,1), which is added to the dictionary and 45 is emitted.
And this process continues, producing a dictionary which looks like:
(1) = 42
(2) = 43
(1,2) = 44
(1,2,1) = 45
(2,1) = 46
(2,1,2) = 47
(1,2,1,2) = 48
(2,1,2,1) = 49
and so on...

An interesting change is to place a limit on the size of the history. If we decide that the history cannot be more than 3 inputs, and if the history is full, we emit it's mapping, we end up with the dictionary containing
(1) = 42
(2) = 43
(1,2) = 44
(1,2,1) = 45
(2,1) = 46
(2,1,2) = 47
and the output for a continuous 1,2,1,2,... being
(1)42, (2)43, (1,2)44, (1,2,1)45, (2,1)46, (2,1,2)47, (1,2,1)45, (2,1,2)47, (1,2,1)45, (2,1,2)47 and so on.

Our little thing recognizes and encodes the simple temporal patterns in it's input.

Combining the spatial and temporal pattern recognition

Out logical next step is to combine the history and the wide input:


typedef std::vector<int> Input;

typedef std::vector<Input> History;

typedef std::map<History, int> Dictionary;



class Predictor

{

public:

Input input;

int output;



private:

History history;

Dictionary dictionary;



public:

ProcessInput()

{

output = 0;

history.push_back(input);



Dictionary::const_iterator it = dictionary.find(history);

if (it == dictionary.end())

{

int new_output = rand();

std::pair<Dictionary::const_iterator, bool> result = dictionary.insert(std::pair<History, int>(history, new_output));



it = result.first;

history.clear();

output = (*it).second;

}

}



I say this is one half of the Predictor.
The other half is surprisingly symmetric to this (includes the prediction thing).

Is this simple? Yes. Is this too simple? Dunno.

It basically does what a predictor should do.

Next steps are:
1. linking them together
2. adding the prediction part