Dataflow matrix machines as generalized recurrent neural networks

Dec 27, 2016 01:29

A year ago I posted about dataflow programming and linear models of computation:

http://anhinga-anhinga.livejournal.com/82757.html

It turns out that those dataflow matrix machines are a fairly powerful generalization of recurrent neural networks.( Read more... )

advanced software, artificial intelligence, machine learning, strange technology, computer science, dataflow matrix machines, neural nets, software continualization

Leave a comment

anhinga_anhinga January 2 2017, 18:54:07 UTC
Привет!

1. Generalized animation - это довольно красивая штука. Вот, полстранички - секция 1.1, page 2 of

http://easychair.org/publications/download/Linear_Models_of_Computation_and_Program_Learning

(Если захочешь её читать в какой-то момент, то это статья-сэндвич, две статьи в одной; математическую начинку, которая Секция 4, следует игнорировать (или рассматривать, как отдельную статью).)

2. Это правильный вопрос, он побудил меня, наконец, это записать и добавить в github:

https://github.com/jsa-aerial/DMM/blob/master/design-notes/Early-2017/vector-space-of-mixed-tensors.md

3. Про двухтактный двигатель мы уже тогда понимали (мы тогда говорили "двудольный граф", но это было для того же - чтобы чередовать линейные преобразования с нелинейными, то есть как раз двухтактный цикл), а про Self - не понимали, пытались, чтобы матрица меняла сама себя на уровне отдельных элементов, и это как-то получалось ужасно утомительно и со скрипом...

Потом уже сообразили, что можно ввести Self, и делать на уровне матриц и больших фрагментов матриц, и стало несколько легче жить...

Reply

datjko January 2 2017, 20:50:39 UTC
1. thank you, it's indeed interesting.
" This architecture provides a direct way to incorporate aesthetic criteria into software systems."
" One should be able to formulate mechanisms of higher-order animation programming via variable illumination of elements of such structures."
- Wow, cool! I'm not sick anymore of so generic view in your papers. I'd rather regret you did not make a series of masterpieces making the concept to generate explanations of itself on its own with a formality/poetry ratio as a parameter :-)

2. Now I see, thank you.
"we are talking about a co-product (direct sum) of one-dimenstional spaces." - hmm. I'd say without this note it reads better. Simply, a linear space generated by finite linear combinations of strings of finite length over some alphabet ("a countable set L which is a subset of legal keys"), right? Or, if we merge common prefixes in the strings in these linear combinations, it's a space of prefix trees with coefficients in leaves (the strings of keys assumed to have an implicit sentinel end-of-string character at their ends which represents the leaf nodes in the trees), right?

Reply

anhinga_anhinga January 3 2017, 00:16:06 UTC
1. It would certainly be nice to create intelligent masterpieces :-) Not a bad subject for meditation and, with some luck, experiments :-) :-) :-)

2. You are right, thanks! I've commented this note out.

***

Yes, these are prefix trees with coefficients (which are real numbers at the moment, although we might make them something more interesting).

***

We do use a very rich alphabet, where everything which Clojure can use as a hash key is a legal letter. So even the alphabet is countable. Yes, mathematically speaking, it would be enough to use just strings as letters (the alphabet is still countable), and then if we use an extra sentinel end-of-string character, then one could concatenate the strings in the path, and use the resulting finite alphabet, and get prefix trees over finite alphabet, but with more depth.

This would indeed work.

***

Implementation-wise, we are actually in love with intermediate layers, e.g. one can see how we use them in our network matrix here (https://github.com/jsa-aerial/DMM/blob/master/design-notes/recurrent-maps.md and the code) with columns and rows of the network matrix being indexed not by "flat indices", but by a three-level hierarchy of indices. So we might want to not actually concatenate in our implementation, but to keep the alphabet countable and to keep "letters" expressive.

But in general, this "concatenation idea" might be quite fruitful in a number of contexts. (The actual implementation would probably be replacing a sequence of keys, e.g. key1, key2, key3, with a single key which would be a vector [key1 key2 key 3]). Basically, instead of concatenating to increase the tree depth, we would be concatenating to decrease the tree depth (to use the whole paths as single dictionary keys). So we should keep this possibility in mind, it might come handy in a number of situations.

Reply

datjko January 3 2017, 07:14:30 UTC
I keep reading in small portions time to time.

1.
"
let's agree that ihe first level of hash-map keys would name the respective input and output streams and map to their latest values.
...
Therefore the root keys should be types of neurons.
"

Neuron types or instances of neurons? - ok, i see, instances are the next level.

2.
"
A map from names of the built-in transformers to (the map from names of the individual neurons to (the map from names of their output streams to the respective non-zero matrix coefficients)).
"

In the first quote you said about naming both input and output streams and here only about output?

3. Am I right that you don't keep in mind the computational graph "as a picture" and you rather think of it as something implicitly given by the "computational matrix"?

May be the pictures of usual neural nets confuse me here. It seems I understood the idea. Could you please check if I'm wrong in the following:

The "computation matrix" *is* that prefix tree of chains of keys that we were talking about (aka recurrent map), and a rule for computing the next step is to traverse it up to each individual neuron and apply the neuron's function to the subtree of arguments, producing a new subtree of output values, which is then split&replicated&wrapped with appropriate neurons keys to make up a computation matrix for next step.

I have a feeling that in papers this (or some more generic operation?) is meant as a part of some simple transformation of a giant (or even of countable dimensions) but sparsed matrix or tensor, is it right? Most likely it's explained in the papers you gave the links to but I did not read or understood it yet. Do you have a simple hint of what is that matrix transformation which would apply the neuron functions to appropriate data and produce a new computational matrix. Or, may be an example? Tomorrow I'll try to grasp the example of the interview task solution in https://arxiv.org/abs/1606.09470 so, don;t hurry with answers if you think it's better to wait me there.

at any moment of time t (and on the up movement of the two-stroke engine) there is a directed bipartite computational graph g_t with neurons as edges (or terminal source or sink nodes) transforming an input recurrent map vector of values (generated for the moment t) to an output recurrent map vector of values. Nodes of g_t are neurons' inputs and outputs (corresp in input and output partitions of the graph), where several recurrent map streams are merged (concatenated) and split (or replicated).
(I wanted to erase this statement as I thought I came to a better idea above after thinking about recurrent map vs vector of values for input/output streams. left it here just in case if it worth any comments).

4. What are requirements for neuron functions? Generically, at least they are assumed to always produce exactly zero on zero input, right? Are there other requirements? Stateless? Deterministic?

Reply

anhinga_anhinga January 3 2017, 15:11:34 UTC
Привет!

Буду отвечать кусочками...

2. Columns correspond to outputs, and they are indexed by this three-level structure: {name of the built-in transformer {name of the individual neuron {name of the output stream ...}}}.

When one considers a particular matrix row, this correspond to "the second index, which changes when one moves along the matrix row".

Rows correspond to inputs, and they are indexed by this three-level structure: {name of the built-in transformer {name of the individual neuron {name of the input stream ...}}}.

So the matrix elements are indexed by the 6-level structure: {name of the built-in transformer of neuron-i {name of the individual neuron-i {name of the input stream {name of the built-in transformer of neuron-j {name of the individual neuron-j {name of the output stream ...}}}}}}

3 (first paragraph). I love visualizing computational graphs. It is always good to have alternative angles of view (to view the network both as a matrix and as a graph).

In some of out earlier prototypes, we did visualize how the dataflow graph of a changing network
changes with time. E.g. I have this auxiliary livejournal, and its top post has the video (made in June 2015) which shows changing dataflow graph in the lower left corner: http://anhinga-drafts.livejournal.com/29929.html

This was made before we understood the matrix formalism. And I always wanted to have options to use input interfaces to edit dataflow graph directly on the graphical level (being inspired in this by some visual programming languages like Pure Data). But it's always a lot of software engineering work, and we don't have anything like this yet in our current Clojure system.

Reply

anhinga_anhinga January 3 2017, 15:35:33 UTC
> The "computation matrix" *is* that prefix tree of chains of keys that we were talking about (aka recurrent map), and a rule for computing the next step is to traverse it up to each individual neuron and apply the neuron's function to the subtree of arguments, producing a new subtree of output values

Yes, this is an "up movement" of the two-stroke engine. In the current version of

https://github.com/jsa-aerial/DMM/blob/master/src/dmm/core.clj

see the comments starting from line 303 at the end of this file, and the two functions at the bottom implement the traversal (when you decide you want to read the code, the important function to understand is reduce, this whole DMM codebase is constructed around it: https://clojuredocs.org/clojure.core/reduce )

Reply

anhinga_anhinga January 3 2017, 15:45:58 UTC
So this up-movement takes a three-level structure of all inputs, and produces a three-level structure of all outputs: from a dictionary of these things {name-of-the-built-in transformer {name-of-the-individual-neuron {name-of-the-input-stream some-input}}} it builds a new dictionary of these things {name-of-the-built-in transformer {name-of-the-individual-neuron {name-of-the-output-stream some-output}}}

Reply

anhinga_anhinga January 3 2017, 16:17:07 UTC
> Do you have a simple hint of what is that matrix transformation which would apply the neuron functions to appropriate data and produce a new computational matrix. Or, may be an example?

Currently, we use a very simple form of Self. It has two arguments, and on the up movement it simply
adds them together. Its output is connected to one of its own inputs with weight 1, so this allows it to act as an accumulator (to keep the computed value through time).

The other input of Self accepts additive contributions from other neurons. So on the down movement, those additive contributions are added together, and on the up movement they are added to the previously existing matrix.

***

A couple of simple examples are in the Appendix D.2 on the page 6 of https://arxiv.org/abs/1610.00831

The second of those examples is also implemented in here: https://github.com/jsa-aerial/DMM/tree/master/examples/dmm

Reply

anhinga_anhinga January 3 2017, 17:08:06 UTC
4 in an interesting question.

On one-hand, the requirements depend on what training methods are to be applicable. For example, people occasionally consider the step-function (y = 1 if x > 0 else 0), but its derivative is 0 everywhere where it is defined, so if one wants to do gradient descent, this is not a good neuron function to have.

At the same time, one can consider an intergal of the step-function, (ReLU: y = max(0,x) ). People were ignoring it for a long time, because of the discontinuity in the derivative, but then it turns out that it works fine in gradient methods, and that it is actually one of the best neuron functions available.

***

A priori, we don't impose many requirements. When a new neuron is recruited to an active part of the network, it happens because Self just changed the network matrix on the up movement. So the next thing that happens is that the inputs of that newly recruited neuron are computed on the down movement. Hence we don't even need to impose any particular requirements for zero.

In fact, we typically think about input neurons as neurons which ignore there own input if any and just inject their output into the network. So this would be an example of a neuron which is not trying to map zeros to zeros.

They are usually stateless, although when we described the general formalism in the second of those preprints, we specifically made sure to be general enough to allow state, if desired.

We found stochastic neurons quite interesting. E.g. all these funky self-organizing patterns in the videos linked from the first comment in http://anhinga-anhinga.livejournal.com/82757.html are obtained using neurons, which are "randomized propagators" (they usually pass through their input unchanged, but with a small probability they output zero instead; this turns out to be enough for all that complexity in those videos).

Reply

datjko January 3 2017, 18:23:32 UTC
Misha, thank you a lot for so detailed answers! I'll keep making small steps and asking new questions.

Reply

anhinga_drafts February 4 2017, 08:35:13 UTC
Privet!

This is one of the most interesting talks I've ever heard in my life:

http://anhinga-drafts.livejournal.com/29954.html

Reply


Leave a comment

Up