Dataflow matrix machines as generalized recurrent neural networks

Dec 27, 2016 01:29

A year ago I posted about dataflow programming and linear models of computation:

http://anhinga-anhinga.livejournal.com/82757.html

It turns out that those dataflow matrix machines are a fairly powerful generalization of recurrent neural networks.( Read more... )

advanced software, artificial intelligence, machine learning, strange technology, computer science, dataflow matrix machines, neural nets, software continualization

Leave a comment

datjko December 29 2016, 10:40:10 UTC
Misha, thank you, looks very interesting but I'm rather a newbie here. Could you answer a first bunch of my silly questions here or give me a link to some introduction? I started learning ml more or less seriously only recently and some looking into the papers you gave makes it clear for me: it's yet another interesting story I started to listen from the middle of the book.

1. why fixed number of inputs in a single neuron can not be approximated by a cascade of layers of ordinary neurons? or what are advantages of a single multiple input neuron compared to a subnetwork of ordinary neurons?

2. are the multiple inputs of a DMM neuron supposed to have different roles/interpretations (as in your example of "main signal" and "modulation signal" or as 'recurrent' connections in RNN) or they are thought to be fed from homogeneous streams like inputs for linear combination before activation function in regular neurons? Or it's up to the designer of the network?

3. Do I understand it right that generally in DMMs the number of a neuron inputs is not necessarily fixed?

4. and even new neurons may be created dynamically? In training time only or even in a test stream processing?

5. "friendly to handling sparse vectors" is friendly only to "one-of-all" representation as for letters in an alphabet or there is some more generic friendship with sparsity?

I'm sorry, I can only guess how silly can be my understanding.

Reply

anhinga_anhinga December 30 2016, 04:22:44 UTC
Привет! Let me start with answering your questions, then pointing to some links.

1. Indeed there is a theorem that any reasonable function can be approximated by a neural cascade. But pragmatically speaking, it's a nightmare, not something easy to use in practice. (E.g. try to craft a network performing a multiplication of two reals. Also, in general, the more precise the approximation, the larger must the cascade/subnetwork.) So it is convenient to allow to add built-in activation functions as needed.

2. It's up to the designer. E.g. even for multiplication, sometimes an asymmetric interpretation is convenient, and sometimes it might be convenient to think in a symmetric fashion that both inputs modulate each other.

3. Typically, we think about a particular neuron as having a type. And this type says among other things how many inputs a neuron has, and what vectors are associated with them. For example, one input might take a vector representing a letter (so it would have a dimension of the alphabet in question), and another input might take a number (a one-dimensional vector), and so on. And the network can have neurons of different types.

So in that approach, the number of inputs for a single given neuron is fixed, but there can be neurons of varying type. This is our approach in the preprints.

Our latest version is more flexible (the one in our new open-source project in Clojure). There we use the vector space of recurrent maps to implement variadic neurons. Basically, recurrent maps allow us to assemble any number of arguments into one single argument, and so we can code neurons with arbitrary (and not necessarily fixed) number of inputs via functions of one argument.

4. Yes, we think about countable-sized networks, but with only finite number of non-zero elements in the network matrix at any given time. Only the neurons with a non-zero weight associated with them are considered active. So at any given moment, only a finite subnetwork is active and the rest is silent. The new neurons can be recruited from the silent part by making some of their associated weights non-zero. So as long as the network weights are allowed to change, the new neurons can be created dynamically (by recruiting them from the countable silent part).

5. There is actually friendship with any reasonable representations of vectors in computers. Even with very approximate representations (e.g. probability distributions, which are often infinitely dimensional as vectors, can be approximately represented by samples, and then variants of a stochastic sum can be used to implement linear combinations).

I think this is a somewhat difficult material. (I am telling various parts of this story to people, and it's not always easy for them to understand.)

It's not too difficult compared to many math texts, but as a computer science it's not very simple.

I am happy to answer more questions. I can also put together more references to RNNs (let me know, if I should).

But speaking about DMMs, the links above are a complete set of literature available at the moment. We discovered them last summer, and this year we understood that what we have is a generalized version of RNNs, and these preprints and notes in the open-source projects (this one, and the previous one mentioned in the previous livejournal post) is all that exists at the moment on DMMs.

RNN-related literature is quite extensive. The link to Karpathy's post in my comments above is very useful. I found various Wikipedia pages helpful. On LSTMs see Appendix C of the last of our preprints and this awesome paper it talks about: https://www.arxiv.org/abs/1603.09420 (this is the best way to understand LSTMs and similar architectures).

Reply

datjko December 30 2016, 05:56:58 UTC
Привет!
Wow! So generic and still workable!
I'll try to read through and return with new questions. How nice to be able to get answers from first hands.
May be you could also write some post resolving some typical noob questions and difficulties in understanding? Say, assuming not bad undergraduate background in CS/ML/FP of the reader? I believe many people would greatly appreciate it.
Thank you, Misha!

Reply

anhinga_anhinga December 30 2016, 17:49:05 UTC
Привет!

Yes, I think it is a very interesting class of models and a very interesting computational platform. All standard methods of machine learning should be applicable to it.

My real hope is, of course, that it would allow to create better methods of machine learning in the coming year. So I invite people to try to make progress in that direction, both on their own and in collaboration with me :-)

***

Speaking of questions and difficulties, this conversation is a good start. Other people might read it and find it useful. So it would be great, if you have more questions!

Reply

datjko December 30 2016, 20:57:41 UTC
Миша, спасибо, я с удовольствием постараюсь вчитаться в работы вашей группы, может попробую что-то перенести на tensorflow в следующем месяце. Но я - только любитель, и мои вопросы могут наоборот уводить от важной сути, а не помогать хорошим людям ее разглядеть. Вот если бы ты (если не ошибаюсь, когда я был школьником и приходил заниматься какими-то фигнюшками в ЦЭМИ, мы были на ты, да?) сам придумал и отобрал важные вопросы и на них ответил в каком-нибудь тексте на гитхабе, где его найдут рядом с работающим кодом, пользы людям было бы точно больше. А ненужных вопросов я еще назадаю :-)
Ханука самеах и удачного 2017!

Reply

anhinga_anhinga December 31 2016, 02:51:23 UTC
я тоже стал слегка смотреть на tensorflow на предмет того, насколько легко/трудно перенести туда что-то из DMMs.

насколько много ты уже возился с tensorflow? (один из моих вопросов, какая там ситуация со sparse tensors; я знаю, что они там есть, но насколько легко их использовать, например в существующих имплементациях шагов по градиенту?)

да, мы, конечно, были на ты :-)

***

ну, вот, я пока что добавил сюда комментарий со списком Clojure links, которые мне пригодились в этом году, для тех, кому может захотеться освоить Clojure, или читать код, на нём написанный... Но иметь версии также на более общедоступных языках, конечно, было бы правильно, как и иметь tutorials/FAQs получше...

Ханука самеах и с Новым Годом!

Reply

datjko January 1 2017, 10:31:06 UTC
Миша, я, увы, еще полный чайник и в tensorflow только немножко поиграть успел (а во все остальное кафе/торч/теано даже и не играл). Я даже не ожидал, что в tensorflow уже есть sparse tensors, я думал как бы без них обойтись, а ты мне открыл глаза. Ну, теперь надо тоько постепенно сцедить в одну воронку все что я выучил за этот месяц - раньше руки не доходили.

За ссылки на кложур тоже спасибо! (краснея) я его тоже не знаю, но собирался узнать, когда свалится на голову. Похоже, что уже.

- и иметь tutorials/FAQs
- Да! я постараюсь вчитаться и придумать что спросить и как поиграть, чтобы стало понятнее.
Буду держать в курсе :-)

Reply

datjko January 2 2017, 10:39:58 UTC
я пытаюсь понять азы.
1. что такое "generalized animation"? это что-то из более ранних работ? это знать?

2. "Recurrent maps are maps which map arbitrary legal hash-map keys into numbers or other recurrent maps." - ок. Как вы из него делаете векторное пространство? Можешь в двух словах пересказать что написано в src/dmm/core.clj или куда посмотреть?

3. в http://www.cs.brandeis.edu/~bukatin/DataFlowGraphsAsMatrices.pdf не упоминается нейрон Селф. Это потому что вы его потом ввели, вместе с двухтактным двигателем, или двухтактный двигатель - это частная implementation detail?

Это я не читая пытаюсь понять как это все читать.

Reply

datjko January 2 2017, 11:17:28 UTC
ой про 3 - я понял, что двухтактный двигатель - это про входы и выходы, а не то что я извлек из каши в голове, что это на ап - процессим данные, а на даун - апдейтим веса нейроном Селф.

Reply

anhinga_anhinga January 2 2017, 18:54:07 UTC
Привет!

1. Generalized animation - это довольно красивая штука. Вот, полстранички - секция 1.1, page 2 of

http://easychair.org/publications/download/Linear_Models_of_Computation_and_Program_Learning

(Если захочешь её читать в какой-то момент, то это статья-сэндвич, две статьи в одной; математическую начинку, которая Секция 4, следует игнорировать (или рассматривать, как отдельную статью).)

2. Это правильный вопрос, он побудил меня, наконец, это записать и добавить в github:

https://github.com/jsa-aerial/DMM/blob/master/design-notes/Early-2017/vector-space-of-mixed-tensors.md

3. Про двухтактный двигатель мы уже тогда понимали (мы тогда говорили "двудольный граф", но это было для того же - чтобы чередовать линейные преобразования с нелинейными, то есть как раз двухтактный цикл), а про Self - не понимали, пытались, чтобы матрица меняла сама себя на уровне отдельных элементов, и это как-то получалось ужасно утомительно и со скрипом...

Потом уже сообразили, что можно ввести Self, и делать на уровне матриц и больших фрагментов матриц, и стало несколько легче жить...

Reply

datjko January 2 2017, 20:50:39 UTC
1. thank you, it's indeed interesting.
" This architecture provides a direct way to incorporate aesthetic criteria into software systems."
" One should be able to formulate mechanisms of higher-order animation programming via variable illumination of elements of such structures."
- Wow, cool! I'm not sick anymore of so generic view in your papers. I'd rather regret you did not make a series of masterpieces making the concept to generate explanations of itself on its own with a formality/poetry ratio as a parameter :-)

2. Now I see, thank you.
"we are talking about a co-product (direct sum) of one-dimenstional spaces." - hmm. I'd say without this note it reads better. Simply, a linear space generated by finite linear combinations of strings of finite length over some alphabet ("a countable set L which is a subset of legal keys"), right? Or, if we merge common prefixes in the strings in these linear combinations, it's a space of prefix trees with coefficients in leaves (the strings of keys assumed to have an implicit sentinel end-of-string character at their ends which represents the leaf nodes in the trees), right?

Reply

anhinga_anhinga January 3 2017, 00:16:06 UTC
1. It would certainly be nice to create intelligent masterpieces :-) Not a bad subject for meditation and, with some luck, experiments :-) :-) :-)

2. You are right, thanks! I've commented this note out.

***

Yes, these are prefix trees with coefficients (which are real numbers at the moment, although we might make them something more interesting).

***

We do use a very rich alphabet, where everything which Clojure can use as a hash key is a legal letter. So even the alphabet is countable. Yes, mathematically speaking, it would be enough to use just strings as letters (the alphabet is still countable), and then if we use an extra sentinel end-of-string character, then one could concatenate the strings in the path, and use the resulting finite alphabet, and get prefix trees over finite alphabet, but with more depth.

This would indeed work.

***

Implementation-wise, we are actually in love with intermediate layers, e.g. one can see how we use them in our network matrix here (https://github.com/jsa-aerial/DMM/blob/master/design-notes/recurrent-maps.md and the code) with columns and rows of the network matrix being indexed not by "flat indices", but by a three-level hierarchy of indices. So we might want to not actually concatenate in our implementation, but to keep the alphabet countable and to keep "letters" expressive.

But in general, this "concatenation idea" might be quite fruitful in a number of contexts. (The actual implementation would probably be replacing a sequence of keys, e.g. key1, key2, key3, with a single key which would be a vector [key1 key2 key 3]). Basically, instead of concatenating to increase the tree depth, we would be concatenating to decrease the tree depth (to use the whole paths as single dictionary keys). So we should keep this possibility in mind, it might come handy in a number of situations.

Reply

datjko January 3 2017, 07:14:30 UTC
I keep reading in small portions time to time.

1.
"
let's agree that ihe first level of hash-map keys would name the respective input and output streams and map to their latest values.
...
Therefore the root keys should be types of neurons.
"

Neuron types or instances of neurons? - ok, i see, instances are the next level.

2.
"
A map from names of the built-in transformers to (the map from names of the individual neurons to (the map from names of their output streams to the respective non-zero matrix coefficients)).
"

In the first quote you said about naming both input and output streams and here only about output?

3. Am I right that you don't keep in mind the computational graph "as a picture" and you rather think of it as something implicitly given by the "computational matrix"?

May be the pictures of usual neural nets confuse me here. It seems I understood the idea. Could you please check if I'm wrong in the following:

The "computation matrix" *is* that prefix tree of chains of keys that we were talking about (aka recurrent map), and a rule for computing the next step is to traverse it up to each individual neuron and apply the neuron's function to the subtree of arguments, producing a new subtree of output values, which is then split&replicated&wrapped with appropriate neurons keys to make up a computation matrix for next step.

I have a feeling that in papers this (or some more generic operation?) is meant as a part of some simple transformation of a giant (or even of countable dimensions) but sparsed matrix or tensor, is it right? Most likely it's explained in the papers you gave the links to but I did not read or understood it yet. Do you have a simple hint of what is that matrix transformation which would apply the neuron functions to appropriate data and produce a new computational matrix. Or, may be an example? Tomorrow I'll try to grasp the example of the interview task solution in https://arxiv.org/abs/1606.09470 so, don;t hurry with answers if you think it's better to wait me there.

at any moment of time t (and on the up movement of the two-stroke engine) there is a directed bipartite computational graph g_t with neurons as edges (or terminal source or sink nodes) transforming an input recurrent map vector of values (generated for the moment t) to an output recurrent map vector of values. Nodes of g_t are neurons' inputs and outputs (corresp in input and output partitions of the graph), where several recurrent map streams are merged (concatenated) and split (or replicated).
(I wanted to erase this statement as I thought I came to a better idea above after thinking about recurrent map vs vector of values for input/output streams. left it here just in case if it worth any comments).

4. What are requirements for neuron functions? Generically, at least they are assumed to always produce exactly zero on zero input, right? Are there other requirements? Stateless? Deterministic?

Reply

anhinga_anhinga January 3 2017, 15:11:34 UTC
Привет!

Буду отвечать кусочками...

2. Columns correspond to outputs, and they are indexed by this three-level structure: {name of the built-in transformer {name of the individual neuron {name of the output stream ...}}}.

When one considers a particular matrix row, this correspond to "the second index, which changes when one moves along the matrix row".

Rows correspond to inputs, and they are indexed by this three-level structure: {name of the built-in transformer {name of the individual neuron {name of the input stream ...}}}.

So the matrix elements are indexed by the 6-level structure: {name of the built-in transformer of neuron-i {name of the individual neuron-i {name of the input stream {name of the built-in transformer of neuron-j {name of the individual neuron-j {name of the output stream ...}}}}}}

3 (first paragraph). I love visualizing computational graphs. It is always good to have alternative angles of view (to view the network both as a matrix and as a graph).

In some of out earlier prototypes, we did visualize how the dataflow graph of a changing network
changes with time. E.g. I have this auxiliary livejournal, and its top post has the video (made in June 2015) which shows changing dataflow graph in the lower left corner: http://anhinga-drafts.livejournal.com/29929.html

This was made before we understood the matrix formalism. And I always wanted to have options to use input interfaces to edit dataflow graph directly on the graphical level (being inspired in this by some visual programming languages like Pure Data). But it's always a lot of software engineering work, and we don't have anything like this yet in our current Clojure system.

Reply

anhinga_anhinga January 3 2017, 15:35:33 UTC
> The "computation matrix" *is* that prefix tree of chains of keys that we were talking about (aka recurrent map), and a rule for computing the next step is to traverse it up to each individual neuron and apply the neuron's function to the subtree of arguments, producing a new subtree of output values

Yes, this is an "up movement" of the two-stroke engine. In the current version of

https://github.com/jsa-aerial/DMM/blob/master/src/dmm/core.clj

see the comments starting from line 303 at the end of this file, and the two functions at the bottom implement the traversal (when you decide you want to read the code, the important function to understand is reduce, this whole DMM codebase is constructed around it: https://clojuredocs.org/clojure.core/reduce )

Reply

anhinga_anhinga January 3 2017, 15:45:58 UTC
So this up-movement takes a three-level structure of all inputs, and produces a three-level structure of all outputs: from a dictionary of these things {name-of-the-built-in transformer {name-of-the-individual-neuron {name-of-the-input-stream some-input}}} it builds a new dictionary of these things {name-of-the-built-in transformer {name-of-the-individual-neuron {name-of-the-output-stream some-output}}}

Reply


Leave a comment

Up