-
Notifications
You must be signed in to change notification settings - Fork 579
RFC: Standardizing Composite Operations In TensorFlow #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
2. If we go org/name, do we need centralized registry of accepted "org" names? | ||
|
||
3. Do we need `reference` field? The isdea is that points to the source of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of requiring a reference. However, I feel that a more natural way to document an op is by just writing a Python docstring which could contain one or more references if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, the idea was that automatic tools to rely on it as human-readable hash (or use non-tensorflow implementation s s a source of groundtruth), but perhaps this would be too confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the idea, but I'm also in favor of just keeping references in doc string. If a downstream developer is re-implementing a composite op I think they'll be thoroughly reading the doc.
|
||
* Tooling is not forward compatible. | ||
|
||
## Questions and Discussion Topics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to cover specialization in this RFC? I can imagine a situation where an efficient implementation is device-specific e.g. TPU vs GPU. A recent use-case we had in Sonnet is LSTM which has a CuDNN-RNN implementation, a "block" implementation in contrib/rnn and a fallback via tf.nn.dynamic_rnn(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question. My intent was to leave it up to tooling to decide what they want to do. In this sense the field is leveled for both core tensorflow
and downstream tools (such as tf-lite). if tensorflow
has an efficient implementation it can simply use it, or it can rely on function implementation. Or it can try both and decide what is better at runtime. Basically it brings what tensorflow already does with conv2d to the api level. Or perhaps i misread this comment. Do you have some specific suggestion on specialization in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if my initial comment has been unclear. Specialization itself will have to be implemented in downstream tools. I was actually thinking about implementation selector in Grappler while writing that. However, to allow that, an op should somehow indicate that it's targeting a specific device, or is device agnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, if i understand it right: basically the idea is that different block implementations could be more efficient on different target devices and at grappler (or downstream tooling for that matter) should be able to put the correct implementation in based on the device it is targetting. I certainly can see a utility for doing that though i think will need to think how to avoid it being very messy. Will add this to discussion list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qlzh727 I think your specialization work should commute with this, can you comment?
|
||
Arguments: | ||
|
||
* `implements`: we propose using namespaces with the top-level namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there already is a function with a given name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. We can follow python semantic of silently overriding such functions, or we can log(error) message. I think crashing would probably be undesirable.. Namespaces should help us avoid most name conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mildly in favor of crashing. Reasoning about failures in the presence of silent overrides is tricky. Note also that Python overriding is lexical (unless you monkey-patch which is usually frowned upon), whereas here you could accidentally override an op by importing a third-party module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the "implements" is an indication for downstream tools to replace the current subgraph with a more optimized implementation, conflicts should not be a problem and should not involve any override. Having multiple function providing a different implementation of the same algorithm seems fine to me.
The problem is more if multiple downstream tools are looking for the same string to replace with a different implementation, but I'm not sure how you could catch it here.
|
||
``` | ||
@function.library_function( | ||
implements=“google.matmul_low_rank_matrix”, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any composite ops moved from contrib would you track down the original developers or will everything in core fall under google namespace?
|
||
2. If we go org/name, do we need centralized registry of accepted "org" names? | ||
|
||
3. Do we need `reference` field? The isdea is that points to the source of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the idea, but I'm also in favor of just keeping references in doc string. If a downstream developer is re-implementing a composite op I think they'll be thoroughly reading the doc.
If so, we might also consider having another attribute 'aliases' to allow | ||
history of names for backward compatibility. | ||
|
||
2. If we go org/name, do we need centralized registry of accepted "org" names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote no. IMO this adds bureaucracy that can probably be handled with an error that is raised when over-writing an existing namespace.function
future collisions with function names. One option is to adopt java-style | ||
"org.opname", basically using the developing organization as disambiguating | ||
factor. Another alternative is to use semantic namespaces e.g. `linalg`. Yet | ||
third, is to use org.semantic.opname. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.semantic.opname
for future proofing with extra disambiguation would be my vote.
6) *Simpler automatic conversion* As tensorflow moves forward it is conceivable | ||
that its computation description language (today: graph def, tomorrow:MLIR) | ||
will be fully be separated from the underlying tensorflow kernel implementations. | ||
This kind of standard attributes allow for more automatic conversions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation - Overall, the approach makes sense to me and is definitely a progress from where we are today. Relying on user supplied names is kind of brittle (as against a strongly typed or well defined schema) but at the same time is also very flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Would like to make sure the design review covers name collisions and the possibility of aligning definitions with MLIR.
For example: | ||
|
||
``` | ||
@function.library_function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this symbol in the public API? Is @tf.library_function
a new symbol or are you proposing adding reference
and implements
to @tf.function
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thought was differentiate user-functions (something that is very unlikely to be of interest downstream) from library-functions, however, it might not be needed and could be differentiated by the attributes. .
|
||
Arguments: | ||
|
||
* `implements`: we propose using namespaces with the top-level namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the "implements" is an indication for downstream tools to replace the current subgraph with a more optimized implementation, conflicts should not be a problem and should not involve any override. Having multiple function providing a different implementation of the same algorithm seems fine to me.
The problem is more if multiple downstream tools are looking for the same string to replace with a different implementation, but I'm not sure how you could catch it here.
* `reference`: The goal for this attribute is to provide a stable reference for the function | ||
implementation details. This could be a reference to non-tensorflow library (e.g. numpy.linalg.special_matmul) | ||
or arxiv reference. The downstream tooling can use this value as an opaque cache to verify that | ||
it implements the right flavor of the function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impact of this field on the downstream tool is blurry to me. The "implement" should be enough to replace a subgraph and this field seems redundant to me right now.
(I can see it being useful for documentation purpose though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this paragraph is obsolete, the field is dropped right?)
def f(..): | ||
pass | ||
[0] https://github.com/tensorflow/mlir/blob/master/g3doc/OpDefinitions.md#operation-definition | ||
[1] https://github.com/tensorflow/mlir/blob/master/g3doc/OpDefinitions.md#operation-name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the "name" could match the MLIR operation name used by the downstream tool: so "google.mm_low_rank_matrix" should be enough. Since there should be a registration in MLIR for the operation itself, you shouldn't replicate here the summary and description though?
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
1 similar comment
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
Talking to @alextp last week, I was wondering about the interaction with gradients here: the annotated function would be mutated and the downstream tool would likely not get what it expect when looking at the function. |
@joker-eph |
I think taking the gradients of a function requires to change the function to escape temporary values during the forward pass and pass them to the gradient function. It seems to me that a downstream transformation that would want to use a more efficient implementation for such function would need to match and replace the pair of the original function and the gradient one and replace both of them together to be correct. |
Let me see if I understood. The actual gradient dL/dx, of y=F(x) when viewed as dL/dx = G(x, y, dL/dy) needs to stay the same no matter the implementation. However, when implemented as symbolic gradient this function actually has dependencies on all intermediate tensors inside F of the composite ops. There isn't much we can do, and i don't think we need to expose this. If downstream tooling wants to provide gradient, they would either need to implement their own with the exposed arguments "x, y, dL/dy" (note they can pull whatever intermediate state they want from their own representation), or recompute symbolically (and hence effectively not being able to use the efficient implementation on when training), or alternatively they can mark-up all intermediate tensors as recomputable and compute them using x. To summarize: i think if downstream provides their own implementation of forward pass, it won't be able to use it for training automatically, because gradients require composite implementation. That's ok. We might want to support optional gradient functions, (i think tf.function already supports it), that downstream tooling can implement if it so chooses. I'll add this to the discussion. |
Right, replacing a function with the gradient computation seems to me like a key use-case that we will want to support. Not having a solution for this makes this proposal much less attractive. |
|
||
``` | ||
@function.library_function( | ||
implements=“google.matmul_low_rank_matrix”, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to use composite ops for more complex structures (e.g. a Keras Layer)? It's helpful for downstream tools to identify complex structures without pattern matching.
In this case, using a single implements name might be insufficient. For example, a Keras RNN layer has many parameters (e.g. return_sequence
, go_backwards
) that can change the implementation. IMHO there are a few potential solutions:
- Encode the parameters in the implementation name. E.g.
implements="keras.layers.RNN,go_backwards=False,return_sequence=True"
- Store parameters in separate attributes
Any thoughts around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A even simpler & very real example is Conv2D layer. A Keras Conv2D or TFLite Conv2D layer is composing TensorFlow Conv2D -> Add -> Activation (e.g. Relu, Relu6...etc).
Ideally we need a way to know what activation function to use without looking into the function body -- Either make the implementation name like "Conv2D,activation=Relu6", or design something smarter.
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Notes from design Meeting: Standardizing Composite Ops |
@marksandler2 can you copy the notes from the design meeting directly into the comments here? That doc is not publicly viewable. Alternatively grant me (ewj@google) to the doc and I can do it. LMK. Thanks! |
Notes from review meeting This list reflects questions raised and the consensus discussed during Tensorflow Review process.
Consensus was that this decision will be deferred to a later date: "we can cross that bridge when we get there".
Q: What will prevent users from placing files in old locations? Consensus: there is no incentive to do this intentionally.
Pros: this can be machine-readable as a hash, and is better than having nothing. Cons: cannot guarantee. Q: What does the reference add over the canonical org name?
The consensus appears to be no-crash is the most natural outcome despite initial misgivings. The argument that tilted in this direction is that the definition that will be invoked is actually well defined by python code semantic, (e.g. user code would call say python_lib.something.my_conv_op, which declares (myai.my_conv_op) and user intent is clear. What the downstream tooling will do is going to be up-to downstream tooling as long as it follows the contract. If there are two implementations available and different parts of the code call different > ones, we might end up with two definitions in the same function, but every invocation is still well defined in the graph itself, and thus preferrable. Q: Should this be Python semantic, or Java semantic? Should we crash, or should we allow users to provide a different implementation? A: @alextp : This sounds like a poor choice to me, because it seems like the whole idea is to provide a single source of truth without hardcoding it into TensorFlow. If that's what we're doing, then we shouldn't support overriding. Response: We don't see this proposal as being a single source of truth: it should be more like hints. If you try to impose a single source of truth, you need a centralized registry and we don't have one of those right now. It's safe to replace the body of this function with a new implementation, but there are no guarantees. The most natural implementation of this does not crash. If you crash, you're adding complexity that would not be there otherwise. This is a step up in the sense that we define the interface for the operations. For now there is no way to build downstream tooling, external to Google. Override would allow you to do it without tooling. We should have some standardized way of doing graph rewrites within TensorFlow. TensorFlow has a mechanism for registering operations and kernels, and it would be great to transfer this ownership to the user — would like an additional mechanism for registering kernels. @alextp : This can come later. Function could either bind to function or kernel (lightweight). We should make kernels and user-defined functions equal citizens = and that means you need a priority. It becomes target specific once you pick a deployment strategy. By forcing this, you lose the debuggability in eager. Q: When should we make the selection? —> Defer this decision to another proposal. **We agree on no-crash and the annotation should not be target-specific. **
I wonder if we should consider more metadata here. For example within a dialect MLIR op definitions No, because MLIR dialects are not well aligned with semantic dialects that we are considering. E.g. MLIR dialects are following framework dialect (e.g. TPU, or tensorflow, etc...), instead of "this is a group of related ops". Consensus: Maybe “dialect” is not a great match, because it describes a tool rather than a framework. We’re not going to do alignment, because it’s not safe.
Agnostic. Use this as an argument to tf.function.
Yes, eventually, but seem not critical to get it from get-go. There should be a specialization mechanism, as the user might not be aware that the function is being called. We need a way for vendors to be able to specify a custom kernel. Current implementation allows downstream to sub with custom version. Can’t sub composite op implementation.
Consensus: named attributes (attrs).
From @joker-eph : "Right, replacing a function with the gradient computation seems to me like a key use-case that we will want to support. Not having a solution for this makes this proposal much less attractive." I think we run out of time on this one, but the proposal is that this is probalby will be left unspecified in this iteration. The potential path forward is to automatically wrap gradient into its own function that downstream tooling can identify and replace with its own implementation. If the tooling needs to support both forward and backward path, it seems that to benefit from this the tooling would need to provide both implementations (and do internal caching) or simply rely on default implementation. Recommendation: @alextp to walk through an example using softmax cross entropy with logits. What would the graph look like with fused implementations? Make sure that diffusing works in a way that doesn’t break it. Looks for a function with 1 output vs. 2. Naive implementation: have a hook, but that would mean that you wouldn’t have a link into the Tensor. Would have to reimplement softmax, which would be a performance bottleneck. We need to address this — simplest training example.
The consensus is that we will eventually make tf.function fast enough that it won't be an issue, and given that this is likely to have very gradual roll out we will have time to adapt. |
Hi @marksandler2 a couple of nits before I can merge this:
Thank you! |
* Discussion what should live in "core" tensorflow (e.g. `tf.xxx.my_cool_op` ) | ||
* Ultra complex functions (e.g. trained models) that are unlikely to get | ||
specialized implementations in hardware. | ||
* Immediate support for replacing gradients of composite ops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Immediate support for replacing gradients of composite ops. | |
* Support for uses of the annotation during training, taking gradients eliminates the annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded to avoid mention of annotations since we haven' introduced them yet:
Support for processing gradients (and forward inference in the presense of gradients)
for composite ops.
|
||
|
||
### Behind the scenes changes | ||
When a new tf.function is created, behind the scenes tensorflow creates up-to 3 functions that are stored in the graph. These functions are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true? I thought that @tf.function would not do that, but taking the gradient of a function would create the two extra functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well the function creation is lazy in the first place... so tf.function creates no functions at all... They are created when you actually invoke them and/or call gradient of them. So i think the statement is technically correct: note "up-to" :)
@ewilderj Done! |
Co-Authored-By: Mehdi Amini <[email protected]>
The work on TFR Composable Tensorflow seems to overlap a bit with this RFC and follows a very similar rational. Would you mind to share whether there are plans to unify the approaches in the future or replace one with another? |
Review period will close 2019-06-25
Standardizing composite ops in tensorflow to support efficient inference.
Objective
The objective is to create a simple API that enables adding new composite ops in
a way that they can be automatically and robustly processed by downstream
inference tooling. The developers (of the composite op) should be able to
iterate on their implementation or even replace them with standard tensorflow
op, without breaking any existing tools.
Why? Composite ops often provide building blocks for building complex models.
However, and this is especially true for embedded and specialized hardware,
these ops when implemented naively, become unusable due to architectural
(hardware) details. It is thus preferable for the downstream tools to be able to
extract such composite ops and for instance provide specialized implementation.
The current proposal concentrates on supporting inference-only transformations
and optimizations. However we leave the door open for follow-up gradient
optimization support. See appendix for a few possible ways forward gradients.