Crate caffe2_c10

source ·

Modules

Macros

Structs

  • | autograd_dispatch_keyset should include all | runtime autograd keys. | | Alias key DispatchKey::Autograd maps to | autograd_dispatch_keyset. | | NB: keys in this set also get associated with | CompositeImplicitAutograd |
  • | Given a sequence of allocations in a | thread, AllocationPlan records | | - 1. size of each allocation | | - 2. Lifetime of each allocation. | | - 3. allocation offsets: Memory offset | for each allocation in a single blob | of memory | | - 4. Total size of a blob of memory required | to satisfy all the allocations. |
  • | Map of memory ptr to allocation id. This | is auxiliary information only used | to establish lifetime of allocations. |
  • | Note [raw_allocate/raw_deallocate and Thrust] | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | Thrust’s support for custom allocators requires | us to write something like this: | | class ThrustAllocator { | char* allocate(size_t); | void deallocate(char*, size_t); | }; | | This is not good for our unique_ptr based | allocator interface, as there is no way to get | to the context when we free. | | However, in some cases the context is exactly | the same as the data pointer. In this case, we | can support the “raw” allocate and deallocate | interface. This is what raw_deleter signifies. | By default, it returns a nullptr, which means | that the raw interface is not implemented. Be | sure to implement it whenever possible, or the | raw interface will incorrectly reported as | unsupported, when it is actually possible.
  • | A RAII, thread local (!) guard that enables | or disables grad mode upon construction, | and sets it back to the original value | upon destruction. |
  • | This is a simple bitset class with sizeof(long | long int) bits. | | You can set bits, unset bits, query bits | by index, and query for the first set | bit. | | Before using this class, please also | take a look at bitset, which has more | functionality and is more generic. | It is probably a better fit for your use | case. The sole reason for utils::bitset | to exist is that bitset misses a find_first_set() | method. |
  • | The primary ATen error class. | | Provides a complete error message with source | location information via what(), and a more | concise message via | what_without_backtrace(). | | Don’t throw this directly; use | TORCH_CHECK/TORCH_INTERNAL_ASSERT instead. | | NB: C10ErrorData is handled specially by the default | torch to suppress the backtrace, see | torch/csrc/Exceptions.h |
  • | A backend-generic movable, not copyable, | not thread-safe event. | | The design of this event follows that | of Cuda and HIP events. These events | are recorded and waited on by streams | and can be rerecorded to, each rerecording | essentially creating a new version | of the event. | | For example, if (in CPU time), stream | X is asked to record E, stream Y waits | on E, and stream X is asked to record E | again, then Y will wait for X to finish | the first call to record and not the second, | because it’s waiting on the first version | of event E, not the second. | | Querying an event only returns the status | of its most recent version. | | Backend-generic events are implemented | by this class and | | InlineEvent. In addition to these events | there are also some backend-specific | events, like ATen’s CudaEvent. Each | of these classes has its own use. | | InlineEvent<…> or a backend-specific | event should be preferred when the backend | is known at compile time and known to | be compiled. Backend-specific events | may have additional functionality. | | This C10Event should be used if a particular | backend may not be available, or the | backend required is not known at compile | time. | | These generic events are built on top | of DeviceGuardImpls, analogous to | DeviceGuard and InlineDeviceGuard. | The name “DeviceGuardImpls,” is no | longer entirely accurate, as these | classes implement the backend-specific | logic for a generic backend interface. | | See DeviceGuardImplInterface.h for | a list of all supported flags. |
  • | PyTorch ddp usage logging capabilities | DDPLoggingData holds data that can be logged in | applications for analysis and debugging. Data | structure is defined in c10 directory so that | it can be easily imported by both c10 and torch | files. |
  • | dyn std::alloc::Allocator DataPtr is a unique pointer (with an attached | deleter and some context for the deleter) to | some memory, which also records what device is | for its data. | | nullptr DataPtrs can still have a nontrivial | device; this allows us to treat zero-size | allocations uniformly with non-zero | allocations. |
  • | DebugInfoGuard is used to set debug | information, ThreadLocalDebugInfo is | semantically immutable, the values are set | through the scope-based guard object. | | Nested DebugInfoGuard adds/overrides existing | values in the scope, restoring the original | values after exiting the scope. | | Users can access the values through the | ThreadLocalDebugInfo::get() call; |
  • | QNNPACK AND XNNPACK may out-of-bound access the | input and / or output tensors. This is | by-design, and chosen to make the | implementation of micro-kernels both simpler | and faster as a result of not having to | individually handle the corner cases where the | number of processed elements is not a multiple | of SIMD register width. | | This behavior will trigger ASAN though, and may | result in a segfault if the accessed memory | location just so happens to fall on a page the | current process has no read access to. Here we | define a custom allocator that allocates the | extra storage required to keep this behavior | safe. | | This allocator could have been restricted to | QNNPACK and XNNPACK only, but that would have | negative performance ramifications, as input | tensors must now be reallocated, and copied | over, if the tensor is not allocated with this | allocator to begin with. | | Making this allocator the default on mobile | builds minimizes the probability of unnecessary | reallocations and copies, and also enables | acceleration of operations where the output | tensor is allocated outside of the function | doing the implementation, wherein the | implementation cannot simply re-allocate the | output with the guarding allocator. | | PreGuardBytes: Number of guard bytes to | allocate before the allocation. | | PostGuardBytes: Number of guard bytes to | allocate after the allocation.
  • | Like TensorOptions, but all fields | are guaranteed to be filled. |
  • | Represents a a compute device on which | a tensor is located. | | A device is uniquely identified by a type, | which specifies the type of machine it is | (e.g. CPU or Cuda GPU), and a device index or | ordinal, which identifies the specific compute | device when there is more than one of | a certain type. | | The device index is optional, and in its | defaulted state represents (abstractly) “the | current device”. | | Further, there are two constraints on the | value of the device index, if one is | explicitly stored: | | 1. A negative index represents the current | device, a non-negative index represents | a specific, concrete device, | | 2. When the device type is CPU, the device | index must be zero. |
  • | RAII guard that sets a certain default device | in its constructor, and changes it back to the | device that was originally active upon | destruction. | | The device is always reset to the one that was | active at the time of construction of the | guard. Even if you set_device after | construction, the destructor will still reset | the device to the one that was active at | construction time. | | This device guard does NOT have an | uninitialized state; it is guaranteed to reset | a device on exit. If you are in a situation | where you might want to setup a guard (i.e., | are looking for the moral equivalent of | optional), see | OptionalDeviceGuard.
  • | I can’t conveniently use c10/util/Registry.h | for the following reason: c10/util/Registry.h | gives me a slow way of Create’ing a object of | some interface from the registry, but no way of | quickly accessing an already created object. | | I’ll be banging on getDeviceGuardImpl every | time we do a DeviceGuard, so I really don’t | want to be doing an unordered_map | lookup. Better if the registration mechanism | directly drops its implementation into | device_guard_impl_registry.
  • | A representation of a set of DispatchKeys. | A tensor may have multiple tensor type ids, | e.g., a Variable tensor can also be a CPU | tensor; | | the DispatchKeySet specifies what type ids | apply. The internal representation is as | a 64-bit bit set (this means only 64 tensor | type ids are supported). | | Note that DispatchKeys are ordered; thus, we | can ask questions like “what is the highest | priority DispatchKey in the set”? (The set | itself is not ordered; two sets with the same | ids will always have the ids ordered in the | same way.) | | At the moment, there are no nontrivial uses of | this set; tensors are always singletons. In | the near future, this set will represent | variable? + tensor type id. In the far future, | it will be requires grad? + profiling? | + tracing? + lazy? + tensor type id. | | (The difference between variable and requires | grad, is that there are currently three states | a tensor can be: | | 1. Not a variable | 2. Variable with requires_grad=False | 3. Variable with requires_grad=True | | Eventually, we want to kill state (1), and only | dispatch to autograd handling code if one of | the inputs requires grad.) | | An undefined tensor is one with an empty tensor | type set. |
  • | Used in ATen for non finite indices. These | turn into ExitException when they cross to | Python. |
  • | A fake implementation of | DeviceGuardImplInterface suitable | for testing. | | The current device is modeled as a mutable | field in the guard implementation class. | | See DeviceGuard_test.cpp for an example | use. |
  • | RAII API for manipulating the thread-local | dispatch state. |
  • | Used in ATen for out-of-bound indices that can | reasonably only be detected lazily inside | a kernel (See: advanced indexing). These turn | into IndexError when they cross to Python. |
  • | This context is used to generate DataPtr which | have arbitrary function deleters associated | with them. | | In some user facing functions, we give | a (user-friendly) interface for constructing | tensors from external data which take an | arbitrary function deleter. | | Grep for InefficientStdFunctionContext to find | these occurrences. | | This context is inefficient because we have to | do a dynamic allocation | InefficientStdFunctionContext, on top of the | dynamic allocation which is implied by function | itself. |
  • | A RAII, thread local (!) guard that enables or | disables inference mode upon construction, and | sets it back to the original value upon | destruction. |
  • | A DeviceGuard is an RAII class that sets | a device to some value on construction, and | resets the device to its original value on | destruction. | | InlineDeviceGuard is a helper class for | implementing DeviceGuards. | | It is templated over a DeviceGuardImpl | (anything that implements | DeviceGuardImplInterface). There are two | primary ways to instantiate InlineDeviceGuard: | | - With a concrete implementation of | DeviceGuardImpl, e.g., CUDAGuardImpl. | | This is the best way to use | InlineDeviceGuard, as all calls are | devirtualized, giving you code as efficient | as straight line calls to | cudaGetDevice/cudaSetDevice. | | - With VirtualGuardImpl, which does a virtual | dispatch to a DeviceGuardImpl retrieved from | a DeviceType registry. We have explicitly | instantiated InlineDeviceGuard this way as | DeviceGuard. | | If you are in a hurry, you can use | InlineDeviceGuard directly: | | using CUDAGuard = InlineDeviceGuard; | | However, you can provide a better user | experience if you explicitly write a wrapper | class that itself contains the template | instantiation: | | class CUDAGuard { |
    | // … the API … |
    | InlineDeviceGuard guard_; | } | | The wrapper class provides a good place to | write documentation, and helps avoid weird | template instantiation errors when a user | incorrectly uses the class. | | If you need to test this class, consider | instantiating it with FakeGuardImpl. | | Note [Omitted default constructor from RAII] | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | In principle, we could add a default | constructor to DeviceGuard which reads the | current device and promises to restore to | that device on exit. However, most cases | where you would have written this, you | probably meant to actually just use | OptionalDeviceGuard (since you don’t actually | need the restore to happen if you don’t ever | actually set the device). | | We remove the constructor here to encourage | you to think about what you actually want to | happen. |
  • | Copy is disallowed | | Move is disallowed, as StreamGuard does not | have an uninitialized state, which is | required for moves on types with nontrivial | destructors. |
  • | A OptionalDeviceGuard is an RAII class | that sets a device to some value on initialization, | and resets the device to its original | value on destruction. | | InlineOptionalDeviceGuard is a helper | class for implementing | | OptionalDeviceGuards. See guidance | in InlineDeviceGuard on how to use this. | See OptionalDeviceGuard for user-oriented | usage notes. | | Note [Explicit initialization of optional fields] | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | Explicit initialization of optional fields | required to workaround an nvcc bug; see | https://github.com/pytorch/pytorch/issues/12117
  • | An OptionalStreamGuard is an RAII class | that sets a device to some value on initialization, | and resets the device to its original | value on destruction. | | See InlineOptionalDeviceGuard for | more guidance on how to use this class. |
  • | A StreamGuard is an RAII class that changes | the current device to the device corresponding | to some stream, and changes the default | stream on that device to be this stream. | | InlineStreamGuard is a helper class | for implementing StreamGuards. | | See InlineDeviceGuard for guidance | on how to use this class. | | Copy is disallowed | | Move is disallowed, as StreamGuard does not | have an uninitialized state, which is | required for moves on types with nontrivial | destructors. |
  • | This class is used to explicitly ignore values | in the conditional logging macros. | | This avoids compiler warnings like “value | computed is not used” and “statement has no | effect”. |
  • | A MultiStreamGuard is an RAII class | that sets the current streams of a set | of devices all at once, and resets them | to their original values on destruction. |
  • | A RAII, thread local (!) guard that stops | future operations from building gradients. |
  • | A no-op device guard impl that doesn’t do | anything interesting. Useful for devices that | don’t actually have a concept of device index. | Prominent examples are CPU and Meta. |
  • | Used in ATen for functionality that is not | implemented. These turn into | NotImplementedError when they cross to Python. |
  • | Used in Onnxifi backend lowering. These | turn into ExitException when they cross to Python. |
  • | A OptionalDeviceGuard is an RAII class that | sets a device to some value on initialization, | and resets the device to its original value on | destruction. | | Morally, a OptionalDeviceGuard is equivalent to | optional, but with extra | constructors and methods as appropriate. | | Besides its obvious use (optionally applying | a DeviceGuard), OptionalDeviceGuard is often | also used for the following idiom: | | OptionalDeviceGuard g; | for (const auto& t : tensors) { | g.set_device(t.device()); | do_something_with(t); | } | | This usage is marginally more efficient than | constructing a DeviceGuard every iteration of | the for loop, as it avoids an unnecessary | device reset. | | Unlike DeviceGuard, a OptionalDeviceGuard may | be uninitialized. This occurs when you use the | nullary constructor, or pass a nullopt to the | constructor. | | Uninitialized OptionalDeviceGuards do | nothing; they do not know what the original | device was and they do not reset on | destruction. This is why original_device() and | current_device() return optional rather | than Device (as they do in DeviceGuard), and | also is why we didn’t just provide | OptionalDeviceGuard by default and hide | DeviceGuard from users. | | The semantics of an OptionalDeviceGuard are | exactly explained by thinking of it as an | optional. In particular, an | initialized OptionalDeviceGuard doesn’t restore | device to its value at construction; it | restores device to its value at | initialization. So if you have the program: | | setDevice(1); | OptionalDeviceGuard g; | setDevice(2); | g.reset_device(Device(DeviceType::CUDA, 3)); // initializes! | | On destruction, g will reset device to 2, | rather than 1. | | An uninitialized OptionalDeviceGuard is | distinct from a (initialized) DeviceGuard whose | original_device_ and current_device_ match, | since the DeviceGuard will still reset the | device to original_device_.
  • | An OptionalStreamGuard is an RAII class | that sets a device to some value on initialization, | and resets the device to its original | value on destruction. | | See OptionalDeviceGuard for more guidance | on how to use this class. |
  • | POD version of LocalDispatchKeySet. Declared | here just so that we can put it in the guards. | | This struct encapsulates special handling for | TLS initialization in set_included()/included() | API so that they reflect the truth. | | If you want to create PODLocalDispatchKeySet | with non-zero state, use set_included() instead | of default constructor. |
  • | A Context that will call extra placement | deleter during deconstruction. | | Accept a already constructed DataPtr | and store it as member during destruction, | we’ll call extra deleter on the underlying | data pointer before the DataPtr is destructed. | data_ptr_ owns the memory. |
  • | A simple struct that is used to report C10’s | memory allocation and deallocation status to | the profiler |
  • | Note [Python interpreter tag] | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | We store a PyObject on TensorImpl so that we | can efficiently translate tensors into the | Python representations. However, in some | situations (torchdeploy) there may be multiple | Python interpreters in a single process and we | must take care not to accidentally mix up | PyObjects with the wrong interpreters. Thus, | we also tag every TensorImpl with the Python | interpreter it corresponds to. | | With torchdeploy, we have these invariants: | | - Any given TensorImpl can be associated with | AT MOST one Python interpreter. | | We represent the interpreter tag as a memory | address to an instance of a virtual class | that is allocated once per interpreter (this | is so that we can request the interpreter to | perform operations for us, if necessary). | | - A given TensorImpl’s interpreter tag can | only go from uninitialized to tagged; once | tagged, this is a quiescent state (once | tagged to an interpreter, ALWAYS tagged to | that interpreter) | | - A thread may mutate the PyObject field of | a TensorImpl if and only if it holds the GIL | for the interpreter tagged on the | TensorImpl. (If the TensorImpl is not | tagged, it must first atomically claim its | tag before it can validly write) | | The PyInterpreter object itself is a class that | contains some function pointers for interacting | with the interpreter. For now this is just for | debugging, but if a C10Tensor can own a PyObject, | the interpreter can be used to free it. | | WARNING: This class has to be written very | carefully, because it may be possible for | a C10Tensor to have a reference an interpreter | corresponding to a shared library that has | ALREADY BEEN UNLOADED. This makes blindly | calling virtual methods very dangerous, because | the vtable may be garbage at that point (on | a good day, you might get “pure virtual method | called”). | | The idea to solve this problem is we always | leak PyInterpreters (so they always stay live | even after dlclose), and disarm the “virtual | methods” by replacing them with function | pointers that just no-op. This can’t be done | with a traditional C++ vtable, so we have to | roll our own. | | NB: The downside with representing | PyInterpreter tags as full objects is that it | takes an extra word on TensorImpl. If tags | were instead just integer indices, on 64-bit | architectures we could pack the tag and | PyObject together into a single atomic word. | On 32-bit architectures we could simply say | that only one Python interpreter is supported | (erroring if a nontrivial interpreter tag is | attempted to be set). | | The difficulty with this scheme is we need to | maintain an out-of-line table to get at the | PyInterpreters so that we can do virtual method | calls on them, and registration/deregistration | to this table must be done in a thread safe | manner. This can be easily done if the number | of possible PyInterpreters is small enough | (e.g., 8-bit integer) by simply preallocating | an array of sufficient size to hold all | possible interpreters. Surely 128 threads is | more than enough for anyone! | | I didn’t decide to do this technique at the | moment, because the extra word added by the | PyInterpreter tag takes us to 24 words, which | means that we still fit inside three eight word | cache lines. If you need to penny pinch | another word consider doing this!
  • | DO NOT call this registerer from a torch | deploy instance! You will clobber other | registrations |
  • | quint4x2 is for un-signed 4 bit quantized | Tensors that are packed to byte boundary. |
  • | ———– | @brief | | A template class that allows one to register | classes by keys. | | The keys are usually a string specifying | the name, but can be anything that can | be used in a map. | | You should most likely not use the Registry | class explicitly, but use the helper | macros below to declare specific registries | as well as registering objects. |
  • | Scalar represents a 0-dimensional | tensor which contains a single element. | | Unlike a tensor, numeric literals (in | C++) are implicitly convertible to | Scalar (which is why, for example, we | provide both add(Tensor) and add(Scalar) | overloads for many operations). | | It may also be used in circumstances where | you statically know a tensor is 0-dim | and single size, but don’t know its type. |
  • | Mostly copied from https://llvm.org/doxygen/ScopeExit_8h_source.html |
  • Represents a location in source code (for debugging).
  • | A storage represents the underlying backing | data buffer for a tensor. | | This concept was inherited from the original | Torch7 codebase; we’d kind of like to get rid | of the concept (see | https://github.com/pytorch/pytorch/issues/14797) | but it’s hard work and no one has gotten around | to doing it. | | NB: storage is supposed to uniquely own a data | pointer; e.g., two non-null data pointers alias | if and only if they are from the same storage. | | Technically you can violate this invariant | (e.g., you can create a non-owning StorageImpl | with from_blob) but a lot of things won’t work | correctly, including: | | - An ordinary deleter on such a storage is | wrong, because normal deleters assume unique | ownership, but if you have two storages at | the same data, that implies there is some | sort of shared ownership. So your deleter | would have to actually be internally doing | some sort of refcount thing | | - Deepcopy in Python side relies on storage | equality and not data pointer equality; so if | there are two separate storages pointing to | the same data, the data will actually get | duplicated in that case (one data ptr before, | two data ptrs after) | | - Version counts won’t work correctly, because | we do all VC tracking at the level of | storages (unless you explicitly disconnect | the VC with detach); mutation because data | pointers are the same are totally untracked |
  • | A stream is a software mechanism used | to synchronize launched kernels without | requiring explicit synchronizations | between kernels. | | The basic model is that every kernel | launch is associated with a stream: | every kernel on the same stream is implicitly | synchronized so that if I launch kernels | A and B on the same stream, A is guaranteed | to finish before B launches. If I want | B to run concurrently with A, I must schedule | it on a different stream. | | The Stream class is a backend agnostic | value class representing a stream which | I may schedule a kernel on. | | Every stream is associated with a device, | which is recorded in stream, which is | used to avoid confusion about which | device a stream refers to. | | Streams are explicitly thread-safe, | in the sense that it is OK to pass a Stream | from one thread to another, and kernels | queued from two different threads will | still get serialized appropriately. | | (Of course, the time when the kernels | get queued is undetermined unless you | synchronize host side ;) | | Stream does NOT have a default constructor. | | Streams are for expert users; if you | want to use Streams, we’re going to assume | you know how to deal with C++ template | error messages if you try to resize() | a vector of Streams. | | Known instances of streams in backends: | | - cudaStream_t (Cuda) | | - hipStream_t (HIP) | | - cl_command_queue (OpenCL) (NB: Caffe2’s | existing OpenCL integration does NOT | support command queues.) | | Because this class is device agnostic, | it cannot provide backend-specific | functionality (e.g., get the cudaStream_t | of a Cuda stream.) | | There are wrapper classes which provide | this functionality, e.g., CudaStream. |
  • | A StreamGuard is an RAII class that changes | the current device to the device corresponding | to some stream, and changes the default | stream on that device to be this stream. | | Use of StreamGuard is HIGHLY discouraged | in operator definitions. In a single | operator, you probably don’t know enough | about the global state of the world to | profitably decide how to set streams. | Let the caller handle this appropriately, | and just use the current stream in your | operator code. | | This StreamGuard does NOT have an uninitialized | state; it is guaranteed to reset the | stream and device on exit. If you are | in a situation where you might want | to setup a stream guard, see OptionalStreamGuard. | | Copy is disallowed | | Move is disallowed, as StreamGuard does not | have an uninitialized state, which is | required for moves on types with nontrivial | destructors.
  • | The low-level representation of a tensor, which | contains a pointer to a storage (which contains | the actual data) and metadata (e.g., sizes and | strides) describing this particular view of the | data as a tensor. | | Some basic characteristics about our in-memory | representation of tensors: | | - It contains a pointer to a storage struct | (Storage/StorageImpl) which contains the | pointer to the actual data and records the | data type and device of the view. This | allows multiple tensors to alias the same | underlying data, which allows to efficiently | implement differing views on a tensor. | | - The tensor struct itself records | view-specific metadata about the tensor, | e.g., sizes, strides and offset into | storage. Each view of a storage can have | a different size or offset. | | - This class is intrusively refcounted. It is | refcounted so that we can support prompt | deallocation of large tensors; it is | intrusively refcounted so that we can still | perform reference counted operations on raw | pointers, which is often more convenient | when passing tensors across language | boundaries. | | - For backwards-compatibility reasons, a tensor | may be in an uninitialized state. A tensor | may be uninitialized in the following two | ways: | | - A tensor may be DTYPE UNINITIALIZED. | A tensor of this form has an | uninitialized dtype. This situation | most frequently arises when a user | writes C10Tensor x(CPU). The dtype and is | subsequently initialized when | mutable_data() is | invoked for the first time. | | - A tensor may be STORAGE UNINITIALIZED. | A tensor of this form has non-zero size, | but has a storage with a null data | pointer. This situation most frequently | arises when a user calls Resize() or | FreeMemory(). This is because Caffe2 | historically does lazy allocation: allocation of data | doesn’t occur until mutable_data() is | invoked. A tensor with zero size is | always storage initialized, because no | allocation is necessary in this case. | | All combinations of these two uninitialized | states are possible. | | Consider the following transcript in | idiomatic Caffe2 API: | | // x is storage-initialized, dtype-UNINITIALIZED | C10Tensor x(CPU); | | // x is storage-UNINITIALIZED, dtype-UNINITIALIZED | x.Resize(4); | | // x is storage-initialized, dtype-initialized | x.mutable_data(); | | // x is storage-UNINITIALIZED, dtype-initialized. | x.FreeMemory(); | | All other fields on tensor are always | initialized. In particular, size is always | valid. (Historically, a tensor declared as | C10Tensor x(CPU) also had uninitialized size, | encoded as numel == -1, but we have now | decided to default to zero size, resulting | in numel == 0). | | Uninitialized storages MUST be uniquely | owned, to keep our model simple. Thus, we | will reject operations which could cause an | uninitialized storage to become shared (or | a shared storage to become uninitialized, | e.g., from FreeMemory). | | In practice, tensors which are | storage-UNINITIALIZED and | dtype-UNINITIALIZED are extremely | ephemeral: essentially, after you do | a Resize(), you basically always call | mutable_data() immediately afterwards. Most | functions are not designed to work if given | a storage-UNINITIALIZED, dtype-UNINITIALIZED | tensor. | | We intend to eliminate all uninitialized | states, so that every tensor is fully | initialized in all fields. Please do not | write new code that depends on these | uninitialized states.
  • | A class to encapsulate construction axes of an | Tensor. TensorOptions was designed to support | the Python style API for specifying | construction options on factory functions, | e.g., | | torch.zeros(2, 3, dtype=torch.int32) | | Because C++ doesn’t natively support keyword | arguments, there must be another way of | specifying keyword-like arguments. | TensorOptions is a builder class which can be | used to construct this “dictionary” of keyword | arguments: functions which support | TensorOptions conventionally take this | argument optionally as their last argument. | | WARNING: In PyTorch, there are Torch | variants of factory functions, e.g., | Torchzeros for zeros. These return Variables | (while the stock ATen functions return plain | Tensors). If you mix these functions up, you | WILL BE SAD. | | Rather than use the constructor of this class | directly, you should prefer to use the | constructor functions, and then chain setter | methods on top of them. | | device(kCUDA).dtype(kInt) | dtype(kInt) | | Additionally, anywhere a TensorOptions is | expected, you can directly pass kCUDA / kInt, | and it will implicitly convert to | a TensorOptions. | | Here are some recommended ways to create a 2x2 | tensor of zeros with certain properties. | These all implicitly make use of | TensorOptions, even if they don’t mention the | class explicitly: | | zeros({2,2}, kCUDA); | zeros({2,2}, kLong); | zeros({2,2}, device(kCUDA).dtype(kLong())); | zeros({2,2}, device({kCUDA, 1})); // place on device 1 | zeros({2,2}, requires_grad()); | | | NOTE [ TensorOptions Constructors ] | | TensorOptions is like a dictionary with | entries from the set: {requires_grad, device, | dtype, layout}, where each entry may be | unspecified (i.e., is optional). It is used to | specify the properties of tensors in many | places both in C++ internal and API, e.g., | tensor factory methods like empty({10}, | options), tensor conversions like | tensor.to(...), etc. | | To provide a simple API that is consistent | with Python, where one can do | | torch.empty(sizes, X) with X being | a torch.device, torch.dtype, or a | | torch.layout, we want TensorOptions to be | implicitly convertible from | | ScalarType dtype, Layout layout and | Device device. | | Therefore, we have three implicit constructors | from each of these three types. | | This is sufficient for ScalarType and | Layout as they are simple Enum | classes. However, Device is an ordinary | class with implicit constructors | | Device(DeviceType, DeviceIndex = -1) and | Device(string) to be consistent with Python | API, where strings are treated as equivalent | with a | | torch.device object (e.g., “cuda:1” can be | passed to everywhere a | | torch.device("cuda:1") is accepted). To | support the syntax | | empty({10}, {kCUDA, 1}) and | tensor.to(kCUDA), we need to make sure that | TensorOptions is implicitly constructible | with any argments that a | | Device can constructed from. So we have, | | /* implicit / TensorOptions(T&& device) : TensorOptions() { | this->set_device(device); | } | | template <typename… Args, | typename = enable_if_t<is_constructible<Device, | Args&&…>::value>> | / implicit */ TensorOptions(Args&&… args) | : TensorOptions(Device(forward(args)…)) {} | | | But this will be problematic. Consider this: | TensorOptions({kCUDA, 1}). Compiler will | compain about ambiguity between the copy | constructor and the Device constructor | because {kCUDA, 1} can be converted to both | a TensorOption and a Device. | | To get around this, we templatize the Device | constructor. Since overload resolution is done | before template resolution, our problem is | solved.
  • | Thread local debug information is propagated | across the forward (including async fork tasks) | and backward passes and is supposed to be | utilized by the user’s code to pass extra | information from the higher layers (e.g. model | id) down to the lower levels (e.g. to the | operator observers used for debugging, logging, | profiling, etc)
  • | Used in ATen for invalid types. These | turn into TypeError when they cross | to Python. |
  • | A type id is a unique id for a given C++ | type. | | You need to register your types using | CAFFE_KNOWN_TYPE(MyType) to be able | to use TypeIdentifier with custom types. | This is for example used to store the | dtype of tensors. |
  • | This struct holds the actual type | information. There will be one allocated per | type. TypeMeta objects will then point to the | struct instance for the type they’re configured | for.
  • | dyn std::alloc::Allocator UniqueVoidPtr is an owning smart pointer like | unique_ptr, but with three major differences: | | 1) It is specialized to void | | 2) It is specialized for a function pointer | deleter void(void* ctx); i.e., the | deleter doesn’t take a reference to the | data, just to a context pointer (erased | as void*). In fact, internally, this | pointer is implemented as having an | owning reference to context, and | a non-owning reference to data; this is | why you release_context(), not release() | (the conventional API for release() | wouldn’t give you enough information to | properly dispose of the object later.) | | 3) The deleter is guaranteed to be called | when the unique pointer is destructed and | the context is non-null; this is | different from unique_ptr where the | deleter is not called if the data pointer | is null. | | Some of the methods have slightly different | types than unique_ptr to reflect this. |
  • | Used in ATen for invalid values. These | turn into ValueError when they cross | to Python. |
  • | NOTE [ Version Counter Sharing ] | | Every C10Tensor has a version counter. Version | counters are incremented whenever the data or | size of a tensor changes through in-place | Variable operations. | | Version counters are used to detect | modifications to saved variables which would | result in incorrect gradient | calculations. Version counters may be shared | between Variables: | | 1. A view shares the version counter of the | base Variable, | | 2. x.detach() shares the version counter of | x, | | 3. Unpacked saved variables share the version | counter of the source. | | Version counters are not shared in these | scenarios: | | 1. When we replace a Variable’s underlying | C10Tensor by calling set_data(...), | | 2. x.data does not share the version counter | of x. (See discussion at | https://github.com/pytorch/pytorch/issues/5396) | | Question: Why do we put the version counter in | TensorImpl instead of AutogradMeta? | | Answer: After the Variable/C10Tensor merge, | a tensor will not have AutogradMeta when its | requires_grad_ is false, but when we use this | tensor in the forward pass of a function that | requires saving this tensor for backward, we | need to keep track of this tensor’s version to | make sure it’s always valid in the autograd | graph. | | To achieve this goal, we put the version | counter in TensorImpl instead of AutogradMeta, | and have it always be available. This allows us | to have the optimization of not carrying | AutogradMeta when a tensor doesn’t require | gradient. | | A hypothetical alternative way to achieve this | goal is to initialize AutogradMeta and create | the version counter for the non-requires-grad | tensor only when it’s saved for | backward. However, since saving a tensor for | backward happens in the forward pass, and our | invariant is that forward pass needs to be | thread-safe, lazy-initializing AutogradMeta | when saving a tensor can introduce race | conditions when we are running the forward pass | in multi-thread scenarios, thus making the | forward pass not thread-safe anymore, which | breaks the invariant.
  • | An implementation of DeviceGuardImplInterface | which delegates to virtual dispatch | on the DeviceGuardImpl registry. |
  • | A RAII guard that sets warn_always (not | thread-local) on construction, and sets it back | to the original value upon destruction. |
  • | Usage: Profile allocations made by one run of | the model. | | AllocationPlan plan; | | { | WithProfileAllocationGuard profile_guard(&plan); | module.forward(…); | } | plan now contains allocation plan.
  • | This is the data type for quantized Tensors. | Right now we only have qint8 which is | for 8 bit Tensors, and qint32 for 32 bit | int Tensors, we might have 4 bit, 2 bit | or 1 bit data types in the future. |
  • | qint32 is for signed 32 bit quantized | Tensors |
  • | quint8 is for unsigned 8 bit quantized | Tensors |

Enums

  • | This legacy enum class defines the set | of backends supported by old school, | code generated Type-based ATen. A “backend” | in this sense roughly corresponds to | the cartesian product of (device type, | layout), but restricted only to combinations | which we actually have kernels for. | Backend does NOT include dtype. | | The reason we are sunsetting this enum | class is because it doesn’t allow for | open registration; e.g., if you want | to add SparseXLA, you’d have to edit | this enum; you wouldn’t be able to do | it out of tree. DispatchKey is the replacement | for Backend which supports open registration. | | NB: The concept of ‘Backend’ here disagrees | with the notion of backend exposed to | users in torch.backends. Backend here | is something like “CPU” or “SparseCUDA”; | backend in torch.backends is something | like “MKL” or “CUDNN”. |
  • | Semantically, a dispatch key identifies | a possible “level” in our dispatch, for which | a handler may be registered. | | Traditional backends like CPU and Cuda get | dispatch keys; however, so do “wrapping” layers | like Variable (for autograd handling). | | In implementation terms, the dispatch key | identifies a specific “bit” in | a DispatchKeySet. Higher bit indexes get | handled by dispatching first (because we “count | leading zeros” when we extract the highest | priority dispatch key.) | | NOTE: Keep the list in sync with DispatchKey | in tools/codegen/model.py |
  • | Flags defining the behavior of events. | | PYTORCH_DEFAULT and BACKEND_DEFAULT | are valid for all backends. The | | BACKEND_DEFAULT is what a particular | backend would select if no flags were | given. | | PYTORCH_DEFAULT is the PyTorch’s framework | default choice for events on that backend, | which may not be the same. | | For example, when PyTorch creates a | Cuda event it sets the flag | | CUDA_EVENT_DISABLING_TIMING by default | to improve performance. | | The mapping of PYTORCH_DEFAULT and | BACKEND_DEFAULT is done by each backend | implementation. Backend-specific | flags, like CUDA_EVENT_DEFAULT, should | map one-to-one with actual event flags | for those backends. |
  • | Policy for adjusting the behavior of | is_contiguous(). Allows subclass customization | while still being able to inline | is_contiguous() in the common case. |
  • replace the joy of being right with the joy of learning what is true
  • | PyInterpreterStatus describes what the state of | its interpreter tag is, relative to the thread | currently holding the GIL. |
  • | QEngine is an enum that is used to select | the engine to run quantized ops. | | Keep this enum in sync with get_qengine_id() | in torch/backends/quantized/init.py |
  • | QScheme is an enum that specifies the | type of quantization. This has a one | to one correspondence with Quantizer | | Please refer to ATen/quantized/Quantizer.h | to see the Quantizers classes. | | Keep this file in sync with torch/nn/_qscheme.py |
  • | Note [StreamId assignment] | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | How do we assign stream IDs? | | – 57 bits – – 5 bits —– – 3 bits – | zeros stream id index StreamIdType | | Where StreamIdType: | 000 = default stream or externally allocated if id[63:3] != 0 | 001 = low priority stream | 010 = high priority stream | | This is not really for efficiency; it’s just | easier to write the code to extract the index | if we do this with bitmasks :) | | We are obligated to treat the stream ID 0 as | the default stream, per the invariant specified | in Stream. | | However, all other numbers are entirely an | internal implementation detail, we reserve the | right to renumber streams however we like. | | Note that it is really important that the MSB | is zero; StreamId is a signed integer, and | unsigned to signed conversion outside of the | bounds of signed integer representation is | undefined behavior. | | You could work around this with something like | https://stackoverflow.com/questions/13150449/efficient-unsigned-to-signed-cast-avoiding-implementation-defined-behavior | but it seems a bit overkill for this. | | Also, external managed stream pointers | (cudaStream_t) can be directly stored in the Id | field so in this case, we need to check the | stream alignment. | | The IdType uses an additional bit to match with | the 64-bit address alignment making easy to | identify an external stream when its value (X | & 7) > 0 |
  • | Note [Enum ImplType] | | This enum is temporary. | | In the followup refactor we should think about | how to specialize TensorImpl creation for | view tensors. | | Currently we only special case its key_set_ but | there’s also potential to share | version_counter_ directly without creating | first and then override in as_view.
  • | Note [Disabled VariableVersion] | | VariableVersion struct has an intrusive_ptr | pointing VersionCounter struct with an atomic | variable. Thus | VariableVersion(/*version=*/0) is not as | cheap as we expected. In some cases | constructing a VariableVersion with version | 0 is not necessary so we add a cheap | constructor which doesn’t allocate the | intrusive_ptr. | | Example use cases are: | | - Inference tensors don’t track version | counter, so they’ll just always have | disbaled VariableVersion. | | - In SavedVariable class we override | version_counter_ inside its construtor so | that we can use the cheap constructor there. |
  • | The behavior an operation has on an input | of 0. |

Constants

Traits

  • | Unfortunately, the definition of AutogradMeta | lives in a separate compilation unit than | TensorImpl (libtorch.so versus libc10.so) which | means that we cannot construct an AutogradMeta | from TensorImpl, not even from the cpp file. | | So we have to indirect it through a factory | function which will be initialized when we load | libtorch.so.
  • | DeviceGuardImplInterface represents | the virtual interface which provides | functionality to provide an RAII class | for device and stream switching, via | DeviceGuard. Every distinct device | type, e.g., Cuda and HIP, is expected | to implement and register an implementation | of this interface. | | All classes which inherit from DeviceGuardImplInterface | should be declared ‘final’. | | This class exists because we provide | a unified interface for performing | device guards via DeviceGuard, but | we cannot assume that we have actually | compiled against the, e.g., Cuda library, | which actually implements this guard | functionality. In this case, a dynamic | dispatch is required to cross the library | boundary. | | If possible, you should directly use | implementations of this interface; | those uses will be devirtualized. | | Intended use of this class is to leak | the DeviceGuardImpl at program end. | | So you better not call the destructor, | buster! | | NB: Implementations of exchangeDevice can be | a bit boilerplatey. | | You might consider replacing exchangeDevice | with a non-virtual function with a baked in | implementation; however, note that this will | triple the number of virtual calls (when you | implement exchangeDevice in a final subclass, | the compiler gets to devirtualize everything; | it won’t do that if you don’t define it in the | subclass!) | | A common way to solve this problem is to use | some sort of CRTP; however, we can template | DeviceGuardImplInterface since we really do | need it to be virtual. | | A little boilerplate seems easiest to explain. | (Another way around this problem is to provide | inline functions that provide the default | implementations, but this seems a little hard | to explain. | | In any case, we’re only going to have on order | of ten implementations of this anyway.) |
  • | Common methods for all generators |
  • An interface for reporting thread local memory usage per device
  • | Returns true if we hold the GIL. If not | linked against Python we always return | false. |
  • | Note: [Verbatim Warnings] | | Warnings originating in C++ code can appear | out-of-place to Python users: a user runs | a line in Python, but the warning references | a line in C++. | | Some parts of PyTorch, like the JIT, are | cognizant of this mismatch and take care to map | warnings back to the user’s program, but most | of PyTorch simply throws a context-free | warning. To allow warning handlers to add | context where appropriate, warn takes the | “verbatim” flag. | | When this is false a warning handler might | append the C++ warning to a Python warning | message that relates the warning back to the | user’s program. Callers who have already | accounted for context in their warnings should | set verbatim to true so their warnings appear | without modification.

Functions

  • | Subtract two unsigned integers, X and Y, of | type T and return the absolute value of the | result. |
  • | Aligns \c Addr to \c Alignment bytes, rounding | up. | | Alignment should be a power of two. This | method rounds up, so alignAddr(7, 4) == 8 and | alignAddr(8, 4) == 8. |
  • | Returns the largest uint64_t less than or | equal to \p Value and is | | \p Skew mod \p Align. \p Align must be | non-zero |
  • | Returns the next integer (mod 2**64) that is greater than or equal to | \p Value and is a multiple of \p Align. \p Align must be non-zero. | | If non-zero \p Skew is specified, the return value will be a minimal | integer that is greater than or equal to \p Value and equal to | \p Align * N + \p Skew for some integer N. If \p Skew is larger than | \p Align, its value is adjusted to ‘\p Skew mod \p Align’. | | Examples: | \code | alignTo(5, 8) = 8 | alignTo(17, 8) = 24 | alignTo(~0LL, 8) = 0 | alignTo(321, 255) = 510 | | alignTo(5, 8, 7) = 7 | alignTo(17, 8, 1) = 17 | alignTo(~0LL, 8, 3) = 3 | alignTo(321, 255, 42) = 552 | \endcode
  • | Returns the next integer (mod 2**64) that is | greater than or equal to | | \p Value and is a multiple of \c Align. \c | Align must be non-zero. |
  • | Returns the necessary adjustment for | aligning \c Ptr to \c Alignment bytes, | rounding up. |
  • | This function takes a 64-bit integer | and returns the bit equivalent double. |
  • | This function takes a 32-bit integer | and returns the bit equivalent float. |
  • This function is not exported
  • | see tensor_attributes.rst for detailed | explanation and examples of casting | rules. |
  • | Wrap around axis_index if it is negative, | s.t., -1 is the last dim |
  • | Reads an environment variable and returns | - optional, if set equal to “1” | - optional, if set equal to “0” | - nullopt, otherwise | | NB: | Issues a warning if the value of the | environment variable is not 0 or 1.
  • | Helper to verify the GPU index is valid |
  • | Helpers for CHECK_NOTNULL(). Two are necessary | to support both raw pointers and smart | pointers. |
  • | This is intended to be a centralized | location by which we can determine what | an appropriate DispatchKey for a tensor is. |
  • | Typed copy function for classes. |
  • | WARNING: Implementations for this | function are currently registered | from | | ATen and caffe2, not yet from c10. Don’t | use this if not either ATen or caffe2 | is present as well. | | We can’t move them yet, because the Cuda | implementations aren’t unified yet | between ATen and caffe2. | | We’re planning to move the implementations | into c10/backend/xxx to make c10 self | contained again. |
  • | A placeholder function for types that | do not allow assignment. |
  • | Implement copysign for half precision floats | using bit ops | | Sign is the most significant bit for both half | and bfloat16 types |
  • | Count the number of ones from the most | significant bit to the first zero bit. | | Ex. countLeadingOnes(0xFF0FFF00) | == 8. Only unsigned integral types are | allowed. | | ———– | @param ZB | | the behavior on an input of all ones. | Only ZeroBehavior::Width and ZeroBehavior::Undefined are | valid arguments. |
  • | Count number of 0’s from the most significant | bit to the least stopping at the first | 1. | | Only unsigned integral types are allowed. | | ———– | @param ZB | | the behavior on an input of 0. Only ZeroBehavior::Width | and ZeroBehavior::Undefined are valid arguments. |
  • Count the number of set bits in a value. Ex. countPopulation(0xF000F000) = 8 Returns 0 if the word is zero.
  • | Count the number of ones from the least | significant bit to the first zero bit. | | Ex. countTrailingOnes(0x00FF00FF) == 8. Only | unsigned integral types are allowed. | | \param ZB the behavior on an input of all | ones. Only ZeroBehavior::Width and ZeroBehavior::Undefined are valid | arguments.
  • | Count number of 0’s from the least significant | bit to the most stopping at the first 1. | | Only unsigned integral types are allowed. | | \param ZB the behavior on an input of 0. Only | ZeroBehavior::Width and ZeroBehavior::Undefined are valid | arguments. |
  • Use this version where you’re sure a Cuda context exists already.
  • Utility to demangle a C++ symbol name.
  • Returns the printable name of the type.
  • | // Deprecation disabled until we fix | sites in our codebase | | C10_DEPRECATED_MESSAGE(“AT_ERROR(msg) | is deprecated, use TORCH_CHECK(false, | msg) instead.”) |
  • | Convenience function that returns | a TensorOptions object with the device set | to the given one. |
  • | NB: In the past, we were inconsistent about | whether or not this reported an error if there | were driver problems are not. Based on | experience interacting with users, it seems | that people basically ~never want this function | to fail; it should just return zero if things | are not working. | | Oblige them. | | It still might log a warning for user first | time it’s invoked
  • | Version of device_count that throws | is no devices are detected |
  • | Convenience function that returns | a TensorOptions object with the device set | to Cuda and the device_index set to the | given one. |
  • Returns the integer ceil(Numerator / Denominator).
  • | This function takes a double and returns the | bit equivalent 64-bit integer. | | Note that copying doubles around changes the | bits of NaNs on some hosts, notably x86, so | this routine cannot be used if these bits are | needed.
  • returns -1 on failure
  • | legacy function to support ScalarType |
  • | Convenience function that returns | a TensorOptions object with the dtype | set to the given one. |
  • | Rich logging messages | | CAFFE_ENFORCE_THAT can be used with one of the | “checker functions” that capture input argument | values and add it to the exception | message. E.g. CAFFE_ENFORCE_THAT(Equals(foo(x), | bar(y)), "Optional additional message") would | evaluate both foo and bar only once and if the | results are not equal - include them in the | exception message. | | Some of the basic checker functions like Equals | or Greater are already defined below. Other | header might define customized checkers by | adding functions to enforce_detail | namespace. For example: | | namespace caffe2 { namespace enforce_detail { | inline EnforceFailMessage IsVector(const vector<int64_t>& shape) { | if (shape.size() == 1) { return EnforceOK(); } | return str(“Shape “, shape, “ is not a vector”); | } | }} | | With further usages like | CAFFE_ENFORCE_THAT(IsVector(Input(0).dims())) | | Convenient wrappers for binary operations like | CAFFE_ENFORCE_EQ are provided too. Please use | them instead of CHECK_EQ and friends for | failures in user-provided input.
  • | Get the index of the first set bit starting | from the least significant bit. | | Only unsigned integral types are allowed. | | ———– | @param ZB | | the behavior on an input of 0. Only ZeroBehavior::Max | and ZeroBehavior::Undefined are valid arguments. |
  • | Get the index of the last set bit starting | from the least significant bit. | | Only unsigned integral types are allowed. | | ———– | @param ZB | | the behavior on an input of 0. Only | | ZeroBehavior::Max and ZeroBehavior::Undefined are valid arguments. |
  • | This function takes a float and returns the | bit equivalent 32-bit integer. | | Note that copying floats around changes the | bits of NaNs on some hosts, notably x86, so | this routine cannot be used if these bits are | needed. |
  • Internal, use ThreadLocalStateGuard
  • | Returns a DispatchKeySet of autocast | related keys mapped to backend. |
  • | for a given backend key, return the associated | autograd key. | | for non-backend keys, return AutogradOther as | a default. | | Note: it’s convenient and fast to return | a default here rather than (say) returning an | optional, or throwing. | | But it makes callers responsible for either a) | enforcing the invariant that only backend keys | be passed as arguments, or b) interpreting our | return value carefully. |
  • | Returns a DispatchKeySet of autograd | related keys mapped to backend. |
  • | Returns a DispatchKeySet of all backend keys | mapped to Autograd dispatch key t, | DispatchKeySet is empty if t is not alias of | DispatchKey::Autograd. | | for a given autograd key, return the | (guaranteed nonempty) set of associated backend | keys. for a non-autograd key, return the empty | keyset. |
  • | ———– | @note | | Hardcoded the channel last stride indices | here to get better performance |
  • Get the CPU Allocator.
  • Get the CPU Caching Allocator
  • Get the Default CPU Allocator
  • Get the Default Mobile CPU Allocator
  • | A utility function to return an exception | string by prepending its exception type before | its what() content |
  • | Helper to determine the index of the stream to | return | | Note: Streams are returned round-robin (see | note in CudaStream.h) |
  • | Gets a non deterministic random number number | from either the /dev/urandom or the current | time. | | For Cuda, gets random from random_device and | adds a transformation on it. | | FIXME: The behavior in this function is from | legacy code | | (THRandom_seed/THCRandom_seed) and is probably | not the right thing to do, even though our | tests pass. Figure out if tests get perturbed | | - when the same algorithm is used for all | backends. Note that the current behavior is | different for CPU, Cuda and Windows CPU. | | - when using C++11 std objects, such as | random_device | | - when constructing a 64 bit seed properly, | rather than static casting a 32 bit number to | 64 bit.
  • | Resolve alias dispatch key to DispatchKeySet | if applicable |
  • Gets the global warning handler.
  • | Return the greatest common divisor | of the values using Euclid’s algorithm. |
  • Return the high 32 bits of a 64 bit value.
  • | Creates the low and high priority stream pools | for the specified device | | Warning: only call once per device! |
  • | Populates global values and creates a default | stream for each device. | | Note: the default stream on each device is | signified by a nullptr, and so is not created | as usual. | | In particular, we don’t need to switch devices | when creating the streams. | | Warning: this function must only be called | once! |
  • | Check if a DispatchKey is an alias mapping | to other runtime keys. |
  • true if t is a backend dispatch key
  • | Note [Ambiguous is_channels_last_strides_xd] | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | The flaw of carrying memory_format implicitly | through strides is very hard to WAR | properly. issue #24090 | | Without the history of permutation, we can’t | infer the memory_format of a tensor from the | snapshot of its size & stride | | e.g. | | 1. We can NOT specify the memory_format of N111 | tensor through strides in a meaningful way; | | 2. Two path that ended up with identical size/stride | | N11W contiguous tensor sliced at w-dimension | becomes [N,1,1,1]@[W,W,W,W] | | NC11 channels_last tensor sliced at | c-dimension becomes [N,1,1,1]@[C,C,C,C] | | So if we see a tensor [N,1,1,1]@[X,X,X,X], | there’s no way for us to infer the | memory_format of the original tensor. | | Due to the limitations, our temporary WAR | is_channels_last_strides does the best effort | to infer whether the original memory_format of | a tensor is MemoryFormat::ChannelsLast. The two | objectives of this function (ordered by their | importance): | | 1. Ensure that normal shape manipulation does | not accidentally change the MemoryFormat | of an existing tensor. | | 2. Allows user to mark | MemoryFormat::ChannelsLast to tensors; | | The function does so via checking strides of | the tensor, including strides of size-1 | dimensions. Although conventionally PyTorch | implies no restriction on trivial stride | (stride for size-1 dimension). | | Note that this approach is a compromise. We did | not solve the problem completely. Many cases we | will not be able to infer the correct memory | format. | | The implementation of | is_channels_last_strides is to serve the | objectives: | | MemoryFormat::ChannelsLast has to be explicitly | opted-in (no accidental conversion); Best | effort to maintain the ChannelsLast flag. | | Due to the fact that this is not a bulletproof | solution, through testing | (aten/src/ATen/test/memory_format_test.cpp) | | a. we ensure that the common tasks are | supported; | | a. we identify corner cases where the | implementation compromises on. | | By the time accumulated permutation is enabled | to replace implicit memory_format through | strides, we should be updating our tests and | fix the issues in our tests. | | We use Channels Last 2d as an example above. | | This is a general problem for all the | is_channels_last_strides_xd | implementation. Please check the helper | functions (is_channels_last_strides_d_s) for | more details.
  • | NOTE: | | Below are Helper functions for | is_channels_last_strides_xd. | | 1. Please do not combine these helper | functions, each helper function handles exactly | one case of sizes + memory_format, by doing | this, the strides indices will be a constant | array and we can access it using constant index | number, the compiler will fully unroll the loop | on strides indices to gain a better | performance. | | 2. No error check in helper function, caller | ensures the correctness of the input | | 3. All helper functions have similar comments, | only 1st helper function is commented here. |
  • | This API exists because we have a use case for | checking | getRuntimeDispatchKeySet(alias).has(DispatchKey::Undefined) | in OperatorEntry.cpp but we disallow it in | has() API.
  • | Checks if an integer fits into the given | bit width. |
  • Checks if an signed integer fits into the given (dynamic) bit width.
  • Return true if the argument is a non-empty sequence of ones starting at the least significant bit with the remainder zero (32 bit version).
  • | Return true if the argument is a non-empty | sequence of ones starting at the least | significant bit with the remainder zero (64 | bit version). |
  • Return true if the argument is a power of two > 0.
  • Return true if the argument is a power of two > 0 (64 bit edition.)
  • unix isprint but insensitive to locale
  • Checks if a signed integer is an N bit number shifted left by S.
  • | Return true if the argument contains | a non-empty sequence of ones with the | remainder zero (32 bit version.) | Ex. isShiftedMask_32(0x0000FF00U) == true. |
  • | Return true if the argument contains | a non-empty sequence of ones with the | remainder zero (64 bit version.) |
  • Checks if a unsigned integer is an N bit number shifted left by S.
  • Checks if an unsigned integer fits into the given (dynamic) bit width.
  • | NB: Per the C++ standard (e.g., | https://stackoverflow.com/questions/18195312/what-happens-if-you-static-cast-invalid-value-to-enum-class) | as long as you cast from the same underlying | type, it is always valid to cast into an enum | class (even if the value would be invalid by | the enum.) | | Thus, the caller is allowed to cast a possibly | invalid int16_t to DeviceType and then pass it | to this function. | | (I considered making this function take an | int16_t directly, but that just seemed weird.) |
  • | Convenience function that returns | a TensorOptions object with the layout | set to the given one. |
  • | Historically, every tensor only had a single | DispatchKey, and it was always something like | CPU, and there wasn’t any of this business | where TLS could cause the DispatchKey of | a tensor to change. | | But we still have some legacy code that is | still using DispatchKey for things like | instanceof checks; if at all possible, refactor | the code to stop using DispatchKey in those | cases. |
  • Return the low 32 bits of a 64 bit value.
  • Return the log base 2 of the specified value.
  • Return the floor log base 2 of the specified value, -1 if the value is zero.
  • | Return the ceil log base 2 of the specified | value, 32 if the value is zero. | | (32 bit edition). | | Ex. Log2_32_Ceil(32) == 5, Log2_32_Ceil(1) == 0, Log2_32_Ceil(6) == 3
  • Return the floor log base 2 of the specified value, -1 if the value is zero.
  • | Return the ceil log base 2 of the specified | value, 64 if the value is zero. | | (64 bit edition.)
  • | Return value is needed to do the static | variable initialization trick |
  • Log a message and terminate.
  • Make a 64-bit integer from a high / low pair of 32-bit integers.
  • | Creates the filename pattern passed to and | completed by mkstemp. | | Returns vector because mkstemp needs | a (non-const) char* and string only | provides const char* before C++17. |
  • | Keeps the callable object that is passed in, | and execute it at the destruction of the | returned object (usually at the scope exit | where the returned object is kept). | | Interface is specified by p0052r2. |
  • | Like try_make_tempdir, but throws | an exception if a temporary directory | could not be returned. |
  • | Like try_make_tempfile, but throws an | exception if a temporary file could not be | returned. |
  • | Create a bitmask with the N left-most bits set | to 1, and all other bits set to 0. Only | unsigned types are allowed. |
  • | Create a bitmask with the N left-most bits set | to 0, and all other bits set to 1. Only | unsigned types are allowed. |
  • | Create a bitmask with the N right-most bits | set to 1, and all other bits set to 0. | Only unsigned types are allowed. |
  • | Create a bitmask with the N right-most bits | set to 0, and all other bits set to 1. Only | unsigned types are allowed. |
  • Gets the maximum value for a N-bit signed integer.
  • Gets the maximum value for a N-bit unsigned integer.
  • | Convenience function that returns | a TensorOptions object with the | memory_format set to the given one. |
  • | Fill the data memory region of num bytes with | a particular garbage pattern. | | The garbage value is chosen to be NaN if | interpreted as floating point value, or a very | large integer.
  • | A and B are either alignments or | offsets. Return the minimum alignment that may | be assumed after adding the two together. |
  • Gets the minimum value for a N-bit signed integer.
  • | Product of a list of integers; accumulates | into the int64_t datatype |
  • | Returns the next power of two (in 64-bits) | that is strictly greater than A. | | Returns zero on overflow. |
  • | A helper function that is basically | doing nothing. |
  • | Product of all dims between k and l (including | dims[k] and excluding dims[l]) k and | l may be supplied in either order |
  • | Return product of all dimensions starting from k | | Returns 1 if k>=dims.size() |
  • | Product of all dims up to k (not including | dims[k]) Throws an error if | k>dims.size() |
  • | Returns the offset to the next integer (mod | 2**64) that is greater than or equal to \p | Value and is a multiple of \p Align. \p Align | must be non-zero. |
  • | typeMetaToScalarType(), lifted to | optional |
  • | Destructor for non-fundamental types. |
  • | Placement new function for the type. |
  • | Returns the power of two which is greater than | or equal to the given value. | | Essentially, it is a ceil operation across the | domain of powers of two. |
  • | Returns the power of two which is less than or | equal to the given value. | | Essentially, it is a floor operation across | the domain of powers of two. |
  • | Gets a random number for /dev/urandom | | Note this is a legacy method (from THRandom.cpp) | | FIXME: use random_device with entropy | information |
  • | Replace all occurrences of “from” substring to | “to” string. | | Returns number of replacements |
  • | Convenience function that returns | a TensorOptions object with the | requires_grad set to the given one. |
  • Reverse the bits in \p Val.
  • | Add two unsigned integers, X and Y, of type T. | Clamp the result to the maximum representable | value of T on overflow. ResultOverflowed | indicates if the result is larger than the | maximum representable value of type T. |
  • | Multiply two unsigned integers, X and Y, of | type T. Clamp the result to the maximum | representable value of T on overflow. | ResultOverflowed indicates if the result is | larger than the maximum representable value of | type T. |
  • | Multiply two unsigned integers, X and Y, and | add the unsigned integer, A to the product. | | Clamp the result to the maximum representable | value of T on overflow. | | ResultOverflowed indicates if the result is | larger than the maximum representable value of | type T. |
  • | convert ScalarType enum values to TypeMeta | handles |
  • | Set the allocator for DeviceType t. | The passed in allocator pointer is expected | to have static lifetime; this function | does NOT take ownership of the raw pointer. | (The reason for this is to prevent existing | pointers to an allocator of a particular | device from being invalidated when | | SetAllocator is called.) | | Also note that this is not thread-safe, | and we assume this function will only | be called during initialization. | | The ‘priority’ flag is introduced when | we want to overwrite the default allocator, | since the allocators are set statically. | The default priority is 0, which means | the lowest. Only higher or equal priority | can overwrite existing ones. |
  • API usage logging capabilities
  • | Sets the CPU allocator to the given allocator: | the caller gives away the ownership of the | pointer. |
  • | The CPUCachingAllocator is experimental and | might disappear in the future. | | The only place that uses it is in | StaticRuntime. | | Set the CPU Caching Allocator |
  • | The TORCH_WARN_ONCE macro is difficult to test | for. Use setWarnAlways(true) to turn it into | TORCH_WARN, which can be tested for more | easily. |
  • | Sets the global warning handler. This is not | thread-safe, so it should generally be called | once during initialization or while holding | the GIL for programs that use python. | | User is responsible for keeping the | WarningHandler alive until it is not needed. |
  • | Sign-extend the number in the bottom B bits of | X to a 32-bit integer. | | Requires 0 < B <= 32. |
  • | Sign-extend the number in the bottom B bits of | X to a 32-bit integer. | | Requires 0 < B < 32.
  • | Sign-extend the number in the bottom B bits of | X to a 64-bit integer. | | Requires 0 < B < 64. |
  • | Sign-extend the number in the bottom B bits of | X to a 64-bit integer. | | Requires 0 < B < 64. |
  • | Product of all dims between k and l (not | including dims[k] and dims[l]) |
  • | Return product of all dimensions starting | from k |
  • | Product of all dims up to k (not including | dims[k]) |
  • | StreamId is 64-bit, so we can just rely on | regular promotion rules. | | We rely on streamIdIndex and streamIdType being | non-negative; see Note [Hazard when | concatenating signed integers]
  • Obtains the base name from a full path.
  • | Sum of a list of integers; accumulates | into the int64_t datatype |
  • | Mechanism for throwing errors which can’t be | prevented at compile time due to type | erasure. E.g. somebody calling TypeMeta::copy() | for non-copyable type. Right now just throws | exception but is implemented in .cpp to manage | dependencies
  • | Non-RAII API | | Please prefer using the RAII API. See | declarations in LocalDispatchKeySet.h for | details. | | Non-RAII API for manipulating the thread-local | dispatch state. | | Please prefer the RAII API. The non-RAII API | may be useful when the included/excluded state | of a given DispatchKey must span many calls | from the Python to the C++, so you cannot | conveniently use an RAII guard. | | Example use case: a Python context manager | that includes a certain DispatchKey, to ensure | ops running under the context manager dispatch | through that DispatchKey’s registered | overrides. | | The non-RAII API is less efficient than the | RAII guards because both the getter and setter | will do a tls_getaddr lookup (the RAII struct | only needs one!)
  • | A utility function to convert vector | to vector<int64_t>. |
  • | The str() call that creates userMsg can have | 1 of 3 return types depending on the number and | types of arguments passed to | TORCH_INTERNAL_ASSERT. | | 0 arguments will get a CompileTimeEmptyString, | 1 const char * will be passed straight through, | and anything else will get converted to string. |
  • | This should never be called. It is provided in | case of compilers that don’t do any dead code | stripping in debug builds. |
  • | Attempts to return a temporary directory or | returns nullopt if an error occurred. | | The directory returned follows the pattern | <tmp-dir>/<name-prefix><random-pattern>/, | where <tmp-dir> is the value of the | "TMPDIR", "TMP", "TEMP" or "TEMPDIR" | environment variable if any is set, or | otherwise /tmp; <name-prefix> is the value | supplied to this function, and | <random-pattern> is a random sequence of | numbers. | | On Windows, name_prefix is ignored and | tmpnam is used. |
  • | Attempts to return a temporary file or returns | nullopt if an error occurred. | | The file returned follows the pattern | <tmp-dir>/<name-prefix><random-pattern>, | where <tmp-dir> is the value of the | "TMPDIR", "TMP", "TEMP" or "TEMPDIR" | environment variable if any is set, or | otherwise /tmp; <name-prefix> is the value | supplied to this function, and | <random-pattern> is a random sequence of | numbers. | | On Windows, name_prefix is ignored and | tmpnam is used. |
  • | convert TypeMeta handles to ScalarType | enum values |
  • | Issue a warning with a given message. | Dispatched to the current warning handler. |

Type Definitions

Unions

  • | Packed container for TensorImpl sizes and | strides. | | This design improves on the previous approach | of using a pair of SmallVector<int64_t, 5> by | specializing for the operations we actually use | and enforcing that the number of sizes is the | same as the number of strides. | | The memory layout is as follows: | | 1 size_t for the size | | 5 eightbytes of inline sizes and 5 eightbytes | of inline strides, OR pointer to out-of-line | array |