| autograd_dispatch_keyset should include all
| runtime autograd keys.
|
| Alias key DispatchKey::Autograd maps to
| autograd_dispatch_keyset.
|
| NB: keys in this set also get associated with
| CompositeImplicitAutograd
|
| Given a sequence of allocations in a
| thread, AllocationPlan records
|
| - 1. size of each allocation
|
| - 2. Lifetime of each allocation.
|
| - 3. allocation offsets: Memory offset
| for each allocation in a single blob
| of memory
|
| - 4. Total size of a blob of memory required
| to satisfy all the allocations.
|
| Map of memory ptr to allocation id. This
| is auxiliary information only used
| to establish lifetime of allocations.
|
| Note [raw_allocate/raw_deallocate and Thrust]
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Thrust’s support for custom allocators requires
| us to write something like this:
|
| class ThrustAllocator {
| char* allocate(size_t);
| void deallocate(char*, size_t);
| };
|
| This is not good for our unique_ptr based
| allocator interface, as there is no way to get
| to the context when we free.
|
| However, in some cases the context is exactly
| the same as the data pointer. In this case, we
| can support the “raw” allocate and deallocate
| interface. This is what raw_deleter signifies.
| By default, it returns a nullptr, which means
| that the raw interface is not implemented. Be
| sure to implement it whenever possible, or the
| raw interface will incorrectly reported as
| unsupported, when it is actually possible.
| A RAII, thread local (!) guard that enables
| or disables grad mode upon construction,
| and sets it back to the original value
| upon destruction.
|
| This is a simple bitset class with sizeof(long
| long int) bits.
|
| You can set bits, unset bits, query bits
| by index, and query for the first set
| bit.
|
| Before using this class, please also
| take a look at bitset, which has more
| functionality and is more generic.
| It is probably a better fit for your use
| case. The sole reason for utils::bitset
| to exist is that bitset misses a find_first_set()
| method.
|
| The primary ATen error class.
|
| Provides a complete error message with source
| location information via what()
, and a more
| concise message via
| what_without_backtrace()
.
|
| Don’t throw this directly; use
| TORCH_CHECK/TORCH_INTERNAL_ASSERT instead.
|
| NB: C10ErrorData is handled specially by the default
| torch to suppress the backtrace, see
| torch/csrc/Exceptions.h
|
| A backend-generic movable, not copyable,
| not thread-safe event.
|
| The design of this event follows that
| of Cuda and HIP events. These events
| are recorded and waited on by streams
| and can be rerecorded to, each rerecording
| essentially creating a new version
| of the event.
|
| For example, if (in CPU time), stream
| X is asked to record E, stream Y waits
| on E, and stream X is asked to record E
| again, then Y will wait for X to finish
| the first call to record and not the second,
| because it’s waiting on the first version
| of event E, not the second.
|
| Querying an event only returns the status
| of its most recent version.
|
| Backend-generic events are implemented
| by this class and
|
| InlineEvent. In addition to these events
| there are also some backend-specific
| events, like ATen’s CudaEvent. Each
| of these classes has its own use.
|
| InlineEvent<…> or a backend-specific
| event should be preferred when the backend
| is known at compile time and known to
| be compiled. Backend-specific events
| may have additional functionality.
|
| This C10Event should be used if a particular
| backend may not be available, or the
| backend required is not known at compile
| time.
|
| These generic events are built on top
| of DeviceGuardImpls, analogous to
| DeviceGuard and InlineDeviceGuard.
| The name “DeviceGuardImpls,” is no
| longer entirely accurate, as these
| classes implement the backend-specific
| logic for a generic backend interface.
|
| See DeviceGuardImplInterface.h for
| a list of all supported flags.
|
| PyTorch ddp usage logging capabilities
| DDPLoggingData holds data that can be logged in
| applications for analysis and debugging. Data
| structure is defined in c10 directory so that
| it can be easily imported by both c10 and torch
| files.
|
| dyn std::alloc::Allocator DataPtr is a unique pointer (with an attached
| deleter and some context for the deleter) to
| some memory, which also records what device is
| for its data.
|
| nullptr DataPtrs can still have a nontrivial
| device; this allows us to treat zero-size
| allocations uniformly with non-zero
| allocations.
|
| DebugInfoGuard is used to set debug
| information, ThreadLocalDebugInfo is
| semantically immutable, the values are set
| through the scope-based guard object.
|
| Nested DebugInfoGuard adds/overrides existing
| values in the scope, restoring the original
| values after exiting the scope.
|
| Users can access the values through the
| ThreadLocalDebugInfo::get() call;
|
| QNNPACK AND XNNPACK may out-of-bound access the
| input and / or output tensors. This is
| by-design, and chosen to make the
| implementation of micro-kernels both simpler
| and faster as a result of not having to
| individually handle the corner cases where the
| number of processed elements is not a multiple
| of SIMD register width.
|
| This behavior will trigger ASAN though, and may
| result in a segfault if the accessed memory
| location just so happens to fall on a page the
| current process has no read access to. Here we
| define a custom allocator that allocates the
| extra storage required to keep this behavior
| safe.
|
| This allocator could have been restricted to
| QNNPACK and XNNPACK only, but that would have
| negative performance ramifications, as input
| tensors must now be reallocated, and copied
| over, if the tensor is not allocated with this
| allocator to begin with.
|
| Making this allocator the default on mobile
| builds minimizes the probability of unnecessary
| reallocations and copies, and also enables
| acceleration of operations where the output
| tensor is allocated outside of the function
| doing the implementation, wherein the
| implementation cannot simply re-allocate the
| output with the guarding allocator.
|
| PreGuardBytes: Number of guard bytes to
| allocate before the allocation.
|
| PostGuardBytes: Number of guard bytes to
| allocate after the allocation.
| Like TensorOptions, but all fields
| are guaranteed to be filled.
|
| Represents a a compute device on which
| a tensor is located.
|
| A device is uniquely identified by a type,
| which specifies the type of machine it is
| (e.g. CPU or Cuda GPU), and a device index or
| ordinal, which identifies the specific compute
| device when there is more than one of
| a certain type.
|
| The device index is optional, and in its
| defaulted state represents (abstractly) “the
| current device”.
|
| Further, there are two constraints on the
| value of the device index, if one is
| explicitly stored:
|
| 1. A negative index represents the current
| device, a non-negative index represents
| a specific, concrete device,
|
| 2. When the device type is CPU, the device
| index must be zero.
|
| RAII guard that sets a certain default device
| in its constructor, and changes it back to the
| device that was originally active upon
| destruction.
|
| The device is always reset to the one that was
| active at the time of construction of the
| guard. Even if you set_device
after
| construction, the destructor will still reset
| the device to the one that was active at
| construction time.
|
| This device guard does NOT have an
| uninitialized state; it is guaranteed to reset
| a device on exit. If you are in a situation
| where you might want to setup a guard (i.e.,
| are looking for the moral equivalent of
| optional), see
| OptionalDeviceGuard.
| I can’t conveniently use c10/util/Registry.h
| for the following reason: c10/util/Registry.h
| gives me a slow way of Create’ing a object of
| some interface from the registry, but no way of
| quickly accessing an already created object.
|
| I’ll be banging on getDeviceGuardImpl every
| time we do a DeviceGuard, so I really don’t
| want to be doing an unordered_map
| lookup. Better if the registration mechanism
| directly drops its implementation into
| device_guard_impl_registry.
| A representation of a set of DispatchKeys.
| A tensor may have multiple tensor type ids,
| e.g., a Variable tensor can also be a CPU
| tensor;
|
| the DispatchKeySet specifies what type ids
| apply. The internal representation is as
| a 64-bit bit set (this means only 64 tensor
| type ids are supported).
|
| Note that DispatchKeys are ordered; thus, we
| can ask questions like “what is the highest
| priority DispatchKey in the set”? (The set
| itself is not ordered; two sets with the same
| ids will always have the ids ordered in the
| same way.)
|
| At the moment, there are no nontrivial uses of
| this set; tensors are always singletons. In
| the near future, this set will represent
| variable? + tensor type id. In the far future,
| it will be requires grad? + profiling?
| + tracing? + lazy? + tensor type id.
|
| (The difference between variable and requires
| grad, is that there are currently three states
| a tensor can be:
|
| 1. Not a variable
| 2. Variable with requires_grad=False
| 3. Variable with requires_grad=True
|
| Eventually, we want to kill state (1), and only
| dispatch to autograd handling code if one of
| the inputs requires grad.)
|
| An undefined tensor is one with an empty tensor
| type set.
|
| Used in ATen for non finite indices. These
| turn into ExitException when they cross to
| Python.
|
| A fake implementation of
| DeviceGuardImplInterface suitable
| for testing.
|
| The current device is modeled as a mutable
| field in the guard implementation class.
|
| See DeviceGuard_test.cpp for an example
| use.
|
| RAII API for manipulating the thread-local
| dispatch state.
|
| Used in ATen for out-of-bound indices that can
| reasonably only be detected lazily inside
| a kernel (See: advanced indexing). These turn
| into IndexError when they cross to Python.
|
| This context is used to generate DataPtr which
| have arbitrary function deleters associated
| with them.
|
| In some user facing functions, we give
| a (user-friendly) interface for constructing
| tensors from external data which take an
| arbitrary function deleter.
|
| Grep for InefficientStdFunctionContext to find
| these occurrences.
|
| This context is inefficient because we have to
| do a dynamic allocation
| InefficientStdFunctionContext, on top of the
| dynamic allocation which is implied by function
| itself.
|
| A RAII, thread local (!) guard that enables or
| disables inference mode upon construction, and
| sets it back to the original value upon
| destruction.
|
| A DeviceGuard is an RAII class that sets
| a device to some value on construction, and
| resets the device to its original value on
| destruction.
|
| InlineDeviceGuard is a helper class for
| implementing DeviceGuards.
|
| It is templated over a DeviceGuardImpl
| (anything that implements
| DeviceGuardImplInterface). There are two
| primary ways to instantiate InlineDeviceGuard:
|
| - With a concrete implementation of
| DeviceGuardImpl, e.g., CUDAGuardImpl.
|
| This is the best way to use
| InlineDeviceGuard, as all calls are
| devirtualized, giving you code as efficient
| as straight line calls to
| cudaGetDevice/cudaSetDevice.
|
| - With VirtualGuardImpl, which does a virtual
| dispatch to a DeviceGuardImpl retrieved from
| a DeviceType registry. We have explicitly
| instantiated InlineDeviceGuard this way as
| DeviceGuard.
|
| If you are in a hurry, you can use
| InlineDeviceGuard directly:
|
| using CUDAGuard = InlineDeviceGuard;
|
| However, you can provide a better user
| experience if you explicitly write a wrapper
| class that itself contains the template
| instantiation:
|
| class CUDAGuard {
|
| // … the API …
|
| InlineDeviceGuard guard_;
| }
|
| The wrapper class provides a good place to
| write documentation, and helps avoid weird
| template instantiation errors when a user
| incorrectly uses the class.
|
| If you need to test this class, consider
| instantiating it with FakeGuardImpl.
|
| Note [Omitted default constructor from RAII]
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| In principle, we could add a default
| constructor to DeviceGuard which reads the
| current device and promises to restore to
| that device on exit. However, most cases
| where you would have written this, you
| probably meant to actually just use
| OptionalDeviceGuard (since you don’t actually
| need the restore to happen if you don’t ever
| actually set the device).
|
| We remove the constructor here to encourage
| you to think about what you actually want to
| happen.
|
| Copy is disallowed
|
| Move is disallowed, as StreamGuard does not
| have an uninitialized state, which is
| required for moves on types with nontrivial
| destructors.
|
| A OptionalDeviceGuard is an RAII class
| that sets a device to some value on initialization,
| and resets the device to its original
| value on destruction.
|
| InlineOptionalDeviceGuard is a helper
| class for implementing
|
| OptionalDeviceGuards. See guidance
| in InlineDeviceGuard on how to use this.
| See OptionalDeviceGuard for user-oriented
| usage notes.
|
| Note [Explicit initialization of optional fields]
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| Explicit initialization of optional fields
| required to workaround an nvcc bug; see
| https://github.com/pytorch/pytorch/issues/12117
| An OptionalStreamGuard is an RAII class
| that sets a device to some value on initialization,
| and resets the device to its original
| value on destruction.
|
| See InlineOptionalDeviceGuard for
| more guidance on how to use this class.
|
| A StreamGuard is an RAII class that changes
| the current device to the device corresponding
| to some stream, and changes the default
| stream on that device to be this stream.
|
| InlineStreamGuard is a helper class
| for implementing StreamGuards.
|
| See InlineDeviceGuard for guidance
| on how to use this class.
|
| Copy is disallowed
|
| Move is disallowed, as StreamGuard does not
| have an uninitialized state, which is
| required for moves on types with nontrivial
| destructors.
|
| This class is used to explicitly ignore values
| in the conditional logging macros.
|
| This avoids compiler warnings like “value
| computed is not used” and “statement has no
| effect”.
|
| A MultiStreamGuard is an RAII class
| that sets the current streams of a set
| of devices all at once, and resets them
| to their original values on destruction.
|
| A RAII, thread local (!) guard that stops
| future operations from building gradients.
|
| A no-op device guard impl that doesn’t do
| anything interesting. Useful for devices that
| don’t actually have a concept of device index.
| Prominent examples are CPU and Meta.
|
| Used in ATen for functionality that is not
| implemented. These turn into
| NotImplementedError when they cross to Python.
|
| Used in Onnxifi backend lowering. These
| turn into ExitException when they cross to Python.
|
| A OptionalDeviceGuard is an RAII class that
| sets a device to some value on initialization,
| and resets the device to its original value on
| destruction.
|
| Morally, a OptionalDeviceGuard is equivalent to
| optional, but with extra
| constructors and methods as appropriate.
|
| Besides its obvious use (optionally applying
| a DeviceGuard), OptionalDeviceGuard is often
| also used for the following idiom:
|
| OptionalDeviceGuard g;
| for (const auto& t : tensors) {
| g.set_device(t.device());
| do_something_with(t);
| }
|
| This usage is marginally more efficient than
| constructing a DeviceGuard every iteration of
| the for loop, as it avoids an unnecessary
| device reset.
|
| Unlike DeviceGuard, a OptionalDeviceGuard may
| be uninitialized. This occurs when you use the
| nullary constructor, or pass a nullopt to the
| constructor.
|
| Uninitialized OptionalDeviceGuards do
| nothing; they do not know what the original
| device was and they do not reset on
| destruction. This is why original_device() and
| current_device() return optional rather
| than Device (as they do in DeviceGuard), and
| also is why we didn’t just provide
| OptionalDeviceGuard by default and hide
| DeviceGuard from users.
|
| The semantics of an OptionalDeviceGuard are
| exactly explained by thinking of it as an
| optional. In particular, an
| initialized OptionalDeviceGuard doesn’t restore
| device to its value at construction; it
| restores device to its value at
| initialization. So if you have the program:
|
| setDevice(1);
| OptionalDeviceGuard g;
| setDevice(2);
| g.reset_device(Device(DeviceType::CUDA, 3)); // initializes!
|
| On destruction, g will reset device to 2,
| rather than 1.
|
| An uninitialized OptionalDeviceGuard is
| distinct from a (initialized) DeviceGuard whose
| original_device_ and current_device_ match,
| since the DeviceGuard will still reset the
| device to original_device_.
| An OptionalStreamGuard is an RAII class
| that sets a device to some value on initialization,
| and resets the device to its original
| value on destruction.
|
| See OptionalDeviceGuard for more guidance
| on how to use this class.
|
| POD version of LocalDispatchKeySet. Declared
| here just so that we can put it in the guards.
|
| This struct encapsulates special handling for
| TLS initialization in set_included()/included()
| API so that they reflect the truth.
|
| If you want to create PODLocalDispatchKeySet
| with non-zero state, use set_included() instead
| of default constructor.
|
| A Context that will call extra placement
| deleter during deconstruction.
|
| Accept a already constructed DataPtr
| and store it as member during destruction,
| we’ll call extra deleter on the underlying
| data pointer before the DataPtr is destructed.
| data_ptr_
owns the memory.
|
| A simple struct that is used to report C10’s
| memory allocation and deallocation status to
| the profiler
|
| Note [Python interpreter tag]
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| We store a PyObject on TensorImpl so that we
| can efficiently translate tensors into the
| Python representations. However, in some
| situations (torchdeploy) there may be multiple
| Python interpreters in a single process and we
| must take care not to accidentally mix up
| PyObjects with the wrong interpreters. Thus,
| we also tag every TensorImpl with the Python
| interpreter it corresponds to.
|
| With torchdeploy, we have these invariants:
|
| - Any given TensorImpl can be associated with
| AT MOST one Python interpreter.
|
| We represent the interpreter tag as a memory
| address to an instance of a virtual class
| that is allocated once per interpreter (this
| is so that we can request the interpreter to
| perform operations for us, if necessary).
|
| - A given TensorImpl’s interpreter tag can
| only go from uninitialized to tagged; once
| tagged, this is a quiescent state (once
| tagged to an interpreter, ALWAYS tagged to
| that interpreter)
|
| - A thread may mutate the PyObject field of
| a TensorImpl if and only if it holds the GIL
| for the interpreter tagged on the
| TensorImpl. (If the TensorImpl is not
| tagged, it must first atomically claim its
| tag before it can validly write)
|
| The PyInterpreter object itself is a class that
| contains some function pointers for interacting
| with the interpreter. For now this is just for
| debugging, but if a C10Tensor can own a PyObject,
| the interpreter can be used to free it.
|
| WARNING: This class has to be written very
| carefully, because it may be possible for
| a C10Tensor to have a reference an interpreter
| corresponding to a shared library that has
| ALREADY BEEN UNLOADED. This makes blindly
| calling virtual methods very dangerous, because
| the vtable may be garbage at that point (on
| a good day, you might get “pure virtual method
| called”).
|
| The idea to solve this problem is we always
| leak PyInterpreters (so they always stay live
| even after dlclose), and disarm the “virtual
| methods” by replacing them with function
| pointers that just no-op. This can’t be done
| with a traditional C++ vtable, so we have to
| roll our own.
|
| NB: The downside with representing
| PyInterpreter tags as full objects is that it
| takes an extra word on TensorImpl. If tags
| were instead just integer indices, on 64-bit
| architectures we could pack the tag and
| PyObject together into a single atomic word.
| On 32-bit architectures we could simply say
| that only one Python interpreter is supported
| (erroring if a nontrivial interpreter tag is
| attempted to be set).
|
| The difficulty with this scheme is we need to
| maintain an out-of-line table to get at the
| PyInterpreters so that we can do virtual method
| calls on them, and registration/deregistration
| to this table must be done in a thread safe
| manner. This can be easily done if the number
| of possible PyInterpreters is small enough
| (e.g., 8-bit integer) by simply preallocating
| an array of sufficient size to hold all
| possible interpreters. Surely 128 threads is
| more than enough for anyone!
|
| I didn’t decide to do this technique at the
| moment, because the extra word added by the
| PyInterpreter tag takes us to 24 words, which
| means that we still fit inside three eight word
| cache lines. If you need to penny pinch
| another word consider doing this!
| DO NOT call this registerer from a torch
| deploy instance! You will clobber other
| registrations
|
| quint4x2 is for un-signed 4 bit quantized
| Tensors that are packed to byte boundary.
|
| ———–
| @brief
|
| A template class that allows one to register
| classes by keys.
|
| The keys are usually a string specifying
| the name, but can be anything that can
| be used in a map.
|
| You should most likely not use the Registry
| class explicitly, but use the helper
| macros below to declare specific registries
| as well as registering objects.
|
| Scalar represents a 0-dimensional
| tensor which contains a single element.
|
| Unlike a tensor, numeric literals (in
| C++) are implicitly convertible to
| Scalar (which is why, for example, we
| provide both add(Tensor) and add(Scalar)
| overloads for many operations).
|
| It may also be used in circumstances where
| you statically know a tensor is 0-dim
| and single size, but don’t know its type.
|
| Mostly copied from https://llvm.org/doxygen/ScopeExit_8h_source.html
|
Represents a location in source code (for
debugging).
| A storage represents the underlying backing
| data buffer for a tensor.
|
| This concept was inherited from the original
| Torch7 codebase; we’d kind of like to get rid
| of the concept (see
| https://github.com/pytorch/pytorch/issues/14797)
| but it’s hard work and no one has gotten around
| to doing it.
|
| NB: storage is supposed to uniquely own a data
| pointer; e.g., two non-null data pointers alias
| if and only if they are from the same storage.
|
| Technically you can violate this invariant
| (e.g., you can create a non-owning StorageImpl
| with from_blob) but a lot of things won’t work
| correctly, including:
|
| - An ordinary deleter on such a storage is
| wrong, because normal deleters assume unique
| ownership, but if you have two storages at
| the same data, that implies there is some
| sort of shared ownership. So your deleter
| would have to actually be internally doing
| some sort of refcount thing
|
| - Deepcopy in Python side relies on storage
| equality and not data pointer equality; so if
| there are two separate storages pointing to
| the same data, the data will actually get
| duplicated in that case (one data ptr before,
| two data ptrs after)
|
| - Version counts won’t work correctly, because
| we do all VC tracking at the level of
| storages (unless you explicitly disconnect
| the VC with detach); mutation because data
| pointers are the same are totally untracked
|
| A stream is a software mechanism used
| to synchronize launched kernels without
| requiring explicit synchronizations
| between kernels.
|
| The basic model is that every kernel
| launch is associated with a stream:
| every kernel on the same stream is implicitly
| synchronized so that if I launch kernels
| A and B on the same stream, A is guaranteed
| to finish before B launches. If I want
| B to run concurrently with A, I must schedule
| it on a different stream.
|
| The Stream class is a backend agnostic
| value class representing a stream which
| I may schedule a kernel on.
|
| Every stream is associated with a device,
| which is recorded in stream, which is
| used to avoid confusion about which
| device a stream refers to.
|
| Streams are explicitly thread-safe,
| in the sense that it is OK to pass a Stream
| from one thread to another, and kernels
| queued from two different threads will
| still get serialized appropriately.
|
| (Of course, the time when the kernels
| get queued is undetermined unless you
| synchronize host side ;)
|
| Stream does NOT have a default constructor.
|
| Streams are for expert users; if you
| want to use Streams, we’re going to assume
| you know how to deal with C++ template
| error messages if you try to resize()
| a vector of Streams.
|
| Known instances of streams in backends:
|
| - cudaStream_t (Cuda)
|
| - hipStream_t (HIP)
|
| - cl_command_queue (OpenCL) (NB: Caffe2’s
| existing OpenCL integration does NOT
| support command queues.)
|
| Because this class is device agnostic,
| it cannot provide backend-specific
| functionality (e.g., get the cudaStream_t
| of a Cuda stream.)
|
| There are wrapper classes which provide
| this functionality, e.g., CudaStream.
|
| A StreamGuard is an RAII class that changes
| the current device to the device corresponding
| to some stream, and changes the default
| stream on that device to be this stream.
|
| Use of StreamGuard is HIGHLY discouraged
| in operator definitions. In a single
| operator, you probably don’t know enough
| about the global state of the world to
| profitably decide how to set streams.
| Let the caller handle this appropriately,
| and just use the current stream in your
| operator code.
|
| This StreamGuard does NOT have an uninitialized
| state; it is guaranteed to reset the
| stream and device on exit. If you are
| in a situation where you might want
| to setup a stream guard, see OptionalStreamGuard.
|
| Copy is disallowed
|
| Move is disallowed, as StreamGuard does not
| have an uninitialized state, which is
| required for moves on types with nontrivial
| destructors.
| The low-level representation of a tensor, which
| contains a pointer to a storage (which contains
| the actual data) and metadata (e.g., sizes and
| strides) describing this particular view of the
| data as a tensor.
|
| Some basic characteristics about our in-memory
| representation of tensors:
|
| - It contains a pointer to a storage struct
| (Storage/StorageImpl) which contains the
| pointer to the actual data and records the
| data type and device of the view. This
| allows multiple tensors to alias the same
| underlying data, which allows to efficiently
| implement differing views on a tensor.
|
| - The tensor struct itself records
| view-specific metadata about the tensor,
| e.g., sizes, strides and offset into
| storage. Each view of a storage can have
| a different size or offset.
|
| - This class is intrusively refcounted. It is
| refcounted so that we can support prompt
| deallocation of large tensors; it is
| intrusively refcounted so that we can still
| perform reference counted operations on raw
| pointers, which is often more convenient
| when passing tensors across language
| boundaries.
|
| - For backwards-compatibility reasons, a tensor
| may be in an uninitialized state. A tensor
| may be uninitialized in the following two
| ways:
|
| - A tensor may be DTYPE UNINITIALIZED.
| A tensor of this form has an
| uninitialized dtype. This situation
| most frequently arises when a user
| writes C10Tensor x(CPU). The dtype and is
| subsequently initialized when
| mutable_data() is
| invoked for the first time.
|
| - A tensor may be STORAGE UNINITIALIZED.
| A tensor of this form has non-zero size,
| but has a storage with a null data
| pointer. This situation most frequently
| arises when a user calls Resize() or
| FreeMemory(). This is because Caffe2
| historically does lazy allocation: allocation of data
| doesn’t occur until mutable_data() is
| invoked. A tensor with zero size is
| always storage initialized, because no
| allocation is necessary in this case.
|
| All combinations of these two uninitialized
| states are possible.
|
| Consider the following transcript in
| idiomatic Caffe2 API:
|
| // x is storage-initialized, dtype-UNINITIALIZED
| C10Tensor x(CPU);
|
| // x is storage-UNINITIALIZED, dtype-UNINITIALIZED
| x.Resize(4);
|
| // x is storage-initialized, dtype-initialized
| x.mutable_data();
|
| // x is storage-UNINITIALIZED, dtype-initialized.
| x.FreeMemory();
|
| All other fields on tensor are always
| initialized. In particular, size is always
| valid. (Historically, a tensor declared as
| C10Tensor x(CPU) also had uninitialized size,
| encoded as numel == -1, but we have now
| decided to default to zero size, resulting
| in numel == 0).
|
| Uninitialized storages MUST be uniquely
| owned, to keep our model simple. Thus, we
| will reject operations which could cause an
| uninitialized storage to become shared (or
| a shared storage to become uninitialized,
| e.g., from FreeMemory).
|
| In practice, tensors which are
| storage-UNINITIALIZED and
| dtype-UNINITIALIZED are extremely
| ephemeral: essentially, after you do
| a Resize(), you basically always call
| mutable_data() immediately afterwards. Most
| functions are not designed to work if given
| a storage-UNINITIALIZED, dtype-UNINITIALIZED
| tensor.
|
| We intend to eliminate all uninitialized
| states, so that every tensor is fully
| initialized in all fields. Please do not
| write new code that depends on these
| uninitialized states.
| A class to encapsulate construction axes of an
| Tensor. TensorOptions was designed to support
| the Python style API for specifying
| construction options on factory functions,
| e.g.,
|
| torch.zeros(2, 3, dtype=torch.int32)
|
| Because C++ doesn’t natively support keyword
| arguments, there must be another way of
| specifying keyword-like arguments.
| TensorOptions is a builder class which can be
| used to construct this “dictionary” of keyword
| arguments: functions which support
| TensorOptions conventionally take this
| argument optionally as their last argument.
|
| WARNING: In PyTorch, there are
Torch
| variants of factory functions, e.g.,
| Torchzeros for zeros. These return Variables
| (while the stock ATen functions return plain
| Tensors). If you mix these functions up, you
| WILL BE SAD.
|
| Rather than use the constructor of this class
| directly, you should prefer to use the
| constructor functions, and then chain setter
| methods on top of them.
|
| device(kCUDA).dtype(kInt)
| dtype(kInt)
|
| Additionally, anywhere a TensorOptions is
| expected, you can directly pass kCUDA / kInt,
| and it will implicitly convert to
| a TensorOptions.
|
| Here are some recommended ways to create a 2x2
| tensor of zeros with certain properties.
| These all
implicitly make use of
| TensorOptions, even if they don’t mention the
| class explicitly:
|
| zeros({2,2}, kCUDA);
| zeros({2,2}, kLong);
| zeros({2,2}, device(kCUDA).dtype(kLong()));
| zeros({2,2}, device({kCUDA, 1})); // place on device 1
| zeros({2,2}, requires_grad());
|
|
| NOTE [ TensorOptions Constructors ]
|
| TensorOptions is like a dictionary with
| entries from the set: {requires_grad, device,
| dtype, layout}, where each entry may be
| unspecified (i.e., is optional). It is used to
| specify the properties of tensors in many
| places both in C++ internal and API, e.g.,
| tensor factory methods like
empty({10}, | options)
, tensor conversions like
|
tensor.to(...)
, etc.
|
| To provide a simple API that is consistent
| with Python, where one can do
|
|
torch.empty(sizes, X)
with
X
being
| a
torch.device
,
torch.dtype
, or a
|
|
torch.layout
, we want TensorOptions to be
| implicitly convertible from
|
|
ScalarType dtype
,
Layout layout
and
|
Device device
.
|
| Therefore, we have three implicit constructors
| from each of these three types.
|
| This is sufficient for
ScalarType
and
|
Layout
as they are simple Enum
| classes. However,
Device
is an ordinary
| class with implicit constructors
|
|
Device(DeviceType, DeviceIndex = -1)
and
|
Device(string)
to be consistent with Python
| API, where strings are treated as equivalent
| with a
|
|
torch.device
object (e.g., “cuda:1” can be
| passed to everywhere a
|
|
torch.device("cuda:1")
is accepted). To
| support the syntax
|
|
empty({10}, {kCUDA, 1})
and
|
tensor.to(kCUDA)
, we need to make sure that
|
TensorOptions
is implicitly constructible
| with any argments that a
|
|
Device
can constructed from. So we have,
|
| /* implicit
/ TensorOptions(T&& device) : TensorOptions() {
| this->set_device(device);
| }
|
| template <typename… Args,
| typename = enable_if_t<is_constructible<Device,
| Args&&…>::value>>
| / implicit */ TensorOptions(Args&&… args)
| : TensorOptions(Device(forward
(args)…)) {}
|
|
| But this will be problematic. Consider this:
| TensorOptions({kCUDA, 1})
. Compiler will
| compain about ambiguity between the copy
| constructor and the Device
constructor
| because {kCUDA, 1}
can be converted to both
| a TensorOption
and a Device
.
|
| To get around this, we templatize the Device
| constructor. Since overload resolution is done
| before template resolution, our problem is
| solved.| Thread local debug information is propagated
| across the forward (including async fork tasks)
| and backward passes and is supposed to be
| utilized by the user’s code to pass extra
| information from the higher layers (e.g. model
| id) down to the lower levels (e.g. to the
| operator observers used for debugging, logging,
| profiling, etc)
| Used in ATen for invalid types. These
| turn into TypeError when they cross
| to Python.
|
| A type id is a unique id for a given C++
| type.
|
| You need to register your types using
| CAFFE_KNOWN_TYPE(MyType) to be able
| to use TypeIdentifier with custom types.
| This is for example used to store the
| dtype of tensors.
|
| This struct holds the actual type
| information. There will be one allocated per
| type. TypeMeta objects will then point to the
| struct instance for the type they’re configured
| for.
| dyn std::alloc::Allocator UniqueVoidPtr is an owning smart pointer like
| unique_ptr, but with three major differences:
|
| 1) It is specialized to void
|
| 2) It is specialized for a function pointer
| deleter void(void* ctx); i.e., the
| deleter doesn’t take a reference to the
| data, just to a context pointer (erased
| as void*). In fact, internally, this
| pointer is implemented as having an
| owning reference to context, and
| a non-owning reference to data; this is
| why you release_context(), not release()
| (the conventional API for release()
| wouldn’t give you enough information to
| properly dispose of the object later.)
|
| 3) The deleter is guaranteed to be called
| when the unique pointer is destructed and
| the context is non-null; this is
| different from unique_ptr where the
| deleter is not called if the data pointer
| is null.
|
| Some of the methods have slightly different
| types than unique_ptr to reflect this.
|
| Used in ATen for invalid values. These
| turn into ValueError when they cross
| to Python.
|
| NOTE [ Version Counter Sharing ]
|
| Every C10Tensor has a version counter. Version
| counters are incremented whenever the data or
| size of a tensor changes through in-place
| Variable operations.
|
| Version counters are used to detect
| modifications to saved variables which would
| result in incorrect gradient
| calculations. Version counters may be shared
| between Variables:
|
| 1. A view shares the version counter of the
| base Variable,
|
| 2. x.detach()
shares the version counter of
| x
,
|
| 3. Unpacked saved variables share the version
| counter of the source.
|
| Version counters are not shared in these
| scenarios:
|
| 1. When we replace a Variable
’s underlying
| C10Tensor
by calling set_data(...)
,
|
| 2. x.data
does not share the version counter
| of x
. (See discussion at
| https://github.com/pytorch/pytorch/issues/5396)
|
| Question: Why do we put the version counter in
| TensorImpl instead of AutogradMeta?
|
| Answer: After the Variable/C10Tensor merge,
| a tensor will not have AutogradMeta when its
| requires_grad_
is false, but when we use this
| tensor in the forward pass of a function that
| requires saving this tensor for backward, we
| need to keep track of this tensor’s version to
| make sure it’s always valid in the autograd
| graph.
|
| To achieve this goal, we put the version
| counter in TensorImpl instead of AutogradMeta,
| and have it always be available. This allows us
| to have the optimization of not carrying
| AutogradMeta when a tensor doesn’t require
| gradient.
|
| A hypothetical alternative way to achieve this
| goal is to initialize AutogradMeta and create
| the version counter for the non-requires-grad
| tensor only when it’s saved for
| backward. However, since saving a tensor for
| backward happens in the forward pass, and our
| invariant is that forward pass needs to be
| thread-safe, lazy-initializing AutogradMeta
| when saving a tensor can introduce race
| conditions when we are running the forward pass
| in multi-thread scenarios, thus making the
| forward pass not thread-safe anymore, which
| breaks the invariant.
| An implementation of DeviceGuardImplInterface
| which delegates to virtual dispatch
| on the DeviceGuardImpl registry.
|
| A RAII guard that sets warn_always (not
| thread-local) on construction, and sets it back
| to the original value upon destruction.
|
| Usage: Profile allocations made by one run of
| the model.
|
| AllocationPlan plan;
|
| {
| WithProfileAllocationGuard profile_guard(&plan);
| module.forward(…);
| }
| plan now contains allocation plan.
| This is the data type for quantized Tensors.
| Right now we only have qint8 which is
| for 8 bit Tensors, and qint32 for 32 bit
| int Tensors, we might have 4 bit, 2 bit
| or 1 bit data types in the future.
|
| qint32 is for signed 32 bit quantized
| Tensors
|
| quint8 is for unsigned 8 bit quantized
| Tensors
|
| Subtract two unsigned integers, X and Y, of
| type T and return the absolute value of the
| result.
|
| Aligns \c Addr to \c Alignment bytes, rounding
| up.
|
| Alignment should be a power of two. This
| method rounds up, so alignAddr(7, 4) == 8 and
| alignAddr(8, 4) == 8.
|
| Returns the largest uint64_t less than or
| equal to \p Value and is
|
| \p Skew mod \p Align. \p Align must be
| non-zero
|
| Returns the next integer (mod 2**64) that is greater than or equal to
| \p Value and is a multiple of \p Align. \p Align must be non-zero.
|
| If non-zero \p Skew is specified, the return value will be a minimal
| integer that is greater than or equal to \p Value and equal to
| \p Align * N + \p Skew for some integer N. If \p Skew is larger than
| \p Align, its value is adjusted to ‘\p Skew mod \p Align’.
|
| Examples:
| \code
| alignTo(5, 8) = 8
| alignTo(17, 8) = 24
| alignTo(~0LL, 8) = 0
| alignTo(321, 255) = 510
|
| alignTo(5, 8, 7) = 7
| alignTo(17, 8, 1) = 17
| alignTo(~0LL, 8, 3) = 3
| alignTo(321, 255, 42) = 552
| \endcode
| Returns the next integer (mod 2**64) that is
| greater than or equal to
|
| \p Value and is a multiple of \c Align. \c
| Align must be non-zero.
|
| Returns the necessary adjustment for
| aligning \c Ptr to \c Alignment bytes,
| rounding up.
|
| This function takes a 64-bit integer
| and returns the bit equivalent double.
|
| This function takes a 32-bit integer
| and returns the bit equivalent float.
|
This function is not exported
| see tensor_attributes.rst for detailed
| explanation and examples of casting
| rules.
|
| Wrap around axis_index if it is negative,
| s.t., -1 is the last dim
|
| Reads an environment variable and returns
| - optional, if set equal to “1”
| - optional, if set equal to “0”
| - nullopt, otherwise
|
| NB:
| Issues a warning if the value of the
| environment variable is not 0 or 1.
| Helper to verify the GPU index is valid
|
| Helpers for CHECK_NOTNULL(). Two are necessary
| to support both raw pointers and smart
| pointers.
|
| This is intended to be a centralized
| location by which we can determine what
| an appropriate DispatchKey for a tensor is.
|
| Typed copy function for classes.
|
| WARNING: Implementations for this
| function are currently registered
| from
|
| ATen and caffe2, not yet from c10. Don’t
| use this if not either ATen or caffe2
| is present as well.
|
| We can’t move them yet, because the Cuda
| implementations aren’t unified yet
| between ATen and caffe2.
|
| We’re planning to move the implementations
| into c10/backend/xxx to make c10 self
| contained again.
|
| A placeholder function for types that
| do not allow assignment.
|
| Implement copysign for half precision floats
| using bit ops
|
| Sign is the most significant bit for both half
| and bfloat16 types
|
| Count the number of ones from the most
| significant bit to the first zero bit.
|
| Ex. countLeadingOnes(0xFF0FFF00)
| == 8. Only unsigned integral types are
| allowed.
|
| ———–
| @param ZB
|
| the behavior on an input of all ones.
| Only ZeroBehavior::Width and ZeroBehavior::Undefined are
| valid arguments.
|
| Count number of 0’s from the most significant
| bit to the least stopping at the first
| 1.
|
| Only unsigned integral types are allowed.
|
| ———–
| @param ZB
|
| the behavior on an input of 0. Only ZeroBehavior::Width
| and ZeroBehavior::Undefined are valid arguments.
|
Count the number of set bits in a value.
Ex. countPopulation(0xF000F000) = 8
Returns 0 if the word is zero.
| Count the number of ones from the least
| significant bit to the first zero bit.
|
| Ex. countTrailingOnes(0x00FF00FF) == 8. Only
| unsigned integral types are allowed.
|
| \param ZB the behavior on an input of all
| ones. Only ZeroBehavior::Width and ZeroBehavior::Undefined are valid
| arguments.
| Count number of 0’s from the least significant
| bit to the most stopping at the first 1.
|
| Only unsigned integral types are allowed.
|
| \param ZB the behavior on an input of 0. Only
| ZeroBehavior::Width and ZeroBehavior::Undefined are valid
| arguments.
|
Use this version where you’re sure a Cuda
context exists already.
Utility to demangle a C++ symbol name.
Returns the printable name of the type.
| // Deprecation disabled until we fix
| sites in our codebase
|
| C10_DEPRECATED_MESSAGE(“AT_ERROR(msg)
| is deprecated, use TORCH_CHECK(false,
| msg) instead.”)
|
| Convenience function that returns
| a TensorOptions
object with the device
set
| to the given one.
|
| NB: In the past, we were inconsistent about
| whether or not this reported an error if there
| were driver problems are not. Based on
| experience interacting with users, it seems
| that people basically ~never want this function
| to fail; it should just return zero if things
| are not working.
|
| Oblige them.
|
| It still might log a warning for user first
| time it’s invoked
| Version of device_count that throws
| is no devices are detected
|
| Convenience function that returns
| a TensorOptions
object with the device
set
| to Cuda and the device_index
set to the
| given one.
|
Returns the integer ceil(Numerator / Denominator).
| This function takes a double and returns the
| bit equivalent 64-bit integer.
|
| Note that copying doubles around changes the
| bits of NaNs on some hosts, notably x86, so
| this routine cannot be used if these bits are
| needed.
returns -1 on failure
| legacy function to support ScalarType
|
| Convenience function that returns
| a TensorOptions
object with the dtype
| set to the given one.
|
| Rich logging messages
|
| CAFFE_ENFORCE_THAT can be used with one of the
| “checker functions” that capture input argument
| values and add it to the exception
| message. E.g. CAFFE_ENFORCE_THAT(Equals(foo(x), | bar(y)), "Optional additional message")
would
| evaluate both foo and bar only once and if the
| results are not equal - include them in the
| exception message.
|
| Some of the basic checker functions like Equals
| or Greater are already defined below. Other
| header might define customized checkers by
| adding functions to enforce_detail
| namespace. For example:
|
| namespace caffe2 { namespace enforce_detail {
| inline EnforceFailMessage IsVector(const vector<int64_t>& shape) {
| if (shape.size() == 1) { return EnforceOK(); }
| return str(“Shape “, shape, “ is not a vector”);
| }
| }}
|
| With further usages like
| CAFFE_ENFORCE_THAT(IsVector(Input(0).dims()))
|
| Convenient wrappers for binary operations like
| CAFFE_ENFORCE_EQ are provided too. Please use
| them instead of CHECK_EQ and friends for
| failures in user-provided input.
| Get the index of the first set bit starting
| from the least significant bit.
|
| Only unsigned integral types are allowed.
|
| ———–
| @param ZB
|
| the behavior on an input of 0. Only ZeroBehavior::Max
| and ZeroBehavior::Undefined are valid arguments.
|
| Get the index of the last set bit starting
| from the least significant bit.
|
| Only unsigned integral types are allowed.
|
| ———–
| @param ZB
|
| the behavior on an input of 0. Only
|
| ZeroBehavior::Max and ZeroBehavior::Undefined are valid arguments.
|
| This function takes a float and returns the
| bit equivalent 32-bit integer.
|
| Note that copying floats around changes the
| bits of NaNs on some hosts, notably x86, so
| this routine cannot be used if these bits are
| needed.
|
Internal, use ThreadLocalStateGuard
| Returns a DispatchKeySet of autocast
| related keys mapped to backend.
|
| for a given backend key, return the associated
| autograd key.
|
| for non-backend keys, return AutogradOther as
| a default.
|
| Note: it’s convenient and fast to return
| a default here rather than (say) returning an
| optional, or throwing.
|
| But it makes callers responsible for either a)
| enforcing the invariant that only backend keys
| be passed as arguments, or b) interpreting our
| return value carefully.
|
| Returns a DispatchKeySet of autograd
| related keys mapped to backend.
|
| Returns a DispatchKeySet of all backend keys
| mapped to Autograd dispatch key t,
| DispatchKeySet is empty if t is not alias of
| DispatchKey::Autograd.
|
| for a given autograd key, return the
| (guaranteed nonempty) set of associated backend
| keys. for a non-autograd key, return the empty
| keyset.
|
| ———–
| @note
|
| Hardcoded the channel last stride indices
| here to get better performance
|
Get the CPU Allocator.
Get the CPU Caching Allocator
Get the Default CPU Allocator
Get the Default Mobile CPU Allocator
| A utility function to return an exception
| string by prepending its exception type before
| its what() content
|
| Helper to determine the index of the stream to
| return
|
| Note: Streams are returned round-robin (see
| note in CudaStream.h)
|
| Gets a non deterministic random number number
| from either the /dev/urandom or the current
| time.
|
| For Cuda, gets random from random_device and
| adds a transformation on it.
|
| FIXME: The behavior in this function is from
| legacy code
|
| (THRandom_seed/THCRandom_seed) and is probably
| not the right thing to do, even though our
| tests pass. Figure out if tests get perturbed
|
| - when the same algorithm is used for all
| backends. Note that the current behavior is
| different for CPU, Cuda and Windows CPU.
|
| - when using C++11 std objects, such as
| random_device
|
| - when constructing a 64 bit seed properly,
| rather than static casting a 32 bit number to
| 64 bit.
| Resolve alias dispatch key to DispatchKeySet
| if applicable
|
Gets the global warning handler.
| Return the greatest common divisor
| of the values using Euclid’s algorithm.
|
Return the high 32 bits of a 64 bit value.
| Creates the low and high priority stream pools
| for the specified device
|
| Warning: only call once per device!
|
| Populates global values and creates a default
| stream for each device.
|
| Note: the default stream on each device is
| signified by a nullptr, and so is not created
| as usual.
|
| In particular, we don’t need to switch devices
| when creating the streams.
|
| Warning: this function must only be called
| once!
|
| Check if a DispatchKey is an alias mapping
| to other runtime keys.
|
true if t is a backend dispatch key
| Note [Ambiguous is_channels_last_strides_xd]
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
| The flaw of carrying memory_format implicitly
| through strides is very hard to WAR
| properly. issue #24090
|
| Without the history of permutation, we can’t
| infer the memory_format of a tensor from the
| snapshot of its size & stride
|
| e.g.
|
| 1. We can NOT specify the memory_format of N111
| tensor through strides in a meaningful way;
|
| 2. Two path that ended up with identical size/stride
|
| N11W contiguous tensor sliced at w-dimension
| becomes [N,1,1,1]@[W,W,W,W]
|
| NC11 channels_last tensor sliced at
| c-dimension becomes [N,1,1,1]@[C,C,C,C]
|
| So if we see a tensor [N,1,1,1]@[X,X,X,X],
| there’s no way for us to infer the
| memory_format of the original tensor.
|
| Due to the limitations, our temporary WAR
| is_channels_last_strides
does the best effort
| to infer whether the original memory_format of
| a tensor is MemoryFormat::ChannelsLast. The two
| objectives of this function (ordered by their
| importance):
|
| 1. Ensure that normal shape manipulation does
| not accidentally change the MemoryFormat
| of an existing tensor.
|
| 2. Allows user to mark
| MemoryFormat::ChannelsLast to tensors;
|
| The function does so via checking strides of
| the tensor, including strides of size-1
| dimensions. Although conventionally PyTorch
| implies no restriction on trivial stride
| (stride for size-1 dimension).
|
| Note that this approach is a compromise. We did
| not solve the problem completely. Many cases we
| will not be able to infer the correct memory
| format.
|
| The implementation of
| is_channels_last_strides
is to serve the
| objectives:
|
| MemoryFormat::ChannelsLast has to be explicitly
| opted-in (no accidental conversion); Best
| effort to maintain the ChannelsLast flag.
|
| Due to the fact that this is not a bulletproof
| solution, through testing
| (aten/src/ATen/test/memory_format_test.cpp)
|
| a. we ensure that the common tasks are
| supported;
|
| a. we identify corner cases where the
| implementation compromises on.
|
| By the time accumulated permutation is enabled
| to replace implicit memory_format through
| strides, we should be updating our tests and
| fix the issues in our tests.
|
| We use Channels Last 2d as an example above.
|
| This is a general problem for all the
| is_channels_last_strides_xd
| implementation. Please check the helper
| functions (is_channels_last_strides_d_s) for
| more details.
| NOTE:
|
| Below are Helper functions for
| is_channels_last_strides_xd.
|
| 1. Please do not combine these helper
| functions, each helper function handles exactly
| one case of sizes + memory_format, by doing
| this, the strides indices will be a constant
| array and we can access it using constant index
| number, the compiler will fully unroll the loop
| on strides indices to gain a better
| performance.
|
| 2. No error check in helper function, caller
| ensures the correctness of the input
|
| 3. All helper functions have similar comments,
| only 1st helper function is commented here.
|
| This API exists because we have a use case for
| checking
| getRuntimeDispatchKeySet(alias).has(DispatchKey::Undefined)
| in OperatorEntry.cpp but we disallow it in
| has() API.
| Checks if an integer fits into the given
| bit width.
|
Checks if an signed integer fits into the
given (dynamic) bit width.
Return true if the argument is a non-empty
sequence of ones starting at the least
significant bit with the remainder zero (32
bit version).
| Return true if the argument is a non-empty
| sequence of ones starting at the least
| significant bit with the remainder zero (64
| bit version).
|
Return true if the argument is a power of two > 0.
Return true if the argument is a power of two > 0 (64 bit edition.)
unix isprint but insensitive to locale
Checks if a signed integer is an N bit number
shifted left by S.
| Return true if the argument contains
| a non-empty sequence of ones with the
| remainder zero (32 bit version.)
| Ex. isShiftedMask_32(0x0000FF00U) == true.
|
| Return true if the argument contains
| a non-empty sequence of ones with the
| remainder zero (64 bit version.)
|
Checks if a unsigned integer is an N bit
number shifted left by S.
Checks if an unsigned integer fits into the
given (dynamic) bit width.
| NB: Per the C++ standard (e.g.,
| https://stackoverflow.com/questions/18195312/what-happens-if-you-static-cast-invalid-value-to-enum-class)
| as long as you cast from the same underlying
| type, it is always valid to cast into an enum
| class (even if the value would be invalid by
| the enum.)
|
| Thus, the caller is allowed to cast a possibly
| invalid int16_t to DeviceType and then pass it
| to this function.
|
| (I considered making this function take an
| int16_t directly, but that just seemed weird.)
|
| Convenience function that returns
| a TensorOptions
object with the layout
| set to the given one.
|
| Historically, every tensor only had a single
| DispatchKey, and it was always something like
| CPU, and there wasn’t any of this business
| where TLS could cause the DispatchKey of
| a tensor to change.
|
| But we still have some legacy code that is
| still using DispatchKey for things like
| instanceof checks; if at all possible, refactor
| the code to stop using DispatchKey in those
| cases.
|
Return the low 32 bits of a 64 bit value.
Return the log base 2 of the specified value.
Return the floor log base 2 of the specified
value, -1 if the value is zero.
| Return the ceil log base 2 of the specified
| value, 32 if the value is zero.
|
| (32 bit edition).
|
| Ex. Log2_32_Ceil(32) == 5, Log2_32_Ceil(1) == 0, Log2_32_Ceil(6) == 3
Return the floor log base 2 of the specified
value, -1 if the value is zero.
| Return the ceil log base 2 of the specified
| value, 64 if the value is zero.
|
| (64 bit edition.)
| Return value is needed to do the static
| variable initialization trick
|
Log a message and terminate.
Make a 64-bit integer from a high / low pair
of 32-bit integers.
| Creates the filename pattern passed to and
| completed by mkstemp
.
|
| Returns vector because mkstemp
needs
| a (non-const) char*
and string
only
| provides const char*
before C++17.
|
| Keeps the callable object that is passed in,
| and execute it at the destruction of the
| returned object (usually at the scope exit
| where the returned object is kept).
|
| Interface is specified by p0052r2.
|
| Like try_make_tempdir
, but throws
| an exception if a temporary directory
| could not be returned.
|
| Like try_make_tempfile
, but throws an
| exception if a temporary file could not be
| returned.
|
| Create a bitmask with the N left-most bits set
| to 1, and all other bits set to 0. Only
| unsigned types are allowed.
|
| Create a bitmask with the N left-most bits set
| to 0, and all other bits set to 1. Only
| unsigned types are allowed.
|
| Create a bitmask with the N right-most bits
| set to 1, and all other bits set to 0.
| Only unsigned types are allowed.
|
| Create a bitmask with the N right-most bits
| set to 0, and all other bits set to 1. Only
| unsigned types are allowed.
|
Gets the maximum value for a N-bit signed
integer.
Gets the maximum value for a N-bit unsigned
integer.
| Convenience function that returns
| a TensorOptions
object with the
| memory_format
set to the given one.
|
| Fill the data memory region of num bytes with
| a particular garbage pattern.
|
| The garbage value is chosen to be NaN if
| interpreted as floating point value, or a very
| large integer.
| A and B are either alignments or
| offsets. Return the minimum alignment that may
| be assumed after adding the two together.
|
Gets the minimum value for a N-bit signed
integer.
| Product of a list of integers; accumulates
| into the int64_t datatype
|
| Returns the next power of two (in 64-bits)
| that is strictly greater than A.
|
| Returns zero on overflow.
|
| A helper function that is basically
| doing nothing.
|
| Product of all dims between k and l (including
| dims[k] and excluding dims[l]) k and
| l may be supplied in either order
|
| Return product of all dimensions starting from k
|
| Returns 1 if k>=dims.size()
|
| Product of all dims up to k (not including
| dims[k]) Throws an error if
| k>dims.size()
|
| Returns the offset to the next integer (mod
| 2**64) that is greater than or equal to \p
| Value and is a multiple of \p Align. \p Align
| must be non-zero.
|
| typeMetaToScalarType(), lifted to
| optional
|
| Destructor for non-fundamental types.
|
| Placement new function for the type.
|
| Returns the power of two which is greater than
| or equal to the given value.
|
| Essentially, it is a ceil operation across the
| domain of powers of two.
|
| Returns the power of two which is less than or
| equal to the given value.
|
| Essentially, it is a floor operation across
| the domain of powers of two.
|
| Gets a random number for /dev/urandom
|
| Note this is a legacy method (from THRandom.cpp)
|
| FIXME: use random_device with entropy
| information
|
| Replace all occurrences of “from” substring to
| “to” string.
|
| Returns number of replacements
|
| Convenience function that returns
| a TensorOptions
object with the
| requires_grad
set to the given one.
|
Reverse the bits in \p Val.
| Add two unsigned integers, X and Y, of type T.
| Clamp the result to the maximum representable
| value of T on overflow. ResultOverflowed
| indicates if the result is larger than the
| maximum representable value of type T.
|
| Multiply two unsigned integers, X and Y, of
| type T. Clamp the result to the maximum
| representable value of T on overflow.
| ResultOverflowed indicates if the result is
| larger than the maximum representable value of
| type T.
|
| Multiply two unsigned integers, X and Y, and
| add the unsigned integer, A to the product.
|
| Clamp the result to the maximum representable
| value of T on overflow.
|
| ResultOverflowed indicates if the result is
| larger than the maximum representable value of
| type T.
|
| convert ScalarType enum values to TypeMeta
| handles
|
| Set the allocator for DeviceType t
.
| The passed in allocator pointer is expected
| to have static lifetime; this function
| does NOT take ownership of the raw pointer.
| (The reason for this is to prevent existing
| pointers to an allocator of a particular
| device from being invalidated when
|
| SetAllocator is called.)
|
| Also note that this is not thread-safe,
| and we assume this function will only
| be called during initialization.
|
| The ‘priority’ flag is introduced when
| we want to overwrite the default allocator,
| since the allocators are set statically.
| The default priority is 0, which means
| the lowest. Only higher or equal priority
| can overwrite existing ones.
|
API usage logging capabilities
| Sets the CPU allocator to the given allocator:
| the caller gives away the ownership of the
| pointer.
|
| The CPUCachingAllocator is experimental and
| might disappear in the future.
|
| The only place that uses it is in
| StaticRuntime.
|
| Set the CPU Caching Allocator
|
| The TORCH_WARN_ONCE macro is difficult to test
| for. Use setWarnAlways(true) to turn it into
| TORCH_WARN, which can be tested for more
| easily.
|
| Sets the global warning handler. This is not
| thread-safe, so it should generally be called
| once during initialization or while holding
| the GIL for programs that use python.
|
| User is responsible for keeping the
| WarningHandler alive until it is not needed.
|
| Sign-extend the number in the bottom B bits of
| X to a 32-bit integer.
|
| Requires 0 < B <= 32.
|
| Sign-extend the number in the bottom B bits of
| X to a 32-bit integer.
|
| Requires 0 < B < 32.
| Sign-extend the number in the bottom B bits of
| X to a 64-bit integer.
|
| Requires 0 < B < 64.
|
| Sign-extend the number in the bottom B bits of
| X to a 64-bit integer.
|
| Requires 0 < B < 64.
|
| Product of all dims between k and l (not
| including dims[k] and dims[l])
|
| Return product of all dimensions starting
| from k
|
| Product of all dims up to k (not including
| dims[k])
|
| StreamId is 64-bit, so we can just rely on
| regular promotion rules.
|
| We rely on streamIdIndex and streamIdType being
| non-negative; see Note [Hazard when
| concatenating signed integers]
Obtains the base name from a full path.
| Sum of a list of integers; accumulates
| into the int64_t datatype
|
| Mechanism for throwing errors which can’t be
| prevented at compile time due to type
| erasure. E.g. somebody calling TypeMeta::copy()
| for non-copyable type. Right now just throws
| exception but is implemented in .cpp to manage
| dependencies
| Non-RAII API
|
| Please prefer using the RAII API. See
| declarations in LocalDispatchKeySet.h for
| details.
|
| Non-RAII API for manipulating the thread-local
| dispatch state.
|
| Please prefer the RAII API. The non-RAII API
| may be useful when the included/excluded state
| of a given DispatchKey must span many calls
| from the Python to the C++, so you cannot
| conveniently use an RAII guard.
|
| Example use case: a Python context manager
| that includes a certain DispatchKey, to ensure
| ops running under the context manager dispatch
| through that DispatchKey’s registered
| overrides.
|
| The non-RAII API is less efficient than the
| RAII guards because both the getter and setter
| will do a tls_getaddr lookup (the RAII struct
| only needs one!)
| A utility function to convert vector
| to vector<int64_t>.
|
| The str() call that creates userMsg can have
| 1 of 3 return types depending on the number and
| types of arguments passed to
| TORCH_INTERNAL_ASSERT.
|
| 0 arguments will get a CompileTimeEmptyString,
| 1 const char * will be passed straight through,
| and anything else will get converted to string.
|
| This should never be called. It is provided in
| case of compilers that don’t do any dead code
| stripping in debug builds.
|
| Attempts to return a temporary directory or
| returns nullopt
if an error occurred.
|
| The directory returned follows the pattern
| <tmp-dir>/<name-prefix><random-pattern>/
,
| where <tmp-dir>
is the value of the
| "TMPDIR"
, "TMP"
, "TEMP"
or "TEMPDIR"
| environment variable if any is set, or
| otherwise /tmp
; <name-prefix>
is the value
| supplied to this function, and
| <random-pattern>
is a random sequence of
| numbers.
|
| On Windows, name_prefix
is ignored and
| tmpnam
is used.
|
| Attempts to return a temporary file or returns
| nullopt
if an error occurred.
|
| The file returned follows the pattern
| <tmp-dir>/<name-prefix><random-pattern>
,
| where <tmp-dir>
is the value of the
| "TMPDIR"
, "TMP"
, "TEMP"
or "TEMPDIR"
| environment variable if any is set, or
| otherwise /tmp
; <name-prefix>
is the value
| supplied to this function, and
| <random-pattern>
is a random sequence of
| numbers.
|
| On Windows, name_prefix
is ignored and
| tmpnam
is used.
|
| convert TypeMeta handles to ScalarType
| enum values
|
| Issue a warning with a given message.
| Dispatched to the current warning handler.
|