orx_concurrent_vec/grow.rs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
use crate::{elem::ConcurrentElement, ConcurrentVec};
use core::sync::atomic::Ordering;
use orx_pinned_vec::IntoConcurrentPinnedVec;
impl<T, P> ConcurrentVec<T, P>
where
P: IntoConcurrentPinnedVec<ConcurrentElement<T>>,
{
/// Concurrent, thread-safe method to push the given `value` to the back of the bag, and returns the position or index of the pushed value.
///
/// It preserves the order of elements with respect to the order the `push` method is called.
///
/// # Panics
///
/// Panics if the concurrent bag is already at its maximum capacity; i.e., if `self.len() == self.maximum_capacity()`.
///
/// Note that this is an important safety assertion in the concurrent context; however, not a practical limitation.
/// Please see the [`orx_pinned_concurrent_col::PinnedConcurrentCol::maximum_capacity`] for details.
///
/// # Examples
///
/// We can directly take a shared reference of the bag, share it among threads and collect results concurrently.
///
/// ```rust
/// use orx_concurrent_vec::*;
///
/// let (num_threads, num_items_per_thread) = (4, 1_024);
///
/// let vec = ConcurrentVec::new();
///
/// std::thread::scope(|s| {
/// let vec = &vec;
/// for i in 0..num_threads {
/// s.spawn(move || {
/// for j in 0..num_items_per_thread {
/// // concurrently collect results simply by calling `push`
/// vec.push(i * 1000 + j);
/// }
/// });
/// }
/// });
///
/// let mut vec = vec.to_vec();
/// vec.sort();
/// let mut expected: Vec<_> = (0..num_threads).flat_map(|i| (0..num_items_per_thread).map(move |j| i * 1000 + j)).collect();
/// expected.sort();
/// assert_eq!(vec, expected);
/// ```
///
/// # Performance Notes - False Sharing
///
/// [`ConcurrentVec::push`] implementation is lock-free and focuses on efficiency.
/// However, we need to be aware of the potential [false sharing](https://en.wikipedia.org/wiki/False_sharing) risk.
/// False sharing might lead to significant performance degradation.
/// However, it is possible to avoid in many cases.
///
/// ## When?
///
/// Performance degradation due to false sharing might be observed when both of the following conditions hold:
/// * **small data**: data to be pushed is small, the more elements fitting in a cache line the bigger the risk,
/// * **little work**: multiple threads/cores are pushing to the concurrent bag with high frequency; i.e.,
/// * very little or negligible work / time is required in between `push` calls.
///
/// The example above fits this situation.
/// Each thread only performs one multiplication and addition in between pushing elements, and the elements to be pushed are very small, just one `usize`.
///
/// ## Why?
///
/// * `ConcurrentBag` assigns unique positions to each value to be pushed. There is no *true* sharing among threads in the position level.
/// * However, cache lines contain more than one position.
/// * One thread updating a particular position invalidates the entire cache line on an other thread.
/// * Threads end up frequently reloading cache lines instead of doing the actual work of writing elements to the bag.
/// * This might lead to a significant performance degradation.
///
/// ### Solution: `extend` rather than `push`
///
/// One very simple, effective and memory efficient solution to this problem is to use [`ConcurrentVec::extend`] rather than `push` in *small data & little work* situations.
///
/// Assume that we will have 4 threads and each will push 1_024 elements.
/// Instead of making 1_024 `push` calls from each thread, we can make one `extend` call from each.
/// This would give the best performance.
/// Further, it has zero buffer or memory cost:
/// * it is important to note that the batch of 1_024 elements are not stored temporarily in another buffer,
/// * there is no additional allocation,
/// * `extend` does nothing more than reserving the position range for the thread by incrementing the atomic counter accordingly.
///
/// However, we do not need to have such a perfect information about the number of elements to be pushed.
/// Performance gains after reaching the cache line size are much lesser.
///
/// For instance, consider the challenging super small element size case, where we are collecting `i32`s.
/// We can already achieve a very high performance by simply `extend`ing the bag by batches of 16 elements.
///
/// As the element size gets larger, required batch size to achieve a high performance gets smaller and smaller.
///
/// Required change in the code from `push` to `extend` is not significant.
/// The example above could be revised as follows to avoid the performance degrading of false sharing.
///
/// ```rust
/// use orx_concurrent_vec::*;
///
/// let (num_threads, num_items_per_thread) = (4, 1_024);
///
/// let vec = ConcurrentVec::new();
/// let batch_size = 16;
///
/// std::thread::scope(|s| {
/// let vec = &vec;
/// for i in 0..num_threads {
/// s.spawn(move || {
/// for j in (0..num_items_per_thread).step_by(batch_size) {
/// let iter = (j..(j + batch_size)).map(|j| i * 1000 + j);
/// // concurrently collect results simply by calling `extend`
/// vec.extend(iter);
/// }
/// });
/// }
/// });
///
/// let mut vec = vec.to_vec();
/// vec.sort();
/// let mut expected: Vec<_> = (0..num_threads).flat_map(|i| (0..num_items_per_thread).map(move |j| i * 1000 + j)).collect();
/// expected.sort();
/// assert_eq!(vec, expected);
/// ```
pub fn push(&self, value: T) -> usize {
let idx = self.len_reserved().fetch_add(1, Ordering::Relaxed);
// # SAFETY: ConcurrentVec ensures that each `idx` will be written only and exactly once.
let maybe = unsafe { self.core.single_item_as_ref(idx) };
unsafe { maybe.0.initialize_unchecked(value) };
idx
}
/// Pushes the value which will be computed as a function of the index where it will be written.
///
/// Note that we cannot guarantee the index of the element by `push`ing since there might be many
/// pushes happening concurrently. In cases where we absolutely need to know the index, in other
/// words, when the value depends on the index, we can use `push_for_idx`.
///
/// # Examples
///
/// ```rust
/// use orx_concurrent_vec::*;
///
/// let vec = ConcurrentVec::new();
/// vec.push(0);
/// vec.push_for_idx(|i| i * 2);
/// vec.push_for_idx(|i| i + 10);
/// vec.push(42);
///
/// assert_eq!(&vec, &[0, 2, 12, 42]);
/// ```
pub fn push_for_idx<F>(&self, f: F) -> usize
where
F: FnOnce(usize) -> T,
{
let idx = self.len_reserved().fetch_add(1, Ordering::Relaxed);
let value = f(idx);
// # SAFETY: ConcurrentVec ensures that each `idx` will be written only and exactly once.
let maybe = unsafe { self.core.single_item_as_ref(idx) };
unsafe { maybe.0.initialize_unchecked(value) };
idx
}
/// Concurrent, thread-safe method to push all `values` that the given iterator will yield to the back of the bag.
/// The method returns the position or index of the first pushed value (returns the length of the concurrent bag if the iterator is empty).
///
/// All `values` in the iterator will be added to the bag consecutively:
/// * the first yielded value will be written to the position which is equal to the current length of the bag, say `begin_idx`, which is the returned value,
/// * the second yielded value will be written to the `begin_idx + 1`-th position,
/// * ...
/// * and the last value will be written to the `begin_idx + values.count() - 1`-th position of the bag.
///
/// Important notes:
/// * This method does not allocate to buffer.
/// * All it does is to increment the atomic counter by the length of the iterator (`push` would increment by 1) and reserve the range of positions for this operation.
/// * If there is not sufficient space, the vector grows first; iterating over and writing elements to the vec happens afterwards.
/// * Therefore, other threads do not wait for the `extend` method to complete, they can concurrently write.
/// * This is a simple and effective approach to deal with the false sharing problem.
///
/// For this reason, the method requires an `ExactSizeIterator`.
/// There exists the variant [`ConcurrentVec::extend_n_items`] method which accepts any iterator together with the correct length to be passed by the caller.
/// It is `unsafe` as the caller must guarantee that the iterator yields at least the number of elements explicitly passed in as an argument.
///
/// # Panics
///
/// Panics if not all of the `values` fit in the concurrent bag's maximum capacity.
///
/// Note that this is an important safety assertion in the concurrent context; however, not a practical limitation.
/// Please see the [`orx_pinned_concurrent_col::PinnedConcurrentCol::maximum_capacity`] for details.
///
/// # Examples
///
/// We can directly take a shared reference of the bag and share it among threads.
///
/// ```rust
/// use orx_concurrent_vec::*;
///
/// let (num_threads, num_items_per_thread) = (4, 1_024);
///
/// let vec = ConcurrentVec::new();
/// let batch_size = 16;
///
/// std::thread::scope(|s| {
/// let vec = &vec;
/// for i in 0..num_threads {
/// s.spawn(move || {
/// for j in (0..num_items_per_thread).step_by(batch_size) {
/// let iter = (j..(j + batch_size)).map(|j| i * 1000 + j);
/// // concurrently collect results simply by calling `extend`
/// vec.extend(iter);
/// }
/// });
/// }
/// });
///
/// let mut vec: Vec<_> = vec.to_vec();
/// vec.sort();
/// let mut expected: Vec<_> = (0..num_threads).flat_map(|i| (0..num_items_per_thread).map(move |j| i * 1000 + j)).collect();
/// expected.sort();
/// assert_eq!(vec, expected);
/// ```
///
/// # Performance Notes - False Sharing
///
/// [`ConcurrentVec::push`] method is implementation is simple, lock-free and efficient.
/// However, we need to be aware of the potential [false sharing](https://en.wikipedia.org/wiki/False_sharing) risk.
/// False sharing might lead to significant performance degradation; fortunately, it is possible to avoid in many cases.
///
/// ## When?
///
/// Performance degradation due to false sharing might be observed when both of the following conditions hold:
/// * **small data**: data to be pushed is small, the more elements fitting in a cache line the bigger the risk,
/// * **little work**: multiple threads/cores are pushing to the concurrent bag with high frequency; i.e.,
/// * very little or negligible work / time is required in between `push` calls.
///
/// The example above fits this situation.
/// Each thread only performs one multiplication and addition for computing elements, and the elements to be pushed are very small, just one `usize`.
///
/// ## Why?
///
/// * `ConcurrentBag` assigns unique positions to each value to be pushed. There is no *true* sharing among threads in the position level.
/// * However, cache lines contain more than one position.
/// * One thread updating a particular position invalidates the entire cache line on an other thread.
/// * Threads end up frequently reloading cache lines instead of doing the actual work of writing elements to the bag.
/// * This might lead to a significant performance degradation.
///
/// ### Solution: `extend` rather than `push`
///
/// One very simple, effective and memory efficient solution to the false sharing problem is to use [`ConcurrentVec::extend`] rather than `push` in *small data & little work* situations.
///
/// Assume that we will have 4 threads and each will push 1_024 elements.
/// Instead of making 1_024 `push` calls from each thread, we can make one `extend` call from each.
/// This would give the best performance.
/// Further, it has zero buffer or memory cost:
/// * it is important to note that the batch of 1_024 elements are not stored temporarily in another buffer,
/// * there is no additional allocation,
/// * `extend` does nothing more than reserving the position range for the thread by incrementing the atomic counter accordingly.
///
/// However, we do not need to have such a perfect information about the number of elements to be pushed.
/// Performance gains after reaching the cache line size are much lesser.
///
/// For instance, consider the challenging super small element size case, where we are collecting `i32`s.
/// We can already achieve a very high performance by simply `extend`ing the bag by batches of 16 elements.
///
/// As the element size gets larger, required batch size to achieve a high performance gets smaller and smaller.
///
/// The example code above already demonstrates the solution to a potentially problematic case in the [`ConcurrentVec::push`] example.
pub fn extend<IntoIter, Iter>(&self, values: IntoIter) -> usize
where
IntoIter: IntoIterator<Item = T, IntoIter = Iter>,
Iter: Iterator<Item = T> + ExactSizeIterator,
{
let values = values.into_iter();
let num_items = values.len();
self.extend_n_items::<_>(values, num_items)
}
/// Extends the vector with the values of the iterator which is created as a function of the
/// index that the first element of the iterator will be written to.
///
/// Note that we cannot guarantee the index of the element by `extend`ing since there might be many
/// pushes or extends happening concurrently. In cases where we absolutely need to know the index, in other
/// words, when the values depend on the indices, we can use `extend_for_idx`.
///
/// # Panics
///
/// Panics if the iterator created by `f` does not yield `num_items` elements.
///
/// # Examples
///
/// ```rust
/// use orx_concurrent_vec::*;
///
/// let vec = ConcurrentVec::new();
///
/// vec.push(0);
///
/// let iter = |begin_idx: usize| ((begin_idx..(begin_idx + 3)).map(|i| i * 5));
/// vec.extend_for_idx(|begin_idx| iter(begin_idx), 3);
/// vec.push(42);
///
/// assert_eq!(&vec, &[0, 5, 10, 15, 42]);
/// ```
pub fn extend_for_idx<IntoIter, Iter, F>(&self, f: F, num_items: usize) -> usize
where
IntoIter: IntoIterator<Item = T, IntoIter = Iter>,
Iter: Iterator<Item = T> + ExactSizeIterator,
F: FnOnce(usize) -> IntoIter,
{
let begin_idx = self.len_reserved().fetch_add(num_items, Ordering::Relaxed);
let slices = unsafe { self.core.n_items_buffer_as_slices(begin_idx, num_items) };
let mut values = f(begin_idx).into_iter();
assert_eq!(values.len(), num_items);
for slice in slices {
for maybe in slice {
let value = values
.next()
.expect("provided iterator is shorter than expected num_items");
unsafe { maybe.0.initialize_unchecked(value) };
}
}
begin_idx
}
/// Concurrent, thread-safe method to push `num_items` elements yielded by the `values` iterator to the back of the bag.
/// The method returns the position or index of the first pushed value (returns the length of the concurrent bag if the iterator is empty).
///
/// All `values` in the iterator will be added to the bag consecutively:
/// * the first yielded value will be written to the position which is equal to the current length of the bag, say `begin_idx`, which is the returned value,
/// * the second yielded value will be written to the `begin_idx + 1`-th position,
/// * ...
/// * and the last value will be written to the `begin_idx + num_items - 1`-th position of the bag.
///
/// Important notes:
/// * This method does not allocate at all to buffer elements to be pushed.
/// * All it does is to increment the atomic counter by the length of the iterator (`push` would increment by 1) and reserve the range of positions for this operation.
/// * Iterating over and writing elements to the vec happens afterwards.
/// * This is a simple, effective and memory efficient solution to the false sharing problem.
///
/// For this reason, the method requires the additional `num_items` argument.
/// There exists the variant [`ConcurrentVec::extend`] method which accepts only an `ExactSizeIterator`.
///
/// # Panics
///
/// Panics if the iterator created by `f` does not yield `num_items` elements.
///
/// # Examples
///
/// We can directly take a shared reference of the bag and share it among threads.
///
/// ```rust
/// use orx_concurrent_vec::*;
///
/// let (num_threads, num_items_per_thread) = (4, 1_024);
///
/// let vec = ConcurrentVec::new();
/// let batch_size = 16;
///
/// std::thread::scope(|s| {
/// let vec = &vec;
/// for i in 0..num_threads {
/// s.spawn(move || {
/// for j in (0..num_items_per_thread).step_by(batch_size) {
/// let iter = (j..(j + batch_size)).map(|j| i * 1000 + j);
/// // concurrently collect results simply by calling `extend_n_items`
/// unsafe { vec.extend_n_items(iter, batch_size) };
/// }
/// });
/// }
/// });
///
/// let mut vec: Vec<_> = vec.to_vec();
/// vec.sort();
/// let mut expected: Vec<_> = (0..num_threads).flat_map(|i| (0..num_items_per_thread).map(move |j| i * 1000 + j)).collect();
/// expected.sort();
/// assert_eq!(vec, expected);
/// ```
///
/// # Performance Notes - False Sharing
///
/// [`ConcurrentVec::push`] method is implementation is simple, lock-free and efficient.
/// However, we need to be aware of the potential [false sharing](https://en.wikipedia.org/wiki/False_sharing) risk.
/// False sharing might lead to significant performance degradation; fortunately, it is possible to avoid in many cases.
///
/// ## When?
///
/// Performance degradation due to false sharing might be observed when both of the following conditions hold:
/// * **small data**: data to be pushed is small, the more elements fitting in a cache line the bigger the risk,
/// * **little work**: multiple threads/cores are pushing to the concurrent bag with high frequency; i.e.,
/// * very little or negligible work / time is required in between `push` calls.
///
/// The example above fits this situation.
/// Each thread only performs one multiplication and addition for computing elements, and the elements to be pushed are very small, just one `usize`.
///
/// ## Why?
///
/// * `ConcurrentBag` assigns unique positions to each value to be pushed. There is no *true* sharing among threads in the position level.
/// * However, cache lines contain more than one position.
/// * One thread updating a particular position invalidates the entire cache line on an other thread.
/// * Threads end up frequently reloading cache lines instead of doing the actual work of writing elements to the bag.
/// * This might lead to a significant performance degradation.
///
/// ### Solution: `extend` rather than `push`
///
/// One very simple, effective and memory efficient solution to the false sharing problem is to use [`ConcurrentVec::extend`] rather than `push` in *small data & little work* situations.
///
/// Assume that we will have 4 threads and each will push 1_024 elements.
/// Instead of making 1_024 `push` calls from each thread, we can make one `extend` call from each.
/// This would give the best performance.
/// Further, it has zero buffer or memory cost:
/// * it is important to note that the batch of 1_024 elements are not stored temporarily in another buffer,
/// * there is no additional allocation,
/// * `extend` does nothing more than reserving the position range for the thread by incrementing the atomic counter accordingly.
///
/// However, we do not need to have such a perfect information about the number of elements to be pushed.
/// Performance gains after reaching the cache line size are much lesser.
///
/// For instance, consider the challenging super small element size case, where we are collecting `i32`s.
/// We can already achieve a very high performance by simply `extend`ing the bag by batches of 16 elements.
///
/// As the element size gets larger, required batch size to achieve a high performance gets smaller and smaller.
///
/// The example code above already demonstrates the solution to a potentially problematic case in the [`ConcurrentVec::push`] example.
pub fn extend_n_items<IntoIter>(&self, values: IntoIter, num_items: usize) -> usize
where
IntoIter: IntoIterator<Item = T>,
{
let begin_idx = self.len_reserved().fetch_add(num_items, Ordering::Relaxed);
let slices = unsafe { self.core.n_items_buffer_as_slices(begin_idx, num_items) };
let mut values = values.into_iter();
for slice in slices {
for maybe in slice {
let value = values
.next()
.expect("provided iterator is shorter than expected num_items");
unsafe { maybe.0.initialize_unchecked(value) };
}
}
begin_idx
}
}