opencache 0.0.1

cache server:)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
# opencache

Opencache is a cache server in rust that run in single node mode, and is an object oriented store that provides two API:
range `write` and range `read` and background auto eviction.

Cache cluster management is on client side. Backsourcing is not supported, an application has to explicitly write objects
into it.

# Protocols

- http2 PUT GET
- (maybe) aws-s3
- (maybe) grpc stream


# Storage model

CacheServer stores object(a byte stream identified by a unique utf-8 string key) in chunks.
Chunks belonging to a same object have equals sizes.

- When reading an object, the first and last chunk may be partially returned if
    the specified range does not align to chunk size.

- When writing an object, the first and last nonaligned bytes will be discarded.
    Because opencache stores object content in fixed size chunks.

    Optionally, the chunk size can be specified when an object is added(the
    first write) and won't be changed.

## Chunk

A `Chunk` is barely a byte slice.

Chunk size can be specified by an application(when the first time writing an
object) or decided by opencache server.  The default chunk size is calculated
with:

```
default_chunk_size = max(min(file_size/64, 2M),64K)
```


## Chunk group

Chunks with different sizes are stored in different chunk groups(`ChunkGroup`), e.g.:
- `(0, 4KB]`
- `(4KB, 1MB]`
- `(1MB, 8MB]`
- `(8MB, +oo]`

Opencache has several predefined ChunkGroup and the actual chunk size for an
object chunk size is the smallest predefined chunk size that is no less than the
specified one.


## Object

An object in opencache includes an object header and several chunks.

The header contains application metadata in `JSON` and internal metadata.
The internal metadata contains:
- time when the object is added,
- total size,
- chunk size,
- bitmap of stored chunks
- chunk-id of every stored chunk.


```
   .------------.
   |   object   |
   '------------'
      |       |
      |       '-------------.
      | header              |
      |                     |
      v                     |
   .-----------------.      |        .----------.
   | json-meta       |      +------->| chunk-1  |
   | size            |      |        '----------'
   | chunk-bitmap    |      |
   | chunk-ids       |      |        .----------.
   '-----------------'      '------->| chunk-2  |
                                     '----------'
```


Opencache store objects in a persistent map of `<key, header>`.
This map will be loaded in to memory when server starts up.

# Data types

## Chunk group id

`ChunkGroupId` is 2 ascii char in file name, or 1 byte in memory.
`ChunkGroupId` is a encoded representation of the size of chunks in it, similar
to float number representation.

The first char(or 4 higher bits) is the `power`.
The second char(or 4 lower bits) is the `significand`.

```
        power          significand
ascii   0              1
bits    4 higher bits  4 lower bits
```

The chunk size is calculated as follow: `significand * 16^power`

E.g. chunk group id of `48K` chunks is `3c`: `12 * 16^3 = 48K  // c=12`;

Max chunk group id is `ff`: `15 * 16^15 = 15E`

Several ChunkGroupId examples are listed as below:

```
01: 2 * 16^0 = 2        08: 8 * 16^0 = 8
11: 2 * 16^1 = 32       18: 8 * 16^1 = 128
21: 2 * 16^2 = 512      28: 8 * 16^2 = 2K
31: 2 * 16^3 = 8K       38: 8 * 16^3 = 32K
41: 2 * 16^4 = 128K     48: 8 * 16^4 = 512K
51: 2 * 16^5 = 2M       58: 8 * 16^5 = 8M
61: 2 * 16^6 = 32M      68: 8 * 16^6 = 128M
71: 2 * 16^7 = 512M     78: 8 * 16^7 = 2G
```

## Chunk id

`ChunkId` is a `u64`,
in which the most significant byte is `ChunkGroupId`,
the other 7 bytes is mono incremental index in 54-bit int.

```
       <ChunkGroupId> <index>
byte:  [0]            [1, 7]
```

## Key

Key is an arbitrary utf-8 string.


## Access entry

A cache replacement algorithm depends the access log to decide which
object/chunk should be evicted.

Either a `Key` or `ChunkId`, for tracing object access or chunk access
respectively.


# Storage layout

Opencache on-disk storage include:
- Manifest file that track all other kind of on-disk files.
- Key files to store object key and object header.
- Chunk files to store cached file data.


## Manifest file

Manifest is a snapshot of current storage.
I.e., it's a list of files belonging to this snapshot.

```
data_ver: 0.1.0
key_meta_ver: 0.2.0
keys:
  key-wal-1, key-wal-2...
  key-compacted-1, key-compacted-2...
access:
  acc-1, acc-2...
chunk-groups:
  CG-1, CG-2...
chunks:
  chunk-id-1, chunk-id-2...
```

## Object store

Object store is a collection of files to persist the object map.
It includes one active WAL file and several compacted file.

The object header includes:
- The timestamp when this record is added.
- A flag indicate it is an add operation or remove operation.
- Total size of this object.
- User meta data in json.
- ChunkGroup(CG) this object is allocated in.
- `ChunkId` list of the chunks that are stored. In this list ChunkGroup does
    not need to be stored. Because every chunk in an object belong to the
    same ChunkGroup.


## Chunk store

Object content is stored in chunks.
Chunks of the same size belong to the same ChunkGroup.

Every ChunkGroup consists of **one** active(writable) chunk file and several
closed(readonly) chunk files. The active chunk file is append only.

Deleting is implemented with another per-chunk-file WAL file(`ChunkIndex`,
in which chunk index entries are appended one by one:

- For active chunk, ChunkIndex only contains removed chunk index: `(chunk_id, REMOVE)`;
    Because chunk file and ChunkIndex file are not `fsync`-ed when being
    updated.

    When restarting, present chunk ids are loaded from chunk file, while removed
    chunk ids are loaded from ChunkIndex file.

- For closed chunk, ChunkIndex contains present chunk ids and removed chunk
    ids(a chunk is removed after being closed(compact)):
    `(chunk_id, ADD)` or `(chunk_id, REMOVE)`.

    When the server restarts, chunk indexes are loaded just from ChunkIndex
    file.


## Access store

Opencache evicts object or chunk by accessing pattern.
Thus the sequence of recent access has to be persisted.
Otherwise when a cache server restarts, cached data will be evicted
in an unexpected way, due to lacking of accessing information.

Accessing data is tracked in two category: by key and by chunk id.

Accessing data is stored in several append only files, contains key and
chunk-id respectively.

Old access data will be removed periodically.
Access data does not need compact.


```
      .----------.
      | Manifest |
      '----------'


  Access   Access    Keys             ChunkGroup-i      ChunkGroup-j   ..
  Key      Chunk

  .----.   .----.    Active           Active
  |key1|   |cid1|    .---------.      .----.  Chunk     ..
  |key2|   |cid2|    | k1,meta |      | c1 |  Index
  | .. |   | .. |    | k2,meta |    .-|-c2 |  .----.
  |CKSM|   |CKSM|    | ..      |    | | .. |<-|cid5|
  |key3|   |cid3|    | CKSM    |    | | .. |  |cid6|
  | .. |   | .. |    | k3,meta |    | '----'  | .. |
  '----'   '----'    | ..      |    |   ..    '----'
    ..       ..      '---------'    |
  .----.   .----.      ..           |
  |keyi|   |cid1|                   |
  |keyj|   |cid2|    Compacted Keys | Compacted Chunks
  | .. |   | .. |    .---------.    |
  |CKSM|   |CKSM|    | k1,meta |    | .----.  Chunk
  |keyk|   |cid3|    | k2,meta |    | | c1 |  Index
  | .. |   | .. |    | ..      |    +-|-c2 |  .----.
  '----'   '----'    | CKSM    |    | | .. |<-|cid5|
                     | k3,meta |    | | .. |  |cid6|
                     | ..  |   |    | '----'  | .. |
                     '---------'    |         '----'
                           |        '---.
                           v            v
                     KeyMeta          Chunk
                     .----------.     .----.
                     | ts       |     |data|
                     | add|rm   |     |CHSM|
                     | size     |     '----'
                     | userdata |
                     | CG       |
                     | chunkids |
                     '----------'
```

# Operations

## Write

A cache server does not need strict data durability, thus when new data is
written(key, chunk, or access data), it does not have to fsync at once.

For different kind of data, the fsync policies are:

- Keys: Keys files are fsync-ed for every `k MB`(where `k` is a configurable parameter). A checksum
  record has to be written before fsync. Since on disk data still has to be
  internally consistent.  The checksum record contains checksum of the `k MB`
  data except itself.

  In other words, Keys file are split into several `k MB` segment.

- Access data is similar to Keys files.

- Chunks is different: Every chunk contains a checksum of itself.
    Chunk files is fsync-ed periodically.


## Update manifest

A write operation does not update `Manifest` file.
`Manifest` is only replaced when files(Key store files, Chunk store files or access store files)
are added or removed(or configuration changed).
E.g., when a new Key file is opened, or a compaction of Chunk files is done.

The Manifest file is named with monotonic incremental integers, e.g.,
`manifest-000021`.

Manifest files can only be added or removed.
When restarting, opencache list all of the manifest file and use the last valid one(by checking the
checksum).


# Memory layout

Keys and Access data are stored in memory for quick access.
Chunks can be optionally cached in memory but it should be OK there is no in-memory chunk-cache.

Keys in memory is just a HashMap: `ObjectMap`.

Key Access data is fed to key-eviction-algorithm.
Chunk Access data is fed to chunk-eviction-algorithm.

When cache server restarts:
All of the keys are loaded into memory.
All Access data are replayed by the two eviction algorithm implementations.



# Cache Replacement

Object and Chunks are evicted independently.

The rule is: if a chunk is accessed, the object it belongs to is accessed too.
This way It's very likely an object won't be evicted if there are still chunk accesses to
it.

- It is possible that a chunk is evicted while the other chunks still resides in CacheServer.

- It is possible that an object resides in CacheServer but all its chunks are
    evicted.

- It is possible that an object is evicted but there are still orphan chunks.
  These chunks will be evicted finally since there won't be any access to them.

The cache replacement policy is a modified and simplified version of `LIRS`.


# API

## Chunk API

ChunkStorage provides chunk access:

```rust
trait ChunkStorage {
    fn add(&mut self, chunk_id, bytes);
    fn remove(&mut self, chunk_id);
    fn read(&mut self, chunk_id) -> Stream<>
}
```

## Object API

ObjectStorage provides object access:

```rust
trait ObjectStorage {
    fn add(&mut self, key: String, size, meta, prefered_chunck_size);
    fn remove(&mut self, key)
    fn access(&mut self, key)
}
```

## CacheServer API

Server level API

```rust
trait Cache {
    /// Write chunks to an object.
    /// Nonexistent object will added implicitly
    fn write(&mut self, key, chunks)

    /// Read specified `range` of bytes in an object
    fn read(&mut self, key, range)
}
```


# Client

Client-side config versioning:
If a new cache node `C` is added to the cache cluster `{A,B}`,
and assuming the cache client uses a consistent hash,
when the new cluster config `{A,B,C}` is published to every client, 1/3 of the data access will encounter a miss.

To upgrade cluster config smoothly, a client needs to access the cache
cluster using two configs: read from the new cluster first, if there is a miss,
read the old cluster, until data migration is done. During this period, there
are two active config `{{A,B},ver=1}` and `{{A,B,C},ver=2}`.

And it is also possible that there are more than two active configs.

Thus for the cache client, it has to:
- Be able to upgrade cluster config on the fly,
- support a **joint** config,
- and be able to retry every config.

A cluster upgrade may looks like this:
- Initially, every cache-client has config `[{{A,B},ver=1}]`;
- A new cache node is added, and the new config is published to every client: `[{{A,B},ver=1}, {{A,B,C},ver=2}]`;
- After a while, config with ver=1 is removed, and another new config is published to every client: `[{{A,B,C},ver=2}]`.

# Caveat

## Consistency concern

Cache has consistency issue with the backend storage, i.e., if an object is
changed in the backend store, there is, **NO** way to provide a strong
consistent view on the cache side.
Unless there is a distributed consensus protocol running upon backend store and
cache(which is expensive).

Thus it is recommended to only store object that won't change, i.e, an object on
the backend will only be added and removed, never be modified.

## Space management

It's recommended to set the cache space limit to under 80% of the disk,
since in order to reduce IO, the space of removed chunk won't be reclaim until next `compaction`.