venti.7 (11062B)
1 .TH VENTI 7 2 .SH NAME 3 venti \- archival storage server 4 .SH DESCRIPTION 5 Venti is a block storage server intended for archival data. 6 In a Venti server, the SHA1 hash of a block's contents acts 7 as the block identifier for read and write operations. 8 This approach enforces a write-once policy, preventing 9 accidental or malicious destruction of data. In addition, 10 duplicate copies of a block are coalesced, reducing the 11 consumption of storage and simplifying the implementation 12 of clients. 13 .PP 14 This manual page documents the basic concepts of 15 block storage using Venti as well as the Venti network protocol. 16 .PP 17 .MR Venti (1) 18 documents some simple clients. 19 .MR Vac (1) , 20 .MR vacfs (4) , 21 and 22 .MR vbackup (8) 23 are more complex clients. 24 .PP 25 .MR Venti (3) 26 describes a C library interface for accessing 27 Venti servers and manipulating Venti data structures. 28 .PP 29 .MR Venti (8) 30 describes the programs used to run a Venti server. 31 .PP 32 .SS "Scores 33 The SHA1 hash that identifies a block is called its 34 .IR score . 35 The score of the zero-length block is called the 36 .IR "zero score" . 37 .PP 38 Scores may have an optional 39 .IB label : 40 prefix, typically used to 41 describe the format of the data. 42 For example, 43 .MR vac (1) 44 uses a 45 .B vac: 46 prefix, while 47 .MR vbackup (8) 48 uses prefixes corresponding to the file system 49 types: 50 .BR ext2: , 51 .BR ffs: , 52 and so on. 53 .SS "Files and Directories 54 Venti accepts blocks up to 56 kilobytes in size. 55 By convention, Venti clients use hash trees of blocks to 56 represent arbitrary-size data 57 .IR files . 58 The data to be stored is split into fixed-size 59 blocks and written to the server, producing a list 60 of scores. 61 The resulting list of scores is split into fixed-size pointer 62 blocks (using only an integral number of scores per block) 63 and written to the server, producing a smaller list 64 of scores. 65 The process continues, eventually ending with the 66 score for the hash tree's top-most block. 67 Each file stored this way is summarized by 68 a 69 .B VtEntry 70 structure recording the top-most score, the depth 71 of the tree, the data block size, and the pointer block size. 72 One or more 73 .B VtEntry 74 structures can be concatenated 75 and stored as a special file called a 76 .IR directory . 77 In this 78 manner, arbitrary trees of files can be constructed 79 and stored. 80 .PP 81 Scores passed between programs conventionally refer 82 to 83 .B VtRoot 84 blocks, which contain descriptive information 85 as well as the score of a directory block containing a small number 86 of directory entries. 87 .PP 88 Conventionally, programs do not mix data and directory entries 89 in the same file. Instead, they keep two separate files, one with 90 directory entries and one with metadata referencing those 91 entries by position. 92 Keeping this parallel representation is a minor annoyance 93 but makes it possible for general programs like 94 .I venti/copy 95 (see 96 .MR venti (1) ) 97 to traverse the block tree without knowing the specific details 98 of any particular program's data. 99 .SS "Block Types 100 To allow programs to traverse these structures without 101 needing to understand their higher-level meanings, 102 Venti tags each block with a type. The types are: 103 .PP 104 .nf 105 .ft L 106 VtDataType 000 \f1data\fL 107 VtDataType+1 001 \fRscores of \fPVtDataType\fR blocks\fL 108 VtDataType+2 002 \fRscores of \fPVtDataType+1\fR blocks\fL 109 \fR\&...\fL 110 VtDirType 010 VtEntry\fR structures\fL 111 VtDirType+1 011 \fRscores of \fLVtDirType\fR blocks\fL 112 VtDirType+2 012 \fRscores of \fLVtDirType+1\fR blocks\fL 113 \fR\&...\fL 114 VtRootType 020 VtRoot\fR structure\fL 115 .fi 116 .PP 117 The octal numbers listed are the type numbers used 118 by the commands below. 119 (For historical reasons, the type numbers used on 120 disk and on the wire are different from the above. 121 They do not distinguish 122 .BI VtDataType+ n 123 blocks from 124 .BI VtDirType+ n 125 blocks.) 126 .SS "Zero Truncation 127 To avoid storing the same short data blocks padded with 128 differing numbers of zeros, Venti clients working with fixed-size 129 blocks conventionally 130 `zero truncate' the blocks before writing them to the server. 131 For example, if a 1024-byte data block contains the 132 11-byte string 133 .RB ` hello " " world ' 134 followed by 1013 zero bytes, 135 a client would store only the 11-byte block. 136 When the client later read the block from the server, 137 it would append zero bytes to the end as necessary to 138 reach the expected size. 139 .PP 140 When truncating pointer blocks 141 .RB ( VtDataType+ \fIn 142 and 143 .BI VtDirType+ n 144 blocks), 145 trailing zero scores are removed 146 instead of trailing zero bytes. 147 .PP 148 Because of the truncation convention, 149 any file consisting entirely of zero bytes, 150 no matter what its length, will be represented by the zero score: 151 the data blocks contain all zeros and are thus truncated 152 to the empty block, and the pointer blocks contain all zero scores 153 and are thus also truncated to the empty block, 154 and so on up the hash tree. 155 .SS Network Protocol 156 A Venti session begins when a 157 .I client 158 connects to the network address served by a Venti 159 .IR server ; 160 the conventional address is 161 .BI tcp! server !venti 162 (the 163 .B venti 164 port is 17034). 165 Both client and server begin by sending a version 166 string of the form 167 .BI venti- versions - comment \en \fR. 168 The 169 .I versions 170 field is a list of acceptable versions separated by 171 colons. 172 The protocol described here is version 173 .BR 02 . 174 The client is responsible for choosing a common 175 version and sending it in the 176 .B VtThello 177 message, described below. 178 .PP 179 After the initial version exchange, the client transmits 180 .I requests 181 .RI ( T-messages ) 182 to the server, which subsequently returns 183 .I replies 184 .RI ( R-messages ) 185 to the client. 186 The combined act of transmitting (receiving) a request 187 of a particular type, and receiving (transmitting) its reply 188 is called a 189 .I transaction 190 of that type. 191 .PP 192 Each message consists of a sequence of bytes. 193 Two-byte fields hold unsigned integers represented 194 in big-endian order (most significant byte first). 195 Data items of variable lengths are represented by 196 a one-byte field specifying a count, 197 .IR n , 198 followed by 199 .I n 200 bytes of data. 201 Text strings are represented similarly, 202 using a two-byte count with 203 the text itself stored as a UTF-encoded sequence 204 of Unicode characters (see 205 .MR utf (7) ). 206 Text strings are not 207 .SM NUL\c 208 -terminated: 209 .I n 210 counts the bytes of UTF data, which include no final 211 zero byte. 212 The 213 .SM NUL 214 character is illegal in text strings in the Venti protocol. 215 The maximum string length in Venti is 1024 bytes. 216 .PP 217 Each Venti message begins with a two-byte size field 218 specifying the length in bytes of the message, 219 not including the length field itself. 220 The next byte is the message type, one of the constants 221 in the enumeration in the include file 222 .BR <venti.h> . 223 The next byte is an identifying 224 .IR tag , 225 used to match responses to requests. 226 The remaining bytes are parameters of different sizes. 227 In the message descriptions, the number of bytes in a field 228 is given in brackets after the field name. 229 The notation 230 .IR parameter [ n ] 231 where 232 .I n 233 is not a constant represents a variable-length parameter: 234 .IR n [1] 235 followed by 236 .I n 237 bytes of data forming the 238 .IR parameter . 239 The notation 240 .IR string [ s ] 241 (using a literal 242 .I s 243 character) 244 is shorthand for 245 .IR s [2] 246 followed by 247 .I s 248 bytes of UTF-8 text. 249 The notation 250 .IR parameter [] 251 where 252 .I parameter 253 is the last field in the message represents a 254 variable-length field that comprises all remaining 255 bytes in the message. 256 .PP 257 All Venti RPC messages are prefixed with a field 258 .IR size [2] 259 giving the length of the message that follows 260 (not including the 261 .I size 262 field itself). 263 The message bodies are: 264 .ta \w'\fLVtTgoodbye 'u 265 .IP 266 .ne 2v 267 .B VtThello 268 .IR tag [1] 269 .IR version [ s ] 270 .IR uid [ s ] 271 .IR strength [1] 272 .IR crypto [ n ] 273 .IR codec [ n ] 274 .br 275 .B VtRhello 276 .IR tag [1] 277 .IR sid [ s ] 278 .IR rcrypto [1] 279 .IR rcodec [1] 280 .IP 281 .ne 2v 282 .B VtTping 283 .IR tag [1] 284 .br 285 .B VtRping 286 .IR tag [1] 287 .IP 288 .ne 2v 289 .B VtTread 290 .IR tag [1] 291 .IR score [20] 292 .IR type [1] 293 .IR pad [1] 294 .IR count [2] 295 .br 296 .B VtRread 297 .IR tag [1] 298 .IR data [] 299 .IP 300 .ne 2v 301 .B VtTwrite 302 .IR tag [1] 303 .IR type [1] 304 .IR pad [3] 305 .IR data [] 306 .br 307 .B VtRwrite 308 .IR tag [1] 309 .IR score [20] 310 .IP 311 .ne 2v 312 .B VtTsync 313 .IR tag [1] 314 .br 315 .B VtRsync 316 .IR tag [1] 317 .IP 318 .ne 2v 319 .B VtRerror 320 .IR tag [1] 321 .IR error [ s ] 322 .IP 323 .ne 2v 324 .B VtTgoodbye 325 .IR tag [1] 326 .PP 327 Each T-message has a one-byte 328 .I tag 329 field, chosen and used by the client to identify the message. 330 The server will echo the request's 331 .I tag 332 field in the reply. 333 Clients should arrange that no two outstanding 334 messages have the same tag field so that responses 335 can be distinguished. 336 .PP 337 The type of an R-message will either be one greater than 338 the type of the corresponding T-message or 339 .BR Rerror , 340 indicating that the request failed. 341 In the latter case, the 342 .I error 343 field contains a string describing the reason for failure. 344 .PP 345 Venti connections must begin with a 346 .B hello 347 transaction. 348 The 349 .B VtThello 350 message contains the protocol 351 .I version 352 that the client has chosen to use. 353 The fields 354 .IR strength , 355 .IR crypto , 356 and 357 .IR codec 358 could be used to add authentication, encryption, 359 and compression to the Venti session 360 but are currently ignored. 361 The 362 .IR rcrypto , 363 and 364 .I rcodec 365 fields in the 366 .B VtRhello 367 response are similarly ignored. 368 The 369 .IR uid 370 and 371 .IR sid 372 fields are intended to be the identity 373 of the client and server but, given the lack of 374 authentication, should be treated only as advisory. 375 The initial 376 .B hello 377 should be the only 378 .B hello 379 transaction during the session. 380 .PP 381 The 382 .B ping 383 message has no effect and 384 is used mainly for debugging. 385 Servers should respond immediately to pings. 386 .PP 387 The 388 .B read 389 message requests a block with the given 390 .I score 391 and 392 .IR type . 393 Use 394 .I vttodisktype 395 and 396 .I vtfromdisktype 397 (see 398 .MR venti (3) ) 399 to convert a block type enumeration value 400 .RB ( VtDataType , 401 etc.) 402 to the 403 .I type 404 used on disk and in the protocol. 405 The 406 .I count 407 field specifies the maximum expected size 408 of the block. 409 The 410 .I data 411 in the reply is the block's contents. 412 .PP 413 The 414 .B write 415 message writes a new block of the given 416 .I type 417 with contents 418 .I data 419 to the server. 420 The response includes the 421 .I score 422 to use to read the block, 423 which should be the SHA1 hash of 424 .IR data . 425 .PP 426 The Venti server may buffer written blocks in memory, 427 waiting until after responding to the 428 .B write 429 message before writing them to 430 permanent storage. 431 The server will delay the response to a 432 .B sync 433 message until after all blocks in earlier 434 .B write 435 messages have been written to permanent storage. 436 .PP 437 The 438 .B goodbye 439 message ends a session. There is no 440 .BR VtRgoodbye : 441 upon receiving the 442 .BR VtTgoodbye 443 message, the server terminates up the connection. 444 .PP 445 Version 446 .B 04 447 of the Venti protocol is similar to version 448 .B 02 449 (described above) 450 but has two changes to accomodates larger payloads. 451 First, it replaces the leading 2-byte packet size with 452 a 4-byte size. 453 Second, the 454 .I count 455 in the 456 .B VtTread 457 packet may be either 2 or 4 bytes; 458 the total packet length distinguishes the two cases. 459 .SH SEE ALSO 460 .MR venti (1) , 461 .MR venti (3) , 462 .MR venti (8) 463 .br 464 Sean Quinlan and Sean Dorward, 465 ``Venti: a new approach to archival storage'', 466 .I "Usenix Conference on File and Storage Technologies" , 467 2002.