plan9port

fork of plan9port with libvec, libstr and libsdb
Log | Files | Refs | README | LICENSE

venti.7 (11062B)


      1 .TH VENTI 7
      2 .SH NAME
      3 venti \- archival storage server
      4 .SH DESCRIPTION
      5 Venti is a block storage server intended for archival data.
      6 In a Venti server, the SHA1 hash of a block's contents acts
      7 as the block identifier for read and write operations.
      8 This approach enforces a write-once policy, preventing
      9 accidental or malicious destruction of data.  In addition,
     10 duplicate copies of a block are coalesced, reducing the
     11 consumption of storage and simplifying the implementation
     12 of clients.
     13 .PP
     14 This manual page documents the basic concepts of
     15 block storage using Venti as well as the Venti network protocol.
     16 .PP
     17 .MR Venti (1)
     18 documents some simple clients.
     19 .MR Vac (1) ,
     20 .MR vacfs (4) ,
     21 and
     22 .MR vbackup (8)
     23 are more complex clients.
     24 .PP
     25 .MR Venti (3)
     26 describes a C library interface for accessing
     27 Venti servers and manipulating Venti data structures.
     28 .PP
     29 .MR Venti (8)
     30 describes the programs used to run a Venti server.
     31 .PP
     32 .SS "Scores
     33 The SHA1 hash that identifies a block is called its
     34 .IR score .
     35 The score of the zero-length block is called the
     36 .IR "zero score" .
     37 .PP
     38 Scores may have an optional 
     39 .IB label :
     40 prefix, typically used to
     41 describe the format of the data.
     42 For example, 
     43 .MR vac (1)
     44 uses a
     45 .B vac:
     46 prefix, while
     47 .MR vbackup (8)
     48 uses prefixes corresponding to the file system
     49 types: 
     50 .BR ext2: ,
     51 .BR ffs: ,
     52 and so on.
     53 .SS "Files and Directories
     54 Venti accepts blocks up to 56 kilobytes in size.  
     55 By convention, Venti clients use hash trees of blocks to
     56 represent arbitrary-size data
     57 .IR files .
     58 The data to be stored is split into fixed-size
     59 blocks and written to the server, producing a list
     60 of scores.
     61 The resulting list of scores is split into fixed-size pointer
     62 blocks (using only an integral number of scores per block)
     63 and written to the server, producing a smaller list
     64 of scores.
     65 The process continues, eventually ending with the
     66 score for the hash tree's top-most block.
     67 Each file stored this way is summarized by
     68 a
     69 .B VtEntry
     70 structure recording the top-most score, the depth
     71 of the tree, the data block size, and the pointer block size.
     72 One or more 
     73 .B VtEntry
     74 structures can be concatenated
     75 and stored as a special file called a
     76 .IR directory .
     77 In this
     78 manner, arbitrary trees of files can be constructed
     79 and stored.
     80 .PP
     81 Scores passed between programs conventionally refer
     82 to
     83 .B VtRoot
     84 blocks, which contain descriptive information
     85 as well as the score of a directory block containing a small number
     86 of directory entries.
     87 .PP
     88 Conventionally, programs do not mix data and directory entries
     89 in the same file.  Instead, they keep two separate files, one with
     90 directory entries and one with metadata referencing those
     91 entries by position.
     92 Keeping this parallel representation is a minor annoyance
     93 but makes it possible for general programs like
     94 .I venti/copy
     95 (see
     96 .MR venti (1) )
     97 to traverse the block tree without knowing the specific details
     98 of any particular program's data.
     99 .SS "Block Types
    100 To allow programs to traverse these structures without
    101 needing to understand their higher-level meanings,
    102 Venti tags each block with a type.  The types are:
    103 .PP
    104 .nf
    105 .ft L
    106     VtDataType     000  \f1data\fL
    107     VtDataType+1   001  \fRscores of \fPVtDataType\fR blocks\fL
    108     VtDataType+2   002  \fRscores of \fPVtDataType+1\fR blocks\fL
    109     \fR\&...\fL
    110     VtDirType      010  VtEntry\fR structures\fL
    111     VtDirType+1    011  \fRscores of \fLVtDirType\fR blocks\fL
    112     VtDirType+2    012  \fRscores of \fLVtDirType+1\fR blocks\fL
    113     \fR\&...\fL
    114     VtRootType     020  VtRoot\fR structure\fL
    115 .fi
    116 .PP
    117 The octal numbers listed are the type numbers used
    118 by the commands below.
    119 (For historical reasons, the type numbers used on
    120 disk and on the wire are different from the above.
    121 They do not distinguish
    122 .BI VtDataType+ n
    123 blocks from
    124 .BI VtDirType+ n
    125 blocks.)
    126 .SS "Zero Truncation
    127 To avoid storing the same short data blocks padded with
    128 differing numbers of zeros, Venti clients working with fixed-size
    129 blocks conventionally
    130 `zero truncate' the blocks before writing them to the server.
    131 For example, if a 1024-byte data block contains the 
    132 11-byte string 
    133 .RB ` hello " " world '
    134 followed by 1013 zero bytes,
    135 a client would store only the 11-byte block.
    136 When the client later read the block from the server,
    137 it would append zero bytes to the end as necessary to
    138 reach the expected size.
    139 .PP
    140 When truncating pointer blocks
    141 .RB ( VtDataType+ \fIn
    142 and
    143 .BI VtDirType+ n
    144 blocks),
    145 trailing zero scores are removed
    146 instead of trailing zero bytes.
    147 .PP
    148 Because of the truncation convention,
    149 any file consisting entirely of zero bytes,
    150 no matter what its length, will be represented by the zero score:
    151 the data blocks contain all zeros and are thus truncated
    152 to the empty block, and the pointer blocks contain all zero scores
    153 and are thus also truncated to the empty block, 
    154 and so on up the hash tree.
    155 .SS Network Protocol
    156 A Venti session begins when a
    157 .I client
    158 connects to the network address served by a Venti
    159 .IR server ;
    160 the conventional address is 
    161 .BI tcp! server !venti
    162 (the
    163 .B venti
    164 port is 17034).
    165 Both client and server begin by sending a version
    166 string of the form
    167 .BI venti- versions - comment \en \fR.
    168 The
    169 .I versions
    170 field is a list of acceptable versions separated by
    171 colons.
    172 The protocol described here is version
    173 .BR 02 .
    174 The client is responsible for choosing a common
    175 version and sending it in the
    176 .B VtThello
    177 message, described below.
    178 .PP
    179 After the initial version exchange, the client transmits
    180 .I requests
    181 .RI ( T-messages )
    182 to the server, which subsequently returns
    183 .I replies
    184 .RI ( R-messages )
    185 to the client.
    186 The combined act of transmitting (receiving) a request
    187 of a particular type, and receiving (transmitting) its reply
    188 is called a
    189 .I transaction
    190 of that type.
    191 .PP
    192 Each message consists of a sequence of bytes.
    193 Two-byte fields hold unsigned integers represented
    194 in big-endian order (most significant byte first).
    195 Data items of variable lengths are represented by
    196 a one-byte field specifying a count,
    197 .IR n ,
    198 followed by
    199 .I n
    200 bytes of data.
    201 Text strings are represented similarly,
    202 using a two-byte count with
    203 the text itself stored as a UTF-encoded sequence
    204 of Unicode characters (see
    205 .MR utf (7) ).
    206 Text strings are not
    207 .SM NUL\c
    208 -terminated:
    209 .I n
    210 counts the bytes of UTF data, which include no final
    211 zero byte.
    212 The
    213 .SM NUL
    214 character is illegal in text strings in the Venti protocol.
    215 The maximum string length in Venti is 1024 bytes.
    216 .PP
    217 Each Venti message begins with a two-byte size field 
    218 specifying the length in bytes of the message,
    219 not including the length field itself.
    220 The next byte is the message type, one of the constants
    221 in the enumeration in the include file
    222 .BR <venti.h> .
    223 The next byte is an identifying
    224 .IR tag ,
    225 used to match responses to requests.
    226 The remaining bytes are parameters of different sizes.
    227 In the message descriptions, the number of bytes in a field
    228 is given in brackets after the field name.
    229 The notation
    230 .IR parameter [ n ]
    231 where
    232 .I n
    233 is not a constant represents a variable-length parameter:
    234 .IR n [1]
    235 followed by
    236 .I n
    237 bytes of data forming the
    238 .IR parameter .
    239 The notation
    240 .IR string [ s ]
    241 (using a literal
    242 .I s
    243 character)
    244 is shorthand for
    245 .IR s [2]
    246 followed by
    247 .I s
    248 bytes of UTF-8 text.
    249 The notation
    250 .IR parameter []
    251 where 
    252 .I parameter
    253 is the last field in the message represents a 
    254 variable-length field that comprises all remaining
    255 bytes in the message.
    256 .PP
    257 All Venti RPC messages are prefixed with a field
    258 .IR size [2]
    259 giving the length of the message that follows
    260 (not including the
    261 .I size
    262 field itself).
    263 The message bodies are:
    264 .ta \w'\fLVtTgoodbye 'u
    265 .IP
    266 .ne 2v
    267 .B VtThello
    268 .IR tag [1]
    269 .IR version [ s ]
    270 .IR uid [ s ]
    271 .IR strength [1]
    272 .IR crypto [ n ]
    273 .IR codec [ n ]
    274 .br
    275 .B VtRhello
    276 .IR tag [1]
    277 .IR sid [ s ] 
    278 .IR rcrypto [1]
    279 .IR rcodec [1]
    280 .IP
    281 .ne 2v
    282 .B VtTping
    283 .IR tag [1]
    284 .br
    285 .B VtRping
    286 .IR tag [1]
    287 .IP
    288 .ne 2v
    289 .B VtTread
    290 .IR tag [1]
    291 .IR score [20]
    292 .IR type [1]
    293 .IR pad [1]
    294 .IR count [2]
    295 .br
    296 .B VtRread
    297 .IR tag [1]
    298 .IR data []
    299 .IP
    300 .ne 2v
    301 .B VtTwrite
    302 .IR tag [1]
    303 .IR type [1]
    304 .IR pad [3]
    305 .IR data []
    306 .br
    307 .B VtRwrite
    308 .IR tag [1]
    309 .IR score [20]
    310 .IP
    311 .ne 2v
    312 .B VtTsync
    313 .IR tag [1]
    314 .br
    315 .B VtRsync
    316 .IR tag [1]
    317 .IP
    318 .ne 2v
    319 .B VtRerror
    320 .IR tag [1]
    321 .IR error [ s ]
    322 .IP
    323 .ne 2v
    324 .B VtTgoodbye
    325 .IR tag [1]
    326 .PP
    327 Each T-message has a one-byte
    328 .I tag
    329 field, chosen and used by the client to identify the message.
    330 The server will echo the request's
    331 .I tag
    332 field in the reply.
    333 Clients should arrange that no two outstanding
    334 messages have the same tag field so that responses
    335 can be distinguished.
    336 .PP
    337 The type of an R-message will either be one greater than
    338 the type of the corresponding T-message or
    339 .BR Rerror ,
    340 indicating that the request failed.
    341 In the latter case, the
    342 .I error
    343 field contains a string describing the reason for failure.
    344 .PP
    345 Venti connections must begin with a 
    346 .B hello
    347 transaction.
    348 The
    349 .B VtThello
    350 message contains the protocol
    351 .I version
    352 that the client has chosen to use.
    353 The fields
    354 .IR strength ,
    355 .IR crypto ,
    356 and
    357 .IR codec
    358 could be used to add authentication, encryption,
    359 and compression to the Venti session
    360 but are currently ignored.
    361 The 
    362 .IR rcrypto ,
    363 and
    364 .I rcodec
    365 fields in the 
    366 .B VtRhello
    367 response are similarly ignored.
    368 The
    369 .IR uid 
    370 and
    371 .IR sid
    372 fields are intended to be the identity
    373 of the client and server but, given the lack of
    374 authentication, should be treated only as advisory.
    375 The initial
    376 .B hello
    377 should be the only
    378 .B hello
    379 transaction during the session.
    380 .PP
    381 The
    382 .B ping
    383 message has no effect and 
    384 is used mainly for debugging.
    385 Servers should respond immediately to pings.
    386 .PP
    387 The
    388 .B read
    389 message requests a block with the given
    390 .I score
    391 and
    392 .IR type .
    393 Use
    394 .I vttodisktype
    395 and
    396 .I vtfromdisktype
    397 (see
    398 .MR venti (3) )
    399 to convert a block type enumeration value
    400 .RB ( VtDataType ,
    401 etc.)
    402 to the 
    403 .I type
    404 used on disk and in the protocol.
    405 The
    406 .I count
    407 field specifies the maximum expected size
    408 of the block.
    409 The
    410 .I data
    411 in the reply is the block's contents.
    412 .PP
    413 The
    414 .B write
    415 message writes a new block of the given
    416 .I type
    417 with contents
    418 .I data
    419 to the server.
    420 The response includes the
    421 .I score
    422 to use to read the block,
    423 which should be the SHA1 hash of 
    424 .IR data .
    425 .PP
    426 The Venti server may buffer written blocks in memory,
    427 waiting until after responding to the
    428 .B write
    429 message before writing them to
    430 permanent storage.
    431 The server will delay the response to a
    432 .B sync
    433 message until after all blocks in earlier
    434 .B write
    435 messages have been written to permanent storage.
    436 .PP
    437 The
    438 .B goodbye
    439 message ends a session.  There is no
    440 .BR VtRgoodbye :
    441 upon receiving the
    442 .BR VtTgoodbye
    443 message, the server terminates up the connection.
    444 .PP
    445 Version
    446 .B 04
    447 of the Venti protocol is similar to version
    448 .B 02
    449 (described above)
    450 but has two changes to accomodates larger payloads.
    451 First, it replaces the leading 2-byte packet size with
    452 a 4-byte size.
    453 Second, the
    454 .I count
    455 in the
    456 .B VtTread
    457 packet may be either 2 or 4 bytes;
    458 the total packet length distinguishes the two cases.
    459 .SH SEE ALSO
    460 .MR venti (1) ,
    461 .MR venti (3) ,
    462 .MR venti (8)
    463 .br
    464 Sean Quinlan and Sean Dorward,
    465 ``Venti: a new approach to archival storage'',
    466 .I "Usenix Conference on File and Storage Technologies" ,
    467 2002.