Data Center Format
This page describes the encrypted and compressed data center format used by TERA.
C/C++-like primitive types,
enum
s,struct
s, andunion
s will be used.bool
is equivalent touint8_t
but only allows the valuestrue
(1
) andfalse
(0
).Integers (
uint8_t
,int8_t
,uint16_t
,int16_t
, etc) are little endian.float
anddouble
are IEEE 754binary32
andbinary64
, respectively.Characters (i.e.
char16_t
) are UTF-16 and little endian.Fields are laid out in the declared order with no implied padding anywhere.
Note that the format uses both zero-based and one-based array indexes in various, seemingly random places.
Encryption
Data center files are encrypted with the AES algorithm in CFB mode and with block, key, and feedback sizes all set to 128 bits. No padding is done for the final block.
The encryption key and initialization vector can both be extracted from a TERA client. This can be done with a running TERA process or an unpacked executable. These values are usually freshly generated for each client build.
Physical Structure
The overall structure can be described like this:
Compression Header
After decryption, there is a small header of the form:
All data immediately following this header is compressed with the Deflate algorithm.
uncompressed_size
is the size of the data center file once inflated.
zlib_header
is a zlib (§2.2) header. In official data center files, it is usually of the form 0x9c78
, but it can be any valid zlib header.
File Header
After decompression, a data center file starts with this header:
version
is currently 6
. A past version 3
also existed. Note that this field has no distinguishing value for the 32-bit and 64-bit formats, but version 3
was only 32-bit.
timestamp
is a Unix timestamp indicating when the file was produced.
unknown_1
, unknown_2
, unknown_3
, unknown_4
, and unknown5
are all always 0
. They are actually part of a tree structure describing the XSD schema of the data graph, but official data centers never include this information.
revision
indicates the version of the data graph contained within the file. It is sometimes (but not always) equal to the value sent by the client in the C_CHECK_VERSION
packet. This field is not present if version
is 3
.
File Footer
A data center file ends with this footer:
marker
is always 0
and has no known purpose in the client.
Regions
Most of the content in data center files is arranged into regions, which may be segmented. The region structures used throughout the format are described here:
In a DataCenterRegion
, the used_count
can be less than the full_count
. full_count
is usually 65535
, even when used_count
is much smaller. All data in the region that goes beyond used_count
can be arbitrary and should be considered undefined.
A DataCenterSegmentedSimpleRegion
is mostly the same as a DataCenterSegmentedRegion
, with the main difference being that it has a static amount of segments.
Addresses are frequently used to refer to elements within both types of segmented regions:
Here, segment_index
is a zero-based index into the segments
array of the segmented region, while element_index
is a zero-based index into the elements
array of the segment.
String Tables
All strings, whether they are names or values, are arranged into string tables, which are effectively used as hash tables by the client. A string table has the form:
A string entry in the table
region looks like this:
hash
is a hash code for the string, given by the expression data_center_string_hash(value)
where value
is the string value. In a typical data center file, there is only a very tiny amount of hash collisions.
length
is the length of the string in terms of characters, including the NUL character.
index
is a one-based index into the string table's addresses
region. The address at this index must match the address
field exactly.
address
is an address into the string table's data
region. This address points to the actual string data. The string read from this address must have the same length as the length
field. Notably, if the string data straddles the end of its segment, the NUL character may be omitted. Readers should therefore not rely exclusively on the presence of a NUL character, but also check the segment bounds.
A string entry must be placed in the correct table
segment based on its hash
field. The segment index is given by the expression (hash ^ hash >> 16) % count
where count
is the static size of the table
region. Further, entries in a segment must be sorted by their hash code in ascending order.
For the names
table, the special names __root__
and __value__
must be added to the table last, so that their index values are greater than all other entries. Also, they must be present even if they are not used.
Finally, it is worth noting that the names
table is always referred to by index, whereas the values
table is always referred to by address. The reason for this is that the names
table is small enough that all entries can be accessed by a uint16_t
index value into its addresses
region, whereas that is not the case for the values
table. In spite of this difference, names
and values
must both have valid addresses
regions.
String Hash
The data_center_string_hash
function is a bizarre variant of CRC32. It is defined as follows:
(Note that string_hash_table
is the same as value_hash_table
.)
Data Graph
The actual content in a data center file is stored as a directed acyclic graph, which is essentially XML serialized to a binary format.
Nodes
Each node is of the form:
name_index
is a one-based index into the addresses
region of the names
table. If this value is 0
, it indicates that this node has no name or associated data of any kind, and should be disregarded; in this case, all other fields of the node should be considered undefined. Such nodes are usually incidental leftovers in official data center files and need not be present.
key_flags
is 0
in official data center files. It may have a combination of the following values:
If DATA_CENTER_KEY_FLAGS_UNCACHED
is set, the results of a query against this node will not be cached by the client.
key_index
is a zero-based index into the keys
region.
attribute_count
and child_count
indicate how many attributes and child nodes should be read for this node, respectively. If a count field is 0
, then the corresponding address field should be considered undefined, though it will usually be 65535:65535
in official data center files.
attribute_address
is an address into the attributes
region. attribute_count
attributes should be read at this address. These attributes must be sorted by their name index in ascending order.
child_address
is an address into the node
region. child_count
nodes should be read at this address. These children must be sorted first by their name index, then by the values of key attributes (if any), in ascending order. Note that the sort must be stable since the order of multiple sibling nodes with the same name can be significant for the interpretation of the data.
padding_1
and padding_2
should be considered undefined. They were added in the 64-bit data center format, and are not present in the 32-bit format.
The root node of the data graph must be located at the address 0:0
. It must have the name __root__
and have zero attributes.
Keys
Keys are used to signal to a data center reading layer (e.g. in the client) that certain attributes of a node will be used frequently for lookups. The reading layer can then decide to construct an optimized lookup table for those specific paths in the graph, transparently making those lookups faster. It is effectively a trade-off between speed and memory usage.
name_index_1
and friends are one-based indexes into the addresses
region of the names
table. A value of 0
indicates that the field does not define a key. A key definition can specify between zero and four keys. These fields may not refer to the special __value__
attribute.
There need not be any keys defined in a data center file at all, but the client will be very slow without certain key definitions. At minimum, a data center file must contain a key definition at index 0
with all fields 0
(i.e. with no keys) which all nodes can point to by default.
Attributes
Each node in the data graph has zero or more attributes, which are name/value pairs. They are of the form:
name_index
is a one-based index into the addresses
region of the names
table.
type_code
specifies the kind of value the attribute holds. Valid values are as follows:
extended_code
specifies extra information based on the value of type_code
:
If
type_code
isDATA_CENTER_TYPE_CODE_INT
, then the lowest bit ofextended_code
is set if the attribute's value should be considered a Boolean, meaning thatvalue
can only be1
(true
) or0
(false
). Either way, the higher bits are0
.If
type_code
isDATA_CENTER_TYPE_CODE_FLOAT
, thenextended_code
is0
.If
type_code
isDATA_CENTER_TYPE_CODE_STRING
, thenextended_code
is given by the expressiondata_center_value_hash(value)
wherevalue
is the string value.
value
holds the attribute value and is interpreted according to type_code
and extended_code
. In the case of DATA_CENTER_TYPE_CODE_STRING
, the a
field holds an address into the data
region of the values
table. For other type codes, the value is written directly and is accessed through the i
, b
, or f
fields.
padding_1
should be considered undefined. It was added in the 64-bit data center format, and is not present in the 32-bit format.
Some nodes will have a special attribute named __value__
. In XML terms, this represents the text of a node. For example, <Foo>bar</Foo>
would be serialized to a node called Foo
containing an attribute named __value__
with the string value bar
. It is worth noting that a node can have both text and child nodes, such as Foo
in this example:
Note that the __value__
attribute, if present, may only be a string.
Value Hash
The data_center_value_hash
function uses a bizarre variant of CRC32 combined with a minimal effort to ignore the casing of characters. It is defined as follows:
(Note that value_hash_table
is the same as string_hash_table
.)
Last updated