Dataset vs an association of associations

Question

Associations with a record structure (e.g., a flat table) can be organised in one of two main ways. First as a list of associations, which is also used in the case of Mathematica's data set functionality, such as

assoclist = 
  {<|"id" -> 1, "a1" -> 1, "a2" -> 2|>, <|"id" -> 2, "a1" -> 3, "a2" -> 4|>};

or as an association of associations with a particular key in this case the unique id of each record.

assoc2lev = 
  <| 1 -> <|"a1" -> 1, "a2" -> 2|>, 2 -> <|"a1" -> 3, "a2" -> 4|>|>;

For many operations both structures can be used. For instance, to retrieve a particular record with a given record id.

Query[Select[#id == 1 &]] @ assoclist
assoc2lev[1]

{<|"id" -> 1, "a1" -> 1, "a2" -> 2|>}
<|"a1" -> 1, "a2" -> 2|>

Or to select a specific element of the retrieved record

Query[Select[#id == 1 &], "a1"] @ assoclist // First
assoc2lev[1, "a1"]

1
1

While the Mathematica help pages on Dataset are quite large, there are few examples regarding the use of association of associations. For simple retrievals the latter seems to be easier and more compact.

Are there any guidelines based on practical examples as to whether one form or other is better to use?

Your assoc2lev merges two nested Associations similar to Association[a->x,b->y,Association[c->z]]. Although this is syntacticly correct, it is perhaps less error prone to write assoc2lev=<|1-><|"a1"->1,"a2"->2|>,2-><|"a1"->3,"a2"->4|>|>. — Romke Bontekoe, Jul 02 '15 at 15:11
You're right. The question remains (it doesn't affect the purpose and the results of my entry) but I've improved the syntax of the assoc2lev definition. I think it's an interesting question and I'm hoping for some suggestions. — Mac, Jul 02 '15 at 15:53
My understanding of Dataset and Association was greatly improved by watching, and listening(!), to these videos: Wolfram, YouTube, and YouTube. — Romke Bontekoe, Jul 03 '15 at 07:08
I am not sure about internal intricacies and performance aspects but the Dataset to me simply is a generalization of what can be built using List and Association: You can have a Dataset to be a List of Lists or an Association of Associations - that kind of flexibility is absent with Association. So I use Dataset whenever I think of a complete set of data (eg. like a database), while Association is a building block and thus on a slightly lower level of abstraction. — gwr, Jul 03 '15 at 08:27

WReach · Answer 1 · 2015-07-18T03:24:56.537

A Dataset represents an abstraction over a structured collection of data. Notionally, it is restricted to "well-behaved" data -- data that comes in simple forms that can be readily interchanged with external systems such as relational databases, XML documents, JSON documents, etc. These are commonplace forms such as vectors, records ("structs"), tuples, etc.

While it is presently possible to drop any arbitrary Wolfram Language (WL) expression into a Dataset, we get best results if we restrict ourselves to these commonplace types. This means that we should avoid tricky data structures that exploit some of the more powerful symbolic features of WL, such as held expressions, up-values, and so on.

As noted in the question, it is not necessary to put an expression into a Dataset in order to exploit the full power of Query. Query operates upon any "naked" expression just fine. In fact, once data is wrapped within a Dataset, some otherwise valid queries may become prohibited. This is due to the main feature of Dataset -- data-typing.

Data-typing in Dataset

When data is placed within a Dataset, type information is generated for that data. In principle, this information can be used for the following purposes:

data visualization
storage optimization
query optimization
interoperability with external systems
proactive error-checking ("strong typing")
... and more

At the present time (version 10.1) type information is essentially used only for data visualization. It is used to generate the display form of a Dataset expression.

Future releases of WL are likely to exploit this type information further. For example, early beta documentation of version 10 spoke extensively about accessing SQL databases through datasets. This feature may return. I also suspect that future releases will place more limitations as to what can be meaningfully placed into a dataset in order to maximize interoperability.

Type information is generated by two different type-analysis processes which go by the jargon names Type Deduction and Type Inference. Type deduction occurs when data is initially placed into a dataset. Type inferencing occurs when an operator is applied to a dataset.

Type Deduction

Type deduction is when concrete data is analyzed in order to determine its type. The function that performs this deduction is TypeSystem`DeduceType:

Needs["TypeSystem`"]

DeduceType[1]
(* Atom[Integer] *)

DeduceType["one"]
(* Atom[String] *)

DeduceType[{1, 2, 3}]
(* Vector[Atom[Integer], 3] *)

DeduceType[{1, "a"}]
(* Tuple[{Atom[Integer], Atom[String]}] *)

DeduceType[<|"a" -> 1, "b" -> 2|>]
(* Struct[{"a", "b"}, {Atom[Integer], Atom[Integer]}] *)

DeduceType[<|a -> 1, b -> 2|>]
(* Assoc[AnyType, Atom[Integer], 2] *)

Arbitrary expressions get a very generic type:

DeduceType[f[x, y, z]]
(* AnyType *)

The cases above show some interesting differences. The all-number list is typed as a Vector, whereas the list with an integer and a string is typed as a Tuple. The association with string keys is typed as a Struct, whereas the one with non-string keys is typed as an Assoc.

It is type differences like this that are responsible for behavioural differences in Dataset. For example, the Dataset display form of a Struct is not the same as the display form of an Assoc:

Dataset[<| "a" -> 1, "b" -> 2 |>]

Struct screenshot

Dataset[<| a -> 1, b -> 2 |>]

Assoc screenshot

The behavioural change is due to a very subtle difference: string versus non-string keys within an association.

Type Inference

The second typing process is called Type Inference. This refers to determining what type of data will result by applying a function to a known type. This relevant function is TypeSystem`TypeApply:

TypeApply[Plus, {Atom[Integer], Atom[Integer]}]
(* Atom[Integer] *)

TypeApply[Plus, {Atom[Integer], Atom[Real]}]
(* Atom[Real] *)

TypeApply[StringLength, {Atom[String]}]
(* Atom[Integer] *)

For general WL expressions, this can be a very difficult problem. Consider that the presence of held expressions, up-values, replacement rules and other symbolic constructs can make it literally impossible to determine the result of an expression without evaluating it completely. Side-effects in functions can also wreak havoc upon any static analysis. So TypeApply sometimes just has to give up for lack of complete information.

TypeApply[g, {Atom[String]}]
(* UnknownType *)

TypeApply will look into pure functions:

TypeApply[# <> "xxx" &, {Atom[String]}]
(* Atom[String] *)

... but it does not presently inspect user definitions:

f[x_] := x <> "xxx"
TypeApply[f, {Atom[String]}]
(* UnknownType *)

Datasets and Querying

One of the applications of the dataset type information is to proactively check whether an operation makes sense. For example, TypeApply knows that you cannot sensibly ask for a key that does not exist in an association:

TypeApply[Query["a" /* IntegerQ] // Normal, {DeduceType[<|"x" -> 1|>]}]
(* FailureType[{Part,"Mismatch"},<|"Type"->Struct[{"x"},{Atom[Integer]}],"Part"->"a"|>] *)

An attempt to execute this query will (by default) fail:

Dataset[<|"x" -> 1|>] // Query["a" /* IntegerQ]

dataset failure screenshot

As noted earlier, Query functionality can be used independently of Dataset objects. Queries can be applied to arbitary WL expressions. If we attempt the same operation against the raw association, the evaluation runs to completion since there is no type-inferencing involved:

<|"x" -> 1|> // Query["a" /* IntegerQ]
(* False *)

This simple example shows how querying a dataset can, by design, produce different results than when querying a general WL expression. The proactive strong type-checking takes a conservative approach that normally will protect us from errors. But, there are mechanisms to override some of this checking should we decide that we can tolerate the apparent issue. In this case, for example:

Dataset[<|"x" -> 1|>] // Query["a" /* IntegerQ, PartBehavior -> None]
(* False *)

WL syntax is vast, so sometimes TypeApply is unable to cope with unusual cases:

TypeApply[Lookup["key"], {Struct[{"key"},{Atom[Integer]}]}]
(* Atom[Integer] *)

TypeApply[Lookup[#, "key", 0]&, {Struct[{"key"},{Atom[Integer]}]}]
(* FailureType[{Lookup,Invalid},
     <|Head->Lookup,Arguments->{Struct[{key},{Atom[Integer]}],key,0}|>] *)

It is these type-inferencing failures that sometimes cause queries upon dataset objects to fail unexpectedly:

Dataset[<|"key" -> 1|>] // Query[Lookup[#, "key", 0] &]

dataset failure screenshot

The type-failure above can be avoided by querying the "naked" data directly:

<|"key"->1|> // Query[Lookup[#,"key",0]&]
(* 1 *)

Future releases are likely to close gaps such as these. (edit: it is indeed fixed in release 10.2)

Type System Heuristics (Edit: 2015-07-17)

Sometimes, the type system relies upon heuristics to make a type determination. As an example, consider this association:

$a = MapIndexed[#->#2[[1]]&, CharacterRange["a", "p"]] // Association

(* <| "a" ->  1, "b" ->  2, "c" ->  3, "d" ->  4
    , "e" ->  5, "f" ->  6, "g" ->  7, "h" ->  8
    , "i" ->  9, "j" -> 10, "k" -> 11, "l" -> 12
    , "m" -> 13, "n" -> 14, "o" -> 15, "p" -> 16
    |> *)

It is typed as an interoperable Struct with 16 integer fields ("members"):

$a//DeduceType

(* Struct[
   {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p"},
   {Atom[Integer], Atom[Integer], Atom[Integer], Atom[Integer], 
    Atom[Integer], Atom[Integer], Atom[Integer], Atom[Integer],
    Atom[Integer], Atom[Integer], Atom[Integer], Atom[Integer],
    Atom[Integer], Atom[Integer], Atom[Integer], Atom[Integer]}] *)

But if we increase the number of fields from 16 to 17 by adding a key, then the expression is no longer considered to be a structure type. Instead, it is typed as a native Assoc:

<| $a, "q" -> 17 |> // DeduceType

(* Assoc[Atom[String], Atom[Integer], 17] *)

This use of "rules of thumb" to determine type introduces a certain element of non-determinism into the type system. These heuristics may change in future releases, meaning that the types of expressions (and even their semantics) may also change over time as well.

Conclusion

A major goal of Dataset is to represent common data interchange formats. By limiting data to simple types, storage optimizations become possible. By limiting the operations that can be performed upon that data, query cross-compilation to other languages becomes possible (e.g. SQL, XQuery, JSON query-by-example).

If our goal is to operate with arbitrary WL constructs, then we should avoid wrapping them into Dataset objects. Operate upon them directly using Query. But if the data is meant to be some combination of basic data types like vectors, structures, tuples and atoms, then Dataset is a good choice -- especially with interoperability in mind. The choice will likely offer more benefits in future releases.

+1. Very helpful answer that seems to explain why I do have trouble putting TimeSeries and EventSeries in Dataset. It seems to work but I do get interpretation warnings and the Objects are displayed with a red frame. Maybe better to put the information in a List then apply the needed Head after extraction? — gwr, Jul 03 '15 at 19:51
@gwr That might be better... or perhaps it is better to operate upon the rich structure directly, without wrapping it within Dataset first. This would allow the use of unrestricted operators. Then, if needs be, the final result could be converted to an interchange-friendly (dataset-friendly) format as the last step. Perhaps future releases might perform such conversions as needed automatically? — WReach, Jul 03 '15 at 20:00
Wow - thanks for the detailed answer and context for Dataset. The general guideline is understood: Dataset[] for basic data types in record format and other constructions for more complicated expressions. — Mac, Jul 04 '15 at 19:19
A great answer (I already voted)! One small addition is that even when the keys are all strings, the association will be interpreted as Assoc in cases when the number of keys is larger than a certain limit (16 currently), or when the values are all themselves associations. — Leonid Shifrin, Jul 05 '15 at 12:45
Small note: The bug where Dataset[<|"key" -> 1|>] // Query[Lookup[#, "key", 0] &] wasn't evaluating is fixed in the upcoming version 10.2.0 of Mathematica. — Stefan R, Jul 07 '15 at 20:15
@LeonidShifrin I added the section Type System Heuristics to make that discussion more explicit. It uses that first example. Thanks. — WReach, Jul 18 '15 at 03:26
@WReach Thanks. That's pretty useful IMO, not just because of that example itself, but because it demonstrates the non-determinism of type deduction, as you noted. — Leonid Shifrin, Jul 18 '15 at 08:35
The Dataset display form of an Assoc depends on the structure of the Assoc. For example, compare your Dataset[<|a -> 1, b -> 2|>] (Assoc[AnyType, Atom[Integer], 2]) with Dataset[<|1 -> a, 2 -> b|>] (Assoc[Atom[Integer], AnyType, 2]). — Karsten7, Oct 29 '15 at 10:41
Dataset[Append[$a, "q" -> 17]] is displayed the same way Dataset[$a] is. I think AnyType causes the different display form. — Karsten7, Oct 29 '15 at 11:15
@Karsten7. Agreed, the use of heuristics in the display form adds yet another level of (apparent) non-determinism into the operation of Dataset. And, just like the data-type heuristics, these heuristics also change from release to release. I am hopeful that there will be convergence in these areas as WRI incorporates feedback from each release. — WReach, Oct 29 '15 at 14:37

score 7 · Answer 2 · answered Jul 06 '15 at 22:03

7

The dual forms are routinely seen in NoSQL data, when importing spreadsheets with an implicit primary key &c.

Use these operators to transform b/w them:

primaryKeyUp[key_] := Query[GroupBy[key], First /* KeyDrop[key]];

primaryKeyDown[key_] := Query[Normal /* Map[Last[#]~Prepend~(key -> First[#]) &]];

eg

assoclist // Dataset

enter image description here

assoclist // primaryKeyUp["id"] // Dataset

enter image description here

assoclist // primaryKeyUp["id"] // primaryKeyDown["id"] // Dataset

enter image description here

Comments:

The shared argument "id" and the relation Up:delete :: Down:insert, could be overloaded a single primaryKey as they operate on List vs Association, but why bother - it's more intuitive w/ explicit directions.
Association will discard values associated to repeated Key, hence "primary".

answered Jul 06 '15 at 22:03

alancalvitti

15,143
3
27
92

1

That's a useful way to convert between the two forms depending on the needs. Many thanks. – Mac Jul 07 '15 at 12:56
Sure, what else do you want to know re "guidelines". – alancalvitti Jul 07 '15 at 19:10
You're @ ESA. Familiar w/ the Gaia housekeeping data? i.e size, shape... – alancalvitti Jul 07 '15 at 19:11

Dataset vs an association of associations

2 Answers2

Linked

Related