Performing statistics on distributed datasets (especially ordered statistics,
such as 'median') was previously very difficult, if not impossible, without
centralizing and exposing the data. TripleBlind's TableAsset.get_statistics()
operation makes it quick and easy to generate a wide variety of statistics
across distributed datasets while retaining privacy.

The first example demonstrates calculating statistics on 1000 similar records
distributed across three organizations. Both overall statistics are produced,
as well as statistics grouped by a class -- in this case by "sex".

Finally, another set of statistics are run grouped by "ethnicity". In this
case the k-grouping setting of 5 (the highest of the three involved sets)
leaves out the small group of "hawaiian" records and does not show grouped
statistics for them to protect their privacy. But only one of the data
owners requires a k-grouping of 5 -- the other two data owners decided
k-grouping of 3 provides acceptable privacy coverage. So statistics by
ethnicity can successfully be calculated against only those two datasets.


====== Vertically Split Statistics ======

In addition to the statistics on simple datasets and combinations of datasets
with similar records as described above, statistics can be calculated for a
"virtual record" where data is split across several datasets which are linked
by a single field holding common values.

In this example, we have three discrete datasets that look something like this:

table1
  identifier, Normal,     Continuous
  1,         -0.11,       0.1111
  2,          0.22,       0.2222
  3,          1.33,       0.3333

table2
  id,         Continuous, DiscreteA
  1,          0.1212121,  11
  3,          0.3232323,  33

table3
  id,         DiscreteA,  DiscreteB
  1,          311,        3113
  3,          333,        3333


These can be combined and thought of as single virtual table of records that
look like this:

  id, Normal, 0.Continuous, 1.Continuous, 1.DiscreteA, 2.DiscreteA, DiscreteB
  1, -0.11,   0.1111,       0.1212121,    11,          311,         3113
  3,  1.33,   0.3333,       0.3232323,    33,          333,         3333

Note that the data record for identifier=2 in the first dataset is excluded
since there isn't a matching record in the other two datasets.

Statistics on these combined datasets can be obtained by using the
combine_with= and matching= parameters to create the single virtual dataset.
Then the funtion and group_by parameters can be used as normal.  For example:

   table1.get_statistics(
     function=StatFunc.MEAN,
     group_by="DiscreteA",
     combine_with=[table2, table3],
     match_column=["identifier", "id", "id"],
   )

Privacy amongst the dataset owners is protected throughout the calculation, and
k-grouping is honored using the strictest dataset value.

See the second example in this folder for more operations on vertically
partitioned data.
