CloverDX Blog on Data Integration

Data sampling with CloverDX

Written by Agata Vackova | January 31, 2011

Testing data transformations is generally not an easy task. When creating and testing a transformation you might want to get a data sample to check if your transformation works properly. In this point a question arises: How do you create a representative data probe on the full data set? Obviously, the easiest way is to read just part of data from the beginning. But such a data sample can be unreliable.  We've prepared a few simple graphs that create a data probe which can be regarded as representative for the full data set.

First things first: What is data sampling?

Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points. By doing this, an analyst can identify and examine patterns and trends in the larger data set.

Data sampling enables data scientists, predictive modelers and other data analysts to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly, while still producing accurate findings.

Here's how to produce a reliable data sample (all graphs were created based on the sampling methods described in the article Sampling (statistics)).

Simple random sampling

In this method each record has the same probability of selection. Filtering is based on double value chosen (approximately) uniformly from the range 0.0d (inclusive) to 1.0d (exclusive): record is selected if the drawn number is lower than required sample set size:

Systematic sampling

Systematic sampling relies on arranging the data set according to some ordering scheme and then selecting elements in regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every k-th element from then onwards:

Sorting can be disabled in this graph. Then it is selected just every k-th element from the full data set, starting from a randomly selected record from the interval [1, k].

Stratified sampling

If the data set embraces a number of distinct categories, the frame can be organized by these categories into separate strata. Each stratum is then sampled as an independent sub-population out of which individual elements can be randomly selected. At least one record from each stratum must be selected:

Probability proportional to size sampling

Probability for each record is set to be proportional to its stratum size, up to a maximum of 1. Strata are defined by the value of the selected field. For each group of records it it is used systematic sampling method:

Methods comparison

Simple random sampling method is the simplest and fastest. It is sufficient in most cases. Systematic sampling with disabled sorting is as fast as simple random sampling and produces also strongly representative data probe. The stratified sampling method is the trickiest one. It is useful only if the data set can be split into the separated groups that have reasonable sizes. In other cases the data probe is a lot of bigger than requested.

Please see the attached CloverDX project with the above graphs. It also contains the graph for comparison of samples created with different sampling methods. I've done some tests for the file containing 5,000,000 rows with information about financial transactions. Each row contains unique transaction id, id of a customer, transaction amount and currency info. Total number of customers is 50,001; number of possible currencies is 35. I performed two sets of tests: one for the group defined by customer id and one defined by currency id.

Results for the sampling_field = CustomerId

Stratum is defined by id of customer. All data can be split to 50,001 groups with sizes from 61 to 143 transactions.

Following table shows testing results for some groups. Sorting was enabled for systematic sampling method.

defined sample size ratio: 0.01

sampling field (CustomerId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 22 s 772 ms 0 h 1 m 34 s 965 ms 0 h 1 m 33 s 831 ms 0 h 1 m 30 s 973 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 4 71 0.0563 1 71 0.0140 1 71 0.0140 1 71 0.0140
1 0 94 0.0000 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 2 110 0.0181 1 110 0.0090 2 110 0.0181 1 110 0.0090
100 1 93 0.0107 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 1 83 0.0120 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 0 109 0.0000 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 2 86 0.0232 1 86 0.0116
10004 1 86 0.0116 1 86 0.0116 1 86 0.0116 1 86 0.0116
total 49937 5000000 0.0099 50000 5000000 0.0100 68172 5000000 0.0136 50011 5000000 0.0100
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 741 ms 0 h 1 m 34 s 474 ms 0 h 1 m 32 s 628 ms 0 h 1 m 33 s 949 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 0 71 0.0000 1 71 0.0140 0 71 0.0000
1 0 94 0.0000 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 1 110 0.0090 1 110 0.0090 3 110 0.0272 1 110 0.0090
100 1 93 0.0107 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 1 83 0.0120 2 83 0.0240 0 83 0.0000
10000 1 101 0.0099 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 2 99 0.0202 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 1 109 0.0091 1 109 0.0091 1 109 0.0091 1 109 0.0091
10003 0 86 0.0000 1 86 0.0116 1 86 0.0116 1 86 0.0116
10004 1 86 0.0116 1 86 0.0116 1 86 0.0116 0 86 0.0000
total 49931 5000000 0.0099 50000 5000000 0.0100 68369 5000000 0.0136 50010 5000000 0.0100
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 24 s 975 ms 0 h 1 m 37 s 446 ms 0 h 1 m 29 s 98 ms 0 h 1 m 32 s 857 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 1 71 0.0140 1 71 0.0140 1 71 0.0140
1 1 94 0.0106 1 94 0.0106 1 94 0.0106 1 94 0.0106
10 0 110 0.0000 1 110 0.0090 2 110 0.0181 2 110 0.0181
100 0 93 0.0000 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 1 83 0.0120 1 83 0.0120 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 1 99 0.0101 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 0 109 0.0000 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 1 86 0.0116 1 86 0.0116
10004 0 86 0.0000 1 86 0.0116 1 86 0.0116 1 86 0.0116
total 49983 5000000 0.0099 50000 5000000 0.0100 68258 5000000 0.0136 49900 5000000 0.0099
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling

Results for the same test but with data sorting disabled in systematic sampling method:

defined sample size ratio: 0.01

sampling field (CustomerId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 168 ms 0 h 0 m 23 s 117 ms 0 h 1 m 35 s 414 ms 0 h 1 m 30 s 985 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1 71 0.0140 0 71 0.0000 1 71 0.0140 1 71 0.0140
1 1 94 0.0106 0 94 0.0000 2 94 0.0212 1 94 0.0106
10 1 110 0.0090 2 110 0.0181 1 110 0.0090 1 110 0.0090
100 0 93 0.0000 0 93 0.0000 1 93 0.0107 1 93 0.0107
1000 0 83 0.0000 0 83 0.0000 1 83 0.0120 1 83 0.0120
10000 2 101 0.0198 0 101 0.0000 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 1 99 0.0101 1 99 0.0101 1 99 0.0101
10002 3 109 0.0275 1 109 0.0091 3 109 0.0275 1 109 0.0091
10003 1 86 0.0116 2 86 0.0232 1 86 0.0116 1 86 0.0116
10004 1 86 0.0116 0 86 0.0000 2 86 0.0232 0 86 0.0000
total 50081 5000000 0.0100 50000 5000000 0.0100 68227 5000000 0.0136 49966 5000000 0.0099
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 23 s 78 ms 0 h 0 m 19 s 178 ms 0 h 1 m 33 s 148 ms 0 h 1 m 29 s 261 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 0 71 0.0000 0 71 0.0000 1 71 0.0140 0 71 0.0000
1 0 94 0.0000 0 94 0.0000 1 94 0.0106 1 94 0.0106
10 0 110 0.0000 3 110 0.0272 1 110 0.0090 1 110 0.0090
100 3 93 0.0322 1 93 0.0107 1 93 0.0107 1 93 0.0107
1000 1 83 0.0120 0 83 0.0000 2 83 0.0240 1 83 0.0120
10000 0 101 0.0000 1 101 0.0099 1 101 0.0099 1 101 0.0099
10001 1 99 0.0101 1 99 0.0101 3 99 0.0303 1 99 0.0101
10002 1 109 0.0091 0 109 0.0000 1 109 0.0091 1 109 0.0091
10003 1 86 0.0116 0 86 0.0000 1 86 0.0116 1 86 0.0116
10004 3 86 0.0348 0 86 0.0000 1 86 0.0116 1 86 0.0116
total 50056 5000000 0.0100 50000 5000000 0.0100 68528 5000000 0.0137 50033 5000000 0.0100
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 28 s 244 ms 0 h 0 m 27 s 52 ms 0 h 1 m 35 s 49 ms 0 h 1 m 27 s 725 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1 71 0.0140 0 71 0.0000 2 71 0.0281 0 71 0.0000
1 1 94 0.0106 0 94 0.0000 2 94 0.0212 1 94 0.0106
10 0 110 0.0000 2 110 0.0181 4 110 0.0363 1 110 0.0090
100 2 93 0.0215 2 93 0.0215 1 93 0.0107 1 93 0.0107
1000 2 83 0.0240 0 83 0.0000 1 83 0.0120 1 83 0.0120
10000 0 101 0.0000 0 101 0.0000 1 101 0.0099 1 101 0.0099
10001 0 99 0.0000 4 99 0.0404 1 99 0.0101 1 99 0.0101
10002 1 109 0.0091 2 109 0.0183 1 109 0.0091 1 109 0.0091
10003 1 86 0.0116 1 86 0.0116 2 86 0.0232 0 86 0.0000
10004 0 86 0.0000 0 86 0.0000 1 86 0.0116 1 86 0.0116
total 50116 5000000 0.0100 50000 5000000 0.0100 68470 5000000 0.0136 50010 5000000 0.0100
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CustomerId simple sampling systematic sampling stratified sampling pps sampling

Since the groups are really small, there should be selected none or one record from each group and for the smaller groups we should have more often zero selected records. In relation to this criteria the PPS sampling method and systematic sampling method with sorting data enabled give the best results. Data sample created with stratified method is always oversized.

Results for the sampling_field = CurrencyId

Stratum is defined by id of currency. All data can be split to 35 groups with very similar sizes from 142,042 to 143,572 transactions.

The following table shows testing results for some groups. Sorting was enabled for systematic sampling method.

defined sample size ratio: 0.01

sampling field (CurrencyId) value simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 21 s 681 ms 0 h 1 m 26 s 859 ms 0 h 1 m 25 s 970 ms 0 h 1 m 27 s 85 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1450 142623 0.0101 1427 142623 0.0100 1447 142623 0.0101 1426 142623 0.0099
1 1371 142925 0.0095 1429 142925 0.0099 1430 142925 0.0100 1429 142925 0.0099
10 1420 142897 0.0099 1429 142897 0.0100 1432 142897 0.0100 1429 142897 0.0100
11 1448 142896 0.0101 1429 142896 0.0100 1443 142896 0.0100 1429 142896 0.0100
12 1383 142522 0.0097 1425 142522 0.0099 1488 142522 0.0104 1425 142522 0.0099
13 1468 142461 0.0103 1425 142461 0.0100 1395 142461 0.0097 1424 142461 0.0099
14 1449 142997 0.0101 1430 142997 0.0100 1479 142997 0.0103 1430 142997 0.0100
15 1401 142697 0.0098 1426 142697 0.0099 1438 142697 0.0100 1427 142697 0.0100
16 1396 143137 0.0097 1432 143137 0.0100 1387 143137 0.0096 1431 143137 0.0099
17 1464 142517 0.0102 1425 142517 0.0099 1413 142517 0.0099 1425 142517 0.0099
total 49959 5000000 0.0099 50000 5000000 0.0100 50075 5000000 0.0100 49997 5000000 0.0099
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 22 s 949 ms 0 h 1 m 25 s 726 ms 0 h 1 m 27 s 629 ms 0 h 1 m 24 s 537 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1449 142623 0.0101 1427 142623 0.0100 1496 142623 0.0104 1426 142623 0.0099
1 1468 142925 0.0102 1429 142925 0.0099 1442 142925 0.0100 1429 142925 0.0099
10 1436 142897 0.0100 1429 142897 0.0100 1406 142897 0.0098 1429 142897 0.0100
11 1436 142896 0.0100 1429 142896 0.0100 1402 142896 0.0098 1429 142896 0.0100
12 1410 142522 0.0098 1425 142522 0.0099 1454 142522 0.0102 1425 142522 0.0099
13 1438 142461 0.0100 1425 142461 0.0100 1414 142461 0.0099 1425 142461 0.0100
14 1420 142997 0.0099 1430 142997 0.0100 1450 142997 0.0101 1430 142997 0.0100
15 1412 142697 0.0098 1427 142697 0.0100 1400 142697 0.0098 1427 142697 0.0100
16 1453 143137 0.0101 1431 143137 0.0099 1442 143137 0.0100 1431 143137 0.0099
17 1431 142517 0.0100 1425 142517 0.0099 1372 142517 0.0096 1425 142517 0.0099
total 50163 5000000 0.0100 50000 5000000 0.0100 49709 5000000 0.0099 50000 5000000 0.0100
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling
Sampling time 0 h 0 m 27 s 716 ms 0 h 1 m 26 s 865 ms 0 h 1 m 26 s 657 ms 0 h 1 m 26 s 254 ms
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
0 1488 142623 0.0104 1426 142623 0.0099 1416 142623 0.0099 1426 142623 0.0099
1 1353 142925 0.0094 1429 142925 0.0099 1434 142925 0.0100 1429 142925 0.0099
10 1417 142897 0.0099 1429 142897 0.0100 1390 142897 0.0097 1429 142897 0.0100
11 1448 142896 0.0101 1429 142896 0.0100 1438 142896 0.0100 1429 142896 0.0100
12 1448 142522 0.0101 1425 142522 0.0099 1408 142522 0.0098 1425 142522 0.0099
13 1412 142461 0.0099 1425 142461 0.0100 1432 142461 0.0100 1424 142461 0.0099
14 1440 142997 0.0100 1430 142997 0.0100 1471 142997 0.0102 1430 142997 0.0100
15 1445 142697 0.0101 1427 142697 0.0100 1530 142697 0.0107 1427 142697 0.0100
16 1436 143137 0.0100 1431 143137 0.0099 1456 143137 0.0101 1432 143137 0.0100
17 1381 142517 0.0096 1425 142517 0.0099 1365 142517 0.0095 1426 142517 0.0100
total 50089 5000000 0.0100 50000 5000000 0.0100 49707 5000000 0.0099 49999 5000000 0.0099
  sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio sample size dataset size sample size ratio
CurrencyId simple sampling systematic sampling stratified sampling pps sampling

With such large groups all the methods give very good results. Although no doubt we get the best results using the systematic sampling or PPS sampling methods where the sample size is always within the limits 0.0099 to 0.0100.

Download the transformation graph with data