Data sampling with CloverDX
Testing data transformations is generally not an easy task. When creating and testing a transformation you might want to get a data sample to check if your transformation works properly. In this point a question arises: How do you create a representative data probe on the full data set? Obviously, the easiest way is to read just part of data from the beginning. But such a data sample can be unreliable. We've prepared a few simple graphs that create a data probe which can be regarded as representative for the full data set.
First things first: What is data sampling?
Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points. By doing this, an analyst can identify and examine patterns and trends in the larger data set.
Data sampling enables data scientists, predictive modelers and other data analysts to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly, while still producing accurate findings.
Here's how to produce a reliable data sample (all graphs were created based on the sampling methods described in the article Sampling (statistics)).
Simple random sampling
In this method each record has the same probability of selection. Filtering is based on double value chosen (approximately) uniformly from the range 0.0d (inclusive) to 1.0d (exclusive): record is selected if the drawn number is lower than required sample set size:
Systematic sampling
Systematic sampling relies on arranging the data set according to some ordering scheme and then selecting elements in regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every k-th element from then onwards:
Sorting can be disabled in this graph. Then it is selected just every k-th element from the full data set, starting from a randomly selected record from the interval [1, k].
Stratified sampling
If the data set embraces a number of distinct categories, the frame can be organized by these categories into separate strata. Each stratum is then sampled as an independent sub-population out of which individual elements can be randomly selected. At least one record from each stratum must be selected:
Probability proportional to size sampling
Probability for each record is set to be proportional to its stratum size, up to a maximum of 1. Strata are defined by the value of the selected field. For each group of records it it is used systematic sampling method:
Methods comparison
Simple random sampling method is the simplest and fastest. It is sufficient in most cases. Systematic sampling with disabled sorting is as fast as simple random sampling and produces also strongly representative data probe. The stratified sampling method is the trickiest one. It is useful only if the data set can be split into the separated groups that have reasonable sizes. In other cases the data probe is a lot of bigger than requested.
Please see the attached CloverDX project with the above graphs. It also contains the graph for comparison of samples created with different sampling methods. I've done some tests for the file containing 5,000,000 rows with information about financial transactions. Each row contains unique transaction id, id of a customer, transaction amount and currency info. Total number of customers is 50,001; number of possible currencies is 35. I performed two sets of tests: one for the group defined by customer id and one defined by currency id.
Results for the sampling_field = CustomerId
Stratum is defined by id of customer. All data can be split to 50,001 groups with sizes from 61 to 143 transactions.
Following table shows testing results for some groups. Sorting was enabled for systematic sampling method.
defined sample size ratio: 0.01
sampling field (CustomerId) value | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sampling time | 0 h 0 m 22 s 772 ms | 0 h 1 m 34 s 965 ms | 0 h 1 m 33 s 831 ms | 0 h 1 m 30 s 973 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 4 | 71 | 0.0563 | 1 | 71 | 0.0140 | 1 | 71 | 0.0140 | 1 | 71 | 0.0140 |
1 | 0 | 94 | 0.0000 | 1 | 94 | 0.0106 | 1 | 94 | 0.0106 | 1 | 94 | 0.0106 |
10 | 2 | 110 | 0.0181 | 1 | 110 | 0.0090 | 2 | 110 | 0.0181 | 1 | 110 | 0.0090 |
100 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 |
1000 | 0 | 83 | 0.0000 | 1 | 83 | 0.0120 | 1 | 83 | 0.0120 | 1 | 83 | 0.0120 |
10000 | 2 | 101 | 0.0198 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 |
10001 | 0 | 99 | 0.0000 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 |
10002 | 0 | 109 | 0.0000 | 1 | 109 | 0.0091 | 3 | 109 | 0.0275 | 1 | 109 | 0.0091 |
10003 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 2 | 86 | 0.0232 | 1 | 86 | 0.0116 |
10004 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 |
total | 49937 | 5000000 | 0.0099 | 50000 | 5000000 | 0.0100 | 68172 | 5000000 | 0.0136 | 50011 | 5000000 | 0.0100 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CustomerId | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
Sampling time | 0 h 0 m 28 s 741 ms | 0 h 1 m 34 s 474 ms | 0 h 1 m 32 s 628 ms | 0 h 1 m 33 s 949 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 0 | 71 | 0.0000 | 0 | 71 | 0.0000 | 1 | 71 | 0.0140 | 0 | 71 | 0.0000 |
1 | 0 | 94 | 0.0000 | 1 | 94 | 0.0106 | 1 | 94 | 0.0106 | 1 | 94 | 0.0106 |
10 | 1 | 110 | 0.0090 | 1 | 110 | 0.0090 | 3 | 110 | 0.0272 | 1 | 110 | 0.0090 |
100 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 |
1000 | 0 | 83 | 0.0000 | 1 | 83 | 0.0120 | 2 | 83 | 0.0240 | 0 | 83 | 0.0000 |
10000 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 |
10001 | 2 | 99 | 0.0202 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 |
10002 | 1 | 109 | 0.0091 | 1 | 109 | 0.0091 | 1 | 109 | 0.0091 | 1 | 109 | 0.0091 |
10003 | 0 | 86 | 0.0000 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 |
10004 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 0 | 86 | 0.0000 |
total | 49931 | 5000000 | 0.0099 | 50000 | 5000000 | 0.0100 | 68369 | 5000000 | 0.0136 | 50010 | 5000000 | 0.0100 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CustomerId | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
Sampling time | 0 h 0 m 24 s 975 ms | 0 h 1 m 37 s 446 ms | 0 h 1 m 29 s 98 ms | 0 h 1 m 32 s 857 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 0 | 71 | 0.0000 | 1 | 71 | 0.0140 | 1 | 71 | 0.0140 | 1 | 71 | 0.0140 |
1 | 1 | 94 | 0.0106 | 1 | 94 | 0.0106 | 1 | 94 | 0.0106 | 1 | 94 | 0.0106 |
10 | 0 | 110 | 0.0000 | 1 | 110 | 0.0090 | 2 | 110 | 0.0181 | 2 | 110 | 0.0181 |
100 | 0 | 93 | 0.0000 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 |
1000 | 1 | 83 | 0.0120 | 1 | 83 | 0.0120 | 1 | 83 | 0.0120 | 1 | 83 | 0.0120 |
10000 | 2 | 101 | 0.0198 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 |
10001 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 |
10002 | 0 | 109 | 0.0000 | 1 | 109 | 0.0091 | 3 | 109 | 0.0275 | 1 | 109 | 0.0091 |
10003 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 |
10004 | 0 | 86 | 0.0000 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 |
total | 49983 | 5000000 | 0.0099 | 50000 | 5000000 | 0.0100 | 68258 | 5000000 | 0.0136 | 49900 | 5000000 | 0.0099 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CustomerId | simple sampling | systematic sampling | stratified sampling | pps sampling |
Results for the same test but with data sorting disabled in systematic sampling method:
defined sample size ratio: 0.01
sampling field (CustomerId) value | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sampling time | 0 h 0 m 28 s 168 ms | 0 h 0 m 23 s 117 ms | 0 h 1 m 35 s 414 ms | 0 h 1 m 30 s 985 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 1 | 71 | 0.0140 | 0 | 71 | 0.0000 | 1 | 71 | 0.0140 | 1 | 71 | 0.0140 |
1 | 1 | 94 | 0.0106 | 0 | 94 | 0.0000 | 2 | 94 | 0.0212 | 1 | 94 | 0.0106 |
10 | 1 | 110 | 0.0090 | 2 | 110 | 0.0181 | 1 | 110 | 0.0090 | 1 | 110 | 0.0090 |
100 | 0 | 93 | 0.0000 | 0 | 93 | 0.0000 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 |
1000 | 0 | 83 | 0.0000 | 0 | 83 | 0.0000 | 1 | 83 | 0.0120 | 1 | 83 | 0.0120 |
10000 | 2 | 101 | 0.0198 | 0 | 101 | 0.0000 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 |
10001 | 0 | 99 | 0.0000 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 |
10002 | 3 | 109 | 0.0275 | 1 | 109 | 0.0091 | 3 | 109 | 0.0275 | 1 | 109 | 0.0091 |
10003 | 1 | 86 | 0.0116 | 2 | 86 | 0.0232 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 |
10004 | 1 | 86 | 0.0116 | 0 | 86 | 0.0000 | 2 | 86 | 0.0232 | 0 | 86 | 0.0000 |
total | 50081 | 5000000 | 0.0100 | 50000 | 5000000 | 0.0100 | 68227 | 5000000 | 0.0136 | 49966 | 5000000 | 0.0099 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CustomerId | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
Sampling time | 0 h 0 m 23 s 78 ms | 0 h 0 m 19 s 178 ms | 0 h 1 m 33 s 148 ms | 0 h 1 m 29 s 261 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 0 | 71 | 0.0000 | 0 | 71 | 0.0000 | 1 | 71 | 0.0140 | 0 | 71 | 0.0000 |
1 | 0 | 94 | 0.0000 | 0 | 94 | 0.0000 | 1 | 94 | 0.0106 | 1 | 94 | 0.0106 |
10 | 0 | 110 | 0.0000 | 3 | 110 | 0.0272 | 1 | 110 | 0.0090 | 1 | 110 | 0.0090 |
100 | 3 | 93 | 0.0322 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 |
1000 | 1 | 83 | 0.0120 | 0 | 83 | 0.0000 | 2 | 83 | 0.0240 | 1 | 83 | 0.0120 |
10000 | 0 | 101 | 0.0000 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 |
10001 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 | 3 | 99 | 0.0303 | 1 | 99 | 0.0101 |
10002 | 1 | 109 | 0.0091 | 0 | 109 | 0.0000 | 1 | 109 | 0.0091 | 1 | 109 | 0.0091 |
10003 | 1 | 86 | 0.0116 | 0 | 86 | 0.0000 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 |
10004 | 3 | 86 | 0.0348 | 0 | 86 | 0.0000 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 |
total | 50056 | 5000000 | 0.0100 | 50000 | 5000000 | 0.0100 | 68528 | 5000000 | 0.0137 | 50033 | 5000000 | 0.0100 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CustomerId | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
Sampling time | 0 h 0 m 28 s 244 ms | 0 h 0 m 27 s 52 ms | 0 h 1 m 35 s 49 ms | 0 h 1 m 27 s 725 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 1 | 71 | 0.0140 | 0 | 71 | 0.0000 | 2 | 71 | 0.0281 | 0 | 71 | 0.0000 |
1 | 1 | 94 | 0.0106 | 0 | 94 | 0.0000 | 2 | 94 | 0.0212 | 1 | 94 | 0.0106 |
10 | 0 | 110 | 0.0000 | 2 | 110 | 0.0181 | 4 | 110 | 0.0363 | 1 | 110 | 0.0090 |
100 | 2 | 93 | 0.0215 | 2 | 93 | 0.0215 | 1 | 93 | 0.0107 | 1 | 93 | 0.0107 |
1000 | 2 | 83 | 0.0240 | 0 | 83 | 0.0000 | 1 | 83 | 0.0120 | 1 | 83 | 0.0120 |
10000 | 0 | 101 | 0.0000 | 0 | 101 | 0.0000 | 1 | 101 | 0.0099 | 1 | 101 | 0.0099 |
10001 | 0 | 99 | 0.0000 | 4 | 99 | 0.0404 | 1 | 99 | 0.0101 | 1 | 99 | 0.0101 |
10002 | 1 | 109 | 0.0091 | 2 | 109 | 0.0183 | 1 | 109 | 0.0091 | 1 | 109 | 0.0091 |
10003 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 | 2 | 86 | 0.0232 | 0 | 86 | 0.0000 |
10004 | 0 | 86 | 0.0000 | 0 | 86 | 0.0000 | 1 | 86 | 0.0116 | 1 | 86 | 0.0116 |
total | 50116 | 5000000 | 0.0100 | 50000 | 5000000 | 0.0100 | 68470 | 5000000 | 0.0136 | 50010 | 5000000 | 0.0100 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CustomerId | simple sampling | systematic sampling | stratified sampling | pps sampling |
Since the groups are really small, there should be selected none or one record from each group and for the smaller groups we should have more often zero selected records. In relation to this criteria the PPS sampling method and systematic sampling method with sorting data enabled give the best results. Data sample created with stratified method is always oversized.
Results for the sampling_field = CurrencyId
Stratum is defined by id of currency. All data can be split to 35 groups with very similar sizes from 142,042 to 143,572 transactions.
The following table shows testing results for some groups. Sorting was enabled for systematic sampling method.
defined sample size ratio: 0.01
sampling field (CurrencyId) value | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sampling time | 0 h 0 m 21 s 681 ms | 0 h 1 m 26 s 859 ms | 0 h 1 m 25 s 970 ms | 0 h 1 m 27 s 85 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 1450 | 142623 | 0.0101 | 1427 | 142623 | 0.0100 | 1447 | 142623 | 0.0101 | 1426 | 142623 | 0.0099 |
1 | 1371 | 142925 | 0.0095 | 1429 | 142925 | 0.0099 | 1430 | 142925 | 0.0100 | 1429 | 142925 | 0.0099 |
10 | 1420 | 142897 | 0.0099 | 1429 | 142897 | 0.0100 | 1432 | 142897 | 0.0100 | 1429 | 142897 | 0.0100 |
11 | 1448 | 142896 | 0.0101 | 1429 | 142896 | 0.0100 | 1443 | 142896 | 0.0100 | 1429 | 142896 | 0.0100 |
12 | 1383 | 142522 | 0.0097 | 1425 | 142522 | 0.0099 | 1488 | 142522 | 0.0104 | 1425 | 142522 | 0.0099 |
13 | 1468 | 142461 | 0.0103 | 1425 | 142461 | 0.0100 | 1395 | 142461 | 0.0097 | 1424 | 142461 | 0.0099 |
14 | 1449 | 142997 | 0.0101 | 1430 | 142997 | 0.0100 | 1479 | 142997 | 0.0103 | 1430 | 142997 | 0.0100 |
15 | 1401 | 142697 | 0.0098 | 1426 | 142697 | 0.0099 | 1438 | 142697 | 0.0100 | 1427 | 142697 | 0.0100 |
16 | 1396 | 143137 | 0.0097 | 1432 | 143137 | 0.0100 | 1387 | 143137 | 0.0096 | 1431 | 143137 | 0.0099 |
17 | 1464 | 142517 | 0.0102 | 1425 | 142517 | 0.0099 | 1413 | 142517 | 0.0099 | 1425 | 142517 | 0.0099 |
total | 49959 | 5000000 | 0.0099 | 50000 | 5000000 | 0.0100 | 50075 | 5000000 | 0.0100 | 49997 | 5000000 | 0.0099 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CurrencyId | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
Sampling time | 0 h 0 m 22 s 949 ms | 0 h 1 m 25 s 726 ms | 0 h 1 m 27 s 629 ms | 0 h 1 m 24 s 537 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 1449 | 142623 | 0.0101 | 1427 | 142623 | 0.0100 | 1496 | 142623 | 0.0104 | 1426 | 142623 | 0.0099 |
1 | 1468 | 142925 | 0.0102 | 1429 | 142925 | 0.0099 | 1442 | 142925 | 0.0100 | 1429 | 142925 | 0.0099 |
10 | 1436 | 142897 | 0.0100 | 1429 | 142897 | 0.0100 | 1406 | 142897 | 0.0098 | 1429 | 142897 | 0.0100 |
11 | 1436 | 142896 | 0.0100 | 1429 | 142896 | 0.0100 | 1402 | 142896 | 0.0098 | 1429 | 142896 | 0.0100 |
12 | 1410 | 142522 | 0.0098 | 1425 | 142522 | 0.0099 | 1454 | 142522 | 0.0102 | 1425 | 142522 | 0.0099 |
13 | 1438 | 142461 | 0.0100 | 1425 | 142461 | 0.0100 | 1414 | 142461 | 0.0099 | 1425 | 142461 | 0.0100 |
14 | 1420 | 142997 | 0.0099 | 1430 | 142997 | 0.0100 | 1450 | 142997 | 0.0101 | 1430 | 142997 | 0.0100 |
15 | 1412 | 142697 | 0.0098 | 1427 | 142697 | 0.0100 | 1400 | 142697 | 0.0098 | 1427 | 142697 | 0.0100 |
16 | 1453 | 143137 | 0.0101 | 1431 | 143137 | 0.0099 | 1442 | 143137 | 0.0100 | 1431 | 143137 | 0.0099 |
17 | 1431 | 142517 | 0.0100 | 1425 | 142517 | 0.0099 | 1372 | 142517 | 0.0096 | 1425 | 142517 | 0.0099 |
total | 50163 | 5000000 | 0.0100 | 50000 | 5000000 | 0.0100 | 49709 | 5000000 | 0.0099 | 50000 | 5000000 | 0.0100 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CurrencyId | simple sampling | systematic sampling | stratified sampling | pps sampling | ||||||||
Sampling time | 0 h 0 m 27 s 716 ms | 0 h 1 m 26 s 865 ms | 0 h 1 m 26 s 657 ms | 0 h 1 m 26 s 254 ms | ||||||||
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
0 | 1488 | 142623 | 0.0104 | 1426 | 142623 | 0.0099 | 1416 | 142623 | 0.0099 | 1426 | 142623 | 0.0099 |
1 | 1353 | 142925 | 0.0094 | 1429 | 142925 | 0.0099 | 1434 | 142925 | 0.0100 | 1429 | 142925 | 0.0099 |
10 | 1417 | 142897 | 0.0099 | 1429 | 142897 | 0.0100 | 1390 | 142897 | 0.0097 | 1429 | 142897 | 0.0100 |
11 | 1448 | 142896 | 0.0101 | 1429 | 142896 | 0.0100 | 1438 | 142896 | 0.0100 | 1429 | 142896 | 0.0100 |
12 | 1448 | 142522 | 0.0101 | 1425 | 142522 | 0.0099 | 1408 | 142522 | 0.0098 | 1425 | 142522 | 0.0099 |
13 | 1412 | 142461 | 0.0099 | 1425 | 142461 | 0.0100 | 1432 | 142461 | 0.0100 | 1424 | 142461 | 0.0099 |
14 | 1440 | 142997 | 0.0100 | 1430 | 142997 | 0.0100 | 1471 | 142997 | 0.0102 | 1430 | 142997 | 0.0100 |
15 | 1445 | 142697 | 0.0101 | 1427 | 142697 | 0.0100 | 1530 | 142697 | 0.0107 | 1427 | 142697 | 0.0100 |
16 | 1436 | 143137 | 0.0100 | 1431 | 143137 | 0.0099 | 1456 | 143137 | 0.0101 | 1432 | 143137 | 0.0100 |
17 | 1381 | 142517 | 0.0096 | 1425 | 142517 | 0.0099 | 1365 | 142517 | 0.0095 | 1426 | 142517 | 0.0100 |
total | 50089 | 5000000 | 0.0100 | 50000 | 5000000 | 0.0100 | 49707 | 5000000 | 0.0099 | 49999 | 5000000 | 0.0099 |
sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | sample size | dataset size | sample size ratio | |
CurrencyId | simple sampling | systematic sampling | stratified sampling | pps sampling |
With such large groups all the methods give very good results. Although no doubt we get the best results using the systematic sampling or PPS sampling methods where the sample size is always within the limits 0.0099 to 0.0100.
Download the transformation graph with data