You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| 2017-03-13 |[**0.3.1**](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.3.1)|[**0.3.1**](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.10-0.3.1)| Extra records accepted after query expiry bug fix |
| 2017-02-15 |[**0.2.1**](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.2.1)|[**0.2.1**](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.10-0.2.1)| Acking support, Max size and other bug fixes |
Copy file name to clipboardExpand all lines: docs/quick-start.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ This section gets you running a mock instance of Bullet to play around with. The
4
4
5
5
By the following the steps in this section, you will:
6
6
7
-
* Setup the Bullet topology using a custom spout on [bullet-storm-0.3.0](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.3.0)
7
+
* Setup the Bullet topology using a custom spout on [bullet-storm-0.3.1](https://github.com/yahoo/bullet-storm/releases/tag/bullet-storm-0.3.1)
8
8
* Setup the [Web Service](ws/setup.md) talking to the topology and serving a schema for your UI using [bullet-service-0.0.1](https://github.com/yahoo/bullet-service/releases/tag/bullet-service-0.0.1)
9
9
* Setup the [UI](ui/setup.md) talking to the Web Service using [bullet-ui-0.1.0](https://github.com/yahoo/bullet-ui/releases/tag/v0.1.0)
Now that Storm is up and running, we can put Bullet on it. We will use an example Spout that runs on Bullet 0.3.0 on our Storm cluster. The source is available [here](https://github.com/yahoo/bullet-docs/blob/master/examples/storm). This was part of the artifact that you installed in Step 1.
91
+
Now that Storm is up and running, we can put Bullet on it. We will use an example Spout that runs on Bullet 0.3.1 on our Storm cluster. The source is available [here](https://github.com/yahoo/bullet-docs/blob/master/examples/storm). This was part of the artifact that you installed in Step 1.
!!! note "Shouldn't the count be slightly less in the last example?"
97
+
!!! note "Shouldn't the count be slightly more in the last example?"
98
98
99
-
The result is ```4040``` in the example is because of [the tick-based design](../backend/storm-architecture.md#topology). Everything in the Storm topology with respect to queries happens with ticks and since our tick granularity is set to 1s, that is our lowest visibility. Depending on when that last tick happened, the result could be off by as much as 1s worth of data. For this example, we should have had ```20000 ms / 101 ms``` or ```198``` periods or ```198 periods * 20 tuples/period``` or ```3960``` tuples. But since we can be off by 1s and we can produce ```20 * 1000/101``` or ```80``` tuples in that time, the result is ```4040```. You can always account for this by running your query with a duration that is 1 s shorter than what you desired.
99
+
**Short answer:** Yes and it's because of the synthetic nature of the data generation.
100
+
101
+
**Long answer:** We should have had ```20000 ms / 101 ms``` or ```198``` periods or ```198 periods * 20 tuples/period``` or ```3960``` tuples with unique values for the```uuid``` field. The example spout generates data in bursts of 20 at the start of every period (101 ms). However, the delay isn't exactly 101 ms between periods; it's a bit more depending on when Storm decided to run the emission code. As a result, every period will slowly add a delay of a few ms. Eventually, this can lead us to missing an entire period. This increases the longer the query runs. Even a delay of 1 ms every period (a very likely scenario) can add up to 101 ms or 1 period in as short a time as a 101 periods or ```101 periods * 101 ms/period``` or ```~10 s```. A good rule of thumb is that for every 10 s your query runs, you are missing 20 tuples. You might also miss another 20 tuples at the beginning or the end of the window since the spout is bursty.
102
+
103
+
In most real streaming scenarios, data should be constantly flowing and there shouldn't delays building like this. Even so, for a distributed, streaming system like Bullet, you should always remember that data can be missed at either end of your query window due to inherent skews and timing issues.
100
104
101
105
!!! note "Why did the Maximum Records input disappear?"
102
106
103
107
Maximum Records as a query stopping criteria only makes sense when you are picking out raw records. While the API still supports using it as a limiting mechanism on the number of records that are returned to you, the UI eschews this and sets it to a value that you can [configure](setup.md#configuration). It is also particularly confusing to see a Maximum Records when you are doing a Count Distinct operation, while it makes sense when you are Grouping data. You should ideally set this to the same value as your maximum aggregation size that you configure when launching your backend.
104
108
105
109
### Approximate
106
110
107
-
When the result is approximate, it is shown as a decimal value. The Result Metadata section will reflect that the result was estimated and provide you standard deviations for the true value. The errors are derived from [DataSketches here](https://datasketches.github.io/docs/Theta/ThetaErrorTable.html). Note the line for ```16384```, which was what we configured for the maximum unique values for the Count Distinct operation. That means if we want 99.73% confidence for the result, the ```3``` standard deviation entry says that the true count could vary from ```38603``` to ```40017```. The backend should have produced ```20 * 200000/101``` or ```39603``` tuples with unique uuids. The result from Bullet was ```39304```, which is pretty close.
111
+
When the result is approximate, it is shown as a decimal value. The Result Metadata section will reflect that the result was estimated and provide you standard deviations for the true value. The errors are derived from [DataSketches here](https://datasketches.github.io/docs/Theta/ThetaErrorTable.html). Note the line for ```16384```, which was what we configured for the maximum unique values for the Count Distinct operation. In the example below, this means if we want 99.73% confidence for the result, the ```3``` standard deviation entry says that the true count could vary from ```38194``` to ```39590```.
!!! note "So why is the approximate count what it is?"
117
121
118
-
The 1s tick granularity still only affects the data by 80, so it can be largely ignored here.
122
+
The backend should have produced ```20 * 200000/101``` or ```39603``` tuples with unique uuids. Due to the synthetic nature of the data generation and the building delays mentioned above, we estimated that we should subtract about 20 tuples for every 10 s the query runs. Since this query ran for ```200 s```, this makes the actual uuids generated to be at best ```39603 - (200/10) * 20``` or ```39203```. The result from Bullet was ```38886```, which is an error of ```~0.8 %```. The real error is probably about a *third* of that because we assumed the delay between periods to be 1 ms. It is more on the order of 2 or 3 ms, which makes the number of uuids actually generated even less.
119
123
120
124
## Group all
121
125
@@ -126,7 +130,7 @@ When choosing the Grouped Data option, you can choose to add fields to group by.
126
130
The metrics you apply on fields are all numeric presently. If you apply a metric on a non-numeric field, Bullet will try to **type-cast** your field into number and if it's not possible, the result will be ```null```. The result will also be ```null``` if the field was not present or no data matched your filters.
@@ -144,23 +148,23 @@ You can also choose Group fields and perform metrics per group. If you do not ad
144
148
145
149
**Example: Grouping by tuple_number**
146
150
147
-
In this example, we group by ```tuple_number```. Recall that this is the number assigned to a tuple within a period. They range from 0 to 19. If we group by this, we expect to have 20 unique groups. In 20s, we have ```20000/101``` or ```198``` periods. Each period has one of each ```tuple_number```. With the 1s tick granulaity, we expect ```199``` as the count for each group, which is what is seen in the results. Note that the average is also roughly ```0.50``` since the ```probability``` field is a uniformly distributed value between 0 and 1.
151
+
In this example, we group by ```tuple_number```. Recall that this is the number assigned to a tuple within a period. They range from 0 to 19. If we group by this, we expect to have 20 unique groups. In 5s, we have ```5000/101``` or ```49``` periods. Each period has one of each ```tuple_number```. We expect ```49``` as the count for each group, and this what we see. The building delays mentioned [in the note above](#exact) has not really started affecting the data yet. Note that the average is also roughly ```0.50``` since the ```probability``` field is a uniformly distributed value between 0 and 1.
Try it out! If the number of unique group values exceeds the [maximum configured](../quick-start.md#setting-up-the-example-bullet-topology) (we used 1024 for this example), you will receive a *uniform sample* across your unique group values. The results for your metrics however, are **not sampled**. It is the groups that are sampled on. This means that is **no** guarantee of order if you were expecting the *most popular* groups or something. We are working on adding a ```TOP K``` query that can support these kinds of use-cases.
160
+
Try it out! Nothing bad should happen. If the number of unique group values exceeds the [maximum configured](../quick-start.md#setting-up-the-example-bullet-topology) (we used 1024 for this example), you will receive a *uniform sample* across your unique group values. The results for your metrics however, are **not sampled**. It is the groups that are sampled on. This means that is **no** guarantee of order if you were expecting the *most popular* groups or similar. We are working on adding a ```TOP K``` query that can support these kinds of use-cases.
157
161
158
162
!!! note "Why no Count Distinct after Grouping"
159
163
160
-
At this time, we do not support counting distinct values per field because with the current implementation of Grouping, it would involve storing Data Sketches within Data Sketches. We are considering this in a future release however.
164
+
At this time, we do not support counting distinct values per field because with the current implementation of Grouping, it would involve storing DataSketches within DataSketches. We are considering this in a future release however.
161
165
162
166
!!! note "Aha, sorting by tuple_number didn't sort properly!"
163
167
164
-
Good job, eagle eyes! Unfortunately, whenever we group on fields, those fields become strings under the current implementation. Rather than convert them back at the end, we have currently decided to leave it as is. This means that in your results, if you try and sort by a grouped field, it will perform a lexicographical sort.
168
+
Good job, eagle eyes! Unfortunately, whenever we group on fields, those fields become strings under the current implementation. Rather than convert them back at the end, we have currently decided to leave it as is. This means that in your results, if you try and sort by a grouped field, it will perform a lexicographical sort even if it was originally a number.
165
169
166
-
This also means that you can actually group by any field - including non primitives such as maps and lists! The field will be converted to a string and that string will be used as the field's representation for uniqueness and grouping purposes.
170
+
However, this also means that you can actually group by any field - including non primitives such as maps and lists! The field will be converted to a string and that string will be used as the field's representation for uniqueness and grouping purposes.
0 commit comments