Cleanup

akshaisarma · akshaisarma · commit c76805bcd913 · 2017-03-30T11:11:44.000-07:00
diff --git a/docs/backend/setup-storm.md b/docs/backend/setup-storm.md
@@ -19,7 +19,7 @@ To use Bullet, you need to implement a way to read from your data source and con
 1. You can implement a Spout that reads from your data source and emits Bullet Record. This spout must have a constructor that takes a List of Strings.
 2. You can pipe your existing Storm topology directly into Bullet. In other words, you convert the data you wish to be query-able through Bullet into Bullet Records from a bolt in your topology.
 
-Option 2 *directly* couples your topology to Bullet and as such, you would need to watch out for things like back-pressure etc.
+Option 1 is the simplest to start with and should accommodate most scenarios. See [Pros and Cons](storm-architecture.md#data-processing).
 
 You need a JVM based project that implements one of the two options above. You include the Bullet artifact and Storm dependencies in your pom.xml or other dependency management system. The artifacts are available through JCenter, so you will need to add the repository.
 
diff --git a/docs/backend/storm-architecture.md b/docs/backend/storm-architecture.md
@@ -30,30 +30,34 @@ The red colored lines are the path for the queries that come in through Storm DR
 
 Bullet can accept arbitrary sources of data as long as they can be read from Storm. You can either:
 
-1. Write a Storm spout that reads your data from where ever it is (Kafka, etc) and [converts it to Bullet Records](ingestion.md). See [Quick Start](../quick-start.md#storm-topology) for an example.
+1. Write a Storm spout that reads your data from where ever it is (Kafka etc) and [converts it to Bullet Records](ingestion.md). See [Quick Start](../quick-start.md#storm-topology) for an example.
 2. Hook up an existing topology that is doing something else directly to Bullet. You will still write and hook up a component that converts your data into Bullet Records in your existing topology.
 
-Option 2 is nice if you do not want to introduce a persistence layer between your existing Streaming pipeline and Bullet. For example, if you just want periodically look at some data within your topology, you could filter them, convert them into Bullet Records and send it into Bullet. You could also sample data. The downside of Option 2 is that you will directly couple your topology with Bullet leaving your topology to be affected by Bullet through Storm features like back-pressure (if you are on Storm 1.0) etc. You could also go with Option 2 if you need something more complex than just a spout from Option 1. For example, you may want to process your data in some fashion before emitting to Bullet.
+|          | Pros                                                                             | Cons                                                                                 |
+| -------- | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
+| Option 1 | Very simple to get started. Just implement a spout                               | Need a storage layer that your spout pulls or some system has to push to your spouts |
+| Option 2 | Saves a persistence layer                                                        | Ties your topology to Bullet directly, making it affected by Storm Backpressure etc  |
+| Option 2 | You can add bolts to do more processing on your data before sending it to Bullet | Increases the complexity of the topology                                             |
 
-Your data is then emitted to the Filter bolt which promptly drops all Bullet Records and does absolutely nothing if you have no queries in your system. If there are queries in the Filter bolt, the record is checked against the [filters](../index.md#filters) in each query and if it matches, it is processed by the query. Each query can choose to emit matched records in micro-batches. For example, queries that collect raw records (a LIMIT operation) do not micro-batch at all. Every matched record (up to the maximum for the query) is emitted. Queries that aggregate, on the other hand, keep the query around till its duration is up and emit the local result.
+Your data is then emitted to the Filter bolt.  If you have no queries in your system, the Filter Bolt will promptly drop all Bullet Records and do absolutely nothing. If there are queries in the Filter bolt, the record is checked against the [filters](../index.md#filters) in each query and if it matches, it is processed by the query. Each query type can choose to emit matched records in micro-batches. By default, ```RAW``` or ```LIMIT``` queries do not micro-batch. Each matched record up to the maximum for the query is emitted at once at the Filter bolt. Queries that aggregate, on the other hand, keep the query around till its duration is up and emit the local result. This is because these queries *cannot* return till they see all the data in your time window anyway because some late arriving data may update an existing aggregate.
 
-!!! note "To micro-batch or not to micro-batch?"
+!!! note "Why support micro-batching?"
 
-    ```RAW``` queries micro-batch by size 1, which makes Bullet really snappy when running those queries. As soon as your maximum record limit is reached, the query immediately returns. On the other hand, the other queries do not micro-batch at all. ```GROUP``` and other aggregate queries *cannot* return till they see all the data in your time window because some late arriving data may update an existing aggregate. So, these other queries have to wait for the entire query duration anyway. Once the queries have timed out, we have to rely on the ticks to get all the intermediate results over to the combiner to merge. Micro-batches are still useful here because we can still emit intermediate aggregations (and they are [additive](#combining)) and relieve memory pressure by periodically purging intermediate results. In practice though, Bullet queries are generally short-lived, so this isn't as needed as it may seem on first glance. Depends on whether others (you) find it necessary, we may decide to implement micro-batching for other queries besides ```RAW``` types.
+    ```RAW``` queries do not micro-batch by default, which makes Bullet really snappy when running those queries. As soon as your maximum record limit is reached, the query immediately returns. You can use a setting in [bullet_defaults.yaml](https://github.com/yahoo/bullet-storm/blob/master/src/main/resources/bullet_defaults.yaml) to turn on batching if you like. At some point in the future, micro-batching will let Bullet provide incremental results - partial results arrive over the duration of the query. Bullet can emit intermediate aggregations as they are all [additive](#combining).
 
 ### Request processing
 
 Storm DRPC handles receiving REST requests for the whole topology. The DRPC spouts fetch these requests (DRPC knows the request is for the Bullet topology using the unique function name set when launching the topology) and shuffle them to the Prepare Request bolts. The request also contains information about how to return the response back to the DRPC servers. The Prepare Request bolts generate unique identifiers for each request (a Bullet query) and broadcasts them to every Filter bolt. Since every Filter bolt has a copy of every query, the shuffled data from the source of data can be compared against the query no matter which particular Filter bolt it ends up at. Each Filter bolt has access to the unique query id and is able to key group by the id to the Join bolt with the intermediate results for the query.
 
-The Prepare Request bolt also key groups the query and the return information to the Join bolts. This means that only *one* Join bolt ever gets one query.
+The Prepare Request bolt also key groups the query and the return information to the Join bolts. This means that the query will be assigned to one and only one Join bolt.
 
 ### Combining
 
-Since the data from the Prepare Request bolt (a query and a piece of return information for the query) and the data from all Filter bolts (intermediate results) is key grouped by the unique query id, only one particular Join bolt receives both the query and all the intermediate results for a particular query. The Join bolt can then combine all the intermediate results and produce a final result. This final result is joined (hence the name) with the return information for the query and is shuffled to the Return Results bolt. This bolt then uses the return information to send the results back to a DRPC server, who then returns it back to the requester.
+Since the data from the Prepare Request bolt (a query and a piece of return information for the query) and the data from all Filter bolts (intermediate results) is key grouped by the unique query id, only one particular Join bolt receives both the query and all the intermediate results for a particular query. The Join bolt can then combine all the intermediate results and produce a final result. This final result is joined (hence the name) with the return information for the query and is shuffled to the Return Results bolt. This bolt then uses the return information to send the results back to a DRPC server, which then returns it back to the requester.
 
 !!! note "Combining and operations"
 
-    In order to be able to combine intermediate results and process data in any order, all aggregations that Bullet does need to be associative and have an identity. In other words, they need to be [Monoids](https://en.wikipedia.org/wiki/Monoid). Luckily for us, the [DataSketches](http://datasketches.github.io) that we use are monoids (actually are commutative monoids). Sketches be unioned and thus all the aggregations we support - SUM, COUNT, MIN, MAX, AVG, COUNT DISTINCTS, DISTINCT - are monoidal. (AVG is monoidal if you store a SUM and a COUNT instead).
+    In order to be able to combine intermediate results and process data in any order, all aggregations that Bullet does need to be associative and have an identity. In other words, they need to be [Monoids](https://en.wikipedia.org/wiki/Monoid). Luckily for us, the [DataSketches](http://datasketches.github.io) that we use are monoids (actually are commutative monoids). Sketches can be unioned and thus all the aggregations we support - ```SUM```, ```COUNT```, ```MIN```, ```MAX```, ```AVG```, ```COUNT DISTINCT```, ```DISTINCT``` - are monoidal. (```AVG``` is monoidal if you store a ```SUM``` and a ```COUNT``` instead).
 
 
 ## Scalability
diff --git a/docs/index.md b/docs/index.md
@@ -12,7 +12,7 @@
 
 * Provides a **UI and Web Service** that are also pluggable for a full end-to-end solution to your querying needs
 
-* Can be implemented on different Stream processing frameworks. Bullet on [Storm](http://storm.apache.org) is currently available
+* Has an implementation on [Storm](http://storm.apache.org) currently. There are plans to implement it on other Stream Processors.
 
 * Is **pluggable**. Any data source that can be read from Storm can be converted into a standard data container letting you query that data. Data is **typed**
 
@@ -32,15 +32,15 @@ This instance of Bullet also powers other use-cases such as letting analysts val
 
 # Quick Start
 
-See [Quick Start](quick-start.md) to set up Bullet on a local Storm topology. We will generate some fake streaming data that you can then query with Bullet.
+See [Quick Start](quick-start.md) to set up Bullet on a local Storm topology. We will generate some synthetic streaming data that you can then query with Bullet.
 
 # Setting up Bullet on your streaming data
 
 To set up Bullet on a real data stream, you need:
 
-1. The backend set up on a Stream processor:
+1. To setup the Bullet backend on a stream processing framework. Currently, we support [Bullet on Storm](backend/setup-storm.md):
     1. Plug in your source of data. See [Getting your data into Bullet](backend/ingestion.md) for details
-    2. Consume your data stream. Currently, we support [Bullet on Storm](backend/setup-storm.md)
+    2. Consume your data stream
 2. The [Web Service](ws/setup.md) set up to convey queries and return results back from the backend
 3. The optional [UI](ui/setup.md) set up to talk to your Web Service. You can skip the UI if all your access is programmatic
 
@@ -54,11 +54,11 @@ To set up Bullet on a real data stream, you need:
 
 Bullet queries allow you to filter, project and aggregate data. It lets you fetch raw (the individual data records) as well as aggregated data.
 
-See the [UI Usage section](ui/usage.md) for using the UI to build Bullet queries. This is the same UI you will build in the [Quick Start](quick-start.md)
+* See the [UI Usage section](ui/usage.md) for using the UI to build Bullet queries. This is the same UI you will build in the [Quick Start](quick-start.md)
 
-See the [API section](ws/api.md) for building Bullet API queries.
+* See the [API section](ws/api.md) for building Bullet API queries
 
-For examples using the API, see [Examples](ws/examples.md). These are actual albeit cleansed queries sourced from the instance at Yahoo.
+* For examples using the API, see [Examples](ws/examples.md). These are actual albeit cleansed queries sourced from the instance at Yahoo.
 
 ## Termination conditions
 
@@ -134,7 +134,7 @@ Using Sketches, we have implemented ```COUNT DISTINCT``` and ```GROUP``` and are
 The Bullet backend can be split into three main sub-systems:
 
 1. Request Processor - receives queries, adds metadata and sends it to the rest of the system
-2. Data Processor - converts the data from an stream and matches it against queries
+2. Data Processor - reads data from a input stream, converts it to an unified data format and matches it against queries
 3. Combiner - combines results for different queries, performs final aggregations and returns results
 
 ## Web Service and UI