Skip to content

Commit 77a50ff

Browse files
More spark updates (#9)
1 parent 0daea83 commit 77a50ff

File tree

10 files changed

+388
-369
lines changed

10 files changed

+388
-369
lines changed

docs/about/contributing.md

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,8 @@ This list is neither comprehensive nor in any particular order.
1616

1717
| Feature | Components | Description | Status |
1818
|-------------------- | ----------- | ------------------------- | ------------- |
19-
| Incremental updates | BE, WS, UI | Push results back to users during the query lifetime. Micro-batching, windowing and other features need to be implemented | In Progress |
20-
| Bullet on Spark | BE | Implement Bullet on Spark Streaming. Compared with SQL on Spark Streaming which stores data in memory, Bullet will be light-weight | In Progress |
2119
| Security | WS, UI | The obvious enterprise security for locking down access to the data and the instance of Bullet. Considering SSL, Kerberos, LDAP etc. Ideally, without a database | Planning |
22-
| In-Memory PubSub | PubSub | For users who don't want a PubSub like Kafka, we could add REST based in-memory PubSub layer that runs in the WS. The backend will then communicate directly with the WS | Planning |
23-
| LocalForage | UI | Migration the UI to LocalForage to distance ourselves from the relatively small LocalStorage space | [#9](https://github.com/yahoo/bullet-ui/issues/9) |
2420
| Bullet on X | BE | With the pub/sub feature, Bullet can be implemented on other Stream Processors like Flink, Kafka Streaming, Samza etc | Open |
2521
| Bullet on Beam | BE | Bullet can be implemented on [Apache Beam](https://beam.apache.org) as an alternative to implementing it on various Stream Processors | Open |
26-
| SQL API | BE, WS | WS supports an endpoint that converts a SQL-like query into Bullet queries | Open |
22+
| SQL API | BE, WS | WS supports an endpoint that converts a SQL-like query into Bullet queries | In Progress |
2723
| Packaging | UI, BE, WS | Github releases and building from source are the only two options for the UI. Docker images or the like for quick setup and to mix and match various pluggable components would be really useful | Open |
28-
| Spring Boot Reactor | WS | Migrate the Web Service to use Spring Boot reactor instead of servlet containers | Open |

docs/backend/ingestion.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,13 @@ Bullet operates on a generic data container that it understands. In order to get
88

99
## Bullet Record
1010

11-
The Bullet Record is a serializable data container based on [Avro](http://avro.apache.org). It is typed and has a generic schema. You can refer to the [Avro Schema](https://github.com/yahoo/bullet-record/blob/master/src/main/avro/BulletAvro.avsc) file for details if you wish to see the internals of the data model. The Bullet Record is also lazy and only deserializes itself when you try to read something from it. So, you can pass it around before sending to Bullet with minimal cost. Partial deserialization is being considered if performance is key. This will let you deserialize a much narrower chunk of the Record if you are just looking for a couple of fields.
11+
The Bullet backend processes data that must be stored in a [Bullet Record](https://github.com/bullet-db/bullet-record/blob/master/src/main/java/com/yahoo/bullet/record/BulletRecord.java) which is an abstract Java class that can
12+
be implemented as to be optimized for different backends or use-cases.
13+
14+
There are currently two concrete implementations of BulletRecord:
15+
16+
1. [SimpleBulletRecord](https://github.com/bullet-db/bullet-record/blob/master/src/main/java/com/yahoo/bullet/record/SimpleBulletRecord.java) which is based on a simple Java HashMap
17+
2. [AvroBulletRecord](https://github.com/bullet-db/bullet-record/blob/master/src/main/java/com/yahoo/bullet/record/AvroBulletRecord.java) which uses [Avro](http://avro.apache.org) for serialization
1218

1319
## Types
1420

@@ -17,9 +23,11 @@ Data placed into a Bullet Record is strongly typed. We support these types curre
1723
### Primitives
1824

1925
1. Boolean
20-
2. Long
21-
3. Double
22-
4. String
26+
2. Integer
27+
3. Long
28+
4. Float
29+
5. Double
30+
6. String
2331

2432
### Complex
2533

@@ -31,7 +39,7 @@ With these types, it is unlikely you would have data that cannot be represented
3139

3240
## Installing the Record directly
3341

34-
Generally, you depend on the Bullet Core artifact for your Stream Processor when you plug in the piece that gets your data into the Stream processor. The Bullet Core artifact already brings in the Bullet Record container as well. See the usage for the [Storm](storm-setup.md#installation) for an example.
42+
Generally, you depend on the Bullet Core artifact for your Stream Processor when you plug in the piece that gets your data into the Stream processor. The Bullet Core artifact already brings in the Bullet Record containers as well. See the usage for the [Storm](storm-setup.md#installation) for an example.
3543

3644
However, if you need it, the artifacts are available through JCenter to depend on them in code directly. You will need to add the repository. Below is a Maven example:
3745

docs/index.md

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121
* Big-data scale-tested - used in production at Yahoo and tested running 500+ queries simultaneously on up to 2,000,000 rps
2222

23-
# How is this useful
23+
# How is Bullet useful
2424

2525
How Bullet is used is largely determined by the data source it consumes. Depending on what kind of data you put Bullet on, the types of queries you run on it and your use-cases will change. As a look-forward query system with no persistence, you will not be able to repeat your queries on the same data. The next time you run your query, it will operate on the different data that arrives after that submission. If this usage pattern is what you need and you are looking for a light-weight system that can tap into your streaming data, then Bullet is for you!
2626

@@ -40,15 +40,15 @@ This instance of Bullet also powers other use-cases such as letting analysts val
4040

4141
See [Quick Start](quick-start/bullet-on-spark.md) to set up Bullet locally using spark-streaming. You will generate some synthetic streaming data that you can then query with Bullet.
4242

43-
# Setting up Bullet on your streaming data
43+
# Setup Bullet on your streaming data
4444

4545
To set up Bullet on a real data stream, you need:
4646

47-
1. To setup the Bullet Backend on a stream processing framework. Currently, we support [Bullet on Storm](backend/storm-setup.md):
47+
1. To setup the Bullet Backend on a stream processing framework. Currently, we support [Bullet on Storm](backend/storm-setup.md) and [Bullet on Spark](backend/spark-setup.md).
4848
1. Plug in your source of data. See [Getting your data into Bullet](backend/ingestion.md) for details
4949
2. Consume your data stream
5050
2. The [Web Service](ws/setup.md) set up to convey queries and return results back from the backend
51-
3. To choose a [PubSub implementation](pubsub/architecture.md) that connects the Web Service and the Backend. We currently support [Kafka](pubsub/kafka.md) on any Backend and [Storm DRPC](pubsub/storm-drpc.md) for the Storm Backend.
51+
3. To choose a [PubSub implementation](pubsub/architecture.md) that connects the Web Service and the Backend. We currently support [Kafka](pubsub/kafka.md) and a [REST PubSub](pubsub/rest.md) on any Backend and [Storm DRPC](pubsub/storm-drpc.md) for the Storm Backend.
5252
4. The optional [UI](ui/setup.md) set up to talk to your Web Service. You can skip the UI if all your access is programmatic
5353

5454
!!! note "Schema in the UI"
@@ -59,9 +59,9 @@ To set up Bullet on a real data stream, you need:
5959

6060
# Querying in Bullet
6161

62-
Bullet queries allow you to filter, project and aggregate data. It lets you fetch raw (the individual data records) as well as aggregated data.
62+
Bullet queries allow you to filter, project and aggregate data. You can also specify a window to get incremental results. Bullet lets you fetch raw (the individual data records) as well as aggregated data.
6363

64-
* See the [UI Usage section](ui/usage.md) for using the UI to build Bullet queries. This is the same UI you will build in the [Quick Start](quick-start.md)
64+
* See the [UI Usage section](ui/usage.md) for using the UI to build Bullet queries. This is the same UI you will build in the [Quick Start](quick-start/bullet-on-spark.md)
6565

6666
* See the [API section](ws/api.md) for building Bullet API queries
6767

@@ -111,6 +111,16 @@ Currently we support ```GROUP``` aggregations with the following operations:
111111
| MAX | Returns the maximum of the non-null values in the provided field for all the elements in the group |
112112
| AVG | Computes the average of the non-null values in the provided field for all the elements in the group |
113113

114+
## Windows
115+
116+
Windows in a Bullet query allow you to specify how often you'd like Bullet to return results.
117+
118+
For example, you could launch a query for 2 minutes, and have Bullet return a COUNT DISTINCT on a particular field every 3 seconds:
119+
120+
![Time-Based Tumbling Windows](../img/time-based-tumbling.png)
121+
122+
See documentation on [the Web Service API](ws/api.md) for more info.
123+
114124
# Results
115125

116126
The Bullet Web Service returns your query result as well as associated metadata information in a structured JSON format. The UI can display the results in different formats.
@@ -145,17 +155,19 @@ The Bullet Backend can be split into three main conceptual sub-systems:
145155
2. Data Processor - reads data from a input stream, converts it to an unified data format and matches it against queries
146156
3. Combiner - combines results for different queries, performs final aggregations and returns results
147157

148-
The core of Bullet querying is not tied to the Backend and lives in a core library. This allows you implement the flow shown above in any stream processor you like. We are currently working on Bullet on [Spark Streaming](https://spark.apache.org/streaming).
158+
The core of Bullet querying is not tied to the Backend and lives in a core library. This allows you implement the flow shown above in any stream processor you like.
149159

150-
## PubSub
160+
Implementations of [Bullet on Storm](backend/storm-architecture.md) and [Bullet on Spark](backend/spark-architecture.md) are currently supported.
151161

152-
The PubSub is responsible for transmitting queries from the API to the Backend and returning results back from the Backend to the clients. It decouples whatever particular Backend you are using with the API. We currently provide a PubSub implementation using Kafka as the transport layer. You can very easily [implement your own](pubsub/architecture.md#implementing-your-own-pubsub) by defining a few interfaces that we provide.
162+
## PubSub
153163

154-
In the case of Bullet on Storm, there is an [additional simplified option](pubsub/storm-drpc.md) using [Storm DRPC](http://storm.apache.org/releases/1.0.0/Distributed-RPC.html) as the PubSub. This layer is planned to only support a request-response model for querying in the future.
164+
The PubSub is responsible for transmitting queries from the API to the Backend and returning results back from the Backend to the clients. It decouples whatever particular Backend you are using with the API.
165+
We currently support two different PubSub implementation:
155166

156-
!!! note "DRPC PubSub"
167+
* [Kafka](pubsub/kafka.md)
168+
* [REST](pubsub/rest.md)
157169

158-
This was how Bullet was first implemented in Storm. Storm DRPC provided a really simple way to communicate with Storm that we took advantage of. We provide this as a legacy adapter or for users who use Storm but don't want a PubSub layer.
170+
You can also very easily [implement your own](pubsub/architecture.md#implementing-your-own-pubsub) by defining a few interfaces that we provide.
159171

160172
## Web Service and UI
161173

docs/pubsub/architecture.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ This section describes how the Publish-Subscribe or [PubSub layer](../index.md#p
44

55
## Why a PubSub?
66

7-
When we initially created Bullet, it was built on [Apache Storm](https://storm.apache.org) and leveraged a feature in it called [Storm DRPC](http://storm.apache.org/releases/1.0.3/Distributed-RPC.html) to deliver queries to and extract results from the Bullet Backend. Storm DRPC is supported by a set of clusters that are physically part of the Storm cluster and is a shared resource for the cluster. While many other stream processors support some form of RPC and we could support multiple versions of the Web Service for those, it quickly became clear that abstracting the transport layer from the Web Service to the Backend was needed. This was particularly highlighted when we wanted to switch Bullet queries from operating in a request-response model (one response at the end of the query) to a streaming model. Streaming responses back to the user for a query through DRPC would be cumbersome and require a lot of logic to handle. A PubSub system was a natural solution to this. Since DRPC was a shared resource per cluster, we also were [tying the Backend's scalability](../backend/storm-performance.md#test-4-improving-the-maximum-number-of-simultaneous-raw-queries) to a resource that we didn't control.
7+
When we initially created Bullet, it was built on [Apache Storm](https://storm.apache.org) and leveraged a feature in it called Storm DRPC to deliver queries to and extract results from the Bullet Backend. Storm DRPC is supported by a set of clusters that are physically part of the Storm cluster and is a shared resource for the cluster. While many other stream processors support some form of RPC and we could support multiple versions of the Web Service for those, it quickly became clear that abstracting the transport layer from the Web Service to the Backend was needed. This was particularly highlighted when we wanted to switch Bullet queries from operating in a request-response model (one response at the end of the query) to a streaming model. Streaming responses back to the user for a query through DRPC would be cumbersome and require a lot of logic to handle. A PubSub system was a natural solution to this. Since DRPC was a shared resource per cluster, we also were [tying the Backend's scalability](../backend/storm-performance.md#test-4-improving-the-maximum-number-of-simultaneous-raw-queries) to a resource that we didn't control.
88

99
However, we didn't want to pick a particular PubSub like Kafka and restrict a user's choice. So, we added a PubSub layer that was generic and entirely pluggable into both the Backend and the Web Service. We would support a select few like [Kafka](https://github.com/yahoo/bullet-kafka) or [Storm DRPC](https://github.com/yahoo/bullet-storm). See [below](#implementing-your-own-pubsub) for how to create your own.
1010

11-
With the transport mechanism abstracted out, it opens up a lot of possibilities like implementing Bullet on other stream processors ([Apache Spark](https://spark.apache.org) is in the works) and adding streaming, incremental results, sharding and much more.
11+
With the transport mechanism abstracted out, it opens up a lot of possibilities like implementing Bullet on other stream processors, allowing for the development of [Bullet on Spark](../backend/spark-architecture.md) along with other possible implementations in the future.
1212

1313
## What does it do?
1414

@@ -28,7 +28,8 @@ The PubSub layer does not deal with queries and results and just works on instan
2828
If you want to use an implementation already built, we currently support:
2929

3030
1. [Kafka](kafka.md#setup) for any Backend
31-
2. [Storm DRPC](storm-drpc.md#setup) if you're using Bullet on Storm as your Backend
31+
2. [REST](rest.md#setup) for any Backend
32+
3. [Storm DRPC](storm-drpc.md#setup) if you're using Bullet on Storm as your Backend
3233

3334
## Implementing your own PubSub
3435

docs/pubsub/storm-drpc.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
# Storm DRPC PubSub
22

3+
!!! note "NOTE: This PubSub only works with old versions of the Storm Backend!"
4+
Since DRPC is part of Storm, and can only support a single query/response model, this PubSub implementation can only be used with the Storm backend, and cannot support Windowed queries (bullet-storm 0.8.0 and later).
5+
36
Bullet on [Storm](https://storm.apache.org/) can use [Storm DRPC](http://storm.apache.org/releases/1.0.0/Distributed-RPC.html) as a PubSub layer. DRPC or Distributed Remote Procedure Call, is built into Storm and consists of a set of servers that are part of the Storm cluster.
47

58
## How does it work?

docs/quick-start/bullet-on-spark.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,6 @@ At the end of this section, you will have:
1313
* You will need to be on an Unix-based system (Mac OS X, Ubuntu ...) with ```curl``` installed
1414
* You will need [JDK 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) installed
1515

16-
## To Install and Launch Bullet Locally:
17-
1816
### Setup Kafka
1917

2018
For this instance of Bullet we will use the kafka PubSub implementation found in [bullet-spark](https://github.com/bullet-db/bullet-spark). So we will first download and run Kafka, and setup a couple Kafka topics.
@@ -180,7 +178,7 @@ Visit [http://localhost:8800](http://localhost:8800) to query your topology with
180178
If you access the UI from another machine than where your UI is actually running, you will need to edit ```config/env-settings.json```. Since the UI is a client-side app, the machine that your browser is running on will fetch the UI and attempt to use these settings to talk to the Web Service. Since they point to localhost by default, your browser will attempt to connect there and fail. An easy fix is to change ```localhost``` in your env-settings.json to point to the host name where you will hosting the UI. This will be the same as the UI host you use in the browser. You can also do a local port forward on the machine accessing the UI by running:
181179
```ssh -N -L 8800:localhost:8800 -L 9999:localhost:9999 hostname-of-the-quickstart-components 2>&1```
182180

183-
## Congratulations!! Bullet is all setup!
181+
### Congratulations!! Bullet is all setup!
184182

185183
#### Playing around with the instance:
186184

0 commit comments

Comments
 (0)