-
Notifications
You must be signed in to change notification settings - Fork 139
sparkler 0.1
Quick Start Guide
Apache Solr (Tested on 6.0.1)
# A place to keep all the files organized
mkdir ~/work/sparkler/ -p
cd ~/work/sparkler/
# Download Solr Binary
wget "http://apache.mirrors.hoobly.com/lucene/solr/6.0.1/solr-6.0.1.tgz" # pick your version and mirror
# Extract Solr
tar xvzf solr-6.0.1.tgz
# Add crawldb config sets
cd solr-6.0.1/
cp -rv ${SPARKLER_GIT_SOURCE_PATH}/conf/solr/crawldb server/solr/configsets/
There are many ways to do this, Here is a relatively easy way to start solr with crawldb
# from the solr extracted directory cp -r server/solr/configsets/crawldb server/solr/ ./bin/solr start
Wait for a while to start the solr, Open http://localhost:8983/solr/#/~cores/ in your browser, Follow Add Core > then fill 'crawldb' for both name and instanceDir form fields and click Add Core.
Now the Crawldb core is ready, go to inject urls phase.
// Coming soon
Open a file called seed.txt and enter your seed urls. Example :
http://nutch.apache.org/ http://tika.apache.org/
If not already, build the `sparkler-app` jar referring to build instructions (TODO:link here)
To inject urls, run the following command.
$ java -jar sparkler-app-0.1.jar inject -sf seed.txt 2016-06-07 19:22:49 INFO Injector$:70 [main] - Injecting 2 seeds >>jobId = sparkler-job-1465352569649
This step just injected 2 urls. In addition we got a jobId `sparkler-job-1465352569649`. Suppose, to inject more seeds to the crawldb later phase, we can update using this job id. Usage :
$ java -jar sparkler-app-0.1.jar inject
-id (--job-id) VAL : Id of an existing Job to which the urls are to be
injected. No argument will create a new job
-sf (--seed-file) FILE : path to seed file
-su (--seed-url) STRING[] : Seed Url(s)
For example:
java -jar sparkler-app-0.1.jar inject -id sparkler-job-1465352569649 \
-su http://www.bbc.com/news -su http://espn.go.com/
To see these URLS in crawldb : http://localhost:8983/solr/crawldb/query?q=*:*&facet=true&facet.field=status&facet.field=depth&facet.field=group
//NOTE: solr url can be updated in `sparkler-[default|site].properties` file
To run a crawl:
$ java -jar sparkler-app-0.1.jar crawl
-i (--iterations) N : Number of iterations to run
-id (--id) VAL : Job id. When not sure, get the job id from injector
command
-m (--master) VAL : Spark Master URI. Ignore this if job is started by
spark-submit
-o (--out) VAL : Output path, default is job id
-tg (--top-groups) N : Max Groups to be selected for fetch..
-tn (--top-n) N : Top urls per domain to be selected for a round
Example :
java -jar sparkler-app-0.1.jar crawl -id sparkler-job-1465352569649 -m local[*] -i 1