This guide assumes that you already have the Sahara service and Horizon dashboard up and running. Don’t forget to make sure that Sahara is registered in Keystone. If you require assistance with that, please see the installation guide.
- To do this, start by choosing a Node Group Template from the dropdown and click the “+” button
- You can adjust the number of nodes to be spawned for this node group via the text box or the “-” and “+” buttons
- Repeat these steps if you need nodes from additional node group templates
- Your cluster’s status will display on the Clusters table
- It will likely take several minutes to reach the “Active” state
- This can be done by selecting your desired Node Group Template from the dropdown and clicking the “+” button
- Your new Node Group will appear below and you can adjust the number of instances via the text box or the “+” and “-” buttons
Data Sources are where the input and output from your jobs are housed.
- For a Swift object, enter <container>/<path> (ie: mycontainer/inputfile). Sahara will prepend swift:// for you
- For an HDFS object, enter an absolute path, a relative path or a full URL:
- /my/absolute/path indicates an absolute path in the cluster HDFS
- my/path indicates the path /user/hadoop/my/path in the cluster HDFS assuming the defined HDFS user is hadoop
- hdfs://host:port/path can be used to indicate any HDFS location
Job Binaries are where you define/upload the source code (mains and libraries) for your job.
- For “Swift”, enter the URL of your binary (<container>/<path>) as well as the username and password (also see Additional Notes)
- For “Internal database”, you can choose from “Create a script” or “Upload a new file”
Jobs are where you define the type of job you’d like to run as well as which “Job Binaries” are required
Job Executions are what you get by “Launching” a job. You can monitor the status of your job to see when it has completed its run
- Additional configuration properties can be defined by clicking on the “Add” button
- An example configuration entry might be mapred.mapper.class for the Name and org.apache.oozie.example.SampleMapper for the Value
- Relaunch on New Cluster will take you through the forms to start a new cluster before letting you specify input/output Data Sources and job configuration
- Relaunch on Existing Cluster will prompt you for input/output Data Sources as well as allow you to change job configuration before launching the job
There are sample jobs located in the sahara repository. In this section, we will give a walkthrough on how to run those jobs via the Horizon UI. These steps assume that you already have a cluster up and running (in the “Active” state).
- Load the input data file from https://github.com/openstack/sahara/tree/master/etc/edp-examples/pig-job/data/input into swift
- Click on Projet/Object Store/Containers and create a container with any name (“samplecontainer” for our purposes here)
- Click on Upload Object and give the object a name (“piginput” in this case)
- Navigate to Data Processing/Data Sources, Click on Create Data Source
- Name your Data Source (“pig-input-ds” in this sample)
- Type = Swift, URL samplecontainer/piginput, fill-in the Source username/password fields with your username/password and click “Create”
- Create another Data Source to use as output for the job
- Name = pig-output-ds, Type = Swift, URL = samplecontainer/pigoutput, Source username/password, “Create”
- Store your Job Binaries in the Sahara database
- Navigate to Data Processing/Job Binaries, Click on Create Job Binary
- Name = example.pig, Storage type = Internal database, click Browse and find example.pig wherever you checked out the sahara project <sahara root>/etc/edp-examples/pig-job
- Create another Job Binary: Name = udf.jar, Storage type = Internal database, click Browse and find udf.jar wherever you checked out the sahara project <sahara root>/etc/edp-examples/pig-job
- Create a Job
- Navigate to Data Processing/Jobs, Click on Create Job
- Name = pigsample, Job Type = Pig, Choose “example.pig” as the main binary
- Click on the “Libs” tab and choose “udf.jar”, then hit the “Choose” button beneath the dropdown, then click on “Create”
- Launch your job
- To launch your job from the Jobs page, click on the down arrow at the far right of the screen and choose “Launch on Existing Cluster”
- For the input, choose “pig-input-ds”, for output choose “pig-output-ds”. Also choose whichever cluster you’d like to run the job on
- For this job, no additional configuration is necessary, so you can just click on “Launch”
- You will be taken to the “Job Executions” page where you can see your job progress through “PENDING, RUNNING, SUCCEEDED” phases
- When your job finishes with “SUCCEEDED”, you can navigate back to Object Store/Containers and browse to the samplecontainer to see your output. It should be in the “pigoutput” folder
- Store the Job Binary in the Sahara database
- Navigate to Data Processing/Job Binaries, Click on Create Job Binary
- Name = sparkexample.jar, Storage type = Internal database, Browse to the location <sahara root>/etc/edp-examples/edp-spark and choose spark-example.jar, Click “Create”
- Create a Job
- Name = sparkexamplejob, Job Type = Spark, Main binary = Choose sparkexample.jar, Click “Create”
- Launch your job
- To launch your job from the Jobs page, click on the down arrow at the far right of the screen and choose “Launch on Existing Cluster”
- Choose whichever cluster you’d like to run the job on
- Click on the “Configure” tab
- Set the main class to be: org.apache.spark.examples.SparkPi
- Under Arguments, click Add and fill in the number of “Slices” you want to use for the job. For this example, let’s use 100 as the value
- Click on Launch
- You will be taken to the “Job Executions” page where you can see your job progress through “PENDING, RUNNING, SUCCEEDED” phases
- When your job finishes with “SUCCEEDED”, you can see your results by sshing to the Spark “master” node
- The output is located at /tmp/spark-edp/<name of job>/<job execution id>. You can do cat stdout which should display something like “Pi is roughly 3.14156132”
- It should be noted that for more complex jobs, the input/output may be elsewhere. This particular job just writes to stdout, which is logged in the folder under /tmp