$ pip install virtualenv
Installing PySpark with Jupyter notebook on Ubuntu 18.04 LTS
Upasana | December 07, 2019 | 4 min read | 1,534 views
In this tutorial we will learn how to install and work with PySpark on Jupyter notebook on Ubuntu Machine and build a jupyter server by exposing it using nginx reverse proxy over SSL. This way, jupyter server will be remotely accessible.
-
Setup Virtual Environment
-
Setup Jupyter notebook
-
Jupyter Server Setup
-
PySpark setup
-
Configure bash profile
-
Setup Jupyter notebook as a service on Ubuntu 18.0 LTS
-
Nginx Setup
-
SSL setup using LetsEncrypt
Virtual Environment Setup
Run the below command on the terminal to install virtual environment on your machine, if it is not there already. We will be using virtualenv
to setup virtual environment.
$ virtualenv -p python3.6 venv
where venv
is the name of the virtual environment. Above command will create a virtual environment in the current directory with name venv
To activate this newly create virtual environment, you need to run the below command
$ source venv/bin/activate
Install jupyter notebook
To install jupyter notebook, run the below command. Make sure that virtual environment is activated when you run the below command.
$ pip install jupyter notebook
Jupyter Server Setup
Now, we will be setting up the password for jupyter notebook.
Generate config for jupyter notebook using following command:
$ jupyter notebook --generate-config
Update the config:
$ vi /home/<username>/.jupyter/jupyter_notebook_config.py
## Hashed password to use for web authentication.
#
# To generate, type in a python/IPython shell:
#
# from notebook.auth import passwd; passwd()
#
# The string should be of the form type:salt:hashed-password.
c.NotebookApp.password = u'sha1:020f1412ae63:227357c88b3996e75dcf85ea96c2d581db74ec1e'
## Allow requests where the Host header doesn't point to a local server
#
# By default, requests get a 403 forbidden response if the 'Host' header shows
# that the browser thinks it's on a non-local domain. Setting this option to
# True disables this check.
#
# This protects against 'DNS rebinding' attacks, where a remote web server
# serves you a page and then changes its DNS to send later requests to a local
# IP, bypassing same-origin checks.
#
# Local IP addresses (such as 127.0.0.1 and ::1) are allowed as local, along
# with hostnames configured in local_hostnames.
c.NotebookApp.allow_remote_access = True
PySpark Setup
We will install PySpark using PyPi. To install just run the following command from inside the virtual environment:
$ pip install pyspark
For more information, see this web page: https://spark.apache.org/downloads.html
As of writing this article, v2.4.4
is the latest version of Apache Spark available with scala
version 2.11.12
Check the installation using following command
$ spark-shell --version
Configure environment using Bash profile
You need to set following enviornment variables in bashrc
located under your home directory.
export SPARK_HOME=/home/<username>/build/jupyter/venv/lib/python3.6/site-packages/pyspark/
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
$ source ~/.bashrc
Now we can start the Jupyter notebook from command line:
$ pyspark
or using this command:
$ jupyter notebook
Run Pyspark on jupyter notebook
Open a general python3 notebook on the jupyter server. We don’t need pyspark kernel as we will be using findspark to find spark home.
import findspark
findspark.find()
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
Setup Jupyter notebook as a service in Ubuntu 18.04 LTS
We need a Systemd Service in order to allow jupyter notebook to be run as a background service.
[Unit]
Description=Jupyter Notebook
[Service]
Type=simple
PIDFile=/run/jupyter.pid
ExecStart=/bin/bash -c ". /home/<username>/build/jupyter/venv/bin/activate;jupyter-notebook --notebook-dir=/home/<username>/my-notebooks"
User=<username>
Group=<username>
WorkingDirectory=/home/<username>/my-notebooks
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
$ sudo systemctl enable jupyter.service
$ sudo systemctl daemon-reload
$ sudo systemctl start jupyter.service
$ sudo systemctl stop jupyter.service
Nginx setup as a reverse proxy
We need to configure HTTP/1.1 and websocket support in order to expose jupyter notebook through nginx proxy server.
The following nginx configuration is required to run jupyter through nginx proxy.
server {
server_name <dns-name>;
location / {
proxy_pass http://localhost:8888;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
client_max_body_size 10M;
proxy_http_version 1.1;
proxy_set_header Upgrade "websocket";
proxy_set_header Connection "Upgrade";
proxy_read_timeout 86400;
}
}
SSL setup using Free SSL
LetsEncrypt provides free SSL certificate that can be used for securing our site with HTTPS.
Top articles in this category:
- Introduction to Python 3.6 & Jupyter Notebook
- Top 100 interview questions on Data Science & Machine Learning
- Google Data Scientist interview questions with answers
- Part 2: Deploy Flask API in production using WSGI gunicorn with nginx reverse proxy
- Python coding challenges for interviews
- Google Colab: import data from google drive as pandas dataframe
- RuntimeError: get_session is not available when using TensorFlow 2.0