Setting up Pyspark in Jupyter through Homebrew on Mac M1

This is more of a “write it down in case I forget it” kind of post. When I switched to Mac I had to figure out how to install things with Homebrew, and doing that somehow involved more random things I had to tinker with just to get stuff working. Anyway, I wanted to set up Pyspark on my Mac with an M1 processor. And because it seemed rather complicated when I was doing it, I had to write it all down so that I wouldn’t forget. Looking back, the steps aren’t actually that complicated :__).

Install Homebrew

If you don’t already have Homebrew installed in your Mac, install it through this command:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

If that doesn’t work, I recommend checking the homebrew website to get the correct up-to-date command to install Homebrew.

Install Java and set JAVA_HOME

Java is required for Pyspark to run. Some articles would tell you to install openjdk@8 (version 8 JDK version), but that wouldn’t work on M1’s processor architecture. So instead, install version 11:

brew install openjdk@11

Then take note where your java is installed:

which java

This should point you to a path in brew, something like this:

/opt/homebrew/opt/openjdk@11/bin/java

Copy this path and set your JAVA_HOME environment variable accordingly in your ~/.zshrc file:

export JAVA_HOME="/opt/homebrew/opt/openjdk@11/"

Important: Note that the bin/java part has been removed from the path.

Then execute your ~/.zshrc file to make the environment variable available in your terminal sessions:

source ~/.zshrc

Install python

If you don’t already have Python installed in your system, you can install the latest version with Homebrew:

brew install python

If you are using pyenv, no need to install python through homebrew, as this might cause a conflict with the python version that you want to use. Instead, you would want to just spin up a local pyenv version in your current directory, like this (for example, installing version 3.10.10 on your local directory):

pyenv local 3.10.10

Install apache-spark

Now that you have Java and Python installed, you can now install Pyspark:

brew install apache-spark

When this installation finishes, check the installation output text. You should be able to find the path where pyspark was installed in brew. Something like this:

/opt/homebrew/Cellar/apache-spark/3.3.2

Take note of this path then set your SPARK_HOME accordingly in your ~/.zshrc file:

export SPARK_HOME="/opt/homebrew/Cellar/apache-spark/3.3.2/libexec"

Important: Note the added “libexec” at the end of the path. This is apparently needed on Mac installations.

Then execute your ~/.zshrc file to make the environment variable available in your terminal sessions:

source ~/.zshrc

You can verify the pyspark installation by typing pyspark on your terminal, which will start a new Pyspark session:

(venv) shiela@Macbook spark-project % pyspark
Python 3.10.10 (main, Mar  2 2023, 22:59:21) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/

Using Python version 3.10.10 (main, Mar  2 2023 22:59:21)
Spark context Web UI available at http://192.168.0.16:4040
Spark context available as 'sc' (master = local[*], app id = local-1679398716738).
SparkSession available as 'spark'.

There may be a few WARNING messages, but you can ignore those for now.

To exit the Pyspark session, just type exit(), and you will be taken back to your terminal.

Install jupyter

Ok, we’re almost done! We now only have to install and jupyter notebook. You can install jupyter either through Homebrew:

brew install jupyter

or by using pip:

pip install jupyter

If you’re using pyenv and working inside a virtual environment, better to install through pip as this will install jupyter within your pyenv environment instead of globally throughout your system.

After installation, you can now run jupyter:

jupyter notebook

This will open a new jupyter notebook session in your browser. You can then run a pyspark session by instatiating the SparkSession object: