Install Spark & Pyspark on Windows

Spark is a parallel data processing framework, which is used for big data analytics. It is an open-source project and is written in Scala. It supports streaming data analysis, SQL, Machine learning, and Graph processing. It has multiple APIs for R, Python, Java, and Scala.
The installation of Spark locally on windows may not be straightforward, so I decided to write a tutorial about its installation and how to call it from python. After reading this tutorial you will learn how to 1) Install Spark on windows and 2) Create a spark session using the pyspark library. The Spark version that is used for installation is Spark 3.1.2. Spark runs on Java 8/11, Scala 2.12, Python 3.6+ and R 3.5+.
Here the list of steps that we are going through together:

  1. Java installation
  2. Apache spark installation
  3. Move the Spark files
  4. Add Hadoop support
  5. Set environment variables
  6. Launch Spark
  7. Launch a Spark session in Python

1. Java installation

The first step is to check if you an installed Java on your windows. You can check this by opening a command prompt or power shell and typing “java --version”.

Consider Spark works with Java 8 /11. In the above screenshot, my default Java version is 16, however, I have also installed Java 11 and set it as the environment variable (We will see this part in a moment).

If you didn't install Java, you can download version 8 or 11 from one of the links below:

After downloading, execute and installing the package. We will return back to Java later.

2. Apache Spark installation

The next step is to install Apache Spark, here, you can find the package of Spark prebuild with Apache Hadoop. The latest version at the writing time of this tutorial is 3.1.2, (Edit: I have seen the newer released version,3.2.0, which should not impact the steps of its installation anyway). Select the desired Spark and Hadoop version, else you can leave the default options and download the tgz file.

You may wonder why you need Hadoop. Hadoop is a distributed file system, Spark by itself doesn’t have a storage system, so if it is needed to be run in a multi-node mode, it is dependent on Hadoop or a similar package such as S3. Spark is an in-memory distributed computing engine and can run in a standalone mode without Hadoop, but this means you miss the data distribution feature. For small workloads, it is faster than Hadoop since it is using memory to process data but cannot save the data. With the help of Hadoop, it can store the data as well.

3. Move the Spark files

After downloading of tgz file, extract the file and then create a folder named Spark and move the extracted file to the folder, For example, I have created a Spark folder in my Windows directory and pasted the contents there:

4. Add Hadoop support

For running Hadoop on windows, we need an extra “.exe” file which enables Hadoop to have file access permissions, you can download this file from here, search for the Hadoop version that you have downloaded before and download relative winutils.exe, in my case I have downloaded the latest one for Hadoop 3.2.2.

After the download, you have to create a new folder, named Hadoop (or anything), inside the folder create another folder named bin, and leave the exe file there, so the structure will be "Haoop/bin/winutiles.exe". In my case, I have created the folder in the Windows directory:

5. Set environment variables

We are done with the downloads. So let’s tell the windows where to find what we have installed. Specifically, we need to set 3 environment variables for referencing Java, Spark & Hadoop.

Go to your environment variables, you can find it by clicking on Window button and typing “Environment”, your windows should automatically find it.

  • Set Java Env variable
    • Click on the new variable
    • Leave the name as JAVA_HOME
    • And you need to specify where your java is located (here you can see despite that I have two versions of Java, I specified to use version 11 which is working with Spark)
  • Set Spark Env variable
    • Click on the new variable
    • Leave the name as SPARK_HOME
    • Specify where your Spark folder is located
  • Set Hadoop Env variable
    • Click on the new variable
    • Leave the name as HADOOP_HOME
    • Specify where your Hadoop folder is located

Then you need to add the paths of bin files in Spark, Hadoop, and Java. For this, still in the environment variables, edit your path and add the bin paths for Spark, Hadoop, and Java:

After this, save and close the window. And you are done!

6. Launch Spark

We are in the last step to check if everything is working successfully, open a Command prompt or PowerShell, and write “spark-shell”, in the case of successful installation, you should be able to see some informative outputs like this:

After seeing the above message, you can check your Spark version also by typing “spark.version”:

You can also have access to Spark Web UI by navigating to http://localhost:4040/. Here you can find useful details about spark jobs, clusters, application status, and etc.

7. Launch a Spark session in Python

Now that we have set up everything, let’s try to use Spark in Python, for this, we need to use pyspark library.

1- Install pyspark

You can download pyspark library using pip command in the notebook:

2- Create a session

To use Spark, we need to create a session. A session is going to be the entry point for Spark.

For creating a session, we can specify some configs such as the name of the app, if you want to run Spark on a cluster or local mode, etc. and then call the method for creating the session:

In the above command, the arguments are:

  • appName(): name of application
  • getOrCreate: if there is any Spark session, returns it else create one

If you print out the Spark variable, you can verify more details about it:

Some details about the output:

  • SparkSession - in-memory: as we know Spark is an in-memory data engine
  • SparkContext: in the earlier versions of Spark, SparkContext was the entry point to Spark application, and we could use it for creating just RDDs, then for each spark interaction such as SQL, Hive, Streaming, etc. we needed to create specific SparkContext, such as SqlContext, HiveContext or StreamingContext. After the version of Spark 2.0, SparkSession has been introduced, which unifies all the above contexts and becomes the entry point for all operations, so you don’t need to worry and create multiple contexts for various applications.
  • The URL for Spark Web UI: as mentioned before
  • The version: version of Spark
  • And Master mode: local[*], means that Spark is running locally and it will use any core number which is available on the machine. If you specific local[1], means Spark is using one worker thread (so no parallelism). Local[k], means Spark is using k threads to divide his task into k parts, k should be compatible with the core of the used machine.
  • App name: we have specified while creating the Sparksession

3- Check Spark version

You can check the Spark version also in the notebook, by “spark.version”, consider Spark is the name of the variable that I have created for my Spark session:

4. Other operations

For example, I want to read from a CSV file and print the scheme of the file:

Vola! hope you could install your Spark and launch it successfully in your python!

Let me know in the comments in case of any trouble or questions, happy Sparking!

References:

Author: Pari

2 thoughts on “Install Spark & Pyspark on Windows

Leave a Reply

Your email address will not be published. Required fields are marked *