I have installed pyspark recently. It was installed correctly. When I am using following simple program in python, I am getting an error.
>>from pyspark import SparkContext >>sc = SparkContext() >>data = range(1,1000) >>rdd = sc.parallelize(data) >>rdd.collect()
while running the last line I am getting error whose key line seems to be
[Stage 0:> (0 + 0) / 4]18/01/15 14:36:32 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main ("%d.%d" % sys.version_info[:2], version)) Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I have the following variables in .bashrc
export SPARK_HOME=/opt/spark export PYTHONPATH=$SPARK_HOME/python3
I am using Python 3.
You should set the following environment variables in
export PYSPARK_PYTHON=/usr/bin/python export PYSPARK_DRIVER_PYTHON=/usr/bin/python
spark-env.sh doesn’t exist, you can rename
I got the same issue, and I set both variable in
export PYSPARK_PYTHON=/usr/local/bin/python3 export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
But My problem is still there.
Then I found out the problem is that my default python version is python 2.7 by typing
So I solved the problem by following below page:
How to set Python’s default version to 3.x on OS X?
Just run the code below in the very beginning of your code. I am using Python3.7. You might need to run
locate python3.7 to get your Python path.
import os os.environ["PYSPARK_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7" os.environ["PYSPARK_DRIVER_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
This may happen also if you’re working within an environment. In this case, it may be harder to retrieve the correct path to the python executable (and anyway I think it’s not a good idea to hardcode the path if you want to share it with others).
If you run the following lines at the beginning of your script/notebook (at least before you create the SparkSession/SparkContext) the problem is solved:
import os import sys os.environ['PYSPARK_PYTHON'] = sys.executable os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os allows you to set global variables; package
sys gives the string with the absolute path of the executable binary for the Python interpreter.
I’m using Jupyter Notebook to study PySpark, and that’s what worked for me.
python3 is installed doing in a terminal:
Here is pointing to
Now in the the beginning of the notebook (or
.py script), do:
import os # Set spark environments os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3' os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/bin/python3'
Restart your notebook session and it should works!
Apache-Spark 2.4.3 on Archlinux
I’ve just installed
Apache-Spark-2.3.4 from Apache-Spark website, I’m using Archlinux distribution, it’s simple and lightweight distribution. So, I’ve installed and put the
apache-spark directory on
/opt/apache-spark/, now it’s time to export our environment variables, remember, I’m using Archlinux, so take in mind to using your
$JAVA_HOME for example.
Importing environment variables
echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/user/.bashrc echo 'export SPARK_HOME=/opt/apache-spark' >> /home/user/.bashrc echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/user/.bashrc echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/user/.bashrc source ../.bashrc
[email protected] ~ $ echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/emanuel/.bashrc [email protected] ~ $ echo 'export SPARK_HOME=/opt/apache-spark' >> /home/emanuel/.bashrc [email protected] ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/emanuel/.bashrc [email protected] ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/emanuel/.bashrc [email protected] ~ $ source .bashrc [email protected] ~ $ python Python 3.7.3 (default, Jun 24 2019, 04:54:02) [GCC 9.1.0] on linux Type "help", "copyright", "credits" or "license" for more information. import pyspark
Everything it’s working fine since you correctly imported the environment variables for
Using Apache-Spark on Archlinux via DockerImage
For my use purposes I’ve created a Docker image with
running the image
docker run -ti -p 8888:8888 emanuelfontelles/spark-jupyter
just go to your browser and type
and will prompted a authentication page, come back to terminal and copy the token number and voila, will have Archlinux container running a Apache-Spark distribution.
If you are using Pycharm , Got to Run – > Edit Configurations and click on Environment variables to add as below(basically the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON should point to the same version of Python) . This solution worked for me .Thanks to the above posts.
To make it easier to see for people, that instead of having to set a specific path /usr/bin/python3 that you can do this:
I put this line in my ~/.zshrc
export PYSPARK_PYTHON=python3.8 export PYSPARK_DRIVER_PYTHON=python3.8
When I type in python3.8 in my terminal I get Python3.8 going. I think it’s because I installed pipenv.
Another good website to reference to get your SPARK_HOME is https://towardsdatascience.com/how-to-use-pyspark-on-your-computer-9c7180075617
(for permission denied issues use sudo mv)
I tried two methods for the question. the method in the picture can works.