I am attempting to run the very basic Spark+Python pyspark tutorial — see http://spark.apache.org/docs/0.9.0/quick-start.html

When I attempt to initialize a new SparkContext,

from pyspark import SparkContext
sc = SparkContext("local[4]", "test")

I get the following error:

ValueError: Cannot run multiple SparkContexts at once

I’m wondering if my previous attempts at running example code loaded something into memory that didn’t clear out. Is there a way to list current SparkContexts already in memory and/or clear them out so the sample code will run?

This happens because when you type “pyspark” in the terminal, the system automatically initialized the SparkContext (maybe a Object?), so you should stop it before creating a new one.

You can use

sc.stop()

before you create your new SparkContext.

Also, you can use

sc = SparkContext.getOrCreate()

instead of

sc = SparkContext()

I am new in Spark and I don’t know much about the meaning of the parameters of the function SparkContext() but the code showed above both worked for me.

Turns out that running ./bin/pyspark interactively AUTOMATICALLY LOADS A SPARKCONTEXT. Here is what I see when I start pyspark:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 0.9.1
      /_/

Using Python version 2.6.6 (r266:84292, Feb 22 2013 00:00:18)
Spark context available as sc.

…so you can either run “del sc” at the beginning or else go ahead and use “sc” as automatically defined.

The other problem with the example is that it appears to look at a regular NFS filesystem location, whereas it really is trying to look at the HDFS filesystem for Hadoop. I had to upload the README.md file in the $SPARK_HOME location using “hadoop fs -put README.md README.md” before running the code.

Here is the modified example program that I ran interactively:

from pyspark import SparkContext
logFile = "README.md"
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

and here is the modified version of the stand-alone python file:

"""SimpleApp.py"""
from pyspark import SparkContext
logFile = "README.md"  # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)

which I can now execute using $SPARK_HOME/bin/pyspark SimpleApp.py

Have you tried to use sc.stop() before you were trying to create another SparkContext?

Instead of setting custom configurations to the SparkContext at PySpark prompt, you can set those at the time of starting PySpark.

e.g.

pyspark --master yarn --queue my_spark_pool1 --conf 
   spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf 
   spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"

It will apply these conf to the sc object in PySpark.