png

  • Go to https://colab.research.google.com/ and create a NEW NOTEBOOK.
  • Give your notebook a name so you can reference this notebook later
  • The code below will install and configure the environment with lates Spark version 2.4.5
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Run the cell. Then

!wget -q https://apache.osuosl.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

Run the cell. Then

!tar xf spark-2.4.5-bin-hadoop2.7.tgz

Run the cell. Then

!pip install -q findspark

Finally

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

Run the cell.

To verify type

os.environ["SPARK_HOME"]

And you should see

'/content/spark-2.4.5-bin-hadoop2.7'

Starting a Spark Session using the code below

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Install spark-nlp

pip install spark-nlp==2.4.2

And run the cell.

Once completed for testing to make sure everything is loaded and ready run the code below

import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

You should see the below

Spark NLP version:  2.4.2
Apache Spark version:  2.4.5