PySpark HDFS Notebook

A simple PySpark wordcount app which reads from HDFS

What does it do?

  • Connects to a specified Spark cluster
  • Reads a file specified by an HDFS url
  • Splits words on spaces and counts them
  • Prints the counts for up to 20 words

Notes on permissions

  • This example uses an unsecured HDFS
  • The file must be readable by nbuser (1011:root)