Data Engineering With RY: 1. Word Count Using RDD

Using Parallelize Method


input = ["I am a data engineer I am a data engineer"]
rdd = spark.sparkContext.parallelize(input)
words = rdd.flatMap(lambda x: x.split(" "))
wordsValue = words.map(lambda x: (x,1))
wordsCount = wordsValue.reduceByKey(lambda counter,nextValue: counter + nextValue)
wordsCount.collect()

Using textFile Method


rdd = spark.sparkContext.textFile("/FileStore/tables/words-1.txt")
words = rdd.flatMap(lambda x: x.split(" "))
wordsValue = words.map(lambda x: (x,1))
wordsCount = wordsValue.reduceByKey(lambda counter,nextValue: counter + nextValue)
wordsCount.collect()