Apache 2.0 Spark with Scala

CtrlK

18. [Activity] Improving the Word Count Script with Regular Expressions

IMPROVING WORD COUNT

Normalizing data, sorting the results

Text Normalization

Problem: word variants with different capitalization, punctuation, etc.
There are fancy natural language processing toolkits like NLTK
But we'll keep it simple, and use a regular expression.

Activity

Looking at WordCount.scala, we need to account for , in the word which is contentuated with the word
To improve upon WordCount.scala to account for , , we can look at WordCountBetter.scala from the resource folder
Import WordCountBetter.scala and open up in Eclipse-Scala IDE

    // Split using a regular expression that extracts words
    val words = input.flatMap(x => x.split("\\W+"))

The following code split the lines using REGEX for one word or more of them, and eliminate the additional , found

    // Normalize everything to lowercase
    val lowercaseWords = words.map(x => x.toLowerCase())

This is done to let all the words be in lowercase to avoid duplicates between capitalized words
Run it to see the difference in output
Now the output looks much cleaner than WordCount.scala
But now if the results were sorted, it would look much better
We would explore this in the next lecture

Previous17. [Activity] Counting Word Occurences using Flatmap()Next19. [Activity] Sorting the Word Count Results

Last updated 7 years ago