Reading a large textual data can be tedious sometimes. Be it a long speech, blog post or a big database, you must have encountered such monotonous situation! What will you do when you want to convey crucial information from a 10,00,00,000 (still counting the zeros!) words of text document in a very limited time? Nothing to worry. Data analytics and visualization to the rescue! Using a word cloud can make even a dull data beatifully insightful. Read on to know more about this graphical representation and even build one using R.
What is a word cloud?
Word cloud is a simple yet powerful visualization technique. It is primarily used to highlight significant textual data points. Word cloud is also known as text cloud or tag cloud. For instance, below is the word cloud created using wiki data on Rio Olympics, 2016.
Altogether, the more frequently a specific word appears in the textual data, the greater prominence is given to that word in the cloud. Hence, the bigness and boldness of a word depend on the frequency of that particular word in the document.
Create your own word cloud!
Want to make your own tag cloud? Let’s do it! Here, we will be using R to convert a source text document into a vibrant word cloud (in fact, with only few lines of code).
- A working RStudio (open source integrated development environment (IDE) for R)
- Source text data
Install the following packages in RStudio:
- tm (a framework for text mining applications within R)
- wordcloud (package for creating pretty word clouds)
- RColorBrewer (provides color schemes for graphics)
Note: To install a package in RStudio, use the following command
After successful installation of the above packages, load these packages into your R script using
Now, we are all set to use the functionality provided by these packages in our script.
To create a word cloud, you need a text document as an input to your R script.
Note: The source text data used in this blog can be downloaded from here. Make it sure that your R script and text file are in the same working directory.
mydata <- readLines(“Rio olympics.txt”)
#This function will read the entire text document line by line into the user defined variable (i.e mydata)
After loading the text data, we need to preprocess it and finally convert it into a plain text. Corpus means list of documents.
We used this class from ‘tm’ to create a corpus from character vectors using ‘VectorSource’ method and passed it to user defined ‘mycorpus’.
mycorpus <- Corpus(VectorSource(mydata))
Once our corpus is ready, we need to preprocess or modify the documents in it, e.g stopword removal, stemming, removing punctuations, numbers, symbols etc.
These transformations are done using tm_map() function which maps a given function to all the elements of corpus.
# write comments for each line
mycorpus <- tm_map(mycorpus,tolower)
mycorpus <- tm_map(mycorpus,removeWords, stopwords(“english”))
mycorpus <- tm_map(mycorpus,removeNumbers)
mycorpus <- tm_map(mycorpus,removePunctuation)
mycorpus <- tm_map(mycorpus,stripWhitespace)
mycorpus <- tm_map(mycorpus, PlainTextDocument)
Done filtering your text document? Great! You are just one step away to create your own word cloud.
The final instance of ‘mycorpus’ is a plain text document without punctuatuions, white space, numbers and stopwords. Perfect condition to create a word cloud!
wordcloud(mycorpus, min.freq=9, colors=brewer.pal(5, “Dark2”))
‘wordcloud()’ method from ‘wordcloud’ package takes the plain text document ( i.e mycorpus) and creates a word cloud based on the minimum threshold frequency of words (here it is 9).
Alternatively, you can create a colorful wordcloud using colors attribute in ‘wordcloud()’ method.
Time to derive some insights
- Firstly, this text document reveals about summer Olympics.
- Secondly, Rio seems to be the hosting city.
- Thirdly, we can say that the document talks about medals, venues and stadiums.
- Since, Brazil is the hosting nation, the document also speaks something about Brazilian Olympic community.
Word cloud reveals the essential by popping brand names, key words etc. to the surface. Moreover, it is engaging and invokes the interest among the audience. In fact, just observing a word cloud gives you overall sense of the text. Lastly, we can say that word cloud is a handy tool for quick visualization.