Data Science Fundamentals: Week 3

Visualization of data sets is quite possibly one of the most important facets of data science. Being able to translate raw data into summary tables, graphs with concluding remarks and scatter plots that direct us to the next steps are useful strategies in any field. We can predict future patterns or potential successful metrics through the data if presented appropriately and accurately. This week, I was able to explore some of the different visualization tools in R, including the standard tools such as bar plots, histograms, scatterplots and the advanced tools (i.e., ggplots).

Using the iris flower dataset (a widely used dataset within R), we can plot the distribution of Sepal width (flower leaves) by flower species (3) using a boxplot:


I found that by organizing the data into a table prior to calling the boxplot function was helpful.

We can also customize the graphs, by including axis titles, main title, and colours. I had some difficulty with coding colours for my scatter plot when comparing Sepal length and petal length across the three flower species, so I used different shapes for my plot to represent the different species (using the pch function). A legend was also included.


I also worked with ggplots (tools within R that provide a variety of graphing models and formatting) and found that it offered more formatting options and visually speaking, the graphs were refined. I started with a histogram showing the frequency of Sepal Length overall.


The ggplot code is straightforward and can be used with other models: ggplot ( aes())+ geom_model(), however, there are more formatting options and different models to use. For example, I was able to add colours and a legend to my scatterplot using the ggplot function, something that I struggled with earlier.



Other models that I found highly useful for analysis included, heatmaps (showcasing gradations) and pairwise functions (analyzing relationships across multiple scatter plots). I would highly recommend this ggplot cheat sheet for all the different models and formatting techniques. Overall, I found that the visualization models within R were useful and most importantly able to analyze large data sets (relative to Excel). However, final formatting of the graphs (i.e., colours, titles, axis points) required additional code and was time-consuming.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s