Data Science Fundamentals: Week 2

For this week, I’ve been working with uploading data sets, performing basic computation and visualizing data through R. I’ve also been using Jupyter notebook (Anaconda package) which I have found extremely useful for printing code output, sharing live code and having your code in one window. If you’re interested in following your code, I would suggest downloading the Anaconda package (also contains Python, Jupyter notebook and other packages) and using the Anaconda command prompt window, you can access Jupyter notebook online.

This week I uploaded packages into Jupyter and ran some computation on the data sets, I’ve found that reading data sets into excel prior to the computation was helpful:

Picture1The mort06.smpl data set contained mortality information on approximately 20,000 residents, as a result using the attach function was important in passing the code and allowing for the large amount of data to be read.

Some basic computation, such as mean or the five number analysis (minimum, first quartile range, median, third quartile range and maximum) were completed. I found that adding the na.rm=T function was important, as within the data set there were cells containing NA values which in this case would be acknowledged but not included in the final computation.

Picture3

You can analyze data through contingency tables (observing for associations between variables) or proportions. I analyzed the variables Sex and Cause of death, in this case calling the two variables into whichever form of analysis can be done in a print function:

Picture4

I can further provide frequencies or proportions depending on the analysis, in this case I am interested in determining the percentage of females/males who died as a result of accidents. By calling on the margin function and setting it to 1 (1-> rows, 2->columns) I am ensuring that the program completes the analysis as per rows (i.e., cause of death):

Picture5

Picture6

On the whole, I actually enjoyed using the Jupyter notebook (the set up was a little difficult and downloading the R packages was little time consuming). I would still recommend RStudio for day-to-day programming, its easy to access packages and the different windows helps with figuring out errors. Next week, we’ll take at look some visualization techniques using R.

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s