Intro To R: 2016

Sunday, April 3, 2016

Module 12 : R Package Creation

This week's module focuses on the various aspects of R package creation, with an assignment to create our very own R package! I am thankful that the video provides a good walkthrough, as well as the supporting resources offered for the week. Following are my experiences, good and bad!

The first step in the process was to actually create a project and new package in the R Studio interface. Then, install two packages, "devtools" and "roxygen2". These packages include developer tools that are a big help to have available when creating R packages. A number of different files and folders were created for functions, help, description, etc.

After the appropriate development-related packages were installed, I got to work creating my function. Since I work as an ECommerce Marketing Specialist as part of my job as ECommerce & Marketing Manager, I thought it would be a good idea to create a package that could perform a variety of different functions for determining stats like click-through rate, conversion rate, cost per conversion and more. Although I created the package "mod12sem", I did only create one function called "ctr" which is meant to calculate the click-through rate, which is clicks divided by impressions. Following is the function:

ctr <- function(impressions, clicks) {
clicks / impressions
}

Unfortunately, this is where I began to run into problems. I created my package, created my function, and built and reloaded it. When I attempted to install it "install.packages("mod12sem") returned an error message that stated "package ‘mod12sem’ is not available (for R version 3.2.3)". I spent a great deal of time researching this issue but I was unable to find a solution for it. Nonetheless, I have posted the package to Github. Perhaps someone can tell me what I am doing wrong!

I was also not able to successfuly integrate Github with R Studio because some of the steps in the powerpoint were not available in my version of R Studio. I believe I was able to push all the way to the powerpoint slides that discussed the cmd prompt, etc. but ultimately failed when attempting git push -uu origin master because I was being told that nschubert@mail.usf.edu.mod12 is not a valid repository name. I know that, and my repository name should be https://github.com/kalovast/Mod12 . Unfortunately I cannot seem to change the portion that shows my email address. I was ultimately unsuccessful at getting it onto Github.

Sunday, March 27, 2016

Module 11 : Debugging!

This module is perhaps one of the most helpful to a new programmer. At least for me it is, because my code is usually filled with bugs, and I spend a lot of time working through the code to resolve the errors and get it up and running. The study materials explored a variety of different ways to conduct sound debugging, and I appreciate that because it's good to have options. That said, nothing really seemed to work on the code we were presented in this assignment. It's highly probable that this is due to the fact that i'm new at R and wasn't entirely successful implementing those debugging steps!

As you're well aware, we started out with the following code:

tukey_multiple <- function(x) {

outliers <- array(TRUE,dim=dim(x))

for (j in 1:ncol(x))

{

outliers[,j] <- outliers[,j] && tukey.outlier(x[,j])

}

outlier.vec <- vector(length=nrow(x))

for (i in 1:nrow(x))

{ outlier.vec[i] <- all(outliers[i,]) } return(outlier.vec) }

I set X equal to 5 and input the code into R Studio. This is the error message I received:

Error: unexpected symbol in:
" for (i in 1:nrow(x))
{ outlier.vec[i] <- all(outliers[i,]) } return"

How disappointing! I actually did a lot of research at this point into the error itself, looking over several entirely unhelpful posts from Stack Overflow. The good news is that in doing that kind of digging/browsing for information, you often learn about other aspects that maybe weren't part of the assignment. I don't mind this phase of research because it sparks my interest and keeps me engaged.

After awhile I decided to clean up my code:

tukey_multiple <- function(x) {
outliers <- array(TRUE,dim=dim(x))
for (j in 1:ncol(x))
{
outliers[,j] <- outliers[,j] && tukey.outlier(x[,j])
}
outlier.vec <- vector(length=nrow(x))
for (i in 1:nrow(x))
{
outlier.vec[i] <- all(outliers[i,])
}
return(outlier.vec)

This code yields no errors in R Studio and so I can only assume the bug was resolved by making sure everything was on its own line where appropriate. The brackets were crowding other lines of code and I think this is what may have been causing the error in the first place. If there is something more that should have been done, please let me know! I was sort of expecting to see some kind of result in R Studio other than a simple return without an error message, but I'm not sure what else I may be missing.

Sunday, March 13, 2016

Module 9 : Visualization, Graphics & R

This latest module in our course focuses on visualizing data sets using basic graphic as well as the more complex packages of lattice and ggplot2. There are a lot of really excellent elements within the various packages and methods that allow the user to visualize data in ways that accentuate results, or clarify scenarios.

I spent a lot of time reading about and playing around with the code, as well as how the code behaved with different data sets. The data set I decided to use for the submission of this assignment is USPop, which is a record of data that reflects the population of the United States from 1790 to 2000. Following is a visualization example that I created for basic, lattice and ggplot2 packages.

Base Graphics

Base graphics consists of the most basic options that are available in the R programming language. I experimented a lot with plot(), and came up with this:

plot(USPop, col="green", type"b", cex=1.5, pch=4)

Apologies for the small size of the image, but making it any larger would have conflicted with blogger's format. You can click the image and save it so that you can view it larger, if you like! I made the following modifications:

Changed color from black to green.
Set point character to X.
Increased plot point size 1.5 times.
Set plot style to points connected by line segments (b).

Lattice Package

The Lattice package is useful because it creates the entire plot at once. It's also able to display many plot points and handle those easier than base would. Lattice also falls short in some areas. It can be difficult to manipulate, and you cannot make changes to a lattice plot after it has been created. Lattice did not suit my data set very well, I believe because I only had two points. As such, I chose the xyplot:

xyplot(population~year, data=USPop, pch="*", cex=3)

I personally found this difficult to use, moreso than base, because I don't think my data set was really complex enough to leverage the benefits of lattice.

GGPlot2 Package

GGPlot2 is more like a design app than either of the other graphical methods available in R. GGPlot2 is excellent, and provides more variety in terms of changes the user can make, than either of the other packages as well. Following is an example I created using my data set:

ggplot(USPop, aes(year, population))

In my opinion, GGPlot2 is the best package to use because it provides the most options to work with different visualizations, etc. and is far and away the most powerful tool in this respect. I would have liked to work longer on this particular portion but I'm afraid I've run out of time! I look forward to utilizing GGPlot2 in future assignments and projects.

Sunday, March 6, 2016

Module 8 : Input/Output, String Manipulation & PLYR Package

Module 8 was one of the most interesting areas of focus so far in this course. I really enjoyed learning about how to use data from files, both in terms of input and output. I am beginning to envision different ways in which the R language can be used, and the ability to use files adds a very useful dimension to it.

The first step in this week's assignment was to import the dataset, which was provided as a .txt file initially. We have done this before, and it wasn't much of a challenge! I also installed the "pylr" package which is used for working with groups of information in a larger set, among other things.

I then ran plyr generates for the mean for both Age and Grade, split by gender. After this, I output the data from y into a file called Sorted_Average:

In order to make a .csv file, I made the following adjustment:

The final phase of the assignment was to output only those names that included the letter i or I, and then to output that to a file:

Saturday, February 27, 2016

Module 7 : R Objects

I came into this week feeling very optimistic about the coursework, due in large part to the fact that I’m familiar with object oriented programming from C++ and Java courses in past semesters. This doesn’t make the assignment any less challenging, mind you, but at least I have a more firm grasp on these concepts and can hit the ground running, so to speak. On to the assignment!

I chose to use a dataset called “discoveries” from the datasets package that consists of the yearly numbers of important discoveries from 1860 to 1959:

The bass type of the 'discoveries' object is double, and this is easily determined by typeof(discoveries). I also tried some other datasets in my environment that were defined as integers and typeof(a) (for example) confirmed that.

The second step in our assignment addressed generic functions which are functions that dispatch methods of a generic concept. Examples of generic functions in R include plot, mean, residuals, predict, summary and others. I chose to determine whether a generic function could be assigned to my "discoveries" dataset by using the plot function, plot(discoveries):

I also attempted some others such as summaries(discoveries):

I had success with a variety of generic functions, but I also tried plenty that did not work, such as logLik and predict which returned errors stating:

Error in UseMethod("predict/logLik") :

no applicable method for 'predict/logLik' applied to an object of class "ts"

The final step of the assignment is to determine whether S3 or S4 can be assigned to the dataset I chose. S3 and S4 are two object systems that are used in the R programming language. S3 objects are informal and more interactive than S4 which are more rigorous. The way to determine whether S3 or S4 can be assigned to a dataset is by using the S4() function. For example, S4(experiences) returns false which means that S3 can be applied.

Saturday, February 20, 2016

Module 6 : Math & Simulations II

Time is really moving fast this semester, isn't it? We are already submitting our Module 6 assignment, part two of a block focused on math and simulations. This module provided a lot of insight and instruction about related transposing matrix, multiplying it by a vector, finding the inverse of a matrix, and finding its determinant as well. These are some challenging concepts to grasp for some of us.

Instead of using 6 for nrows, I went with 10 for both A and B matrix data sets. In my opinion, the less conflicts the better. Transposing a matrix is very easy: t(matrix). Inputting the command t(A) output the matrix with 10 rows and 10 columns, cells numbering 1 to 100. Inputting the command t(B) output the matrix with 100 rows and 10 columns, cells numbering 1 to 1,000.

Multiplying the matrix by a vector was the next step, and I needed to create a vector. I did so easily and multiplied: X = a*A, Y = b*B. I also created a vector Z = a*B and displayed that as well to evaluate how it differed. After this, I reverted back to nrow=6 for both 1:100 and 1:1000. I then reassigned a to 1:17 and b to 1:167 and used %*% to multiply a against A, then B against b.

The next step in the assignment was to reverse the matrix. I changed A matrix to 1:4 with an nrow=2 and was able to invert using solve(A). It clearly became inverted! I then created a matrix using runif to generate random numbers ranging from 0 to 50 with 25 as the median. I then found the determinant by using det(A).

R can be pretty simple if you know your equations and can wrap your head around these concepts. It's not always easy, but I am not finding it to be quite as difficult for me to understand as C++ or Java, for example. I'm looking forward to more!

Sunday, February 14, 2016

Module 5 : Math and Simulations

This was a very challenging and interesting assignment for me. I am familiar with entering data sets now, and so I was very comfortable doing that. I created a side by side box plot to represent the data:

The left-most boxplot represents the first assessment of blood pressure levels. The middle boxplot represents the second assessment of blood pressure levels. The right-most boxplot represents the final decision. It is very interesting to see how the data sets are transformed into visual representations, and it allows us to see them in a different way and perhaps gain a little better insight into what it actually means.

I also created histograms for this module assigment as follows:

I am a little less clear on this portion of the assignment, and I feel like I may have done something wrong here. I was looking for a way to combine these into one histogram rather than have four separate visualizations. The histogram that depicts blood pressure makes the most sense to me because it clearly shows the distribution of blood pressures according to frequency.

Sunday, January 31, 2016

Module 3 : Data.frame and Much More!

This was quite a difficult module for me! I didn't have problems understanding about matrices, data tables, lists or the other topics discussed in the reading, but I did have some problems running the procedure in the assignment tutorial using the following as a data source:

Name <- c("Jeb", "Donald", "Ted", "Marco", "Carly", "Hilary", "Bernie")
"ABCP" <- c(4, 62, 51, 21, 2, 14, 15)
"CBSP" <- c(12, 75, 43, 19, 1, 21, 19)

Following is what I was able to complete:

As you can see, I was able to create the data frame and display it. Looks nice! I started to have problems from this point down, where I am entering mean(results.df) and receiving the warning message mentioned in the tutorial. I am not sure how to avoid this, unfortunately.

From this part on, I am really sure sure where the data is coming from. Nothing had been created for C previously and so R Studio is not recognizing it as an object. Further, I am not sure how I would apply that to the polling data we were given for this assignment. I do certainly understand the concepts behind the work, both from the book and in the assignment itself, but I would love some input or insight from anyone who has a better idea.

Sunday, January 24, 2016

Module 2 : Objects, Functions and Vectors

R is a very new programming language to me, and to be honest, I’m a little nervous about the course because I have not fared too well in C++ and Java alike. I really love statistics (also didn’t do great in that course) and I love big data. As such, the notion of learning and understanding R programming language is very attractive for me in the E-Commerce & Internet Marketing field where I am currently, and plan to be for many years to come. On its face, R doesn’t look to be too overwhelming in terms of understanding how the language and syntax works, as well as its capabilities.

Module #1 has been good to me so far! I was able to follow the assigned tutorial with relative ease, although I did have problems creating a scatter plot of a data set using plot(x = s$age_husband , y = s$age_wife, type = ‘p’) due to an error that the s object could not be found. I am not sure how the writer was able to accomplish this and I would welcome any comments that can provide me with the necessary insight. I thought the transformation of data was pretty smooth using the R interface, and I took some time to play around with different equations, based initially off of the equations presented in the tutorial.

This Module covered, in large part, Objects, Functions and Vector. Data can be stored in Objects, and then when the Object is called, it is replaced with the data that is saved inside. Functions are types of procedures or routines, and are especially important in R because the language has so many pre-existing functions to execute complex tasks. Common examples of functions include mathematical functions that determine the mean, standard deviation, sum and others. Vectors are defined as a sequence of data elements of the same basic type, which is a lot like a string of common elements.