Subsetting Data in R - Andrew Jaffe

Subsetting data in RJohn MuschelliJanuary 5, 2016 OverviewWe showed one way to read data into R In thismodule, we will show you how to:1. Select specific elements of an object by an index or logicalcondition2. Renaming columns of Subset rows of Subset columns of Add/remove new columns to Order the columns of Order the rows of will show you how to do each operation in base R then showyou how to use thedplyrpackage to do the same operation (ifapplicable).Many resources on how to usedplyrexist and are straightforward: specific elements using an indexOften you only want to look at subsets of a data set at any giventime. As a review, elements of an R object are selected using thebrackets ([and]).

For example,xis a vector of numbers and we can select the secondelement ofxusing the brackets and an index (2):x =c(1, 4, 2, 8, 10)x[2][1] 4 Select specific elements using an indexWe can select the fifth or second AND fifth elements below:x =c(1, 2, 4, 8, 10)x[5][1] 10x[c(2,5)][1] 2 10 Subsetting by deletion of entriesYou can put a minus (-) before integers inside brackets to removethese indices from the [-2]# all but the second[1] 1 4 8 10 Note that you have to be careful with this syntax when droppingmore than 1 element:x[-c(1,2,3)]# drop first 3[1] 8 10# x[-1:3] # shorthand. R sees as -1 to 3x[-(1:3)]# needs parentheses[1] 8 10 Select specific elements using logical operatorsWhat about selecting rows based on the values of two variables?

We use logical statements. Here we select only elements ofxgreater than 2:x[1] 1 2 4 8 10x > 2[1] FALSE FALSE TRUE TRUE TRUEx[ x > 2 ][1] 4 8 10 Select specific elements using logical operatorsYou can have multiple logical conditions using the following:I&: ANDI|: ORx[ x > 2 & x < 5 ][1] 4x[ x > 5 | x == 2 ][1] 2 8 10which functionThewhichfunctions takes in logical vectors and returns the indexfor the elements where the logical value (x > 5 | x == 2)# returns index[1] 2 4 5x[which(x > 5 | x == 2) ][1] 2 8 10x[ x > 5 | x == 2 ][1] 2 8 10 Creating work withHere we create a toy nameddfusing random (2016)# reproducbilitydf = (x =c(1, 2, 4, 10, 10),x2 =rpois(5, 10),y =rnorm(5),z =rpois(5, 6))Renaming Columns of : base RWe can use thecolnamesfunction to directly reassign columnnames ofdf.

Colnames(df) =c("x", "X", "y", "z")head(df)x X y z1 1 7 62 2 6 43 4 10 74 10 13 105 10 13 5colnames(df) =c("x", "x2", "y", "z")#resetRenaming Columns of : base RWe can assign the column names, change the ones we want, andthen re-assign the column names:cn =colnames(df)cn[ cn == "x2"] = "X"colnames(df) = cnhead(df)x X y z1 1 7 62 2 6 43 4 10 74 10 13 105 10 13 5colnames(df) =c("x", "x2", "y", "z")#resetRenaming Columns of : dplyrlibrary(dplyr)Note, when loadingdplyr, it says objects can be masked . Thatmeans if you use a function defined in 2 places, it uses the one thatis loaded Columns of : dplyrFor example, if we printfilter, then we see at the bottomnamespace:dplyr, which means when you typefilter, it will usethe one from (.)

data , ..){filter_(. data , .dots = lazyeval::lazy_dots(..))}<environment: namespace:dplyr>Renaming Columns of : dplyrAfilterfunction exists by default in thestatspackage, you want to make sure you use that one, you usePackageName::Functionwith the colon-colon ( :: ) (stats::filter,2)1 function (x, filter, method = c("convolution", "recursive"),2 sides = 2L, circular = FALSE, init = NULL)This is important when loading many packages, and you may havesome conflicts/masking:Renaming Columns of : dplyrTo rename columns indplyr, you use therenamecommanddf = dplyr::rename(df, X = x2)head(df)x X y z1 1 7 62 2 6 43 4 10 74 10 13 105 10 13 5df = dplyr::rename(df, x2 = X)# resetSubset columns of.

We can grab thexcolumn using the$ $x[1] 1 2 4 10 10 Subset columns of :We can also subset the bracket[, ] and matrices (2-dimensional objects), thebrackets are[rows, columns] Subsetting . We can grab thexcolumn using the index of the column or the column name ( x )df[, 1][1] 1 2 4 10 10df[, "x"][1] 1 2 4 10 10 Subset columns of :We can select multiple columns using multiple column names:df[,c("x", "y")]x y1 1 2 4 10 10 columns of : dplyrTheselectcommand fromdplyrallows you to subsetselect(df, x)x1 12 23 44 105 10 Select columns of : dplyrTheselectcommand fromdplyrallows you to subset columns ofselect(df, x, x2)x x21 1 72 2 63 4 104 10 135 10 13select(df,starts_with("x"))x x21 1 72 2 63 4 104 10 135 10 13 Subset rows of indices:Let s selectrows1 and 3 fromdfusing brackets.

Df[c(1, 3), ]x x2 y z1 1 7 63 4 10 7 Subset rows of :Let s select the rows ofdfwhere thexcolumn is greater than 5 oris equal to 2. Without any index for columns, all columns arereturned:df[ df$x > 5 | df$x == 2, ]x x2 y z2 2 6 44 10 13 105 10 13 5 Subset rows of :We can subset both rows and colums at the same time:df[ df$x > 5 | df$x == 2,c("y", "z")]y z2 44 105 5 Subset rows of : dplyrThe command indplyrfor Subsetting rows isfilter. Try?filterfilter(df, x > 5 | x == 2)x x2 y z1 2 6 42 10 13 103 10 13 5 Note, no$or Subsetting is necessary. R knows xrefers to acolumn rows of : dplyrBy default, you can separate conditions by commas, andfilterassumes these statements are joined by&filter(df, x > 2 & y < 0)x x2 y z1 4 10 7filter(df, x > 2, y < 0)x x2 y z1 4 10 7 CombiningfilterandselectYou can combinefilterandselectto subset the rows andcolumns, respectively, of :select(filter(df, x > 2 & y < 0), y, z)y z1 7 InR, the common way to perform multiple operations is to wrapfunctions around each other in a nested way such as aboveAssigning Temporary ObjectsOne can also create temporary objects and reassign them.

Df2 =filter(df, x > 2 & y < 0)df2 =select(df2, y, z)Piping - a new conceptThere is another (newer) way of performing these operations, called piping . It is becoming more popular as it s easier to read:df %>%filter(x > 2 & y < 0) %>%select(y, z)y z1 7It is read: take df, then filter the rows and then selecty,z .Adding new columns to : base RYou can add a new column, callednewcoltodf, using the$operator:df$newcol = 5:1df$newcol = df$x + 2 Removing columns to : base RYou can remove a column by assigning toNULL:df$newcol = NULLor selecing only the columns that were notnewcol:df = df[,colnames(df) != "newcol"]Adding new columns to : base RYou can also columnbind a vector (or seriesof vectors), using thecbindcommand:cbind(df, newcol = 5:1)x x2 y z newcol1 1 7 6 52 2 6 4 43 4 10 7 34 10 13 10 25 10 13 5 1 Adding columns to : dplyrThemutatefunction indplyrallows you to add or replace columnsof :mutate(df, newcol = 5:1)x x2 y z newcol1 1 7 6 52 2 6 4 43 4 10 7 34 10 13 10 25 10 13 5 1print({df =mutate(df, newcol = x + 2)})x x2 y z newcol1 1 7 6 32 2 6 4 43 4 10 7 64 10 13 10 125 10 13 5 12 Removing columns to.

DplyrTheNULL method is still very can remove a column with a minus (-), muchlike removing rows:select(df, -newcol)x x2 y z1 1 7 62 2 6 43 4 10 74 10 13 105 10 13 5 Removing columns to : dplyrRemovenewcolandyselect(df, -one_of("newcol", "y"))x x2 z1 1 7 62 2 6 43 4 10 74 10 13 105 10 13 5 Ordering the columns of : base RWe can use thecolnamesfunction to get the column names ofdfand then putnewcolfirst by subsettingdfusing brackets:cn =colnames(df)df[,c("newcol", cn[cn != "newcol"]) ]newcol x x2 y z1 3 1 7 62 4 2 6 43 6 4 10 74 12 10 13 105 12 10 13 5 Ordering the columns of : dplyrTheselectfunction can reorder columns.

Putnewcolfirst, thenselect the rest of columns:select(df, newcol,everything())newcol x x2 y z1 3 1 7 62 4 2 6 43 6 4 10 74 12 10 13 105 12 10 13 5 Ordering the rows of : base RWe use theorderfunction on a vector or set of vectors, inincreasing order:df[order(df$x), ]x x2 y z newcol1 1 7 6 32 2 6 4 43 4 10 7 64 10 13 10 125 10 13 5 12 Ordering the rows of : base RThedecreasingargument will order it in decreasing order:df[order(df$x, decreasing = TRUE), ]x x2 y z newcol4 10 13 10 125 10 13 5 123 4 10 7 62 2 6 4 41 1 7 6 3 Ordering the rows of : base RYou can pass multiple vectors, and must use the negative (using-)to mix decreasing and increasing orderings (sort increasing onxanddecreasing ony):df[order(df$x, -df$y), ]x x2 y z newcol1 1 7 6 32 2 6 4 43 4 10 7 64 10 13 10 125 10 13 5 12 Ordering the rows of : dplyrThearrangefunction can reorder rows By default,arrangeordersin ascending order.

Subsetting Data in R - Andrew Jaffe

Tags:

Information

Transcription of Subsetting Data in R - Andrew Jaffe

Related search queries

Subsetting Data in R - Andrew Jaffe

Tags:

Information

Related documents

PRZEDSIĘBIORCY WYKONUJĄCY CZYNNOŚCI NA RZECZ PKO …

12 YIL ZORUNLU EĞİTİM SORULAR - CEVAPLAR

Related search queries