VOOZH about

URL: https://www.javacodegeeks.com/2014/12/r-vectorising-all-the-things.html

⇱ R: Vectorising all the things


After my last post about finding the distance a date/time is from the weekend Hadley Wickham suggested I could improve the function by vectorising it…

 
 
 
 
 
 
 
 

@markhneedham vectorise with pmin(pmax(dateToLookup – before, 0), pmax(after – dateToLookup, 0)) / dhours(1)

— Hadley Wickham (@hadleywickham) December 14, 2014

…so I thought I’d try and vectorise some of the other functions I’ve written recently and show the two versions.

I found the following articles useful for explaining vectorisation and why you might want to do it:

Let’s get started.

Distance from the weekend

We want to find out how many hours away from the weekend i.e. nearest Saturday/Sunday a particular date/time is. We’ll be using the following libraries and set of date/times:

library(dplyr)
library(lubridate)
library(geosphere)
options("scipen"=100, "digits"=4)
 
times = ymd_hms("2002-01-01 17:00:00") + c(0:99) * hours(1)
data = data.frame(time = times)
> data %>% head()
 time
1 2002-01-01 17:00:00
2 2002-01-01 18:00:00
3 2002-01-01 19:00:00
4 2002-01-01 20:00:00
5 2002-01-01 21:00:00
6 2002-01-01 22:00:00

Let’s have a look at the non vectorised version first:

distanceFromWeekend = function(dateToLookup) {
 before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
 after = ceiling_date(dateToLookup, "week") - days(1)
 timeToBefore = dateToLookup - before
 timeToAfter = after - dateToLookup
 
 if(timeToBefore < 0 || timeToAfter < 0) {
 0 
 } else {
 if(timeToBefore < timeToAfter) {
 timeToBefore / dhours(1)
 } else {
 timeToAfter / dhours(1)
 }
 }
}

Now let’s run it against our data frame:

> system.time(
 data %>% mutate(ind = row_number()) %>% group_by(ind) %>% mutate(dist = distanceFromWeekend(time)) 
 )
 user system elapsed 
 1.837 0.020 1.884

And now for Hadley’s vectorised version:

distanceFromWeekendVectorised = function(dateToLookup) {
 before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
 after = ceiling_date(dateToLookup, "week") - days(1)
 pmin(pmax(dateToLookup - before, 0), pmax(after - dateToLookup, 0)) / dhours(1)
}
 
> system.time(data %>% mutate(dist = distanceFromWeekendVectorised(time)))
 user system elapsed 
 0.020 0.001 0.023

Extracting start date

My next example was from cleaning up Google Trends data and extracting the start date from a cell inside a CSV file.

We’ll use this data frame:

googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
names(googleTrends) = c("week", "score")
> googleTrends %>% head(10)
 week score
1 Worldwide; 2004 - present 
2 Interest over time 
3 Week neo4j
4 2004-01-04 - 2004-01-10 0
5 2004-01-11 - 2004-01-17 0
6 2004-01-18 - 2004-01-24 0
7 2004-01-25 - 2004-01-31 0
8 2004-02-01 - 2004-02-07 0
9 2004-02-08 - 2004-02-14 0
10 2004-02-15 - 2004-02-21 0

The non vectorised version looked like this:

> system.time(
 googleTrends %>% 
 mutate(ind = row_number()) %>% 
 group_by(ind) %>%
 mutate(dates = strsplit(week, " - "),
 start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character())
 )
 user system elapsed 
 0.215 0.000 0.214

In this case it’s actually not possible to vectorise the code using the strsplit so we need to use something else. Antonios showed me how to do so using substr:

> system.time(googleTrends %>% mutate(start = substr(week, 1, 10) %>% ymd()))
 user system elapsed 
 0.018 0.000 0.017

Calculating haversine distance

I wanted to work out the great circular distance from a collection of venues to a centre point in London. I started out with this data frame:

centre = c(-0.129581, 51.516578)
venues = read.csv("/tmp/venues.csv")
 
> venues %>% head()
 venue lat lon
1 Skills Matter 51.52 -0.09911
2 Skinkers 51.50 -0.08387
3 Theodore Bullfrog 51.51 -0.12375
4 The Skills Matter eXchange 51.52 -0.09923
5 The Guardian 51.53 -0.12234
6 White Bear Yard 51.52 -0.10980

My non vectorised version looked like this:

> system.time(venues %>% 
 mutate(distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre) }))
 )
 user system elapsed 
 0.034 0.000 0.033

It’s pretty quick but we can do better – the distHaversine function allows us to calculate multiple distances if the first argument ot it is a matrix of lon/lat values rather than a vector:

> system.time(
 venues %>% mutate(distanceFromCentre = distHaversine(cbind(venues$lon, venues$lat), centre))
 )
 user system elapsed 
 0.001 0.000 0.001

One I can’t figure out…

And finally I have a function which I can’t figure out how to vectorise but maybe someone with more R skillz than me can?

I have a data frame containing the cumulative member counts of various NoSQL London groups:

cumulativeMeetupMembers = read.csv("/tmp/cumulativeMeetupMembers.csv")
> cumulativeMeetupMembers %>% sample_n(10)
 g.name dayMonthYear n
4734 Hadoop Users Group UK 2013-10-26 1144
4668 Hadoop Users Group UK 2013-08-03 979
4936 Hadoop Users Group UK 2014-07-31 1644
5150 Hive London 2012-10-15 109
8020 Neo4j - London User Group 2014-03-15 826
7666 Neo4j - London User Group 2012-08-06 78
1030 Big Data London 2013-03-01 1416
6500 London MongoDB User Group 2013-09-21 952
8290 Oracle Big Data 4 the Enterprise 2012-06-04 61
2584 Data Science London 2012-03-20 285

And I want to find out the number of members for a group on a specific date. e.g. given the following data…

> cumulativeMeetupMembers %>% head(10)
 g.name dayMonthYear n
1 Big Data / Data Science / Data Analytics Jobs 2013-01-29 1
2 Big Data / Data Science / Data Analytics Jobs 2013-02-06 15
3 Big Data / Data Science / Data Analytics Jobs 2013-02-07 28
4 Big Data / Data Science / Data Analytics Jobs 2013-02-10 31
5 Big Data / Data Science / Data Analytics Jobs 2013-02-18 33
6 Big Data / Data Science / Data Analytics Jobs 2013-03-27 38
7 Big Data / Data Science / Data Analytics Jobs 2013-04-16 41
8 Big Data / Data Science / Data Analytics Jobs 2013-07-17 53
9 Big Data / Data Science / Data Analytics Jobs 2013-08-28 58
10 Big Data / Data Science / Data Analytics Jobs 2013-11-11 63

…the number of members for the ‘Big Data / Data Science / Data Analytics Jobs’ group on the 10th November 2013 should be 58.

I created this data frame of groups and random dates:

dates = ymd("2014-09-01") + c(0:9) * weeks(1)
groups = cumulativeMeetupMembers %>% distinct(g.name) %>% select(g.name)
 
groupsOnDate = merge(dates, groups)
names(groupsOnDate) = c('date', 'name')
 
> groupsOnDate %>% sample_n(10)
 date name
156 2014-10-06 GridGain London
153 2014-09-15 GridGain London
70 2014-11-03 Couchbase London
185 2014-09-29 Hadoop Users Group UK
105 2014-09-29 Data Science London
137 2014-10-13 Equal Experts Technical Meetup Group
360 2014-11-03 Scale Warriors of London
82 2014-09-08 Data Science & Business Analytics London Meetup
233 2014-09-15 London ElasticSearch User Group
84 2014-09-22 Data Science & Business Analytics London Meetup

The non vectorised version looks like this:

memberCount = function(meetupMembers) {
 function(groupName, date) {
 (meetupMembers %>% 
 filter(g.name == groupName & dayMonthYear < date) %>% do(tail(., 1)))$n 
 } 
} 
 
findMemberCount = memberCount(cumulativeMeetupMembers)
 
> system.time(groupsOnDate %>% mutate(groupMembers = by(groupsOnDate, 1:nrow(groupsOnDate), function(row) { 
 findMemberCount(row$name, as.character(row$date))
 }) %>% 
 cbind() %>% 
 as.vector() ))
 user system elapsed 
 2.259 0.005 2.269

The output looks like this:

date name groupMembers
116 2014-10-06 DeNormalised London 157
322 2014-09-08 OpenCredo Tech Workshops 7
71 2014-09-01 Data Enthusiasts London 
233 2014-09-15 London ElasticSearch User Group 614
171 2014-09-01 HPC & GPU Supercomputing Group of London 80
109 2014-10-27 Data Science London 3632
20 2014-11-03 Big Data Developers in London 708
42 2014-09-08 Big Data Week London Meetup 96
127 2014-10-13 Enterprise Search London Meetup 575
409 2014-10-27 Women in Data 548

I’ve tried many different approaches but haven’t been able to come up with a version that lets me pass in all the rows to memberCount and calculate the count for each row in one go.

Any ideas/advice/hints welcome!

Reference: R: Vectorising all the things from our JCG partner Mark Needham at the Mark Needham Blog blog.
Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
I agree to the Terms and Privacy Policy

Thank you!

We will contact you soon.

Tags
R
👁 Photo of Mark Needham
Mark Needham
December 29th, 2014Last Updated: December 28th, 2014
0 85 5 minutes read
Back to top button
Close
wpDiscuz