Event Date
This is a ‘LIVE COURSE’ – the instructor will be delivering lectures and coaching attendees through the accompanying computer practical’s via video link, a good internet connection is essential.
TIME ZONE – Central Time Zone – however all sessions will be recorded and made available allowing attendees from different time zones to follow.
Please email oliverhooker@prstatistics.com for full details or to discuss how we can accommodate you.
During this course we provide a comprehensive practical introduction to data wrangling using R. In particular, we focus on tools provided by R’s tidyverse, including dplyr, tidyr, purrr, etc. Data wrangling is the art of taking raw and messy data and formatting and cleaning it so that data analysis and visualization etc may be performed on it. Done poorly, it can be time consuming, laborious, and error-prone. Fortunately, the tools provided by R’s tidyverse allow us to do data wrangling in a fast, efficient, and high-level manner, which can have dramatic consequences for ease and speed with which we analyse data. We start with how to read data of different types into R, we then cover in detail all the dplyr tools such as select, filter, mutate, etc. Here, we will also cover the pipe operator (%>%) to create data wrangling pipelines that take raw messy data on the one end and return cleaned tidy data on the other. We then cover how to perform descriptive or summary statistics on our data using dplyr’s summarize and group_by functions. We then turn to combining and merging data. Here, we will consider how to concatenate data frames, including concatenating all data files in a folder, as well as cover the powerful SQL like join operations that allow us to merge information in different data frames. The final topic we will consider is how to “pivot” data from a “wide” to “long” format and back using tidyr’s pivot_longer and pivot_wider.
Delivered remotely
Time zone – GMT+1
Availability – TBC
Duration – 3 x 1/2 days
Contact hours – Approx. 12 hours
ECT’s – Equal to 1 ECT’s
Language – English
Coming soon..
Minimal prior experience with R and RStudio is required. Attendees should be familiar with some basic R syntax and commands, how to write code in the RStudio console and script editor, how to load up data from files, etc.
A laptop computer with a working version of R or RStudio is required. R and RStudio are both available as free and open source software for PCs, Macs, and Linux computers.
Participants should be able to install additional software on their own computer during the course (please make sure you have administration rights to your computer).
A large monitor and a second screen, although not absolutely necessary, could improve the learning experience. Participants are also encouraged to keep their webcam active to increase the interaction with the instructor and other students.
Cancellations are accepted up to 28 days before the course start date subject to a 25% cancellation fee. Cancellations later than this may be considered, contact oliverhooker@prstatistics.com. Failure to attend will result in the full cost of the course being charged. In the unfortunate event that a course is cancelled due to unforeseen circumstances a full refund of the course fees will be credited.
If you are unsure about course suitability, please get in touch by email to find out more
Classes from 12:00 to 16:00 (Central Time Zone)
DAY 1
Topic 1: Reading in data. We will begin by reading in data into R using tools such as readr and readxl. Almost all types of data can be read into R, and here we will consider many of the main types, such as csv, xlsx, sav, etc. Here, we will also consider how to contol how data are parsed, e.g., so that they are read as dates, numbers, strings, etc.
Topic 2: Wrangling with dplyr. For the remainder of Day 1, we will next cover the very powerful dplyr R package. This package supplies a number of so-called “verbs” — select, rename, slice, filter, mutate, arrange, etc. — each of which focuses on a key data manipulation tools, such as selecting or changing variables. All of these verbs can be chained together using “pipes” (represented by %>%). Together, these create powerful data wrangling pipelines that take raw data as input and return cleaned data as output. Here, we will also learn about the key concept of “tidy data”, which is roughly where each row of a data frame is an observation and each column is a variable.
Classes from 12:00 to 16:00 (Central Time Zone)
DAY 2
Topic 2 continued:
Topic 3: Summarizing data. The summarize and group_by tools in dplyr can be used with great effect to summarize data using descriptive statistics.
Classes from 12:00 to 16:00 (Central Time Zone)
DAY 3
Topic 4: Merging and joining data frames. There are multiple ways to combine data frames, with the simplest being “bind” operations, which are effectively horizontal or vertical concatenations. Much more powerful are the SQL like “join” operations. Here, we will consider the inner_join, left_join, right_join, full_join operations. In this section, we will also consider how to use purrr to read in and automatically merge large sets of files.
Topic 5: Pivoting data. Sometimes we need to change data frames from “long” to “wide” formats. The R package tidyr provides the tools pivot_longer and pivot_wider for doing this.
Dr. Rafael De Andrade Moral
Rafael is an Associate Professor of Statistics at Maynooth University, Ireland. With a background in Biology and a PhD in Statistics from the University of São Paulo, Rafael has a deep passion for teaching and conducting research in statistical modelling applied to Ecology, Wildlife Management, Agriculture, and Environmental Science. As director of the Theoretical and Statistical Ecology Group, Rafael brings together a community of researchers who use mathematical and statistical tools to better understand the natural world. As an alternative teaching strategy, Rafael has been producing music videos and parodies to promote Statistics in social media and in the classroom. His personal webpage can be found here