Title: | Tidyverse-Friendly Introductory Linear Regression |
---|---|
Description: | Datasets and wrapper functions for tidyverse-friendly introductory linear regression, used in "Statistical Inference via Data Science: A ModernDive into R and the Tidyverse" available at <https://moderndive.com/>. |
Authors: | Albert Y. Kim [aut, cre] , Chester Ismay [aut] , Andrew Bray [ctb] , Delaney Moran [ctb], Evgeni Chasnovski [ctb] , Will Hopper [ctb] , Benjamin S. Baumer [ctb] , Marium Tapal [ctb] , Wayne Ndlovu [ctb], Catherine Peppers [ctb], Annah Mutaya [ctb], Anushree Goswami [ctb], Ziyue Yang [ctb] , Clara Li [ctb] , Caroline McKenna [ctb], Catherine Park [ctb] , Abbie Benfield [ctb], Georgia Gans [ctb], Kacey Jean-Jacques [ctb], Swaha Bhattacharya [ctb], Vivian Almaraz [ctb], Elle Jo Whalen [ctb], Jacqueline Chen [ctb], Michelle Flesaker [ctb], Irene Foster [ctb], Aushanae Haller [ctb], Benjamin Bruncati [ctb] , Quinn White [ctb] , Tianshu Zhang [ctb] , Katelyn Diaz [ctb] , Rose Porta [ctb], Renee Wu [ctb], Arris Moise [ctb], Kate Phan [ctb], Grace Hartley [ctb], Silas Weden [ctb], Emma Vejcik [ctb], Nikki Schuldt [ctb], Tess Goldmann [ctb], Hongtong Lin [ctb], Alejandra Munoz [ctb], Elina Gordon-Halpern [ctb], Haley Schmidt [ctb] |
Maintainer: | Albert Y. Kim <[email protected]> |
License: | GPL-3 |
Version: | 0.7.0.9000 |
Built: | 2024-11-09 06:10:18 UTC |
Source: | https://github.com/moderndive/moderndive |
On-time data for all Alaska Airlines flights that departed NYC (i.e. JFK, LGA or EWR)
in 2013. This is a subset of the flights
data frame from nycflights13
.
alaska_flights
alaska_flights
A data frame of 714 rows representing Alaska Airlines flights and 19 variables
Date of departure.
Actual departure and arrival times (format HHMM or HMM), local tz.
Scheduled departure and arrival times (format HHMM or HMM), local tz.
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
Two letter carrier abbreviation. See nycflights13::airlines
to get name.
Flight number.
Plane tail number. See nycflights13::planes
for additional metadata.
Origin and destination. See nycflights13::airports
for
additional metadata.
Amount of time spent in the air, in minutes.
Distance between airports, in miles.
Time of scheduled departure broken into hour and minutes.
Scheduled date and hour of the flight as a POSIXct
date.
Along with origin
, can be used to join flights data to nycflights13::weather
data.
RITA, Bureau of transportation statistics
5000 chocolate-covered almonds selected from a large batch, weighed in grams.
almonds_bowl
almonds_bowl
A data frame with 5000 observations on the following 2 variables
Identification value for a given chocolate-covered almond
Weight of the chocolate-covered almond in grams (to the nearest tenth)
A sample of 25 chocolate-covered almonds, weighed in grams.
almonds_sample
almonds_sample
A data frame with 25 observations on the following 2 variables
Replicate number set to 1 since there is only one sample
Identification value for a given chocolate-covered almond
Weight of the chocolate-covered almond in grams (to the nearest tenth)
A sample of 100 chocolate-covered almonds, weighed in grams.
almonds_sample_100
almonds_sample_100
A data frame with 100 observations on the following 2 variables
Replicate number set to 1 since there is only one sample
Identification value for a given chocolate-covered almond
Weight of the chocolate-covered almond in grams (to the nearest tenth)
A random sample of 325 books from Amazon.com.
amazon_books
amazon_books
A data frame of 325 rows representing books listed on Amazon and 13 variables.
Book title
Author who wrote book
recommended retail price of book
lowest price of book shown on Amazon
book is either hardcover or paperback
number of pages in book
Company that issues the book for sale
Year the book was published
10-character ISBN number
height, width, weight and thickness of the book
The Data and Story Library (DASL) https://dasl.datadescription.com/datafile/amazon-books
Gathered from https://docs.google.com/spreadsheets/d/1cNuj9V-9Xe8fqV3DQRhvsXJhER3zTkO1dSsQ1Q0j96g/edit#gid=1419070688
avocados
avocados
A data frame of 54 regions over 3 years of weekly results
Week of Data Recording
Average Price of Avocado
Total Amount of Avocados
Amount of Small Haas Avocados Sold
Amount of Large Haas Avocados Sold
Amount of Extra Large Haas Avocados Sold
Total Amount of Bags of Avocados
Total Amount of Bags of Small Haas Avocados
Total Amount of Bags of Large Haas Avocados
Total Amount of Bags of Extra Large Haas Avocados
Type of Sale
Year of Sale
Region Where Sale Took Place
Data on maternal smoking and infant health
babies
babies
A data frame of 1236 rows of individual mothers.
Identification number
Marked 5 for single fetus, otherwise number of fetuses
Marked 1 for live birth that survived at least 28 days
Birth date where 1096 is January 1st, 1961
Birth date in mm-dd-yyyy format
Length of gestation in days, marked 999 if unknown
Infant's sex, where 1 is male, 2 is female, and 9 is unknown
Birth weight in ounces, marked 999 if unknown
Total number of previous pregnancies including fetal deaths and stillbirths, marked 99 if unknown
Mother's race where 0-5 is white, 6 is Mexican, 7 is Black, 8 is Asian, 9 is mixed, and 99 is unknown
Mother's age in years at termination of pregnancy, 99=unknown
Mother's education 0= less than 8th grade, 1 = 8th -12th grade - did not graduate, 2= HS graduate–no other schooling , 3= HS+trade, 4=HS+some college 5= College graduate, 6&7 Trade school HS unclear, 9=unknown
Mother's height in inches to the last completed inch, 99=unknown
Mother prepregnancy wt in pounds, 999=unknown
Father's race, coding same as mother's race
Father's age, coding same as mother's age
Father's education, coding same as mother's education
Father's height, coding same as for mother's height
Father's weight coding same as for mother's weight
0= legally separated, 1=married, 2= divorced, 3=widowed, 5=never married
Family yearly income in $2500 increments 0 = under 2500, 1=2500-4999, ..., 8= 12,500-14,999, 9=15000+, 98=unknown, 99=not asked
Does mother smoke? 0=never, 1= smokes now, 2=until current pregnancy, 3=once did, not now, 9=unknown
If mother quit, how long ago? 0=never smoked, 1=still smokes, 2=during current preg, 3=within 1 yr, 4= 1 to 2 years ago, 5= 2 to 3 yr ago, 6= 3 to 4 yrs ago, 7=5 to 9yrs ago, 8=10+yrs ago, 9=quit and don't know, 98=unknown, 99=not asked
Number of cigs smoked per day for past and current smokers 0=never, 1=1-4, 2=5-9, 3=10-14, 4=15-19, 5=20-29, 6=30-39, 7=40-60, 8=60+, 9=smoke but don't know, 98=unknown, 99=not asked
Data on maternal smoking and infant health from https://www.stat.berkeley.edu/~statlabs/labs.html
A sampling bowl used as the population in a simulated sampling exercise. Also known as the urn sampling framework https://en.wikipedia.org/wiki/Urn_problem.
bowl
bowl
A data frame 2400 rows representing different balls in the bowl, of which 900 are red and 1500 are white.
ID variable used to denote all balls. Note this value is not marked on the balls themselves
color of ball: red or white
A single tactile sample of size n = 50 balls from https://github.com/moderndive/moderndive/blob/master/data-raw/sampling_bowl.jpeg
bowl_sample_1
bowl_sample_1
A data frame of 50 rows representing different balls and 1 variable.
Color of ball sampled
Counting the number of red balls in 10 samples of size n = 50 balls from https://github.com/moderndive/moderndive/blob/master/data-raw/sampling_bowl.jpeg
bowl_samples
bowl_samples
A data frame 10 rows representing different groups of students' samples of size n = 50 and 5 variables
Group name
Number of red balls sampled
Number of white balls sampled
Number of green balls sampled
Total number of balls samples
This dataset contains detailed information about coffee quality evaluations from various origins. It includes data on the country and continent of origin, farm name, lot number, and various quality metrics. The dataset also includes attributes related to coffee processing, grading, and specific sensory attributes.
coffee_quality
coffee_quality
A data frame with 207 rows and 30 variables:
character
. The country where the coffee originated.
character
. The continent where the coffee originated.
character
. The name of the farm where the coffee was grown.
character
. The lot number assigned to the batch of coffee.
character
. The name of the mill where the coffee was processed.
character
. The company associated with the coffee batch.
character
. The altitude range (in meters) where the coffee was grown.
character
. The specific region within the country where the coffee was grown.
character
. The name of the coffee producer.
character
. The in-country partner organization associated with the coffee batch.
character
. The year or range of years during which the coffee was harvested.
date
. The date when the coffee was graded.
character
. The owner of the coffee batch.
character
. The variety of the coffee plant.
character
. The method used to process the coffee beans.
numeric
. The aroma score of the coffee, on a scale from 0 to 10.
numeric
. The flavor score of the coffee, on a scale from 0 to 10.
numeric
. The aftertaste score of the coffee, on a scale from 0 to 10.
numeric
. The acidity score of the coffee, on a scale from 0 to 10.
numeric
. The body score of the coffee, on a scale from 0 to 10.
numeric
. The balance score of the coffee, on a scale from 0 to 10.
numeric
. The uniformity score of the coffee, on a scale from 0 to 10.
numeric
. The clean cup score of the coffee, on a scale from 0 to 10.
numeric
. The sweetness score of the coffee, on a scale from 0 to 10.
numeric
. The overall score of the coffee, on a scale from 0 to 10.
numeric
. The total cup points awarded to the coffee, representing the sum of various quality metrics.
numeric
. The moisture percentage of the coffee beans.
character
. The color description of the coffee beans.
character
. The expiration date of the coffee batch.
character
. The body that certified the coffee batch.
Coffee Quality Institute
1,340 digitized reviews on coffee samples from https://database.coffeeinstitute.org/.
coffee_ratings
coffee_ratings
A data frame of 1,340 rows representing each sample of coffee.
Number of points in final rating (scale of 0-100)
Species of coffee bean plant (Arabica or Robusta)
Owner of coffee plant farm
Coffee bean's country of origin
Name of coffee plant farm
Lot number for tested coffee beans
Name of coffee bean's processing facility
International Coffee Organization number
Name of coffee bean's company
Altitude at which coffee plants were grown
Region where coffee plants were grown
Name of coffee bean roaster
Number of tested bags
Tested bag weight
Partner for the country
Year the coffee beans were harvested
Day the coffee beans were graded
Owner of the coffee beans
Variety of the coffee beans
Method used for processing the coffee beans
Coffee aroma rating
Coffee flavor rating
Coffee aftertaste rating
Coffee acidity rating
Coffee body rating
Coffee balance rating
Coffee uniformity rating
Cup cleanliness rating
Coffee sweetness rating
Cupper Points, an overall rating for the coffee
Coffee moisture content
Number of category one defects for the coffee beans
Number of coffee beans that don't dark brown when roasted
Color of the coffee beans
Number of category two defects for the coffee beans
Expiration date of the coffee beans
Entity/Institute that certified the coffee beans
Body address of certification for coffee beans
Certification contact for coffee beans
Unit of measurement for altitude
Lower altitude level coffee beans grow at
Higher altitude level coffee beans grow at
Average altitude level coffee beans grow at
Coffee Quality Institute. Access cleaned data available at https://github.com/jldbc/coffee-quality-database
Number of Dunkin Donuts & Starbucks, median income, and population in 1024 census tracts in eastern Massachusetts in 2016.
DD_vs_SB
DD_vs_SB
A data frame of 1024 rows representing census tracts and 6 variables
County where census tract is located. Either Bristol, Essex, Middlesex, Norfolk, Plymouth, or Suffolk county
Federal Information Processing Standards code identifying census tract
Median income of census tract
Population of census tract
Coffee shop type: Dunkin Donuts or Starbucks
Number of shops
US Census Bureau. Code used to scrape data available at https://github.com/DelaneyMoran/FinalProject
Hourly meteorological data for LGA, JFK and EWR for the month of January 2023.
This is a subset of the weather
data frame from nycflights23
.
early_january_2023_weather
early_january_2023_weather
A data frame of 360 rows representing hourly measurements and 15 variables
Weather station. Named origin
to facilitate merging with
nycflights23::flights
data.
Time of recording.
Temperature and dewpoint in F.
Relative humidity.
Wind direction (in degrees), speed and gust speed (in mph).
Precipitation, in inches.
Sea level pressure in millibars.
Visibility in miles.
Date and hour of the recording as a POSIXct
date.
ASOS download from Iowa Environmental Mesonet, https://mesonet.agron.iastate.edu/request/download.phtml.
Hourly meteorological data for LGA, JFK and EWR for the month of January 2013.
This is a subset of the weather
data frame from nycflights13
.
early_january_weather
early_january_weather
A data frame of 358 rows representing hourly measurements and 15 variables
Weather station. Named origin
to facilitate merging with
nycflights13::flights
data.
Time of recording.
Temperature and dewpoint in F.
Relative humidity.
Wind direction (in degrees), speed and gust speed (in mph).
Precipitation, in inches.
Sea level pressure in millibars.
Visibility in miles.
Date and hour of the recording as a POSIXct
date.
ASOS download from Iowa Environmental Mesonet, https://mesonet.agron.iastate.edu/request/download.phtml.
On-time data for all Envoy Air flights that departed NYC (i.e. JFK, LGA or EWR)
in 2023. This is a subset of the flights
data frame from nycflights23
.
envoy_flights
envoy_flights
A data frame of 357 rows representing Alaska Airlines flights and 19 variables
Date of departure.
Actual departure and arrival times (format HHMM or HMM), local tz.
Scheduled departure and arrival times (format HHMM or HMM), local tz.
Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
Two letter carrier abbreviation. See nycflights23::airlines
to get name.
Flight number.
Plane tail number. See nycflights23::planes
for additional metadata.
Origin and destination. See nycflights23::airports
for
additional metadata.
Amount of time spent in the air, in minutes.
Distance between airports, in miles.
Time of scheduled departure broken into hour and minutes.
Scheduled date and hour of the flight as a POSIXct
date.
Along with origin
, can be used to join flights data to nycflights23::weather
data.
RITA, Bureau of transportation statistics
This dataset consists of information on 3,395 electric vehicle charging sessions across locations for a workplace charging program. The data contains information on multiple charging sessions from 85 electric vehicle drivers across 25 workplace locations, which are located at facilities of various types.
ev_charging
ev_charging
A data frame of 3,395 rows on 24 variables, where each row is an electric vehicle charging session.
Unique identifier specifying the electric vehicle charging session
Total energy used at the charging session, in kilowatt hours (kWh)
Quantity of money paid for the charging session in U.S. dollars
Date and time recorded at the beginning of the charging session
Date and time recorded at the end of the charging session
Hour of the day when the charging session began (1 through 24)
Hour of the day when the charging session ended (1 through 24)
Length of the charging session in hours
First three characters of the name of the weekday when the charging session occurred
Digital platform the driver used to record the session (android, ios, web)
Distance from the charging location to the driver's home, expressed in miles NA if the driver did not report their address
Unique identifier for each driver
Unique identifier for each charging station
Unique identifier for each location owned by the company where charging stations were located
Binary variable that is 1 when the vehicle is a type commonly used by managers of the firm and 0 otherwise
Categorical variable that represents the facility type:
1 = manufacturing
2 = office
3 = research and development
4 = other
Binary variables; 1 if the charging session took place on that day, 0 otherwise
Binary variable; 1 if the driver did report their zip code, 0 if they did not
Harvard Dataverse doi:10.7910/DVN/NFPQLW. Note data is released under a CC0: Public Domain license.
The data are gathered from end of semester student evaluations for a sample of 463 courses taught by 94 professors from the University of Texas at Austin. In addition, six students rate the professors' physical appearance. The result is a data frame where each row contains a different course and each column has information on either the course or the professor https://www.openintro.org/data/index.php?data=evals
evals
evals
A data frame with 463 observations corresponding to courses on the following 13 variables.
Identification variable for course.
Identification variable for professor. Many professors are included more than once in this dataset.
Average professor evaluation score: (1) very unsatisfactory - (5) excellent.
Age of professor.
Average beauty rating of professor.
Gender of professor (collected as a binary variable at the time of the study): female, male.
Ethnicity of professor: not minority, minority.
Language of school where professor received education: English or non-English.
Rank of professor: teaching, tenure track, tenured.
Outfit of professor in picture: not formal, formal.
Color of professor’s picture: color, black & white.
Number of students in class who completed evaluation.
Total number of students in class.
Class level: lower, upper.
Çetinkaya-Rundel M, Morgan KL, Stangl D. 2013. Looking Good on Course Evaluations. CHANCE 26(2).
The data in evals
is a slight modification of openintro::evals()
.
geom_categorical_model()
fits a regression model using the categorical
x axis as the explanatory variable, and visualizes the model's fitted values
as piece-wise horizontal line segments. Confidence interval bands can be
included in the visualization of the model. Like geom_parallel_slopes()
,
this function has the same nature as geom_smooth()
from
the ggplot2
package, but provides functionality that geom_smooth()
currently doesn't have. When using a categorical predictor variable,
the intercept corresponds to the mean for the baseline group, while
coefficients for the non-baseline groups are offsets from this baseline.
Thus in the visualization the baseline for comparison group's median is
marked with a solid line, whereas all offset groups' medians are marked with
dashed lines.
geom_categorical_model( mapping = NULL, data = NULL, position = "identity", ..., se = TRUE, level = 0.95, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
geom_categorical_model( mapping = NULL, data = NULL, position = "identity", ..., se = TRUE, level = 0.95, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
... |
Other arguments passed on to
|
se |
Display confidence interval around model lines? |
level |
Level of confidence interval to use (0.95 by default). |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
library(dplyr) library(ggplot2) p <- ggplot(mpg, aes(x = drv, y = hwy)) + geom_point() + geom_categorical_model() p # In the above visualization, the solid line corresponds to the mean of 19.2 # for the baseline group "4", whereas the dashed lines correspond to the # means of 28.19 and 21.02 for the non-baseline groups "f" and "r" respectively. # In the corresponding regression table however the coefficients for "f" and "r" # are presented as offsets from the mean for "4": model <- lm(hwy ~ drv, data = mpg) get_regression_table(model) # You can use different colors for each categorical level p %+% aes(color = drv) # But mapping the color aesthetic doesn't change the model that is fit p %+% aes(color = class)
library(dplyr) library(ggplot2) p <- ggplot(mpg, aes(x = drv, y = hwy)) + geom_point() + geom_categorical_model() p # In the above visualization, the solid line corresponds to the mean of 19.2 # for the baseline group "4", whereas the dashed lines correspond to the # means of 28.19 and 21.02 for the non-baseline groups "f" and "r" respectively. # In the corresponding regression table however the coefficients for "f" and "r" # are presented as offsets from the mean for "4": model <- lm(hwy ~ drv, data = mpg) get_regression_table(model) # You can use different colors for each categorical level p %+% aes(color = drv) # But mapping the color aesthetic doesn't change the model that is fit p %+% aes(color = class)
geom_parallel_slopes()
fits parallel slopes model and adds its line
output(s) to a ggplot
object. Basically, it fits a unified model with
intercepts varying between groups (which should be supplied as standard
{ggplot2}
grouping aesthetics: group
, color
, fill
,
etc.). This function has the same nature as geom_smooth()
from
{ggplot2}
package, but provides functionality that geom_smooth()
currently doesn't have.
geom_parallel_slopes( mapping = NULL, data = NULL, position = "identity", ..., se = TRUE, formula = y ~ x, n = 100, fullrange = FALSE, level = 0.95, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
geom_parallel_slopes( mapping = NULL, data = NULL, position = "identity", ..., se = TRUE, formula = y ~ x, n = 100, fullrange = FALSE, level = 0.95, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
... |
Other arguments passed on to
|
se |
Display confidence interval around model lines? |
formula |
Formula to use per group in parallel slopes model. Basic
linear |
n |
Number of points per group at which to evaluate model. |
fullrange |
If |
level |
Level of confidence interval to use (0.95 by default). |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
library(dplyr) library(ggplot2) ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE) # Basic usage ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes() ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE) # Supply custom aesthetics ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE, size = 4) # Fit non-linear model example_df <- house_prices %>% slice(1:1000) %>% mutate( log10_price = log10(price), log10_size = log10(sqft_living) ) ggplot(example_df, aes(x = log10_size, y = log10_price, color = condition)) + geom_point(alpha = 0.1) + geom_parallel_slopes(formula = y ~ poly(x, 2)) # Different grouping ggplot(example_df, aes(x = log10_size, y = log10_price)) + geom_point(alpha = 0.1) + geom_parallel_slopes(aes(fill = condition))
library(dplyr) library(ggplot2) ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE) # Basic usage ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes() ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE) # Supply custom aesthetics ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE, size = 4) # Fit non-linear model example_df <- house_prices %>% slice(1:1000) %>% mutate( log10_price = log10(price), log10_size = log10(sqft_living) ) ggplot(example_df, aes(x = log10_size, y = log10_price, color = condition)) + geom_point(alpha = 0.1) + geom_parallel_slopes(formula = y ~ poly(x, 2)) # Different grouping ggplot(example_df, aes(x = log10_size, y = log10_price)) + geom_point(alpha = 0.1) + geom_parallel_slopes(aes(fill = condition))
Determine the Pearson correlation coefficient between two variables in a data frame using pipeable and formula-friendly syntax
get_correlation(data, formula, na.rm = FALSE, ...)
get_correlation(data, formula, na.rm = FALSE, ...)
data |
a data frame object |
formula |
a formula with the response variable name on the left and the explanatory variable name on the right |
na.rm |
a logical value indicating whether NA values should be stripped before the computation proceeds. |
... |
further arguments passed to |
A 1x1 data frame storing the correlation value
library(moderndive) # Compute correlation between mpg and cyl: mtcars %>% get_correlation(formula = mpg ~ cyl) # Group by one variable: library(dplyr) mtcars %>% group_by(am) %>% get_correlation(formula = mpg ~ cyl) # Group by two variables: mtcars %>% group_by(am, gear) %>% get_correlation(formula = mpg ~ cyl)
library(moderndive) # Compute correlation between mpg and cyl: mtcars %>% get_correlation(formula = mpg ~ cyl) # Group by one variable: library(dplyr) mtcars %>% group_by(am) %>% get_correlation(formula = mpg ~ cyl) # Group by two variables: mtcars %>% group_by(am, gear) %>% get_correlation(formula = mpg ~ cyl)
Output information on each point/observation used in an lm()
regression in
"tidy" format. This function is a wrapper function for broom::augment()
and renames the variables to have more intuitive names.
get_regression_points( model, digits = 3, print = FALSE, newdata = NULL, ID = NULL )
get_regression_points( model, digits = 3, print = FALSE, newdata = NULL, ID = NULL )
model |
an |
digits |
number of digits precision in output table |
print |
If TRUE, return in print format suitable for R Markdown |
newdata |
A new data frame of points/observations to apply |
ID |
A string indicating which variable in either the original data used to fit
|
A tibble-formatted regression table of outcome/response variable, all explanatory/predictor variables, the fitted/predicted value, and residual.
augment()
, get_regression_table()
, get_regression_summaries()
library(dplyr) library(tibble) # Convert rownames to column mtcars <- mtcars %>% rownames_to_column(var = "automobile") # Fit lm() regression: mpg_model <- lm(mpg ~ cyl, data = mtcars) # Get information on all points in regression: get_regression_points(mpg_model, ID = "automobile") # Create training and test set based on mtcars: training_set <- mtcars %>% sample_frac(0.5) test_set <- mtcars %>% anti_join(training_set, by = "automobile") # Fit model to training set: mpg_model_train <- lm(mpg ~ cyl, data = training_set) # Make predictions on test set: get_regression_points(mpg_model_train, newdata = test_set, ID = "automobile")
library(dplyr) library(tibble) # Convert rownames to column mtcars <- mtcars %>% rownames_to_column(var = "automobile") # Fit lm() regression: mpg_model <- lm(mpg ~ cyl, data = mtcars) # Get information on all points in regression: get_regression_points(mpg_model, ID = "automobile") # Create training and test set based on mtcars: training_set <- mtcars %>% sample_frac(0.5) test_set <- mtcars %>% anti_join(training_set, by = "automobile") # Fit model to training set: mpg_model_train <- lm(mpg ~ cyl, data = training_set) # Make predictions on test set: get_regression_points(mpg_model_train, newdata = test_set, ID = "automobile")
Output scalar summary statistics for an lm()
regression in "tidy"
format. This function is a wrapper function for broom::glance()
.
get_regression_summaries(model, digits = 3, print = FALSE)
get_regression_summaries(model, digits = 3, print = FALSE)
model |
an |
digits |
number of digits precision in output table |
print |
If TRUE, return in print format suitable for R Markdown |
A single-row tibble with regression summaries. Ex: r_squared
and mse
.
glance()
, get_regression_table()
, get_regression_points()
library(moderndive) # Fit lm() regression: mpg_model <- lm(mpg ~ cyl, data = mtcars) # Get regression summaries: get_regression_summaries(mpg_model)
library(moderndive) # Fit lm() regression: mpg_model <- lm(mpg ~ cyl, data = mtcars) # Get regression summaries: get_regression_summaries(mpg_model)
Output regression table for an lm()
regression in "tidy" format. This function
is a wrapper function for broom::tidy()
and includes confidence
intervals in the output table by default.
get_regression_table( model, conf.level = 0.95, digits = 3, print = FALSE, default_categorical_levels = FALSE )
get_regression_table( model, conf.level = 0.95, digits = 3, print = FALSE, default_categorical_levels = FALSE )
model |
an |
conf.level |
The confidence level to use for the confidence interval
if |
digits |
number of digits precision in output table |
print |
If TRUE, return in print format suitable for R Markdown |
default_categorical_levels |
If TRUE, do not change the non-baseline categorical variables in the term column. Otherwise non-baseline categorical variables will be displayed in the format "categorical_variable_name: level_name" |
A tibble-formatted regression table along with lower and upper end
points of all confidence intervals for all parameters lower_ci
and
upper_ci
; the confidence levels default to 95\
tidy()
, get_regression_points()
, get_regression_summaries()
library(moderndive) # Fit lm() regression: mpg_model <- lm(mpg ~ cyl, data = mtcars) # Get regression table: get_regression_table(mpg_model) # Vary confidence level of confidence intervals get_regression_table(mpg_model, conf.level = 0.99)
library(moderndive) # Fit lm() regression: mpg_model <- lm(mpg ~ cyl, data = mtcars) # Get regression table: get_regression_table(mpg_model) # Vary confidence level of confidence intervals get_regression_table(mpg_model, conf.level = 0.99)
NOTE: This function is deprecated; please use geom_parallel_slopes()
instead. Output a visualization of linear regression when you have one numerical
and one categorical explanatory/predictor variable: a separate colored
regression line for each level of the categorical variable
gg_parallel_slopes(y, num_x, cat_x, data, alpha = 1)
gg_parallel_slopes(y, num_x, cat_x, data, alpha = 1)
y |
Character string of outcome variable in |
num_x |
Character string of numerical explanatory/predictor variable in
|
cat_x |
Character string of categorical explanatory/predictor variable
in |
data |
an optional data frame, list or environment (or object
coercible by |
alpha |
Transparency of points |
A ggplot2::ggplot()
object.
## Not run: library(ggplot2) library(dplyr) library(moderndive) # log10() transformations house_prices <- house_prices %>% mutate( log10_price = log10(price), log10_size = log10(sqft_living) ) # Output parallel slopes model plot: gg_parallel_slopes( y = "log10_price", num_x = "log10_size", cat_x = "condition", data = house_prices, alpha = 0.1 ) + labs( x = "log10 square feet living space", y = "log10 price in USD", title = "House prices in Seattle: Parallel slopes model" ) # Compare with interaction model plot: ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.1) + geom_smooth(method = "lm", se = FALSE, size = 1) + labs( x = "log10 square feet living space", y = "log10 price in USD", title = "House prices in Seattle: Interaction model" ) ## End(Not run)
## Not run: library(ggplot2) library(dplyr) library(moderndive) # log10() transformations house_prices <- house_prices %>% mutate( log10_price = log10(price), log10_size = log10(sqft_living) ) # Output parallel slopes model plot: gg_parallel_slopes( y = "log10_price", num_x = "log10_size", cat_x = "condition", data = house_prices, alpha = 0.1 ) + labs( x = "log10 square feet living space", y = "log10 price in USD", title = "House prices in Seattle: Parallel slopes model" ) # Compare with interaction model plot: ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.1) + geom_smooth(method = "lm", se = FALSE, size = 1) + labs( x = "log10 square feet living space", y = "log10 price in USD", title = "House prices in Seattle: Interaction model" ) ## End(Not run)
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. This dataset was obtained from Kaggle.com https://www.kaggle.com/harlfoxem/housesalesprediction/data
house_prices
house_prices
A data frame with 21613 observations on the following 21 variables.
a notation for a house
Date house was sold
Price is prediction target
Number of Bedrooms/House
Number of bathrooms/bedrooms
square footage of the home
square footage of the lot
Total floors (levels) in house
House which has a view to a waterfront
Has been viewed
How good the condition is (Overall)
overall grade given to the housing unit, based on King County grading system
square footage of house apart from basement
square footage of the basement
Built Year
Year when house was renovated
zip code
Latitude coordinate
Longitude coordinate
Living room area in 2015 (implies– some renovations) This might or might not have affected the lotsize area
lotSize area in 2015 (implies– some renovations)
Kaggle https://www.kaggle.com/harlfoxem/housesalesprediction. Note data is released under a CC0: Public Domain license.
International Power Lifting Results A subset of international powerlifting results.
ipf_lifts
ipf_lifts
A data frame with 41,152 entries, one entry for individual lifter
Individual lifter name
Binary sex (M/F)
The type of competition that the lifter entered
The equipment category under which the lifts were performed
The age of the lifter on the start date of the meet
The age class in which the filter falls
division of competition
The recorded bodyweight of the lifter at the time of competition, to two decimal places
The weight class in which the lifter competed, to two decimal places
Maximum of the first three successful attempts for the lift
Maximum of the first three successful attempts for the lift
Maximum of the first three successful attempts for the lift
The recorded place of the lifter in the given division at the end of the meet
Date of the event
The federation that hosted the meet
The name of the meet
This data is a subset of the open dataset Open Powerlifting
Data on Massachusetts public high schools in 2017
MA_schools
MA_schools
A data frame of 332 rows representing Massachusetts high schools and 4 variables
High school name.
Average SAT math score. Note 58 of the original 390 values of this variable were missing; these rows were dropped from consideration.
Percent of the student body that are considered economically disadvantaged.
Size of school enrollment; small 13-341 students, medium 342-541 students, large 542-4264 students.
The original source of the data are Massachusetts Department of Education reports https://profiles.doe.mass.edu/state_report/, however the data was downloaded from Kaggle at https://www.kaggle.com/ndalziel/massachusetts-public-schools-data
This dataset contains information about changes in speed, volume, and accidents of traffic between 2020 and 2019 by community and class of road in Massachusetts.
ma_traffic_2020_vs_2019
ma_traffic_2020_vs_2019
A data frame of 264 rows each representing a different community in Massachusetts.
City or Town
Class or group the road belongs to
Average estimated Speed (mph)
Average traffic
Average number of accidents
https://massdot-impact-crashes-vhb.opendata.arcgis.com/datasets/MassDOT::2020-vehicle-level-crash-details/explore https://mhd.public.ms2soft.com/tcds/tsearch.asp?loc=Mhd&mod=
Ebay auction data for the Nintendo Wii game Mario Kart.
mario_kart_auction
mario_kart_auction
A data frame of 143 auctions.
Auction ID assigned by Ebay
Auction length in days
Number of bids
Game condition, either new
or used
Price at the start of the auction
Shipping price
Total price, equal to auction price plus shipping price
Shipping speed or method
Seller's rating on Ebay, equal to the number of positive ratings minus the number of negative ratings
Whether the auction photo was a stock photo or not, pictures used in many options were considered stock photos
Number of Wii wheels included in the auction
The title of the auctions
This data is from https://www.openintro.org/data/index.php?data=mariokart
2020 road traffic volume and crash level date for 13 Massachusetts counties
mass_traffic_2020
mass_traffic_2020
A data frame of 874 rows representing traffic data at the 874 sites
Site id
County in which the site is located
Community in which the site is located
Rural (R) or Urban (U)
Direction for traffic movement. Either 1-WAY, 2-WAY, EB (eastbound), RAMP or WB (westbound)
Classification of road. Either Arterial, Collector, Freeway & Expressway, Interstate or Local Road
Average traffic speed
Number of vehicles recorded at each site in 2020
Number of vehicle crashes at each site
Number of non-fatal injuries for all recorded vehicle crashes
Number of fatal injuries for all recorded vehicle crashes
Datasets and wrapper functions for tidyverse-friendly introductory linear regression, used in "Statistical Inference via Data Science: A ModernDive into R and the tidyverse" available at https://moderndive.com/.
Maintainer: Albert Y. Kim [email protected] (ORCID)
Authors:
Chester Ismay [email protected] (ORCID)
Other contributors:
Andrew Bray [email protected] (ORCID) [contributor]
Delaney Moran [email protected] [contributor]
Evgeni Chasnovski [email protected] (ORCID) [contributor]
Will Hopper [email protected] (ORCID) [contributor]
Benjamin S. Baumer [email protected] (ORCID) [contributor]
Marium Tapal [email protected] (ORCID) [contributor]
Wayne Ndlovu [email protected] [contributor]
Catherine Peppers [email protected] [contributor]
Annah Mutaya [email protected] [contributor]
Anushree Goswami [email protected] [contributor]
Ziyue Yang [email protected] (ORCID) [contributor]
Clara Li [email protected] (ORCID) [contributor]
Caroline McKenna [email protected] [contributor]
Catherine Park [email protected] (ORCID) [contributor]
Abbie Benfield [email protected] [contributor]
Georgia Gans [email protected] [contributor]
Kacey Jean-Jacques [email protected] [contributor]
Swaha Bhattacharya [email protected] [contributor]
Vivian Almaraz [email protected] [contributor]
Elle Jo Whalen [email protected] [contributor]
Jacqueline Chen [email protected] [contributor]
Michelle Flesaker [email protected] [contributor]
Irene Foster [email protected] [contributor]
Aushanae Haller [email protected] [contributor]
Benjamin Bruncati [email protected] (ORCID) [contributor]
Quinn White [email protected] (ORCID) [contributor]
Tianshu Zhang [email protected] (ORCID) [contributor]
Katelyn Diaz [email protected] (ORCID) [contributor]
Rose Porta [email protected] [contributor]
Renee Wu [email protected] [contributor]
Arris Moise [email protected] [contributor]
Kate Phan [email protected] [contributor]
Grace Hartley [email protected] [contributor]
Silas Weden [email protected] [contributor]
Emma Vejcik [email protected] [contributor]
Nikki Schuldt [email protected] [contributor]
Tess Goldmann [email protected] [contributor]
Hongtong Lin [email protected] [contributor]
Alejandra Munoz [email protected] [contributor]
Elina Gordon-Halpern [email protected] [contributor]
Haley Schmidt [email protected] (ORCID) [contributor]
Useful links:
Report bugs at https://github.com/moderndive/moderndive/issues
library(moderndive) # Fit regression model: mpg_model <- lm(mpg ~ hp, data = mtcars) # Regression tables: get_regression_table(mpg_model) # Information on each point in a regression: get_regression_points(mpg_model) # Regression summaries get_regression_summaries(mpg_model) # Plotting parallel slopes models library(ggplot2) ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE)
library(moderndive) # Fit regression model: mpg_model <- lm(mpg ~ hp, data = mtcars) # Regression tables: get_regression_table(mpg_model) # Information on each point in a regression: get_regression_points(mpg_model) # Regression summaries get_regression_summaries(mpg_model) # Plotting parallel slopes models library(ggplot2) ggplot(evals, aes(x = age, y = score, color = ethnicity)) + geom_point() + geom_parallel_slopes(se = FALSE)
A random sample of 32 action movies and 36 romance movies from https://www.imdb.com/ and their ratings.
movies_sample
movies_sample
A data frame of 68 rows movies.
Movie title
Year released
IMDb rating out of 10 stars
Action or Romance
This data was sampled from the movies
data frame in the ggplot2movies
package.
From a study on whether yawning is contagious https://www.imdb.com/title/tt0768479/. The data here was derived from the final proportions of yawns given in the show.
mythbusters_yawn
mythbusters_yawn
A data frame of 50 rows representing each of the 50 participants in the study.
integer value corresponding to identifier variable of subject ID
string of either "seed"
, participant was shown a
yawner, or "control"
, participant was not shown a yawner
string of either "yes"
, the participant yawned, or
"no"
, the participant did not yawn
This dataset contains records of eruptions from the Old Faithful geyser in Yellowstone National Park, recorded in 2024. It includes details such as the eruption ID, date and time of eruption, waiting time between eruptions, webcam availability, and the duration of each eruption.
old_faithful_2024
old_faithful_2024
A data frame with 114 rows and 6 variables:
numeric
. A unique identifier for each eruption.
date
. The date of the eruption.
numeric
. The time of the eruption in HHMM format (e.g., 538 corresponds to 5:38 AM, 1541 corresponds to 3:41 PM).
numeric
. The waiting time in minutes until the next eruption.
character
. Indicates whether the eruption was captured on webcam ("Yes" or "No").
numeric
. The duration of the eruption in seconds.
Volunteer information from https://geysertimes.org/retrieve.php
pennies
data frameA dataset of 40 pennies to be treated as a random sample with pennies()
acting
as the population. Data on these pennies were recorded in 2011.
orig_pennies_sample
orig_pennies_sample
A data frame of 40 rows representing 40 randomly sampled pennies from pennies()
and 2 variables
Year of minting
Age in 2011
StatCrunch https://www.statcrunch.com:443/app/index.html?dataid=301596
A dataset of 800 pennies to be treated as a sampling population. Data on these pennies were recorded in 2011.
pennies
pennies
A data frame of 800 rows representing different pennies and 2 variables
Year of minting
Age in 2011
StatCrunch https://www.statcrunch.com:443/app/index.html?dataid=301596
35 bootstrap resamples with replacement of sample of 50 pennies contained in
a 50 cent roll from Florence Bank on Friday February 1, 2019 in downtown Northampton,
Massachusetts, USA https://goo.gl/maps/AF88fpvVfm12. The original sample
of 50 pennies is available in pennies_sample()
.
pennies_resamples
pennies_resamples
A data frame of 1750 rows representing 35 students' bootstrap resamples of size 50 and 3 variables
ID variable of replicate/resample number.
Name of student
Year on resampled penny
A sample of 50 pennies contained in a 50 cent roll from Florence Bank on Friday February 1, 2019 in downtown Northampton, Massachusetts, USA https://goo.gl/maps/AF88fpvVfm12.
pennies_sample
pennies_sample
A data frame of 50 rows representing 50 sampled pennies and 2 variables
Variable used to uniquely identify each penny.
Year of minting.
The original pennies_sample
has been renamed orig_pennies_sample()
as of moderndive
v0.3.0.
This function calculates the population standard deviation for a numeric vector.
pop_sd(x)
pop_sd(x)
x |
A numeric vector for which the population standard deviation should be calculated. |
A numeric value representing the population standard deviation of the vector.
# Example usage: library(dplyr) df <- data.frame(weight = c(2, 4, 6, 8, 10)) df |> summarize(population_mean = mean(weight), population_sd = pop_sd(weight))
# Example usage: library(dplyr) df <- data.frame(weight = c(2, 4, 6, 8, 10)) df |> summarize(population_mean = mean(weight), population_sd = pop_sd(weight))
Data from a 1970's study on whether gender influences hiring recommendations. Originally used in OpenIntro.org.
promotions
promotions
A data frame with 48 observations on the following 3 variables.
Identification variable used to distinguish rows.
gender (collected as a binary variable at the time of the study): a factor with two levels male
and female
a factor with two levels: promoted
and not
Rosen B and Jerdee T. 1974. Influence of sex role stereotypes on personnel decisions. Journal of Applied Psychology 59(1):9-14.
The data in promotions
is a slight modification of openintro::gender_discrimination()
.
Shuffled/permuted data from a 1970's study on whether gender influences hiring recommendations.
promotions_shuffled
promotions_shuffled
A data frame with 48 observations on the following 3 variables.
Identification variable used to distinguish rows.
shuffled/permuted (binary) gender: a factor with two levels male
and female
a factor with two levels: promoted
and not
Random sample of 1057 houses taken from full Saratoga Housing Data (De Veaux)
saratoga_houses
saratoga_houses
A data frame with 1057 observations on the following 8 variables
price (US dollars)
Living Area (square feet)
Number of Bathroom (half bathrooms have no shower or tub)
Number of Bedrooms
Number of Fireplaces
Size of Lot (acres)
Age of House (years)
Whether the house has a Fireplace
Gathered from https://docs.google.com/spreadsheets/d/1AY5eECqNIggKpYF3kYzJQBIuuOdkiclFhbjAmY3Yc8E/edit#gid=622599674
This dataset contains a sample of 52 tracks from Spotify, focusing on two genres: deep-house and metal. It includes metadata about the tracks, the artists, and an indicator of whether each track is considered popular. This dataset is useful for comparative analysis between genres and for studying the characteristics of popular versus non-popular tracks within these genres.
spotify_52_original
spotify_52_original
A data frame with 52 rows and 6 columns:
character
. Spotify ID for the track. See: https://developer.spotify.com/documentation/web-api/
character
. Genre of the track, either "deep-house" or "metal".
character
. Names of the artists associated with the track.
character
. Name of the track.
numeric
. Popularity score of the track (0-100). See: https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track
character
. Indicates whether the track is considered popular ("popular") or not ("not popular"). Popularity is defined as a score of 50 or higher which corresponds to the 75th percentile of the popularity
column.
https://developer.spotify.com/documentation/web-api/
data(spotify_52_original) head(spotify_52_original)
data(spotify_52_original) head(spotify_52_original)
This dataset contains a sample of 52 tracks from Spotify, focusing on two genres: deep-house and metal. It includes metadata about the tracks, the artists, and a shuffled indicator of whether each track is considered popular.
spotify_52_shuffled
spotify_52_shuffled
A data frame with 52 rows and 6 columns:
character
. Spotify ID for the track. See: https://developer.spotify.com/documentation/web-api/
character
. Genre of the track, either "deep-house" or "metal".
character
. Names of the artists associated with the track.
character
. Name of the track.
numeric
. Popularity score of the track (0-100). See: https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track
character
. A shuffled version of the column of the same name in the spotify_52_original
data frame.
https://developer.spotify.com/documentation/web-api/
data(spotify_52_shuffled) head(spotify_52_shuffled)
data(spotify_52_shuffled) head(spotify_52_shuffled)
This dataset contains information on 6,000 tracks from Spotify, categorized by one of six genres. It includes various audio features, metadata about the tracks, and an indicator of popularity. The dataset is useful for analysis of music trends, popularity prediction, and genre-specific characteristics.
spotify_by_genre
spotify_by_genre
A data frame with 6,000 rows and 21 columns:
character
. Spotify ID for the track. See: https://developer.spotify.com/documentation/web-api/
character
. Names of the artists associated with the track.
character
. Name of the album on which the track appears.
character
. Name of the track.
numeric
. Popularity score of the track (0-100). See: https://developer.spotify.com/documentation/web-api/reference/#/operations/get-track
numeric
. Duration of the track in milliseconds.
logical
. Whether the track has explicit content.
numeric
. Danceability score of the track (0-1). See: https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features
numeric
. Energy score of the track (0-1).
numeric
. The key the track is in (0-11 where 0 = C, 1 = C#/Db, etc.).
numeric
. The loudness of the track in decibels (dB).
numeric
. Modality of the track (0 = minor, 1 = major).
numeric
. Speechiness score of the track (0-1).
numeric
. Acousticness score of the track (0-1).
numeric
. Instrumentalness score of the track (0-1).
numeric
. Liveness score of the track (0-1).
numeric
. Valence score of the track (0-1), indicating the musical positiveness.
numeric
. Tempo of the track in beats per minute (BPM).
numeric
. Time signature of the track (typically 3, 4, or 5).
character
. Genre of the track (country, deep-house, dubstep, hip-hop, metal, and rock).
character
. Indicates whether the track is considered popular ("popular") or not ("not popular"). Popularity is defined as a score of 50 or higher which corresponds to the 75th percentile of the popularity
column.
https://developer.spotify.com/documentation/web-api/
data(spotify_by_genre) head(spotify_by_genre)
data(spotify_by_genre) head(spotify_by_genre)
Counting the number of red balls in 33 tactile samples of size n = 50 balls from https://github.com/moderndive/moderndive/blob/master/data-raw/sampling_bowl.jpeg
tactile_prop_red
tactile_prop_red
A data frame of 33 rows representing different groups of students' samples of size n = 50 and 4 variables
Group members
Replicate number
Number of red balls sampled out of 50
Proportion red balls out of 50
type
column indicates whether the data is numeric, character, factor, or logical.This function calculates the five-number summary (minimum, first quartile, median, third quartile, maximum) for specified numeric columns in a data frame and returns the results in a long format. It also handles categorical, factor, and logical columns by counting the occurrences of each level or value, and includes the results in the summary. The type
column indicates whether the data is numeric, character, factor, or logical.
tidy_summary(df, columns = names(df), ...)
tidy_summary(df, columns = names(df), ...)
df |
A data frame containing the data. The data frame must have at least one row. |
columns |
Unquoted column names or tidyselect helpers specifying the columns for which to calculate the summary. Defaults to call columns in the inputted data frame. |
... |
Additional arguments passed to the |
A tibble in long format with columns:
The name of the column.
The number of non-missing values in the column for numeric variables and the number of non-missing values in the group for categorical, factor, and logical columns.
The group level or value for categorical, factor, and logical columns.
The type of data in the column (numeric, character, factor, or logical).
The minimum value (for numeric columns).
The first quartile (for numeric columns).
The mean value (for numeric columns).
The median value (for numeric columns).
The third quartile (for numeric columns).
The maximum value (for numeric columns).
The standard deviation (for numeric columns).
# Example usage with a simple data frame df <- tibble::tibble( category = factor(c("A", "B", "A", "C")), int_values = c(10, 15, 7, 8), num_values = c(8.2, 0.3, -2.1, 5.5), one_missing_value = c(NA, 1, 2, 3), flag = c(TRUE, FALSE, TRUE, TRUE) ) # Specify columns tidy_summary(df, columns = c(category, int_values, num_values, flag)) # Defaults to full data frame (note an error will be given without # specifying `na.rm = TRUE` since `one_missing_value` has an `NA`) tidy_summary(df, na.rm = TRUE) # Example with additional arguments for quantile functions tidy_summary(df, columns = c(one_missing_value), na.rm = TRUE)
# Example usage with a simple data frame df <- tibble::tibble( category = factor(c("A", "B", "A", "C")), int_values = c(10, 15, 7, 8), num_values = c(8.2, 0.3, -2.1, 5.5), one_missing_value = c(NA, 1, 2, 3), flag = c(TRUE, FALSE, TRUE, TRUE) ) # Specify columns tidy_summary(df, columns = c(category, int_values, num_values, flag)) # Defaults to full data frame (note an error will be given without # specifying `na.rm = TRUE` since `one_missing_value` has an `NA`) tidy_summary(df, na.rm = TRUE) # Example with additional arguments for quantile functions tidy_summary(df, columns = c(one_missing_value), na.rm = TRUE)
This dataset contains information on 193 United Nations member states as of 2024. It includes various attributes such as country names, ISO codes, official state names, geographic and demographic data, economic indicators, and participation in the Olympic Games. The data is designed for use in statistical analysis, data visualization, and educational purposes.
un_member_states_2024
un_member_states_2024
A data frame with 193 rows and 39 columns:
character
. Name of the country.
character
. ISO 3166-1 alpha-3 country code. See: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3
character
. Official name of the country. See: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_and_their_capitals_in_native_languages
factor
. Continent where the country is located. See: https://en.wikipedia.org/wiki/Continent
character
. Specific region within the continent.
character
. Name of the capital city. See: https://en.wikipedia.org/wiki/List_of_national_capitals_by_population
numeric
. Population of the capital city.
numeric
. Percentage of the country’s population living in the capital.
integer
. Year the capital population data was collected.
numeric
. GDP per capita in USD. See: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
numeric
. Year the GDP per capita data was collected.
numeric
. Number of times the country has competed in the Summer Olympics
integer
. Number of gold medals won in the Summer Olympics.
integer
. Number of silver medals won in the Summer Olympics.
integer
. Number of bronze medals won in the Summer Olympics.
integer
. Total number of medals won in the Summer Olympics.
integer
. Number of times the country has competed in the Winter Olympics
integer
. Number of gold medals won in the Winter Olympics.
integer
. Number of silver medals won in the Winter Olympics.
integer
. Number of bronze medals won in the Winter Olympics.
integer
. Total number of medals won in the Winter Olympics.
integer
. Total number of times the country has competed in both Summer and Winter Olympics. See: https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table
integer
. Total number of gold medals won in both Summer and Winter Olympics.
integer
. Total number of silver medals won in both Summer and Winter Olympics.
integer
. Total number of bronze medals won in both Summer and Winter Olympics.
integer
. Total number of medals won in both Summer and Winter Olympics.
character
. Indicates whether the country drives on the left or right side of the road. See: https://en.wikipedia.org/wiki/Left-_and_right-hand_traffic
numeric
. Percentage of the population classified as obese in 2024. See: https://en.wikipedia.org/wiki/List_of_countries_by_obesity_rate
numeric
. Percentage of the population classified as obese in 2016.
logical
. Indicates whether the country has nuclear weapons as of 2024. See: https://en.wikipedia.org/wiki/List_of_states_with_nuclear_weapons
numeric
. Population of the country in 2024. See: https://data.worldbank.org/indicator/SP.POP.TOTL
numeric
. Area of the country in square kilometers. See: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area
numeric
. Area of the country in square miles.
numeric
. Population density in square kilometers.
numeric
. Population density in square miles.
factor
. World Bank income group classification in 2024. See: https://data.worldbank.org/indicator/NY.GNP.PCAP.CD
numeric
. Life expectancy at birth in 2022. See: https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy
numeric
. Fertility rate in 2022 (average number of children per woman). See: https://en.wikipedia.org/wiki/List_of_countries_by_total_fertility_rate
numeric
. Human Development Index in 2022. See: https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index
data(un_member_states_2024) head(un_member_states_2024)
data(un_member_states_2024) head(un_member_states_2024)