📢New
#rstats
blog post! 📢
(^New blog, for that matter)
If you're interested in how RMarkdowns evolve from analytical scratchpads to reproducible data products (projects & packages), pls give a read and let me know what you think!
Too many broadly useful stats methods are masked in domain-specific language. In my new pair of posts, I discuss formula-free
#causalinference
design patterns to help data analysts recognize frameworks as they encounter them in everyday work
1/3
Data dictionaries really need to document what constitutes an observation / unique row and not just what each column / variable means. This is a hill I will die on.
No one:
Absolutely no one:
Me: SO, I know we can't have a holiday party this year, but we CAN make our
#rstats
R Markdown reports snow before we send them to each other
HT to for the heavy lifting
Any R users looking for a good Docker crash course? Highly recommend this tutorial from
@rOpenSci
. Such a straightforward, practical, jargon-free what, why, and how 🤓:
Cut an
#rstats
scripts runtime from 2+ hours to <5 minutes and feel extremely powerful (even though arguably the first version was just bad code)
Don’t know who needs this but a few random tips below. Easy once you’ve heard them but often outside of intro content 👇🏻
Many
#rstats
Shiny users query database, but fewer have managed their own. This becomes necessary if your app needs persistent storage.
In this post, I share some tips for creating DBs for use with Shiny and what (don't know that) you need to know
(1/2)
Have you ever needed to write typo-free
#sql
for a large number of repetitive calculations?
In this new blog post, I show how
#rstats
{dbplyr} and the {sqlfluff} CLI styler can be used as a preprocessor for readable, accurate SQL
**
#rstats
BLACK FRIDAY "DEALS"** (thread)
100% OFF on these awesome, always free ebooks I've read and/or recommended this year
BOGO: in true R fashion, each thoughtfully covers both code and theory
Thankful to all these authors for openly sharing such great content🙏
(1/n)
Causal inference in industry should be advantaged by greater data & context on past obsv data, but this advantage can only happen with proactive data, metadata, and knowledge mgmt
1/2
I don't know who needs this but I just spent 20 min tracking down this excellent git overview to send to someone, so check it out if you're interested! Great explanations to help build a real mental model and not just memorize a litany of commands
A very short
#rstats
post on a few ways that I like to organize my projects when sending SQL queries to a database from R
tldr
🗃️ modularize R/SQL code
🖊️ make templates
📦 enable sharing in pkg or GitHub
❓ use R's data gen to push test data to the db
Surprisingly few intro stats books (vs econ) introduce the Frisch–Waugh–Lovell Theorem when teaching multivariable regression
Such a beautiful little result that is so helpful building that intuition as students start to think in >3 dimensions
New blog post on using data column names to form "contracts" between data producers and consumers. I demonstrate how pkgs like {pointblank}, {collapsibleTree}, and {dplyr} can make use of controlled vocabularies to enhance data management and wrangling
R Markdown Cookbook is now available! Check it out for short, snappy, real-world examples of how to customize and polish every aspect of your R Markdown.
To celebrate, a thread of 10 of my favorite tips (1/n)
R Markdown Cookbook is now available! Written by the developers of R Markdown, it is an essential reference that will help users learn and make full use of the software.
@xieyihui
@chrisderv
@rstudio
Get 20% off when you purchase on the Routledge website.
This time of year, I think a lot about the new stats grads starting their first jobs and all that I didn't know about real-world data when I began
I'm drafting thoughts on a lot of common stumbling blocks and thought I'd share a super early draft:
🧵👇🏻
I don't know who needs to hear this, but if you need a in-memory database to play with in
#rstats
, DuckDB has a more robust feature set than SQLite and is just as easy to use 🤓
🦆
Many organizations are now building internal
#rstats
packages📦, but optimal design / engineering decisions differ from open source
In this new post based on my
#rstudioglobal
talk, I explore strategies for API design, docs, testing, and more! (1/3)
Pleased to announce that I'm now an
@rstudio
Certified Instructor! Thanks to
@gvwilson
for the fantastic class.
To celebrate, I turned my 10-minute lesson on {crosstalk} into a {learnr} tutorial. Check it out here:
I previously wrote about using a controlled vocabulary to name variables in a dataset and how this can help encode metadata and create contracts.
I now have a work-in-progress
#rstats
package to create and apply 'convo's:
So what does it do?🧵 (1/n)
New blog post on using data column names to form "contracts" between data producers and consumers. I demonstrate how pkgs like {pointblank}, {collapsibleTree}, and {dplyr} can make use of controlled vocabularies to enhance data management and wrangling
Any
#rstats
-ers have learning python in their 2024 resolutions?
Learning a new language, it's hard to abandon the workflow you know and love. In my last post of 2023, I recommend some of the latest python pkgs/versions with similar ergonomics
🧵1/n
Ironically, after we’ve largely struck out on “don’t use Excel because it’s not reproducible”, we’re landing on “don’t use Excel because it actually no-kidding makes your fraud quite easy to reproduce”
(See also Frank start-up fake data)
Come for the "fraudulent data was used in a paper about honesty", but stay for the fascinating deep dive into how the team investigated using Excel metadata files that you probably don't even know exist
Emily +
#rstudioconf
+ 4hr ✈️ = new blog
When talking RMarkdown Driven Development this wk, I tried to hit both the concepts and implementation. This technical appendix focuses on the latter to show how a plethora of great
#rstats
tools can help out.
Ladies, if he:
- wastes your money
- makes it hard to work with others
- locks you into long term contracts
- patronizes you with point-and-click GUIs that don’t preserve data lineage
He’s not your man. He’s legacy enterprise stats software. Dump him for some
#rstats
Today, I turn 30, which is exciting because I'm told that's when the Central Limit Theorem "kicks in"
While I look forward to my approximately Normal life from here on on out, I'm enjoying this thread of other abused statistical fictions and methods
Loving
@rlmcelreath
‘s Statistical Rethinking from
@CRC_MathStats
as a stay-at-home read
Even if you already have a solid foundation in the methods described, the beautiful exposition is worth the read. Feels like getting to know an old friend even better
I promise you that the number of characters you save by randomly removing letters from variable names is far fewer than the number of characters you’ll type in vain misremembering those abbreviations
"Reading Bayesian Probability for Babies together, it became increasingly clear that her nephew might be more of a frequentist..."
(Sidenote: awesome, charming baby book. 10/10 would recommend! 📚)
In stats, we talk about the data generating process (DGP), yet data validation is often conducted without a theory of error generation
This post explores some failure models in ELT and implications for
#data
consumers on effective validation
🧵(1/6)
We should make 2023 the year where we stop lying to intro stats students that businesses are sitting around unsupervisedly clustering their customers with k-means all day long
NASA has lost contact with Voyager 2, the spacecraft that’s been exploring the universe for 46 years, after accidentally sending it the wrong command.
The craft was 12 billion miles away from Earth but NASA hopes they can resume communication when it is due to reset in October.
The python version of
@gt_package
just released nanoplots for tables! Loving the examples with
@DataPolars
-- beautiful syntax and output
Check it out:
New blog post on a lightweight approach to building an "advanced" but right-size data validation workflow with tools R users already know and love:
#rstats
(pointblank + projmgr pkgs), GitHub Actions / Pages / issues, and Slack notifications
I first read “Good Enough Practices in Scientific Computing” probably seven years ago now. Still so impactful - not just the recommendations but the permission structure of “good enough” which I just stole today for slides for a data engineering training
New blog post to introduce Rtistic, a hackathon-in-a-box
#rstats
repo. Blog discusses motivations and repo gives step-by-step instructions for planning a ggplot/Rmd theme building event for new useRs
Blog:
GitHub:
Come for the "fraudulent data was used in a paper about honesty", but stay for the fascinating deep dive into how the team investigated using Excel metadata files that you probably don't even know exist
Apologies in advance to anyone who reads what I write in Quarto. I find the callout boxes so charming, I am 100% guilty of somewhat dramatically overusing them
Polyglot workflows are ever easier thanks to tools like Arrow, Quarto, dbt python models. But how might these advanced create new pitfalls for analysts?
In this post, I talk how every analyst's favorite demon (nulls) behaves differently across languages
I don't know why, but my spidey senses tell me that having many analysts first intro to scripting be via python-in-Excel will be a special kind of undebuggable reference-versus-copy quagmire
Teaching an intro data analysis class tomorrow which means I’ll be sitting alone in a room by myself for three hours and talking at my computer screen about how numbers can lie to you. Just another normal, healthy 2020 thing
Hey
#rstats
, what's your *most efficient* way of "stumbling upon" cool new R tools to try? Got a great question from a colleague about how they can start discover new things to give them their own ideas, and I'm trying to think of practical advice beyond read Twitter 18 hr/day
@dgkeyes
My data never touches GitHub regardless of sensitivity. General pattern:
data lives in secure storage (S3, database, etc)
code authenticates with creds referenced as env variables
env var live in secured "secrets" lockbox on server running it and populate at runtime
But is there really a better way to teach ggplot theme() options than to make a plot that very visibly (ab)uses all. the. options. ? 🤔
(Narrator: Yes, yes there definitely were. Many, in fact.)
📢New
#rstats
blog post! 📢
(^New blog, for that matter)
If you're interested in how RMarkdowns evolve from analytical scratchpads to reproducible data products (projects & packages), pls give a read and let me know what you think!
Run iterations in parallel! If you’re using {purrr} this is *ridiculously* easy with
@dvaughan32
‘s {furrr}
You truly just add ‘future_’ prefixes to map functions
#data
column names can embed metadata and improve discoverability, validation, and wrangling
This is natural in
#rstats
but less so in
#sql
. In this post, I demo how custom
@getdbt
Jinja templates, macros, and schema tests can enforce con-vo contracts
🧵
I am probably never using group_by() again. To me, it’s always felt more like an adverb than a verb (and I prefer ungrouped final df) so .by argument really jives with me semantically !
dplyr 1.1.0 is coming soon!! 🎉🎉
We are so excited to introduce you to the new features we've been working on, including:
- Temporary inline grouping with `.by`
- Non-equi joins
- Faster `arrange()`
And SO much more!
#rstats
One huge, underappreciated value in tech twitter is that you mentally index what you learn both by the topic and the sharer for easier retrieval (mental or search). I strangely can't remember "code_download: true" for anything, but "
@apreshill
download button" never fails
TIL you can embed a "code download" button in an HTML
#rmarkdown
doc so that users can click to download your source .Rmd from the rendered HTML version...without GitHub 🤩
#rstats
YAML:
---
output:
html_document:
code_download: true
---
Test:
I'm increasingly developing a hypothesis that things we consider "advanced" topics in tech could be massively useful to beginners and should be introduced earlier
🧵Starting an open-ended thread to log some of these things and get reactions
(1/n)
I realized that my most manual, copy-pasty workflow was, ironically enough, hunting down the same set of links and notes about reproducibility. Now condensed in a blog post for future reference:
An interesting aspect of data work is that you need to rapidly switch between being obsessively detail-oriented and the comfortable dealing with ambiguity and stretching (w/out totally breaking) assumptions
Increasingly believe that’s a key differentiator is senior folks
Inspired by
@earino
's great GitHub on how to host a good panel, I started making some notes on how to create a good experience for speakers at satRdays Chicago
Right now, most (dis/)likes from my own experiences. Appreciate any ideas/suggestions via PR!
It's easy to go to Carolina in your mind but harder to fit
@NCSBE
's rich election
#data
into your RAM!
In this post, I explore how
@duckdb
(and
@ApacheArrow
) can help analyst tackle large datasets
+Take the batteries-included Codespaces demo for a spin
I love
@rstudio
's explicit philosophy of providing tools to make tasks easier in the IDE w/o hiding the code. If you find it challenging to access items in a nested list, for example, the Object Explorer *shows* you the correct R code. It's like having a personal
#rstats
tutor.
Continuing my look at
#python
pkgs for
#rstats
converts, this post explores polars ergonomics beyond the basics -- column selectors, window functions, nested data, etc
(polars post sponsored by the polar vortex's in-kind donation of "stay indoors time")
Another awesome thing about
@DataPolars
(for
#rstats
folks and beyond) -- it inspires equally ergonomic open-source addons
Neat project here finally makes calc'ing model metrics in a df as easy as it should be like any aggregation
Nothing makes you understand the power of filter bubbles / algorithms quite like niche interests. My Twitter feed lately would imply 20% of the population is talking about nothing but
@duckdb
My Shiny hot take is that modules are **not** an advanced topic. IMHO it’s so much easier and more natural for
#rstats
users to write small, modular functions that they can independently play with and test than huge monolithic apps (1/3)
Pleased to share my
#UseR2020
lightning talk on {projmgr}. Take 5 minutes to see if this pkg can help you save hours in project management overhead
Plus, check out other videos and live talks / tutorials throughout the month thanks to
@useR2020stl
!
It's easy to learn the basics (loops!) or pkgs in a new language, but it's harder to rediscover the "didn't know I needed it, don't know to call it, can't live w/out it" utility fxs
That's the topic of my next
#python
Rgonomics post for
#rstats
users
🧵
Today at
#rstudioconf
(
#rstudioglobal
?), I'll speak on design practices specifically for internal
#rstats
pkgs. We'll explore how strategies for API design, docs, testing, and more differ from your fav open source tools
Join me!
Thanks
@cnicault
for this excellent post on the interaction between text size and resolution in
#rstats
{ggplot2}.
One of those (previously) annoying things I always have to look up, but this incredibly cogent explanation may finally stick in my head
Shinylive +
@quarto_pub
Dashboards are almost perfect for shipping “desktop apps” as a directory to nontechnical users, but they still need to be "served" due to a quirk about WASM/browsers/https
Anyone have a process to create an executable-like experience without an install?
Four months into widespread work from home, I still cannot get it through my head that 99% of articles about "How to succeed in a virtual environment" will contain advice about working in my bedroom and not in conda
Does anyone have a Git flow they particularly like for collaborating on **analytics projects**? I have a hunch that the best branching strategy might look different than for software development but haven't fully thought through it
The beautiful thing about reproducible and tidy data analysis frameworks in
#rstats
/
@RStudio
: I don't know anything about ocean science, but reading this Nature article (), I get the sense their projects look just like my consumer credit projects
@ClausWilke
's Fundamentals of Data Visualization
Beautiful plots and brilliant advice on how to avoid "ugly", "bad", and "wrong" plots. Thoughtful analysis of what makes diff viz choices superior in diff contexts. Also goldmine repo for ggplot2 tricks
#rstats
friends, what are some of your favorite tidyverse functions that you wish were available in SQL?
I'm beginning to build a backlog for my {dbtplyr} dbt add-on package () to go further than the select-helpers. Would love ideas/priorities!
New social distancing hobby: Make list of friends you've lost touch with, mentors who gave good advice, people who showed you kindness. Every day, send one random 'thank you' note you always meant to but never "had the time". Slow COVID, spread gratitude instead
Preparing a talk:
*write half a slide*
*think of random esoteric metric about existing CRAN packages it would be nice to mention offhandedly*
*spend next half hour writing a script to generate ~3 seconds / 10 words of content*
😳
Building off the discussion below, I wrote a short blog post / code-through on how adopting Shiny modules can make app dev easier for newer developers
📜Post:
👩🏽💻GitHub:
📊App:
My Shiny hot take is that modules are **not** an advanced topic. IMHO it’s so much easier and more natural for
#rstats
users to write small, modular functions that they can independently play with and test than huge monolithic apps (1/3)
It's impossible to overstate
@xieyihui
's impact
R Markdown taught paved the way for best-in-class literate computation, became a core tool of open science, enabled some professors to share resources as free websites, inspired other professors to become textbook authors,
2/n🧶
It's December again and still no holiday parties. What better way to spread cheer than with a snowing
#rstats
Markdown?
This year, paired with a short post on a few of the neat R Markdown features that make this easy (and other more useful things) easy
No one:
Absolutely no one:
Me: SO, I know we can't have a holiday party this year, but we CAN make our
#rstats
R Markdown reports snow before we send them to each other
HT to for the heavy lifting
@robjhyndman
's Forecasting: Principles and Practices
Fantastic intro to forecasting building from basic principles to complex models. Also gives context to appreciate a lot of exciting work happening in {tidyverts}
This article on technical writing from
@JessHaberman
and
@gvwilson
is stellar
I’ve reviewed somewhere >75 proposals and/or manuscripts for
@CRC_MathStats
. User personas, differentiation, and forced tone particularly jump out for me
Someone asked today at the
@posit_pbc
Data Science Hangout hosted by
@_RachaelDempsey
about my book collection and some of my favorite
@CRCPress
titles in my library 📚
A **very non-comprehensive** 🧵(1/n)
Revamped blog is live thanks to the new {hugodown} package and
@juliasilge
and
@apreshill
's phenomenal blog repos which helped me hack Academic Hugo theme!
Check out {hugodown} here:
As an Anscombe's Quartet lover and summary-stats skeptic, I'm adoring these two new variants:
@PrzeBiec
's Rashomon Quartet for model performance
@StatModeling
's Causal Quartets for heterogenous effects
(1/3)
You're telling me a duck made this database? 🦆
Not only does
@duckdb
let me bat around 22M records locally, but on top of that it's even kind enough to tell me the table I meant to query 🤩
Possibly one of the most useful lessons moving from college during the big data hype cycle to industry was that a highly novel result was more likely a data-quality issue than a groundbreaking data-driven insight
Quantitative social science is just demonstrating empirically what most people already assumed to be true, so your results should almost never be "surprising" in the literal sense and counterintuitive results often point to a modeling problem
Bragged to a coworked about the amazing thing that is
@rstudio
's multiline cursors, grabbed their keyboard to do a demo, and learned that on Windows 7 Ctrl+Alt+{{Arrow Key}} actually just rotates your whole screen upside down 🙃😳🙃
Excited to be talking about causal design patterns and why data scientists' love of AB tests shouldn't crowd out observational methods at
@DataSciSalon
on June 7!
Check out the line-up and consider joining us virtually (like me!) or in person:
Every GitHub Copilot demo gif makes me worry that such a tool would (among other issues) incentivize very bad code comments
Writing “what” comments that are self evident from code itself (“plot x vs y”) nudges code gen, but it’s far more useful to write “why” comments
Plenty of ink spilled already over why this is wrong... so I did mine as a picture instead! Sample size depends on incidence rate - not population size.
Reading about expected baby milestones for my soon-to-be nephew. Delighted to learn that he'll be ready to discuss DAGs and the latest diff-in-diff literature at 5 months
H/T to tweet below for flagging this absolutely lovely paper on different causal estimands
Nice reflection on the relationships between metric collapsibility, result transportability, and baseline risk
Similarly, use simpler data types. In my ex, I was subsetting of a ton of 0/1 indicators in each iteration. Order of magnitude improvement converting to logical (TRUE / FALSE). Intuitively, give R the benefit of knowing there are only two possible value
@bradleyboehmke
's Hands On Machine Learning with R
Haven't actually read, but hearing so many great things I can't help but include. First glance, I love the "Final Thoughts" sections ending each chapter highlighting cautions / shortcomings