I’d like to understand more about the world of statistics and data science, so I’ve been learning the programming language R. It has wide applications in the sciences (social and natural). There’s an IDE, freely available, called R Studio, which features a very large number of relatively sensible defaults and makes it easy to load up libraries. This meant that I didn’t have to learn much to get started, and had all the software I needed installed and ready to go in a few minutes. There’s also a set of connected libraries called the tidyverse that are obviously a huge improvement on whatever was there before.
These tools help you turn messy data (i.e., most data) into a more normalized form, and then you can filter it, reorganize it, and finally, chart it. I’ve munged a lot of CSVs in my day via all kinds of pipelines, and the tidyverse is exceptionally well organized and thoughtful. It isn’t magical, it’s reasonable.
Like any language worth its salt, you can do anything in R, and people do. It has web servers and drum machines. But I’m suspicious of those. This is a language focused on transforming and visualizing data. The end result of an R program is something you feed into a report, a scientific paper, a work of analysis, a dashboard. R Studio makes it very easy to work with notebooks, which are a mix of prose, HTML, and code. It’s a special-purpose language but a very flexible one.
I’ve always been very interested in special-purpose programming languages, like Csound for writing songs, or TeX for composing documents, or NetLogo for running simulations. For some reason I always come back to them at the holidays. Nothing is more fun after I get the kids to bed than learning some programming language that draws 3D scenes, like POV-Ray. So I’m also enjoying R at that level. These languages have rarely been a major topic of study in the history of programming languages, although that is thankfully changing.
And they all raise a question that I think about a lot: Why is one language “general purpose” and another “domain-specific”? As I was starting my career, the big battle was between “compiled” languages like C++ or FORTRAN, which require you to convert source code to machine code before they ran, and “interpreted” languages like Python, PHP, or Perl, which didn’t. Which was better?
It turned out that when you were programming web services, the ability to change a line of code, save it, and hit reload in the browser sped up everything dramatically. Eventually the lines always blur: People figure out ways to speed up the interpreted languages, and also create faster compiled languages. These battles can be surprisingly fierce, but over the years I’ve lost much interest in them. I like whatever 1) makes the computer go; 2) can be understood by mortals in a year or two; and (3) makes the current programmers happy.
I’ve also been talking to and interacting more with people doing data science from inside different disciplines — not data scientists per se, just people doing their jobs — and what keeps surprising me is that they’re programmers but don’t identify as programmers. The data they know about is the data relevant to their discipline; the programming they know about consists of operations upon that data. They don’t think too much about the state of their code as it runs, or where the variables go once they’re used up. And a lot of their code is imperative with big, messy variable names. They write code that needs to be run once a week, not a million times a minute. The end result is not a web API or front-end that needs to run all day but a CSV or an SVG with a graph.
I kept wondering: What’s the difference between what they do with, say, climate data, in R, and what we do with TypeScript, Elixir, Java, Swift, or Rust? What am I doing differently when I write R code to convert messy Excel files into charts that I can use to support a business plan, versus when I write some Flask code in Python for a small web project?
I think it comes down to three things:
- First, we, the general-purpose software people, write functions that can be run millions or billions of times without breaking. (One hopes.) There’s a lot of subtlety in that work.
- Second, we create interfaces and experiences on top of our code that can be extremely specific to a specific job or role — commodities trading platforms, or systems for reporting train schedules to millions of riders. That takes a ton of design work and product thinking, work that would probably get in the way of someone who just wants their CSV to export.
- And finally, we instrument and analyze the performance of our code and systems because we want them to be as efficient as possible, and we get extremely worried when something takes half a second, because we typically need it to take about 20 milliseconds. And because we need to report analytics to our clients. They want to know how people are using the app.
Do R programmers do all of this too? Sure, lots of them do. Especially people writing libraries or building big systems for other R users. Are there lots of exceptions to the above? Absolutely. Nonetheless, I think there’s something in it:
Domain-specific languages are defined to create specific kinds of outputs; general-purpose languages are designed to keep running forever without any problems.
Elixir is a great example: It is optimized for tons of parallel network connections, which is great for chatty client/server applications (like chat apps). You can do the same work in lots of languages, of course. We live in a world of riches. But compiled Elixir code is designed to stay up and run forever, catching errors, never failing.
When I asked people if I should use R, some said yes, and some said, “You already know Python, why bother?” R Studio is great, and R is great, and the tidyverse tooling is just a joy to use. Plus, the language itself wasn’t hard to learn. But it’s about what you can do with it.