M365 Show -  Microsoft 365 Digital Workplace Daily
M365 Show with Mirko Peters - Microsoft 365 Digital Workplace Daily
R or T-SQL? One Button Changes Everything
0:00
-19:29

R or T-SQL? One Button Changes Everything

Here’s a story: a team trained a model, everything worked fine—until the dataset doubled. Suddenly, their R pipeline crawled for hours. The root cause wasn’t the algorithm at all. It was compute context. They were running in local compute, dragging every row across the network into memory. One switch to SQL compute context pushed the R script to run directly on the server, kept the data in place, and turned the crawl into a sprint.

That’s the rule of thumb: if your dataset is large, prefer SQL compute context to avoid moving rows over the network. Try it yourself—run the same R script locally and then in SQL compute. Compare wall-clock time and watch your network traffic. You’ll see the difference.

And once you understand that setting, the next question becomes obvious: where’s the real drag hiding when the data starts to flow?

The Invisible Bottleneck

What most people don’t notice at first is a hidden drag inside their workflow: the invisible bottleneck. It isn’t a bug in your model or a quirk in your code—it’s the way your compute context decides where the work happens.

When you run in local compute context, R runs on your laptop. Every row from SQL Server has to travel across the network and squeeze through your machine’s memory. That transfer alone can strangle performance. Switch to SQL Server compute context, and the script executes inside the server itself, right next to the data. No shuffling rows across the wire, no bandwidth penalty—processing stays local to the engine built to handle it.

A lot of people miss this because small test sets don’t show the pain. Ten thousand rows? Your laptop shrugs. Ten million rows? Now you’re lugging a library home page by page, wondering why the clock melted. The fix isn’t complex tuning or endless loop rewrites. It’s setting the compute context properly so the heavy lifting happens on the server that was designed for it.

That doesn’t mean compute context is a magic cure-all. If your data sources live outside SQL Server, you’ll still need to plan ETL to bring them in first. SQL compute context only removes the transfer tax if the data is already inside SQL Server. Think of it this way: the server’s a fortress smithy; if you want the blacksmith to forge your weapon fast, you bring the ore to him rather than hauling each strike back and forth across town.

This is why so many hours get wasted on what looks like “optimization.” Teams adjust algorithms, rework pipeline logic, and tweak parameters trying to speed things up. But if the rows themselves are making round trips over the network, no amount of clever code will win. You’re simply locked into bandwidth drag. Change the compute context, and the fight shifts in your favor before you even sharpen the code.

Still, it’s worth remembering: not every crawl is caused by compute context. If performance stalls, check three things in order. First, confirm compute context—local versus SQL Server. Second, inspect your query shape—are you pulling the right columns and rows, or everything under the sun? Third, look at batch size, because how many rows you feed into R at a time can make or break throughput. That checklist saves you from wasting cycles on the wrong fix.

Notice the theme: network trips are the real tax collector here. With local compute, you pay tolls on every row. With SQL compute, the toll booths vanish. And once you start running analysis where the data actually resides, your pipeline feels like it finally got unstuck from molasses.

But even with the right compute context, another dial lurks in the pipeline—how the rows are chunked and handed off. Leave that setting on default, and you can still find yourself feeding a beast one mouse at a time. That’s where the next performance lever comes in.

Batch Size: Potion of Speed or Slowness

Batch size is the next lever, and it behaves like a potion: dose it right and you gain speed, misjudge it and you stagger. In SQL Server, the batch size is controlled by the `rowsPerRead` parameter. By default, `rowsPerRead` is set to 50,000. That’s a safe middle ground, but once you start working with millions of rows, it often starves the process—like feeding a dragon one mouse at a time and wondering why it still looks hungry.

Adjusting `rowsPerRead` changes how many rows SQL Server hands over to R in each batch. Too few, and R wastes time waiting for its next delivery. Too many, and the server may choke, running out of memory or paging to disk. The trick is to find the point where the flow into R keeps it busy without overwhelming the system.

A practical way to approach this is simple: test in steps. Start with the default 50,000, then increase to 500,000, and if the server has plenty of memory, try one million. Each time, watch runtime and keep an eye on RAM usage. If you see memory paging, you’ve pushed too far. Roll back to the previous setting and call that your sweet spot. The actual number will vary based on your workload, but this test plan keeps you on safe ground.

The shape of your data matters just as much as the row count. Wide tables—those with hundreds of columns—or those that include heavy text or blob fields are more demanding. In those cases, even if the row count looks small, the payload per row is huge. Rule of thumb: if your table is wide or includes large object columns, lower `rowsPerRead` to prevent paging. Narrow, numeric-only tables can usually handle much larger values before hitting trouble.

Once tuned, the effect can be dramatic. Raising the batch size from 50,000 to 500,000 rows can cut wait times significantly because R spends its time processing instead of constantly pausing for the next shipment. Push past a million rows and you might get even faster results on the right hardware. The runtime difference feels closer to a network upgrade than a code tweak—even though the script itself hasn’t changed at all.

A common mistake is ignoring `rowsPerRead` entirely and assuming the default is “good enough.” That choice often leads to pipelines that crawl during joins, aggregations, or transformations. The problem isn’t the SQL engine or the R code—it’s the constant interruption from feeding R too slowly. On the flip side, maxing out `rowsPerRead` without testing can be just as costly, because one oversized batch can tip memory over the edge and stall the process completely.

That balance is why experimentation matters. Think of it as tuning a character build: one point too heavy on offense and you drop your defenses, one point too light and you can’t win the fight. Same here—batch size is a knob that lets you choose between throughput and resource safety, and only trial runs tell you where your system maxes out.

The takeaway is clear: don’t treat `rowsPerRead` as a background setting. Use it as an active tool in your tuning kit. Small increments, careful monitoring, and attention to your dataset’s structure will get you to the best setting faster than guesswork ever will.

And while batch size can smooth how much work reaches R at once, it can’t make up for sloppy queries. If the SQL feeding the pipeline is inefficient, then even a well-tuned batch size will struggle. That’s why the next focus is on something even more decisive: how the query itself gets written and whether the engine can break it into parallel streams.

The Query That Unlocks Parallel Worlds

Writing SQL can feel like pulling levers in a control room. Use the wrong switch and everything crawls through one rusty conveyor. Use the right one and suddenly the machine splits work across multiple belts at once. Same table, same data, but the outcome is night and day. The real trick isn’t about raw compute—it’s whether your query hands the optimizer enough structure to break the task into parallel paths.

SQL Server will parallelize happily—but only if the query plan gives it that chance. A naive “just point to the table” approach looks simple, but it often leaves the optimizer no option but a single-thread execution. That’s exactly what happens when you pass `table=` into `RxSqlServerData`. It pulls everything row by row, and parallelism rarely triggers. By contrast, defining `sqlQuery=` in `RxSqlServerData` with a well-shaped SELECT gives the database optimizer room to generate a parallel plan. One choice silently bottlenecks you; the other unlocks extra workers without touching your R code.

You see the same theme with SELECT statements. “SELECT *” isn’t clever, it’s dead weight. Never SELECT *. Project only what you need, and toss the excess columns early. Columns that R can’t digest cleanly—like GUIDs, rowguids, or occasionally odd timestamp formats—should be dropped or cast in SQL itself, or wrapped in a view before you hand them to R. A lean query makes it easier for the optimizer to split tasks, and it keeps memory from being wasted on junk you’ll never use.

Parallelism also extends beyond query shape into how you call R from SQL Server. There are two main dials here. If you’re running your own scripts through `sp_execute_external_script` and not using RevoScaleR functions, explicitly set `@parallel = 1`. That tells SQL it can attempt parallel processes on your behalf. But if you are using the RevoScaleR suite—the functions with the rx* prefix—then parallel work is managed automatically inside the SQL compute context, and you steer it with the `numTasks` parameter. Just remember: asking for 8 or 16 tasks doesn’t guarantee that many will spin up. SQL still honors the server’s MAXDOP and resource governance. You might request 16, but get 6 if that’s all the server is willing to hand out under current load. The lesson is simple: test both methods against your workload, and watch how the server responds.

One smart diagnostic step is to check your query in Management Studio before ever running it with R. Execute it, right-click the plan, and look: do you see parallel streams, or is it a single-line serial path? A missing index, a sloppy SELECT, or too-broad a scan can quietly kill parallelism. Fix the index, rewrite the projection, give the optimizer better doors to walk through. Watching the execution plan is like scouting the dungeon map before charging in—you’ll know if you’re sending a whole party or just one unlucky rogue.

Small mistakes quickly stack. Ask for every column “just in case,” and you’ll drag twice as much payload as needed, only to drop most of it in R. Include a problem datatype, and rows get stuck in costly conversions. Directly reference the table without a query, and SQL plays it safe by running serial. None of this is glamorous debugging—it’s self-inflicted slog. Clean up the query, and parallelism often clicks on automatically, delivering a speed boost so sharp you wonder why you ever re-optimized R code instead.

Think of query structure as the difference between a narrow hallway and a set of double doors. With only one opening, threads line up one after another, processing until finished. Add multiple entry points through filters, joins, and selective column pulls, and the optimizer splits work across threads, chewing through the dataset far faster. It’s the same castle, but instead of one knight shuffling through a gate, you get squads breaching together.

Under the hood, SQL Server does the heavy decision-making: indexes, joins, datatypes, workload—all weighed before granting or denying a parallel plan. Your job is to tip the odds by making queries easier to split. Keep them lean. Project only the essentials. Test in Management Studio. And when possible, guide the system with `@parallel=1` or tuned `numTasks` in rx functions. Get those details right, and you’re not adding more compute—you’re multiplying efficiency by unlocking the workers already there.

The bigger point here is simple: sloppy SQL sabotages performance far more than clever batching or exotic R tricks. A query shaped cleanly, tested for parallelism, and trimmed of junk makes your pipelines feel light. A lazy one drags the entire server down long before your modeling code ever runs. You don’t need heroics to fix it—you just need to hand SQL Server a map it knows how to split.

Of course, even with a tuned query feeding rows quickly and in parallel, there’s another kind of slowdown waiting. It’s not about how the data enters R, but what you choose to do with it after. Because if you start reshaping fields and calculating extra columns in the middle of the fight, you’ll slow yourself down in ways you didn’t even expect.

The Trap of On-the-Fly Transformations

Here’s the next common snare: the trap of on-the-fly transformations. It looks convenient—tossing calculated fields, type conversions, or cleanup steps directly into your R model scripts—but it carries a hidden tax that grows heavier with scale.

The problem is how SQL Server and R actually talk. When you code a transformation inside R, it isn’t applied once to a column. It’s applied to every row in every batch. Each row must move from SQL storage into the analytics engine, then hop into the R interpreter, then back out again. That hop burns cycles. With small data, you barely notice. With millions of rows, the repeated trips pile up until your training loop crawls.

It’s a workflow design issue, not a math trick. The SQL engine is built to crunch set operations across entire datasets, while R is built to analyze data once it’s already clean. Forcing R to clean row by row means you lose both advantages at once. It’s extra communication overhead that doesn’t need to exist.

The faster, cleaner method is to stage transformed data before you begin analysis. Add derived variables in your source table where possible, or apply them through T-SQL in a view. If changing the base table isn’t an option, spin up a temp table or a dedicated staging table where the transformations are cast and materialized. Then point your `RxSqlServerData` call at that object. At runtime, R sees the ready-to-use columns, so the model focuses on analysis instead of constant prep.

Yes, creating views or staging tables adds a little upfront work. But that investment pays back fast. Each query or batch now flows once, instead of bouncing between engines for every calculation. Removing those repeated per-row round trips often saves hours in full training runs. It’s one of those optimizations that feels small at setup but changes the whole cadence of your pipeline.

Even basic cleanup tasks fit better in SQL. Trim leading or trailing spaces with `TRIM()` or `RTRIM()`. Normalize capitalization with functions like `INITCAP()`. Standardize string consistency with `REPLACE()`. By the time R sees the dataset, the inconsistencies are already gone—no mid-loop conversions needed.

Type conversions are another classic slowdown if left in R. Many times, numerical fields arrive as text. Strip symbols or units inside SQL, then cast the field to integer or decimal before handing it to R. Converting a revenue column from “$10,000” strings into a numeric type is much cheaper in T-SQL once than across millions of rows in the R interpreter. The same goes for timestamps—cast them at the source instead of repeatedly parsing in R.

Even more advanced steps, like identifying outliers, can be offloaded. SQL functions can calculate percentiles, flag outliers based on interquartile range, or replace nulls with defaults. By the time the dataset lands in R compute, it’s already standardized and consistent. That avoids the cut-by-cut bleeding effect of running those transformations in every iteration.

The payoff is speed now and stability later. Faster prep means shorter iteration loops, more time for tuning models, and lower server costs since resources aren’t wasted on redundant translation work. And because your transformations sit in views or staging tables, you have a consistent reference dataset for audits and re-runs. In production environments, that consistency matters as much as raw speed.

The opposite case is easy to spot. A script looks clean in the editor, but in runtime the job thrashes: huge back-and-forth chatter between SQL, the analytics engine, and R. CPUs run hot for the wrong reasons. The server is fine—it’s just doing the extra lifting you accidentally told it to. That’s why the rule is simple: transform before the model loop, never during it.

Treat this as table stakes. Once you move cleanup and formatting to SQL, R becomes a sharper tool—focused on modeling, not janitorial work. Your workflow simplifies, and the runtime penalty disappears without needing exotic configuration.

And just as important, when you think the data is finally “ready,” there’s another kind of variable waiting that can quietly tank performance. Not numeric, not continuous—categories. Handle them wrong, and they become the next hidden slowdown in your pipeline.

The Categorical Curse

The next hazard shows up when you start dealing with categorical data. This is the so‑called categorical curse, and it strikes when those fields aren’t flagged properly before they make the jump from SQL into R.

In R, categories are handled as factors. Factors aren’t just plain text—they’re objects with defined levels, labels, and pointers. That’s how R’s modeling functions know that “red,” “blue,” and “green” are classes, not just unrelated strings. The catch is that if your source data doesn’t come in with levels defined, R has to improvise. And that improvisation translates into wasted runtime.

Take a common setup: categories stored as integers in SQL Server. Database folks like it—compact storage, simple joins, fewer bytes on disk. But pass that column straight into R and suddenly R has to stop and decode. It does it by converting each integer into a string, then mapping those back into factors on the fly. That’s an extra round trip of conversions baked into every batch. It looks neat in SQL, but at R runtime it stacks into painful slowdowns.

Picture it like shelving items in a warehouse with boxes labeled 1 through 50, but tossing away the contents chart. Every time a picker shows up, they have to crack open the box to see what’s inside. It works, technically, but multiply that across thousands of picks and your “tidy numbering system” has turned the floor into a bottleneck.

The cleaner way is to bring a catalog with you. In practice, that means using the `colInfo` argument in RevoScaleR when you create your data source. With `colInfo`, you tell the system outright: “1 equals apple, 2 equals orange, 3 equals banana.” Defined once, R doesn’t need to guess or do runtime re‑mapping. The integers still store efficiently in SQL, but by the time they cross into R they arrive fully labeled, ready for modeling.

The same advice applies even if your column already uses strings. If your SQL column holds “apple,” “orange,” and “banana” in plain text, you could let R scan the column and infer levels. But that inference process eats cycles and can burn you later if an oddball value sneaks in. Instead, still set `colInfo` with the exact levels you expect. That way, R treats the values as factors as soon as they enter memory, no scanning, no guessing. It’s like giving the dungeon master the roster of NPCs before the game starts—the table knows who belongs before the party rolls initiative.

For example, when constructing `RxSqlServerData`, you might pass something like `colInfo = list(fruit = list(type = "factor", levels = as.character(1:3), newLevels = c("apple","orange","banana")))` if the source is integers. Or if the source is strings, you can simply declare `colInfo = list(fruit = list(type="factor", levels = c("apple","orange","banana")))`. Either way, you’re telling R what those categories mean before the rows leave SQL. That upfront declaration removes the need for runtime sniffing or triple conversions.

Beyond speed, this has a stability payoff. Predefining factor levels ensures that training and scoring data agree on how categories are encoded. Without it, R might sort levels in the order it encounters them—which can change depending on the data slice. The result is unstable models, inconsistent encoding, and predictions that wobble for no good reason. With `colInfo`, you lock categories to the same map every time, regardless of order or sample.

One more trick: reuse those definitions. If you’ve declared `colInfo` for training, carry the same mapping into production scoring or retraining runs. That consistency means your factors never shift under your feet. Consistent factor encoding improves speed, keeps model inputs stable, and avoids surprise rerolls when you move from prototype to deployment.

If you ignore factor handling, the punishment comes slowly. On a small test set, you won’t see it. But scale to millions of rows and the runtime slug creeps in. Each batch grinds longer than the last. What looked efficient in design turns into clogged pipelines in practice. That’s the categorical curse—it doesn’t knock you down right away, but it builds until the backlog overwhelms the system.

The escape is simple: define levels up front with `colInfo` and let the database hold the raw codes. No runtime guessing, no constant conversions, no silent performance leak. Categories stop being a hidden curse and start behaving like any other well‑typed field in your analysis.

Handle them correctly, and suddenly your pipeline steps in rhythm. Compute context does its job, batch size feeds R efficiently, queries run parallel, transformations are cleaned before they hit the loop, and categorical variables arrive pre‑named. Each piece aligns, so instead of scattered fixes you get a system that feels like it’s actually built to run. And when every gear meshes, performance stops being luck and starts looking like something you can count on.

Conclusion

Here’s the bottom line: performance gains don’t come from flashy algorithms, they come from discipline in setup. There are three rules worth burning into memory. First, put compute where the data is—use SQL compute context so the server carries the load. Second, feed R in real meals, not crumbs—tune `rowsPerRead` so batches are big but still safe for memory. Third, let the database shape the data before hand‑off—tight queries, staged views, and clear `colInfo` for factors.

Data prep takes the lion’s share of effort in any project. Experts often cite it as 80–90% of the total work, which means slow prep wastes entire weeks, but smart prep gains them back.

If this saved you time, hit subscribe and ring the bell—your future self will thank you. And if you like the persona, keep the quip: resistance is futile.

Discussion about this episode

User's avatar