I note for future use a couple of things to be aware of in using R to schedule the execution of SQL scripts.
09 Dec 2017
I’ve been writing on this blog less frequently in the past few months. Mostly this is because I’ve been working on a very intensive and engaging professional project that is much more hands-on (ie coding) than I’ve been at work for a while. So a lot of energy has been going into that. One interesting side effect for me has been diving deep into Structured Query Language (SQL), and the Microsoft-flavoured Transact SQL in particular. I’ve used SQL for a long time, but usually with the attitude of “get it out of the database as quickly as possible”. In a situation where this didn’t make sense, I’ve been pleasantly surprised at how powerful and flexible SQL is for the right sort of jobs.
The bulk of the work project is in SQL, but there is an R flavour with a Shiny front end and a bunch of testing and semi-automation of the build process going on in R (long story). Here are a couple of quirky and useful things relating to using R and SQL Server in combination.
Using R to schedule SQL scripts
Imagine a project that combines SQL and R. For example, SQL is used to do a bunch of heavy lifting data management and complex queries in the database; and R is used for statistical modelling and producing polished outputs. This is actually a very common scenario. Relational databases are very powerful tools with many decades of optimisation embedded in them. They aren’t going anywhere soon.
“Whether your first data analysis language is R, Python, Julia or SAS, your second language should be SQL”
Quote by me - something I just thought of.
It’s good practice to keep any non-trivial SQL queries in their own files with a .sql suffix, and develop them in a database-aware environment. With SQL Server, this will often mean SQL Server Management Studio. But when you come to doing the final stage of reproducible analysis, you don’t want to be flicking between two applications; certainly not in terms of actually running anything. Although SQL Server since 2016 can include R code in a stored procedure, if it’s basically a statistical project it’s still going to have the workflow of “database first, R second”, with the statistical and presentation stage probably developed in RStudio or similar. So it’s very useful to be able to run bunch of .sql scripts from R. This is commonly done by reading in the script with readLines() and executing it on the database via RODBC or other database connection software.
I developed an R function sql_execute() to make this process efficient. The version below is available in my pssmisc R package (only on GitHub at this point) which is a grab-bag of multi-use functionality associated with this blog. The original version had a few project-specific features as well as a few cut corners. It also had an accompanying simple function that uses sql_execute() to run all the SQL scripts in a given folder in order.
The sql_execute() function provides the following benefits:
combines the multiple steps of reading in SQL scripts and executing them in a single command
handles a common problem where files developed in Management Studio often aren’t saved in an encoding automatically recognised by R
allows the use of the GO batch separator, a Microsoft-specific addition to SQL that will cause problems if included in a ODBC query
lets you specify a search-and-replace for a string - very useful sometimes if you’re running lots of SQL scripts to be able to say something like “oh by the way, can you change all references to database X to database Y while you’re at it”
lets you specify if an error in one batch should be fatal, or whether to proceed with the rest of the batches from that script
logs execution and results in a table in the databse.
Dealing with GO was particularly important, because it’s common (and often essential) in T-SQL development. GO divides a single script into batches. The sql_execute() function below embraces this, by splitting the original file into separate queries based on the location of GO, and sending the individual batches one at a time to the server.
One of the things to watch out for in this situation is how running a script via ODBC can get different results from hitting F5 in Management Studio. One key thing to trip up on is what happens if the SQL includes a SELECT statement that doesn’t INSERT the results into another table or variable, but returns them as a table. In this case, ODBC considers its work done and will not continue to execute anything else in the batch beyond that SELECT statement.
To clarify how this works, here is a potentially problematic SQL file:
If I run that file via sql_execute(ch, "examples/eg-sql.sql"), it does the following:
executes the SELECT TOP 5 statement and returns the results as a data frame, which is discarded as it is not the result of the last batch of the script
tries to execute the some non-legitimate SQL, gets an error and stops.
Alternatively, if I run it via sql_execute(ch, "examples/eg-sql.sql", error_action = "continue") it does the following
executes the SELECT TOP 5 statement and returns the results as a data frame, which is discarded as it is not the result of the last batch of the script
tries to execute the some non-legitimate SQL, gets an error and prints it to the screen.
executes the SELECT TOP 7 statement, returns the results as a data frame, and stops. The SELECT TOP 10 statement isn’t returned.
An odd quirk with SQL loops cutting short with ODBC
A second quirk that had me puzzled for a while (and indeed I am still puzzled and can’t get a fully reproducible example) seems to relate to the use of SQL WHILE loops in a script executed on the database from R via ODBC. I found many such SQL programs would silently stop after about 20 iterations of the loop under ODBC, even if they worked perfectly in Management Studio. The examples all look like this:
BTW, SQL is one of those languages where you avoid loops if you can, and think instead in terms of joining, aggregating and filtering tables. But there are times when it is necessary (for example, performing an action on each table in a database, such as creating a carefully chosen random sample of it in another database - one of the things we had to do in the work project mentioned above).
The solution to this mysterious refusal to go beyond about 20 (it varied) iterations in some loops was to wrap the whole action in a user-defined stored procedure, then execute the procedure. This seems satisfyingly secure in all sorts of ways. The procedure can be kept permanently or blown away depending on what makes sense:
Worth noting - T-SQL distinguishes between its functions and stored procedures, whereas R lumps the two types of functionality together. Functions in SQL are true computer-science-defined functions, that take inputs and return outputs, with strictly no side effects. Stored procedures can have side effects (like creating or modifying tables). In R, functions can have side effects (and frequently do eg drawing plots), not just return outputs based on inputs.