10 Smart R programming tips to become better R programmer

November 4, 2017 Chaitanya Sagar

How to become better r programmer

Coding is the process by which a programmer converts tasks from human-readable logic to machine-readable language. The reason behind coding being so popular is that there are so many ways to do the same thing that programmers don’t know the right choice anymore.

As a result, each programmer has his/her own style in writing implementations to the same part of an algorithm.

Writing code can sometimes be the most difficult and time-consuming part of any project. If the code is written in such a way that it is hard to change or requires a lot of work for every small update, then the investments will keep on piling up and more and more issues will crop up as the project progresses.

A good and well-written code is reusable, efficient and written cleverly by a smart programmer. This is what differentiates programmers from each other.

So, here are some tips to becoming a SMART coder:

1. Writing Codes for Programmer, Developer, and Even for A Layman

Though codes are primarily written for the machine to understand. They should be structured and well organized for other developers or for any layman to understand. In reality, codes should be written for all the three.

Those who keep this fact in mind are one step ahead of other coders while those who are able to make sure everyone can understand their code are miles ahead than their struggling friends.

Good programmers always document their codes and make use of IDE. I will use R language to explain the concept. Using IDE such as Rstudio makes it easier to write code quickly.

The main advantage available in almost all IDE is the auto-completion feature which suggests the function or command when part of it is written.

IDE is also known to suggest the syntax of the selected functions which saves time. Rstudio IDE environment also displays environment variables alongside with some basic details of each variable.

Documentation is another ability which differentiates good programmers from the rest.

Let’s look at this viewpoint using an example. Say you read the following code:

Code snippet 1

# Code snippet 1

a=16

b=a/2

c=(a+b)/2

Code snippet 2

# Code snippet 2

# store the max memory size

a=16

# taking half of the maximum memory as the minimum memory

b=a/2

# taking mean of maximum and minimum memory as the recommended memory

c=(a+b)/2

Code snippet 3

# Code snippet 3

# store the max memory size

max_mem=16

# taking half of the maximum memory as the minimum memory

min_mem=max_mem/2

# taking mean of maximum and minimum memory as the recommended memory

mean_mem=(max_mem+min_mem)/2

The difference in documentation is highlighted in these three code snippets and this is just a simple demonstration of code understandability.

The first code is difficult to understand. It just sets the values of three variables. There are no comments and the variable names do not explain anything.

The second code snippet explains that ‘a’ is the maximum memory, ‘b’ is the minimum memory and ‘c’ is the mean of the two.

Without the comments in code snippet 2, no one can understand whether the calculation for ‘c’ is correct or not.

The third code is a step further with the variables representing what is stored in them.

The third code is the easiest to understand even though all the three codes perform similar tasks. Moreover, when the variables are used elsewhere, the variables used in the third snippet are self-explanatory and will not require a programmer to search in the code for what they store until an error occurs in the code.

2. Knowing how to Improve

R has multiple ways to achieve a task. Each of the possibilities comes from using more memory, faster execution or different algorithm/logic.

Whenever possible, good programmers make this choice wisely.

R has the feature to execute code in parallel. Lengthy tasks such as fitting models can be executed in parallel, resulting in time-saving. Other tasks can also be executed faster based on the logic and packages used.

As an illustration, the following code snippets reflects the same task, one with sqldf package and another with dplyr package.

Using sqldf version

# Using sqldf version

install.packages(“sqldf”)

library(sqldf)

Out_df=sqldf(“select * from table_a left outer join table_b on table_a.var_x=table_b.var_x”)

Using dplyr version

# Using dplyr version

install.packages(“dplyr”)

library(dplyr)

Out_df=left_join(table_a,table_b)

I personally prefer the dplyr version whenever possible. However, there are some differences between the outputs.

The dplyr version will look at all variables with the same name and join using them. If there is more than one such variable, I need to use them by field. Moreover, left join using dplyr will not keep both copies of the variable used to join tables whereas sqldf does.

One advantage of sqldf is that sqldf is not case sensitive and can easily join tables even if the variable names in the two tables are completely different. However, it is slower than dplyr.

3. Writing Robust Code

While writing code, you can make the code simple but situation specific or write a generic code. One such way in which programmers write simple but situation-specific code is by ‘Hard Coding’.

It is the term given to fixing values of variables and is never recommended.

For example, dividing the sum of all salaries in a 50,000-row salary data by 50,000 rather than dividing the sum of that sum with the number of rows may seem to make the same sense but have a different meaning in programming.

If the data changes with the change in the number of rows, the number 50,000 needs to be searched and updated. If the programmer misses making the small change, all the work goes down the drain. On the other hand, the latter approach automatically does the task and is a robust method.

Another popular programming issue quite specific to languages such as R is Code Portability. Codes running on one computer may not work on another because the other computer does not have some packages installed or has outdated packages.

Such cases can be handled by checking for installed packages first and then installing them. These tasks can be collectively called as robust programming and make the code error free.

Using an illustration for checking and installing/updating h2o package.

# If h2o package is already loaded, unload it and uninstall
 
if (“package:h2o” %in% search()) { detach(“package:h2o”, unload=TRUE) }
# Checking 
if (“h2o” %in% rownames(installed.packages())) { remove.packages(“h2o”) }
 
 
# Next, we download packages that H2O depends on.
# methods 
if (! (“methods” %in% rownames(installed.packages()))) { install.packages(“methods”) }
# statmod 
if (! (“statmod” %in% rownames(installed.packages()))) { install.packages(“statmod”) }
# stats 
if (! (“stats” %in% rownames(installed.packages()))) { install.packages(“stats”) }
# graphics 
if (! (“graphics” %in% rownames(installed.packages()))) { install.packages(“graphics”) }
# Rcurl 
if (! (“RCurl” %in% rownames(installed.packages()))) { install.packages(“RCurl”) }
# jsonlite 
if (! (“jsonlite” %in% rownames(installed.packages()))) { install.packages(“jsonlite”) }
# tools 
if (! (“tools” %in% rownames(installed.packages()))) { install.packages(“tools”) }
# utils 
if (! (“utils” %in% rownames(installed.packages()))) { install.packages(“utils”) }
 
 
# Finally install and load h2o package
 
install.packages(“h20”)
 
library(h2o)

4. When to Use Shortcuts and When Not to

Using shortcuts may be tempting in the pursuit of writing code swiftly but the right practice is to know when to use them.

For instance, shortcut keys are something which is really helpful and can always be used. Using Ctrl+L in windows clears the console output screen, Using Ctrl+Shift+C in windows comments and un-comments all selected lines of code in one go are my favorite shortcuts in Rstudio.

Another shortcut is writing code for fixing code temporarily or writing faulty fixes which are not desired.

Here are some of the examples of faulty fixes.

This code changes a particular column name without checking its existing name

# This code changes a particular column name without checking its existing name

colnames(data_f)[5]=”new_name”

This removes certain columns using a number. This may remove important ones and code may give the error if the number of columns less than 10 in this case.

# This removes certain columns using a number. This may remove important ones and code may give error if the number of columns are less than 10 in this case

data_f=data_f[,1:4,6:10]

This converts a value to numeric without checking if it actually has all numbers. If the value does not contain numbers, it may produce NAs by coercion

# This converts a value to numeric without checking if it actually has all numbers. If the value does not contain numbers, it may produce NAs by coercion

Num_val=”123″

The following converts Num_val to 123 correctly

# The following converts Num_val to 123 correctly

Num_val=as.numeric(Num_val)

char_val=”A_Name”

The following issues a warning and converts Num_val to NA as it is not a number

# The following issues a warning and converts Num_val to NA as it is not a number

char_val=as.numeric(char_val)

5. Reduce Effort Through Code Reuse

When you start writing a code, you don’t need to waste time if a particular piece of logic has already been written for you. Better known as “Code Re-use”, you can always use your own code you previously wrote or even google to reach out the large R community.

Don’t be afraid to search. Looking up for already implemented solutions online is very helpful in learning the methods prevalent for similar situations and the pros and cons associated with them.

Even when it becomes necessary to reinvent the wheel, the existing solutions can serve as a benchmark to test your new solution. An equally important part of writing code is to make your own code reusable.

Here are two snippets which highlight reusability.

Code which needs to be edited before resuing it

# Code which needs to be edited before reusing it

for(i in 1:501)             {

df[,i]=as.numeric(df[,i])

}

Code which can be reused with lesser editing

# Code which can be reused with lesser editing

for(i in 1:ncol(df))      {

df[,i]=as.numeric(df[,i])

}

6. Write Planned Out Code

Writing code on the fly may be a cool-to-have skill but not helpful for writing efficient codes. Coding is most efficient when you know what you are writing.

Always plan and write your logic on a piece of paper before implementing it. Inculcating the habit of adding tabs and spaces and basic formatting as you code is another time-saving skill for a good programmer.

For instance, every new ‘if’, ‘for’ or ‘while’ statement can be followed by tabs so that indentation is clearly visible. Although optional, such actions separate out blocks of code and helpful in identifying breakpoints as well as debugging.

A more rigorous but helpful approach is to write code using functions and modules and explaining every section with examples in comments or printing progress inside loops and conditions. Ultimately it all depends on the programmer how he/she chooses to document and log in the code.

7. Active Memory Management

Adding memory handling code is like handling a double-edged sword. It may not be useful for small-scale programs due to a slowdown in execution speed but nevertheless a great skill to have for writing scalable code.

In Rstudio, removing variables and frames when they are no longer required with the rm() function, garbage collection using gc() command and selecting the relevant features and data for proceeding are ways to manage memory.

Adjusting RAM usage with memory.limit() and setting parallel processing are also tasks for managing your memory usage. Remember! Memory management goes hand in hand with data backup.

It only takes a few seconds create and store copies of data. It should be done to ensure that data loss does not occur if backtracking is required.

Have a look at this example snippet which stores the master data and then frees up memory.

# dividing master dataset into train and test with ratio 7:3

library(dplyr)

train<-sample_frac(master_data, 0.7

train_ind<-as.numeric(rownames(train))

test<-master_data[-train_ind,]

# saving backup of master_data and removing unneeded data

write.csv(master_data,”master_data.csv”)

rm(master_data)

rm(train_ind)

gc()

8. Remove Redundant Tasks

Sometimes programmers do some tasks repeatedly or forget to remove program code without knowing it.

Writing separate iterations for each data manipulation step, leaving libraries loaded even after they are no longer required, not removing features until the last moment, multiple joins and queries,etc. are some examples of redundancy lurking in your code.

While these happen somewhat naturally as more and more changes are made and new logic is added. It is a good practice to look at existing code and adjust your new lines to save runtime.

Redundancy can slow your code so much that removing it can do wonders in execution speed.

# Redundant code

# Takes about 0.5 seconds for iris data   

for(i in 1:ncol(df))      {

            df[,i]=as.numeric(df[,i])

}

for(i in 1:ncol(df))      {

            #storing missing values per column in mis vector

            mis[i]=length(which(is.na(df[,i])))

}

#Better implementation (implementations faster than the one below also exist)

#Gives a similar output but takes about 0.3 seconds for iris data - 35% improvement

for(i in 1:ncol(df))      {

            df[,i]=as.numeric(df[,i])

            #storing missing values per column in mis vector

            mis[i]=length(which(is.na(df[,i])))

}

9. Learn to Adapt

No matter how good a programmer you are, you can always be better! This tip is not related to typical coding practices but teamwork. Sharing and understanding codes from peers, Reading codes online (such as from repositories).

setting yourself up to date with books and blogs and learning about new technologies and packages which are released for R are some ways to learn.

Being flexible and adaptive to new methods and keeping yourself up to date with what’s happening in the analytics industry today can help you in avoiding becoming obsolete with old practices.

10. Peer Review

The code you write may be straightforward for you but very complex for everyone else. How will you know that? The only way is to know what others think about it.

Code review is thus the last but not the least in terms of importance for better coding. Ask people to go through your code and be open to suggested edits. You may come across situations when some code you thought is written beautifully can be replaced with more efficient code.

Code review is a process which helps both the coder and reviewer as it is a way of helping each other to improve and move forward.

The Path is Not So Difficult: Conclusion

Becoming a good programmer is no easy feat but becoming better at programming as you progress is possible. Though it will take time, persevering to add strong programming habits will make you a strong member in every team’s arsenal.

These tips are just the beginning and there may be more ways to improve. The knowledge to always keep improving will take you forward and let you taste the sweet results of being a hi-tech programmer.

In the rapidly changing analytics world, staying with the latest tools and techniques is a priority and being good at R programming can be a prime factor towards your progress in your analytics career.

So go out there and make yourself acquainted with the techniques of becoming better at R programming.

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, business intelligence and reporting services to e-commerce, retail and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Related Courses:

2 Responses to “10 Smart R programming tips to become better R programmer”

nuncsystems
4 years ago
Reply

Need to Appreciate your efforts, Nice work.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Thanks! 🙂

Dataaspirant