Introduction
Starting to work with a new tool may not be easy, especially when that “tool” means a new programming language. At the same time, there might be an opportunity to build upon something already known and so make the transition smoother and less painful.
In my case, it was the transition from R to Python. Unfortunately, my former colleagues, pythonians, did not share my genuine fondness of R. Also, whether R enthusiasts like it or not, Python is a widely used tool in data analysis/engineering/science and beyond. So, I concluded that learning at least some Python is a reasonable thing to do.
For me, the first steps were maybe the most difficult ones. Residing in the comfort of RStudio, IDEs like Pycharm or Atom did not feel familiar. This experience led to the decision to begin in the well-known environment and test its limits when it comes to working with Python.
To tell the truth, I did not end up using RStudio as the weapon of choice for using Python in a general setting. Hopefully, the following text will deliver the message why. However, I am convinced that for some use-cases, like integrating R and Python in an ad hoc analysis R Markdown way, RStudio still represents a viable way to go.
More importantly, it could be a convenient starting line for people with the primary background in R.
So, what did I find?
Analysis
Packages and environment
First of all, let us set the environment and load the required packages.
- Global environment:
# Globally round numbers at decimals
options(digits=2)
# Force R to use regular numbers instead of using the e+10-like notation
options(scipen = 999)
- R libraries:
# Load the required packages.
# If these are not available, install them first.
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg,
dependencies = TRUE)
sapply(pkg,
require,
character.only = TRUE)
}
packages <- c("tidyverse", # Data wrangling
"gapminder", # Data source
"knitr", # R Markdown styling
"htmltools") # .html files manipulation
ipak(packages)
tidyverse gapminder knitr htmltools
TRUE TRUE TRUE TRUE
In this analysis, I will be working with the reticulate library, a package developed at RStudio. However,
feel free to look for alternatives.
Also, I am going to import the reticulate package separately for the purposes of a
clearer flow.
- Python activation:
# The "weapon of choice"
library(reticulate)
# Specifying which version of python to use.
use_python("/home/vg/anaconda3/bin/python3.7",
required = T) # Locate and run Python
- Python libraries:
import pandas as pd # data wrangling
import numpy as np # arrays, matrices, math
import statistics as st # functions for calculating mathematical statistics
import plotly.express as px # plotting package
One of the limits of working with Python in R (Studio) is that in some cases you do not receive the Python traceback by default. That is a problem because, if I oversimplify things, a traceback helps you to identify where is the problem. Meaning it helps you fix it. So, consider it as an error message (or the lack of it).
- For example, when calling a library that you do not have installed, the Python chunk in R Markdown gives you green lights (so everything looks up and running), but this does not mean that the code ran the way you would expect (e.g. it imported a library).
To deal with that, I suggest making sure you have your libraries installed in
Terminal. If they are not, you can always install them and import afterwards.
For example, I will first import the json package which is already installed on my
machine. I will do so by using Terminal here in RStudio. In addition, let
me try to import the TensorFlow package.
From the following picture, you can see that there is no package TensorFlow. So, let
me switch back to bash and install the package:
The working directory was changed to /home/vg inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
Then go to the directory where is the newly installed package TensorFlow installed,
switch to Python and import the package once again:
All right. For more information, take a look at the Installing Python Modules page.
Data
As we now have R and Python set, let us import some data to play with. I will be working with a sample from the Gapminder, an intriguing project related to socio-demography of the world’s population. To be more specific, I will be working with the latest available data within the GapMinder library.
- Data import in R:
# Let us begin with the most recent set of data
gapminder_latest <- gapminder %>%
filter(year == max(year))
So, now we have a data loaded in R. Unfortunately, it is not possible to access the R objects (e.g. vectors od tibbles) directly by Python. So, we need to convert the R object(s) first.
- Data import in Python:
# Convert R Data Frame (tibble) into the pandas Data Frame
gapminder_latest_py = r['gapminder_latest']
One important thing to realise when working with Python objects (e.g. arrays or
pandas Data Frame) is that they are not explicitly
stored in the environment as the R objects are.
- In other words, if we want to know what is stored in the workspace, we must call functions like
dir(),globals()orlocals():
['R', '__annotations__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'gapminder_latest_py', 'np', 'pd', 'px', 'r', 'st', 'sys']
Great, among the present objects, we can clearly see the data (gapminder_latest_py) or
libraries (e.g. px).
So, let us explore the data a bit!
Life Expectancy
For the demonstration purposes, I will focus on the life expectancy or the average number of years a person is expected to live.
Descriptive statistics
Let’s begin with calculating some descriptive statistics like mean, median or the number of rows in the data using Python:
# Descriptive statistics for the inline code in Python
## Data Frame Overview
### Number of rows
gapminder_latest_shape = gapminder_latest_py.shape[0]
### Number of distinct values within the life expectancy variable
gapminder_latest_count_py = gapminder_latest_py['lifeExp'].nunique()
## Life Expectancy
### Median (Life Expectancy)
gapminder_latest_median_lifeExp_py = st.median(gapminder_latest_py['lifeExp'])
### Mean
gapminder_latest_mean_lifeExp_py = st.mean(gapminder_latest_py['lifeExp'])
### Minimum
gapminder_latest_min_lifeExp_py = min(gapminder_latest_py['lifeExp'])
### Maximum
gapminder_latest_max_lifeExp_py = max(gapminder_latest_py['lifeExp'])
### Standard deviation
gapminder_latest_stdev_lifeExp_py = st.stdev(gapminder_latest_py['lifeExp'])
Nice. Unfortunately, we are not able to use the Python objects for inline coding, one of the key features of literate coding in R Markdown. So, if we want to use the results for inline codes, we need to transform the Python objects back to R:
# Descriptive statistics for the inline code in Python - transformed to R
## Data Frame Overview
## Number of rows
gapminder_latest_nrow_r = py$gapminder_latest_shape
### Number of distinct values within the life expectancy variable
gapminder_latest_count_r = py$gapminder_latest_count_py
## Life Expectancy
### Median (Life Expectancy)
gapminder_latest_median_lifeExp_r = py$gapminder_latest_median_lifeExp_py
### Mean
gapminder_latest_mean_lifeExp_r = py$gapminder_latest_mean_lifeExp_py
### Minimum
gapminder_latest_min_lifeExp_r = py$gapminder_latest_min_lifeExp_py
### Maximum
gapminder_latest_max_lifeExp_r = py$gapminder_latest_max_lifeExp_py
### Standard deviation
gapminder_latest_stdev_lifeExp_r = py$gapminder_latest_stdev_lifeExp_py
So, what can we say about life expectancy in 2007?
First of all, there were 142 countries on the list. The minimum value of life expectancy was 39.61 years, the maximum 82.6 years.
The average value for life expectancy was 67.01 years and 50% or median hope to live 71.94 years or more. Lastly, the standard deviation was 12.07 years.
Graphs (using Plotly)
Okay, let’s move to something else, like graphs.
For example, we can take a look at how is the life expectancy distributed across the
globe using Plotly:
fig = px.histogram(gapminder_latest_py, # package.function; Data Frame
x="lifeExp", # Variable on the X axis
range_x=(gapminder_latest_min_lifeExp_py,
gapminder_latest_max_lifeExp_py), # Minimum and maximum values for the X axis
labels={'lifeExp':'Life expectancy - in years'}, # Naming of the interactive part
color_discrete_sequence=['#005C4E']) # Colour of fill
lifeExpHist = fig.update_layout(
title="Figure 1. Life Expectancy in 2007 Across the Globe - in Years", # The name of the graph
xaxis_title="Years", # X-axis title
yaxis_title="Count", # Y-axis title
font=dict( # "css"
family="Roboto",
size=12,
color="#252A31"
))
lifeExpHist.write_html("lifeExpHist.html") # Save the graph as a .html object
Unfortunately, it is not possible to print interactive Plotly graphs in R
Markdown via Python. Or, to be more precise, you will receive a
Figure object by printing (e.g. print(lifeExpHist)) it:
Figure({
'data': [{'alignmentgroup': 'True',
'bingroup': 'x',
'hovertemplate': 'Life expectancy - in years=%{x}<br>count=%{y}<extra></extra>',
'legendgroup': '',
'marker': {'color': '#005C4E'},
'name': '',
'offsetgroup': '',
'orientation': 'v',
'showlegend': False,
'type': 'histogram',
'x': array([43.828, 76.423, 72.301, 42.731, 75.32 , 81.235, 79.829, 75.635, 64.062,
79.441, 56.728, 65.554, 74.852, 50.728, 72.39 , 73.005, 52.295, 49.58 ,
59.723, 50.43 , 80.653, 44.741, 50.651, 78.553, 72.961, 72.889, 65.152,
46.462, 55.322, 78.782, 48.328, 75.748, 78.273, 76.486, 78.332, 54.791,
72.235, 74.994, 71.338, 71.878, 51.579, 58.04 , 52.947, 79.313, 80.657,
56.735, 59.448, 79.406, 60.022, 79.483, 70.259, 56.007, 46.388, 60.916,
70.198, 82.208, 73.338, 81.757, 64.698, 70.65 , 70.964, 59.545, 78.885,
80.745, 80.546, 72.567, 82.603, 72.535, 54.11 , 67.297, 78.623, 77.588,
71.993, 42.592, 45.678, 73.952, 59.443, 48.303, 74.241, 54.467, 64.164,
72.801, 76.195, 66.803, 74.543, 71.164, 42.082, 62.069, 52.906, 63.785,
79.762, 80.204, 72.899, 56.867, 46.859, 80.196, 75.64 , 65.483, 75.537,
71.752, 71.421, 71.688, 75.563, 78.098, 78.746, 76.442, 72.476, 46.242,
65.528, 72.777, 63.062, 74.002, 42.568, 79.972, 74.663, 77.926, 48.159,
49.339, 80.941, 72.396, 58.556, 39.613, 80.884, 81.701, 74.143, 78.4 ,
52.517, 70.616, 58.42 , 69.819, 73.923, 71.777, 51.542, 79.425, 78.242,
76.384, 73.747, 74.249, 73.422, 62.698, 42.384, 43.487]),
'xaxis': 'x',
'yaxis': 'y'}],
'layout': {'barmode': 'relative',
'font': {'color': '#252A31', 'family': 'Roboto', 'size': 12},
'legend': {'tracegroupgap': 0},
'margin': {'t': 60},
'template': '...',
'title': {'text': 'Figure 1. Life Expectancy in 2007 Across the Globe - in Years'},
'xaxis': {'anchor': 'y', 'domain': [0.0, 1.0], 'range': [39.613, 82.603], 'title': {'text': 'Years'}},
'yaxis': {'anchor': 'x', 'domain': [0.0, 1.0], 'title': {'text': 'Count'}}}
})
So, we import the previously created .html file instead (e.g. using the
includeHTML function from the htmltools package):