Data Must Flow
  • About
  •      
  • Blog
  • R101
  • Photography
  • Generative art
  •      
  • CV
  • Contact
  • Show All Code
  • Hide All Code
  • Download Rmd

Introduction

Starting to work with a new tool may not be easy, especially when that “tool” means a new programming language. At the same time, there might be an opportunity to build upon something already known and so make the transition smoother and less painful.

In my case, it was the transition from R to Python. Unfortunately, my former colleagues, pythonians, did not share my genuine fondness of R. Also, whether R enthusiasts like it or not, Python is a widely used tool in data analysis/engineering/science and beyond. So, I concluded that learning at least some Python is a reasonable thing to do.

For me, the first steps were maybe the most difficult ones. Residing in the comfort of RStudio, IDEs like Pycharm or Atom did not feel familiar. This experience led to the decision to begin in the well-known environment and test its limits when it comes to working with Python.

To tell the truth, I did not end up using RStudio as the weapon of choice for using Python in a general setting. Hopefully, the following text will deliver the message why. However, I am convinced that for some use-cases, like integrating R and Python in an ad hoc analysis R Markdown way, RStudio still represents a viable way to go.

More importantly, it could be a convenient starting line for people with the primary background in R.

So, what did I find?

Analysis

Packages and environment

First of all, let us set the environment and load the required packages.

  • Global environment:
# Globally round numbers at decimals
  options(digits=2)
  
  # Force R to use regular numbers instead of using the e+10-like notation
  options(scipen = 999)
  • R libraries:
# Load the required packages. 
  # If these are not available, install them first.
  ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
      install.packages(new.pkg, 
                       dependencies = TRUE)
    sapply(pkg, 
           require, 
           character.only = TRUE)
  }
  
  packages <- c("tidyverse", # Data wrangling
                "gapminder", # Data source
                "knitr", # R Markdown styling
                "htmltools") # .html files manipulation
  
  ipak(packages)
tidyverse gapminder     knitr htmltools 
       TRUE      TRUE      TRUE      TRUE 

In this analysis, I will be working with the reticulate library, a package developed at RStudio. However, feel free to look for alternatives.

Also, I am going to import the reticulate package separately for the purposes of a clearer flow.

  • Python activation:
# The "weapon of choice"
  library(reticulate)
  
  # Specifying which version of python to use.
  use_python("/home/vg/anaconda3/bin/python3.7", 
             required = T) # Locate and run Python
  • Python libraries:
import pandas as pd # data wrangling
  import numpy as np # arrays, matrices, math
  import statistics as st # functions for calculating mathematical statistics
  import plotly.express as px # plotting package

One of the limits of working with Python in R (Studio) is that in some cases you do not receive the Python traceback by default. That is a problem because, if I oversimplify things, a traceback helps you to identify where is the problem. Meaning it helps you fix it. So, consider it as an error message (or the lack of it).

  • For example, when calling a library that you do not have installed, the Python chunk in R Markdown gives you green lights (so everything looks up and running), but this does not mean that the code ran the way you would expect (e.g. it imported a library).

To deal with that, I suggest making sure you have your libraries installed in Terminal. If they are not, you can always install them and import afterwards.

For example, I will first import the json package which is already installed on my machine. I will do so by using Terminal here in RStudio. In addition, let me try to import the TensorFlow package.

From the following picture, you can see that there is no package TensorFlow. So, let me switch back to bash and install the package:

The working directory was changed to /home/vg inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.


Then go to the directory where is the newly installed package TensorFlow installed, switch to Python and import the package once again:


All right. For more information, take a look at the Installing Python Modules page.

Data

As we now have R and Python set, let us import some data to play with. I will be working with a sample from the Gapminder, an intriguing project related to socio-demography of the world’s population. To be more specific, I will be working with the latest available data within the GapMinder library.

  • Data import in R:
# Let us begin with the most recent set of data
  gapminder_latest <- gapminder %>%
            filter(year == max(year))

So, now we have a data loaded in R. Unfortunately, it is not possible to access the R objects (e.g. vectors od tibbles) directly by Python. So, we need to convert the R object(s) first.

  • Data import in Python:
# Convert R Data Frame (tibble) into the pandas Data Frame
  gapminder_latest_py = r['gapminder_latest'] 

One important thing to realise when working with Python objects (e.g. arrays or pandas Data Frame) is that they are not explicitly stored in the environment as the R objects are.

  • In other words, if we want to know what is stored in the workspace, we must call functions like dir(), globals() or locals():
['R', '__annotations__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'gapminder_latest_py', 'np', 'pd', 'px', 'r', 'st', 'sys']

Great, among the present objects, we can clearly see the data (gapminder_latest_py) or libraries (e.g. px).

So, let us explore the data a bit!

Life Expectancy

For the demonstration purposes, I will focus on the life expectancy or the average number of years a person is expected to live.

Descriptive statistics

Let’s begin with calculating some descriptive statistics like mean, median or the number of rows in the data using Python:

# Descriptive statistics for the inline code in Python
  
  ## Data Frame Overview
  
  ### Number of rows
  gapminder_latest_shape = gapminder_latest_py.shape[0] 
  ### Number of distinct values within the life expectancy variable
  gapminder_latest_count_py = gapminder_latest_py['lifeExp'].nunique()
  
  ## Life Expectancy
  
  ### Median (Life Expectancy)
  gapminder_latest_median_lifeExp_py = st.median(gapminder_latest_py['lifeExp']) 
  ### Mean
  gapminder_latest_mean_lifeExp_py = st.mean(gapminder_latest_py['lifeExp'])
  ### Minimum
  gapminder_latest_min_lifeExp_py = min(gapminder_latest_py['lifeExp']) 
  ### Maximum
  gapminder_latest_max_lifeExp_py = max(gapminder_latest_py['lifeExp'])
  ### Standard deviation
  gapminder_latest_stdev_lifeExp_py = st.stdev(gapminder_latest_py['lifeExp'])

Nice. Unfortunately, we are not able to use the Python objects for inline coding, one of the key features of literate coding in R Markdown. So, if we want to use the results for inline codes, we need to transform the Python objects back to R:

# Descriptive statistics for the inline code in Python - transformed to R
  
  ## Data Frame Overview
  
  ## Number of rows
  gapminder_latest_nrow_r = py$gapminder_latest_shape
  ### Number of distinct values within the life expectancy variable
  gapminder_latest_count_r = py$gapminder_latest_count_py
  
  ## Life Expectancy
  
  ### Median (Life Expectancy)
  gapminder_latest_median_lifeExp_r = py$gapminder_latest_median_lifeExp_py
  ### Mean
  gapminder_latest_mean_lifeExp_r = py$gapminder_latest_mean_lifeExp_py
  ### Minimum
  gapminder_latest_min_lifeExp_r = py$gapminder_latest_min_lifeExp_py
  ### Maximum
  gapminder_latest_max_lifeExp_r = py$gapminder_latest_max_lifeExp_py
  ### Standard deviation
  gapminder_latest_stdev_lifeExp_r = py$gapminder_latest_stdev_lifeExp_py

So, what can we say about life expectancy in 2007?

First of all, there were 142 countries on the list. The minimum value of life expectancy was 39.61 years, the maximum 82.6 years.

The average value for life expectancy was 67.01 years and 50% or median hope to live 71.94 years or more. Lastly, the standard deviation was 12.07 years.

Graphs (using Plotly)

Okay, let’s move to something else, like graphs.

For example, we can take a look at how is the life expectancy distributed across the globe using Plotly:

fig = px.histogram(gapminder_latest_py, # package.function; Data Frame
                     x="lifeExp", # Variable on the X axis
                     range_x=(gapminder_latest_min_lifeExp_py, 
                              gapminder_latest_max_lifeExp_py), # Minimum and maximum values for the X axis
                     labels={'lifeExp':'Life expectancy - in years'}, # Naming of the interactive part
                     color_discrete_sequence=['#005C4E']) # Colour of fill 
  
  lifeExpHist = fig.update_layout(
    title="Figure 1. Life Expectancy in 2007 Across the Globe - in Years", # The name of the graph
    xaxis_title="Years", # X-axis title
    yaxis_title="Count", # Y-axis title
    font=dict( # "css"
      family="Roboto",
      size=12,
      color="#252A31"
    ))
  
  lifeExpHist.write_html("lifeExpHist.html") # Save the graph as a .html object

Unfortunately, it is not possible to print interactive Plotly graphs in R Markdown via Python. Or, to be more precise, you will receive a Figure object by printing (e.g. print(lifeExpHist)) it:

Figure({
      'data': [{'alignmentgroup': 'True',
                'bingroup': 'x',
                'hovertemplate': 'Life expectancy - in years=%{x}<br>count=%{y}<extra></extra>',
                'legendgroup': '',
                'marker': {'color': '#005C4E'},
                'name': '',
                'offsetgroup': '',
                'orientation': 'v',
                'showlegend': False,
                'type': 'histogram',
                'x': array([43.828, 76.423, 72.301, 42.731, 75.32 , 81.235, 79.829, 75.635, 64.062,
                            79.441, 56.728, 65.554, 74.852, 50.728, 72.39 , 73.005, 52.295, 49.58 ,
                            59.723, 50.43 , 80.653, 44.741, 50.651, 78.553, 72.961, 72.889, 65.152,
                            46.462, 55.322, 78.782, 48.328, 75.748, 78.273, 76.486, 78.332, 54.791,
                            72.235, 74.994, 71.338, 71.878, 51.579, 58.04 , 52.947, 79.313, 80.657,
                            56.735, 59.448, 79.406, 60.022, 79.483, 70.259, 56.007, 46.388, 60.916,
                            70.198, 82.208, 73.338, 81.757, 64.698, 70.65 , 70.964, 59.545, 78.885,
                            80.745, 80.546, 72.567, 82.603, 72.535, 54.11 , 67.297, 78.623, 77.588,
                            71.993, 42.592, 45.678, 73.952, 59.443, 48.303, 74.241, 54.467, 64.164,
                            72.801, 76.195, 66.803, 74.543, 71.164, 42.082, 62.069, 52.906, 63.785,
                            79.762, 80.204, 72.899, 56.867, 46.859, 80.196, 75.64 , 65.483, 75.537,
                            71.752, 71.421, 71.688, 75.563, 78.098, 78.746, 76.442, 72.476, 46.242,
                            65.528, 72.777, 63.062, 74.002, 42.568, 79.972, 74.663, 77.926, 48.159,
                            49.339, 80.941, 72.396, 58.556, 39.613, 80.884, 81.701, 74.143, 78.4  ,
                            52.517, 70.616, 58.42 , 69.819, 73.923, 71.777, 51.542, 79.425, 78.242,
                            76.384, 73.747, 74.249, 73.422, 62.698, 42.384, 43.487]),
                'xaxis': 'x',
                'yaxis': 'y'}],
      'layout': {'barmode': 'relative',
                 'font': {'color': '#252A31', 'family': 'Roboto', 'size': 12},
                 'legend': {'tracegroupgap': 0},
                 'margin': {'t': 60},
                 'template': '...',
                 'title': {'text': 'Figure 1. Life Expectancy in 2007 Across the Globe - in Years'},
                 'xaxis': {'anchor': 'y', 'domain': [0.0, 1.0], 'range': [39.613, 82.603], 'title': {'text': 'Years'}},
                 'yaxis': {'anchor': 'x', 'domain': [0.0, 1.0], 'title': {'text': 'Count'}}}
  })

So, we import the previously created .html file instead (e.g. using the includeHTML function from the htmltools package):

htmltools::includeHTML("lifeExpHist.html") # Render the graph