Home
Research
Publications
Teachings
Datavisualisation
Code
Miscellanous

Code snippets

Some usefull tricks / hacks I use frequently, mostly in Python.

Pandas’ dataFrame

Read (*.csv) files

Since i’m french, I often struggle with coma / semicolon separation in csv files, coma / dot for decimals, and so one… The usefull option I use:

import pandas as pd
data = pd.read_csv('/path/to/file.csv',   # Path to the file
                   delimiter=',',         # delimiter for the CSV encoding
                   decimal='.',           # decimal separator / Usefull for french files where the coma ',' is often used
                   header='infer',        # Usefull if there is a clean header
                   skiprows=1,            # Skip N rows
                   nrows=1000             # Number of rows to read
       )          

Official documentation: read_csv

Usually when working with time series, there is often the issue to index the data on a time. If the date and / or time is inculded in the data, one can “pass” it as an index!

# If there is a 'date' and a 'time' column with proper formats: 
data['Datetime'] = pd.to_datetime(data['Date']+' '+data['Time'])

# If the dataframe contains a proper 'Datetime' column: 
data.set_index('Datetime')

# To change the date format: 
data.index = data.index.strftime('%d/%m/%Y')            # to get day first, month and year.
# or
data.index = data.index.strftime('%d-%m-%Y-%H:%M:%S')   # to add Hours and Minutes to the previous index

Resampling : upsample or downsample a time-series

First let’s create a daily dataframe with a random column:

>>>> import pandas as pd
# Create a random time indexed dataframe
>>>> df = pd.DataFrame(np.random.randint(0,10,size=10),
                  columns=["Random"],
                  index=pd.date_range("20180101", periods=10))
>>>> df
Out[1]: 
            Random
2018-01-01       5
2018-01-02       3
2018-01-03       5
2018-01-04       5
2018-01-05       3
2018-01-06       6
2018-01-07       0
2018-01-08       0
2018-01-09       3
2018-01-10       3

If we want to resample to let’s say hours, we can use the resample functionnality, together with an interpolation to fill the new rows:

>>>> resampledDf =  df.resample('h')
>>>> resampledDf.interpolate(method='linear') 
Out[2]: 
2018-01-01 00:00:00  5.000000
2018-01-01 01:00:00  4.916667
2018-01-01 02:00:00  4.833333
2018-01-01 03:00:00  4.750000
2018-01-01 04:00:00  4.666667
                      ...
2018-01-09 20:00:00  3.000000
2018-01-09 21:00:00  3.000000
2018-01-09 22:00:00  3.000000
2018-01-09 23:00:00  3.000000
2018-01-10 00:00:00  3.000000

[217 rows x 1 columns]

The linear interpolation fill linearly the values as if equally spaced (no index consideration) The various interpolations methods can be found here (pandas documentation)

Example of downsampling

min max mean std …

groupby

difference with resampling
how to use it

Data selections and filters

Example of masks
between_times : select data betwween hours of the day covering several days

# Create a 10 day dataframe with an hourly frequency
>>>> df = pd.DataFrame(
                np.random.randint(0,10,size=240),
                columns=["Random"],
                index=pd.date_range("20180101", periods=240, freq = 'H')
          )
>>>> df
Out[1]: 
                     Random
2018-01-01 00:00:00       8
2018-01-01 01:00:00       1
2018-01-01 02:00:00       8
2018-01-01 03:00:00       6
2018-01-01 04:00:00       9
                    ...
2018-01-10 19:00:00       2
2018-01-10 20:00:00       2
2018-01-10 21:00:00       4
2018-01-10 22:00:00       2
2018-01-10 23:00:00       3

# Select data only between 9 and 10 am (included)

>>>> df.between_time('09:00', '10:00')
Out[2]: 
                     Random
2018-01-01 09:00:00       6
2018-01-01 10:00:00       0
2018-01-02 09:00:00       7
2018-01-02 10:00:00       3
2018-01-03 09:00:00       0
...
2018-01-09 09:00:00       7
2018-01-09 10:00:00       1
2018-01-10 09:00:00       1
2018-01-10 10:00:00       5

np.where & .loc[]

How to bin column values in categories

A current task when dealing with data is to bin “values”, either to count them, or to classify them. Here we’ll focus on the latter. A built-in funciton of pandas is here for you: cut. It allows to bin values into categories, and label them into a new column. Personnaly, I use it often to bin air temperature values into comfort levels (e.g. 18 > T° > 25 would be “Neutral”). Let’s look at some examples.

Official documentation: pandas.cut

# Create a dataframe
>>>> df = pd.DataFrame(np.random.randint(0,10,size=10),
                    columns=["some_data"],
                    index=pd.date_range("20180101", periods=10))
>>>> df
Out[1]: 
            some_data
2018-01-01          2
2018-01-02          8
2018-01-03          9
2018-01-04          2
2018-01-05          3
2018-01-06          6
2018-01-07          7
2018-01-08          7
2018-01-09          6
2018-01-10          6

# If we want to bin value between for example as follow : [0-5] / [5-10]
>>>> df['binned_values'] = pd.cut(df['some_data'],
                                  bins=[0,5,10])

# With default parameter, we will get a result like the following, where the new column is labelled automaticaly
>>>> df
Out[2]: 
            some_data binned_values
2018-01-01          2        (0, 5]
2018-01-02          8       (5, 10]
2018-01-03          9       (5, 10]
2018-01-04          2        (0, 5]
2018-01-05          3        (0, 5]
2018-01-06          6       (5, 10]
2018-01-07          7       (5, 10]
2018-01-08          7       (5, 10]
2018-01-09          6       (5, 10]
2018-01-10          6       (5, 10]

# And for a more comprehensive label, we could add 
>>>> df['binned_values'] = pd.cut(df['some_data'],
                                  bins = [0,5,10],
                                  labels = ["lower than 5","greater or equal to 5"],
                                  include_lowest = True,
                                  right = False)

>>>> df
Out[3]: 
            some_data          binned_values
2018-01-01          2           lower than 5
2018-01-02          8  greater or equal to 5
2018-01-03          9  greater or equal to 5
2018-01-04          2           lower than 5
2018-01-05          3           lower than 5
2018-01-06          6  greater or equal to 5
2018-01-07          7  greater or equal to 5
2018-01-08          7  greater or equal to 5
2018-01-09          6  greater or equal to 5
2018-01-10          6  greater or equal to 5

pandas data exploration

corr matrixes
simple plots

Pyvista (soon…)

Recently I’ve ben working with maps in python applications. I mostly use geopandas, dash-leaflet, plotly.scattermapbox, etc.

Compute the distance in km (or meters) between 2 points defined by their re spective lat / lon. It’s called a Haversine function:

def distance(lat1, lon1, lat2, lon2):
  # Haversine function to get distance in km from lat/lon points
  p = 0.017453292519943295
  hav = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * \
      cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
  # 2*6372.8 ~ 12742 (2* Rterre)
  return 12742 * asin(sqrt(hav))

Starting from a point on an map defined by its latitude and longitude, what are the new cordinates if one moves, say 50m “to the right” (e.g. East), and 10m up (North) ? I’ve been strugling with that in pure python (some libraries probably do this type of calculation, but I wanted no dependancy). The solution I found comes from here. It’s a function that allows you to compute exactly that: start from a point, move x meters in the “latitude” direction and y meters in the “longitude” direction, and compute the new coordinates:

def translate_latlong(lat,long,lat_translation_meters,long_translation_meters):
  ''' method to move any lat,long point by provided meters in lat and long direction.
  params :
      lat,long: lattitude and longitude in degrees as decimal values, e.g. 37.43609517497065, -122.17226450150885
      lat_translation_meters: movement of point in meters in lattitude direction.
                              positive value: up move, negative value: down move
      long_translation_meters: movement of point in meters in longitude direction.
                              positive value: left move, negative value: right move
      '''
  earth_radius = 6378.137

  #Calculate top, which is lat_translation_meters above
  m_lat = (1 / ((2 * np.pi / 360) * earth_radius)) / 1000;  
  lat_new = lat + (lat_translation_meters * m_lat)

  #Calculate right, which is long_translation_meters right
  m_long = (1 / ((2 * np.pi / 360) * earth_radius)) / 1000;  # 1 meter in degree
  long_new = long + (long_translation_meters * m_long) / np.cos(lat * (np.pi / 180));
    
  return lat_new,long_new

Miscellaneous

Daylength computation

What is the length of the day in hours ?

I needed to compute the length of days for a complete year, and I started using python libraries like pvlib with its module solarposition() or suntime to get the hours of sunset and sunrise. I kept looking for other solutions as I wanted to avoid dependancies, and i found a post on stackoverflow which gives you exactly that.

It is based on a paper by [Forsytthe et al., 1995], named A Model Comparison for Daylength as a Function of Latitude and Day of Year. It uses only the day_of_year and latitude and returns the length of the day in hours (and even have different definitions for the length of a day).

Usefully, one can provide a list of days of the year (or rather an np.array), and the function will return an array of the same length with daylengths :

For a complete year calculation, I used J = np.array([i for i in range(1,366)]).

def day_length(J, L):
    """
    -----------------------------------------------------------------------------------------
    Based upon : "A model comparison for daylength as a function of latitude and day of year"
    Forsythe et al., 1995, Ecological Modelling 80 (1995) 87-95
    -----------------------------------------------------------------------------------------
    Parameters
    ----------
    J: int / list of int / array 
        day of the year.
    L: float 
        latitude (in °)

    Returns
    -------
    Lenght of the day(s) in hours
    
    To account for various definitions of daylength, modify the "p" value accordingly.
    * Sunrise/Sunset is when the center of the sun is even with the horizon 
    p = 0
    * Sunrise/Sunset is when the top of the sun is even with horizon
    p = 0.26667
    * Sunrise/Sunset is when the top of the sun is apparently even with horizon
    p = 0.8333

    """
    p = 0.8333
    phi = np.arcsin(
            0.39795 * ( np.cos( 0.2163108 + 2 * np.arctan( 0.9671396 * np.tan( 0.00860 * (J-186) ) ) ) )
        )
    D = 24 - (24/np.pi)*np.arccos(
              ( np.sin( p*np.pi/180 ) + np.sin( L*np.pi/180 ) * np.sin( phi ) ) / (np.cos(L*np.pi/180) * np.cos( phi ) )
        )

    return D