Mateusz BOGDAN

Logo

My personal site, with a short presentation of who I am, my academic and professional activities and research interests !

Home
Research
Publications
Teachings
Datavisualisation
Code
Miscellanous

Code snippets

-

Some usefull tricks / hacks I use frequently, mostly in Python.

Pandas’ dataFrame

Read (*.csv) files

Since i’m french, I often struggle with coma / semicolon separation in csv files, coma / dot for decimals, and so one… The usefull option I use:

import pandas as pd
data = pd.read_csv('/path/to/file.csv',   # Path to the file
                   delimiter=',',         # delimiter for the CSV encoding
                   decimal='.',           # decimal separator / Usefull for french files where the coma ',' is often used
                   header='infer',        # Usefull if there is a clean header
                   skiprows=1,            # Skip N rows
                   nrows=1000             # Number of rows to read
       )          

Official documentation: read_csv

Usually when working with time series, there is often the issue to index the data on a time. If the date and / or time is inculded in the data, one can “pass” it as an index!

# If there is a 'date' and a 'time' column with proper formats: 
data['Datetime'] = pd.to_datetime(data['Date']+' '+data['Time'])

# If the dataframe contains a proper 'Datetime' column: 
data.set_index('Datetime')

# To change the date format: 
data.index = data.index.strftime('%d/%m/%Y')            # to get day first, month and year.
# or
data.index = data.index.strftime('%d-%m-%Y-%H:%M:%S')   # to add Hours and Minutes to the previous index

Resampling : upsample or downsample a time-series

First let’s create a daily dataframe with a random column:

>>>> import pandas as pd
# Create a random time indexed dataframe
>>>> df = pd.DataFrame(np.random.randint(0,10,size=10),
                  columns=["Random"],
                  index=pd.date_range("20180101", periods=10))
>>>> df
Out[1]: 
            Random
2018-01-01       5
2018-01-02       3
2018-01-03       5
2018-01-04       5
2018-01-05       3
2018-01-06       6
2018-01-07       0
2018-01-08       0
2018-01-09       3
2018-01-10       3

If we want to resample to let’s say hours, we can use the resample functionnality, together with an interpolation to fill the new rows:

>>>> resampledDf =  df.resample('h')
>>>> resampledDf.interpolate(method='linear') 
Out[2]: 
2018-01-01 00:00:00  5.000000
2018-01-01 01:00:00  4.916667
2018-01-01 02:00:00  4.833333
2018-01-01 03:00:00  4.750000
2018-01-01 04:00:00  4.666667
                      ...
2018-01-09 20:00:00  3.000000
2018-01-09 21:00:00  3.000000
2018-01-09 22:00:00  3.000000
2018-01-09 23:00:00  3.000000
2018-01-10 00:00:00  3.000000

[217 rows x 1 columns]

The linear interpolation fill linearly the values as if equally spaced (no index consideration) The various interpolations methods can be found here (pandas documentation)

Example of downsampling

groupby

Data selections and filters

# Create a 10 day dataframe with an hourly frequency
>>>> df = pd.DataFrame(
                np.random.randint(0,10,size=240),
                columns=["Random"],
                index=pd.date_range("20180101", periods=240, freq = 'H')
          )
>>>> df
Out[1]: 
                     Random
2018-01-01 00:00:00       8
2018-01-01 01:00:00       1
2018-01-01 02:00:00       8
2018-01-01 03:00:00       6
2018-01-01 04:00:00       9
                    ...
2018-01-10 19:00:00       2
2018-01-10 20:00:00       2
2018-01-10 21:00:00       4
2018-01-10 22:00:00       2
2018-01-10 23:00:00       3

# Select data only between 9 and 10 am (included)

>>>> df.between_time('09:00', '10:00')
Out[2]: 
                     Random
2018-01-01 09:00:00       6
2018-01-01 10:00:00       0
2018-01-02 09:00:00       7
2018-01-02 10:00:00       3
2018-01-03 09:00:00       0
...
2018-01-09 09:00:00       7
2018-01-09 10:00:00       1
2018-01-10 09:00:00       1
2018-01-10 10:00:00       5

How to bin column values in categories

A current task when dealing with data is to bin “values”, either to count them, or to classify them. Here we’ll focus on the latter. A built-in funciton of pandas is here for you: cut. It allows to bin values into categories, and label them into a new column. Personnaly, I use it often to bin air temperature values into comfort levels (e.g. 18 > T° > 25 would be “Neutral”). Let’s look at some examples.

Official documentation: pandas.cut

# Create a dataframe
>>>> df = pd.DataFrame(np.random.randint(0,10,size=10),
                    columns=["some_data"],
                    index=pd.date_range("20180101", periods=10))
>>>> df
Out[1]: 
            some_data
2018-01-01          2
2018-01-02          8
2018-01-03          9
2018-01-04          2
2018-01-05          3
2018-01-06          6
2018-01-07          7
2018-01-08          7
2018-01-09          6
2018-01-10          6

# If we want to bin value between for example as follow : [0-5] / [5-10]
>>>> df['binned_values'] = pd.cut(df['some_data'],
                                  bins=[0,5,10])

# With default parameter, we will get a result like the following, where the new column is labelled automaticaly
>>>> df
Out[2]: 
            some_data binned_values
2018-01-01          2        (0, 5]
2018-01-02          8       (5, 10]
2018-01-03          9       (5, 10]
2018-01-04          2        (0, 5]
2018-01-05          3        (0, 5]
2018-01-06          6       (5, 10]
2018-01-07          7       (5, 10]
2018-01-08          7       (5, 10]
2018-01-09          6       (5, 10]
2018-01-10          6       (5, 10]

# And for a more comprehensive label, we could add 
>>>> df['binned_values'] = pd.cut(df['some_data'],
                                  bins = [0,5,10],
                                  labels = ["lower than 5","greater or equal to 5"],
                                  include_lowest = True,
                                  right = False)

>>>> df
Out[3]: 
            some_data          binned_values
2018-01-01          2           lower than 5
2018-01-02          8  greater or equal to 5
2018-01-03          9  greater or equal to 5
2018-01-04          2           lower than 5
2018-01-05          3           lower than 5
2018-01-06          6  greater or equal to 5
2018-01-07          7  greater or equal to 5
2018-01-08          7  greater or equal to 5
2018-01-09          6  greater or equal to 5
2018-01-10          6  greater or equal to 5

pandas data exploration

Pyvista (soon…)

Maps and GIS related questions

Recently I’ve ben working with maps in python applications. I mostly use geopandas, dash-leaflet, plotly.scattermapbox, etc.

Miscellaneous

Daylength computation

What is the length of the day in hours ?

I needed to compute the length of days for a complete year, and I started using python libraries like pvlib with its module solarposition() or suntime to get the hours of sunset and sunrise. I kept looking for other solutions as I wanted to avoid dependancies, and i found a post on stackoverflow which gives you exactly that.

It is based on a paper by [Forsytthe et al., 1995], named A Model Comparison for Daylength as a Function of Latitude and Day of Year. It uses only the day_of_year and latitude and returns the length of the day in hours (and even have different definitions for the length of a day).

Usefully, one can provide a list of days of the year (or rather an np.array), and the function will return an array of the same length with daylengths :

def day_length(J, L):
    """
    -----------------------------------------------------------------------------------------
    Based upon : "A model comparison for daylength as a function of latitude and day of year"
    Forsythe et al., 1995, Ecological Modelling 80 (1995) 87-95
    -----------------------------------------------------------------------------------------
    Parameters
    ----------
    J: int / list of int / array 
        day of the year.
    L: float 
        latitude (in °)

    Returns
    -------
    Lenght of the day(s) in hours
    
    To account for various definitions of daylength, modify the "p" value accordingly.
    * Sunrise/Sunset is when the center of the sun is even with the horizon 
    p = 0
    * Sunrise/Sunset is when the top of the sun is even with horizon
    p = 0.26667
    * Sunrise/Sunset is when the top of the sun is apparently even with horizon
    p = 0.8333

    """
    p = 0.8333
    phi = np.arcsin(
            0.39795 * ( np.cos( 0.2163108 + 2 * np.arctan( 0.9671396 * np.tan( 0.00860 * (J-186) ) ) ) )
        )
    D = 24 - (24/np.pi)*np.arccos(
              ( np.sin( p*np.pi/180 ) + np.sin( L*np.pi/180 ) * np.sin( phi ) ) / (np.cos(L*np.pi/180) * np.cos( phi ) )
        )

    return D