My personal site, with a short presentation of who I am, my academic and professional activities and research interests !
Home |
---|
Research |
Publications |
Teachings |
Datavisualisation |
Code |
Miscellanous |
Some usefull tricks / hacks I use frequently, mostly in Python.
Since i’m french, I often struggle with coma / semicolon separation in csv files, coma / dot for decimals, and so one… The usefull option I use:
import pandas as pd
data = pd.read_csv('/path/to/file.csv', # Path to the file
delimiter=',', # delimiter for the CSV encoding
decimal='.', # decimal separator / Usefull for french files where the coma ',' is often used
header='infer', # Usefull if there is a clean header
skiprows=1, # Skip N rows
nrows=1000 # Number of rows to read
)
Official documentation: read_csv
Usually when working with time series, there is often the issue to index the data on a time. If the date and / or time is inculded in the data, one can “pass” it as an index!
# If there is a 'date' and a 'time' column with proper formats:
data['Datetime'] = pd.to_datetime(data['Date']+' '+data['Time'])
# If the dataframe contains a proper 'Datetime' column:
data.set_index('Datetime')
# To change the date format:
data.index = data.index.strftime('%d/%m/%Y') # to get day first, month and year.
# or
data.index = data.index.strftime('%d-%m-%Y-%H:%M:%S') # to add Hours and Minutes to the previous index
First let’s create a daily dataframe with a random column:
>>>> import pandas as pd
# Create a random time indexed dataframe
>>>> df = pd.DataFrame(np.random.randint(0,10,size=10),
columns=["Random"],
index=pd.date_range("20180101", periods=10))
>>>> df
Out[1]:
Random
2018-01-01 5
2018-01-02 3
2018-01-03 5
2018-01-04 5
2018-01-05 3
2018-01-06 6
2018-01-07 0
2018-01-08 0
2018-01-09 3
2018-01-10 3
If we want to resample to let’s say hours, we can use the resample
functionnality, together with an interpolation to fill the new rows:
>>>> resampledDf = df.resample('h')
>>>> resampledDf.interpolate(method='linear')
Out[2]:
2018-01-01 00:00:00 5.000000
2018-01-01 01:00:00 4.916667
2018-01-01 02:00:00 4.833333
2018-01-01 03:00:00 4.750000
2018-01-01 04:00:00 4.666667
...
2018-01-09 20:00:00 3.000000
2018-01-09 21:00:00 3.000000
2018-01-09 22:00:00 3.000000
2018-01-09 23:00:00 3.000000
2018-01-10 00:00:00 3.000000
[217 rows x 1 columns]
The linear
interpolation fill linearly the values as if equally spaced (no index consideration)
The various interpolations methods can be found here (pandas documentation)
# Create a 10 day dataframe with an hourly frequency
>>>> df = pd.DataFrame(
np.random.randint(0,10,size=240),
columns=["Random"],
index=pd.date_range("20180101", periods=240, freq = 'H')
)
>>>> df
Out[1]:
Random
2018-01-01 00:00:00 8
2018-01-01 01:00:00 1
2018-01-01 02:00:00 8
2018-01-01 03:00:00 6
2018-01-01 04:00:00 9
...
2018-01-10 19:00:00 2
2018-01-10 20:00:00 2
2018-01-10 21:00:00 4
2018-01-10 22:00:00 2
2018-01-10 23:00:00 3
# Select data only between 9 and 10 am (included)
>>>> df.between_time('09:00', '10:00')
Out[2]:
Random
2018-01-01 09:00:00 6
2018-01-01 10:00:00 0
2018-01-02 09:00:00 7
2018-01-02 10:00:00 3
2018-01-03 09:00:00 0
...
2018-01-09 09:00:00 7
2018-01-09 10:00:00 1
2018-01-10 09:00:00 1
2018-01-10 10:00:00 5
A current task when dealing with data is to bin “values”, either to count them, or to classify them. Here we’ll focus on the latter.
A built-in funciton of pandas is here for you: cut
. It allows to bin values into categories, and label them into a new column.
Personnaly, I use it often to bin air temperature values into comfort levels (e.g. 18 > T° > 25 would be “Neutral”).
Let’s look at some examples.
Official documentation: pandas.cut
# Create a dataframe
>>>> df = pd.DataFrame(np.random.randint(0,10,size=10),
columns=["some_data"],
index=pd.date_range("20180101", periods=10))
>>>> df
Out[1]:
some_data
2018-01-01 2
2018-01-02 8
2018-01-03 9
2018-01-04 2
2018-01-05 3
2018-01-06 6
2018-01-07 7
2018-01-08 7
2018-01-09 6
2018-01-10 6
# If we want to bin value between for example as follow : [0-5] / [5-10]
>>>> df['binned_values'] = pd.cut(df['some_data'],
bins=[0,5,10])
# With default parameter, we will get a result like the following, where the new column is labelled automaticaly
>>>> df
Out[2]:
some_data binned_values
2018-01-01 2 (0, 5]
2018-01-02 8 (5, 10]
2018-01-03 9 (5, 10]
2018-01-04 2 (0, 5]
2018-01-05 3 (0, 5]
2018-01-06 6 (5, 10]
2018-01-07 7 (5, 10]
2018-01-08 7 (5, 10]
2018-01-09 6 (5, 10]
2018-01-10 6 (5, 10]
# And for a more comprehensive label, we could add
>>>> df['binned_values'] = pd.cut(df['some_data'],
bins = [0,5,10],
labels = ["lower than 5","greater or equal to 5"],
include_lowest = True,
right = False)
>>>> df
Out[3]:
some_data binned_values
2018-01-01 2 lower than 5
2018-01-02 8 greater or equal to 5
2018-01-03 9 greater or equal to 5
2018-01-04 2 lower than 5
2018-01-05 3 lower than 5
2018-01-06 6 greater or equal to 5
2018-01-07 7 greater or equal to 5
2018-01-08 7 greater or equal to 5
2018-01-09 6 greater or equal to 5
2018-01-10 6 greater or equal to 5
Recently I’ve ben working with maps in python applications. I mostly use geopandas
, dash-leaflet
, plotly.scattermapbox
, etc.
lat
/ lon
. It’s called a Haversine function:
def distance(lat1, lon1, lat2, lon2):
# Haversine function to get distance in km from lat/lon points
p = 0.017453292519943295
hav = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * \
cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
# 2*6372.8 ~ 12742 (2* Rterre)
return 12742 * asin(sqrt(hav))
x
meters in the “latitude” direction and y
meters in the “longitude” direction, and compute the new coordinates:
def translate_latlong(lat,long,lat_translation_meters,long_translation_meters):
''' method to move any lat,long point by provided meters in lat and long direction.
params :
lat,long: lattitude and longitude in degrees as decimal values, e.g. 37.43609517497065, -122.17226450150885
lat_translation_meters: movement of point in meters in lattitude direction.
positive value: up move, negative value: down move
long_translation_meters: movement of point in meters in longitude direction.
positive value: left move, negative value: right move
'''
earth_radius = 6378.137
#Calculate top, which is lat_translation_meters above
m_lat = (1 / ((2 * np.pi / 360) * earth_radius)) / 1000;
lat_new = lat + (lat_translation_meters * m_lat)
#Calculate right, which is long_translation_meters right
m_long = (1 / ((2 * np.pi / 360) * earth_radius)) / 1000; # 1 meter in degree
long_new = long + (long_translation_meters * m_long) / np.cos(lat * (np.pi / 180));
return lat_new,long_new
What is the length of the day in hours ?
I needed to compute the length of days for a complete year, and I started using python libraries like pvlib
with its module solarposition()
or suntime
to get the hours of sunset and sunrise. I kept looking for other solutions as I wanted to avoid dependancies, and i found a post on stackoverflow which gives you exactly that.
It is based on a paper by [Forsytthe et al., 1995], named A Model Comparison for Daylength as a Function of Latitude and Day of Year. It uses only the day_of_year
and latitude
and returns the length of the day in hours (and even have different definitions for the length of a day).
Usefully, one can provide a list of days of the year (or rather an np.array
), and the function will return an array
of the same length with daylengths :
J = np.array([i for i in range(1,366)])
.def day_length(J, L):
"""
-----------------------------------------------------------------------------------------
Based upon : "A model comparison for daylength as a function of latitude and day of year"
Forsythe et al., 1995, Ecological Modelling 80 (1995) 87-95
-----------------------------------------------------------------------------------------
Parameters
----------
J: int / list of int / array
day of the year.
L: float
latitude (in °)
Returns
-------
Lenght of the day(s) in hours
To account for various definitions of daylength, modify the "p" value accordingly.
* Sunrise/Sunset is when the center of the sun is even with the horizon
p = 0
* Sunrise/Sunset is when the top of the sun is even with horizon
p = 0.26667
* Sunrise/Sunset is when the top of the sun is apparently even with horizon
p = 0.8333
"""
p = 0.8333
phi = np.arcsin(
0.39795 * ( np.cos( 0.2163108 + 2 * np.arctan( 0.9671396 * np.tan( 0.00860 * (J-186) ) ) ) )
)
D = 24 - (24/np.pi)*np.arccos(
( np.sin( p*np.pi/180 ) + np.sin( L*np.pi/180 ) * np.sin( phi ) ) / (np.cos(L*np.pi/180) * np.cos( phi ) )
)
return D