Data cleaning and processing functions are located in
dato.process module). This module is largely comprised of
In general, any
pandas function should have an analogue in
dato, following an UpperCamelCase naming convention, and dropping underscores. For example,
df >> GroupBy('a') >> Sum('b') >> ValueCounts
However, aside from declarative-ready operations like this (that are attributes in pandas), it can require some deep knowledge of
pandas to understand how to do simple operations like filtering or mutation (to use R's terminology). We'll briefly cover these within this page.
A common operation in dealing with dataframes is to make a new column out of some combination of other columns. For example, in our work, the following is a common pattern:
df['C'] = df[df.A + df.B]
In R, this is called mutation, but in pandas, a little-known fact is that it is included as a function called
eval. This thus can be implemented easily in
df >> Eval('C = A + B')
For those familiar with
mutate, we've also implemented a function called
Mutate which does the same thing. See the pandas documentation on
eval for a full overview of the the functionality.
Another common pattern in dataframe munging is to filter a dataframe on some condition (e.g. two columns are both true). For example:
df = df[df.A > df.B]
This can be accomplished with the
Query function as follows:
df >> Query('A > B')
See the pandas documentation on
query for a full overview of the functionality.
Another common operation is to simply select columns.
df = df[['A', 'B']]
We enable this in
dato using the
df >> Select('A', 'B')