This section, intended for the more technically minded readers, will deal with real data analysis problems and their solutions – mostly in R.
The data set on German fuel prices contains the fuel prices, but not the sales, from more than 14000 fuel stations in Germany since June 2014. It is made available by the webservice www.tankerkoenig.de as a Postgres dump (from June 2014 onwards) under CC4.0.
This is a particularly interesting data set
from the general public point of view as
from the data science point of view as
I am treating this as an open research project and will post about interesting results in this blog. The original data is available at Tankerkoenig, the current status of the code for the data preparation, some of the additional external data, and some simple first models are available as a “workshop” at my GitHub-account.
Reading, cleaning and consolidating multiple socio-demographic data files
Tidying, creation of brand and highway markers using regular expressions, parsing of json-information on opening hours
Identification of NUTS 3-region per station, station distance matrices to competitors, highways, traffic-counters etc.
Reading from Postgres, cleaning strange prices, imputing and aggregating the price data (see also the blog entry about efficient missing line imputation). Calculation of competitor prices
Moving to AWS, test of different (Linear, Panel, Spatial) models, collection of results
You can find the print-version of my UseR!2017-talk on this project here. There is also a video, courtesy of Channel 9, that you can watch here:
Boris Vaillant - Quantitative Consulting 17