🎙 Jan Beitner, Creator of PyTorch Forecasting

TheSequence interviews ML practitioners to merge you into the real world of machine learning and artificial intelligence

Feb 10, 2021

There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work can become a great source of insights and inspiration. We’d like to introduce to you TheSequence Chat – the interviews that bring you closer to real ML practitioners. Please share these interviews if you find them enriching. No subscription is needed.

👤 Quick bio / Jan Beitner

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

Jan Beitner (JB): I am a Project Leader and Lead Data Scientist at BCG GAMMA, which is the data science practice of the Boston Consulting Group. At BCG GAMMA, I lead teams on high-impact machine learning use cases, working directly with the executive management and analytics units of our clients. I am also the global supporting lead for deep learning at BCG.

My PhD in Quantum Information at Cambridge was partially a coding bootcamp, and I wanted to keep using the quantitative skills that I had acquired over the years. Machine learning was just emerging as a big trend. After reading a couple of books and realizing that it was not very far from what I was already doing, I knew what would be next for me. Consulting appealed and still appeals to me because you can drive huge impact and see a lot of different companies in a very short time. BCG had just started GAMMA back then and offered a career track in data science that led straight up to the Partner level. It was an easy decision to join and it has been a great journey since.

🛠 ML Work

PyTorch Forecasting (covered in Edge#53) is a very exciting new project that applies modern deep learning to the world of time-series forecasting. What were the main motivations to launch that project?

JB: In the coming years, I believe deep learning models will become the method of choice for time-series forecasting. Many ML applications that drive value at companies are in essence some kind of time-series forecast. For example, for optimal pricing you need to predict demand depending on a price. The time-series forecasting field has undergone a lot of transitions in the past. In the beginning, models were problem-tailored expressions whose parameters were deduced from data. Later, the often-superior performance of statistical methods, such as ARIMA, convinced practitioners to switch. Recently, machine learning models, such as gradient booster, have changed the game, because they can make use of covariates far more effectively. I believe deep learning is the next revolution in the field because neural networks can incorporate the notion of time better than traditional machine learning and efficiently find similarities between multiple time-series.

Further, if you want to do computer vision or NLP with neural networks, there are lots of great tools out there. For time-series forecasting, there are some packages but the fast.ai-moment has not arrived yet. I hope that PyTorch Forecasting can contribute to the field and make deep learning for time-series mainstream. For me personally, it is a chance to give back to the amazing open-source community.

Time-series forecasting is one of the classic scenarios in machine learning and, yet, we haven’t seen the same level of advancements compared to other domains such as computer vision or language. Why do you think that has been the case?

JB: I believe there are a number of reasons. First, we deal with very heterogeneous datasets. Pixels or language have each a common underlying process generating them. Pixels are recordings of light particles and language consists of words. A time-series, on the other hand, could be a stock price, a sensor reading from an IoT device or the sales of a product. For each, the process of generating the data is vastly different. This makes it really difficult to build a model to rule them all.

Second, stacking convolutions to understand pixels has revolutionized computer vision, because it exploits the nature of images so well. In time-series forecasting, statistical models are already doing a pretty good job at understanding the nature of the problem. The bar to beat is higher.

Last but not least, there is a lack of common benchmarks. Everyone seems to evaluate their model on a different dataset. This is partially because there are so many different applications to time-series forecasting but it also makes it very difficult to spot progress when it happens. I hope that PyTorch Forecasting can contribute to the latter. It aims to provide an interface that makes it easy to apply your algorithm to multiple datasets.

⏳->here you can read our coverage of PyTorch Forecasting and time-series forecasting in general ⌛️

How would you differentiate PyTorch Forecasting from alternatives such as Amazon GluonTS and others?

JB: First, GluonTS is building on MXNet which is a considerably smaller platform, than, for example, TensorFlow or PyTorch, upon which PyTorch Forecasting is based. The sheer popularity of those frameworks makes it easy for someone with a deep learning experience to contribute to the field of time-series forecasting. Recognizing this, the team behind GluonTS is working on improving compatibility with PyTorch and simplifying the internal API. It is certainly the most mature package in the field and sets an aspiring standard for everyone else.

The second aspect is accessibility. PyTorch Forecasting aims to make it easy to implement new architectures and debug existing ones. There is minimal syntactic sugar compared to pure PyTorch models. Extensive tutorials and documentation are available to ensure users also know what is going on under the hood. In order to lower the entry barrier, there is also a unified interface to create datasets from pandas data frames. It does not aim to reinvent the wheel and focuses only on the time-series aspects of deep learning. For instance, the package relies heavily on PyTorch Lightning which is gaining a lot of popularity and is about to become the preferred way of training and tuning PyTorch networks.

GluonTS models are great if you use them exactly as prescribed. However, it can be rather difficult to customize them.

Using neural networks for forecasting is a very new field and I am excited for what the open-source community will and is developing. Other packages I am aware of, apart from GluonTS and PyTorch Forecasting, are either focusing on classifying time-series or are in their very early stage of development.

One of the things I find super interesting about PyTorch Forecasting is that it incorporates transformer architectures. Do you believe transformers can have an impact in time-series forecasting in the same way they have revolutionized language and now computer vision?

JB: I like the idea of attention (covered in Edge#3) and think it is very well suited to time-series forecasting. Many recent articles make use of some attention mechanism. The Temporal Fusion Transformer, which is implemented in PyTorch Forecasting, is a prime example of such an architecture delivering great results.

Will the transformer (covered in Edge#57), as we know it from NLP and CV, make a huge splash? I am cautious. In time-series forecasting, most architectures are rather shallow, because the available data are small. A transformer with many different layers is unlikely to perform well with such little data.

If you could fast-forward 2-3 years from now, what are some milestones in the world of time-series forecasting that you are excited about?

JB: In the next few years, I am excited about more research on how to estimate the relations between different time-series and how they interact with each other. This is an important issue, for example, for predicting cannibalization due to price changes. It is a very difficult problem because the number of parameters to estimate is so large. One has to determine at every time step how each time-series is related to all other time-series. Neural networks are uniquely positioned for the task. There are some promising advances based on normalizing flows and graphs, for example, Transformer-MAF and Graph Deep Factors. I am keen to see how to make those approaches work for real-world datasets and scale them to large time-series collections.

Transfer learning is another area where I am hoping for progress, but it is notoriously difficult given that almost every dataset has a different underlying generation process, different covariates, etc.

💥 Miscellaneous – a set of rapid-fire questions

TensorFlow or PyTorch?

JB: PyTorch because it is more Pythonic. But I am by no measure a fundamentalist. The experience of the team matters more than my preference.

Favorite math paradox?

JB: I came across Russell’s Paradox back in my undergrad times. The paradox’s formulation is very convincing, yet we know intuitively that one must be able to resolve it.

Any book you would recommend to aspiring data scientists?

JB: I always recommend “The Elements of Statistical Learning” (you can read it here for free). It has the right level of details and is amazingly well written.

Is P equals NP?

JB: No, but my hope is that we can find good heuristics that are able to approximately solve many NP problems. Advances in reinforcement learning make me hopeful. Big improvements, for example, in mixed-integer programming are only a matter of time, in my opinion.

TheSequence