I went to the Open Data Science Conference in London last week. I really enjoyed it. There was lots of really practical, technical content and not too much bullshit/sales and lots of up to date data science. Here are my four highlights:
1. NLP is having a pretty big moment: we can all have a go!
There were quite a few good talks on NLP, mostly focusing on recent developments in deep neural networks - particularly large pre-trained transformer models. These models have been taking the NLP world by storm over the last couple of years, since the publishing of the seminal paper Attention Is All You Need in late 2017.
Natasha Latysheva Machine Learning Engineer @ Welocalize gave a really tutorial on sequence modelling with deep learning. You can find the code related to it in this Github Repo. It was pitched at just the right technical level (for me anyway). Quickly and clearly we were taken on a tour from feed forward Neural Nets to Recurrent NNs, GRU/LSTMs up to recent advances in encoder-decoders, Bi-RNNS, attention and transformers.
The tutorial took us through building a language model based on the Game of Thrones books. Where we built an LSTM model based on the text from the popular fantasy series. Here is a snippet of the model code, for making a LSTM model using keras:
from keras.models import Sequential from keras.layers import Embedding from keras.layers import LSTM from keras.layers import Dense model = Sequential() model.add(Embedding(vocabulary_size, 50, input_length=max_sequence_length-1)) model.add(LSTM(100, return_sequences=True)) model.add(LSTM(100)) model.add(Dense(100, activation='relu')) model.add(Dense(vocabulary_size, activation='softmax'))
Here is a ‘picture’ of the model:
We looked at using the model generatively to simulate text from a ‘seed’ phrase. Here is an example where the seed phrase was.
A dragon, the dead, and Tyrion walk into a…
and the output was:
… great roast of rubies the wind was the same and the way he had been a man of the harpy
This kind of thing can give you a good intuition for how the model is working. What you tend to find with LSTMs is that they while the do have some ‘memory’ going back few words, the topic quickly veers off from the starting point.
Attention was designed as a way to improve this, and transformer models eventually did away with any recurrent layers replacing the whole thing with attention (and a few other bits and bobs…).These new models include BERT, XLNET, and GPT-2. GPT-2 is of fun to play with the GPT-2 transformer model here https://talktotransformer.com/. I put in the start of this blog post and got this:
4 Things I Learnt at the open data science conference.
You are using data that has been curated by scientists. Ideally, these scientists are trying to find new techniques, methods, or concepts.
Typically, these scientists are using tools and techniques in their own disciplines and institutions to make sense of the data.
- You are relying on the original scientist to handcraft each piece of the analysis, so you have an opportunity to watch their previous work and learn something new.
- Writing code
Which, kind of, makes sense - there is at least a thread running through the whole paragraph, that is related to the seed phrase.
2. I like Python (mostly)
After 10 years of using mostly R, lately I have been using more and more Python and I’ve really been enjoying it (with some misgivings that I won’t go into here). So at this conference I tried to go to as many Python related talks as I could. It made me feel like a noob again, but was fun. Particularly good were these three
Andreas Muller: Professor at Columbia and developer of scikit-learn.
Was giving a series of talks on using the preeminent python machine learning library scikit-learn. I couldn’t go to all of them (too much good stuff on!), but the one I went to (on using pipelines) was excellent. Its nice to hear directly from the developer of such a popular package. Like he said:
“I can ignore the warnings because I wrote them!”
It was good to get a bit of a break from neural nets and focus on what is sometimes (a bit disparagingly maybe?) called ‘traditional machine learning’. This is actually the stuff that gets used by most data scientists in a commercial setting and getting and in depth tutorial by such an expert was really valuable. You can get the code for the whole series of tutorials at this github repo.
Ian Ozvald, organiser of PyData London, and self-styled ‘interim chief data scientist’ on tools for higher performance Python.
He had loads of tips for getting python code to run faster. More details are here These are the things I noted down:
%%timeitfor timing stuff in Jupyter
line_profiler- profiling package.
Dask - multicore vs distributed computing.. dask-ml for distributed sklearn ml. easy to parallelised Pandas functions.
swifter - project for running code multicore. sits on dask.
precompile functions with numba
import numba @numba.jit(nopython=True) def foo(row): dostuff
shelve cache module
- bulwark - pandas testing/schema.
Daniel Voigt Godoy from Deloitte, on Pytorch.
colab link github. He showed how to implement a simple linear regression in Pytorch using gradient descent. This was a good intro as it used all of Pytorch’s feature in simple example. A really useful introduction to the framework. I liked this image he used:
3. AI has a long way to go but DeepMind are doing amazing stuff
Danilo J Rezende, from DeepMind gave a fascinating talk on his research around model based reinforcement learning. I find this stuff really intresting, although I don’t fully understand all (much?) of it. He was basically arguing why it is useful to learn generative models of the world. If we want really successful agents we will need to move beyond classification and model the inputs, not just the outputs. To do this you need to understand causal structure. This really chimes with me. The general approach echoes stuff being put forward about how the human brain works by neuroscientist Karl Friston and my favourite philosopher Andy Clark. There seems to be load of work around this area at the moment, and I’m keen to keep up with it. As yet it is mostly fairly theoretical without loads of applications, but I think that will change soon.
- paper:”One-shot generalisation in deep generative models”, Danilo Rezend
- paper:”neural scene representation and rendering”: really interesting. Generative Query Network (GQN), . Filling in 3-d scenes from small number of images. Actions driven by reducing uncertainty (a la Friston). GQN, learns factorised representation of a scene. full paper
4. Vendors don’t really have anything very good to offer data science
IBMs ‘keynote’ talk was fairly dismal. I guess if you are the ‘diamond’ sponsor you get to do a sales pitch. Maybe it could be a bit better disguised. I’m never very impressed with Watson - its basically just branding of a fairly standard service. I once was pitched at by IBM and it was about 25 consultants in suits with 1 statistician on the phone suggesting to use a linear regression. Telling a room full of data scientists about how data can change our organisations is a bit redundant. 85% of statistics are made up on the spot.
They were pitching their AutoAI product (along with most of the vendors here). I’m fairly sceptical about these. Such a small percentage of a data scientists time is spent running through various models to select the best. Professional data science is nothing like a Kaggle competition where you are given a dataset nicely formatted in .csv. The work is in understanding the business problem and how the data relates to it. Prediction is the quick bit at the end.
To be fair the organiser didn’t look too happy with the talk. And also to be fair, I think the guy who was supposed to be doing it had to pull out at the last minute. But anyway - sales pitches are not welcome on a conference you paid large amounts off money to attend.
All in all though, it was a really good conference and I reckon I’ll be back next year in Dublin