It’s very common to see Data Science (DS) and Machine Learning (ML) practitioners enter the field after some years in academia. Even those who didn’t pursue advanced degrees are usually very interested in the evolution of the craft and its latest advancements.
Hosting a Journal Club is a good idea to provide a sort of safety valve for intellectually curious Data Scientists and Machine Learning Engineers to keep in contact with what’s happening in the world–even as they focus on their day-to-day work (which oftentimes doesn’t always involve the latest shiny tools and/or technologies).
We have been running a DS Journal Club at Nubank since 2019 (kudos Paulo Rossi and others) and we believe it’s had a lot of positive effects on our culture and also on the company at large. Let us talk a little bit about what we think about this and the lessons we learned along the way.
Are you curious about the Journal Club? Keep reading this article!
The Journal Club is a recurring meeting in which members of Nubank’s Data Science chapter discuss scientific articles, conference papers, book chapters and/or industry blog posts.
Its format varied from time to time, as we adapted to the different needs from remote and hybrid work environments and the growth of the chapter.
We started off with in-person meetings but, as the chapter started to grow and we moved to the remote environment, it became more difficult to find speakers and we realized the remote environment coupled with the larger number of participants inhibited contributions.
Finally, last year, with even more participants, we pre-selected the most voted articles by the data science chapter members. Instead of using slides, we provided a template on a Miro board, which helped our hosts (we purposefully avoided calling them presenters) to guide the discussion with the participants.
Each session attracted a different set of participants; not everyone attended all the meetings and that made perfect sense, considering the size of the chapter now and all the different interests of our members.
Being adaptable to changes was what kept the Journal Club alive throughout the years, and we learned a few lessons that are worth sharing.
There are many reasons why it makes sense for data-driven companies to have some sort of Journal Club in place. Let’s go through some of these next:
The Journal Club provides opportunities for people to practice public speaking skills. We usually operate on a voluntary basis – people volunteer to host or present a given resource on the club.
The Journal Club sessions should be a safe space to practice talking about technical subjects and facilitate meetings. To ensure hosts are set up for success, the organization team always reaches out to them beforehand to offer some suggestions on how to conduct the meetings.
For companies that operate on a fully-remote or even hybrid mode, these meetings are also a great way to instill a sense of community for Data Scientists and Machine Learning Engineers.
A “Chapter” is what we call a “job family” at Nubank. The Data Science Chapter is therefore the collection of all Data Scientists and Machine Learning Engineers working at the company.
Sometimes, people from different teams get siloed in their respective business areas. Journal Club meetings are also a way for people to get together, chat and cross-pollinate experiences in general.
This is not so much a benefit to the people, but to the company at large.
The simple effort of reading and getting to know what is going on in academia and in the industry is a great way to avoid being left behind as a company. The first step to enable innovation is getting to know various other ways of doing things.
Journal clubs encourage people to read and share content with other like-minded folks, and this can be the first step towards testing out something new, trying out some new algorithm, etc.
Fostering leadership/Organization skills
On the logistical/organization side of things, it takes some effort to put together recurring meetings and dealing with problems. These are useful skills that transfer to other areas of your life – and should help in your career as well.
As mentioned earlier, we’ve been organizing Journal Clubs at Nubank since 2019, and we experimented a lot until we reached the current state.
Our objective is to make it so that the Journal Club is useful both for the participants and for the company (after all, there are always opportunity costs involved). Here are some of the lessons we learned along the way.
Not too much math
Articles and papers that are too mathy are usually not very suitable for ~1 hour sessions, as the time isn’t enough to go through things like theorem proofs and things like that.
If people want to present an article that’s very heavy on mathematical notation, they should present a condensed version with the main takeaways only, because we can’t assume that everyone in the audience will read the paper in full.
Self-contained articles are better
Presenting articles or papers that require a lot of background knowledge is also not a very good idea.
If you want to present something that only makes sense if people have read 5 other articles before that, chances are nobody will understand a thing and we’ll have wasted everybody’s time.
Stick to articles and posts that are self-contained – those that don’t require a lot of background knowledge to be understood (other than a general understanding of DS/ML).
Set aside time for silent reading at the start
Participants naturally have their day-to-day tasks and it’s not always possible to read the article/post before the discussion meeting. Don’t assume everyone has read the resource before the discussion.
Setting aside the first 5-10 minutes for silent reading is a good way to have some slack for people who haven’t had the time to read the material beforehand.
A simple way to mitigate this is to set aside the first minutes of the meeting for silent reading of the selected resource – this brings everyone up to speed and enables the discussion to proceed with everyone having at least a rough idea of what the article is about.
Select resources upfront
We found that it’s a bit stressful to have to choose resources and find volunteers before each session (we usually run sessions every 2 weeks). It quickly turns into a chore and we risk people just dropping out.
It works much better if organizers select the list of resources once, at the beginning of the season, every 6 months or so.
Our suggestion is to proceed as follows:
- Initial list (picked by organizers): At the beginning of the season, organizers pick an initial list with 10-15 resource suggestions (conference papers, preprints, journals, industry blogs, etc)
- Additional list (suggestions from the team): In addition to the initial list, ask all members of the chapter for another list, with more suggestions.
- Merge and have people vote: Merge/deduplicate those 2 lists and put it up for voting – ask every member of the DS/MLE chapter to vote on which resources they are more interested in.
- Select top K and schedule sessions: After people have voted on the resources they are more interested in, select the top resources and start scheduling the sessions in the common calendar.
This way of doing things accomplishes several objectives:
- Bottom-up: Organizers put up an initial list of suggestions but then they collect ideas from the rest of the team, rather than dictating what will be studied.
- Democratic + relevant: By selecting resources people have voted for, we guarantee that we’ll cover topics people are interested in – which means more people will attend the sessions.
- Planned upfront: Because we select the resources up front for the next months, we give people a lot of time to read up on what they are interested in and organizers have a lot of time to find volunteers to host or facilitate the sessions.
Drive engagement with collaboration tools
Using a collaborative board during the meeting (such as Miro), so that people can have someplace to draw and write notes, is very useful.
We have a template that looks like this:
We make a copy of this for every new session and we ask the host to fill it out with information about the particular content they are talking about. This has several advantages:
- Collaborating: The audience can add their own notes to the documents, increasing the collaboration potential.
- Engaging: People have something to look at during the presentation, instead of just looking at the article text.
- Documenting: It serves as documentation for the meeting for those present, those who could not attend and also those who have not even started working at Nubank (and will have access to past meetings)
Record if people are comfortable
We usually record the sessions and save them in internal websites where people can retrieve and watch them later on.
You may even want to publish sessions outside of your organization, provided you are not revealing any business secrets or any other private information about your company!
All of these are good ideas – but make sure everybody is comfortable with having their presentation recorded for posterity, and always ask for permission first!
Organizing events is work too!
Hosting and organizing a Journal Club takes work and it’s only fair that this work gets rewarded and recognized, as any other work done for the benefit of the company.
So it’s important to make sure leadership and management is supportive of this – and take it into account for purposes of performance management, promotions and also compensation for people who have taken the time to contribute to or organize the sessions.
If you’re interested in holding a journal club of your own, these are some of the sources we keep an eye on to look for interesting articles and/or blog posts:
Industry / Personal blogs
There are many good blogs around, mostly written by companies that apply DS/ML in their day to day operations but also from those that have dedicated research teams, such as Google and Meta.
Many of these sources provide RSS/Atom feeds you can subscribe to. It’s especially useful to plug RSS feeds to slack channels so you get notified when there is new content!
Here are some high-quality blogs we usually pick posts from:
- Doordash ML Blog
- Airbnb – Data Science
- Open AI Blog
- Meta AI Blog
- Google AI Blog
- Chip Huyen’s Blog
- Sebastian Raschka’s Blog
Newsletters / Aggregators
Conferences / Journals
Articles/resources read last season
- BERT Rediscovers the Classical NLP Pipeline, hosted by Hellen Lima
- Efficient and Robust Automated ML, hosted by Daniel Boriero
- Real-time ML: Challenges and Solutions, hosted by George Salvino
- Fashion Classifier in Production using GCP, hosted by Pierre Krzisch
- To understand Deep Learning we need to understand Kernel Learning, hosted by Rafael Calsaverini
- Training ML Models More Efficiently with Dataset Distillation, hosted by Felipe Coelho (Lodur)
- Tabular Data: Deep Learning Is not all you need, hosted by Raphael Dayan
- On Challenges in ML Model Management, hosted by Luan Moura
- Learning to Complement Humans, hosted by Cinthia Tanaka
- ML Feature Serving Infrastructure at Lyft, hosted by Felipe Almeida
- Data Distribution Shifts and Monitoring, hosted by Jose Andrade
- Feature Selection using Boruta, hosted by Pedro Cardoso
- Interpretable Machine Learning Book: Chapter 3 Interpretability, hosted by Andryw Ramos
- Statistical Tests: A Guide to Misinterpretation, hosted by Pedro Lealdino
- ML With Graphs (post 1, post 2), hosted by Edesio Alcobaça
In the last season the following Nubankers participated as presenters: Daniel Boriero, George Salvino, Pierre Krzisch, Cinthia Tanaka, Felipe Coelho (Lodur), Raphael Dayan, Hellen Lima, Misael Moura, Rafael Calsaverini, Felipe Almeida, Jose Andrade, Pedro Cardoso, Andryw Ramos, Pedro Lealdino, Edesio Alcobaça