Data Umbrella Scikit Learn Online Sprint Report

8 minute read

Sprint Background

This sprint was organized by Reshama Shaikh of Data Umbrella and NYC PyLadies to increase the participation of underrepresented persons in data science. All organization of this sprint was by volunteer time.

The sprint was originally scheduled to be an in-person event in New York City. It would have been the fourth year in a row that I (Reshama Shaikh) would have organized a sprint in NYC. Due to the coronavirus pandemic, it was pivoted to become a virtual event.

This report focuses on the summary, impact and lessons learned of the first online scikit-learn sprint.


To ensure that attendees had some knowledge of Python and scikit-learn, a brief application form was used. Attendees did not have to be experienced Git users, but some experience was helpful.

About 25% of the attendees had participated in a scikit-learn open source sprint before. The attendees were evenly split by gender. Participants from ten different countries joined. Over half the participants identified as “underrepresented.”

Most attendees learned of the event via Twitter and word-of-mouth, followed by scikit-learn GitHub repo and then through other social media and community platforms (e.g. Slack, Meetup, LinkedIn, etc.)

Impact Report for Data Umbrella Scikit-learn Sprint

  Sprint 2020
Report date 27-Jun-2020
Sprint date 06-Jun-2020
Location Online; Global
Sprint website 2020 Online Sprint
  Twitter Moment
Open source library scikit-learn
GitHub repository link data-umbrella/2020-sklearn-sprint
List of Issues project list
Organizer Reshama Shaikh
Lead Facilitator Andreas Mueller
Scikit-learn core contributors Adrin Jalali, Thomas Fan, Nicolas Hug, Chiara Marmo, Tom Dupre la Tour
Teaching Assistants Mark Hannel & Shashank Singh
Helpers Mariam Haji, Noemi Derzsy, Melissa Ferrari
Platforms Discord & Zoom
Sponsor: venue Not applicable
Sponsor: food Not applicable
Cost of Sprint 60 hours volunteer time to organize event
PRs [MRG] at sprint 27
PRs [MRG] post-sprint 30
PRs open 4
PRs returned to issue pool ?
PRs merged  
PRs open  
Attendees: Initial Registrations 51
Attendees: Participated ~ 42
Attendee List 2020
Post-sprint Survey survey form (closed)
Blog 1: by Joe Lucas Scikit-learn Sprint: My Open Source Adventure
Blog 2: by Jake Tae A Reflection on My First Open Source Contribution Sprint
Blog 3: by C Thinwa Why you should contribute to open-source as a data scientist
Blog 4: by Maren Westerman Review of the Data Umbrella Scikit-learn Online Sprint June 2020

Impact Summary for 2020

Preparation Work

Because this was a virtual event and the idea of having an 8-hour online sprint was not appealing (to me), I reduced the time in half and increased the preparation work that attendees could do.

There are two videos for newcomers to Get Started with Contributing to Scikit-learn:

Added Bonus: Many people who discovered these resources from social media and were not participating in the sprint watched the videos and began submitting PRs. The goal was to make contributing accessible to a larger pool of people.

Challenges for Me, as an Organizer

Participants can use a variety of different (user) names (and nicknames) for email, Discord, GitHub, social media, etc. It is challenging to connect their different profiles and assist them, and it makes getting to know participants more time-consuming and difficult.


Technology Platforms

For a typical in-person sprint, interaction is in person and some communication in the scikit-learn sprint Gitter channel.

For the virtual event, the following platforms were utilized, which would also limit any costs required to use:

  1. Zoom: for presentation: being online from 11:45 am EDT to about 12:15 pm EDT (because Discord has max of only 25 people in any one channel, and we have about 50 people joining)
  2. Discord: during sprint time
  3. Gitter: use after sprint (Typically our sprints are in person, so sprint participants would ask in person or on Gitter.) There are more core scikit-learn contributors on Gitter than on this Discord. But, typically, if the question is related to a specific pull request (PR), the conversation is on the GitHub PR.

Pair Programming

About 8 people were no-shows, and their pairs needed to be reassigned at the start of the event.


Discord is a platform which is unfamiliar to some people and there was a learning curve in navigating it.

Applicant Responsivity

Over 80 applicants were sent acceptances and half did not RSVP. Partial reasons is that the sprint emails were going to spam folders.

Virtual Environment Setup for Windows

A number of Windows users experienced challenges in setting up their virtual environment.

Pair Programming

This was an entirely online event. Participants were assigned their partner prior to sprint start. Where possible, a new contributor was matched with a returning contributor.

Follow-up Office Hours

Office hours were set up 2 weeks after the sprint where some of the scikit-learn core contributors were available to answer questions on open PRs. The office hours were scheduled for 7am PDT / 10am EDT / 5pm EAT / 7:30pm IST.

Non-measurable Impact

Aside from the number of PRs that were merged, there is non-quantifiable impact of the open source sprint. Some examples include:

  • learning to set up virtual environment
  • using Git (fork, clone, branch, fetching another’s PR)
  • introduction to tests such as: flake8 (linting, formatting), pytest, “continuous integration”
  • navigating through the codebase structure of scikit-learn
  • digging into functions, learning about errors
  • learning about unit tests
  • interacting with contributors on GitHub
  • learning, in general
  • networking
  • building confidence (making a dent in “imposter syndrome”)
  • having fun

Sprint Feedback

Feedback has been shared a number of ways:

  • Twitter Moment
  • Blogs
  • Sprint survey
  • Casually, in conversation during the sprint


Adjustments for Next Sprint

Application Form: reminder for spam

Remind participants that communication is sent from other platforms (Mailchimp, Eventbrite, etc) and it may go to spam. It would be good to keep an eye out on the spam folder or email the sprint organizer if they have not heard back.

Application Form: pronouns

Ask for preferred pronoun on application and also to include on website for contributors.


Add in a slide to explain to participants how to look for issues to work on.

Pair Programming

Add in a slide explaining how pair programming works.

Second Pull Request

Update slides / documentation to show how to submit a second PR.

Pair Partner

Explore how to optimally match participants as pair partners based on experience.


Three platforms (Zoom, Discord and Gitter) were confusing for attendees. One platform was preferred.

Scikit-learn Mailing List

Include link to scikit-learn mailing list in communications. Encourage participants to sign up for the mailing list to keep up to date on discussions. The mailing list is also a good way to learn about open source, the library and the community.

Setting up virtual environment

We encouraged people to set up their virtual environment beforehand. The dilemna here is if we make it optional, more people probably would not do it. If we make it required, people who do not set up beforehand may not attend the sprint. Some people had difficulty setting up their environment and thought they could only join the sprint if their set up was ready. Action: find a way to optimize set up before the sprint while providing support for those who need it.

Discord categories

For each virtual table, use categories to group table for chat and voice. The current setup does not have voice and chat for each table adjacent to each other.


Query for PR:

  • Open PRs: 4 (Query: is:pr is:open created:>=2020-06-04 #DataUmbrella)
    • Open (w/o date range)
  • Merged PRs: 57 (Query: is:pr is:merged created:>=2020-06-04 #DataUmbrella)


  • [no addendums or updates at the time of publication]


Leave a Comment