Like many data scientists, I desired to contribute to open source, but I thought that “open source contribution” meant creating a new library in Python. That would require expertise in objects, inheritance, parallelism, asynchronous, classes, methods, decorators, and more to write that long, complex code. But, I’m a statistician and that level of Python/computer science is beyond my scope of knowledge.
In early 2017, Andreas Mueller, who is the core maintainer of the Python analysis library scikit-learn and co-author of Introduction to Machine Learning with Python, reached out to me, as an organizer for the NYC Women in Machine Learning & Data Science meetup group, to increase the participation of women in open source. A 2013 survey found that only 11 percent of open-source contributors were women.
After much planning, the event Scikit-learn Sprint was scheduled, the scikit-sprint GitHub repo was set up, prep work was created for participants, teaching assistants were recruited, Gitter was introduced, breakfast was ordered… All details were meticulously choreographed. I was enthusiastic about organizing this event for many reasons:
- to involve a community of users (particularly women) in open source
- to make a dent in the 700+ open issues at the time (now 1000+)
- for me to contribute by submitting a pull request to a critical Python library
After all that planning, I was unfortunately unable to attend due to a severe ankle injury. My dreams of contributing to open source in March 2017 were crushed. Perhaps next time.
Fortunately, the event was still a success.
A big thank you to everyone that made it out. Open source needs you! I hope we'll see more from you all in the future! https://t.co/7xgWklizMO— Andreas Mueller (@amuellerml) March 5, 2017
What is Open Source
Open source has traditionally been known as:
free software whose source code is open to the public
- Linux operating system
- Apache projects
- NumFocus Sponsored Projects
- more programming languages
- web browsers
- text editors
- & even more open source applications
In November 2017, I participated in the inaugural Diversity & Inclusion in Scientific Computing (DISC) Unconference where I broadened my knowledge of open source.
We can expand the purview of open source to include the following contributions:
- creating additional libraries for programming languages
- contributing to programming languages by fixing bugs or adding features via GitHub
- tutorials and training materials (text, video)
- research papers
- answering user questions on forums (StackOverflow)
- useful tweets (Notable Data Scientists)
- talk slides and videos
- datasets (Kaggle)
- trending repositories on GitHub
- translations of English documentation
An updated definition could be:
Free software whose source code is open to the public, as well as any documentation, examples, applications, maintenance and support of software and related tools for all users.
Why Contribute to Open Source?
Because everyone uses it, benefits from it, and there is a lot of work to be done. Other advantages include:
- learning a skill by contributing
- sharing work
- augmenting your portfolio by including your OS contributions
- networking opportunities
fast.ai was founded by Jeremy Howard and Rachel Thomas with the goal of making deep learning (DL) accessible to the masses. Fortunately and most importantly, the program also allows those who are interested, but not based in San Francisco, an opportunity to attend the lectures online and it makes available all course materials including an active and vibrant Discourse community (free of charge!). I learned of the fastai International Fellowship (for the Fall 2017 class) through a data science connection of mine just 2 days before the deadline to apply. I applied and I am happy to say, I was accepted! There were approximately 300 students attending in person and 400 International fellows participating online.
My personal experience in data science (both as a student and teacher) is that Dev Ops can be a barrier to entry. Dev Ops are those skills needed to set up a workable environment even before doing any data science. It includes skills in: Git (version control), setting up a machine in the cloud (via AWS or Paperspace or a myriad of other options), configuring environment, navigating bash and editor on a terminal on the cloud machine, downloading images and transferring them to a cloud machine. It is separate from the machine learning part. Beginner data scientists often struggle with these dev ops skills.
Initially, while enrolled in this online course, I took notes in google docs for myself. After a week or so, I decided to create a repo on GitHub, fastai_deeplearn_part1, and share my notes in case a few other students might find them helpful.
To date, thousands of users have referenced the documents in this repo.
I had contributed to open source!
Conclusion & Inspiration
I live in the United States, specifically New York City. It’s a time period when any topic under discussion is inherently divisive. I am reminded of Charles Dickens’ famous lines:
What I love about data science is the passion that all, who are bit by the data bug, bring to it. Data science ignores artificial boundaries, field of study, profession, nationality, religion, language and time zone. I have benefited from the work and tools of many contributors to build my data science skills. It’s been enormously gratifying to create and share this repo which enables so many aspiring and practicing data scientists all over the world to manage dev ops tools and move forward with deep learning. That is the true depth and value of open source. It really is open to everyone.
What the Future Holds
Given the expanded definition of open source, the opportunity to contribute to open source is available to users of all levels. I foresee a section on the resume entitled “Open Source Contributions” where data scientists will include how they have contributed to open source.
Open Source Projects