Kaggle is the first place to go for anyone who studies machine learning development. This interactive online platform provides hundreds of databases and tutorials that you can use to kick-start your ML career.
But what the website is the most famous for are its competitions. It can be hard for a newcomer to orient themselves in the interface and understand where to get started. So in this post, we will get you started with your first Kaggle competition!
A few words about Kaggle competitions
Kaggle competitions are machine learning tasks made by Kaggle or other companies like Google or WHO. If you compete successfully, you can win real money prizes.
Competitions range in types of problems and complexity. You can take part in one even if you’re a beginner. However, advanced competitions are much more interesting, and a leading place in a competition is a great addition to your machine learning engineer resume.
Competitions are held in three different formats.
These are your standard Kaggle competitions. You access data, build a model, make a submission. Then, your results are checked by the hosts of the competition and you are attributed a score on the leaderboard. The majority of competitions on Kaggle follow this format.
In two-stage competitions, every challenge has two parts. Stage 2 offers a new test dataset that is released at the start of the stage. To access it, you have to make a submission at stage 1. To participate in such competitions successfully, you need to read the rules carefully and keep an eye on the timeline.
In code competitions, submissions are made from inside of a Kaggle Notebook (we explain later on what it is). These competitions, in a way, are fairer because all the users have the same hardware. Code competitions may have constraints on the Notebooks you can submit, for example, CPU or GPU runtime, ability to use external data, and access to the internet. So reading the rules here is also very important.
In Kaggle competitions, everyone competes in teams of one or more people. Every team must have a team leader. You can invite your friends or join a team of other users via the Team tab in the competition. Your team can also merge with another team, for example, if you realize that submitting a model on your own is too challenging. However, you can only do it until a certain deadline. If there are any other details you would like to clarify, feel free to explore the Kaggle Competitions page.
How to use Kaggle to compete
Let’s learn how to use the platform. After you register, you get redirected to a personalized feed with posts, competitions, and discussions that might interest you. From here, you can go to the Compete tab.
Click on the Titanic competition. All newly registered users are invited to participate in a simple competition to understand how Kaggle works. Even if you don’t know how to code, it is not a problem.
In the Titanic competition, you need to use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. You are asked to build a predictive model that answers the question: “which passengers were more likely to survive?” using data like name, age, gender, socio-economic class.
Every competition has these tabs:
- In Data, you will find the datasets to train and test your model.
- Notebooks are your workspace. They contain tutorials, blog posts, documentation. They can also execute code without you installing anything.
- In Discussion, you can communicate with other people who participate in the competition, ask questions, and give advice.
- The Leaderboard shows the score of participants in the competition.
- Datasets include additional datasets that have been added by participants.
- Finally, Rules contain the rules of the competition.
If you want to participate, click Join the Competition.
How to make your first submission
Now let us learn how to participate in a competition step-by-step.
1. Download the data
If you want to do some preliminary exploratory analysis before starting a kernel or just have the files on your computer, go to Data, scroll down, and click Download all.
2. Read the rules
Yes, it’s important. For example, sometimes rules can impose limitations on the use of data. You need to follow those rules to avoid being disqualified, especially for those competitions that are conducted by third-party organizations.
3. Study the open kernels
You can access other people’s public notebooks and see how they approached the problem. Just go to the tab Notebook in the competition. It is definitely useful to study the best examples that got to the top 1%-3%. Solutions that did not win are also handy because you can inspect them and search for what needs to be improved.
4. Create your kernel
You can build models on Kaggle Kernels. This customizable Jupyter Notebooks environment with free GPUs is quite convenient, has libraries pre-installed, and allows you to generate a CVS prediction file for submission.
To create your own Kernel, go to the Notebooks tab and click New Notebook. Easily add your Data to the submission by clicking Add Data -> Competition Data -> Add. No need to upload anything.
If you work with other people, you can easily share your Notebook with your colleagues by clicking the Share button. The Notebook allows you to switch between different versions of the submission and safely store them. You can write up to 20GB to the directory that gets preserved.
5. Make a submission
The submissions on Kaggle are in CSV and usually have two columns: an ID column and a prediction column. Upload your submission and receive an accuracy score.
To try out submitting on Kaggle, follow this tutorial. It even provides you with code; no need to write anything.
6. Check the leaderboard
It’s always thrilling to see who won! See how your model ranks on the leaderboard. You can win a gold, silver, or bronze medal that will increase your reputation on the platform. In advanced competitions, it is possible to win good money.
7. Improve your score
It is usually possible to improve your score and go even higher on the leaderboard. Read discussions, ask questions, and get insights from other competitors to learn and improve.
How to find a contest for beginners on Kaggle
If you don’t have much experience with machine learning, it can be hard to find the competitions suitable for beginners. I recommend you use filtering.
The categories that interest us now are Getting Started and Playground. Other categories offer much more advanced tasks that might seem discouraging at the start. But let’s briefly talk about all of them:
- Featured. Competitions organized by third-parties offer generous prizes. Usually, the rolling window is 2-3 months.
- Research. Research-oriented tasks have little to no prize money, but they are good for boosting your skills.
- Recruitment. Companies that want to hire the best of the best can host such competitions to hire data scientists.
- Getting Started. These competitions are meant for those who are just starting to learn ML for practice. Usually, they don’t have any prize pools but sometimes can offer small rewards. Making submissions for Getting Started is easier because there are plenty of tutorials and example submissions online. You can enter them at any time.
- Masters. Kaggle has tiers, and being a Kaggle master is quite prestigious. It can even be considered by your employer in the hiring process. Sometimes Kaggle hosts competitions where the best of the best compete against each other.
- Playground. These submissions often have to do with games and are mostly for fun. They are suitable for beginners who have gained a little bit of experience and want to grow.
- Analytics. This category is dedicated to data analysis.
However, if you apply the filter Getting Started, you will see only a few competitions there. Once you get some experience, feel free to explore more options. It is a good idea to use search to find posts and discussion topics where people share their opinion on the competitions that are suitable for beginners. Here is what I would recommend:
- Learn to predict housing prices with linear regression.
- Predict whether the applicant is capable of repaying the loan.
- Experiment with computer vision and teach the machine to recognize hand-written digits.
- Predict keypoint positions on face images to analyze facial expressions or track faces in images and video.
- Make the first steps in sentiment analysis with Google’s Word2Vec.
- Practice decoding Morse code in audio files.
- Explore an unusual competition where your rival is the computer.
- Learn to use Tensor Processing Units for flower classification.
- Use GANs to make art.
One step above in difficulty:
- Create an algorithm that distinguishes dogs from cats.
- Teach your algorithm to classify leaves.
- Learn to predict the duration of a taxi trip.
- Practice regression skills with an approachable ML database.
- Create an AI to play against others in a simple game.
- Practice more building AI gaming agents.
If you have any competitions to add, feel free to send your idea to us, and we will include it in the post.
What kind of problems will I deal with in Kaggle competitions?
Kaggle competitions are diverse but, for the most part, they deal with one of the following problems:
- Computer vision;
- Image processing;
- Natural language processing.
The choice of the algorithm depends on the problem that you’re dealing with. It is tempting to solve everything by using only neural networks and deep learning, and quite often they do guarantee good results. But not all the time. Sometimes sticking to ANNs is just inefficient. So don’t be afraid of simple solutions to simple problems and choose a machine learning technique wisely.
How to practice before the contest?
Now let us talk about what you can do before participating in a contest.
First of all, you need to choose a language: the competitions are usually hosted either in R or Python, sometimes in other languages like Julia, but mostly Python.
Once you have chosen the language, you can start practicing on real datasets. I recommend the UCI Machine Learning repository. Try to solve a simple problem like classification or clustering and see what happens. The datasets on UCI are grouped by the problem so it’s quite easy to orient yourself. Don’t forget to split the data set into a training set and test set and then also split the test set into a ‘public’ and ‘private’ set because that is how competitions on Kaggle are checked. For more information about cool ML datasets, you can also explore our blog.
In the Notebooks section on Kaggle or on GitHub you will probably find a solution to any simple problem that you’re trying to solve, maybe even using the same dataset that you use. Hundreds of people before you were also trying to learn ML. Feel free to use their notebooks for inspiration so you can get good at interpreting the results.
Practice on simple Kaggle contests that we have mentioned. Don’t start with the ones that offer hundreds of dollars as a prize. Use the notebooks and tutorials published by other participants to gradually grow your skills while also familiarizing yourself with the particularities of the platform. After your submissions start being graded high (top 10%-25%) in the leaderboard, you can start thinking of Featured or Research competitions.
What to do if I don’t have amazing hardware to use for Kaggle competitions?
Effective hardware that doesn’t struggle with computations simplifies the work of a data scientist a lot, but it’s not always possible to upgrade your hardware.
First of all, you can run Kaggle Kernels with GPU to speed up the training of deep learning models. Learn how to set it up. A great advantage of this option is that it is free. However, judging by some comments, this option is not always working. So, here are some other options.
If you need more computer power, you can upgrade to Google Cloud AI Notebook directly in the Notebook. You can explore some options for free and also get a $300 credit. But when working with large amounts of data (for example, training a deep learning model), you will need to upgrade soon, and it’s quite costly. That is why some prefer to set up servers elsewhere. In that case, you will have to set up the environment on your own computer, and one of the most popular choices is Anaconda. If you decide to go for this option, use the tutorial by Faizan Ahemad on how to install it.
Another popular option is to rent computer power from AWS, Microsoft Azure, Digital Ocean, or something similar. Platforms that are recommended by many specifically for deep learning are FloydHub and Crestle.ai.
Now you know everything you need to make your first submission on Kaggle. The only thing left is to start.
If you are looking for more information about machine learning, feel free to explore our blog. Write to us on Twitter or use the form below if you have any questions or suggestions about what to cover next. Good luck, and may the odds be ever in your favor!