Syngenta Crop Challenge in Analytics

The population of Earth is growing by 200,000 people per day yet our world is running out of cropland - land needed to produce food. We’ll add 2 billion more people by the year 2050, but we’re currently using our arable land and water 50 percent faster than the planet can sustain. At the same time, the crops farmers plant face an unprecedented set of obstacles due to increasingly limited growing conditions and climate change.

How will we be able to grow enough food
to meet world demand?

Today, the agriculture industry works to optimize the amount of food we gain from plants by breeding plants with the strongest, highest-yielding genetics. Scientists at R&D organizations like Syngenta create stronger plants by breeding and then selecting the best offspring over time to provide to farmers.

We believe data-driven strategies can help our industry breed better seeds, faster. Developing models that identify robust patterns in our experimental data may help scientists more accurately choose seeds that increase the productivity of the crops we plant – and help address the growing global food demand.

How can we use data to address the growing global food demand?


Corn is one of the world’s most important crops. Each year the breeding process creates several new corn products known as experimental hybrids. Corn breeders work to create the best combination of genes that result in high yielding corn hybrids. To find the best performers, these experimental hybrids are “tested” by planting them in a diverse set of locations and measuring their performance. The testing takes place across years with the goal of advancing the best hybrids each year to test again. This is done to account for the GxE interactions described here.

There is a limit to the number of locations that breeders can test in however. Having a limited number of observations can cause uncertainty when trying to choose the best hybrids for the growers to plant. If corn breeders could accurately predict the performance of each individual hybrid in untested environments, they could make better decisions on which hybrids to move forward and provide to growers, increasing productivity to meet the world’s growing demands.
Based on historic observations and genetic composition, how will a corn hybrid perform in new locations with varying environmental conditions?

Your goal is to develop a quantitative framework for predicting hybrid performance in new, untested locations. You will first need to build and validate predictive models using observations from previous years. You will then use this model to predict the performance of a set of hybrids tested in 2017, without precisely knowing the environmental conditions.



Submissions must be in MS-Word or LaTeX format using the appropriate submission template. You can download the submission template here (.zip).
Entries should provide predictions of performance for 2017 test hybrids at given locations in the test dataset.

Additionally, following the standards for academic publication, entries should provide:
  • Quantitative results to justify your predictive modeling process
  • A clear description of the methodology and theory used
  • References or citations as appropriate


The entries will be evaluated based on:

The quality of the proposed solution will be assessed by how well predicted performance aligns with observed hybrid performance for 2017. This is discussed further in the Model Validation section.

Additional criteria that will be considered are:


You are provided with the following training, validation and test datasets to create performance (yield) predictions.
  1. Training data: The training dataset includes all the current knowledge of experimental hybrids, including checks. There are 2,267 of hybrids in 2,122 of locations between 2008 and 2016. The ‘Training data’ contains three separate datasets, the performance dataset (year, hybrid name and yield difference from check), the environment dataset (latitude, longitude, weather information, soil information) and the genetic dataset (genetic markers).

    1. Performance Dataset: This dataset contains the observed yields from the tests (trials) of experimental hybrids. Each row represents one observation for one hybrid at a given location and year. Performance data of various hybrids in development is provided from 2008 to 2016. This data represents our current knowledge of how experimental hybrids have performed. The ‘performance dataset’ needs to be aligned with ‘genetic dataset’ by hybrid name and ‘environment dataset’ by latitude, longitude and year. (Training_performance_dataset.csv)
    2. Environment Dataset: This dataset contains the recorded weather and soil conditions for our selected growing region. Each latitude and longitude combination is represented by a unique Location ID. Across the growing region, differences in weather conditions and soil types will cause variation in a hybrid’s observed performance. For example, a hybrid with exceptional performance in Southern Illinois may be a very poor choice for a grower in Minnesota. This dataset needs to be aligned with the ‘performance dataset’ by latitude, longitude and year. (Training_weather_dataset.csv and Training_soil_dataset.csv)
    3. Genetic Dataset: This dataset provides genetic information for the experimental hybrids. Genetic information can be useful for predicting how a hybrid will respond in various environmental conditions. There are nearly 19,500 unique genetic markers provided in this dataset. The average number of markers assembled per hybrid is approximately 12,000. Part of your challenge is to determine which genetic markers are useful. Note that not all experimental hybrids have the same amount of genotypic information available. This dataset needs to be aligned with the ‘performance dataset’ by hybrid name. (Training_genetic_dataset.csv)
  2. Validation dataset: The validation dataset includes all the experimental hybrids for 2016. It contains performance information including latitude and longitude, environmental information including weather for previous years and soil conditions for test locations, and the genetic marker information for the hybrids. In order to properly use the validation dataset, 2016 data should not be included in the training dataset used to build your predictive model.
  3. Test dataset: The test dataset includes the experimental hybrids tested in 2017. This set will include location information (latitude and longitude), environmental information (weather for previous years and soil conditions) for test locations, and the genetic marker information of the hybrids.
  4. Key for datasets: the tables provide the meaning of each variable in the datasets.
Data structures for training and test dataset are as follows:

TrainingPerformance DatasetHybrid name
Yield difference
Location (Latitude and Longitude)
Genetic DatasetHybrid name
Genetic Markers
Environment DatasetLocation (Latitude and Longitude)
Weather Variables
Soil Variables

Validation dataset (for prediction of 2016 hybrids)Performance DatasetHybrid name
Need to predict
Location (Latitude and Longitude)
Year = 2016
Genetic DatasetHybrid name
Genetic Markers
Environment DatasetLocation (Latitude and Longitude)
Weather Variables
Years = 2001 - 2015
Soil Variables

Test dataset (prediction of 2017 hybrids for final submission)Performance DatasetHybrid name
Need to predict
Location (Latitude and Longitude)
Year = 2017
Genetic DatasetHybrid name
Genetic Markers
Environment DatasetLocation (Latitude and Longitude)
Weather Variables
Years = 2001 - 2016
Soil Variables


The goal of this challenge is to create a model that predicts the general performance of a hybrid under uncertain conditions, as a farmer attempts to do each year. To align with this goal, the validation procedure will assess your model’s predictive ability across a set of hybrids tested in a single year, with unknown weather conditions.

Important: The uncertainty of the weather is part of the challenge, so there should be no attempt to collect data about the precise weather conditions experienced in 2017. Your prediction should rely on historical weather to create an “expectation” of the weather in 2017.

The validation dataset is included in order to simulate this procedure. It consists of hybrids tested in 2016. You should not use 2016 weather data to make your prediction but you can include soil variables. Remember, you will not have weather data for 2017, so you will not want to use 2016 weather in the validation process. It will result in predictions that are better than you are able to make in 2017 when you do not know the precise weather conditions.


You will be evaluated by your model’s Root Mean Squared Error (RMSE) when predicting hybrid performance for 2017.

Latex for hybrid performance: y sub i for the yield difference with check of observation i , and circumslex y sub i for the predicted yield difference with check of observation .

Latex for rmse: Equation for Root Mean Squared Error


JAN 11, 2018
Deadline for Submissions

MARCH 1, 2018
Finalists Announced

APRIL 15-17, 2018
Finalist Presentations and Winner Announcement







Sign Up to Download the Data