Syngenta Crop Challenge in Analytics

The population of Earth is growing by 200,000 people per day and is projected to reach 9.8 billion in 2050, yet our world is running out of cropland — land needed to produce food. And we’re currently using our arable land and water 50 percent faster than the planet can sustain. At the same time, the crops farmers plant face an unprecedented set of obstacles due to increasingly limited growing conditions and climate change.

How will we be able to grow enough food
to meet world demand?

Today, the agriculture industry works to maximize the amount of food we gain from crops by breeding plants with the strongest, highest-yielding genetics. Scientists at research and development organizations like Syngenta create more resilient plants by breeding and then selecting the best offspring, over time, to provide to farmers.

We’ve proven that data-driven strategies can help our industry breed better seeds that require fewer resources and are adaptable to more diverse environments. Developing models that identify robust patterns in our experimental data may help scientists more accurately choose seeds that increase the productivity of the crops we plant – and will help address the growing global food demand.

How can we use data to accomplish this?

The performance of a plant is determined by three major factors:

These three factors are explained below.

Genes are the building blocks to all living things. The genes present in a plant affect its productivity, influence how tall or short it is, or may protect the plant from a particular disease.

In addition to genes, a plant’s health and productivity are also directly impacted by the environment (weather and soil) in which it is grown. Plants need water and sunlight. However, too much rain can cause disease or flooding. Or too much heat, especially in the absence of rainfall, can decrease productivity. The type of soil also has an effect on a plant. For example, if a plant is grown in soil that is able to hold more water than average, it will be able to better withstand an extended period of low rainfall. By characterizing the environments in which plants are grown, we can better understand how plants react to the different environments. Scientists do this by precisely measuring the weather and soil in all growing locations.

A particular plant is adapted to grow best in a particular region due to many factors, including the length of the growing season (determined roughly by the time between the last frost in the spring and the first frost in the fall), expected rainfall, temperature, solar radiation, soil types and others. Some plants may tolerate drought better than others. Some plants may prefer a soil that is sandy, while others prefer clay. This is what is called a genetic by environment (GxE) interaction. The environment activates certain genes that allow the plant to thrive (or not) in that particular environment.

Plant breeders work to develop high yielding plants for growers across a wide range of environments. Not all environments are productive growing environments; however scientists are working to better understand GxE and breed for plants that can perform in highly stressed environments. Successfully doing so could result in crops being developed to make marginal cropland more productive, potentially reducing hunger in arid regions of the world.


Corn is one of the world’s most important crops. Each year, breeders create several new corn products, known as experimental hybrids. Corn breeders work to create corn hybrids that can maintain high yield across a wide range of environments. Historically, identifying the best hybrids has been by trial and error, with breeders testing their experimental hybrids in a diverse set of locations and measuring their performance to select the highest yielding hybrids. This process can take many years. Corn breeders would benefit from accurate models that can predict performance across a range of environmental scenarios.

One way of modeling corn yield is that any particular hybrid (experimental cross of corn varieties) has a maximum yield potential, which then decreases depending on the environment in which it is grown. Every environment will have certain characteristics, or limiting factors, that are suboptimal for any hybrid, causing the actual yield to be less than the yield potential.
Can environmental data be aggregated into useful metrics representing stresses encountered by corn throughout a growing season? Can these metrics be used to discriminate between hybrids tolerant and susceptible to the stresses they represent?

Some potential environmental stresses that can have a negative effect on yield are poor weather (heat, drought, cold, etc.), soil lacking nutrients, insect damage or pathogens. The degree of each stress and how resistant a particular hybrid is to the stresses encountered will determine how much the yield is impacted. In addition, certain stresses, when faced at the same time, can have a stronger impact than the combined individual stresses.

A strong understanding of how a hybrid reacts when facing certain stresses (and combined stresses) could be a powerful tool for developing hybrids for regions that are less hospitable for corn, allowing farmers the potential to productively grow corn where currently it is challenging. Furthermore, individual farmers benefit from having access to this type of information because they can better manage risk across their acres.

Using feature engineering on environmental data (daily weather, soil, plant/harvest dates, any other available data), develop metrics representing the amount of stress that corn would face in any particular environment across a growing season. The objective is to individually model heat stress, drought stress, and stress due to the combination of heat and drought. Each stress will obviously depend on the weather at each location, but the impact can also vary depending on soil type and when the stress occurs throughout the growing season. These stresses are not the only factors affecting yield but, typically, the higher the stress, the lower the typical yield would be.

A sub-analysis that can be done at this step is measuring the impact of the interaction of heat stress and drought stress. Can the yield loss due to these stresses be explained by the individual contributions of heat and drought stress, or does the interaction of the two stresses significantly contribute to yield loss?

Using the stress metrics developed in Objective #1, classify hybrids as either tolerant or susceptible to each type of stress using the hybrid’s yield across different environments. One possible way of doing this is by conducting a linear regression of yield against each stress, and classify hybrids based on the slope of that regression line. You are encouraged to use more complex or non-linear models in order to build a better classifier.




Submissions must be in MS-Word or LaTeX format using the appropriate submission template. You can download the submission template here (.zip).
  • Definition and interpretation of stress metrics (heat, drought, combined heat and drought)
  • Classifications of stress tolerance (heat, drought, combined heat and drought) for all hybrids.
Additionally, following the standards for academic publication, entries should include:
  • Quantitative results to justify your modeling and classification techniques
  • A clear description of the methodology and theory used
  • References or citations as appropriate


The entries will be evaluated based on:


You are provided with the following training datasets to create stress models and classify hybrids.
  1. Performance Dataset: This dataset contains the observed yields from the tests (trials) of hybrids. Each row represents one observation for one hybrid at a given location and year. Performance data of 2452 hybrids in 1560 locations is provided from 2008 to 2017. In addition, plant date, harvest date, and irrigation status are included for each observation, along with information about the location such as average yield and soil properties (sourced from ISRIC). The ‘performance dataset’ needs to be aligned with ‘weather dataset’ by ENV_ID (which is a unique identifier combining latitude, longitude and year). (performance_data.csv)

  2. Weather Dataset: This dataset (sourced from Daymet) contains the recorded weather for each environment in which any hybrids were tested. Across the growing region, differences in weather conditions and soil types will cause variation in a hybrid’s observed performance, as well as a difference in the observed average yield of all hybrids tested in a location. Weather data is included in daily increments, labeled by the day number within the year (e.g. January 1 is day 1, December 31 is day 365 in non-leap years). This dataset needs to be aligned with the ‘performance dataset’ by ENV_ID (which is a unique identifier combining latitude, longitude and year). (weather_data.csv)
  3. Key for Datasets: This table provides the meaning of each variable in the two datasets.
Performance DatasetHYBRID_IDID for each hybrid in dataset
ENV_IDID for each environment in dataset
HYBRID_MGMaturity group of hybrid – a higher number indicates a longer growing season needed to reach maturity
ENV_MGTypical maturity group of environment – a higher number indicates a longer growing season with more growing degree days; this can vary due to weather in any given year
YIELDYield of hybrid in environment
YEARYear grown
PLANT_DATEPlant date for this observation
HARVEST_DATEHarvest date for this observation
IRRIGATION Whether field was irrigated:
NULL – unknown irrigation
NONE or DRY – no irrigation
ECO – very light irrigation
LIRR – light irrigation
IRR – normal irrigation
ENV_YIELD_STDStandard Deviation of yield for ENV_ID
ELEVATIONElevation of field
CLAY% of clay in soil
SILT% of silt in soil
SAND% of sand in soil
AWCAvailable water capacity in soil
PHpH of soil
OMOrganic matter in soil
CECCation exchange capacity of soil
KSATSaturated hydraulic conductivity of soil
Weather DatasetENV_IDID for each environment in dataset
DAY_NUMDay number within year of weather variables
DAYLDay length
SRADSolar radiation
SWESnow water equivalent
TMAXMaximum temperature
TMINMinimum temperature
VPVapor pressure


JAN 18, 2019
Deadline for Submissions

MARCH, 2019
Finalists Announced

APRIL 14-16, 2019
Finalist Presentations and Winner Announcement







Two Q&A webinars will be available, October 11th and December 6th, that all participants may attend. Archives will be available to view here and the distilled results will be added to the FAQ.


Sign Up for the Challenge