FREQUENTLY ASKED QUESTIONS
Please check back frequently for new answers.
How do I submit my solution?
Submissions must be in MS-Word or LaTeX format using the appropriate submission template. You can download the submission template here (.zip).
Once your solution is completed, you can submit it on the submission page in the participants area.
Can several people enter as a team?
Yes, a team may participate. The only requirement is that each person on the team must register and click the "Download data" link in the participant's dashboard to sign the NDA. Please make note of all the team members on your submission.
I am overseas from the United States. Since the event will be held in the USA, will I be able to participate?
Yes you are eligible to compete in the contest. The event will be held in the United States, however if you are selected as a finalist you will have the option to present via videoconference.
Am I allowed to publish my entry in a journal?
Technically, the work may be published, however the data may not be published or publicized in any way, and is protected by the non-disclosure agreement that is signed when you download it.
May I use the Syngenta Crop Challenge in Analytics data for educational purposes?
The data may only be used for the purposes of the contest, so no it may not be used for educational or any other purposes.
I work for a large agricultural company, can I compete?
Employees or people who are associated with large agricultural companies are not eligible to compete. Please contact us if you have any questions about whether or not you are allowed to participate.
What is the meaning of the data in the Genetic Dataset?
Each data point in the Genetic Dataset represents a pair of nucleotides (A, C, G, or T). The row defines the hybrid and the column defines a specific position within a chromosome (marker position). The reason it is a pair, and not a single nucleotide, is that each hybrid has two of each chromosome, one from each parent plant that produced it.
For each marker position, there are typically 2 possible values that can appear for each chromosome, which results in 3 possible values in total. For example, one marker position could contain the nucleotides A or C. Thus, the pair of chromosomes could be AA, CC, or AC (treating CA as identical to AC). In the Genetic Dataset, we map AA to 1, AC to 0, and CC to -1 (any hybrid that was not tested in a particular marker position has a value of NA). After this conversion, mathematical operations can be applied in order to extract value from the data, such as defining similarity between pairs of hybrids.
Note that comparing the values in this dataset across a single hybrid is not useful. Having a value of 1 in multiple columns does not mean the true nucleotide is the same, only that it is homozygous (same nucleotide in both chromosomes) in both marker positions.
Another point to keep in mind is that all markers are not equally important. In fact, finding the relevant subset of markers that interact with the environment in positive or negative ways is one of the most important pieces of this challenge.
What meaning does the Hybrid Name have?
The name of the hybrid refers to the parents of that particular hybrid. For example, ‘P1234:P4321’ means that one of the hybrid’s parents is P1234, and the other is P4321. The order is not important, so another hybrid named ‘P1111:P1234’ would share one parent with the hybrid above. Further reading about ‘heterosis’ and ‘heterotic pools’ may provide some insight on how this parentage data could potentially be used.
2018 Crop Challenge Overview Webinar Q&A
Q: Why do we have so much missing data in the genetic data set? What should be done with “NA” in the model?
A: Because we use genetic markers to detect genes. The marker will not cover every position in the chromosome. The markers that are collected change year to year as costs decrease, so each hybrid does not have the same set of markers.
Q: Why is 2016 excluded from training? Looks like we try to make a predictor for 2017 with data only until 2015. Right?
A: We do have 2016 data in the training dataset. When you use 2016 as a validation dataset, we shall use weather (and performance) information from 2001 to 2015. But when you make the final prediction for 2017, you shall use the information from 2001 to 2016. The soil information stays consistent across years.
Q: Why do we need both latitude, longitude and location ID?
A: Location ID should be used to join the data. Latitude and Longitude can be useful predictors in your model, though, so we included them separately. It shouldn't matter whether you use them from the performance dataset or environment dataset, but they may be very slightly different.
Q: What does maturity group data mean?
A: It is a variable used to indicate how long it will take for a crop to mature. It has some correlations with Latitude and Longitude or the growing locations.
Q: For the longitude data, what do “negative number” and “positive number” mean?
A: Negative number is used for western hemisphere, positive number is used for eastern hemisphere
Q: What does check yield mean? Why do we need this variable?
A: It is included to provide some clarity on how the yield difference is calculated. I wouldn't expect it to be very useful for building your model.
Q: The prediction needs to be the yield or the check yield?
A: The prediction shall be the yield difference against the check. We do not want to predict the yield or the check yield directly.
Q: Should there be a relationship between maturity group data and yield?
A: There could potentially be a relationship there. Hybrids that take longer to mature have more time to collect energy from sunlight and nutrients from soil, so may have higher yields. However, yield difference probably wouldn't show much of a relationship as the checks are likely from the same maturity group.
Q: The prediction needs be the forecasted yield or the forecasted yield difference?
A: Yield difference.
Q: Should we ignore the weather data from 2001 to 2007?
A: It depends on how you forecast the weather for 2017. It can be useful for predicting the weather in 2017.
Q: Can we have access to last year's data set? (2017 Crop Challenge)
A: I would say the answer to that is no. I don't think it would be all that useful, as the 2017 challenge used soybean data.
Q: For a certain hybrid do we know which check hybrid it is compared to? From a previous answer it was implied that it is not always the same.
A: Yield difference is calculated based on the comparison with the mean of top 3 checks in the same growing field and the top 3 checks are not always the same.
Q: For the genetic data, I can see the same gene marker exists in some hybrids but is missing in most of the hybrids. Can we ignore this gene marker? How should we deal with these data?
A: Dealing with the missing data will likely be challenging. If a very small percentage of hybrids have a particular marker, ignoring it may be the best choice. One way the genetic data can be used is to determine how related two hybrids are, which you will likely only be able to use markers that are defined for both hybrids.