Q: The test dataset does not contain loc information, but we observe yield difference across location in training data. Any specific reason for not having loc info in test data?
A: Because we want to predict/impute the mean for the combination in the test dataset, which could be grown in any loc.
Q: We have location and year in training set, but not in test set. What’s location and year in test set?
A: We are not requesting performance in specific locations and years, but an overall average.
Q: Is there data for environment factors such as soil quality and radiation?
A: No, we want to remove that part and focus on the combination between tester and inbred.
Q: Is Inbred 740 the same as Tester 740?
A: No, they are not. They are and should be considered independent.
Q: …The higher the yield number, the better?
A: Yes, a higher number indicates better yield.
Q: What do the yield values 0.9872...and 1.1293...mean exactly?
A: The yield is scaled to the internal benchmark with the same standard. The values in the dataset represent the ratio of the hybrid’s performance to the benchmark in that environment.
Q: Is there any metric to indicate how similar the varieties within one cluster are to each other?
A: No, we do not expect to put too much emphasis on the genetics in this challenge. We focus more on the performance.
Q: Why test hybrids instead of the original inbred?
A: Hybrids produce more grain than inbreds (the terms heterosis or hybrid vigor could be used to search for more information online), and are what is typically grown in farmers’ fields. Therefore, we care more about the performance of the hybrids than the inbreds used to produce them.
Q: When affirmed that testers and inbreds are unique, does that mean that no testers were ever used as inbreds?
A: Some testers could serve as inbred as well. But in this challenge, there is no tester A * imbred B and inbred A* tester B. Each hybrid should be considered a unique set of genetics.
Q: From a breeder point of view the cluster is the genetic background, so for you this value is a major interest, right?
A: All information is of major interest to our organization. We do not want to influence any analysis, so if you deem the genetic clusters to be of major interest then please feel free to utilize the information.
Q: Interpretability level in the model are important for the business? Meaning, are you just interested in the yield by tester and inbred or do you want to know why the model predicts the yield it predicted?
A: Being able to clearly explain your process is important, but interpretability is not especially critical for this challenge.
Q: What are tester and inbred clusters? I am not sure about clusters.
A: That’s based on internal analysis. You may or may not use that info. Basically, the inbreds or testers from same cluster have some similarity in their genetic structure.
Q: What is the difference between Inbred and Inbred_Cluster?
A: Inbred Cluster refers to a clustering of the inbred genetics. Two inbreds with the same cluster can be treated as having more similar genetics than two inbreds from different clusters.
Q: The test records are the same year or no information about this?
A: We are looking for a mean performance across years and locations, so the test records are not for a specific year.
Q: Is the number of pairs of inbreds and testers the same in each cluster?
Q: Online it mentions there is geolocation data that can/should be used. How are the locations supposed to be mapped to actual locations?
A: If geolocation data is mentioned somewhere online, it is probably a mistake held over from previous challenges. Apologies for any confusion.
Q: The goal of the challenge is to help farmers identify the best crops. Why wouldn’t a ranking methodology be a good way of doing this? Why do you require us to predict the yield?
A: Predicting yield allows for developing a ranking from those results, as well as see the gap in performance between any two individual hybrids.
Q: If we are using a ML package, do we necessarily need to understand the interworkings of the algorithm and explain it in the report? Or does it suffice to mention which algorithm we used?
A: If the ML algorithms are common such as trees or linear regressions, then a simple, short, brief overview of the algorithm will suffice. However, if your algorithm is cutting-edge then we would appreciate a summary of the methodology and possible a link to a research paper which describes the algorithm.
Q: Is there a preference between using machine learning algorithms and statistical analysis techniques?
A: There are no preferences. There are many techniques that can be utilized such as collaborative filtering or Bayesian Inference.
Q: Are there any species that are an Inbred and a Tester?
A: Yes, but it is not relevant to this challenge. This challenge focuses on corn.
Q: You mentioned that the yield variances are also of interest. But officially they are not part of the challenge?
A: Apologies, that is a mistake in the slides. We focus on the mean only. We will put that in the FAQ, but you are welcome to provide any interesting discoveries on the variance in your manuscript.
Q: Why did you scale the yields and provide relative yields instead of the raw data e.g. in t/ha?
A: We did this to reduce the complexity of the challenge by leaving out environmental data and other factors.
Q: I wanted to know the difference between inbreds and testers as I was late [to the webinar].
A: Inbred and Testers can simply be considered as two different parents. The naming convention is just the industry standard.
Q: Do we need to submit the code as well, and is there any format?
A: We focus on the documentation and test dataset performance. You are welcome to submit your code.
Q: What should we do if we have multiple ideas?
A: You could try multiple approaches and submit the best prediction as your final. Please document other methods in the manuscript, as well.
Q: After submission, do we get to see the actual solutions for the yield values?
A: This is to be determined.
Q: Will the data be available for publication eventually, or will the publicity ban always remain in effect?
A: I believe the data will not eventually be available, but please refer to the NDA.
Q: What are the criteria that will be used to evaluate the best work?
A: RMSE, root mean square error compared with internal dataset, and clarity, documentation and other criteria listed on the webpage.
Q: Can we publish our paper somewhere after the final announcement?
A: From the FAQ page, “Am I allowed to publish my entry in a journal? Technically, the work may be published, however the data may not be published or publicized in any way, and is protected by the non-disclosure agreement that is signed when you download it.”
Q: Do we have to write/submit it as a research paper or only technical report? Template
A: You can structure the paper submission to be of any quality you like. The only requirement is that the paper is simple, intuitive and has a clear explanation. Unlike a traditional journal paper, you may be able to exclude a detailed literature review to have a paper submission of shorter length.
Q: What key points do you need included in the report?
A: The major sections in the template.
Q: How long are the paper submissions expected to be?
A: There is no required length to the paper submission. If you cover all the necessary pieces of information then that is sufficient.
Q: How will the test set predictions be scored?
A: We will be using the mean squared error as the metric for accuracy. We have the actual performance of the hybrids listed in the test set.
Q: Would the rights of the submitter with respect to IP (if anything novel were to be created) reside with the submitter?
A: From the RULES page, “Participants represent and warrant that they have the right to submit their idea to the contest, that they are the sole and exclusive owner of the idea, and that the idea or submission of the idea does not violate any agreements with third parties or any applicable law in any country.”
Q: In the submission answer, you did not mention data submission. To confirm, you want a filled out version of the testing data correct?
A: For the submission we want the predictions on the test set to be submitted with the report.
Q: How are the models to be submitted? E.g., Python or any other source language modules?
A: The models can be submitted via the written report that should be submitted along with your submission of the predicted values. Including code is not a requirement, but if you would like to submit code with your entry you are more than welcome.
Q: It seems that most questions have also already been answered in your presentation, so will these still be posted somewhere?
A: We’ll review the questions for anything that hasn’t already been addressed elsewhere, and add anything new to the FAQ on the website. The presentation video will also be posted.
Q: When will the results be known?
A: At the 2020 Informs Business Analytics meeting.
Q: Will there be more future webinars held if we have more questions?
A: Yes, there will be one more by the end of Nov. or early Dec.
Q: How many people are competing in this year’s submission?
A: That result is unknown. We will not know until the submission deadline when all the submissions are completed.
Q: When will the winners be decided?
A: 2020 INFORMS analytics and business conference. Please refer to the details from the webpage.