As near as I can tell, data read from files is cycled through. In essence, this means that values from files are uniform ally distributed across the generated data. It would be nice if we could use the values based on some weighting factor. Here is an example:
I have a file that has each ZIP Code in the US and the population in that ZIP Code. Lets say there are 60,000 zip codes.
I want to generate 3 million people, 100,000 physicians, and 20,000 hospital records. Each of these has a zip code.
I want the likelihood that a person, physician, or hospital has a specific zip code to roughly follow the same distribution as the population in that zip code.
Does this make sense? I can't see a way to do this with Benerator today - am I missing something?
As it is, I am getting data generated starting at the beginning. So, all my hospitals are close together on the east cost since that is where the zip codes begin. Each zip code has a uniform number of people, even ones that have very few in real life.
Take Care