Assign by relative weighting

suggest/request new features

Assign by relative weighting

Postby MichaelWacey » Wed Apr 09, 2008 6:15 pm

As near as I can tell, data read from files is cycled through. In essence, this means that values from files are uniform ally distributed across the generated data. It would be nice if we could use the values based on some weighting factor. Here is an example:

I have a file that has each ZIP Code in the US and the population in that ZIP Code. Lets say there are 60,000 zip codes.
I want to generate 3 million people, 100,000 physicians, and 20,000 hospital records. Each of these has a zip code.
I want the likelihood that a person, physician, or hospital has a specific zip code to roughly follow the same distribution as the population in that zip code.

Does this make sense? I can't see a way to do this with Benerator today - am I missing something?

As it is, I am getting data generated starting at the beginning. So, all my hospitals are close together on the east cost since that is where the zip codes begin. Each zip code has a uniform number of people, even ones that have very few in real life.

Take Care
MichaelWacey
 
Posts: 17
Joined: Mon Apr 07, 2008 3:25 am

Postby Volker Bergmann » Thu Apr 10, 2008 5:47 am

Hi Michael,

data is only cycled through, if you use cyclic="true", otherwise it's iterated once. If you use e.g. distribution="cumulated", you will receive values 'in the middle' of the data range more frequently than at the borders.

You can generate values of simple type with individual weighting.
I will sypport composite types in one of the next releases, too.

Now you can, for simple types, have a csv file that lists the values in the first column and adds a wight in the second one:, e.g. the file lastNames_US.csv which is shipped with benerator (within lib/benerator-x.y.z.jar) and starts like this:

Code: Select all
Smith,1.006
Johnson,0.810
Williams,0.699
Jones,0.621
Brown,0.621
Davis,0.480
Miller,0.424
Wilson,0.339
Moore,0.312


For using this programmatically, check out the source code of FamilyNameGenerator:

Code: Select all
public class FamilyNameGenerator extends DatasetCSVGenerator<String> {

    public FamilyNameGenerator() {
        this(Locale.getDefault().getCountry());
    }

    public FamilyNameGenerator(String datasetName) {
        this(datasetName,
                "org/databene/dataset/region",
                "org/databene/domain/person/familyName_{0}.csv");
    }

    public FamilyNameGenerator(String datasetName, String nesting, String fileNamePattern) {
        super(fileNamePattern, datasetName, nesting, "UTF-8");
    }
}


You can then use it from the benerator file as follows:

Code: Select all
<create-entities count="100">
    <attribute name="lastName" generator="org.databene.domain.person.FamilyNameGenerator"/>
    <consumer class="org.databene.model.consumer.LoggingConsumer"/>
</create-entities>


I will add generic support for such CSV files (without the need to write a Java wrapper) soon.

Regards,
Volker 'databene' Bergmann

Need faster response times? phone support? onsite support? training? custom extensions? immediate bug fixes? Support Benerator evolution by buying services from Volker Bergmann!
User avatar
Volker Bergmann
 
Posts: 654
Joined: Sat Nov 10, 2007 2:40 pm

Randomly Selecting from a CSV

Postby MichaelWacey » Thu Apr 10, 2008 4:07 pm

Below is the code I am using. I am using version 0.5.1. I have tried several variations of the <variable entry for ZIP and the <attribute entry. In all cases, it reads from the ZIP file in sequence and puts all of my providers in Puerto Rico.

Is there a way to get it to select the ZIP code records (there are about 60,000) randomly?


<create-entities name="provider" count="{ftl:${provider_count}}">
<variable name="person" generator="org.databene.domain.person.PersonGenerator" locale="US"/>
<variable name="ZIP" type="entity" source="ZIP.import.csv" cyclic="false" distribution="random"/>
<attribute name="id" type="long" min="1" max="1000000000" distribution="step"/>
<attribute name="last_name" source="person.familyName"/>
<attribute name="type" values="Hospital,Clinic,Office"/>
<attribute name="ZIP_Code" source="ZIP.ZIP_CODE" type="string" minLength="5" maxLength="5"/>
<attribute name="MSA_No" source="ZIP.MSA_No"/>
<attribute name="MSA_Name" source="ZIP.MSA_Name"/>
<consumer id="csv" class="org.databene.platform.csv.CSVEntityExporter">
<property name="uri" value="BHI_Provider.csv"/>
<property name="properties" value="id,last_name,type,ZIP_Code,MSA_No,MSA_Name"/>
</consumer>
</create-entities>
MichaelWacey
 
Posts: 17
Joined: Mon Apr 07, 2008 3:25 am

Postby Volker Bergmann » Thu Apr 10, 2008 5:00 pm

Sorry, the weighted reuse of imported entities was a feature I planned to implement but did not yet. I will implement it in the following release. It is planned for end of April, but I will try to provide you with a workaround soon.
Volker 'databene' Bergmann

Need faster response times? phone support? onsite support? training? custom extensions? immediate bug fixes? Support Benerator evolution by buying services from Volker Bergmann!
User avatar
Volker Bergmann
 
Posts: 654
Joined: Sat Nov 10, 2007 2:40 pm

Postby Volker Bergmann » Thu Apr 10, 2008 6:29 pm

Hi Michael,

here's the workaround:

Compile this class and add it to the classpath:

Code: Select all
package org.databene.benerator.hotfix;

import java.io.FileNotFoundException;

import org.databene.benerator.InvalidGeneratorSetupException;
import org.databene.benerator.sample.WeightedSample;
import org.databene.benerator.sample.WeightedSampleGenerator;
import org.databene.commons.ArrayBuilder;
import org.databene.commons.SystemInfo;
import org.databene.model.data.Entity;
import org.databene.platform.csv.CSVEntityIterator;

public class WeightedCSVEntityGenerator extends WeightedSampleGenerator<Entity> {
   
   private String entity;
   private String uri;
   private String encoding = SystemInfo.fileEncoding();
   private char   separator = ',';
   
   private boolean dirty = true;

   public void setEntity(String entity) {
      this.entity = entity;
      this.dirty = true;
   }

   public void setUri(String uri) {
      this.uri = uri;
      this.dirty = true;
   }

   public void setSeparator(char separator) {
      this.separator = separator;
      this.dirty = true;
   }

   public void setEncoding(String encoding) {
      this.encoding = encoding;
      this.dirty = true;
   }

   @Override
   public void validate() {
      if (dirty) {
         try {
            ArrayBuilder<WeightedSample<Entity>> entities = new ArrayBuilder(WeightedSample.class);
            CSVEntityIterator iterator = new CSVEntityIterator(uri, entity, separator, encoding);
            while (iterator.hasNext()) {
               Entity entity = iterator.next();
               String weightString = (String) entity.get("ben:weight");
               double weight = (weightString != null ? Double.parseDouble(weightString) : 1);
               entities.append(new WeightedSample<Entity>(entity, weight));
            }
            setSamples(entities.toArray());
         } catch (FileNotFoundException e) {
            throw new InvalidGeneratorSetupException(e);
         }
         super.validate();
         dirty = false;
      }
   }
   
   @Override
   public boolean available() {
      if (dirty)
         validate();
      return true;
   }
}


If you create a file zip.csv (notice the last column, it contains a weight for each row!)

Code: Select all
ZIP_CODE,CITY,ben:weight
00000,Acity,7.2
11111,Btown,4.8
22222,Cville,2.2
33333,Dington,0


You can use it like this:

Code: Select all
<?xml version="1.0" encoding="iso-8859-1"?>
<setup    xmlns="http://databene.org/benerator-0.5.0.xsd"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://databene.org/benerator-0.5.0.xsd org/databene/benerator/benerator-0.5.0.xsd">

   <bean id="ZIP_var" class="org.databene.benerator.hotfix.WeightedCSVEntityGenerator">
      <property name="entity" value="Address"/>
      <property name="uri" value="zip.csv"/>
      <property name="encoding" value="UTF-8"/>
      <property name="separator" value=","/>
   </bean>
   
   <create-entities name="sample" count="10">
      <variable name="ZIP" type="entity" source="ZIP_var"/>
       <attribute name="ZIP_Code" source="ZIP.ZIP_CODE"/>   
       <attribute name="City" source="ZIP.CITY"/>   
      <consumer class="org.databene.model.consumer.LoggingConsumer"/>
   </create-entities>

</setup>


This will provide you with entities from the csv file, weighted with the factor provided in the last column.

Regards
Volker 'databene' Bergmann

Need faster response times? phone support? onsite support? training? custom extensions? immediate bug fixes? Support Benerator evolution by buying services from Volker Bergmann!
User avatar
Volker Bergmann
 
Posts: 654
Joined: Sat Nov 10, 2007 2:40 pm

Random Selection rather than in Sequence

Postby MichaelWacey » Fri Apr 11, 2008 9:53 pm

Volker,

Thanks for the reply. I will try the work around this weekend. I am also interested in being able to randomly select rows from a CSV file without weighting. I tried the code in my previous post, but it would always select rows in sequence. Should it work by random selection or is that a future enhancment?

Thanks
MichaelWacey
 
Posts: 17
Joined: Mon Apr 07, 2008 3:25 am

Postby Volker Bergmann » Sat Apr 12, 2008 5:31 am

Hi Michael,

this is a future enhancement I will ship with the next release which is scheduled for the end of April. How urgent is the matter for you? I could provide you with a workaround before.

Regards,
Volker 'databene' Bergmann

Need faster response times? phone support? onsite support? training? custom extensions? immediate bug fixes? Support Benerator evolution by buying services from Volker Bergmann!
User avatar
Volker Bergmann
 
Posts: 654
Joined: Sat Nov 10, 2007 2:40 pm


Return to Benerator Feature Requests

Who is online

Users browsing this forum: No registered users and 1 guest

cron