Categorizing the Census Surname Data

(Maybe do this in class?)

What is a categorical variable vs a continuous variable

The problem with the Census surname data is that it has too many continuous variables.

Calculating attributes of surnames

  • Length of name
  • First character
  • Last character
  • First/last 2 characters

Deriving more values from the pct columns

  • White vs non-white
  • Number of white/etc people for a given name

Categorizing the group-percentage values

  • Diverse vs. non-diverse name?
  • Labeling a name by its most dominant racial/ethnic group