This
There are many reasons to dichotomize valued network data. It might be for methodological reasons, for example, in order to use a graphtheoretic concept such as a clique or an nclan, or to use methods such as ERGMs or SAOMs, which largely assume binary data
Whatever the reason, if we are going dichotomize, the question is at what level should we dichotomize? In some cases, the situation is guided by theoretical meaningfulness and the research design. For example, suppose respondents are asked to rate others on a scale of 1 = do not know them, 2 = acquaintance, 3 = friend, and 4 = family. We see there is a loose gradation from “does not know” to “knows well”; however, categories 3 and 4 do not possess so much degrees of closeness as different kinds of social relations. The choice of which to use is determined by the research question. A similar example is provided by questions that ask for a range of effects from negative to positive. If respondents are asked to rate others on a scale of 1 = dislike a lot, 2 = dislike somewhat, 3 = neither like nor dislike, 4 = like somewhat, and 5 = like a lot, for many analyses, it will make sense to choose a cut off of >3 or >4 for positive ties and <3 or <2 for negative ties. Note that in both of the last examples, we are still confronted with a choice of two values to choose from. In addition, if the scale points are more ambiguous than the ones above, or if the data are counts or rankings, then there is likely no a priori way of deciding where to dichotomize.
Here, we propose a twostep approach to dichotomizing. Step 1 is to simply dichotomize at every level (or a collection of
For step 1, input your valued network into your favorite network data management software and dichotomize at every level of the scale (see insert for information about how to do this in R and in UCINET). We recommend always spending some time visualizing the networks, which can be very informative regarding the emergence of clusters at certain levels of dichotomization. For example, consider Davis et al.’s (1941) womenbyevents data (often referred to as the Davis data set or the DGG data). We construct a 1mode womenbywomen network by multiplying the original by its transpose. The result is shown in
One mode DGG Women by Women network projection.
EV  LA  TH  BR  CH  FR  EL  PE  RU  VE  MY  KA  SY  NO  HE  DO  OL  FL  

EVELYN  8  6  7  6  3  4  3  3  3  2  2  2  2  2  1  2  1  1 
LAURA  6  7  6  6  3  4  4  2  3  2  1  1  2  2  2  1  0  0 
THERESA  7  6  8  6  4  4  4  3  4  3  2  2  3  3  2  2  1  1 
BRENDA  6  6  6  7  4  4  4  2  3  2  1  1  2  2  2  1  0  0 
CHARLOTTE  3  3  4  4  4  2  2  0  2  1  0  0  1  1  1  0  0  0 
FRANCES  4  4  4  4  2  4  3  2  2  1  1  1  1  1  1  1  0  0 
ELEANOR  3  4  4  4  2  3  4  2  3  2  1  1  2  2  2  1  0  0 
PEARL  3  2  3  2  0  2  2  3  2  2  2  2  2  2  1  2  1  1 
RUTH  3  3  4  3  2  2  3  2  4  3  2  2  3  2  2  2  1  1 
VERNE  2  2  3  2  1  1  2  2  3  4  3  3  4  3  3  2  1  1 
MYRNA  2  1  2  1  0  1  1  2  2  3  4  4  4  3  3  2  1  1 
KATHERINE  2  1  2  1  0  1  1  2  2  3  4  6  6  5  3  2  1  1 
SYLVIA  2  2  3  2  1  1  2  2  3  4  4  6  7  6  4  2  1  1 
NORA  2  2  3  2  1  1  2  2  2  3  3  5  6  8  4  1  2  2 
HELEN  1  2  2  2  1  1  2  1  2  3  3  3  4  4  5  1  1  1 
DOROTHY  2  1  2  1  0  1  1  2  2  2  2  2  2  1  1  2  1  1 
OLIVIA  1  0  1  0  0  0  0  1  1  1  1  1  1  2  1  1  2  2 
FLORA  1  0  1  0  0  0  0  1  1  1  1  1  1  2  1  1  2  2 
If we dichotomize at >1 and visualize, we get
DGG Women by Women dataset dichotomized above 1.
If we dichotomize at >2, we get
DGG Women by Women dataset dichotomized above 2.
And if we dichotomize at >3, we get
DGG Women by Women dataset dichotomized above 3.
Thus, the successive dichotomizations reveal a 2group structure, which is illuminating
BKS FRATERNITY dataset dichotomized above 0.
If we dichotomize at >2, we get
BKS FRATERNITY dataset dichotomized above 2.
Dichotomizing at >4, we get
BKS FRATERNITY dataset dichotomized above 4.
Dichotomizing at >6, we get
BKS FRATERNITY dataset dichotomized above 6.
And so on. Coreperiphery structures have a kind of selfsimilarity property where the main component always looks the same regardless of what level of dichotomization produced it.
Now, successive dichotomizations are informative, but our original question was about choosing a single dichotomization that would be used in all further analyses, which is where step 2 becomes important. For step 2, we present three potential approaches. The first will horrify some people. This approach is to choose the level of dichotomization that maximizes your results. For example, suppose you are predicting managers’ performance as a function of betweenness centrality. For each possible level of dichotomization, you measure betweenness centrality and regress performance on betweenness, along with any control variables. The level of dichotomization that yields the highest
As we said, some people (scientists, statisticians, and people of good character) will be horrified
Rsquare of models predicting performance using betweenness centrality at different levels of dichotomization.
Dichot. level 


1  0.05 
2  0.29 
3  0.02 
4  0.01 
5  0.31 
6  0.06 
7  0.11 
8  0.02 
9  0.23 
Clearly, we would choose 5, but how to make sense of these results? They rise and fall with no rhyme or reason. In this case, we would strongly advise against taking this approach. On the other hand, the results were something like this, as shown in
Rsquare of models predicting performance using betweenness centrality at different levels of dichotomization.
Dichot. level 


1  0.05 
2  0.09 
3  0.12 
4  0.23 
5  0.31 
6  0.27 
7  0.22 
8  0.15 
9  0.07 
We would be comforted by the underlying regularity and feel good about choosing 5, even though we might be hardpressed to explain why medium density worked best.
A slightly less controversial version of this approach might be to choose the dichotomized version of your network that maximizes the replication of results from past studies. For example, we know from past studies that actors with higher levels of selfmonitoring are more likely to receive more friendship nominations. We could choose the dichotomization threshold that maximizes the relationship between selfmonitoring and new friendship nominations, even if the test of our hypothesis has to do with betweenness centrality and performance.
That was the first approach. The second approach is less controversial. Dichotomization, by its very nature, is a distortion of the data
Zscore, correlation, number of ties and density of the DGG dataset at different dichotomization levels.
Value 

Correlation  Ties  Density 

7  3.352  0.271887  2  0.006536 
6  2.667  0.646625  16  0.052288 
5  1.983  0.666829  18  0.058824 
4  1.298  0.781314  48  0.156863 
3  0.613  0.811928  92  0.300654 
2  −0.072  0.720115  190  0.620915 
1  −0.756  0.457341  278  0.908497 
0  −1.441  306  1.000000 
Interestingly, ≥3 is the level just below the one at which the network splits into two large components (along with four isolates). At ≥4, the network looks like this, as shown in
DGG Women by Women dataset dichotomized at 4.
The third approach is theory based, and can be harder to implement. There are certain cases where we can use the emergent properties of the dichotomized networks themselves in order to identify the correct dichotomization threshold, just like when we noticed the appearance of clusters while visually inspecting different dichotomization thresholds in the DGG data. As an example, let us consider an approach proposed by
Number of gtransitive and intransitive triples in the DGG dataset at different dichotomization levels.
Value  Trans  Intrans 

7  0  0 
6  26  0 
5  30  0 
4  160  0 
3  526  4 
2  2,032  44 
1  3,786  292 
0  4,448  448 
The table shows that at ≥4, the number of gtransitive triples is 160 and the number of intransitive triples is 0. Hence, ties 4 or above are strong ties, and ties <4 but >0 are weak ties.
Combining this with our previous approach, we might summarize the situation as follows. Dichotomizing at ≥3 optimally identifies ties of any kind in terms of the leastviolence criterion, and maintains a single large component (plus isolates). Dichotomizing at ≥4 identifies strong ties, which strongly fragment the network. The latter is useful for sharply outlining a subgroup structure, while the former enables the calculation of measure that requires connected networks (aside from isolates) (
DGG Women by Women dataset dichotomized at 3. Strong ties in bold.
It is worth noting that Freeman’s approach needs not be limited to maximizing gtransitivity. On theoretical grounds, we may identify a specific mechanism that organizes ties. For example, we may see a status mechanism such at the Matthew effect in which nodes that already have a lot of ties tend to attract even more ties. Now, to dichotomize valued data, we choose the cutoff that maximizes the extent to which there are just a few nodes with many ties and a great many nodes with few ties. Alternatively, we might choose the cutoff to maximize the level of transitivity in the network.
This “How to” guide on dichotomization is intended to provide guidance on how to find a suitable threshold for dichotomization for social network data. We propose that in all cases, we should start by creating multiple versions of the dichotomized network at every possible value of the threshold and inspect them visually. Then, we suggest three separate approaches in order to choose (and justify your choice of) a single threshold based on (i) maximizing expected results, (ii) minimizing distortions, and (iii) identifying specific emergent properties in the network.
#Import the Davis data set in R, assuming that it is already in a text file, for example exported from UCINET.
library(readr)
davis < as.matrix(read.csv(“davis.txt”,sep = “\t”, row.names = 1))
#Create a onemode network by multiplying the original matrix by its transpose
davisonemode < davis %*% t(davis)
diag(davisonemode) < 0
#Dichotomize the network at all values
davisonemodedic < array(dim = c(NROW(davisonemode),NCOL(davisonemode),max(davisonemode)))
for (i in 1:max(davisonemode)) {
davisonemodedic[,,i] < ifelse(davisonemode > = i, 1, 0)
}
#Visualize all networks
library(sna)
par(mfrow = c(4,2))
for (i in 1:max(davisonemode)) {
plot(as.network(davisonemodedic[,,i]))
}
#Correlation between original network and dichotomized networks, and some descriptive statistics
stats < array(dim = c(max(davisonemode),4))
colnames(stats) < c(“Threshold”, “Correlation”, “Num of 1 s”, “Density”)
for (i in 1:max(davisonemode)) {
stats[i,1] < i
stats[i,2] < summary(qaptest(list(davisonemode, davisonemodedic[,,i]), gcor, g1 = 1, g2 = 2))$test
stats[i,3] < sum(davisonemodedic[,,i])
stats[i,4] < stats[i,3]/(NROW(davisonemode)*(NROW(davisonemode)  1))
}
stats
To visualize successive dichotomizations in UCINET, one opens the valued data as usual and presses the + sign in the rels tab at right to raise the level of dichotomization by one unit, see
Screenshot of Netdraw.
This can also be done in the command line interface (CLI) as follows:
>d1 = dichot(women ge 1)
>d2 = dichot(women ge 2)
>d3 = dichot(women ge 3)
Etc.
In addition, the network could be drawn after each step:
>draw d1
>draw d2
Etc.
To compute the correlation between an original data set and successive dichotomizations of it, we can use UCINET’s TransformInteractively Dichotomize procedure.
Screenshot of UCINET’s Interactive Dichotomization routine’s results.
Finally, to execute Freeman’s strongweaknull tie decomposition based on gtransitivity, we can use UCINET’s command line interface (CLI) as shown in
Gtransitivity decomposition command line instruction and output in UCINET.



























































There are, of course, many methods that do not require dichotomization. For example, we do not need to dichotomize in order to measure eigenvector centrality, nor to apply the relational event model (
However, this should not be taken as definitive. Various normalizations of the data, as well as bipartite representations, tend to show a third smaller subgroup. See
On the other hand, these same people are happy to use regression to find the optimal coefficients to show a relationship between their explanatory variable and a dependent. Perhaps, we should ask them to choose the coefficients
Of course, if you have these other datasets on hand, then you could pick the level of dichotomization that yields the highest average
Clearly, in some cases, distorting the data is what we are looking for, for example, when distinguishing between negative and positive ties. In this case, we should not expect the dichotomized data to preserve the properties of the original dataset and we should either use a theoretically or literature driven approach or revert to approach 1.