GRGR: Zipf

Glenn Scheper glenn_scheper at earthlink.net
Mon Dec 5 09:04:21 CST 2005


1. The rank-frequency law.

This is the most famous one; unfortunately many people call it "Zipf's law"
as if it was the only one. - The procedure to estimate this relation is
very simple: the words in a text are sorted by decreasing frequency
and a rank number is assigned to each word. For words with the same
frequency, the sub-sorting and ranking is arbitrary.

The plot of log (frequency) versus log (rank) approximates a straight line
of slope -1.

2. The number-frequency law.

The plot of log (frequency) versus log (number of words with the
same frequency) approximates a straight line of slope -0.5.

While the rank-frequency law tends to occur for the high frequency words
(although not necessarily for the first few ranking positions),
the number-frequency law is observed for the low frequency words.


I'm not sure I got the idea, but lets try to compare
GR.TXT with KJV.TXT, after converting all non-alphas
to spaces, and after changing uppercase to lowercase.

Whole text word count:

 791442  KJV
 340838  GR

Number of distinct words:

 12558  KJV
 25490  GR

Words exceeding 1% of whole text word count:

KJV:
  63924 the
  51696 and
  34617 of
  13562 to
  12913 that
  12667 in
  10420 he
   9838 shall
   8997 unto
   8971 for
   8854 i
   8473 his
   8177 a
   7964 lord

GR:
  19442 the
   9313 of
   8197 a
   7794 and
   7787 to
   6493 in
   4374 s -- (from 's)
   3883 it
   3812 he

Counts of the word-frequency-count columns: E.g.,
at one end of KJV list, the count 1 occurred 3947 x;
at other end of KJV list the count 63924 occurred 1 x.
(Anybody understand how / want to graph these?)

KJV:
1	3947
2	1740
3	972
4	623
5	499
6	403
7	313
8	286
9	229
10	192
11	158
12	146
13	130
14	120
15	132
16	95
17	99
18	74
19	83
20	78
21	72
22	64
23	49
24	57
25	51
26	49
27	35
28	38
29	43
30	44
31	31
32	39
33	46
34	26
35	17
36	27
37	43
38	32
39	48
40	20
41	21
42	30
43	24
44	14
45	24
46	24
47	12
48	19
49	24
50	20
51	29
52	13
53	14
54	14
55	18
56	16
57	12
58	11
59	15
60	17
61	17
62	18
63	8
64	14
65	9
66	16
67	7
68	9
69	9
70	11
71	13
72	9
73	11
74	10
75	9
76	11
77	8
78	5
79	7
80	7
81	11
82	10
83	1
84	3
85	4
86	9
87	3
88	6
89	7
90	4
91	7
92	3
93	1
94	7
95	4
96	8
97	6
98	9
99	2
100	9
101	9
102	5
103	1
104	10
105	5
106	7
107	7
108	7
109	4
110	4
111	3
112	1
113	7
114	1
115	3
116	5
117	2
118	5
119	6
120	3
121	2
122	4
123	4
124	3
125	3
126	7
127	4
128	2
129	4
130	2
131	3
132	7
133	4
134	1
135	6
136	5
137	2
138	1
139	2
140	4
142	2
143	5
144	1
145	7
146	1
147	3
148	2
149	2
150	2
151	1
152	1
153	3
154	2
155	1
156	3
157	3
158	4
159	2
160	4
161	1
162	5
163	1
164	1
165	2
166	2
167	2
168	2
169	4
170	3
171	2
172	1
173	3
174	2
175	2
176	3
177	2
178	3
179	5
181	3
182	3
183	1
184	2
185	2
186	2
188	3
190	1
191	2
192	4
193	2
194	2
195	1
196	3
197	1
198	1
199	1
201	2
202	3
203	1
204	1
205	3
207	1
208	1
209	1
211	2
212	1
213	2
214	2
215	3
216	2
217	1
218	1
222	1
223	1
225	3
226	1
227	1
228	1
229	2
230	1
231	1
232	2
233	2
234	3
235	1
236	1
237	2
238	1
239	2
240	1
241	3
242	2
243	3
244	3
245	1
246	1
247	3
249	2
250	4
252	2
253	1
254	1
255	1
256	2
257	1
260	1
261	1
263	2
264	1
265	2
267	1
269	1
272	2
273	2
274	1
275	2
276	1
277	2
278	1
279	1
280	1
282	3
283	2
284	2
287	4
288	1
290	2
291	2
292	1
294	3
296	1
299	1
300	1
302	1
304	2
306	1
307	2
311	1
313	1
314	1
320	4
321	1
327	1
328	3
331	1
332	1
333	1
334	1
336	1
338	2
339	1
340	1
342	1
344	1
345	1
346	1
348	2
350	1
358	1
359	1
361	2
362	1
364	5
366	2
369	2
372	1
377	1
378	1
380	1
392	1
396	1
400	3
401	1
402	3
407	1
409	1
416	2
417	1
420	3
423	1
424	1
426	1
429	1
434	1
442	1
443	1
447	2
448	1
450	1
451	1
453	1
459	1
462	1
463	2
464	1
465	2
475	1
476	1
480	1
481	1
484	1
485	1
492	1
494	1
498	2
501	1
502	1
505	2
507	1
513	1
520	1
523	1
527	1
539	1
542	1
543	1
544	1
546	1
548	2
549	1
550	1
556	1
564	1
565	1
571	1
583	1
587	1
590	1
596	1
597	1
611	2
613	1
623	1
625	1
632	1
641	1
649	1
655	1
664	1
669	1
672	1
683	1
685	1
686	1
695	1
699	1
716	1
720	1
724	1
725	1
737	1
750	1
755	1
763	2
783	1
793	1
814	1
816	1
830	1
833	1
837	1
847	1
863	1
868	2
874	2
879	1
880	1
888	1
906	1
911	1
912	1
915	1
916	1
938	1
958	1
962	1
968	1
982	1
983	1
985	1
987	1
1006	1
1008	1
1027	1
1056	2
1064	1
1065	1
1070	1
1094	1
1121	1
1125	1
1127	1
1162	1
1172	1
1179	1
1209	1
1225	1
1236	1
1237	1
1262	1
1326	1
1356	1
1368	1
1393	1
1394	1
1400	1
1405	1
1445	1
1451	1
1466	1
1492	1
1511	1
1570	1
1595	1
1616	1
1667	1
1677	1
1689	1
1699	1
1718	1
1743	1
1769	1
1785	1
1795	1
1796	1
1821	1
1844	1
1969	1
1971	1
1995	1
2011	1
2015	1
2024	1
2026	1
2093	1
2143	1
2169	1
2264	1
2299	1
2380	1
2392	1
2540	1
2575	1
2617	1
2624	1
2735	1
2748	1
2772	1
2775	1
2785	1
2834	1
2950	1
3520	1
3642	1
3827	1
3836	1
3904	1
3946	1
3982	1
3992	1
3999	1
4096	1
4368	1
4413	1
4472	1
4522	1
4600	1
5474	1
5620	1
6012	1
6129	1
6430	1
6596	1
6659	1
6989	1
7013	1
7376	1
7964	1
8177	1
8473	1
8854	1
8971	1
8997	1
9838	1
10420	1
12667	1
12913	1
13562	1
34617	1
51696	1
63924	1

GR:
1	11469
2	4001
3	2239
4	1398
5	996
6	704
7	550
8	435
9	399
10	272
11	244
12	176
13	163
14	160
15	127
16	116
17	110
18	92
19	90
20	88
21	74
22	59
23	74
24	49
25	64
26	45
27	55
28	33
29	40
30	43
31	27
32	33
33	40
34	30
35	27
36	23
37	17
38	24
39	15
40	25
41	27
42	22
43	19
44	16
45	12
46	21
47	14
48	11
49	9
50	8
51	17
52	18
53	13
54	19
55	7
56	8
57	7
58	13
59	12
60	7
61	12
62	10
63	6
64	9
65	7
66	7
67	8
68	11
69	11
70	5
71	9
72	12
73	4
74	6
75	5
76	6
77	6
78	7
79	6
80	2
81	4
82	7
83	8
84	3
85	2
86	3
87	6
88	3
89	5
90	5
91	4
92	8
93	6
94	3
95	1
96	3
97	3
98	3
99	5
100	4
101	4
102	3
103	3
104	5
106	1
107	1
108	4
109	2
110	3
111	2
112	5
113	3
114	1
115	3
116	4
117	1
118	3
119	6
121	1
122	3
123	3
124	3
125	3
127	1
128	3
129	1
130	1
132	3
133	2
134	2
135	2
136	3
137	1
138	1
140	4
141	3
142	2
143	2
146	3
147	3
148	1
149	1
150	1
153	3
154	3
156	1
157	3
158	1
159	1
160	2
161	1
162	3
163	2
164	1
166	1
168	3
170	3
171	1
172	1
174	2
175	1
176	1
177	3
178	2
180	1
181	2
182	1
184	2
185	1
186	1
187	1
188	2
190	3
192	1
193	1
194	1
195	1
197	1
198	3
200	1
203	1
205	1
207	3
208	1
209	1
211	2
212	1
213	1
215	1
216	2
219	1
221	1
224	1
226	1
227	2
228	1
229	1
231	1
232	1
233	2
235	1
241	2
242	1
244	1
245	2
250	2
252	1
258	1
260	1
262	1
264	1
266	1
268	1
275	1
277	1
288	1
291	1
294	1
296	1
299	1
304	1
306	1
307	1
308	1
312	1
314	1
316	1
317	1
323	1
337	1
340	1
345	1
348	1
355	1
356	1
360	1
364	1
365	1
367	1
369	1
370	1
374	1
384	1
392	3
393	2
396	1
412	1
415	1
418	1
420	1
424	1
432	1
434	1
442	2
447	1
449	1
454	2
459	2
462	1
467	2
477	1
485	1
486	1
491	1
497	1
498	1
503	1
518	1
524	1
525	2
536	1
545	1
564	1
583	1
585	1
586	1
590	1
621	1
669	1
678	1
680	1
685	1
697	1
728	1
729	1
745	1
753	1
754	1
784	1
787	1
794	1
852	1
879	1
891	1
935	1
980	1
1000	1
1001	1
1011	1
1105	1
1129	1
1135	1
1158	1
1164	1
1173	1
1192	1
1200	1
1232	1
1250	1
1300	1
1305	1
1307	1
1313	1
1370	1
1414	1
1493	1
1502	1
1598	1
1613	1
1692	1
1704	1
1714	1
1807	1
1833	1
1973	1
2053	1
2119	1
2386	1
2393	1
2474	1
2784	1
2847	1
2987	1
3297	1
3812	1
3883	1
4374	1
6493	1
7787	1
7794	1
8197	1
9313	1
19442	1

Yours truly,
Glenn Scheper
http://home.earthlink.net/~glenn_scheper/
glenn_scheper + at + earthlink.net
Copyleft(!) Forward freely.




More information about the Pynchon-l mailing list