GRGR: Zipf, logs, Napier / 1,6 Roger and Jessica

Michael Bailey michael.lee.bailey at gmail.com
Mon Dec 5 11:33:23 CST 2005


 Glenn Scheper explained:
> 1. The rank-frequency law.
>  The procedure to estimate this relation is
> very simple: the words in a text are sorted by decreasing frequency
> and a rank number is assigned to each word. For words with the same
> frequency, the sub-sorting and ranking is arbitrary.
>

So that gives you two co-ordinates:
a) how many times a word appears
b) its ranking (popularity)

then you say the sub-sorting and sub-ranking is arbitrary
--- I guess that would be because Rank-Frequency deals with the top of
the list --- so ties aren't as common here --- because the largeness
of the count makes them less probable, right?
Our data points on this graph are individual words.

> The plot of log (frequency) versus log (rank) approximates a straight line
> of slope -1.
>

Why do we bring "log" into the picture?  I guess the graph would get
too big otherwise, wouldn't it?  Log smooshes it down so it can fit on
the page.
I used to have a slide rule, never got great at it, but it does give
you sort of a feel for logarithms.  (mem check: I think the inventor
was named Napier - just went and entered it "John Napier logarithm" as
my first search in surf4me.exe" on the Windows computer)

Straight line with a slope of -1 goes down 1 for each step rightward
on the x axis

> 2. The number-frequency law.
>
> The plot of log (frequency) versus log (number of words with the
> same frequency) approximates a straight line of slope -0.5.
>
> While the rank-frequency law tends to occur for the high frequency words
> (although not necessarily for the first few ranking positions),
> the number-frequency law is observed for the low frequency words.
>

...because there are a lot of ties...using rank-frequency on those
would give an asymptotic tail to the straight line...so
number-frequency zooms in on the tail?
  a data point here is "a group of words that appear a certain # of times"

> I'm not sure I got the idea, but lets try to compare
> GR.TXT with KJV.TXT, after converting all non-alphas
> to spaces, and after changing uppercase to lowercase.

that will make 2 words out of hyphenated words, and eliminate
sentence-starters from being considered different words - also will
eliminate the distinction of proper nouns, so for instance (Brigadier)
Pudding will be equivalent to pudding

>
> Whole text word count:
>
>  791442  KJV
>  340838  GR
>
> Number of distinct words:
>
>  12558  KJV
>  25490  GR
>
> Words exceeding 1% of whole text word count:
>

that's the dividing line for the rank-frequency count

> KJV:
>   63924 the
>   51696 and
>   34617 of
>   13562 to
>   12913 that
>   12667 in
>   10420 he
>    9838 shall
>    8997 unto
>    8971 for
>    8854 i
>    8473 his
>    8177 a
>    7964 lord
>
> GR:
>   19442 the
>    9313 of
>    8197 a
>    7794 and
>    7787 to
>    6493 in
>    4374 s -- (from 's)
>    3883 it
>    3812 he
>
-------------------------------------------------------------
so if I were to graph this, one axis would be the log of the rank
for KJV, log(1) thru log(14)
for GR, log(1) thru log(9)

and the other axis would be the log of the frequency
that's why they use different scales...gotta read up on Napier and logs,
maybe I can graph this  - I need to figure out which axis to use for
which and what kind of scale to choose
--------------------------------------------------

> Counts of the word-frequency-count columns: E.g.,
> at one end of KJV list, the count 1 occurred 3947 x;

there were 3,947 words that only occurred once

> at other end of KJV list the count 63924 occurred 1 x.

there was one word ("the") that appeared 63,924 times.

> (Anybody understand how / want to graph these?)

 after I bone up on logs a bit (-:
Will it give a straight line or a bow?
>

--------------------
Roger and Jessica - brought together by the War, Jessica's fear of
strangers overcome by her greater fear of the bomb, Roger backing over
her bicycle; now on their way to a "certain high-class
vivisectionist", passing a bomb scene without stopping to help,
pulling away from the exhausting group drama to make a better drama of
their own -
the dubious morality of this within the context of "the Home Front"
vies with their own natural instincts (we all like to see young people
get together) - and casts doubt on the legitimacy of the propaganda

Given enough data, Roger might be able to graph the numbers of
love-commitments made, changed, shaken, or ended by the War, and
compare them to the numbers in peacetime.

How does being ineluctably a statistic jibe with being a person with emotions?

Thesis: The Stakhanovite demands made on a warring populace at some
point ("n") become unbearable.  Slothrop "used to care" and Roger and
Jessica's body-encounter count n++ has been incremented past the point
where they can take any satisfaction in performing according to the
propaganda.

If one believes in the War (and after all, WWII was a "just war" on
the part of the Allies, wasn't it?) then are they falling away from
virtue?

If one doesn't believe in the War, then doesn't that call into
question the social hierarchy, the order of things, and the ease of
delegation of responsibility to "the authorities" - Roger and Jessica
recap some of Winston Smith and Julia's dilemma.  Yet they aren't
overtly political, and Jessica (apparently) never will be.



--
"Acceptance, forgiveness, love - now that's a philosophy of life!"
-Woody Allen, as Broadway Danny Rose




More information about the Pynchon-l mailing list