描述

Retrieving a Line at Random from a File of Unknown Size

读取未知大小文件的随机一行

问题 Problem

有文件，不清楚大小(但是可能非常大)，需要忽略文件本身，只读取数据的随机一行。

解决 Solution

We do need to read the whole file, but we don't have to read it all at once:

需要读取文件的全部数据,但不是一次全部读出:

import random

def randomLine(file_object):
    "顺序读取文件内容,取文件的随机的一行"
    lineNum = 0
    selected_line = ''

    while 1:
        aLine = file_object.readline(  )
        if not aLine: break
        lineNum = lineNum + 1
        # 本行有多大可能性是文件最后一行?          
        if random.uniform(0,lineNum)<1:              #译注：这里请思考1分钟(天才除外 :)                   
            selected_line = aLine
    file_object.close(  )
    return selected_line

#译注: 算法的解释见讨论部分

讨论 Discussion

当然,更明了的方法是这样的:

random.choice(file_object.readlines(  ))

但是,这需要将全部文件内容读入内存,对确实很大的文件可能有问题。

解决部分中算法的理论依据不是很明显，但是也不是太深奥: 当读取到文件第Ｎ行时，当前行(即变量selected_line指向的行)被随机取得的可能性是1/N。容易看出算法选择当前最后读取一行的可能性是对的，因为此处我们以可能性1.0/lineNum的选择最后一行。

由归纳法可以证明算法的合理性：如果读取第N-1行时，选择N-1行的可能性是对的，那么显然对于第N行时算法也是对的。由于(仅读取到第一行)选择第一行的可能性是1, 因此，算法的随机选择处理直到第N行都是对的。

Of course, the same technique holds for a uniform-probability selection from any finite sequence that, for whatever reason, is made available only one item at a time. But, apart from, for example, selecting a random word from /usr/dict/words, there aren't all that many practical applications of this pretty theorem.

同样的技巧对于从有限序列中，应用均匀分布可能性，一次处理一个元素(原因未知)的随机选择一个元素也是适用的。

-  ⇤ ← Revision 1 as of 2004-09-21 05:42:12 → 
  Size: 1207
  Editor: 61
  Comment:
+   ← Revision 6 as of 2004-09-21 06:28:20 → ⇥
  Size: 2640
  Editor: 61
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 12:
-Line 20:
+Line 21:
-需要读取文件的全部数据,但是不是一次全部读出:
+需要读取文件的全部数据,但不是一次全部读出:
-Line 33:
+Line 35:
-        # How likely is it that this is the last line of the file?
        if random.uniform(0,lineNum)<1:  #译注: 这里是随机的吗?
+        # 本行有多大可能性是文件最后一行?          
        if random.uniform(0,lineNum)<1:              #译注：这里请思考1分钟(天才除外 :)
-Line 38:
+Line 40:
+}}}
-Line 39:
+Line 42:
+#译注: 算法的解释见讨论部分
-Line 40:
+Line 44:
-}}}
-Line 43:
+Line 46:
-...
+当然,更明了的方法是这样的:
{{{
random.choice(file_object.readlines(  ))
}}}
但是,这需要将全部文件内容读入内存,对确实很大的文件可能有问题。

解决部分中算法的理论依据不是很明显，但是也不是太深奥: 当读取到文件第'''Ｎ'''行时，当前行(即变量'''selected_line'''指向的行)被随机取得的可能性是'''1/N'''。 容易看出算法选择当前最后读取一行的可能性是对的，因为此处我们以可能性'''1.0/lineNum'''的选择最后一行。

由归纳法可以证明算法的合理性：如果读取第'''N-1'''行时，选择'''N-1'''行的可能性是对的，那么显然对于第'''N'''行时算法也是对的。 由于(仅读取到第一行)选择第一行的可能性是'''1''', 因此，算法的随机选择处理直到第'''N'''行都是对的。   

Of course, the same technique holds for a uniform-probability selection from any finite sequence that, for whatever reason, is made available only one item at a time. But, apart from, for example, selecting a random word from /usr/dict/words, there aren't all that many practical applications of this pretty theorem. 

同样的技巧对于从有限序列中，应用均匀分布可能性，一次处理一个元素(原因未知)的随机选择一个元素也是适用的。

Diff for "PyCkBk-4-6"

描述

问题 Problem

解决 Solution

讨论 Discussion

参考 See Also