描述

Retrieving a Line at Random from a File of Unknown Size

读取未知大小文件的随机一行

问题 Problem

有文件，不清楚大小(但是可能非常大)，需要忽略文件本身，只读取数据的随机一行。

解决 Solution

We do need to read the whole file, but we don't have to read it all at once:

需要读取文件的全部数据,但不是一次全部读出:

import random

def randomLine(file_object):
    "顺序读取文件内容,取文件的随机的一行"
    lineNum = 0
    selected_line = ''

    while 1:
        aLine = file_object.readline(  )
        if not aLine: break
        lineNum = lineNum + 1
        # 本行有多大可能性是文件最后一行?          
        if random.uniform(0,lineNum)<1:              #译注：这里请思考1分钟(天才除外 :)                   
            selected_line = aLine
    file_object.close(  )
    return selected_line

#译注: 算法的解释见讨论部分

讨论 Discussion

当然,更明了的方法是这样的:

random.choice(file_object.readlines(  ))

但是,这需要将全部文件内容读入内存,对确实很大的文件可能有问题。

This recipe works thanks to an unobvious but not terribly deep theorem: when we have seen the first N lines in the file, there is a probability of exactly 1/N that each of them is the one selected so far (i.e., the one to which the selected_line variable refers at this point). This is easy to see for the last line (the one just read into the aLine variable), because we explicitly choose it with a probability of 1.0/lineNum. The general validity of the theorem follows by induction. If it was true after the first N-1 lines were read, it's clearly still true after the Nth one is read. As we select the very first line with probability 1, the limit case of the theorem clearly does hold when N=1.

上面算法的理论依据不是很明显，但是也不是太深奥: 当读取到文件第Ｎ行时，当前行(即变量selected_line指向的行)被随机取得的可能性是1/N。容易看出算法选择当前最后读取一行的可能性是对的，因为此处我们以可能性1.0/lineNum的选择最后一行。

由归纳法可以证明算法的合理性：如果读取第N-1行时，选择N-1行的可能性是对的，那么显然对于第N行时算法也是对的。由于(仅读取到)选择第一行的可能性是1,因此，算法的随机选择处理直到第N行都是对的。

...

-  ⇤ ← Revision 4 as of 2004-09-21 06:09:20 → 
  Size: 2511
  Editor: 61
  Comment:
+   ← Revision 5 as of 2004-09-21 06:19:42 → ⇥
  Size: 2850
  Editor: 61
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 36:
-        if random.uniform(0,lineNum)<1:
+        if random.uniform(0,lineNum)<1:              #译注：这里请思考1分钟(天才除外 :)
 Line 54:
-上面算法的理论依据不是很明显，但是也不是太深奥: 当读取到文件第'''Ｎ'''行时，当前行(即变量'''selected_line'''指向的行)被随机取得的可能性是'''1/N'''。 容易算法对于最后一行的可能性是对的，因为此处我们以可能性'''1.0/lineNum'''的选择最后一行。
+上面算法的理论依据不是很明显，但是也不是太深奥: 当读取到文件第'''Ｎ'''行时，当前行(即变量'''selected_line'''指向的行)被随机取得的可能性是'''1/N'''。 容易看出算法选择当前最后读取一行的可能性是对的，因为此处我们以可能性'''1.0/lineNum'''的选择最后一行。
 Line 56:
-由归纳法可以证明算法的合理性：
+由归纳法可以证明算法的合理性：如果读取第'''N-1'''行时，选择'''N-1'''行的可能性是对的，那么显然对于第'''N'''行时算法也是对的。由于(仅读取到)选择第一行的可能性是'''1''',因此，算法的随机选择处理直到第'''N'''行都是对的。

Diff for "PyCkBk-4-6"

描述

问题 Problem

解决 Solution

讨论 Discussion

参考 See Also