描述

Retrieving a Line at Random from a File of Unknown Size

读取未知大小文件的随机一行

问题 Problem

有文件，不清楚大小(但是可能非常大)，需要忽略文件本身，只读取数据的随机一行。

解决 Solution

We do need to read the whole file, but we don't have to read it all at once:

需要读取文件的全部数据,但不是一次全部读出:

import random

def randomLine(file_object):
    "顺序读取文件内容,取文件的随机的一行"
    lineNum = 0
    selected_line = ''

    while 1:
        aLine = file_object.readline(  )
        if not aLine: break
        lineNum = lineNum + 1
        # 本行有多大可能性是文件最后一行?          
        if random.uniform(0,lineNum)<1:                   
            selected_line = aLine
    file_object.close(  )
    return selected_line

#译注: 算法的解释见讨论部分

讨论 Discussion

当然,更明了的方法是这样的:

random.choice(file_object.readlines(  ))

但是,这需要将全部文件内容读入内存,对确实很大的文件可能有问题。

This recipe works thanks to an unobvious but not terribly deep theorem: when we have seen the first N lines in the file, there is a probability of exactly 1/N that each of them is the one selected so far (i.e., the one to which the selected_line variable refers at this point). This is easy to see for the last line (the one just read into the aLine variable), because we explicitly choose it with a probability of 1.0/lineNum. The general validity of the theorem follows by induction. If it was true after the first N-1 lines were read, it's clearly still true after the Nth one is read. As we select the very first line with probability 1, the limit case of the theorem clearly does hold when N=1.

上面算法的理论依据不是很明显，但是也不是太深奥: 当读取到文件第Ｎ行时，当前行(即变量selected_line指向的行)被随机取得的可能性是1/N。容易看出选择最后一行的可能性是1.0/lineNum

...

-  ⇤ ← Revision 2 as of 2004-09-21 05:52:40 → 
  Size: 1417
  Editor: 61
  Comment:
+   ← Revision 3 as of 2004-09-21 06:03:08 → ⇥
  Size: 2398
  Editor: 61
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 21:
-Line 41:
+Line 42:
-#译注: 算法的解释见讨论
+#译注: 算法的解释见讨论部分
-Line 51:
+Line 52:
+This recipe works thanks to an unobvious but not terribly deep theorem: when we have seen the first N lines in the file, there is a probability of exactly 1/N that each of them is the one selected so far (i.e., the one to which the selected_line variable refers at this point). This is easy to see for the last line (the one just read into the aLine variable), because we explicitly choose it with a probability of 1.0/lineNum. The general validity of the theorem follows by induction. If it was true after the first N-1 lines were read, it's clearly still true after the Nth one is read. As we select the very first line with probability 1, the limit case of the theorem clearly does hold when N=1. 

上面算法的理论依据不是很明显，但是也不是太深奥: 当读取到文件第'''Ｎ'''行时，当前行(即变量'''selected_line'''指向的行)被随机取得的可能性是'''1/N'''。 容易看出选择最后一行的可能性是'''1.0/lineNum'''

Diff for "PyCkBk-4-6"

描述

问题 Problem

解决 Solution

讨论 Discussion

参考 See Also