构建给定文本中最常用单词的 ASCII 图表

挑战:

构建给定文本中最常用单词的 ASCII 图表。

规则:

  • 只接受 a-zA-Z(字母字符)作为单词的一部分。
  • 忽略大小写(对于我们来说,She = = she)。
  • 忽略以下单词(相当武断,我知道) : the, and, of, to, a, i, it, in, or, is
  • 说明: 考虑到 don't: 这将被视为范围 a-zA-Z: (dont)中的两个不同的“单词”。

  • 也可以选择 (现在正式更改规范为时已晚) ,选择删除所有单字母“ words”(这也可能导致忽略列表的缩短)。

解析给定的 text(读取通过命令行参数指定的文件或通过管道输入的文件; 假设为 us-ascii) ,并构建具有以下特征的 word frequency chart:

  • 显示22个最常见单词的图表(也参见下面的例子)(按降序频率排列)。
  • 条形 width表示单词的出现次数(频率)(按比例)。附加一个空格并打印单词。
  • 确保这些条形码(加上空格-单词-空格)始终是 合身: bar + [space] + word + [space]应该始终是 < = 80字符(确保你考虑到可能的条形码和单词长度的不同: 例如: 第二个最常见的单词可能比第一个长得多,但频率不会有太大的差异)。最大化条宽度在这些约束和比例适当的条(根据频率,他们代表)。

举个例子:

示例 可以在这里找到(《爱丽丝漫游奇境记》刘易斯 · 卡罗尔著)的文本。

这一具体案文将产生以下图表:

_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so




供你参考: 这些是上面图表所建立的频率:

[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
, 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]

第二个示例(检查是否实现了完整的规范) : 将链接的 爱丽丝梦游仙境文件中每次出现的 you替换为 superlongstringstring:

________________________________________________________________
|________________________________________________________________| she
|_______________________________________________________| superlongstringstring
|_____________________________________________________| said
|______________________________________________| alice
|________________________________________| was
|_____________________________________| that
|______________________________| as
|___________________________| her
|_________________________| with
|_________________________| at
|________________________| s
|________________________| t
|______________________| on
|_____________________| all
|___________________| this
|___________________| for
|___________________| had
|__________________| but
|_________________| be
|_________________| not
|________________| they
|________________| so

获胜者:

最短的解决方案(按字符数,每种语言)。玩得开心!


编辑 : 迄今为止(2012-02-15)的结果汇总表(最初由用户 Nas Banov 添加) :

Language          Relaxed  Strict
=========         =======  ======
GolfScript          130     143
Perl                        185
Windows PowerShell  148     199
Mathematica                 199
Ruby                185     205
Unix Toolchain      194     228
Python              183     243
Clojure                     282
Scala                       311
Haskell                     333
Awk                         336
R                   298
Javascript          304     354
Groovy              321
Matlab                      404
C#                          422
Smalltalk           386
PHP                 450
F#                          452
TSQL                483     507

这些数字表示特定语言中最短解的长度。“严格”是指完全实现规范的解决方案(画 |____|条,用 ____线关闭上面的第一个条,考虑到可能出现高频率的长单词等)。“放松”是指采取了一些自由,以缩短解决方案。

只包含短于500个字符的解决方案。语言列表按照“严格”解决方案的长度排序。“ Unix Toolchain”用于表示使用传统 * nix shell 还有的各种解决方案,这些解决方案混合使用各种工具(如 grep、 tr、 sort、 uniq、 head、 perl、 awk)。

38232 次浏览

Perl,237 229209字符

(再次更新,用更多肮脏的高尔夫技巧击败 Ruby 版本,用 lc=~/[a-z]+/g代替 split/[^a-z/,lc,并在另一个地方消除检查空字符串。这些都是受到 Ruby 版本的启发,因此值得称赞。)

更新: 现在使用 Perl 5.10!用 say代替 print,并使用 ~~避免 map。这必须在命令行上以 perl -E '<one-liner>' alice.txt的形式调用。因为整个脚本都在一行上,所以将它写成一行代码应该不会有任何困难:)。

 @s=qw/the and of to a i it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21];

注意,这个版本对于大小写是正常化的。这不会缩短任何解决方案,因为删除 ,lc(用于较低的大小写)需要您将 A-Z添加到分割的正则表达式,所以这是一次清洗。

如果您所在的系统中换行是一个字符而不是两个字符,那么可以使用文字换行代替 \n将其缩短两个字符。但是,我没有这样写上面的示例,因为它更“清晰”(哈!) 那边。


下面是一个基本上正确的 perl 解决方案,但还不够短:

use strict;
use warnings;


my %short = map { $_ => 1 } qw/the and of to a i it in or is/;
my %count = ();


$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>);
my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
my $widest = 76 - (length $sorted[0]);


print " " . ("_" x $widest) . "\n";
foreach (@sorted)
{
my $width = int(($count{$_} / $count{$sorted[0]}) * $widest);
print "|" . ("_" x $width) . "| $_ \n";
}

下面是在保持相对可读性的情况下所能达到的最短的内容。

%short = map { $_ => 1 } qw/the and of to a i it in or is/;
%count;


$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-z]/, lc } (<>);
@sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
$widest = 76 - (length $sorted[0]);


print " " . "_" x $widest . "\n";
print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted;

C #-510 451 436 446 434 426422字符(微型化)

不是那么短,但现在可能是正确的!注意,之前的版本没有显示条形图的第一行,没有正确缩放条形图,没有从 stdin 获取文件而是下载了文件,并且没有包含所有需要的 C # 详细代码。如果 C # 不需要这么多额外的废话,您可以轻松地减少许多笔画。也许 Powershell 可以做得更好。

using C=System.Console;   // alias for Console
using System.Linq;  // for Split, GroupBy, Select, OrderBy, etc.


class Class // must define a class
{
static void Main()  // must define a Main
{
// split into words
var allwords = System.Text.RegularExpressions.Regex.Split(
// convert stdin to lowercase
C.In.ReadToEnd().ToLower(),
// eliminate stopwords and non-letters
@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+")
.GroupBy(x => x)    // group by words
.OrderBy(x => -x.Count()) // sort descending by count
.Take(22);   // take first 22 words


// compute length of longest bar + word
var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length));


// prepare text to print
var toPrint = allwords.Select(x=>
new {
// remember bar pseudographics (will be used in two places)
Bar = new string('_',(int)(x.Count()/lendivisor)),
Word=x.Key
})
.ToList();  // convert to list so we can index into it


// print top of first bar
C.WriteLine(" " + toPrint[0].Bar);
toPrint.ForEach(x =>  // for each word, print its bar and the word
C.WriteLine("|" + x.Bar + "| " + x.Word));
}
}

贷除数内联的422个字符(这使得它慢了22倍)的格式如下(用于选择空格的换行) :

using System.Linq;using C=System.Console;class M{static void Main(){var
a=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);var
b=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+x.t));}}

F # 452字符

简单: 获取字数对的序列 a,找到每列字数乘法器 k,然后打印结果。

let a=
stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1)
|>Seq.map(fun s->s.ToLower())|>Seq.countBy id
|>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w))
|>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22
let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.min
let u n=String.replicate(int(float(n)*k)-2)"_"
printfn" %s "(u(snd(Seq.nth 0 a)))
for(w,n)in a do printfn"|%s| %s "(u n)w

例子(我有不同的频率计数比你,不知道为什么) :

% app.exe < Alice.txt


_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| t
|____________________________| s
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| had
|______________________| for
|_____________________| but
|_____________________| be
|____________________| not
|___________________| they
|__________________| so

Gawk ——336(最初是507)个字符

(在修复了输出格式之后; 修复了缩写之类的东西; 调整; 再次调整; 删除了完全不必要的排序步骤; 再次调整; 再次调整(哎呀,这个破坏了格式) ; 再调整一些; 接受马特的挑战我拼命地调整了这么多; 找到了另一个地方保存一些,但又给了两个来修复条形码长度错误)

我暂时领先于[ Matt’s JavaScript ][1] solution反向挑战! ;) 还有[ AKX’s python ][2]。

这个问题似乎需要一种语言来实现本机关联数组,所以我选择了一种具有严重缺陷的操作符集的语言。特别是,您无法控制 awk 提供散列映射元素的顺序,因此我重复扫描 完整映射以找到当前数量最多的项,打印并从数组中删除它。

这一切都是非常低效的,因为我打了那么多次高尔夫球,它也变得非常糟糕。

缩小版:

{gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++}
END{split("the and of to a i it in or is",b," ");
for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e}
for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2;
t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t;
print"|"t"| "x;delete a[x]}}

只是为了清楚起见: 它们是不必要的,也不应该被计算在内。


产出:

$ gawk -f wordfreq.awk.min < 11.txt
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
$ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min
______________________________________________________________________
|______________________________________________________________________| she
|_____________________________________________________________| superlongstring
|__________________________________________________________| said
|__________________________________________________| alice
|____________________________________________| was
|_________________________________________| that
|_________________________________| as
|______________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|__________________________| t
|________________________| on
|________________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|_________________| so

可读; 633个字符(原949个) :

{
gsub("[^a-zA-Z]"," ");
for(;NF;NF--)
a[tolower($NF)]++
}
END{
# remove "short" words
split("the and of to a i it in or is",b," ");
for (w in b)
delete a[b[w]];
# Find the bar ratio
d=1;
for (w in a) {
e=a[w]/(78-length(w));
if (e>d)
d=e
}
# Print the entries highest count first
for (i=22; i; --i){
# find the highest count
e=0;
for (w in a)
if (a[w]>e)
e=a[x=w];
# Print the bar
l=a[x]/d-2;
# make a string of "_" the right length
t=sprintf(sprintf("%%%dc",l)," ");
gsub(" ","_",t);
if (i==22) print" "t;
print"|"t"| "x;
delete a[x]
}
}

* sh (+ curl) ,不完整溶液

这是不完整的,但它的地狱,这里的字频计算一半的问题在192个字节:

curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^a-z]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^a-z]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22

JavaScript 1.8(SpiderMonkey)-354

x={};p='|';e=' ';z=[];c=77
while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1}))
z=z.sort(function(a,b)b.c-a.c).slice(0,22)
for each(v in z){v.r=v.c/z[0].c
c=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){v=z[k]
s=Array(v.r*c|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

遗憾的是,Rhino 版本的 for([k,v]in z)似乎不想在 SpiderMonkey 中工作,而且 readFile()比使用 readline()要容易一些,但是升级到1.8允许我们使用函数闭包来减少更多的代码行... ..。

为可读性添加空格:

x={};p='|';e=' ';z=[];c=77
while(l=readline())
l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,
function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} )
)
z=z.sort(function(a,b) b.c - a.c).slice(0,22)
for each(v in z){
v.r=v.c/z[0].c
c=c>(l=(77-v.w.length)/v.r)?l:c
}
for(k in z){
v=z[k]
s=Array(v.r*c|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)
}

用法: js golf.js < input.txt

产出:

_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| for
|______________________| had
|______________________| but
|_____________________| be
|_____________________| not
|___________________| they
|___________________| so

(基本版本-不能正确处理条宽)

JavaScript (Rhino)-405 395 387 377 368 343304字符

我想我的分类逻辑是错误的,但是. . 我不知道。

简化(有时将 \n解释为 ;) :

x={};p='|';e=' ';z=[]
readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})})
z=z.sort(function(a,b){return b.c-a.c}).slice(0,22)
for([k,v]in z){s=Array((v.c/z[0].c)*70|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

Python 2.6347个字符

import re
W,x={},"a and i in is it of or the to".split()
[W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[a-z]+",file("11.txt").read().lower())if w not in x]
W=sorted(W.items(),key=lambda p:p[1])[:22]
bm=(76.-len(W[0][0]))/W[0][1]
U=lambda n:"_"*int(n*bm)
print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W))

产出:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

红宝石,215,216218221224236237字符

更新1: 万岁! 和 JS 邦斯解决方案打平了。想不到有什么办法可以减少了:)

更新2: 玩了一个肮脏的高尔夫球把戏。将 each改为 map以保存1个字符:)

更新3: 将 File.read改为 IO.read + 2。Array.group_by不是很富有成果,改为 reduce + 6。使用正则表达式 + 1中的 downcase下壳后,不需要区分大小写检查。按降序排序很容易通过取消值 + 6来完成。节省总额 + 15

更新4: [0]而不是 .first,+ 3。(@Sht ééf)

更新5: 原地展开变量 l,+ 1。原地展开变量 s,+ 2。(@Sht ééf)

更新6: 对第一行使用字符串加法而不是插值,+ 2。(@Sht ééf)

w=(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "}

更新7: 我花了很多功夫来检测第一次迭代 在里面循环,使用实例变量。我得到的只是 + 1,尽管也许还有潜力。保留之前的版本,因为我相信这是黑魔法。(@Sht éf)

(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "}

可读版本

string = File.read($_).downcase


words = string.scan(/[a-z]+/i)
allowed_words = words - %w{the and of to a i it in or is}
sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22)
highest_frequency = sorted_words.first
highest_frequency_count = highest_frequency[1]
highest_frequency_word = highest_frequency[0]


word_length = highest_frequency_word.size
widest = 76 - word_length


puts " #{'_' * widest}"
sorted_words.each do |word, freq|
width = (freq * 1.0 / highest_frequency_count) * widest
puts "|#{'_' * width}| #{word} "
end

使用方法:

echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb

产出:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

Java,慢慢变短(1500 1358年 1241 1020 913890字符)

剥离更多的空白和变量名长度。 尽可能删除泛型,删除内联类和 try/catch 块 太糟糕了,我的900版本有个漏洞

删除了另一个 try/catch 块

import java.net.*;import java.util.*;import java.util.regex.*;import org.apache.commons.io.*;public class G{public static void main(String[]a)throws Exception{String text=IOUtils.toString(new URL(a[0]).openStream()).toLowerCase().replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b","");final Map<String,Integer>p=new HashMap();Matcher m=Pattern.compile("\\b\\w+\\b").matcher(text);Integer b;while(m.find()){String w=m.group();b=p.get(w);p.put(w,b==null?1:b+1);}List<String>v=new Vector(p.keySet());Collections.sort(v,new Comparator(){public int compare(Object l,Object m){return p.get(m)-p.get(l);}});boolean t=true;float r=0;for(String w:v.subList(0,22)){if(t){t=false;r=p.get(w)/(float)(80-(w.length()+4));System.out.println(" "+new String(new char[(int)(p.get(w)/r)]).replace('\0','_'));}System.out.println("|"+new String(new char[(int)(((Integer)p.get(w))/r)]).replace('\0','_')+"|"+w);}}}

可读版本:

import java.net.*;
import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.*;


public class G{


public static void main(String[] a) throws Exception{
String text =
IOUtils.toString(new URL(a[0]).openStream())
.toLowerCase()
.replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b", "");
final Map<String, Integer> p = new HashMap();
Matcher m = Pattern.compile("\\b\\w+\\b").matcher(text);
Integer b;
while(m.find()){
String w = m.group();
b = p.get(w);
p.put(w, b == null ? 1 : b + 1);
}
List<String> v = new Vector(p.keySet());
Collections.sort(v, new Comparator(){


public int compare(Object l, Object m){
return p.get(m) - p.get(l);
}
});
boolean t = true;
float r = 0;
for(String w : v.subList(0, 22)){
if(t){
t = false;
r = p.get(w) / (float) (80 - (w.length() + 4));
System.out.println(" "
+ new String(new char[(int) (p.get(w) / r)]).replace('\0',
'_'));
}
System.out.println("|"
+ new String(new char[(int) (((Integer) p.get(w)) / r)]).replace('\0',
'_') + "|" + w);
}
}
}

Javascript 348个字符

当我完成我的想法后,我从马特那里偷了一些想法: 3

t=prompt().toLowerCase().replace(/\b(the|and|of|to|a|i[tns]?|or)\b/gm,'');r={};o=[];t.replace(/\b([a-z]+)\b/gm,function(a,w){r[w]?++r[w]:r[w]=1});for(i in r){o.push([i,r[i]])}m=o[0][1];o=o.slice(0,22);o.sort(function(F,D){return D[1]-F[1]});for(B in o){F=o[B];L=new Array(~~(F[1]/m*(76-F[0].length))).join('_');print(' '+L+'\n|'+L+'| '+F[0]+' \n')}

需要 打印提示功能支持。

Mathematica (297 284 248 244 242199字符)纯功能

和 Zipf 定律测试

妈妈,你看,没有花瓶,没有手,没有头

编辑1 > 一些已定义的简写(284个字符)

f[x_, y_] := Flatten[Take[x, All, y]];


BarChart[f[{##}, -1],
BarOrigin -> Left,
ChartLabels -> Placed[f[{##}, 1], After],
Axes -> None
]
& @@
Take[
SortBy[
Tally[
Select[
StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]],
!MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&]
],
Last],
-22]

一些解释

Import[]
# Get The File


ToLowerCase []
# To Lower Case :)


StringSplit[ STRING , RegularExpression["\\W+"]]
# Split By Words, getting a LIST


Select[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&]
#  Select from LIST except those words in LIST_TO_AVOID
#  Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the test


Tally[LIST]
# Get the LIST {word,word,..}
and produce another  \{\{word,counter},{word,counter}...}


SortBy[ LIST ,Last]
# Get the list produced bt tally and sort by counters
Note that counters are the LAST element of {word,counter}


Take[ LIST ,-22]
# Once sorted, get the biggest 22 counters


BarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST
# Get the list produced by Take as input and produce a bar chart


f[x_, y_] := Flatten[Take[x, All, y]]
# Auxiliary to get the list of the first or second element of lists of lists x_
dependending upon y
# So f[{##}, -1] is the list of counters
# and f[{##}, 1] is the list of words (labels for the chart)

输出

替换文本 http://i49.tinypic.com/2n8mrer.jpg

Mathematica 不太适合打高尔夫,这只是因为它有很长的描述性函数名。像“ regarExpression []”或“ StringSplit []”这样的函数让我感动: (。

Zipf 定律测试

Zipf 定律预测,对于自然语言文本,日志(排名)日志(事件)情节遵循 线性的关系。

这条定律被用于开发用于抄写和数据压缩的算法。(但它不是 LZW 算法中的“ Z”)。

在我们的文本中,我们可以使用以下内容进行测试

 f[x_, y_] := Flatten[Take[x, All, y]];
ListLogLogPlot[
Reverse[f[{##}, -1]],
AxesLabel -> {"Log (Rank)", "Log Counter"},
PlotLabel -> "Testing Zipf's Law"]
& @@
Take[
SortBy[
Tally[
StringSplit[ToLowerCase[b], RegularExpression["\\W+"]]
],
Last],
-1000]

结果是(非常好的线性)

替换文本 http://i46.tinypic.com/33fcmdk.jpg

编辑6 > (242个字符)

重构正则表达式(不再使用 Select 函数)
删除1个字符单词
函数“ f”更有效的定义

f = Flatten[Take[#1, All, #2]]&;
BarChart[
f[{##}, -1],
BarOrigin -> Left,
ChartLabels -> Placed[f[{##}, 1], After],
Axes -> None]
& @@
Take[
SortBy[
Tally[
StringSplit[ToLowerCase[Import[i]],
RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]]
],
Last],
-22]

编辑7 & rarr; 199个字符

BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@
Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i,
RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22]
  • TransposeSlot(#1/#2)参数替换 f
  • 我们不需要讨厌的括号(尽可能使用 f@x而不是 f[x])

Java-896字符

931个字符

1233个字符无法读取

1977年字符“未压缩”


更新: 我已经积极地减少了字符数量。每个更新规范省略了单字母单词。

我非常羡慕 C # 和 LINQ。

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");}}}

“可读”:

import java.util.*;
import java.io.*;
import static java.util.regex.Pattern.*;
class g
{
public static void main(String[] a)throws Exception
{
PrintStream o = System.out;
Map<String,Integer> w = new HashMap();
Scanner s = new Scanner(new File(a[0]))
.useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));
while(s.hasNext())
{
String z = s.next().trim().toLowerCase();
if(z.equals(""))
continue;
w.put(z,(w.get(z) == null?0:w.get(z))+1);
}
List<Integer> v = new Vector(w.values());
Collections.sort(v);
List<String> q = new Vector();
int i,m;
i = m = v.size()-1;
while(q.size()<22)
{
for(String t:w.keySet())
if(!q.contains(t)&&w.get(t).equals(v.get(i)))
q.add(t);
i--;
}
int r = 80-q.get(0).length()-4;
String l = String.format("%1$0"+r+"d",0).replace("0","_");
o.println(" "+l);
o.println("|"+l+"| "+q.get(0)+" ");
for(i = m-1; i > m-22; i--)
{
o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");
}
}
}

爱丽丝输出:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

堂吉诃德的输出(同样来自古腾堡) :

 ________________________________________________________________________
|________________________________________________________________________| that
|________________________________________________________| he
|______________________________________________| for
|__________________________________________| his
|________________________________________| as
|__________________________________| with
|_________________________________| not
|_________________________________| was
|________________________________| him
|______________________________| be
|___________________________| don
|_________________________| my
|_________________________| this
|_________________________| all
|_________________________| they
|________________________| said
|_______________________| have
|_______________________| me
|______________________| on
|______________________| so
|_____________________| you
|_____________________| quixote

Java-991字符 < sup > (包括换行和缩进)

我采用了 @ seanizer的代码,修正了一个错误(他省略了第一行输出) ,做了一些改进,使代码更“高尔夫”。

import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.IOUtils;
public class WF{
public static void main(String[] a)throws Exception{
String t=IOUtils.toString(new java.net.URL(a[0]).openStream());
class W implements Comparable<W> {
String w;int f=1;W(String W){w=W;}public int compareTo(W o){return o.f-f;}
String d(float r){char[]c=new char[(int)(f/r)];Arrays.fill(c,'_');return "|"+new String(c)+"| "+w;}
}
Map<String,W>M=new HashMap<String,W>();
Matcher m=Pattern.compile("\\b\\w+\\b").matcher(t.toLowerCase());
while(m.find()){String w=m.group();W W=M.get(w);if(W==null)M.put(w,new W(w));else W.f++;}
M.keySet().removeAll(Arrays.asList("the,and,of,to,a,i,it,in,or,is".split(",")));
List<W>L=new ArrayList<W>(M.values());Collections.sort(L);int l=76-L.get(0).w.length();
System.out.println(" "+new String(new char[l]).replace('\0','_'));
for(W w:L.subList(0,22))System.out.println(w.d((float)L.get(0).f/(float)l));
}
}

产出:

_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so


Java-886 865 756 744 742 744 752 742 714680字符

  • 在第一个742 之前更新: 改进了正则表达式,删除了多余的参数化类型,删除了多余的空白。

  • 更新742 > 744 chars : 修复了固定长度的 hack。它只取决于第一个单词,而不是其他的单词。找到了几个地方来缩短代码(正则表达式中的 \\s 所取代,ArrayListVector所取代)。我现在正在寻找一种简短的方法来删除 Commons IO 依赖项并从 stdin 读取。

  • 更新744 > 752 chars : 我删除了 commons 依赖项。它现在从 stdin 读取。将文本粘贴到 stdin 中,然后按 Ctrl+Z获得结果。

  • 更新752 > 742个字符 : 我删除了 public和一个空格,将 classname 1 char 改为2,现在忽略了一个字母单词。

  • 更新742 > 714字符 : 根据 Carl 的注释更新: 删除冗余赋值(742 > 730) ,用 m.get(k)!=null替换 m.containsKey(k)(730 > 728) ,引入子串行(728 > 714)。

  • 更新714 > 680字符 : 根据转子的评论更新: 改进了条形尺寸计算以消除不必要的铸造,改进了 split()以消除不必要的 replaceAll()


import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}}

更易读版本:

import java.util.*;
class F{
public static void main(String[]a)throws Exception{
StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));
final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);
List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});
int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);
for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);
}
}

产出:

_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

Java 还没有 String#join()关闭,这真是糟透了。

编辑:

我对你的解决方案做了几处修改:

  • 用字符串[]替换 List
  • 重用了‘ args’参数,而不是声明我自己的 String 数组
  • 用 String 替换 StringBuffer (是的,是的,糟糕的性能)
  • 将 Java 排序替换为具有提前停止的选择排序(只需要找到前22个元素)
  • 将一些 int 声明聚合成单个语句
  • 实现了寻找最大输出限制线的非欺骗算法,无 FP 实现。
  • 修正了当文本中有少于22个不同的单词时程序崩溃的问题
  • 实现了一种新的读取输入算法,该算法速度快,只比慢速算法长9个字符。

浓缩代码是 罢工 > 688罢工 > 罢工 > 711罢工 > 684字符长度:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;(j=System.in.read())>0;w+=(char)j);for(String W:w.toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(W,m.get(W)!=null?m.get(W)+1:1);l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

快速版本(罢工 > 720罢工 > 693字符)

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

更易读版本:

import java.util.*;class F{public static void main(String[]l)throws Exception{
Map<String,Integer>m=new HashMap();String w="";
int i=0,k=0,j=8,x,y,g=22;
for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{
if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";
}}
l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;
for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}
for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}
String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');
System.out.println(" "+s);
for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}
}

没有行为改进的版本是 615字符:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);for(;i<g;++i)for(j=i;++j<l.length;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}i=76-l[0].length();String s=new String(new char[i]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/m.get(l[0]))+"| "+w);}}}

Python 2.x,纬向方法 = 227183个字符

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22]
for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w

考虑到实现中的自由度,我构建了一个字符串串联,其中包含了所有要求排除的单词(the, and, of, to, a, i, it, in, or, is)——此外,它还从示例中排除了两个臭名昭著的“单词”st——我免费加入了对 an, for, he的排除。我尝试了所有这些词语的连接对爱丽丝的语料库,詹姆斯国王的圣经和行话文件,看看是否有任何单词将被错误地排除在字符串之外。这就是我如何用两个排除字符串结束的: itheandtoforinisandithetoforinis

借用其他解决方案来缩短代码。

=========================================================================== she
================================================================= you
============================================================== said
====================================================== alice
================================================ was
============================================ that
===================================== as
================================= her
============================== at
============================== with
=========================== on
=========================== all
======================== this
======================== had
======================= but
====================== be
====================== not
===================== they
==================== so
=================== very
=================== what
================= little

咆哮

至于要忽略的单词,人们可能会认为这些单词应该从英语中最常用的单词列表中去掉。这个列表取决于使用的 文本语料库。每一个最流行的列表(http://en.wikipedia.org/wiki/Most_common_words_in_Englishhttp://www.english-for-students.com/Frequently-Used-Words.htmlhttp://www.sporcle.com/games/common_english_words.php) ,前10个单词是: the be(am/are/is/was/were) to of and a in that have I

爱丽丝梦游仙境中排名前十位的单词是 the and to a of it she i you said
行话文件(v4.4.7)的前10个单词是 the a of to and in is that or for

因此,问题是为什么 or被包含在问题的忽略列表中,在这个列表中,or的流行程度排在第30位,而 that(第8位最常用的单词)却没有。等等,等等。因此,我认为忽略列表应该动态提供(或者可以忽略)。

另一种方法是简单地跳过结果的前10个单词——这实际上会缩短解决方案(基本的——只显示第11到第32个条目)。


Python 2.x,一丝不苟的方法 = 277243个字符

上面代码中绘制的图表被简化(对于条形图只使用一个字符)。如果一个人想要从问题描述中精确地复制图表(这是不需要的) ,这段代码可以做到:

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22]
h=min(9*l/(77-len(w))for l,w in r)
print'',9*r[0][0]/h*'_'
for l,w in r:print'|'+9*l/h*'_'+'|',w

我对排除 the, and, of, to, a, i, it, in, or, is的10个单词的随机选择提出了一个问题,因为这10个单词将作为命令行参数传递,如下所示:
python WordFrequencyChart.py the and of to a i it in or is <"Alice's Adventures in Wonderland.txt"

如果我们考虑命令行 = 243上传递的“原始”忽略列表,则为213个字符 + 30

附言。第二个代码还对所有顶部单词的长度进行“调整”,因此在退化情况下它们都不会溢出。

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|_____________________________________________________| said
|______________________________________________| alice
|_________________________________________| was
|______________________________________| that
|_______________________________| as
|____________________________| her
|__________________________| at
|__________________________| with
|_________________________| s
|_________________________| t
|_______________________| on
|_______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|___________________| not
|_________________| they
|_________________| so

我喜欢大号的... 目标 C (1070 931905字符)

#define S NSString
#define C countForObject
#define O objectAtIndex
#define U stringWithCString
main(int g,char**b){id c=[NSCountedSet set];S*d=[S stringWithContentsOfFile:[S U:b[1]]];id p=[NSPredicate predicateWithFormat:@"SELF MATCHES[cd]'(the|and|of|to|a|i[tns]?|or)|[^a-z]'"];[d enumerateSubstringsInRange:NSMakeRange(0,[d length])options:NSStringEnumerationByWords usingBlock:^(S*s,NSRange x,NSRange y,BOOL*z){if(![p evaluateWithObject:s])[c addObject:[s lowercaseString]];}];id s=[[c allObjects]sortedArrayUsingComparator:^(id a,id b){return(NSComparisonResult)([c C:b]-[c C:a]);}];g=[c C:[s O:0]];int j=76-[[s O:0]length];char*k=malloc(80);memset(k,'_',80);S*l=[S U:k length:80];printf(" %s\n",[[l substringToIndex:j]cString]),[[s subarrayWithRange:NSMakeRange(0,22)]enumerateObjectsUsingBlock:^(id a,NSUInteger x,BOOL*y){printf("|%s| %s\n",[[l substringToIndex:[c C:a]*j/g]cString],[a cString]);}];}

切换到使用大量贬值的 API,删除一些不需要的内存管理,更积极地删除空白

 _________________________________________________________________________
|_________________________________________________________________________| she
|______________________________________________________________| said
|__________________________________________________________| you
|____________________________________________________| alice
|________________________________________________| was
|_______________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|________________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| so
|___________________| very
|__________________| what
|_________________| they

Ruby 207 213 211 210 207 203 201200字符

阿努拉格的一个改进,纳入建议从 rfusca。也删除争议排序和其他一些次要的高尔夫球。

w=(STDIN.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "}

执行方式:

ruby GolfedWordFrequencies.rb < Alice.txt

编辑: 把‘ put’放回去,需要在那里避免在输出中有引号。
编辑2: 更改文件-> IO
编辑3: 删除/i
编辑4: 删除(f * 1.0)周围的括号,重新计算
编辑5: 对第一行使用字符串添加; 就地展开 s
编辑6: 使我浮动,删除1.0。编辑: 不工作,改变长度。编辑: 没有比以前更糟糕
编辑7: 使用 STDIN.read

206

Shell、 grep、 tr、 grep、 sort、 uniq、 sort、 head、 perl

~ % wc -c wfg
209 wfg
~ % cat wfg
egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
~ % # usage:
~ % sh wfg < 11.txt

嗯,如上所示: sort -nr-> sort -n,然后是 head-> tail = > 208:)
Update2: 嗯,当然上面的是愚蠢的,因为它将被颠倒然后。所以,209。
更新3: 优化排除 regexp-> 206 < br >

egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'



有趣的是,这里有一个只有 perl 的版本(快得多) :

~ % wc -c pgolf
204 pgolf
~ % cat pgolf
perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'
~ % # usage:
~ % sh pgolf < 11.txt

高尔夫脚本,177 175 173 167 164 163 144 131130字符

慢-示例文本(130)为3分钟

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"
|"\~1*2/0*'| '@}/

说明:

{           #loop through all characters
32|.       #convert to uppercase and duplicate
123%97<    #determine if is a letter
n@if       #return either the letter or a newline
}%          #return an array (of ints)
]''*        #convert array to a string with magic
n%          #split on newline, removing blanks (stack is an array of words now)
"oftoitinorisa"   #push this string
2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
-           #remove any occurrences from the text
"theandi"3/-#remove "the", "and", and "i"
$           #sort the array of words
(1@         #takes the first word in the array, pushes a 1, reorders stack
#the 1 is the current number of occurrences of the first word
{           #loop through the array
.3$>1{;)}if#increment the count or push the next word and a 1
}/
]2/         #gather stack into an array and split into groups of 2
{~~\;}$     #sort by the latter element - the count of occurrences of each word
22<         #take the first 22 elements
.0=~:2;     #store the highest count
,76\-:1     #store the length of the first line
'_':0*' '\@ #make the first line
{           #loop through each word
"
|"\~        #start drawing the bar
1*2/0       #divide by zero
*'| '@      #finish drawing the bar
}/

“正确”(希望如此)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"
|"\~1*^/0*'| '@}/

减慢-半分钟

'"'/' ':S*n/S*'"#{%q
'\+"
.downcase.tr('^a-z','
')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"
|"\~1*2/0*'| '@}/

在修订日志中可见输出。

Windows PowerShell,199个字符

$x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *
filter f($w){' '+'_'*$w
$x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}}
f(76..1|?{!((f $_)-match'.'*80)})[0]

(最后一行中断是不必要的,但是为了便于阅读,这里包含了它。)

(当前代码和我的测试文件可用 在我的 SVN 存储库里。我希望我的测试用例能够捕捉到最常见的错误(条长度、正则表达式匹配问题以及其他一些问题)

假设:

  • 美国 ASCII 作为输入。它可能会得到奇怪的 Unicode。
  • 课文中至少有 不间断单词

历史

放松版本 (137) ,因为现在已经分开计算了,很明显:

($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name}
  • 不能关闭第一个酒吧
  • 不能解释非第一个单词的词长

在将浮点数转换为整数时,PowerShell 使用舍入代替截断,从而导致一个字符的条形长度与其他解决方案相比发生变化。由于这项任务只需要比例的酒吧长度,这应该是罚款,但。

与其他解决方案相比,我采取了一种略有不同的方法来确定最长的条形码长度,只是简单地尝试,并采取这样的最高长度,其中没有行长度超过80个字符。

解释的旧版本可以找到 给你

Ruby 1.9185个字符

(主要基于其他 Ruby 解决方案)

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
k,l=w[0]
puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

您可以简单地将文件名作为参数传递,而不像其他解决方案那样使用任何命令行开关。(即 ruby1.9 wordfrequency.rb Alice.txt)

因为我在这里使用了字符文字,所以这个解决方案只能在 Ruby 1.9中使用。

编辑: 用换行符替换分号,以表示“可读性”

编辑2: Shtéf 指出,我忘记了尾随的空间-固定。

编辑3: 再次删除尾随空间;)

R449字符

可能会变得更短。

bar <- function(w, l)
{
b <- rep("-", l)
s <- rep(" ", l)
cat(" ", b, "\n|", s, "| ", w, "\n ", b, "\n", sep="")
}


f <- "alice.txt"
e <- c("the", "and", "of", "to", "a", "i", "it", "in", "or", "is", "")
w <- unlist(lapply(readLines(file(f)), strsplit, s=" "))
w <- tolower(w)
w <- unlist(lapply(w, gsub, pa="[^a-z]", r=""))
u <- unique(w[!w %in% e])
n <- unlist(lapply(u, function(x){length(w[w==x])}))
o <- rev(order(n))
n <- n[o]
m <- 77 - max(unlist(lapply(u[1:22], nchar)))
n <- floor(m*n/n[1])
u <- u[o]


for (i in 1:22)
bar(u[i], n[i])

Groovy,424 389 378321字符

b=map[a]代替 b=map.get(a), 用匹配器/迭代器替换拆分

def r,s,m=[:],n=0;def p={println it};def w={"_".multiply it};(new URL(this.args[0]).text.toLowerCase()=~/\b\w+\b/).each{s=it;if(!(s==~/(the|and|of|to|a|i[tns]?|or)/))m[s]=m[s]==null?1:m[s]+1};m.keySet().sort{a,b->m[b]<=>m[a]}.subList(0,22).each{k->if(n++<1){r=(m[k]/(76-k.length()));p" "+w(m[k]/r)};p"|"+w(m[k]/r)+"|"+k}

(作为 groovy 脚本执行,URL 作为 cmd 行参数。不需要导入!)

可读版本:

def r,s,m=[:],n=0;
def p={println it};
def w={"_".multiply it};
(new URL(this.args[0]).text.toLowerCase()
=~ /\b\w+\b/
).each{
s=it;
if (!(s ==~/(the|and|of|to|a|i[tns]?|or)/))
m[s] = m[s] == null ? 1 : m[s] + 1
};
m.keySet()
.sort{
a,b -> m[b] <=> m[a]
}
.subList(0,22).each{
k ->
if( n++ < 1 ){
r=(m[k]/(76-k.length()));
p " " + w(m[k]/r)
};
p "|" + w(m[k]/r) + "|" + k
}

Scala 368个字符

首先,592个字符的清晰版本:

object Alice {
def main(args:Array[String]) {
val s = io.Source.fromFile(args(0))
val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase)
val freqs = words.foldLeft(Map[String, Int]())((countmap, word)  => countmap + (word -> (countmap.getOrElse(word, 0)+1)))
val sortedFreqs = freqs.toList.sort((a, b)  => a._2 > b._2)
val top22 = sortedFreqs.take(22)
val highestWord = top22.head._1
val highestCount = top22.head._2
val widest = 76 - highestWord.length
println(" " + "_" * widest)
top22.foreach(t => {
val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt
println("|" + "_" * width + "| " + t._1)
})
}
}

控制台输出如下:

$ scalac alice.scala
$ scala Alice aliceinwonderland.txt
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

我们可以做一些积极的缩小,把它缩小到415个字符:

object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}}

控制台会话如下所示:

$ scalac a.scala
$ scala A aliceinwonderland.txt
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

我相信一个 Scala 专家可以做得更好。

更新: 在评论中,托马斯给出了一个更短的版本,368个字符:

object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}

字体清晰,375个字符:

object Alice {
def main(a:Array[String]) {
val t = (Map[String, Int]() /: (
for (
x <- io.Source.fromFile(a(0)).getLines
y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x)
) yield y.toLowerCase
).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22)
val w = 76 - t.head._1.length
print (" "+"_"*w)
t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print)
}
}

Python 2.6,273 269 267266个字符。

(编辑: 克里斯托弗 D 的字符剃须建议道具)

import sys,re
t=re.findall('[a-z]+',"".join(sys.stdin).lower())
d=sorted((t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:-23:-1]
r=min((78.-len(m[1]))/m[0]for m in d)
print'','_'*(int(d[0][0]*r-2))
for(a,b)in d:print"|"+"_"*(int(a*r-2))+"|",b

产出:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so

Perl,205 191189个字符/205个字符(完全实现)

一些部分受到了早期提交的 perl/ruby 的启发,一些类似的想法是独立得到的,其他部分是原创的。简短的版本也包含了一些我从其他提交中看到或学到的东西。

原文:

$k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[a-z]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s
",'_'x$l;printf"|%s| $_
",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21];

最新版本为 191个字符:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@e[0,0..21]

最新版本只有189个字符:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@_[0,0..21]

这个版本(205字符)说明了行中的单词比后来发现的要长。

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
";}@e[0,0..21]

Perl: 203 202 201 198 195 208203/231字符

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21]

替代的,完整的实现,包括指示行为(全局压条)的病理情况下,其中的第二个单词是流行的,并且足够长,以结合到超过80个字符(这个实现是231个字符) :

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"}

规范中没有说明必须将其转移到 STDOUT,所以我使用了 perl 的 warting ()而不是 print-4个字符保存在那里。使用 map 代替 foreach,但是我觉得分割(join ())仍然可以节省一些成本。尽管如此,已经降到203了,可能还要再考虑一下。至少现在 Perl 位于“ shell,grep,tr,grep,sort,uniq,sort,head,Perl”字符计数之下;)

附注: Reddit 说“ Hi”;)

更新: 删除 join ()以支持赋值和隐式标量转换 join。降到202了。还请注意,我已经利用可选的“忽略1个字母的单词”规则,削减了2个字符,所以请记住,频率计数将反映这一点。

更新2: 首先使用 < > 将赋值和隐式连接替换为杀死 $/,以一次性获取文件。同样的尺寸,但更恶心。换出如果(!$y){} for $y | | {} & & ,再保存1个 char = > 201。

更新3: 通过将 lc 移出 map 块来控制早期的小写(lc < >)-将两个正则表达式交换为不再使用/i 选项,因为不再需要。交换显式条件 x? y: z 构造为传统 perlgolf | | 隐式条件构造-/^ ... $/i?1: $x { $} + + for/^ ... $/| | $x { $} + + 保存三个字符!= > 198,突破200关。可能很快就会睡着... 也许。

更新4: 睡眠剥夺让我发疯。好吧。更疯狂。考虑到这只需要解析普通的快乐文本文件,我让它放弃了,如果它命中一个空值。保存了两个字符。将“长度”替换为更短的1字符(和更多的高尔夫球) y//c-你听到了吗,高尔夫脚本? ?我来找你了! ! !呜咽

更新5: 睡眠深度让我忘记了22行的限制和后续行的限制。回到208处理那些。不错,十三个字处理不是世界末日。玩弄 perl 的 regex 内联 eval,但是在同时使用 还有保存字符时遇到了麻烦... lol。更新示例以匹配当前输出。

更新6: 移除了不需要的大括号保护(...) for,因为语法糖果 + + 允许将其推向 for。多亏了查斯的意见。欧文斯(提醒我疲惫的大脑) ,在那里得到了字符类 i [ tns ]的解决方案。回到203。

更新7: 增加了第二项工作,完全实现规范(包括完全压扁次长词的行为,而不是大多数人正在做的截断,基于原始规范,没有病理示例情况)

例子:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

病理案例中的替代实施:

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| with
|_________________________| at
|_______________________| on
|______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| so
|________________| very
|________________| what

C + + ,647个字符

我并不期望通过使用 C + + 获得高分,但是没关系。我很确定它符合所有要求。注意,我使用 C + + 0x auto关键字进行变量声明,因此,如果您决定测试我的代码,请适当地调整您的编译器。

最小化版本

#include <iostream>
#include <cstring>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
int main(){map<C,int>f;char d[230];int i=1,v;for(;i<256;i++)d[i<123?i-1:i-27]=i;d[229]=0;char w[99];while(cin>>w){for(i=0;w[i];i++)w[i]=tolower(w[i]);char*p=strtok(w,d);while(p)++f[p],p=strtok(0,d);}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}

下面是第二个版本,通过使用 string,而不是 char[]strtok,它更像“ C + +”。它有点大,在 669(+ 22 vs 以上),但我不能得到它小的时候,所以认为我会张贴它无论如何。

#include <iostream>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
#define E e=w.find_first_of(d,g);g=w.find_first_not_of(d,e);
int main(){map<C,int>f;int i,v;C w,x,d="abcdefghijklmnopqrstuvwxyz";while(cin>>w){for(i=w.size();i-->0;)w[i]=tolower(w[i]);unsigned g=0,E while(g-e>0){x=w.substr(e,g-e),++f[x],E}}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}

我已经删除了完整的版本,因为我不想继续更新它与我的调整,以最小化的版本。如果您对长版本(可能已经过时)感兴趣,请查看编辑历史记录。

巨蟒320字

import sys
i="the and of to a i it in or is".split()
d={}
for j in filter(lambda x:x not in i,sys.stdin.read().lower().split()):d[j]=d.get(j,0)+1
w=sorted(d.items(),key=lambda x:x[1])[:-23:-1]
m=sorted(dict(w).values())[-1]
print" %s\n"%("_"*(76-m)),"\n".join(map(lambda x:("|%s| "+x[0])%("_"*((76-m)*x[1]/w[0][1])),w))

Python 3.1-245229字符

我想使用 柜台是一种欺骗:)我大约一个星期前才读到它,所以这是一个完美的机会,看看它是如何工作的。

import re,collections
o=collections.Counter([w for w in re.findall("[a-z]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22)
print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o))

打印出来:

|____________________________________________________________________________| she
|__________________________________________________________________| you
|_______________________________________________________________| said
|_______________________________________________________| alice
|_________________________________________________| was
|_____________________________________________| that
|_____________________________________| as
|__________________________________| her
|_______________________________| with
|_______________________________| at
|______________________________| s
|_____________________________| t
|____________________________| on
|___________________________| all
|________________________| this
|________________________| for
|________________________| had
|________________________| but
|______________________| be
|______________________| not
|_____________________| they
|____________________| so

部分代码是从 AKX 的解决方案中“借用”的。

MATLAB 335 404 410字节 357字节。 < del > 390字节。

更新后的代码现在是335个字符,而不是404个字符,这两个示例似乎都表现良好。


原始信息 (用于404字符代码)

这个版本稍微长一点,然而,它会适当地缩放条的长度 如果有一个单词长得离谱,那么所有的列都不会超过80。

所以,我的代码是357字节没有重新缩放,410字节长重新缩放。

A=textscan(fopen('11.txt'),'%s','delimiter',' 0123456789,.!?-_*^:;=+\\/(){}[]@&#$%~`|"''');
s=lower(A{1});s(cellfun('length', s)<2)=[];s(ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);N=hist(i,max(i)); [j,k]=sort(N,'descend'); b=k(1:22); n=cellfun('length',w(b));
q=80*N(b)'/N(k(1))+n; q=floor(q*78/max(q)-n); for i=1:22, fprintf('%s| %s\n',repmat('_',1,l(i)),w{k(i)});end

结果:

___________________________________________________________________________| she
_________________________________________________________________| you
______________________________________________________________| said
_______________________________________________________| alice
________________________________________________| was
____________________________________________| that
_____________________________________| as
_________________________________| her
______________________________| at
______________________________| with
____________________________| on
___________________________| all
_________________________| this
________________________| for
________________________| had
________________________| but
_______________________| be
_______________________| not
_____________________| they
____________________| so
___________________| very
___________________| what

例如,将爱丽丝梦游仙境中所有的“ you”实例替换为“ superlongstringofridiculality”,我的代码将正确地缩放结果:

____________________________________________________________________| she
_________________________________________________________| superlongstringstring
________________________________________________________| said
_________________________________________________| alice
____________________________________________| was
________________________________________| that
_________________________________| as
______________________________| her
___________________________| with
___________________________| at
_________________________| on
________________________| all
_____________________| this
_____________________| for
_____________________| had
_____________________| but
____________________| be
____________________| not
__________________| they
__________________| so
_________________| very
_________________| what

这里更新的代码写得更清楚一些:

A=textscan(fopen('t'),'%s','delimiter','':'@');
s=lower(A{1});
s(cellfun('length', s)<2|ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);
N=hist(i,max(i));
[j,k]=sort(N,'descend');
n=cellfun('length',w(k));
q=80*N(k)'/N(k(1))+n;
q=floor(q*78/max(q)-n);
for i=1:22,
fprintf('%s| %s\n',repmat('_',1,q(i)),w{k(i)});
end

Haskell-366 351 344 337333个字符

(在 main中增加了一个换行符以提高可读性,在最后一行的末尾不需要换行符。)

import Data.List
import Data.Char
l=length
t=filter
m=map
f c|isAlpha c=toLower c|0<1=' '
h w=(-l w,head w)
x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w
q?(g,w)=q*(77-l w)`div`g
b x=m(x!)x
a(l:r)=(' ':t(=='_')l):l:r
main=interact$unlines.a.b.take 22.sort.m h.group.sort
.t(`notElem`words"the and of to a i it in or is").words.m f

如果倒着读 interact的参数,就能最清楚地看到它是如何工作的:

  • map f小写字母表,用空格代替其他所有内容。
  • words生成一个单词列表,去掉分隔空格。
  • filter (notElemwords "the and of to a i it in or is")丢弃所有带有禁用词的条目。
  • group . sort对单词进行排序,并将相同的单词分组成列表。
  • map h将每个相同单词的列表映射为形式 (-frequency, word)的元组。
  • take 22 . sort通过降频(第一个元组条目)对元组进行排序,并且只保留前22个元组。
  • b将元组映射到条(见下文)。
  • a在下划线的第一行前面加上下划线,以完成最上面的条形图。
  • unlines用换行将所有这些行连接在一起。

棘手的问题是如何正确地调整酒吧的长度。我假设只有下划线计数到酒吧的长度,所以 ||将是一个酒吧零长度。函数 bc x映射到 x上,其中 x是直方图列表。整个列表被传递给 c,这样每次调用 c都可以通过调用 u为自己计算比例因子。通过这种方式,我避免使用浮点数学或有理数,因为它们的转换函数和导入会吃掉很多字符。

注意使用 -frequency的技巧。这消除了对 reversesort的需要,因为排序(升序) -frequency将把频率最高的单词放在第一位。稍后,在函数 u中,两个 -frequency值相乘,这将抵消负值。

巨蟒 250个字符

借用了所有其他 Python 代码片段

import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w

如果你厚颜无耻,并把避免的话作为论点,< em > 223个字符

import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set(sys.argv[1:]))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w

产出为:

$ python alice4.py  the and of to a i it in or is < 11.txt
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

PHP CLI 版本(450字符)

这个解决方案考虑到最后的要求,大多数纯粹主义者已经方便地选择忽略。这花费了170个字符!

用法: php.exe <this.php> <file.txt>

缩小版:

<?php $a=array_count_values(array_filter(preg_split('/[^a-z]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

可读性:

<?php


// Read:
$s = strtolower(file_get_contents($argv[1]));


// Split:
$a = preg_split('/[^a-z]/', $s, -1, PREG_SPLIT_NO_EMPTY);


// Remove unwanted words:
$a = array_filter($a, function($x){
return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);
});


// Count:
$a = array_count_values($a);


// Sort:
arsort($a);


// Pick top 22:
$a=array_slice($a,0,22);




// Recursive function to adjust bar widths
// according to the last requirement:
function R($a,$F,$B){
$r = array();
foreach($a as $x=>$f){
$l = strlen($x);
$r[$x] = $b = $f * $B / $F;
if ( $l + $b > 76 )
return R($a,$f,76-$l);
}
return $r;
}


// Apply the function:
$c = R($a,max($a),76-strlen(key($a)));




// Output:
foreach ($a as $x => $f)
echo '|',str_repeat('-',$c[$x]),"| $x\n";


?>

产出:

|-------------------------------------------------------------------------| she
|---------------------------------------------------------------| you
|------------------------------------------------------------| said
|-----------------------------------------------------| alice
|-----------------------------------------------| was
|-------------------------------------------| that
|------------------------------------| as
|--------------------------------| her
|-----------------------------| at
|-----------------------------| with
|--------------------------| on
|--------------------------| all
|-----------------------| this
|-----------------------| for
|-----------------------| had
|-----------------------| but
|----------------------| be
|---------------------| not
|--------------------| they
|--------------------| so
|-------------------| very
|------------------| what

当有一个很长的单词时,条形图就会被适当地调整:

|--------------------------------------------------------| she
|---------------------------------------------------| thisisareallylongwordhere
|-------------------------------------------------| you
|-----------------------------------------------| said
|-----------------------------------------| alice
|------------------------------------| was
|---------------------------------| that
|---------------------------| as
|-------------------------| her
|-----------------------| with
|-----------------------| at
|--------------------| on
|--------------------| all
|------------------| this
|------------------| for
|------------------| had
|-----------------| but
|-----------------| be
|----------------| not
|---------------| they
|---------------| so
|--------------| very

基于 Transact SQL 集的解决方案(SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644630个字符

感谢 Gabe 为减少字符数量提供了一些有用的建议。

注意: 添加换行符是为了避免滚动条只需要最后一个换行符。

DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',
SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING
(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #D
FROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)C
INTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH
(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it',
'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+
REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+'
|'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @

可读版本

DECLARE @  VARCHAR(MAX),
@F REAL
SELECT @=BulkColumn
FROM   OPENROWSET(BULK'A',SINGLE_BLOB)x; /*  Loads text file from path
C:\WINDOWS\system32\A  */


/*Recursive common table expression to
generate a table of numbers from 1 to string length
(and associated characters)*/
WITH N AS
(SELECT 1 i,
LEFT(@,1)L


UNION ALL


SELECT i+1,
SUBSTRING(@,i+1,1)
FROM   N
WHERE  i<LEN(@)
)
SELECT   i,
L,
i-RANK()OVER(ORDER BY i)R
/*Will group characters
from the same word together*/
INTO     #D
FROM     N
WHERE    L LIKE'[A-Z]'OPTION(MAXRECURSION 0)
/*Assuming case insensitive accent sensitive collation*/


SELECT   TOP 22 W,
-COUNT(*)C
INTO     #
FROM     (SELECT DISTINCT R,
(SELECT ''+L
FROM    #D
WHERE   R=b.R FOR XML PATH('')
)W
/*Reconstitute the word from the characters*/
FROM             #D b
)
T
WHERE    LEN(W)>1
AND      W NOT IN('the',
'and',
'of' ,
'to' ,
'it' ,
'in' ,
'or' ,
'is')
GROUP BY W
ORDER BY C


/*Just noticed this looks risky as it relies on the order of evaluation of the
variables. I'm not sure that's guaranteed but it works on my machine :-) */
SELECT @F=MIN(($76-LEN(W))/-C),
@ =' '      +REPLICATE('_',-MIN(C)*@F)+' '
FROM   #


SELECT @=@+'
|'+REPLICATE('_',-C*@F)+'| '+W
FROM     #
ORDER BY C


PRINT @

输出

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| You
|____________________________________________________________| said
|_____________________________________________________| Alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| This
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| So
|___________________| very
|__________________| what

用长绳

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| Alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| at
|_________________________| with
|_______________________| on
|______________________| all
|____________________| This
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| So
|________________| very
|________________| what

R,298个字符

f=scan("stdin","ch")
u=unlist
s=strsplit
a=u(s(u(s(tolower(f),"[^a-z]")),"^(the|and|of|to|it|in|or|is|.|)$"))
v=unique(a)
r=sort(sapply(v,function(i) sum(a==i)),T)[2:23]  #the first item is an empty string, just skipping it
w=names(r)
q=(78-max(nchar(w)))*r/max(r)
cat(" ",rep("_",q[1])," \n",sep="")
for(i in 1:22){cat("|",rep("_",q[i]),"| ",w[i],"\n",sep="")}

输出结果是:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

如果“你”被更长的词取代:

 ____________________________________________________________
|____________________________________________________________| she
|____________________________________________________| veryverylongstring
|__________________________________________________| said
|___________________________________________| alice
|______________________________________| was
|___________________________________| that
|_____________________________| as
|__________________________| her
|________________________| at
|________________________| with
|______________________| on
|_____________________| all
|___________________| this
|___________________| for
|___________________| had
|__________________| but
|__________________| be
|__________________| not
|________________| they
|________________| so
|_______________| very
|_______________| what

LabVIEW 51个节点,5个结构,10个图

教大象跳踢踏舞从来都不好看,我就不数字了。

labVIEW code

results

程序从左向右流动:

labVIEW code explained

Python 290255,253


Python 中的290个字符(从标准输入读取文本)

import sys,re
c={}
for w in re.findall("[a-z]+",sys.stdin.read().lower()):c[w]=c.get(w,0)+1-(","+w+","in",a,i,the,and,of,to,it,in,or,is,")
r=sorted((-v,k)for k,v in c.items())[:22]
sf=max((76.0-len(k))/v for v,k in r)
print" "+"_"*int(r[0][0]*sf)
for v,k in r:print"|"+"_"*int(v*sf)+"| "+k

但是... ... 在阅读了其他解决方案后,我突然意识到效率不是一个要求; 所以这是另一个更短、更慢的解决方案(255个字符)

import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print" "+"_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"| "+k

在阅读了一些其他的解决方案之后..。

import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print"","_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"|",k

现在这个解决方案,几乎是每字节一个字节,与阿斯丁的一个相同:-D

C (828)

它看起来很像模糊的代码,并使用 glib 来表示字符串、列表和散列。字符数与 wc -m828。它不考虑单字符单词。为了计算条的最大长度,它考虑所有可能的单词中最长的,而不仅仅是前22个。这与规格不符吗?

它不处理故障,也不释放使用过的内存。

#include <glib.h>
#define S(X)g_string_##X
#define H(X)g_hash_table_##X
GHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);}

公共 LISP,670个字符

我是一个 LISP 新手,这是尝试使用哈希表进行计数(所以可能不是最紧凑的方法)。

(flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c(
make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda
(k v)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test
'equal))(push(cons k v)y)))c)(setf y(sort y #'> :key #'cdr))(setf y
(subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(-
76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* n f)))
(write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline)
(dolist(x y)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x))))))
(cond((char<= #\a x #\z)(push x w))(t(incf(gethash(concatenate 'string(
reverse w))c 0))(setf w nil)))))

可以运行,例如 cat alice.txt | clisp -C golf.lisp.

以可读的形式是

(flet ((r () (let ((x (read-char t nil)))
(and x (char-downcase x)))))
(do ((c (make-hash-table :test 'equal))  ; the word count map
w y                                 ; current word and final word list
(x (r) (r)))  ; iteration over all chars
((not x)


; make a list with (word . count) pairs removing stopwords
(maphash (lambda (k v)
(if (not (find k '("" "the" "and" "of" "to"
"a" "i" "it" "in" "or" "is")
:test 'equal))
(push (cons k v) y)))
c)


; sort and truncate the list
(setf y (sort y #'> :key #'cdr))
(setf y (subseq y 0 (min (length y) 22)))


; find the scaling factor
(let ((f (apply #'min
(mapcar (lambda (x) (/ (- 76.0 (length (car x)))
(cdr x)))
y))))
; output
(flet ((outx (n) (dotimes (i (floor (* n f))) (write-char #\_))))
(write-char #\Space)
(outx (cdar y))
(write-char #\Newline)
(dolist (x y)
(write-char #\|)
(outx (cdr x))
(format t "| ~a~%" (car x))))))


; add alphabetic to current word, and bump word counter
; on non-alphabetic
(cond
((char<= #\a x #\z)
(push x w))
(t
(incf (gethash (concatenate 'string (reverse w)) c 0))
(setf w nil)))))

Shell,228个字符,使用80个字符约束工作

tr A-Z a-z|tr -Cs a-z "\n"|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$" |uniq -c|sort -r|head -22>g
n=1
while :
do
awk '{printf "|%0*s| %s\n",$1*'$n'/1e3,"",$2;}' g|tr 0 _>o
egrep -q .{80} o&&break
n=$((n+1))
done
cat o

我很惊讶似乎没有人使用 printf 的神奇功能。

Cat 11-very. txt > golf.sh

|__________________________________________________________________________| she
|________________________________________________________________| you
|_____________________________________________________________| said
|______________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|________________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so

Cat 11 | golf.sh

|_________________________________________________________________| she
|_________________________________________________________| verylongstringstring
|______________________________________________________| said
|_______________________________________________| alice
|__________________________________________| was
|_______________________________________| that
|________________________________| as
|_____________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|_________________________| t
|________________________| on
|_______________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|__________________| so

对象 Rexx 4.0 with PC-Pipes

Where the 电脑管道 library can be found. < br/> 此解决方案忽略单个字母单词。


address rxpipe 'pipe (end ?) < Alice.txt',
'|regex split /[^a-zA-Z]/', -- split at non alphbetic character
'|locate 2',                -- discard words shorter that 2 char
'|xlate lower',             -- translate all words to lower case
,                           -- discard list words that match list
'|regex not match /^(the||and||of||to||it||in||or||is)$/',
'|l:lookup autoadd before count',  -- accumulate and count words
'? l:',                       -- no master records to feed into lookup
'? l:',                       -- list of counted words comes here
,                           -- columns 1-10 hold count, 11-n hold word
'|sort 1.10 d',             -- sort in desending order by count
'|take 22',                 -- take first 22 records only
'|array wordlist',          -- store into a rexx array
'|count max',               -- get length of longest record
'|var maxword'              -- save into a rexx variable


parse value wordlist[1] with count 11 .  -- get frequency of first word
barunit = count % (76-(maxword-10))      -- frequency units per chart bar char


say ' '||copies('_', (count+barunit)%barunit)  -- first line of the chart
do cntwd over wordlist
parse var cntwd count 11 word          -- get word frequency and the word
say '|'||copies('_', (count+barunit)%barunit)||'| '||word||' '
end
The output produced
________________________________________________________________________________
|________________________________________________________________________________| she
|_____________________________________________________________________| you
|___________________________________________________________________| said
|__________________________________________________________| alice
|____________________________________________________| was
|________________________________________________| that
|________________________________________| as
|____________________________________| her
|_________________________________| at
|_________________________________| with
|______________________________| on
|_____________________________| all
|__________________________| this
|__________________________| for
|__________________________| had
|__________________________| but
|________________________| be
|________________________| not
|_______________________| they
|______________________| so
|_____________________| very
|_____________________| what

还有一个 python 2.x-206字符(或者带宽度条的232字符)

我相信这一点,如果完全符合的问题。忽略列表在这里,它完全检查行长度(见例子,我在整个文本中用 Aliceinwonderlandbylewiscarroll代替 Alice,使第五项成为最长的一行。甚至文件名也是从命令行提供的,而不是硬编码的(硬编码会删除大约10个字符)。它有一个缺点(但我相信这个问题没问题) ,因为它计算一个整数除法器,使行短于80个字符,最长的行短于80个字符,不完全是80个字符。Python 3.x 版本没有这个缺陷(但是更长)。

而且我相信读起来也没那么难。

import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
for l,w in b:print"|"+l/min(z/(78-len(e))for z,e in b)*'-'+"|",w

|----------------------------------------------------------------| she
|--------------------------------------------------------| you
|-----------------------------------------------------| said
|----------------------------------------------| aliceinwonderlandbylewiscarroll
|-----------------------------------------| was
|--------------------------------------| that
|-------------------------------| as
|----------------------------| her
|--------------------------| at
|--------------------------| with
|-------------------------| s
|-------------------------| t
|-----------------------| on
|-----------------------| all
|---------------------| this
|--------------------| for
|--------------------| had
|--------------------| but
|-------------------| be
|-------------------| not
|------------------| they
|-----------------| so

因为它不清楚,如果我们必须打印最大酒吧单独在它的行(如在样品输出)。下面是另一个做到这一点,但232个字符。

import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
f=min(z/(78-len(e))for z,e in b)
print"",b[0][0]/f*'-'
for y,w in b:print"|"+y/f*'-'+"|",w

Python 3.x-256字符

使用 python 3.x 中的 Counter 类,很有希望缩短它(因为 Counter 可以完成我们需要的所有工作)。结果并没有好转。下面是我试用的266个字符:

import sys,re,collections as c
b=c.Counter(re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",
sys.stdin.read().lower())).most_common(22)
F=lambda p,x,w:print(p+'-'*int(x/max(z/(77.-len(e))for e,z in b))+w)
F(" ",b[0][1],"")
for w,y in b:F("|",y,"| "+w)

问题是,collectionsmost_common是非常长的单词,甚至 Counter也不短... 真的,不使用 Counter使代码只长2个字符;-(

Python 3.x 还引入了其他约束: 除以两个整数不再是一个整数(因此我们必须转换为 int) ,print 现在是一个函数(必须加上括号) ,等等。这就是为什么它比 python2.x 版本长22个字符,但是速度更快的原因。也许一些更有经验的 python3.x 编码器会有缩短代码的想法。

Ruby 205


这个 Ruby 版本处理“ superlongstringstring”。 (前两行几乎与前面的 Ruby 程序相同)

它必须这样运行:

ruby -n0777 golf.rb Alice.txt


W=($_.upcase.scan(/\w+/)-%w(THE AND OF TO A I IT
IN OR IS)).group_by{|x|x}.map{|k,v|[-v.size,k]}.sort[0,22]
u=proc{|m|"_"*(W.map{|n,s|(76.0-s.size)/n}.max*m)}
puts" "+u[W[0][0]],W.map{|n,s|"|%s| "%u[n]+s}

第三行创建一个闭包或 lambda,它生成一个正确缩放的下划线字符串:

u = proc{|m|
"_" *
(W.map{|n,s| (76.0 - s.size)/n}.max * m)
}

使用 .max代替 .min,因为数字是负的。

Scala 2.8,311 314 320 330 332 336 341 375字符

包括长词调整。从其他解决方案借用的想法。

现在作为一个脚本(a.scala) :

val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22
def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toInt
println(" "+b(t(0)._2))
for(p<-t)printf("|%s| %s \n",b(p._2),p._1)

一起跑

scala -howtorun:script a.scala alice.txt

顺便说一句,从314到311个字符的编辑实际上只删除1个字符。以前有人数错了(Windows CR?).

Bourne shell 213/240个字符

对之前发布的 shell 版本进行了改进,我可以把它缩减到213个字符:

tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v '^(the|and|of|to|a|i|it|in|or|is)$'|uniq -c|sort -rn|sed 22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,$2}' g|tr 0 _>o
((n++))
done
cat o

为了得到最上面一栏的轮廓,我必须把它展开到240个字符:

tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$"|uniq -c|sort -r|sed 1p\;22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,NR==1?"":$2}' g|sed '1s,|, ,g'|tr 0 _>o
((n++))
done
cat o

Shell、 grep、 tr、 grep、 sort、 uniq、 sort、 head、 perl-194个字符

添加一些-i 标志可能会删除过长的 tr A-Z A-Z | 步骤; 规范没有说明显示的大小写,uniq-ci 会删除任何大小写差异。

egrep -oi [a-z]+|egrep -wiv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -ci|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'

与原来的206个字符相比,tr 的 -11加上 -i 的 -2。

编辑: b 的负3,因为模式匹配将从边界开始,所以可以省略。

Sort 首先给出小写,uniq-ci 首先出现,因此输出中唯一真正的变化是 Alice 保留了她的大写初始值。

Go,613字符,可能会小得多:

package main
import(r "regexp";. "bytes";. "io/ioutil";"os";st "strings";s "sort";. "container/vector")
type z struct{c int;w string}
func(e z)Less(o interface{})bool{return o.(z).c<e.c}
func main(){b,_:=ReadAll(os.Stdin);g:=r.MustCompile
c,m,x:=g("[A-Za-z]+").AllMatchesIter(b,0),map[string]int{},g("the|and|of|it|in|or|is|to")
for w:=range c{w=ToLower(w);if len(w)>1&&!x.Match(w){m[string(w)]++}}
o,y:=&Vector{},0
for k,v:=range m{o.Push(z{v,k});if v>y{y=v}}
s.Sort(o)
for i,v:=range *o{if i>21{break};x:=v.(z);c:=int(float(x.c)/float(y)*80)
u:=st.Repeat("_",c);if i<1{println(" "+u)};println("|"+u+"| "+x.w)}}

我觉得好脏。

Perl ,188个字符

上面的 perl 版本(以及任何基于 regexp 分割的版本)可以通过将禁用词汇列表作为负面的前瞻断言而不是作为单独的列表来缩短几个字节。此外,尾随的分号可以省略。

我还包括了一些其他的建议(- 而不是 < = > ,for/foreach,删除了“ key”)

$c{$_}++for grep{$_}map{lc=~/\b(?!(?:the|and|a|of|or|i[nts]?|to)\b)[a-z]+/g}<>;@s=sort{$c{$b}-$c{$a}}%c;$f=76-length$s[0];say$"."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "for@s[0..21]

我不知道 perl,但我假设(? !(?: ... b)可能会失去?: 如果周围的处理是固定的。

Scala 327个字符

这是改编自 mknissl 的 回答,灵感来自 Python 版本,虽然它更大。我把它留在这里,以防有人能把它缩短。

val f="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile("11.txt").mkString.toLowerCase).toSeq
val t=f.toSet[String].map(x=> -f.count(x==)->x).toSeq.sorted take 22
def b(p:Int)="_"*(-p/(for((c,w)<-t)yield-c/(76.0-w.size)).max).toInt
println(" "+b(t(0)._1))
for(p<-t)printf("|%s| %s \n",b(p._1),p._2)

Perl,185 char

200(稍有破损) 199 197 195 193 187 185个字符。最后两行换行很重要。符合规范。

map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[a-z]+/gfor<>;
$n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21];
die map{$U='_'x($X{$_}/$n);" $U
"x!$z++,"|$U| $_
"}@w

第一行将有效单词的计数加载到 %X中。

第二行计算最小缩放因子,以便所有输出行都将 < = 80个字符。

第三行(包含两个换行符)生成输出。

GNU Smalltalk (386)

我觉得可以再短一点,但还是不知道怎么做。

|q s f m|q:=Bag new. f:=FileStream stdin. m:=0.[f atEnd]whileFalse:[s:=f nextLine.(s notNil)ifTrue:[(s tokenize:'\W+')do:[:i|(((i size)>1)&({'the'.'and'.'of'.'to'.'it'.'in'.'or'.'is'}includes:i)not)ifTrue:[q add:(i asLowercase)]. m:=m max:(i size)]]].(q:=q sortedByCount)from:1to:22 do:[:i|'|'display.((i key)*(77-m)//(q first key))timesRepeat:['='display].('| %1'%{i value})displayNl]

Clojure 282严格

(let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[k v]s](p \| v \| k)))

更清晰一些:

(let[[[_ m]:as s](->> (slurp *in*)
.toLowerCase
(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")
frequencies
(sort-by val >)
(take 22))
[b] (sort (map #(/ (- 76 (count (key %)))(val %)) s))
p #(do
(print %1)
(dotimes[_(* b %2)] (print \_))
(apply println %&))]
(p " " m)
(doseq[[k v] s] (p \| v \| k)))

Clojure-611个字符(未最小化)

我试着在深夜尽可能多地使用常用的 Clojure 编写代码。我并不太为 draw-chart函数感到自豪,但是我想这段代码将充分体现 Clojure 的简洁性。

(ns word-freq
(:require [clojure.contrib.io :as io]))


(defn word-freq
[f]
(take 22 (->> f
io/read-lines ;;; slurp should work too, but I love map/red
(mapcat (fn [l] (map #(.toLowerCase %) (re-seq #"\w+" l))))
(remove #{"the" "and" "of" "to" "a" "i" "it" "in" "or" "is"})
(reduce #(assoc %1 %2 (inc (%1 %2 0))) {})
(sort-by (comp - val)))))


(defn draw-chart
[fs]
(let [[[w f] & _] fs]
(apply str
(interpose \newline
(map (fn [[k v]] (apply str (concat "|" (repeat (int (* (- 76 (count w)) (/ v f 1))) "_") "| " k " ")) ) fs)))))


;;; (println (draw-chart (word-freq "/Users/ghoseb/Desktop/alice.txt")))

产出:

|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| t
|____________________________| s
|__________________________| on
|__________________________| all
|_______________________| for
|_______________________| had
|_______________________| this
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

我知道,这不符合规范,但是,嘿,这是一些非常干净的 Clojure 代码,它已经非常小了:)

Lua 解决方案: 478个字符。

t,u={},{}for l in io.lines()do
for w in l:gmatch("%a+")do
w=w:lower()if not(" the and of to a i it in or is "):find(" "..w.." ")then
t[w]=1+(t[w]or 0)end
end
end
for k,v in next,t do
u[#u+1]={k,v}end
table.sort(u,function(a,b)return a[2]>b[2]end)m,n=u[1][2],math.min(#u,22)for w=80,1,-1 do
s=""for i=1,n do
a,b=u[i][1],w*u[i][2]/m
if b+#a>=78 then s=nil break end
s2=("_"):rep(b)if i==1 then
s=s.." " ..s2.."\n"end
s=s.."|"..s2.."| "..a.."\n"end
if s then print(s)break end end

可读版本:

t,u={},{}
for line in io.lines() do
for w in line:gmatch("%a+") do
w = w:lower()
if not (" the and of to a i it in or is "):find(" "..w.." ") then
t[w] = 1 + (t[w] or 0)
end
end
end
for k, v in pairs(t) do
u[#u+1]={k, v}
end


table.sort(u, function(a, b)
return a[2] > b[2]
end)


local max = u[1][2]
local n = math.min(#u, 22)


for w = 80, 1, -1 do
s=""
for i = 1, n do
f = u[i][2]
word = u[i][1]
width = w * f / max
if width + #word >= 78 then
s=nil
break
end
s2=("_"):rep(width)
if i==1 then
s=s.." " .. s2 .."\n"
end
s=s.."|" .. s2 .. "| " .. word.."\n"
end
if s then
print(s)
break
end
end

TCL 554严格

foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {if {[lsearch {the and of to it in or is a i} $w]>=0} {continue};if {[catch {incr Ws($w)}]} {set Ws($w) 1}}
set T [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get Ws]] 0 43]
foreach {w c} $T {lappend L [string length $w];lappend C $c}
set N [tcl::mathfunc::max {*}$L]
set C [lsort -integer $C]
set M [lindex $C end]
puts " [string repeat _ [expr {int((76-$N) * [lindex $T 1] / $M)}]] "
foreach {w c} $T {puts "|[string repeat _ [expr {int((76-$N) * $c / $M)}]]| $w"}

或者更清晰一点

foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {
if {[lsearch {the and of to a i it in or is} $w] >= 0} { continue }
if {[catch {incr words($w)}]} {
set words($w) 1
}
}
set topwords [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get words]] 0 43]
foreach {word count} $topwords {
lappend lengths [string length $word]
lappend counts $count
}
set maxlength [lindex [lsort -integer $lengths] end]
set counts [lsort -integer $counts]
set mincount [lindex $counts 0].0
set maxcount [lindex $counts end].0
puts " [string repeat _ [expr {int((76-$maxlength) * [lindex $topwords 1] / $maxcount)}]] "
foreach {word count} $topwords {
set barlength [expr {int((76-$maxlength) * $count / $maxcount)}]
puts "|[string repeat _ $barlength]| $word"
}

好极了,250

密码:

m=[:]
(new URL(args[0]).text.toLowerCase()=~/\w+/).each{it==~/(the|and|of|to|a|i[tns]?|or)/?:(m[it]=1+(m[it]?:0))}
k=m.keySet().sort{a,b->m[b]<=>m[a]}
b={d,c,b->println d+'_'*c+d+' '+b}
b' ',z=77-k[0].size(),''
k[0..21].each{b'|',m[it]*z/m[k[0]],it}

执行:

$ groovy wordcount.groovy http://www.gutenberg.org/files/11/11.txt

产出:

 __________________________________________________________________________
|__________________________________________________________________________| she
|________________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|________________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so

注意: 这遵循宽松的规则 re: 长字符串

另一个 T-SQL 解决方案借鉴了 马丁的解决方案(min76-etc)的一些思想。

declare @ varchar(max),@w real,@j int;select s=@ into[ ]set @=(select*
from openrowset(bulk'a',single_blob)a)while @>''begin set @=stuff(@,1,
patindex('%[a-z]%',@)-1,'')+'.'set @j=patindex('%[^a-z]%',@)if @j>2insert[ ]
select lower(left(@,@j-1))set @=stuff(@,1,@j,'')end;select top(22)s,count(*)
c into # from[ ]where',the,and,of,to,it,in,or,is,'not like'%,'+s+',%'
group by s order by 2desc;select @w=min((76.-len(s))/c),@=' '+replicate(
'_',max(c)*@w)from #;select @=@+'
|'+replicate('_',c*@w)+'| '+s+' 'from #;print @

整个解决方案应该在两行上(连接前7行) ,尽管您可以按原样剪切、粘贴和运行它。总字符 = 507(如果以 Unix 格式保存并使用 SQLCMD 执行,则将换行符计为1)

假设:

  1. 没有临时表 #
  2. 没有一张桌子叫 [ ]
  3. 输入在默认的系统文件夹中,例如 C:\windows\system32\a
  4. 您的查询窗口已经“设置 nocount on”活动(防止伪造的“行受影响”msgs)

为了进入解决方案列表(< 500 char) ,下面是 483字符的“轻松”版本(没有垂直条/没有顶部条/没有词后跟空格)

declare @ varchar(max),@w real,@j int;select s=@ into[ ]set @=(select*
from openrowset(bulk'b',single_blob)a)while @>''begin set @=stuff(@,1,
patindex('%[a-z]%',@)-1,'')+'.'set @j=patindex('%[^a-z]%',@)if @j>2insert[ ]
select lower(left(@,@j-1))set @=stuff(@,1,@j,'')end;select top(22)s,count(*)
c into # from[ ]where',the,and,of,to,it,in,or,is,'not like'%,'+s+',%'
group by s order by 2desc;select @w=min((78.-len(s))/c),@=''from #;select @=@+'
'+replicate('_',c*@w)+' '+s from #;print @

可读版本

declare @ varchar(max), @w real, @j int
select s=@ into[ ] -- shortcut to create table; use defined variable to specify column type
-- openrowset reads an entire file
set @=(select * from openrowset(bulk'a',single_blob) a) -- a bit shorter than naming 'BulkColumn'


while @>'' begin -- loop until input is empty
set @=stuff(@,1,patindex('%[a-z]%',@)-1,'')+'.' -- remove lead up to first A-Z char *
set @j=patindex('%[^a-z]%',@) -- find first non A-Z char. The +'.' above makes sure there is one
if @j>2insert[ ] select lower(left(@,@j-1)) -- insert only words >1 char
set @=stuff(@,1,@j,'') -- remove word and trailing non A-Z char
end;


select top(22)s,count(*)c
into #
from[ ]
where ',the,and,of,to,it,in,or,is,' not like '%,'+s+',%' -- exclude list
group by s
order by 2desc; -- highest occurence, assume no ties at 22!


-- 80 - 2 vertical bars - 2 spaces = 76
-- @w = weighted frequency
-- this produces a line equal to the length of the max occurence (max(c))
select @w=min((76.-len(s))/c),@=' '+replicate('_',max(c)*@w)
from #;


-- for each word, append it as a new line. note: embedded newline
select @=@+'
|'+replicate('_',c*@w)+'| '+s+' 'from #;
-- note: 22 words in a table should always fit on an 8k page
--       the order of processing should always be the same as the insert-orderby
--       thereby producing the correct output


print @ -- output

Q,194

{t::y;{(-1')t#(.:)[b],'(!:)[b:"|",/:(((_)70*x%(*:)x)#\:"_"),\:"|"];}desc(#:')(=)($)(`$inter\:[(,/)" "vs'" "sv/:"'"vs'a(&)0<(#:')a:(_:')read0 -1!x;52#.Q.an])except`the`and`of`to`a`i`it`in`or`is`}

该函数有两个参数: 一个是包含文本的文件,另一个是要显示的图表行数

q){t::y;{(-1')t#(.:)[b],'(!:)[b:"|",/:(((_)70*x%(*:)x)#\:"_"),\:"|"];}desc(#:')(=)($)(`$inter\:[(,/)" "vs'" "sv/:"'"vs'a(&)0<(#:')a:(_:')read0 -1!x;52#.Q.an])except`the`and`of`to`a`i`it`in`or`is`}[`a.txt;20]

输出

|______________________________________________________________________|she
|____________________________________________________________|you
|__________________________________________________________|said
|___________________________________________________|alice
|_____________________________________________|was
|_________________________________________|that
|__________________________________|as
|_______________________________|her
|_____________________________|with
|____________________________|at
|___________________________|t
|___________________________|s
|_________________________|on
|_________________________|all
|_______________________|this
|______________________|for
|______________________|had
|_____________________|but
|_____________________|be
|_____________________|not