我想按两列对数据帧进行分组,然后在这些组中对聚合的结果进行排序。

In [167]: df

Out[167]:
   count     job source
0      2   sales      A
1      4   sales      B
2      6   sales      C
3      3   sales      D
4      7   sales      E
5      5  market      A
6      3  market      B
7      2  market      C
8      4  market      D
9      1  market      E


In [168]: df.groupby(['job','source']).agg({'count':sum})

Out[168]:
               count
job    source       
market A           5
       B           3
       C           2
       D           4
       E           1
sales  A           2
       B           4
       C           6
       D           3
       E           7

我现在想在每个组中按降序对“count”列排序,然后只取前三行。得到类似这样的东西:

                count
job     source
market  A           5
        D           4
        B           3
sales   E           7
        C           6
        B           4

这个问题是由我的另一个问题:如何在cdef等待?

网络上有大量关于asyncio的文章和博客文章,但它们都非常肤浅。我找不到任何关于asyncio实际是如何实现的,以及什么使I/O异步的信息。我试图阅读源代码,但它有数千行不是最高级的C代码,其中很多处理辅助对象,但最重要的是,它很难将Python语法和它将转换成的C代码联系起来。

Asycnio自己的文档就更没有帮助了。这里没有关于它如何工作的信息,只有一些关于如何使用它的指南,这些指南有时也会误导/写得很糟糕。

我熟悉Go的协程实现,并希望Python也能做同样的事情。如果是这样的话,我在上面链接的帖子中出现的代码应该是有效的。既然它没有,我现在正试图找出原因。到目前为止,我最好的猜测如下,请纠正我的错误:

Procedure definitions of the form async def foo(): ... are actually interpreted as methods of a class inheriting coroutine. Perhaps, async def is actually split into multiple methods by await statements, where the object, on which these methods are called is able to keep track of the progress it made through the execution so far. If the above is true, then, essentially, execution of a coroutine boils down to calling methods of coroutine object by some global manager (loop?). The global manager is somehow (how?) aware of when I/O operations are performed by Python (only?) code and is able to choose one of the pending coroutine methods to execute after the current executing method relinquished control (hit on the await statement).

换句话说,这是我试图将一些asyncio语法“糖化”成更容易理解的东西:

async def coro(name):
    print('before', name)
    await asyncio.sleep()
    print('after', name)

asyncio.gather(coro('first'), coro('second'))

# translated from async def coro(name)
class Coro(coroutine):
    def before(self, name):
        print('before', name)

    def after(self, name):
        print('after', name)

    def __init__(self, name):
        self.name = name
        self.parts = self.before, self.after
        self.pos = 0

    def __call__():
        self.parts[self.pos](self.name)
        self.pos += 1

    def done(self):
        return self.pos == len(self.parts)


# translated from asyncio.gather()
class AsyncIOManager:

    def gather(*coros):
        while not every(c.done() for c in coros):
            coro = random.choice(coros)
            coro()

Should my guess prove correct: then I have a problem. How does I/O actually happen in this scenario? In a separate thread? Is the whole interpreter suspended and I/O happens outside the interpreter? What exactly is meant by I/O? If my python procedure called C open() procedure, and it in turn sent interrupt to kernel, relinquishing control to it, how does Python interpreter know about this and is able to continue running some other code, while kernel code does the actual I/O and until it wakes up the Python procedure which sent the interrupt originally? How can Python interpreter in principle, be aware of this happening?

如果我有一张桌子

CREATE TABLE users (
  id int(10) unsigned NOT NULL auto_increment,
  name varchar(255) NOT NULL,
  profession varchar(255) NOT NULL,
  employer varchar(255) NOT NULL,
  PRIMARY KEY  (id)
)

我想获得所有专业领域的独特价值,什么会更快(或建议):

SELECT DISTINCT u.profession FROM users u

or

SELECT u.profession FROM users u GROUP BY u.profession

?

我玩周围的MongoDB试图弄清楚如何做一个简单的

SELECT province, COUNT(*) FROM contest GROUP BY province

但是我似乎不能用聚合函数来算出来。我可以用一些奇怪的组语法来做

db.user.group({
    "key": {
        "province": true
    },
    "initial": {
        "count": 0
    },
    "reduce": function(obj, prev) {
        if (true != null) if (true instanceof Array) prev.count += true.length;
        else prev.count++;
    }
});

但是是否有更简单/更快的方法使用聚合函数?

我需要计数每个域中的唯一ID值。

我有数据:

ID, domain
123, 'vk.com'
123, 'vk.com'
123, 'twitter.com'
456, 'vk.com'
456, 'facebook.com'
456, 'vk.com'
456, 'google.com'
789, 'twitter.com'
789, 'vk.com'

我试试df。groupby([‘域’,‘身份证’]).count ()

但是我想要

domain, count
vk.com   3
twitter.com   2
facebook.com   1
google.com   1

我得到一个错误

列的员工。EmpID'在选择列表中无效,因为它是无效的 不包含在聚合函数或GROUP BY子句中。


select loc.LocationID, emp.EmpID
from Employee as emp full join Location as loc 
on emp.LocationID = loc.LocationID
group by loc.LocationID 

这种情况符合Bill Karwin给出的答案。

修正以上,符合ExactaBox -的答案

select loc.LocationID, count(emp.EmpID) -- not count(*), don't want to count nulls
from Employee as emp full join Location as loc 
on emp.LocationID = loc.LocationID
group by loc.LocationID 

原来的问题

对于SQL查询-

select *
from Employee as emp full join Location as loc 
on emp.LocationID = loc.LocationID
group by (loc.LocationID)

我不明白为什么会得到这个错误。我所要做的就是连接这些表,然后将所有员工分组在一个特定的位置。

对于我自己的问题,我想我有一个部分的解释。告诉我是否可以

为了对在同一位置工作的所有员工进行分组,我们必须首先提到LocationID。

然后,我们不能/不提到它旁边的每个员工ID。相反,我们提到该位置的员工总数,即我们应该SUM()在该位置工作的员工。我不知道为什么我们要用后一种方式。 因此,这就解释了“它不包含在任何一个聚合函数中”这部分错误。

如何解释错误的GROUP BY子句部分?

有人给我发送了一个SQL查询,其中GROUP BY子句由语句组成:GROUP BY 1。

这一定是打错了,对吧?没有列被赋予别名1。这意味着什么呢?我认为这一定是打错了,对吗?

我正在使用这个数据框架:

Fruit   Date      Name  Number
Apples  10/6/2016 Bob    7
Apples  10/6/2016 Bob    8
Apples  10/6/2016 Mike   9
Apples  10/7/2016 Steve 10
Apples  10/7/2016 Bob    1
Oranges 10/7/2016 Bob    2
Oranges 10/6/2016 Tom   15
Oranges 10/6/2016 Mike  57
Oranges 10/6/2016 Bob   65
Oranges 10/7/2016 Tony   1
Grapes  10/7/2016 Bob    1
Grapes  10/7/2016 Tom   87
Grapes  10/7/2016 Bob   22
Grapes  10/7/2016 Bob   12
Grapes  10/7/2016 Tony  15

我想按名称聚合,然后按水果,以获得每个名称的水果总数。例如:

Bob,Apples,16

我尝试按名称和水果分组,但我如何得到水果的总数?

我用WAMP服务器在我的windows PC上使用MySQL 5.7.13

我的问题是在执行这个查询时

SELECT *
FROM `tbl_customer_pod_uploads`
WHERE `load_id` = '78' AND
      `status` = 'Active'
GROUP BY `proof_type`

我总是得到这样的错误

SELECT列表中的表达式#1不在GROUP BY子句中,包含未聚合的列returntr_prod.tbl_customer_pod_uploads。id',它不依赖于GROUP BY子句中的列;这与sql_mode=only_full_group_by不兼容

你能告诉我最好的解决办法吗?

我需要这样的结果

+----+---------+---------+---------+----------+-----------+------------+---------------+--------------+------------+--------+---------------------+---------------------+
| id | user_id | load_id | bill_id | latitude | langitude | proof_type | document_type | file_name    | is_private | status | createdon           | updatedon           |
+----+---------+---------+---------+----------+-----------+------------+---------------+--------------+------------+--------+---------------------+---------------------+
|  1 |       1 | 78      | 1       | 21.1212  | 21.5454   |          1 |             1 | id_Card.docx |          0 | Active | 2017-01-27 11:30:11 | 2017-01-27 11:30:14 |
+----+---------+---------+---------+----------+-----------+------------+---------------+--------------+------------+--------+---------------------+---------------------+

文档展示了如何在一个groupby对象上同时应用多个函数,使用输出列名作为键的dict:

In [563]: grouped['D'].agg({'result1' : np.sum,
   .....:                   'result2' : np.mean})
   .....:
Out[563]: 
      result2   result1
A                      
bar -0.579846 -1.739537
foo -0.280588 -1.402938

但是,这只适用于Series groupby对象。当dict类似地通过DataFrame传递给一个组时,它期望键是函数将应用到的列名。

What I want to do is apply multiple functions to several columns (but certain columns will be operated on multiple times). Also, some functions will depend on other columns in the groupby object (like sumif functions). My current solution is to go column by column, and doing something like the code above, using lambdas for functions that depend on other rows. But this is taking a long time, (I think it takes a long time to iterate through a groupby object). I'll have to change it so that I iterate through the whole groupby object in a single run, but I'm wondering if there's a built in way in pandas to do this somewhat cleanly.

例如,我曾经尝试过

grouped.agg({'C_sum' : lambda x: x['C'].sum(),
             'C_std': lambda x: x['C'].std(),
             'D_sum' : lambda x: x['D'].sum()},
             'D_sumifC3': lambda x: x['D'][x['C'] == 3].sum(), ...)

但正如预期的那样,我得到一个KeyError(因为键必须是一列,如果agg从一个DataFrame调用)。

是否有任何内置的方式来做我想做的事情,或者这种功能可能会被添加,或者我只需要手动遍历组?