Message ID | 1409170479-29955-4-git-send-email-lina.iyer@linaro.org |
---|---|
State | New |
Headers | show |
On Wed, 27 Aug 2014, Lina Iyer wrote: > PM QoS and other idle frameworks can do a better job of addressing power > and performance requirements for a cpu, knowing the IRQs that are > affine to that cpu. If a performance request is placed against serving > the IRQ faster and if the IRQ is affine to a set of cpus, then setting > the performance requirements only on those cpus help save power on the > rest of the cpus. PM QoS framework is one such framework interested in > knowing the smp_affinity of an IRQ and the change notificiation in this > regard. QoS requests for the CPU_DMA_LATENCY constraint currently apply > to all cpus, but when attached to an IRQ, can be applied only to the set > of cpus that IRQ's smp_affinity is set to. This allows other cpus to > enter deeper sleep states to save power. More than one framework/driver > can be interested in such information. All you are describing is the fact, that there is only a single notifier possible right now, but you completely miss to describe WHY you need multiple ones. The existing notifier was introduced to tell the driver that the irq affinity has changed, so it can move its buffers to the proper NUMA node. So if that driver gets this information then it can tell the PM QoS code that its affinity changed so that stuff can react accordingly, right? This is going nowhere unless you provide a proper usecase and arguments why this is necessary. Handwaving and I want a pony arguments are not sufficient. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 27 2014 at 14:56 -0600, Thomas Gleixner wrote: >On Wed, 27 Aug 2014, Lina Iyer wrote: > >> PM QoS and other idle frameworks can do a better job of addressing power >> and performance requirements for a cpu, knowing the IRQs that are >> affine to that cpu. If a performance request is placed against serving >> the IRQ faster and if the IRQ is affine to a set of cpus, then setting >> the performance requirements only on those cpus help save power on the >> rest of the cpus. PM QoS framework is one such framework interested in >> knowing the smp_affinity of an IRQ and the change notificiation in this >> regard. QoS requests for the CPU_DMA_LATENCY constraint currently apply >> to all cpus, but when attached to an IRQ, can be applied only to the set >> of cpus that IRQ's smp_affinity is set to. This allows other cpus to >> enter deeper sleep states to save power. More than one framework/driver >> can be interested in such information. > >All you are describing is the fact, that there is only a single >notifier possible right now, but you completely miss to describe WHY >you need multiple ones. > >The existing notifier was introduced to tell the driver that the irq >affinity has changed, so it can move its buffers to the proper NUMA >node. So if that driver gets this information then it can tell the PM >QoS code that its affinity changed so that stuff can react >accordingly, right? > With the new PM QoS changes, multiple drivers would now be interested in knowing the smp_affinity changes of the same IRQ. PM QoS abstracts the notifier in the framework, so individual drivers dont have to register for notification themselves and handle affinity notifications. But PM QoS needs to coexist with NUMA and other arch drivers that need to modify based on their arch specific needs upon affinity change notifications. Modifying the IRQ notifications to list, benefits having a simpler mechanism to notify all arch drivers and frameworks like PM QoS. >This is going nowhere unless you provide a proper usecase and >arguments why this is necessary. Handwaving and I want a pony >arguments are not sufficient. > >Thanks, > > tglx -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2 Sep 2014, Lina Iyer wrote: > On Wed, Aug 27 2014 at 14:56 -0600, Thomas Gleixner wrote: > > On Wed, 27 Aug 2014, Lina Iyer wrote: > > > > All you are describing is the fact, that there is only a single > > notifier possible right now, but you completely miss to describe WHY > > you need multiple ones. > > > > The existing notifier was introduced to tell the driver that the irq > > affinity has changed, so it can move its buffers to the proper NUMA > > node. So if that driver gets this information then it can tell the PM > > QoS code that its affinity changed so that stuff can react > > accordingly, right? > > > With the new PM QoS changes, multiple drivers would now be interested in > knowing the smp_affinity changes of the same IRQ. PM QoS abstracts the Again, you fail to tell me WHY multiple drivers are interested and which drivers are interested and what they do with that information. > notifier in the framework, so individual drivers dont have to register > for notification themselves and handle affinity notifications. But PM > QoS needs to coexist with NUMA and other arch drivers that need to > modify based on their arch specific needs upon affinity change > notifications. Lots of unspecified blurb. What has NUMA to do with other arch drivers? Are you actually understanding what this is all about? > Modifying the IRQ notifications to list, benefits having a simpler > mechanism to notify all arch drivers and frameworks like PM QoS. Painting my bikeshed blue benefits all neighbours and institutions like the Royal Navy, right? Lina, this is leading nowhere. You just make completely unspecified claims and try to push your solution to a not explained problem through. That's not how it works, at least not with me. So please sit down and write up a proper description of the problem you are trying to solve. All I can see from your postings so far is: 1) You want to use the notification for PM QoS 2) You run into conflicts with the existing notifiers 3) You want to solve that conflict What you are not telling is: 1) In which way is PM QoS going to use that notifier 2) How does that conflict with the existing users. And we only have two of them: - InfiniPath 7322 chip driver - cpu_rmap driver, which is used by - solarflare NIC driver - mellanox mlx4 driver - cisco enic driver So how is that conflicting with your ARM centric PM QoS work? Please provide proper debug info about such a conflict, if it exists at all. 3) Which arch drivers are involved? So far none, at least not in mainline according to "grep irq_set_affinity_notifier arch/". If you have out of tree code which uses this, then we want to see it first to understand what the arch driver is trying to solve by using that notification mechanism. The more I think about what you are not telling the more I get the feeling that this whole PM QoS thing you try to do is a complete design disaster. From your changelog: "PM QoS and other idle frameworks can do a better job of addressing power and performance requirements for a cpu, knowing the IRQs that are affine to that cpu." I agree with that. PM might make better decisions if it knows which activities are targeted at a particular cpu. So someone figured that out and you got tasked to implement it. So you looked for a way to get to this information and you found the existing notifier and decided to (ab)use it for your purpose. But you did not spend a split second to figure out why this is wrong and even useless. At the point where PM QoS or whatever decides to use that information, it does not know at all what the current affinity mask is. So in your patch 4/4, which I ignored so far you do: + struct irq_desc *desc = irq_to_desc(req->irq); + struct cpumask *mask = desc->irq_data.affinity; WTF? You did not even have the courtesy to use the proper functions for this. And of course you access all of this unlocked. So how is that supposed to work with a concurrent affinity update? Not at all. Fiddling with irq_desc is a NONO: Consult your preferred search machine or just the kernel changelogs about the consequences. Looking at the other patches you really seem to have missed out on some other NONOs: struct pm_qos_request { + enum pm_qos_req_type type; + struct cpumask cpus_affine; + /* Internal structure members */ struct plist_node node; int pm_qos_class; struct delayed_work work; /* for pm_qos_update_request_timeout */ @@ -80,6 +90,7 @@ enum pm_qos_type { struct pm_qos_constraints { struct plist_head list; s32 target_value; /* Do not change to 64 bit */ + s32 target_per_cpu[NR_CPUS]; s32 default_value; s32 no_constraint_value; enum pm_qos_type type; Go figure how well "struct cpumask" and a s32 array sizeof NR_CPUS bodes on kernels compiled for NR_CPUS=8192 or larger. Just get it: You are hacking core code, not some random ARM SoC driver. That aside, your design or the lack thereof sucks. While your hackery might work somehow for the PM QoS problem at hand - ignoring the hard to debug race window of the concurrent affinity update - it's completely useless for other PM related decisions despite your claims that its a benefit for all of that. How would a general "keep track of the targets of all interrupts in the system" mechanism make use of this? Not at all. Simply because it has no idea which interrupts are requested at any given time. Sure your answer will be "add another notifier" to solve that problem. But that's the completely wrong answer because notifiers are a PITA and should go away. Here is Linus' POV of notifiers and I really agree with it: "Notifiers are a disgrace, and almost all of them are a major design mistake. They all have locking problems, the introduce internal arbitrary API's that are hard to fix later (because you have random people who decided to hook into them, which is the whole *point* of those notifier chains)." And so I really fight that notifier chain tooth and nails until someone provides me a proper argument WHY it cant be solved with proper explicit and understandable mechanisms. Back to your PM QoS trainwreck: I'm really impressed that you did not notice at all how inaccurate and therefor useless the affinity information which you are consulting is. Did it ever hit even the outer hemisphere of your brain, that irqdata.affinity is the affinity which was requested by the caller of an affinity setter function and not the effective affinity? I bet it did not, simply because you did not pay any attention and bother to analyze that proper. Otherwise you would have tried to hack around this as well in some disgusting way. Really great savings, if the affinity is left at default to all cpus, but the effective affinity is a single cpu due to irq routing restrictions in the hardware. Surely you tested against single CPU affinities and achieved great results, but that's not a generic and useful solution. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 02 2014 at 14:56 -0600, Thomas Gleixner wrote: >On Tue, 2 Sep 2014, Lina Iyer wrote: >> On Wed, Aug 27 2014 at 14:56 -0600, Thomas Gleixner wrote: >> > On Wed, 27 Aug 2014, Lina Iyer wrote: >> > >> > All you are describing is the fact, that there is only a single >> > notifier possible right now, but you completely miss to describe WHY >> > you need multiple ones. >> > >> > The existing notifier was introduced to tell the driver that the irq >> > affinity has changed, so it can move its buffers to the proper NUMA >> > node. So if that driver gets this information then it can tell the PM >> > QoS code that its affinity changed so that stuff can react >> > accordingly, right? >> > >> With the new PM QoS changes, multiple drivers would now be interested in >> knowing the smp_affinity changes of the same IRQ. PM QoS abstracts the > >Again, you fail to tell me WHY multiple drivers are interested and >which drivers are interested and what they do with that information. > >> notifier in the framework, so individual drivers dont have to register >> for notification themselves and handle affinity notifications. But PM >> QoS needs to coexist with NUMA and other arch drivers that need to >> modify based on their arch specific needs upon affinity change >> notifications. > >Lots of unspecified blurb. What has NUMA to do with other arch >drivers? Are you actually understanding what this is all about? > >> Modifying the IRQ notifications to list, benefits having a simpler >> mechanism to notify all arch drivers and frameworks like PM QoS. > >Painting my bikeshed blue benefits all neighbours and institutions >like the Royal Navy, right? Sorry about not being very clear in my responses. > >Lina, this is leading nowhere. You just make completely unspecified >claims and try to push your solution to a not explained problem >through. That's not how it works, at least not with me. > >So please sit down and write up a proper description of the problem >you are trying to solve. > >All I can see from your postings so far is: > > 1) You want to use the notification for PM QoS > > 2) You run into conflicts with the existing notifiers > > 3) You want to solve that conflict > Here is the premise of my patchset - When a cpuidle governor selects the idle state of the cpu, it uses the PM QoS CPU_DMA_LATENCY constraint value in determining the acceptable tolerance in exit latency of the cpu, before choosing the low power state for the cpu. Drivers specify the PM QoS request for this constraint, indicating their tolerance. The aggregated QoS request is used in determining the state of each cpu. Now, when drivers make a request, they usually have a device IRQ in mind, which generally needs to be addressed within the stipulated time after the interrupt fires. By default, the IRQs' have affinity towards many cpus. However, when used with a IRQ balancer, that sets the affinity only to a subset of the available cpus, it seems inefficient that all cpus obey the QoS requirement in terms of power. Say, when an USB driver specifies a 2 millisecond QoS constraint, it prevents all the cpus in the SoC from entering any power down state, which need more than 2 ms to be effective. Thats a lot of leakage current that could be saved. The IRQ in many instances would just wake up cpu0, while other cpus remain in high leakage idle states. This information is critical to the governor to determine the best idle state it could allow the cpu to be in. So the change here is to specify QoS requests against a subset of cpus, which when aggregated, the idle governor can distill out the QoS requirement only for that cpu. When requests are made with a set of cpus as the intended target for the QoS value, the PM QoS framework calculates the QoS value for each cpu. The governor can then be modified to request the QoS CPU_DMA_LATENCY constraint on an individual cpu and can choose the deepest idle state that would respect the requirement. By default QoS requests do not have a preference for cpus and that is the behavior with these patches too. Using the per-cpu PM QoS, the drivers can now specify a set of cpus that need to obey the QoS requests, if they know the cpus or they would specify an IRQ, if the request is to avoid latency in handling the device IRQ. The driver is not expected to handle the IRQ's smp_affinity changes and would be done by the PM QoS framework on behalf of the driver. This is not enforced on the driver that they need to specify an IRQ or a subset of cpus for their request. If there is no IRQ or cpu preference, then PM QoS defaults to applying the request for all cpus. I would expect many of the existing drivers still want this behavior. In the case a driver is aware that the QoS request to guarantee a certain performance in handling the device IRQ, the request can be made to track the IRQ. PM QoS registers for a smp_affinity change and does the best judgement based on the affinity. The IRQ may still fire on any of the cpus in the affinity mask which would make it comparable to the existing QoS behavior and no power saving is achieved. >What you are not telling is: > > 1) In which way is PM QoS going to use that notifier > > 2) How does that conflict with the existing users. And we only > have two of them: > > - InfiniPath 7322 chip driver > > - cpu_rmap driver, which is used by > > - solarflare NIC driver > - mellanox mlx4 driver > - cisco enic driver > > So how is that conflicting with your ARM centric PM QoS work? > > Please provide proper debug info about such a conflict, if it > exists at all. > > 3) Which arch drivers are involved? > > So far none, at least not in mainline according to > "grep irq_set_affinity_notifier arch/". > > If you have out of tree code which uses this, then we want to > see it first to understand what the arch driver is trying to > solve by using that notification mechanism. > I assumed that PM QoS is also used in conjunction with NUMA. I did notice a few systems that used the smp_affinity notifications. My assumption was such systems may have drivers that could prefer a per-cpu QoS request. per-cpu PM QoS is not an ARM specific change, though this benefits quite a bit in a power constrained environment. An octa core system with a low power or a high performance distinction amongst the cores, clearly benefits from this feature, when used with an IRQ balancer. It didnt seem right me that PM QoS gets notified from InifiniPath and NIC drivers, while others have PM QoS registering directly. As far my usecases are concerned in an ARM chipset, I did not see any conflict with the existing users of the smp affinity notifications. But, I assume the systems may want to use per-cpu PM QoS for some of their IRQs on a later date. Hence the change to a notification mechanism. >The more I think about what you are not telling the more I get the >feeling that this whole PM QoS thing you try to do is a complete >design disaster. > >From your changelog: > > "PM QoS and other idle frameworks can do a better job of addressing > power and performance requirements for a cpu, knowing the IRQs that > are affine to that cpu." > >I agree with that. PM might make better decisions if it knows which >activities are targeted at a particular cpu. > >So someone figured that out and you got tasked to implement it. So you >looked for a way to get to this information and you found the existing >notifier and decided to (ab)use it for your purpose. But you did not >spend a split second to figure out why this is wrong and even useless. > >At the point where PM QoS or whatever decides to use that information, >it does not know at all what the current affinity mask is. So in your >patch 4/4, which I ignored so far you do: > >+ struct irq_desc *desc = irq_to_desc(req->irq); >+ struct cpumask *mask = desc->irq_data.affinity; > >WTF? > Sorry, bad access on my part. I will fix it. >You did not even have the courtesy to use the proper functions for >this. And of course you access all of this unlocked. So how is that >supposed to work with a concurrent affinity update? Not at >all. > >Fiddling with irq_desc is a NONO: Consult your preferred search >machine or just the kernel changelogs about the consequences. > >Looking at the other patches you really seem to have missed out on >some other NONOs: > > struct pm_qos_request { >+ enum pm_qos_req_type type; >+ struct cpumask cpus_affine; >+ /* Internal structure members */ > struct plist_node node; > int pm_qos_class; > struct delayed_work work; /* for pm_qos_update_request_timeout */ >@@ -80,6 +90,7 @@ enum pm_qos_type { > struct pm_qos_constraints { > struct plist_head list; > s32 target_value; /* Do not change to 64 bit */ >+ s32 target_per_cpu[NR_CPUS]; > s32 default_value; > s32 no_constraint_value; > enum pm_qos_type type; > >Go figure how well "struct cpumask" and a s32 array sizeof NR_CPUS >bodes on kernels compiled for NR_CPUS=8192 or larger. > Sorry, I did not imagine the possiblity of these failures. I figured that I would not use up a per-cpu variable, since the access to the values may be from one cpu where the code is being processed, but seems like there are quite a few merits, that I overlooked. >Just get it: You are hacking core code, not some random ARM SoC >driver. > >That aside, your design or the lack thereof sucks. While your hackery >might work somehow for the PM QoS problem at hand - ignoring the hard >to debug race window of the concurrent affinity update - it's >completely useless for other PM related decisions despite your claims >that its a benefit for all of that. Yes, there is a race condition in PM QoS, where the notification between the time, the request is added and the notification is registered. AFAIK, the behavior in the irq/manage.c seems to be held the same. If there is a concurrent change in notification, the schedule_work would not reschedule and possibly not send a notification, which happens today as well. I will revisit the change and look, if I can make it better. > >How would a general "keep track of the targets of all interrupts in >the system" mechanism make use of this? Sorry, I do not understand your question. PM QoS is only interested in the IRQs specified in the QoS request. If there are no requests that need to be associated with an IRQ, then PM QoS will not register for an affinity change notification. >Not at all. Simply because it >has no idea which interrupts are requested at any given time. Sure >your answer will be "add another notifier" to solve that problem. But >that's the completely wrong answer because notifiers are a PITA and >should go away. Here is Linus' POV of notifiers and I really agree >with it: > > "Notifiers are a disgrace, and almost all of them are a major design > mistake. They all have locking problems, the introduce internal > arbitrary API's that are hard to fix later (because you have random > people who decided to hook into them, which is the whole *point* of > those notifier chains)." > >And so I really fight that notifier chain tooth and nails until someone >provides me a proper argument WHY it cant be solved with proper >explicit and understandable mechanisms. > FWIW, I do not like notifications either. I feel it adds unpredictablity to the system. If I didnt think, that NUMA and other systems might benefit from per-cpu PM QoS feature, I would not have added this notification mechanism and instead registered directly. I admit to having not much of a knowledge about NUMA, but atleast PM QoS seems to be architecture agnostic and all systems could use this generic kernel feature. Also, the notification added is restricted to PM QoS framework. The smp-affinity notifications are not relayed to the driver making the PM QoS request. The way it is with per-cpu PM QoS is that the framework registers for notification, *only* if there is a QoS request tagged with an IRQ. When a request does not specify an IRQ, the QoS framework does not register for any IRQ affinity notifications. Also, its possible that multiple drivers may have a QoS request and could tag the same IRQ, in which case, the notfication list helps. But surely, I can use other methods in PM QoS to do the same thing, without the need for a notification list in the IRQ framework. But that still doesnt solve for other systems that may have NUMA or NIC drivers using per-cpu PM QoS. When notified PM QoS updates its request list and amends the request affinity and recalculates the target value. In this patch, I currently update requests when the irq notifications are sent out. >Back to your PM QoS trainwreck: > >I'm really impressed that you did not notice at all how inaccurate and >therefor useless the affinity information which you are consulting is. > >Did it ever hit even the outer hemisphere of your brain, that >irqdata.affinity is the affinity which was requested by the caller of >an affinity setter function and not the effective affinity? > Yes and the effective affinity may not match up to irqdata.affinity, but the only reliable information we have is irqdata.affinity. We are trying to do the best with the information available. It still is better than having no information about affinity. >I bet it did not, simply because you did not pay any attention and >bother to analyze that proper. Otherwise you would have tried to hack >around this as well in some disgusting way. > >Really great savings, if the affinity is left at default to all cpus, >but the effective affinity is a single cpu due to irq routing >restrictions in the hardware. > >Surely you tested against single CPU affinities and achieved great >results, but that's not a generic and useful solution. Here is what needs to happen, before the power savings can be beneficial. - Drivers specifying PM QoS request update their drivers to track the request against an IRQ - SoCs that have a way of constraining the IRQ affinity to a subset of cpus. Or may use an IRQ balancer or may be design limited to have IRQs affine to a set of cpus. - Menu and other cpuidle governors consider per-cpu PM QoS in determining the idle state of the cpu. We may not see any benefit without all these in play on a given system. Like you say, with single CPU affinity and with these changes, we do see quite a bit of power savings from the performance cores. Obviously, I seem to be doing it wrong with this change. I am ready to amend and start over, if there are better ways to get affinity information. Thanks, Lina -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Resending... Hoping to retain the thread information, this time. On Tue, Sep 02 2014 at 14:56 -0600, Thomas Gleixner wrote: >On Tue, 2 Sep 2014, Lina Iyer wrote: >> On Wed, Aug 27 2014 at 14:56 -0600, Thomas Gleixner wrote: >> > On Wed, 27 Aug 2014, Lina Iyer wrote: >> > >> > All you are describing is the fact, that there is only a single >> > notifier possible right now, but you completely miss to describe WHY >> > you need multiple ones. >> > >> > The existing notifier was introduced to tell the driver that the irq >> > affinity has changed, so it can move its buffers to the proper NUMA >> > node. So if that driver gets this information then it can tell the PM >> > QoS code that its affinity changed so that stuff can react >> > accordingly, right? >> > >> With the new PM QoS changes, multiple drivers would now be interested in >> knowing the smp_affinity changes of the same IRQ. PM QoS abstracts the > >Again, you fail to tell me WHY multiple drivers are interested and >which drivers are interested and what they do with that information. > >> notifier in the framework, so individual drivers dont have to register >> for notification themselves and handle affinity notifications. But PM >> QoS needs to coexist with NUMA and other arch drivers that need to >> modify based on their arch specific needs upon affinity change >> notifications. > >Lots of unspecified blurb. What has NUMA to do with other arch >drivers? Are you actually understanding what this is all about? > >> Modifying the IRQ notifications to list, benefits having a simpler >> mechanism to notify all arch drivers and frameworks like PM QoS. > >Painting my bikeshed blue benefits all neighbours and institutions >like the Royal Navy, right? Sorry about not being very clear in my responses. > >Lina, this is leading nowhere. You just make completely unspecified >claims and try to push your solution to a not explained problem >through. That's not how it works, at least not with me. > >So please sit down and write up a proper description of the problem >you are trying to solve. > >All I can see from your postings so far is: > > 1) You want to use the notification for PM QoS > > 2) You run into conflicts with the existing notifiers > > 3) You want to solve that conflict > Here is the premise of my patchset - When a cpuidle governor selects the idle state of the cpu, it uses the PM QoS CPU_DMA_LATENCY constraint value in determining the acceptable tolerance in exit latency of the cpu, before choosing the low power state for the cpu. Drivers specify the PM QoS request for this constraint, indicating their tolerance. The aggregated QoS request is used in determining the state of each cpu. Now, when drivers make a request, they usually have a device IRQ in mind, which generally needs to be addressed within the stipulated time after the interrupt fires. By default, the IRQs' have affinity towards many cpus. However, when used with a IRQ balancer, that sets the affinity only to a subset of the available cpus, it seems inefficient that all cpus obey the QoS requirement in terms of power. Say, when an USB driver specifies a 2 millisecond QoS constraint, it prevents all the cpus in the SoC from entering any power down state, which need more than 2 ms to be effective. Thats a lot of leakage current that could be saved. The IRQ in many instances would just wake up cpu0, while other cpus remain in high leakage idle states. This information is critical to the governor to determine the best idle state it could allow the cpu to be in. So the change here is to specify QoS requests against a subset of cpus, which when aggregated, the idle governor can distill out the QoS requirement only for that cpu. When requests are made with a set of cpus as the intended target for the QoS value, the PM QoS framework calculates the QoS value for each cpu. The governor can then be modified to request the QoS CPU_DMA_LATENCY constraint on an individual cpu and can choose the deepest idle state that would respect the requirement. By default QoS requests do not have a preference for cpus and that is the behavior with these patches too. Using the per-cpu PM QoS, the drivers can now specify a set of cpus that need to obey the QoS requests, if they know the cpus or they would specify an IRQ, if the request is to avoid latency in handling the device IRQ. The driver is not expected to handle the IRQ's smp_affinity changes and would be done by the PM QoS framework on behalf of the driver. This is not enforced on the driver that they need to specify an IRQ or a subset of cpus for their request. If there is no IRQ or cpu preference, then PM QoS defaults to applying the request for all cpus. I would expect many of the existing drivers still want this behavior. In the case a driver is aware that the QoS request to guarantee a certain performance in handling the device IRQ, the request can be made to track the IRQ. PM QoS registers for a smp_affinity change and does the best judgement based on the affinity. The IRQ may still fire on any of the cpus in the affinity mask which would make it comparable to the existing QoS behavior and no power saving is achieved. >What you are not telling is: > > 1) In which way is PM QoS going to use that notifier > > 2) How does that conflict with the existing users. And we only > have two of them: > > - InfiniPath 7322 chip driver > > - cpu_rmap driver, which is used by > > - solarflare NIC driver > - mellanox mlx4 driver > - cisco enic driver > > So how is that conflicting with your ARM centric PM QoS work? > > Please provide proper debug info about such a conflict, if it > exists at all. > > 3) Which arch drivers are involved? > > So far none, at least not in mainline according to > "grep irq_set_affinity_notifier arch/". > > If you have out of tree code which uses this, then we want to > see it first to understand what the arch driver is trying to > solve by using that notification mechanism. > I assumed that PM QoS is also used in conjunction with NUMA. I did notice a few systems that used the smp_affinity notifications. My assumption was such systems may have drivers that could prefer a per-cpu QoS request. per-cpu PM QoS is not an ARM specific change, though this benefits quite a bit in a power constrained environment. An octa core system with a low power or a high performance distinction amongst the cores, clearly benefits from this feature, when used with an IRQ balancer. It didnt seem right me that PM QoS gets notified from InifiniPath and NIC drivers, while others have PM QoS registering directly. As far my usecases are concerned in an ARM chipset, I did not see any conflict with the existing users of the smp affinity notifications. But, I assume the systems may want to use per-cpu PM QoS for some of their IRQs on a later date. Hence the change to a notification mechanism. >The more I think about what you are not telling the more I get the >feeling that this whole PM QoS thing you try to do is a complete >design disaster. > >From your changelog: > > "PM QoS and other idle frameworks can do a better job of addressing > power and performance requirements for a cpu, knowing the IRQs that > are affine to that cpu." > >I agree with that. PM might make better decisions if it knows which >activities are targeted at a particular cpu. > >So someone figured that out and you got tasked to implement it. So you >looked for a way to get to this information and you found the existing >notifier and decided to (ab)use it for your purpose. But you did not >spend a split second to figure out why this is wrong and even useless. > >At the point where PM QoS or whatever decides to use that information, >it does not know at all what the current affinity mask is. So in your >patch 4/4, which I ignored so far you do: > >+ struct irq_desc *desc = irq_to_desc(req->irq); >+ struct cpumask *mask = desc->irq_data.affinity; > >WTF? > Sorry, bad access on my part. I will fix it. >You did not even have the courtesy to use the proper functions for >this. And of course you access all of this unlocked. So how is that >supposed to work with a concurrent affinity update? Not at >all. > >Fiddling with irq_desc is a NONO: Consult your preferred search >machine or just the kernel changelogs about the consequences. > >Looking at the other patches you really seem to have missed out on >some other NONOs: > > struct pm_qos_request { >+ enum pm_qos_req_type type; >+ struct cpumask cpus_affine; >+ /* Internal structure members */ > struct plist_node node; > int pm_qos_class; > struct delayed_work work; /* for pm_qos_update_request_timeout */ >@@ -80,6 +90,7 @@ enum pm_qos_type { > struct pm_qos_constraints { > struct plist_head list; > s32 target_value; /* Do not change to 64 bit */ >+ s32 target_per_cpu[NR_CPUS]; > s32 default_value; > s32 no_constraint_value; > enum pm_qos_type type; > >Go figure how well "struct cpumask" and a s32 array sizeof NR_CPUS >bodes on kernels compiled for NR_CPUS=8192 or larger. > Sorry, I did not imagine the possiblity of these failures. I figured that I would not use up a per-cpu variable, since the access to the values may be from one cpu where the code is being processed, but seems like there are quite a few merits, that I overlooked. >Just get it: You are hacking core code, not some random ARM SoC >driver. > >That aside, your design or the lack thereof sucks. While your hackery >might work somehow for the PM QoS problem at hand - ignoring the hard >to debug race window of the concurrent affinity update - it's >completely useless for other PM related decisions despite your claims >that its a benefit for all of that. Yes, there is a race condition in PM QoS, where the notification between the time, the request is added and the notification is registered. AFAIK, the behavior in the irq/manage.c seems to be held the same. If there is a concurrent change in notification, the schedule_work would not reschedule and possibly not send a notification, which happens today as well. I will revisit the change and look, if I can make it better. > >How would a general "keep track of the targets of all interrupts in >the system" mechanism make use of this? Sorry, I do not understand your question. PM QoS is only interested in the IRQs specified in the QoS request. If there are no requests that need to be associated with an IRQ, then PM QoS will not register for an affinity change notification. >Not at all. Simply because it >has no idea which interrupts are requested at any given time. Sure >your answer will be "add another notifier" to solve that problem. But >that's the completely wrong answer because notifiers are a PITA and >should go away. Here is Linus' POV of notifiers and I really agree >with it: > > "Notifiers are a disgrace, and almost all of them are a major design > mistake. They all have locking problems, the introduce internal > arbitrary API's that are hard to fix later (because you have random > people who decided to hook into them, which is the whole *point* of > those notifier chains)." > >And so I really fight that notifier chain tooth and nails until someone >provides me a proper argument WHY it cant be solved with proper >explicit and understandable mechanisms. > FWIW, I do not like notifications either. I feel it adds unpredictablity to the system. If I didnt think, that NUMA and other systems might benefit from per-cpu PM QoS feature, I would not have added this notification mechanism and instead registered directly. I admit to having not much of a knowledge about NUMA, but atleast PM QoS seems to be architecture agnostic and all systems could use this generic kernel feature. Also, the notification added is restricted to PM QoS framework. The smp-affinity notifications are not relayed to the driver making the PM QoS request. The way it is with per-cpu PM QoS is that the framework registers for notification, *only* if there is a QoS request tagged with an IRQ. When a request does not specify an IRQ, the QoS framework does not register for any IRQ affinity notifications. Also, its possible that multiple drivers may have a QoS request and could tag the same IRQ, in which case, the notfication list helps. But surely, I can use other methods in PM QoS to do the same thing, without the need for a notification list in the IRQ framework. But that still doesnt solve for other systems that may have NUMA or NIC drivers using per-cpu PM QoS. When notified PM QoS updates its request list and amends the request affinity and recalculates the target value. In this patch, I currently update requests when the irq notifications are sent out. >Back to your PM QoS trainwreck: > >I'm really impressed that you did not notice at all how inaccurate and >therefor useless the affinity information which you are consulting is. > >Did it ever hit even the outer hemisphere of your brain, that >irqdata.affinity is the affinity which was requested by the caller of >an affinity setter function and not the effective affinity? > Yes and the effective affinity may not match up to irqdata.affinity, but the only reliable information we have is irqdata.affinity. We are trying to do the best with the information available. It still is better than having no information about affinity. >I bet it did not, simply because you did not pay any attention and >bother to analyze that proper. Otherwise you would have tried to hack >around this as well in some disgusting way. > >Really great savings, if the affinity is left at default to all cpus, >but the effective affinity is a single cpu due to irq routing >restrictions in the hardware. > >Surely you tested against single CPU affinities and achieved great >results, but that's not a generic and useful solution. Here is what needs to happen, before the power savings can be beneficial. - Drivers specifying PM QoS request update their drivers to track the request against an IRQ - SoCs that have a way of constraining the IRQ affinity to a subset of cpus. Or may use an IRQ balancer or may be design limited to have IRQs affine to a set of cpus. - Menu and other cpuidle governors consider per-cpu PM QoS in determining the idle state of the cpu. We may not see any benefit without all these in play on a given system. Like you say, with single CPU affinity and with these changes, we do see quite a bit of power savings from the performance cores. Obviously, I seem to be doing it wrong with this change. I am ready to amend and start over, if there are better ways to get affinity information. Thanks, Lina -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Thomas, Thomas Gleixner <tglx@linutronix.de> writes: [...] > All I can see from your postings so far is: > > 1) You want to use the notification for PM QoS > > 2) You run into conflicts with the existing notifiers > > 3) You want to solve that conflict > > What you are not telling is: > > 1) In which way is PM QoS going to use that notifier > > 2) How does that conflict with the existing users. And we only > have two of them: > > - InfiniPath 7322 chip driver > > - cpu_rmap driver, which is used by > - solarflare NIC driver > - mellanox mlx4 driver > - cisco enic driver > > So how is that conflicting with your ARM centric PM QoS work? > > Please provide proper debug info about such a conflict, if it > exists at all. Ignoring any new users of the affinity notifiers for now, aren't the existing users you list above conflicting with each other already? Maybe I'm missing something, or maybe we're just lucky and nobody uses them together, but irq_set_affinity_notifier() only allows a single notifier to be registered at any given time. So if you had a system with the inifinipath driver and a cpu_rmap user, only the one who's lucky enough to be the last to call irq_set_affinity_notifier() will ever be notified. So I think the main question for Lina is: assuming we need more than one co-existing driver/subsystem to get IRQ affinity notifiers (and we already seem to have them, but are lucky they don't currently co-exist), how should the IRQ core be changed to support that? Is the list approach proposed in $SUBJECT path reasonable? Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 25 Sep 2014, Kevin Hilman wrote: > Maybe I'm missing something, or maybe we're just lucky and nobody uses > them together, but irq_set_affinity_notifier() only allows a single > notifier to be registered at any given time. So if you had a system A single notifier per irq ..... -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 25 Sep 2014, Lina Iyer wrote: > > How would a general "keep track of the targets of all interrupts in > > the system" mechanism make use of this? > Sorry, I do not understand your question. > PM QoS is only interested in the IRQs specified in the QoS request. If > there are no requests that need to be associated with an IRQ, then PM > QoS will not register for an affinity change notification. Right, and I really hate the whole per irq notifier. It's a rats nest of life time issues and other problems. It also does not tell you whether an irq is disabled, reenabled or removed, which will change the qos constraints as well unless you plaster all drivers with updates to qos for those cases. So what about adding a qos field to irq_data itself, have a function to update it and let the irq core keep track of the per cpu irq relevant qos constraints and provide an evaluation function or a notifier for the PM/idle code? That's going to need some serious thought as well, but it should avoid most of the nasty notifier and lifetime issue which the per irq notifiers provide. Thoughts? tglx -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 08 2014 at 09:03 -0600, Thomas Gleixner wrote: >On Thu, 25 Sep 2014, Lina Iyer wrote: >> > How would a general "keep track of the targets of all interrupts in >> > the system" mechanism make use of this? >> Sorry, I do not understand your question. >> PM QoS is only interested in the IRQs specified in the QoS request. If >> there are no requests that need to be associated with an IRQ, then PM >> QoS will not register for an affinity change notification. > >Right, and I really hate the whole per irq notifier. It's a rats nest >of life time issues and other problems. > >It also does not tell you whether an irq is disabled, reenabled or >removed, which will change the qos constraints as well unless you >plaster all drivers with updates to qos for those cases. > >So what about adding a qos field to irq_data itself, have a function >to update it and let the irq core keep track of the per cpu irq >relevant qos constraints and provide an evaluation function or a >notifier for the PM/idle code? If that isnt intrusive in the IRQ core, then we can make it work for PM QoS. The issue that I am concerned is that, it might result in back and forth between IRQ and PM QoS frameworks. If that doesnt happen, then we are good with this approach. > >That's going to need some serious thought as well, but it should avoid >most of the nasty notifier and lifetime issue which the per irq >notifiers provide. Sure. I will look into this. > >Thoughts? Thank you. Lina > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 10 Oct 2014, Lina Iyer wrote: > On Wed, Oct 08 2014 at 09:03 -0600, Thomas Gleixner wrote: > > On Thu, 25 Sep 2014, Lina Iyer wrote: > > > > How would a general "keep track of the targets of all interrupts in > > > > the system" mechanism make use of this? > > > Sorry, I do not understand your question. > > > PM QoS is only interested in the IRQs specified in the QoS request. If > > > there are no requests that need to be associated with an IRQ, then PM > > > QoS will not register for an affinity change notification. > > > > Right, and I really hate the whole per irq notifier. It's a rats nest > > of life time issues and other problems. > > > > It also does not tell you whether an irq is disabled, reenabled or > > removed, which will change the qos constraints as well unless you > > plaster all drivers with updates to qos for those cases. > > > > So what about adding a qos field to irq_data itself, have a function > > to update it and let the irq core keep track of the per cpu irq > > relevant qos constraints and provide an evaluation function or a > > notifier for the PM/idle code? > > If that isnt intrusive in the IRQ core, then we can make it work for PM > QoS. The issue that I am concerned is that, it might result in back and > forth between IRQ and PM QoS frameworks. If that doesnt happen, then we > are good with this approach. I can't tell that upfront, but I think it's worth to explore it. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Thomas, On Fri, Oct 17 2014 at 01:29 -0600, Thomas Gleixner wrote: >On Fri, 10 Oct 2014, Lina Iyer wrote: >> On Wed, Oct 08 2014 at 09:03 -0600, Thomas Gleixner wrote: >> > On Thu, 25 Sep 2014, Lina Iyer wrote: >> > > > How would a general "keep track of the targets of all interrupts in >> > > > the system" mechanism make use of this? >> > > Sorry, I do not understand your question. >> > > PM QoS is only interested in the IRQs specified in the QoS request. If >> > > there are no requests that need to be associated with an IRQ, then PM >> > > QoS will not register for an affinity change notification. >> > >> > Right, and I really hate the whole per irq notifier. It's a rats nest >> > of life time issues and other problems. >> > >> > It also does not tell you whether an irq is disabled, reenabled or >> > removed, which will change the qos constraints as well unless you >> > plaster all drivers with updates to qos for those cases. >> > >> > So what about adding a qos field to irq_data itself, have a function >> > to update it and let the irq core keep track of the per cpu irq >> > relevant qos constraints and provide an evaluation function or a >> > notifier for the PM/idle code? >> >> If that isnt intrusive in the IRQ core, then we can make it work for PM >> QoS. The issue that I am concerned is that, it might result in back and >> forth between IRQ and PM QoS frameworks. If that doesnt happen, then we >> are good with this approach. > >I can't tell that upfront, but I think it's worth to explore it. > I was able to review the options and I attempted a few methods, but off-loading the QoS onto the IRQ framework, made it quite complex to manage it. QoS values for each of the four constraints and the constraints could be one of 3 types - min, max or sum, makes it a whole lot of mess handling it in IRQ code. I was able to get QoS to be notified of IRQ affinity changes without using notifiers, but, I still am yet to find a way to modify QoS requests on enable/disable of IRQ. I will send the RFC of the new patchset, would be interested in taking a look and let me know what you think. Thanks, Lina -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c b/drivers/infiniband/hw/qib/qib_iba7322.c index a7eb325..62cb77d 100644 --- a/drivers/infiniband/hw/qib/qib_iba7322.c +++ b/drivers/infiniband/hw/qib/qib_iba7322.c @@ -3345,9 +3345,7 @@ static void reset_dca_notifier(struct qib_devdata *dd, struct qib_msix_entry *m) "Disabling notifier on HCA %d irq %d\n", dd->unit, m->msix.vector); - irq_set_affinity_notifier( - m->msix.vector, - NULL); + irq_release_affinity_notifier(m->notifier); m->notifier = NULL; } diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 698ad05..c1e227c 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -203,7 +203,7 @@ static inline int check_wakeup_irqs(void) { return 0; } * struct irq_affinity_notify - context for notification of IRQ affinity changes * @irq: Interrupt to which notification applies * @kref: Reference count, for internal use - * @work: Work item, for internal use + * @list: Add to the notifier list, for internal use * @notify: Function to be called on change. This will be * called in process context. * @release: Function to be called on release. This will be @@ -214,7 +214,7 @@ static inline int check_wakeup_irqs(void) { return 0; } struct irq_affinity_notify { unsigned int irq; struct kref kref; - struct work_struct work; + struct list_head list; void (*notify)(struct irq_affinity_notify *, const cpumask_t *mask); void (*release)(struct kref *ref); }; @@ -265,6 +265,8 @@ extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m); extern int irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify); +extern int +irq_release_affinity_notifier(struct irq_affinity_notify *notify); #else /* CONFIG_SMP */ static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m) @@ -295,6 +297,12 @@ irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify) { return 0; } + +static inline int +irq_release_affinity_notifier(struct irq_affinity_notify *notify) +{ + return 0; +} #endif /* CONFIG_SMP */ /* diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h index 472c021..db3509e 100644 --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -31,7 +31,8 @@ struct irq_desc; * @threads_handled_last: comparator field for deferred spurious detection of theraded handlers * @lock: locking for SMP * @affinity_hint: hint to user space for preferred irq affinity - * @affinity_notify: context for notification of affinity changes + * @affinity_notify: list of notification clients for affinity changes + * @affinity_work: Work queue for handling affinity change notifications * @pending_mask: pending rebalanced interrupts * @threads_oneshot: bitfield to handle shared oneshot threads * @threads_active: number of irqaction threads currently running @@ -60,7 +61,8 @@ struct irq_desc { struct cpumask *percpu_enabled; #ifdef CONFIG_SMP const struct cpumask *affinity_hint; - struct irq_affinity_notify *affinity_notify; + struct list_head affinity_notify; + struct work_struct affinity_work; #ifdef CONFIG_GENERIC_PENDING_IRQ cpumask_var_t pending_mask; #endif diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 1487a12..c95e1f3 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -91,6 +91,7 @@ static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node, for_each_possible_cpu(cpu) *per_cpu_ptr(desc->kstat_irqs, cpu) = 0; desc_smp_init(desc, node); + INIT_LIST_HEAD(&desc->affinity_notify); } int nr_irqs = NR_IRQS; diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 3dc6a61..b6ff79c 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -209,10 +209,9 @@ int irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask, irq_copy_pending(desc, mask); } - if (desc->affinity_notify) { - kref_get(&desc->affinity_notify->kref); - schedule_work(&desc->affinity_notify->work); - } + if (!list_empty(&desc->affinity_notify)) + schedule_work(&desc->affinity_work); + irqd_set(data, IRQD_AFFINITY_SET); return ret; @@ -248,14 +247,14 @@ EXPORT_SYMBOL_GPL(irq_set_affinity_hint); static void irq_affinity_notify(struct work_struct *work) { - struct irq_affinity_notify *notify = - container_of(work, struct irq_affinity_notify, work); - struct irq_desc *desc = irq_to_desc(notify->irq); + struct irq_desc *desc = + container_of(work, struct irq_desc, affinity_work); cpumask_var_t cpumask; unsigned long flags; + struct irq_affinity_notify *notify; if (!desc || !alloc_cpumask_var(&cpumask, GFP_KERNEL)) - goto out; + return; raw_spin_lock_irqsave(&desc->lock, flags); if (irq_move_pending(&desc->irq_data)) @@ -264,11 +263,20 @@ static void irq_affinity_notify(struct work_struct *work) cpumask_copy(cpumask, desc->irq_data.affinity); raw_spin_unlock_irqrestore(&desc->lock, flags); - notify->notify(notify, cpumask); + list_for_each_entry(notify, &desc->affinity_notify, list) { + /** + * Check and get the kref only if the kref has not been + * released by now. Its possible that the reference count + * is already 0, we dont want to notify those if they are + * already released. + */ + if (!kref_get_unless_zero(¬ify->kref)) + continue; + notify->notify(notify, cpumask); + kref_put(¬ify->kref, notify->release); + } free_cpumask_var(cpumask); -out: - kref_put(¬ify->kref, notify->release); } /** @@ -286,34 +294,50 @@ int irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify) { struct irq_desc *desc = irq_to_desc(irq); - struct irq_affinity_notify *old_notify; unsigned long flags; - /* The release function is promised process context */ - might_sleep(); - if (!desc) return -EINVAL; - /* Complete initialisation of *notify */ - if (notify) { - notify->irq = irq; - kref_init(¬ify->kref); - INIT_WORK(¬ify->work, irq_affinity_notify); + if (!notify) { + WARN("%s called with NULL notifier - use irq_release_affinity_notifier function instead.\n", + __func__); + return -EINVAL; } + notify->irq = irq; + kref_init(¬ify->kref); + INIT_LIST_HEAD(¬ify->list); raw_spin_lock_irqsave(&desc->lock, flags); - old_notify = desc->affinity_notify; - desc->affinity_notify = notify; + list_add(¬ify->list, &desc->affinity_notify); raw_spin_unlock_irqrestore(&desc->lock, flags); - if (old_notify) - kref_put(&old_notify->kref, old_notify->release); - return 0; } EXPORT_SYMBOL_GPL(irq_set_affinity_notifier); +/** + * irq_release_affinity_notifier - Remove us from notifications + * @notify: Context for notification + */ +int irq_release_affinity_notifier(struct irq_affinity_notify *notify) +{ + struct irq_desc *desc; + unsigned long flags; + + if (!notify) + return -EINVAL; + + desc = irq_to_desc(notify->irq); + raw_spin_lock_irqsave(&desc->lock, flags); + list_del(¬ify->list); + raw_spin_unlock_irqrestore(&desc->lock, flags); + kref_put(¬ify->kref, notify->release); + + return 0; +} +EXPORT_SYMBOL_GPL(irq_release_affinity_notifier); + #ifndef CONFIG_AUTO_IRQ_AFFINITY /* * Generic version of the affinity autoselector. @@ -348,6 +372,8 @@ setup_affinity(unsigned int irq, struct irq_desc *desc, struct cpumask *mask) if (cpumask_intersects(mask, nodemask)) cpumask_and(mask, mask, nodemask); } + INIT_LIST_HEAD(&desc->affinity_notify); + INIT_WORK(&desc->affinity_work, irq_affinity_notify); irq_do_set_affinity(&desc->irq_data, mask, false); return 0; } @@ -1413,14 +1439,15 @@ EXPORT_SYMBOL_GPL(remove_irq); void free_irq(unsigned int irq, void *dev_id) { struct irq_desc *desc = irq_to_desc(irq); + struct irq_affinity_notify *notify; if (!desc || WARN_ON(irq_settings_is_per_cpu_devid(desc))) return; -#ifdef CONFIG_SMP - if (WARN_ON(desc->affinity_notify)) - desc->affinity_notify = NULL; -#endif + WARN_ON(!list_empty(&desc->affinity_notify)); + + list_for_each_entry(notify, &desc->affinity_notify, list) + kref_put(¬ify->kref, notify->release); chip_bus_lock(desc); kfree(__free_irq(irq, dev_id)); diff --git a/lib/cpu_rmap.c b/lib/cpu_rmap.c index 4f134d8..0c8da50 100644 --- a/lib/cpu_rmap.c +++ b/lib/cpu_rmap.c @@ -235,7 +235,7 @@ void free_irq_cpu_rmap(struct cpu_rmap *rmap) for (index = 0; index < rmap->used; index++) { glue = rmap->obj[index]; - irq_set_affinity_notifier(glue->notify.irq, NULL); + irq_release_affinity_notifier(&glue->notify); } cpu_rmap_put(rmap);
PM QoS and other idle frameworks can do a better job of addressing power and performance requirements for a cpu, knowing the IRQs that are affine to that cpu. If a performance request is placed against serving the IRQ faster and if the IRQ is affine to a set of cpus, then setting the performance requirements only on those cpus help save power on the rest of the cpus. PM QoS framework is one such framework interested in knowing the smp_affinity of an IRQ and the change notificiation in this regard. QoS requests for the CPU_DMA_LATENCY constraint currently apply to all cpus, but when attached to an IRQ, can be applied only to the set of cpus that IRQ's smp_affinity is set to. This allows other cpus to enter deeper sleep states to save power. More than one framework/driver can be interested in such information. The current implementation allows only a single notification callback whenever the IRQ's SMP affinity is changed. Adding a second notification punts the existing notifier function out of registration. Add a list of notifiers, allowing multiple clients to register for irq affinity notifications. The kref object associated with the struct irq_affinity_notify was used to prevent the notifier object from being released if there is a pending notification. It was incremented before the work item was scheduled and was decremented when the notification was completed. If the kref count was zero at the end of it, the release function gets a callback allowing the module to release the irq_affinity_notify memory. This works well for a single notification. When multiple clients are registered, no single kref object can be used. Hence, the work function when scheduled, will increase the kref count using the kref_get_unless_zero(), so if the module had already unregistered the irq_affinity_notify object while the work function was scheduled, it will not be notified. Signed-off-by: Lina Iyer <lina.iyer@linaro.org>